4

I have to parse some html to find a set of values from some HTML which isn't always well formed and I have no control over (so Scanner does not seem to be an option)

This is a shopping cart, and within the cart is n number of rows each containing a quantity dropdown. Now I want to be able to get the sum total of products in the cart.

Given this html, I would want to match the values 2 and 5

...
<select attr="other stuff" name="quantity">
    <option value="1" />
    <option value="2" selected="selected" />
</select>
....
<select name="quantity" attr="other stuff">
    <option selected="selected" value="5" />
    <option value="6" />
</select>

I've made a number of pitiful attempts but given the number of variables (for example order of the 'value' and 'selected' tags) most of my solutions either don't work or are really slow.

The last Java code I ended with is the following

Pattern pattern = Pattern.compile("select(.*?)name=\"quantity\"([.|\\n|\\r]*?)option(.*?)value=\"(/d)\" selected=\"selected\"", Pattern.DOTALL);
Matcher matcher = pattern.matcher(html);
if (matcher.find()) {
   ....
}

It's very slow and does not work when attribute order changes. My Regex knowledge is not good enough to write an efficient pattern

Lightness Races in Orbit
  • 358,771
  • 68
  • 593
  • 989
Nick Cardoso
  • 18,430
  • 9
  • 62
  • 110

4 Answers4

4

Instead of using a regular expression, you can use an XPath expression to retrieve all value attributes for the HTML you have in the question:

//select[@name="quantity"]/option[@selected="selected"]/@value

In words:

  • Find all <select> elements within the XML with attribute name equal to quantity, with a subelement <option> with an attribute selected equal to selected
  • Retrieve the value attributes.

I would really consider trying with an XQuery/XPath, that's what it is made for. Read this answer to the question How to read XML using XPath in Java on how to retrieve the values. An introduction on XPath expressions here.


Consider the situation where in the future you then need to only find options where attribute selected="selected" and eg status="accepted". The XPath expression would simply become:

//select[@name="quantity"]/option[@selected="selected" and @status="accepted"]/@value

The XPath expression is easy to extend, easy to review, easy to prove what it is doing.

Now what kind of RegEx monster would you have to create for the added condition? Hard to write, even harder to maintain. How can a code-reviewer tell what the complex (cf bobble bubble's answer) regular expression is doing? How do you prove that the regular expression is actually doing what it is supposed to do?

You can of course document the regular expression, something you should always do for regular expressions. But that doesn't prove anything.

My advice: Stay away from regular expressions unless there is absolutely no other way.


For sports I made a snippet showing the basics of this way of working:

import java.io.StringReader;
import javax.xml.xpath.*;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;

public class ReadElementsFromHtmlUsingXPath {
    private static final String html=
"<html>Read more about XPath <a href=\"www.w3schools.com/xsl/xpath_intro.asp\">here</a>..."+
"<select attr=\"other stuff\" name=\"quantity\">"+
    "<option value=\"1\" />"+
    "<option value=\"2\" selected=\"selected\" />"+
"</select>"+
"<i><b>Oh and here's the second element</b></i>"+
"<select name=\"quantity\" attr=\"other stuff\">"+
    "<option selected=\"selected\" value=\"5\" />"+
    "<option value=\"6\" />"+
"</select>"+
"And that's all folks</html>";

    private static final String xpathExpr = 
"//select[@name=\"quantity\"]/option[@selected=\"selected\"]/@value";

    public static void main(String[] args) {
        try {
            XPath xpath = XPathFactory.newInstance().newXPath();
            XPathExpression expr = xpath.compile(xpathExpr);
            NodeList nodeList = (NodeList) expr.evaluate(new InputSource(new StringReader(html)),XPathConstants.NODESET);
            for( int i = 0; i != nodeList.getLength(); ++i )
                System.out.println(nodeList.item(i).getNodeValue());
        } catch (XPathExpressionException e) {
            e.printStackTrace();
        }
    }
}

Result in output:

2
5
Community
  • 1
  • 1
TT.
  • 14,883
  • 6
  • 41
  • 77
  • 1
    Note that you can shorten the research addying `[1]` for the first (and unique) `option` tag that has the `selected` attribute (and since `selected="selected"` is only an xhtml idiom, you don't need to test the value, a `selected` attribute can only have the value `selected`): `//select[@name="quantity"]/option[@selected][1]/@value`. In this way XPath don't try to find another `option` tag with the `selected` attribute for the same `select` parent. The research is stopped, and it will jump to the next `select` tag immediately. – Casimir et Hippolyte Feb 27 '16 at 17:04
  • @CasimiretHippolyte Thanks for your insights. – TT. Feb 28 '16 at 07:55
4

Surely depends on how malformed your html could be. Parser solution to be preferred.

A regex that matches your requirement is not much of a challenge, just putting it together.

(?xi) # i-flag for caseless, x-flag for comments (free spacing mode) 

# 1.) match <select with optional space at the end
<\s*select\s[^>]*?\bname\s*=\s*["']\s*quantity[^>]*>\s*

# 2.) match lazily any amount of options until the "selected"
(?:<\s*option[^>]*>\s*)*?

# 3.) match selected using a lookahead and capture number from value
<\s*option\s(?=[^>]*?\bselected)[^>]*?\bvalue\s*=\s*["']\s*(\d[.,\d]*)

Try pattern at regex101 or RegexPlanet (Java) and as a Java String:

"(?i)<\\s*select\\s[^>]*?\\bname\\s*=\\s*[\"']\\s*quantity[^>]*>\\s*(?:<\\s*option[^>]*>\\s*)*?<\\s*option\\s(?=[^>]*?\\bselected)[^>]*?\\bvalue\\s*=\\s*[\"']\\s*(\\d[.,\\d]*)"

There is not much magic in it. A long ugly pattern mostly because parsing html.

  • \s is a short for whitespace [ \t\r\n\f]
  • \d is a short for digit [0-9]
  • \b matches a word boundary
  • (?: opens a non capturing group
  • [^>] is the negation of > (matches characters, that are not >)
  • (?=[^>]*?\bselected) the check is done by use of a lookahead for being independent of order
  • (\d[.,\d]*) part to capture the number. Required is one digit with any optional [.,\d]

Matches would be in group(1) the first capturing group (parenthesized group).

Community
  • 1
  • 1
bobble bubble
  • 11,968
  • 2
  • 22
  • 34
  • =) compare the elegance of the XPath expression, with the regular expression in your answer. Try debug that if there's an error in it... – TT. Feb 23 '16 at 20:08
  • Seems like RegEx101 agrees with you. GJ. – TT. Feb 23 '16 at 20:24
  • 1
    I'm giving this the bounty as it's the answer that I was actually looking for. I'm going to accept the XPath answer though, because I think that's where future visitors should really be looking – Nick Cardoso Feb 29 '16 at 16:05
  • @NickCardoso thank you! : ) yes, TTs answer is very detailed and of course much more elegante. – bobble bubble Feb 29 '16 at 16:14
2

Let's Divide and Conquer.

first, create a class called Option :

public class Option {

    private String value;
    private boolean selected;

    public Option() {
    }

    public Option(String value, boolean selected) {
        this.value = value;
        this.selected = selected;
    }

    public String getValue() {
        return value;
    }

    public void setValue(String value) {
        this.value = value;
    }

    public boolean isSelected() {
        return selected;
    }

    public void setSelected(boolean selected) {
        this.selected = selected;
    }

    @Override
    public String toString() {
        return "{" +
                "value='" + value + '\'' +
                ", selected=" + selected +
                '}';
    }

}

second, we need a regex to find the html tag :

static final Pattern OPTION_TAG_PATTERN = Pattern.compile("<option\\s*(value=\"\\w+\"\\s+(?:selected=\"selected\")?|(?:selected=\"selected\")?\\s+value=\"\\w+\")\\s*/>");

and to extract value of value :

static final Pattern VALUE_PATTERN = Pattern.compile("value=\"(\\w+)\"");

and finally :

public class Test {

    private static final Pattern OPTION_TAG_PATTERN = Pattern.compile("<option\\s*(value=\"\\w+\"\\s+(?:selected=\"selected\")?|(?:selected=\"selected\")?\\s+value=\"\\w+\")\\s*/>");
    private static final Pattern VALUE_PATTERN = Pattern.compile("value=\"(\\w+)\"");

    public static void main(String[] args) {
        String html = "...\n" +
                "<select attr=\"other stuff\" name=\"quantity\">\n" +
                "    <option value=\"1\" />\n" +
                "    <option value=\"2\" selected=\"selected\" />\n" +
                "</select>\n" +
                "....\n" +
                "<select name=\"quantity\" attr=\"other stuff\">\n" +
                "    <option selected=\"selected\" value=\"5\" />\n" +
                "    <option value=\"6\" />\n" +
                "</select>";
        findOptions(html).forEach(System.out::println);
    }

    public static List<Option> findOptions(String htmlContent) {
        List<Option> options = new ArrayList<>();
        Matcher optionMatcher = OPTION_TAG_PATTERN.matcher(htmlContent);
        while (optionMatcher.find()) {
            options.add(toOption(htmlContent.substring(optionMatcher.start(), optionMatcher.end())));
        }
        return options;
    }

    private static Option toOption(String htmlTag) {
        Option option = new Option();
        Matcher valueMatcher = VALUE_PATTERN.matcher(htmlTag);
        if (valueMatcher.find()) {
            option.setValue(valueMatcher.group(1));
        }
        if (htmlTag.contains("selected=\"selected\"")) {
            option.setSelected(true);
        }
        return option;
    }

}

Output :

{value='1', selected=false}
{value='2', selected=true}
{value='5', selected=true}
{value='6', selected=false}

I hope this helps you!

FaNaJ
  • 1,289
  • 1
  • 14
  • 37
0

I believe regex is not best for this simply because the complexity makes it hard to read through and diagnose the code. We can still use regex, but break down the logic to make it easier to read through and improve:

String html = "<select attr=\"other stuff\" name=\"quantity\">" +
"<option value=\"1\" /> " +
"<option value=\"2\" selected=\"selected\" /> " +
"</select> " +
"<select name=\"quantity\" attr=\"other stuff\"> " + 
"<option selected=\"selected\" value=\"5\" /> " +
"<option value=\"6\" /> " + "</select>";
String options = "(?<=<option).*?(?=/>)";
Pattern pat = Pattern.compile(options, Pattern.DOTALL);
Matcher m = pat.matcher(html);
Pattern values = Pattern.compile("(?<=value=\").*?(?=\")");
Pattern selected = Pattern.compile("selected=\"selected\"");
Integer counter = 0;
while (m.find()) {
    Matcher sel = selected.matcher(m.group());
    if (sel.find()) {
        Matcher val = values.matcher(m.group());
        if (val.find()) {
            Integer count = Integer.parseInt(val.group());
            counter = counter + count;
        }
    }
}
System.out.println(counter.toString());
}

which prints out the required 7

bmbigbang
  • 1,088
  • 1
  • 9
  • 15