0

Let's say we have a String in Java, which contains HTML code.

I would like to do something like return every substring within this string that contains "<li>stuff here</li>". I realize also that the leading li tag may have parameters. The BIGGEST problem is that there might be multiple <li></li> pairs in one line, especially if whoever wrote the HTML likes to have everything compressed and less human readable! ;)

I have thought a while about using things like string split, and going through the array of strings programatically, throwing a boolean flag to true when im in a <li> tag, and false when i exit. Maybe this will work, but it feels very non-elegant.

How can i design a method that returns, say an ArrayList<String> of all results? Can i do this without regex? I have looked up regex and it seems powerful, but sometimes the syntax can be very complicated. If i have to resort to regex i will, but simpler more clear solutions are appreciated!

If there is no elegant and clear way without regex, i will deal with the regex patterns.

Drifter64
  • 1,013
  • 1
  • 11
  • 28
  • 2
    You {c,sh}ould use an HTML parser for this, such as jsoup – fge Apr 11 '14 at 20:35
  • See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 (But really, try a HTML parser.) – ethanbustad Apr 11 '14 at 20:42
  • @fge simply brilliant. hopefully i'm not the only one to get it. – tenub Apr 11 '14 at 20:56
  • @fge: The objects in this are unfamiliar, but i just did a test run looking for things in an HTML string, and it works great! Thank you! As a note, what i needed in this case was Document.select("li").get(int), this allowed me to cycle through all found matches to li tags, after using Document.select("li").size() – Drifter64 Apr 11 '14 at 21:06
  • As I understand, you want to extract the `...`? I'm not familiar with java, but as lists can be [nested](http://www.w3schools.com/html/tryit.asp?filename=tryhtml_lists2), if using regex, could try with a pattern like [this](http://fiddle.re/89wb3) (click on "java" to test) for only matching the innermost `
  • ...
  • `. If you know, it's not nested, something like [that](http://fiddle.re/awyp3) should be convenient. For further explanation see [this FAQ](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean/22944075#22944075) – Jonny 5 Apr 11 '14 at 21:52