0

I have a regex and want it to match html meta tags content attribute and get its content. For example:

<meta name="description" content="Some website description.">

In this case to get

Some website description.

and nothing more. In my case I am using this pattern:

private static Pattern siteMetaTagDescriptionAttributePattern = Pattern.compile("name=\"description\"(\\s*)content=\"(.*)\"");
Matcher matcher = siteMetaTagDescriptionAttributePattern.matcher(siteContentLine);
String siteDescription = "";
while(matcher.find()) {
  siteDescription = matcher.group(2);
}

And getting Till the end of the line, in this case this:

Some website description.">

What should I do to get only inner content of the content attribute, in this case

Some website description.

Thanks a lot.

George
  • 245
  • 3
  • 17
  • Consider using Jsoup, if you are extracting data here and there in the page. – nhahtdh Feb 10 '14 at 20:16
  • 1
    [Obligatory "don't do this" link](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – Boris the Spider Feb 10 '14 at 20:16
  • Hi there George. See @BoristheSpider link. It's hard to match HTML with Regex; however, you may *try* `(">)` at the end of your expression to see if that works. – Alvin Bunk Feb 10 '14 at 20:24
  • 1
    Seems to be working fine to me.. Here's your example in a java regex tester - http://fiddle.re/mwvtf - `.group(2) = "Some website description."` – Bryan Elliott Feb 10 '14 at 20:29
  • [Obligatory link](http://stackoverflow.com/q/4231382/471272): please link to actual answers, not to non-answers. – tchrist Jun 08 '14 at 20:10

2 Answers2

3

Consider using parser instead of regex. You can use for instance Jsoup like

String html = "<meta name=\"description\" content=\"Some website description.\">";

Document doc =Jsoup.parse(html);
System.out.println(doc.select("meta[name=description]").attr("content"));

output:

Some website description.
Pshemo
  • 113,402
  • 22
  • 170
  • 242
  • @downvoter If you found something wrong with this answer can you tell what is it? I would like to improve/correct it. – Pshemo Feb 11 '14 at 16:57
1

If you insist:

(?<=name=\"description\" content=\")[^\"]*(?=\")
tenub
  • 3,240
  • 1
  • 14
  • 25
  • good regex, but I think you would need to escape the " in the char set `[^"]` also. Because he has the pattern enclosed in double quotes. I think it needs to be like this : `Pattern.compile("(?<=name=\"description\" content=\")[^\"]*(?=\")");` – Bryan Elliott Feb 10 '14 at 20:35