0

I'm writing a little program that finds email addresses given a url, but there seems to be something wrong with my regex. It's printing out the same thing multiple time and matching text that I'm not looking for.

Cleaner cleaner = new Cleaner(Whitelist.basic());
String url = "http://www.fon.hum.uva.nl/paul/";
Document doc = cleaner.clean(Jsoup.connect(url).get());
Elements emails = doc.select(":matches(" + 
                "[0-9a-zA-Z_-]+@[0-9a-zA-Z_-]+\\.[a-zA-Z]{2,4}"
                +")");
for (Element e : emails) {
   System.out.println(e.text());
}

I won't post the complete result here, because it's too long, but it's matching an email, and also a bunch of repeated text that doesn't follow the pattern.

"Paul Boersma Professor of Phonetic Sciences   University of Amsterdam   "...
"Paul Boersma Professor of Phonetic Sciences   University of Amsterdam   "...
"Paul Boersma Professor of Phonetic Sciences   University of Amsterdam   "...

Does anyone know what the problem could be? Is it the regex, or does it have something to do with printing e.text()?

Thank you.

Edit: I have also tried a more complicated expression:

[\\w-]+(\\.[\\w-]+)*@[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,4})

But I have had the same issue with it.

Edit 2: I have used this regex in Notepad++, and it seems to work well. I only have this issue when matching text from webpages.

Edit 3: I tried running it on regexplanet.com and interestingly enough, it matches correctly. So is this a Jsoup thing then? Something having to do with Elements, maybe?

pushkin
  • 7,514
  • 14
  • 41
  • 74
  • 1
    Two questions from the [Stack Overflow Regular Expressions FAQ](http://stackoverflow.com/a/22944075/2736496) that may be of interest: [validating email addresses](http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address), and [validating urls](http://stackoverflow.com/a/190405/2736496) (listed under "Common Validation Tasks"). The only thing that stands out to me at first glance, is that many emails have dots before the at-sign, and your regex doesn't allow it. – aliteralmind Apr 22 '14 at 01:24

2 Answers2

1

The problem comes from the css query. Since there is no specific nodes inside it, Jsoup tends to bring back the whole node hierachy. What you get is the node containing an email and ALL its ancestors until root node (<html>).

I can see two options for you:

1. Use a specific css query

a:matches([0-9a-zA-Z_-]+@[0-9a-zA-Z_-]+\\.[a-zA-Z]{2,4})

Demo: http://try.jsoup.org/~fsXXqnQtTNEOSTR3TPvyONtWS64

2. Extract the node immediately containing the email

:matchesOwn([0-9a-zA-Z_-]+@[0-9a-zA-Z_-]+\\.[a-zA-Z]{2,4})

Demo: http://try.jsoup.org/~RgbUgekyWIoSe5bvFhZdQju9ibM

Community
  • 1
  • 1
Stephan
  • 37,597
  • 55
  • 216
  • 310
  • Thank you for you help. While this seems to work for the url I provided, it fails for another url. What I ended up doing was using the Pattern class in Java to find the matches instead of Jsoup. I'll post my solution as an edit. – pushkin Apr 25 '14 at 16:53
  • @Pushkin The issue for the other url may come from the regex – Stephan Apr 25 '14 at 22:13
  • I believe my regex works, because I've tested it in Notepad++ (and it's successfully matched the pattern) and I used it in the new solution (posted as an edit) and that works too. Are there any issues with the regex that stand out to you? – pushkin Apr 26 '14 at 00:57
  • @Pushkin The regex used in the new solution is quite different from the original. – Stephan Apr 26 '14 at 03:41
  • I just took the \\s? out to simplify. The one with \\s? still had issues with some urls, but using the Pattern class fixes it. – pushkin Apr 26 '14 at 04:17
0

I solved this using Pattern instead of JSoup for pattern matching:

Pattern pattern = Pattern.compile("[\\w-]+(\\.[\\w-]+)*\\s?@\\s?[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,4})");
Document doc = cleaner.clean(Jsoup.connect(url).get());
String text = doc.text();
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
    System.out.println(matcher.group());
}
pushkin
  • 7,514
  • 14
  • 41
  • 74