1

I am looking for a way to remove a sentence that contains a URL in Java. Note that I want to remove the entire sentence and not just the URL.

I found a way to do this and it works, but I am looking for a simpler way to do this, maybe with just one RegEx?

  1. Detect a sentence (can end with .?!) using BreakIterator : Split string into sentences
  2. Use a Regex to read every line and detect the pattern : Detect and extract url from a string?. If found, just remove the sentence.
String source = "Sorry, we are closed today. Visit our website tomorrow at https://www.google.com. Thank you and have a nice day!";
iterator.setText(source);
int start = iterator.first();
int end = iterator.next();
while(end != BreakIterator.DONE){                
 if(SENT.matcher(source.substring(start,end)).find()) {                  
   source = source.substring(0, start) + source.substring(end);                  
   iterator.setText(source);
   start = iterator.first();
  }else{
    start = end;
  }
  end = iterator.next();
}
System.out.println(source);

This prints : Sorry, we are closed today. Thank you and have a nice day!

sindhunk
  • 245
  • 1
  • 3
  • 9

2 Answers2

0

It'd be best to break/split our sentences first, prior to having it passed through an expression.

Then, this expression might simply return only those lines (sentences) that do not have a URL,

^(?!.*https?[^\s]+.*).*$

Here, we'd be defining a URL as https?[^\s]+.

Demo

Test

import java.util.regex.Matcher;
import java.util.regex.Pattern;

final String regex = "^(?!.*https?[^\\s]+.*).*$";
final String string = "Sorry, we are closed today. Visit our website tomorrow at https://www.google.com. Thank you and have a nice day!\n\n"
     + "Sorry, we are closed today. Visit our website tomorrow at. Thank you and have a nice day!\n\n"
     + "Sorry, we are closed today. Visit our website tomorrow at https://www.goog. Thank you and have a nice day!\n";

final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);

while (matcher.find()) {
    System.out.println("Full match: " + matcher.group(0));
    for (int i = 1; i <= matcher.groupCount(); i++) {
        System.out.println("Group " + i + ": " + matcher.group(i));
    }
}

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Community
  • 1
  • 1
Emma
  • 1
  • 9
  • 28
  • 53
  • 1
    That regex says absolutely nothing about sentences. It matches _lines_ that contain a URL. – Amadan Jun 15 '19 at 00:10
  • 1
    I'd thought splitting them and stitching them up too, but that was something I was trying to avoid in general. This is a good alternate though. thanks! – sindhunk Jun 15 '19 at 00:32
0
"(?<=^|[?!.])[^?!.]+" + urlRegex + ".*?(?:$|[?!.])"

This will match each whole sentence whose part matches urlRegex, according to your definition of a sentence; you can use replaceAll to get rid of them. (There are many URL regexes around, and you didn't specify which one you were using, so I left the exact definition of URL to you.)

Amadan
  • 169,219
  • 18
  • 195
  • 256
  • This is nice, it worked! Just curious, what does the "?<=^" indicate? – sindhunk Jun 15 '19 at 00:34
  • (?<=...) is a positive lookbehind zero-width assertion ("there has to be ... just before, but don't include it into the match"). `^` is start of string (or line if you use `Pattern.MULTILINE`). See [here](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) for more details. – Amadan Jun 15 '19 at 00:38