Strip off a sentence that contains a URL

Question

I am looking for a way to remove a sentence that contains a URL in Java. Note that I want to remove the entire sentence and not just the URL.

I found a way to do this and it works, but I am looking for a simpler way to do this, maybe with just one RegEx?

Detect a sentence (can end with .?!) using BreakIterator : Split string into sentences
Use a Regex to read every line and detect the pattern : Detect and extract url from a string?. If found, just remove the sentence.

String source = "Sorry, we are closed today. Visit our website tomorrow at https://www.google.com. Thank you and have a nice day!";
iterator.setText(source);
int start = iterator.first();
int end = iterator.next();
while(end != BreakIterator.DONE){                
 if(SENT.matcher(source.substring(start,end)).find()) {                  
   source = source.substring(0, start) + source.substring(end);                  
   iterator.setText(source);
   start = iterator.first();
  }else{
    start = end;
  }
  end = iterator.next();
}
System.out.println(source);

This prints : Sorry, we are closed today. Thank you and have a nice day!

A *particular* URL, or *any* URL? – Bohemian Jun 15 '19 at 00:03 — Bohemian, Jun 15 '19 at 00:03
Any URL. Can start with http/https/www/ftp, etc – sindhunk Jun 17 '19 at 20:45 — sindhunk, Jun 17 '19 at 20:45

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

It'd be best to break/split our sentences first, prior to having it passed through an expression.

Then, this expression might simply return only those lines (sentences) that do not have a URL,

^(?!.*https?[^\s]+.*).*$

Here, we'd be defining a URL as https?[^\s]+.

Demo

Test

import java.util.regex.Matcher;
import java.util.regex.Pattern;

final String regex = "^(?!.*https?[^\\s]+.*).*$";
final String string = "Sorry, we are closed today. Visit our website tomorrow at https://www.google.com. Thank you and have a nice day!\n\n"
     + "Sorry, we are closed today. Visit our website tomorrow at. Thank you and have a nice day!\n\n"
     + "Sorry, we are closed today. Visit our website tomorrow at https://www.goog. Thank you and have a nice day!\n";

final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);

while (matcher.find()) {
    System.out.println("Full match: " + matcher.group(0));
    for (int i = 1; i <= matcher.groupCount(); i++) {
        System.out.println("Group " + i + ": " + matcher.group(i));
    }
}

RegEx Circuit

jex.im visualizes regular expressions:

That regex says absolutely nothing about sentences. It matches _lines_ that contain a URL. — Amadan, Jun 15 '19 at 00:10
I'd thought splitting them and stitching them up too, but that was something I was trying to avoid in general. This is a good alternate though. thanks! — sindhunk, Jun 15 '19 at 00:32

Amadan · Accepted Answer · 2019-06-15T00:08:27.577

0

"(?<=^|[?!.])[^?!.]+" + urlRegex + ".*?(?:$|[?!.])"

This will match each whole sentence whose part matches urlRegex, according to your definition of a sentence; you can use replaceAll to get rid of them. (There are many URL regexes around, and you didn't specify which one you were using, so I left the exact definition of URL to you.)

edited Jun 15 '19 at 00:08

answered Jun 15 '19 at 00:02

Amadan

169,219
18
195
256

This is nice, it worked! Just curious, what does the "?<=^" indicate? – sindhunk Jun 15 '19 at 00:34
(?<=...) is a positive lookbehind zero-width assertion ("there has to be ... just before, but don't include it into the match"). `^` is start of string (or line if you use `Pattern.MULTILINE`). See [here](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) for more details. – Amadan Jun 15 '19 at 00:38

Strip off a sentence that contains a URL

2 Answers2

Demo

Test

RegEx Circuit