Exclude url pattern in regex

Question

This is my input string

<div>http://google.com</div><span data-user-info="{\"name\":\"subash\", \"url\" : \"http://userinfo.com?userid=33\"}"></span><a href="https://contact.me"></a>http://byebye.com is a dummy website.

for this case I need to match only first and last occurrence of http. because those are innerText in html point of view. http in attribute values we need to ignore. I build following regex.

(?<!href=\"|src=\"|value=\"|href=\'|src=\'|value=\'|=)(http://|https://|ftp://|sftp://)

It is working fine for first and last occurrence. but this is also matching the second occurrence of http. the link(http) in the attribute we don't need to match.

FYI : I am trying negative lookahead, but that is seems not helping. This is the one with negative lookahead.

(?<!href=\"|src=\"|value=\"|href=\'|src=\'|value=\'|=)(http://|https://|ftp://|sftp://).*?(?!>)

Link wasn't meant to point to solution, but to show that you tagged your question using two different languages (which I see you corrected now). Anyway your question is bit unclear (at least to me). "*I need to match only first and last occurrence of http*" for that we can use `indexOf` and `lastIndexOf` methods from String class, no need for regex. — Pshemo, Mar 20 '17 at 19:48
yes. problem is the http inside attributes we need to ignore them. for above example 2nd and 3rd http in attribute values. we need to match only html innerText — subash, Mar 20 '17 at 19:52
So is your goal finding links which are not HTML attributes? If yes, do you really want to find just first and last one or do you want to find all such links? — Pshemo, Mar 20 '17 at 19:56
I would avoid using regex with HTML (http://stackoverflow.com/a/1732454). Instead we can use HTML parser like jsoup to parse HTML and extract text which it represents (this will get rid of HTML tags), something like: http://stackoverflow.com/a/15982332. Then we can safely use regex to extract links, like described: http://stackoverflow.com/a/5713866 — Pshemo, Mar 20 '17 at 20:10
@Pshemo thanks for your suggestion. i will try with HTML parser. HTML parser will support plain text too?. because in above example the last occurrence of http is plain text — subash, Mar 20 '17 at 20:20
If you are asking about jsoup then yes, just use `Jsoup.parse(yourString)` like shown in linked answer. Other parsers should most likely also support it since it is one of basic use case. — Pshemo, Mar 20 '17 at 20:23
BTW while it is nice to receive up-votes, lets try to avoid serial voting since (1) purpose of votes is to show how many people agree that shown solution is correct, not how many people are grateful for other posts (2) such votes may one day be removed, leaving negative reputation balance for that day. If my advises ware useful then you are welcome, if you still have problem let me know, or maybe better post new question. — Pshemo, Mar 20 '17 at 20:40
@Pshemo, thanks. i have looked into the JSoup. it is the perfect solution for my case. but unfortunately i am not going to use that. because Jsoup take quite large amount of time build the node tree compared with regex/indexOf mix text match. the time taking ratio is 10:2. most cases, i have huge size of string, but http occur only one or two time. that time i just handle with Regex/indexOf. but i know jsoup will give 100% clear output compared with Regex/IndexOf mix. but my mentor require the performance rather than clear output. — subash, Mar 21 '17 at 11:41
One of possible solutions could be removing all HTML tags manually with something like `html = html.replaceAll("]*>"," ")` and then searching such text for links. But this simple mechanism can be broken if for instance your HTML contains JavaScript using `` operator. Or if tags have attributes like `whatever`. So as you see this limits possible input. Maybe there is a parser which will try to handle text nodes on the fly while iterating over HTML. This way you could append them to StringBuilder, maybe JAXB, but I never used it so can't help more — Pshemo, Mar 21 '17 at 14:46

pbalaga · Answer 1 · 2017-03-20T19:36:32.760

Update after having more details

Another approach is to take benefit from regex's "greediness". /(http).*(http)/g will match as much text as possible from the first to the last occurrence of "http". Below example illustrates this behavior. (http) are capturing groups - replace those with your full regex. I simplified the regex for easier understanding.

var text ='<div>http://google.com</div><span data-user-info="{\"name\":\"subash\", \"url\" : \"http://userinfo.com?userid=33\"}"></span><a href="https://contact.me"></a>http://byebye.com is a dummy website.'
var regex = /(http).*(http)/g;
var match = regex.exec(text);
//match[0] is entire matched text
var firstMatch = match[1]; // = "http"
var lastMatch = match[2]; // = "http"

This example is specific of JavaScript, but Java regexps (and many other regex engines) work the same way. (http).*(http) would work too.

Do you aim to match the first and the last line or the first and the last occurrence of a string?

If the former is correct, I would split the text into lines first, and then regex-match the first and the last line.

//Split into lines:
var lines = yourMultiLineText.split(/[\r\n]+/g);

If the latter is correct, find all matches with your basic pattern and from the array of matches take the first and the last one, e.g.:

//Match using a simpler regex
var matches = yourMultiLineText.match(yourRegex);
//Store the result here
var result;
//Make sure that there are at least 2 matches in total for this to make sense.
if(matches.length > 1){
   //Grab the first and the last match.
   result = [matches[0], matches[matches.length - 1]];
} else {
   result = [];
}

oh my bad, for clarification i said and put like multiple lines. actually it is a single line string. sorry will update the question correctly — subash, Mar 20 '17 at 19:06
@subash, in that case I would go either with the second approach as described initially or yet another suggestion - see my updated answer. — pbalaga, Mar 20 '17 at 19:38

Exclude url pattern in regex

1 Answers1