Get list of links from large text file

Question

I have a huge text file, 20k+ lines, and I want to extract links from it.

What I need is a regular expression that generates a clean list of links.

The links i need start with http:// (without www) and end with .html

What would the expression look like?

Is Notepad++ an absolute requirement? grep would be perfect for this job. I'm thinking with Notepad++ you'll have to run the regex search/replace more than once (once to find and somehow flag your links, another time to delete all non-link text)... — Anssssss, Apr 10 '14 at 19:58
I found a way to do it but with several steps: step 1) Mark all lines that contain a http:// Step 2) Delete all unmarked step 3) find everything before http:// and delete it. Step 4) Find everything after .html and delete it. — Michael Rogers, Apr 10 '14 at 20:11
@user3521013 that's bound to produce errors sooner or later your regex idea is way better. — deW1, Apr 10 '14 at 20:35
For more information, here's an answer about [matching urls with regex](http://stackoverflow.com/a/190405/2736496) from the [Stack Overflow Regular Expressions FAQ](http://stackoverflow.com/a/22944075/2736496), which is listed under "Common Validation Tasks > Internet". — aliteralmind, Apr 10 '14 at 20:44

deW1 · Accepted Answer · 2014-04-11T21:19:39.640

0

Would look like this for global websites that end with .html pages:

(http|https)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,}.+[a-zA-Z0-9\-\.].html

And to match exactly what you specified:

http\://[a-zA-Z0-9\-]+\.+[a-z]{2,}\/[a-zA-Z0-9\-]+.html

Just Ctrl+X and Ctrl+V in a new File and u got it.

Works for JavaScript and Notepad++ so on.

\b is for word boundaries that searches whole words only so if there's just this word in the text like that: ewkgml http://test.com/a.html lamklwmwtmk it will find it and \B is the negation of it so wegniwgnwkjnhttp://test.com/a.htmllmwtlkmt34lt will work too. | is the or statement.

edited Apr 11 '14 at 21:19

answered Apr 10 '14 at 20:14

deW1

5,196
10
35
52

What do you mean with `(\b|\B)` ? – Toto Apr 11 '14 at 15:22
@M42 added the answer to your question in my answer – deW1 Apr 11 '14 at 15:37
Then you could remove it :-) – Toto Apr 11 '14 at 17:07
thank you i just added it cause for some reason my notepad selected more than just those strings. Or maybe it was just too late ;) anyway works fine now without the b's – deW1 Apr 11 '14 at 21:20

drkunibar · Answer 2 · 2014-04-11T14:59:47.820

0

In Notepad++ open the Replace Dialog (CTRL+H) insert

.*?(http://.*?\.html).*?

in Find what: input field and

$1\n

in Replace with: input field

You have to check the checkbox Regular Expression and the chebox . match newline

After you have clicked Replace all you get a list of all links - one per line

edited Apr 11 '14 at 14:59

answered Apr 10 '14 at 20:42

drkunibar

1,212
1
7
7

Get list of links from large text file

2 Answers2