3

I have a huge text file, 20k+ lines, and I want to extract links from it.

What I need is a regular expression that generates a clean list of links.

The links i need start with http:// (without www) and end with .html

What would the expression look like?

admdrew
  • 3,600
  • 4
  • 22
  • 39
Michael Rogers
  • 202
  • 3
  • 20
  • Is Notepad++ an absolute requirement? grep would be perfect for this job. I'm thinking with Notepad++ you'll have to run the regex search/replace more than once (once to find and somehow flag your links, another time to delete all non-link text)... – Anssssss Apr 10 '14 at 19:58
  • I found a way to do it but with several steps: step 1) Mark all lines that contain a http:// Step 2) Delete all unmarked step 3) find everything before http:// and delete it. Step 4) Find everything after .html and delete it. – Michael Rogers Apr 10 '14 at 20:11
  • @user3521013 that's bound to produce errors sooner or later your regex idea is way better. – deW1 Apr 10 '14 at 20:35
  • For more information, here's an answer about [matching urls with regex](http://stackoverflow.com/a/190405/2736496) from the [Stack Overflow Regular Expressions FAQ](http://stackoverflow.com/a/22944075/2736496), which is listed under "Common Validation Tasks > Internet". – aliteralmind Apr 10 '14 at 20:44

2 Answers2

0

Would look like this for global websites that end with .html pages:

(http|https)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,}.+[a-zA-Z0-9\-\.].html

And to match exactly what you specified:

http\://[a-zA-Z0-9\-]+\.+[a-z]{2,}\/[a-zA-Z0-9\-]+.html

Just Ctrl+X and Ctrl+V in a new File and u got it.

Works for JavaScript and Notepad++ so on.

\b is for word boundaries that searches whole words only so if there's just this word in the text like that: ewkgml http://test.com/a.html lamklwmwtmk it will find it and \B is the negation of it so wegniwgnwkjnhttp://test.com/a.htmllmwtlkmt34lt will work too. | is the or statement.

deW1
  • 5,196
  • 10
  • 35
  • 52
0

In Notepad++ open the Replace Dialog (CTRL+H) insert

.*?(http://.*?\.html).*?

in Find what: input field and

$1\n

in Replace with: input field

You have to check the checkbox Regular Expression and the chebox . match newline

After you have clicked Replace all you get a list of all links - one per line

drkunibar
  • 1,212
  • 1
  • 7
  • 7