Python web scraping, symbols meaning

Question

In below code, what does each and every element of the symbol string re.sub('<[^>]*>|[\n]|\[[0-9]*\]', '', htmlread) mean?

import urllib2
import re

htmltext = urllib2.urlopen("https://en.wikipedia.org/wiki/Linkin_Park")
htmlread = htmltext.read()
htmlread = re.sub('<[^>]*>|[\n]|\[[0-9]*\]', '', htmlread)
regex = '(?<=Linkin Park was founded)(.*)(?=the following year.)'
pattern = re.compile(regex)
htmlread = re.findall(pattern, htmlread)
print "Linkin Park was founded" + htmlread[0] + "the following year."

http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean — Jean-François Fabre, Aug 10 '16 at 14:53

score 0 · Answer 1 · edited May 23 '17 at 11:51

0

The line htmlread = re.sub('<[^>]*>|[\n]|\[[0-9]*\]', '', htmlread) removes either

an expression between <> OR
a newline
a number between brackets or empty brackets

from htmlread

interesting wiki post here: Reference - What does this regex mean?

edited May 23 '17 at 11:51

Community

1
1

answered Aug 10 '16 at 14:55

Jean-François Fabre

126,787
22
103
165

score 0 · Answer 2 · answered Aug 10 '16 at 14:55

0

Replace every character with '', that means delete it from htmlread variable

Please read more about RegEx

answered Aug 10 '16 at 14:55

Maciej Wolski

23
5

Python web scraping, symbols meaning

2 Answers2