Using regular expressions to parse HTML

Question

I am new to Python. A coder helped me out by giving me some code to parse HTML. I'm having trouble understanding how it works. My idea is for it to grab (consume?) HTML from funtweets.com/random and basically tell me a funny joke in the morning as an alarm clock. It currently extracts all jokes on the page and I only want one. Either modifying the code or a detailed explanation as to how the code works would be helpful to me. This is the code:

import re 
import urllib2

page = urllib2.urlopen("http://www.m.funtweets.com/random").read() 
user = re.compile(r'<span>@</span>(\w+)') 
text = re.compile(r"</b></a> (\w.*)") 
user_lst =[match.group(1) for match in re.finditer(user, page)] 
text_lst =[match.group(1) for match in re.finditer(text, page)] 
for _user, _text in zip(user_lst, text_lst):
    print '@{0}\n{1}\n'.format(_user,_text)

and here we go one more time... please read this http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — jambriz, Apr 23 '14 at 21:35
@jambriz that answer doesn't help at all. Please read [this meta thread](http://meta.stackexchange.com/questions/182189) — HamZa, Apr 23 '14 at 21:52
Why don't you ask the coder to explain it to you? Otherwise, check the python manual and this [regex reference](http://stackoverflow.com/questions/22937618) — HamZa, Apr 23 '14 at 21:54
@HamZa That's probably why he didn't write it as an answer, even the meta acknowledges it has "historical value". Incidentally, the 3rd answer to that old SO has a very specific explanation of why regexes can't parse HTML, and the 4th answer provides a specific example. — Endophage, Apr 23 '14 at 22:57
Just because [you **can** indeed use patterns to parse HTML](http://stackoverflow.com/a/4234491/471272) doesn’t you should. — tchrist, Jun 06 '14 at 22:41

zx81 · Accepted Answer · 2014-04-24T11:00:24.800

0

user3530608 you want one match, instead of iterating through matches?

This is a nice way to get started with python regular expressions.

Here is a small tweak to your code. I don't have python in front of me to test it, so let me know if you run into any issues.

import re 
import urllib2

page = urllib2.urlopen("http://www.m.funtweets.com/random").read() 
umatch = re.search(r"<span>@</span>(\w+)", page)
user = umatch.group()
utext = re.search(r"</b></a> (\w.*)", page)
text = utext.group()
print '@{0}\n{1}\n'.format(user,text)

edited Apr 24 '14 at 11:00

answered Apr 23 '14 at 22:06

zx81

38,175
8
76
97

Thank you, that code worked just fine except it also returns , but getting rid of that shouldn't be too hard. Thanks again. @all the users who posted useless links that said "it can't be done" while not explaining a better way to do it, it obviously can be done because this code works just fine. – user3530608 Apr 24 '14 at 12:39
@user3530608 You're welcome, it was a pleasure. Hey, I notice that you haven't yet voted on StackOverflow. If this answer or another answer solves your problem, please consider "accepting it" by clicking the checkmark and arrow on the left, as this is how the reputation system works. Of course there is no obligation to do so. Later when you have more reputation you can also upvote questions. Thanks for listening to my 20-second SO tutorial. :) – zx81 Apr 24 '14 at 19:16

score 0 · Answer 2 · answered Apr 24 '14 at 11:24

Although you can parse html by regex , but I strongly suggest you to use some python third's lib.

My favorest htmlparser lib is PyQuery, you can use it as jquery: such as

from pyquery import PyQuery as pq
page=pq(url='http://www.m.funtweets.com/random')
users=page("#user_id")
a_first=page("a:first")
...

You can find it here:https://pypi.python.org/pypi/pyquery

Just:

pip install PyQuery
or 
easy_install PyQuery

You'll love it !

Another htmlparse-lib: https://pypi.python.org/pypi/beautifulsoup4/4.3.2

score 0 · Answer 3 · answered Apr 24 '14 at 13:38

If anyone is interested in getting only one joke from the html with no html tags, here is the final code:

import re 
import urllib2
def remove_html_tags(text):
    pattern = re.compile(r'</b></a>') 
    return pattern.sub('', text) 

page = urllib2.urlopen("http://www.m.funtweets.com/random").read() 
umatch = re.search(r"<span>@</span>(\w+)", page) 
user = umatch.group() 
utext = re.search(r"</b></a> (\w.*)", page) 
text = utext.group()
print remove_html_tags(text)

Using regular expressions to parse HTML

3 Answers3