3

I am scraping a webpage that contains HTML that looks like this in the browser

<td>LGG&reg; MAX multispecies probiotic consisting of four bacterial trains</td>
<td>LGG® MAX helps to reduce gastro-intestinal discomfort</td>

Taking just the LGG®, in the first instance it is LGG&reg; In the second instance, ® is written as ® in the source code.

I am using Python 2.7, mechanize and BeautifulSoup.

My difficulty is that the &reg; is uplifted by mechanize, and carried through and is ultimately printed out or written to file.

There are many other special characters. Some are 'converted' on output and the ® are converted to a muddle.

The webpage is declared as UTF-8 and the only reference I make to encoding is when I open my out file. I've declared UTF-8. If I don't the writing to file bombs on other characters.

I am working on Windows 7. Other details:

>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'cp850'
>>> locale.getdefaultlocale()
('en_GB', 'cp1252')
>>>

Can anyone give me any tips on the best way to handle the special characters? Or should they be called HTML entities? This must be a fairly common problem but I haven't been able to find any straightforward explanations on the web.

UPDATE: I've made some progress here. The basic algorithm is

  1. Read the webpage in mechanize
  2. Use beautiful soup to do what.. as i write it down i have no idea what this pre-processing stage is for, exactly.
  3. Use beautiful soup to extract information from a table that is orderly other than for the treatment of special characters.
  4. Write the information to file delimited by | to account for punctuation in long cell entries and to allow for importing into Excel etc.

The progress is in stage 3. I've used some regex and htmlentityrefs to change the code cell entry by cell entry. See this blog post.

Remaining difficulty: the code written to file (and printed to screen) is still incorrect but it appears that the problem is now a matter of specifying the coding correctly. The problem seems smaller at least.

Martin Geisler
  • 69,865
  • 23
  • 162
  • 224
jobucks
  • 59
  • 7
  • There is a page about how the markup system works: http://stackoverflow.com/editing-help It's linked from the question mark in the top-right when you edit something. – Martin Geisler Jan 14 '12 at 18:22
  • to convert html entities you could use [`unescape()` function](http://effbot.org/zone/re-sub.htm#unescape-html) – jfs Jan 14 '12 at 18:34
  • [Python, Windows console and Unicode don't play nice](http://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console) – jfs Jan 14 '12 at 19:15

1 Answers1

2

To answer the question from the title:

# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup

html = u"""
<td>LGG&reg; MAX multispecies probiotic consisting of four bacterial trains</td>
<td>LGG® MAX helps to reduce gastro-intestinal discomfort</td>
"""

soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
print(''.join(soup('td', text=True)))

Output

LGG® MAX multispecies probiotic consisting of four bacterial trains
LGG® MAX helps to reduce gastro-intestinal discomfort
jfs
  • 346,887
  • 152
  • 868
  • 1,518