1

Being new to Python I spend about an hour trying to find a string with Python 2.7.x and Beautiful Soup from a heading inside a div:

import urllib2
from bs4 import BeautifulSoup

request = urllib2.Request("http://somerandomurl.org")
response = urllib2.urlopen(request)
soup = BeautifulSoup(response)

The HTML file looks like this:

<div class="ABC">
    <h1>My string</h1>
</div>

I can't describe all the ways from the Beautiful Soup Documentation I tried here (including print soup.div('ABC').h1 …), but I assume I got something terrible wrong while reading. Thank you for help.

lejonet
  • 185
  • 3
  • 10

1 Answers1

3

You wanted:

soup.find('div', class_='ABC').h1

which would find the first div tag with the ABC class, then traverse to the first H1 tag inside of it:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''
... <div class="ABC">
...     <h1>My string</h1>
... </div>
... ''')
>>> soup.find('div', class_='ABC').h1
<h1>My string</h1>
Martijn Pieters
  • 889,049
  • 245
  • 3,507
  • 2,997
  • Thank you. In my example this gives me the following error: "AttributeError: 'NoneType' object has no attribute 'h1'" – lejonet Mar 12 '13 at 22:13
  • @lejonet8: then you do *not* have a div with that class in your input HTML. – Martijn Pieters Mar 12 '13 at 22:15
  • XYZ

    The original class name contains a - might this be a problem?
    – lejonet Mar 12 '13 at 22:21
  • 1
    @lejonet8: does `soup.find('div')` return anything? Does `soup.find(class_='ABC')` match anything? You need to narrow this down. No, a dash (`-`) in the class name does not make a difference. – Martijn Pieters Mar 12 '13 at 22:22
  • It works, the problem was sitting in front of the computer. Thank you again. – lejonet Mar 12 '13 at 23:25
  • Martijn, please allow one additional question concerning the soup.find approach: When class slightly chances in different documents (e.g. `ABC1`, `ABC2` oder `ABC3`), is there a possibility to work with this using BeautifulSoup options? Thank you. – lejonet Mar 13 '13 at 00:55
  • 1
    Yes, you can use a regular expression to match the class; `import re` then `.find('div', class_=re.compile(r'ABC\d'))` would match all those. – Martijn Pieters Mar 13 '13 at 08:08