6

I've rechecked my code and looked at comparable operations on opening a URL to pass web data into Beautiful Soup, for some reason my code just doesn't return anything although it's in correct form:

>>> from bs4 import BeautifulSoup

>>> from urllib3 import poolmanager

>>> connectBuilder = poolmanager.PoolManager()

>>> content = connectBuilder.urlopen('GET', 'http://www.crummy.com/software/BeautifulSoup/')

>>> content
<urllib3.response.HTTPResponse object at 0x00000000032EC390>

>>> soup = BeautifulSoup(content)

>>> soup.title
>>> soup.title.name
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'name'
>>> soup.p
>>> soup.get_text()
''

>>> content.data
a stream of data follows...

As shown, it's clear that urlopen() returns an HTTP response which is captured by the variable content, it makes sense that it can read the status of the response, but after it's passed into Beautiful Soup, the web data doesn't get converted into a Beautiful Soup object (variable soup). You can see that I've tried to read a few tags and text, the get_text() returns an empty list, this is strange.

Strangely, when I access the web data via content.data, the data shows up but it's not useful since I can't use Beautiful Soup to parse it. What is my problem? Thanks.

user3885774
  • 65
  • 1
  • 1
  • 5
  • It clearly is getting converted to a `BeautifulSoup` object—otherwise, `soup.title` would have raised an exception rather than giving you `None`. A better way to tell is to print out `type(soup)`. – abarnert Jul 31 '14 at 19:47
  • your code is getting nothing, try printing content.read() – Padraic Cunningham Jul 31 '14 at 19:55
  • Is there a reason you're manually constructing a pool and then calling ["the lowest level call for making a request"](http://urllib3.readthedocs.org/en/latest/pools.html?highlight=urlopen#urllib3.connectionpool.HTTPConnectionPool.urlopen) on it? – abarnert Jul 31 '14 at 19:59
  • @abarnert I see, thanks. – user3885774 Jul 31 '14 at 20:33
  • @PadraicCunningham content.read() gives b'' – user3885774 Jul 31 '14 at 20:37
  • b for bytes and an empty string – Padraic Cunningham Jul 31 '14 at 20:38
  • @abarnert I'd also looked that the module and read about the lowest level but I didn't understand it well and thought urlopen() was the lowest level, so I chose the latter. – user3885774 Jul 31 '14 at 20:40
  • @user3885774: Yes, `urlopen` is the lowest level. Unless you have some good reason, you do not want to use the lowest level. Especially if you're just learning. That's why that same documentation recommends, at least twice, that you use one of the convenience methods. While you _could_ learn all the nitty-gritty details of how `urllib3` works under the covers, wouldn't you rather first learn how to use it the easy way, and write some working code you can play with to learn further? – abarnert Jul 31 '14 at 20:50
  • @abarnert Agreed, I didn't interpret/understand the module's notes well. : ) – user3885774 Jul 31 '14 at 20:59

4 Answers4

14

If you just want to scrape the page, requests will get the content you need:

from bs4 import BeautifulSoup

import requests
r = requests.get('http://www.crummy.com/software/BeautifulSoup/')
soup = BeautifulSoup(r.content)

In [59]: soup.title
Out[59]: <title>Beautiful Soup: We called him Tortoise because he taught us.</title>

In [60]: soup.title.name
Out[60]: 'title'
Padraic Cunningham
  • 160,756
  • 20
  • 201
  • 286
12

urllib3 returns a Response object, which contains the .data which has the preloaded body payload.

Per the top quickstart usage example here, I would do something like this:

import urllib3
http = urllib3.PoolManager()
response = http.request('GET', 'http://www.crummy.com/software/BeautifulSoup/')

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.data)  # Note the use of the .data property
...

The rest should work as intended.

--

A little about what went wrong in your original code:

You passed the entire response object rather than the body payload. This should normally be fine because the response object is a file-like object, except in this case urllib3 already consumes all of the response and parses it for you, so that there is nothing left to .read(). It's like passing a filepointer which has already been read. .data on the other hand will access the already-read data.

If you want to use urllib3 response objects as file-like objects, you'll need to disable content preloading, like this:

response = http.request('GET', 'http://www.crummy.com/software/BeautifulSoup/', preload_content=False)
soup = BeautifulSoup(response)  # We can pass the original `response` object now.

Now it should work as you expected.

I understand that this is not very obvious behaviour, and as the author of urllib3 I apologize. :) We plan to make preload_content=False the default someday. Perhaps someday soon (I opened an issue here).

--

A quick note on .urlopen vs .request:

.urlopen assumes that you will take care of encoding any parameters passed to the request. In this case it's fine to use .urlopen because you're not passing any parameters to the request, but in general .request will do all the extra work for you so it's more convenient.

If anyone would be up for improving our documentation to this effect, that would be greatly appreciated. :) Please send a PR to https://github.com/shazow/urllib3 and add yourself as a contributor!

shazow
  • 14,207
  • 1
  • 24
  • 31
  • I really appreciate your explanations, I admit I had no idea what content preloading was in exact terms. I'm new to Python and related items, while I knew that URL params were often needed for more precise operations, I thought that urlopen was more basic and a standard/preferred method. : ) – user3885774 Jul 31 '14 at 21:31
  • No worries, your experience is useful feedback for me. :) – shazow Aug 01 '14 at 22:26
2

As shown, it's clear that urlopen() returns an HTTP response which is captured by the variable content…

What you've called content isn't the content, but a file-like object that you can read the content from. BeautifulSoup is perfectly happy taking such a thing, but it's not very helpful to print it out for debugging purposes. So, let's actually read the content out of it to make this easier to debug:

>>> response = connectBuilder.urlopen('GET', 'http://www.crummy.com/software/BeautifulSoup/')
>>> response
<urllib3.response.HTTPResponse object at 0x00000000032EC390>
>>> content = response.read()
>>> content
b''

This should make it pretty clear that BeautifulSoup is not the problem here. But continuing on:

… but after it's passed into Beautiful Soup, the web data doesn't get converted into a Beautiful Soup object (variable soup).

Yes it does. The fact that soup.title gave you None instead of raising an AttributeError is pretty good evidence, but you can test it directly:

>>> type(soup)
bs4.BeautifulSoup

That's definitely a BeautifulSoup object.

When you pass BeautifulSoup an empty string, exactly what you get back will depend on which parser it's using under the covers; if it's relying on the Python 3.x stdlib, what you'll get is an html node with an empty head, and empty body, and nothing else. So, when you look for a title node, there isn't one, and you get None.


So, how do you fix this?

As the documentation says, you're using "the lowest level call for making a request, so you’ll need to specify all the raw details." What are those raw details? Honestly, if you don't already know, you shouldn't be using this method Teaching you how to deal with the under-the-hood details of urllib3 before you even know the basics would not be doing you a service.

In fact, you really don't need urllib3 here at all. Just use the modules that come with Python:

>>> # on Python 2.x, instead do: from urllib2 import urlopen 
>>> from urllib.request import urlopen
>>> r = urlopen('http://www.crummy.com/software/BeautifulSoup/')
>>> soup = BeautifulSoup(r)
>>> soup.title.text
'Beautiful Soup: We called him Tortoise because he taught us.'
abarnert
  • 313,628
  • 35
  • 508
  • 596
  • Thanks, but when I tried further parsing, I didn't get anything like soup.find_all(True) and soup.get_text(), so I was confused. – user3885774 Jul 31 '14 at 20:44
  • @user3885774: That's what my last paragraph explains: you may have an empty soup, or a soup with just an `html` node with an empty `head` and `body`, but it really doesn't matter; there's no useful data, so who cares exactly how that lack of useful data is represented? – abarnert Jul 31 '14 at 20:48
  • urllib3 actually returns a file-liked object but it's consumed by default (this is not ideal, as I mentioned in my answer below and opened an issue). To fix that, use preload_content=False in the request parameter. – shazow Jul 31 '14 at 21:18
  • @shazow: Or, more simply, just use `r.data`, which is where the preloaded content goes. Or, even more simply, don't use `urllib3` if you don't need it and it's too complicated for you to find what you need in the docs… – abarnert Jul 31 '14 at 21:45
  • @abarnert Or give the author of urllib3 feedback for how to make it not too complicated so that he can fix it. :) Or even more preferred, come help with improving it! – shazow Aug 01 '14 at 22:27
  • @shazow: Honestly, I've only ever really looked at `urllib3` twice. Both times, I expected `requests` to be able to do something for me like magic, and it couldn't, so I looked under the covers, saw that `urllib3` made it easy to do what I wanted, and wrote a patch to expose the behavior to `requests`. Both times, I didn't see anything to be unhappy about in `urllib3`, so I don't have any real suggestions for improving it. But I did reply to your #436. – abarnert Aug 03 '14 at 05:42
0

My beautiful soup code was working in one environment (my local machine) and returning an empty list in another one (ubuntu 14 server).

I've resolved my problem changing the installation. details in other thread:

Html parsing with Beautiful Soup returns empty list

Community
  • 1
  • 1
  • Note that [link-only answers](http://meta.stackoverflow.com/tags/link-only-answers/info) are discouraged, SO answers should be the end-point of a search for a solution (vs. yet another stopover of references, which tend to get stale over time). Please consider adding a stand-alone synopsis here, keeping the link as a reference. – kleopatra Jul 24 '15 at 22:21