3

I'm messing around with web scraping using requests and beautifulsoup and I'm getting some odd results when trying to loop through multiple pages of message board data by adding 1 page number each loop.

The below code is an example where I'm looping through page 1 on the message board and then page 2. Just to check myself, I'm printing the URL I'm hitting and then the first record found on that page. The URLs look to be correct but the first post is the same for both. But if I copy and paste those two URLs, I definitely see a different set of content on the page.

Can anyone tell me if this is a problem with my code or if it has something to do with how the data is structured on that forum that is giving me these results? Thanks in advance!

from bs4 import BeautifulSoup

import requests

n_pages = 2
base_link = 'http://tigerboard.com/boards/list.php?board=4&page='

for i in range (1,n_pages+1):
    link = base_link+str(i)
    html_doc = requests.get(link)
    soup = BeautifulSoup(html_doc.text,"lxml")
    bs_tags = soup.find_all("div",{"class":"msgline"})
    posts=[]
    for post in bs_tags:
        posts.append(post.text)
    print link
    print posts[0]

>     http://tigerboard.com/boards/list.php?board=4&page=1
>     52% of all websites are in English, but  - catbirdseat MU - 3/23/17 14:41:06
>     http://tigerboard.com/boards/list.php?board=4&page=2
>     52% of all websites are in English, but  - catbirdseat MU - 3/23/17 14:41:06
mizzou541
  • 35
  • 5
  • `posts` is a list of posts on the current page, not a cumulative list of all posts seen so far. – John Gordon Mar 23 '17 at 20:05
  • @JohnGordon Yeah in my original code, I define the list "posts" outside the main loop so I get a running record of everything but for troubleshooting purposes, I moved it inside so I could refresh after each page. – mizzou541 Mar 23 '17 at 20:17
  • Using `range(2, n_pages + 1)`, one also obtains only the results from the first page. I tried various ways to avoid a redirect, such as `requests.get(link, allow_redirects=False)` and what has been discussed e.g. [here](http://stackoverflow.com/a/20475712/6009280) and [here](http://stackoverflow.com/questions/24897373/how-to-use-beautifulsoup-to-get-redirect-html), but no success thus far. – Spherical Cowboy Mar 23 '17 at 21:04

1 Answers1

3

The implementation of the website is bogus. For some reason, it requires the specific cookie PHPSESSID to be set, or it won't return another page than the first page, regardless of the page parameter.

Setting this cookie fixes the problem:

from bs4 import BeautifulSoup

import requests

n_pages = 2
base_link = 'http://tigerboard.com/boards/list.php?board=4&page='

for i in range (1,n_pages+1):
    link = base_link+str(i)
    html_doc = requests.get(link, headers={'Cookie': 'PHPSESSID=notimportant'})
    soup = BeautifulSoup(html_doc.text,"lxml")
    bs_tags = soup.find_all("div",{"class":"msgline"})
    posts=[]
    for post in bs_tags:
        posts.append(post.text)
    print link
    print posts[0]

Another solution would be to use a session because the first request (of the first page) will set the cookie to a real value and it will be sent in later requests.

It was fun to debug!

Antoine Bolvy
  • 3,281
  • 1
  • 23
  • 39