Use Beatiful Soup in scraping multiple websites

Question

I want to know why lists all_links and all_titles don't want to receive any records from lists titles and links. I have tried also .extend() method and it didn't help.

import requests
from bs4 import BeautifulSoup
all_links = []
all_titles = []

def title_link(page_num):
    page = requests.get(
    'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/page-%d/v%dc9073l3200008p%d'
    % (page_num, page_num, page_num))
    soup = BeautifulSoup(page.content, 'html.parser')
    links = ['https://www.gumtree.pl' + link.get('href')
                for link in soup.find_all('a', class_ ="href-link tile-title-text")]
    titles = [flat.next_element for flat in soup.find_all('a', class_ = "href-link tile-title-text")] 
    print(titles)

for i in range(1,5+1):
    title_link(i)
    all_links = all_links + links
    all_titles = all_titles + titles
    i+=1
    print(all_links)

import pandas as pd
df = pd.DataFrame(data = {'title': all_titles ,'link': all_links})
df.head(100)
#df.to_csv("./gumtree_page_1.csv", sep=';',index=False, encoding = 'utf-8')
#df.to_excel('./gumtree_page_1.xlsx')

E. Bassett · Answer 1 · 2020-03-24T23:37:42.600

1

Try this:

import requests
from bs4 import BeautifulSoup
all_links = []
all_titles = []

def title_link(page_num):
    page = requests.get(
    'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/page-%d/v%dc9073l3200008p%d'
    % (page_num, page_num, page_num))
    page.encoding = 'utf-8'
    soup = BeautifulSoup(page.content, 'html.parser', from_encoding='utf-8')
    links = ['https://www.gumtree.pl' + link.get('href')
                for link in soup.find_all('a', class_ ="href-link tile-title-text")]
    titles = [flat.next_element for flat in soup.find_all('a', class_ = "href-link tile-title-text")]
    print(titles)
    return links, titles

for i in range(1,5+1):
    links, titles = title_link(i)
    all_links.extend(links)
    all_titles.extend(titles)
    # i+=1 not needed in python
    print(all_links)

import pandas as pd
df = pd.DataFrame(data = {'title': all_titles ,'link': all_links})
df.head(100)

I think you just needed to get links and titles out of title_link(page_num).

Edit: removed the manual incrementing per comments

Edit: changed the all_links = all_links + links to all_links.extend(links)

Edit: website is utf-8 encoded, added page.encoding = 'utf-8' and as extra (probably unnecessary) measure, from_encoding='utf-8' to the BeautifulSoup

edited Mar 24 '20 at 23:37

answered Mar 15 '20 at 23:18

E. Bassett

156
7

I'd love to see that another anwer is being typed :D; Wouldn't have bothered writing practically the same – julka Mar 15 '20 at 23:22
feature request! how do you reference another user? like say: ggorlen caught the i+=1 – E. Bassett Mar 15 '20 at 23:23
hi there E.Basstt have runned the code and i saw the following results... on a freshly installed ATOM on Win10 ´Traceback (most recent call last): links, titles = title_link(i) \_examples_\gumtree_pl2.py", line 14, in title_link print(titles) File "C:\Program Files\Python37\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u015a' in position 8: character maps to ]´ - any idea?° love to hear from you - zero – zero Mar 24 '20 at 22:22
1

If I had to guess it’s a requests encoding issue since that’s a Unicode character. Take a look at `page.encoding`. Should find an answer [here](https://stackoverflow.com/questions/44203397/python-requests-get-returns-improperly-decoded-text-instead-of-utf-8/44203507#44203507) – E. Bassett Mar 24 '20 at 22:51
hi there - many thanks for the reply and the hint - youre right now it works. Many thanks for this great asset in learing python. Keep up your great job here - it rocks!! – zero Mar 25 '20 at 01:11

score 1 · Answer 2 · answered Mar 15 '20 at 23:20

When I ran your code, I got

NameError                                 Traceback (most recent call last)
<ipython-input-3-6fff0b33d73b> in <module>
     16 for i in range(1,5+1):
     17     title_link(i)
---> 18     all_links = all_links + links
     19     all_titles = all_titles + titles
     20     i+=1

NameError: name 'links' is not defined

That points to a problem - variable named links is not defined in a global scope (where you add it to all_links). You can read about python scopes here. You'd need to return links and titles from title_link. Something similar to this:

def title_link(page_sum):
    # your code here
    return links, titles


for i in range(1,5+1):
    links, titles = title_link(i)
    all_links = all_links + links
    all_titles = all_titles + titles
    print(all_links)

ggorlen · Answer 3 · 2020-03-15T23:48:47.070

This code is exhibits confusion about scoping. titles and links inside of title_link are local to that function. When the function ends, the data disappears and it cannot be accessed from another scope such as main. Use the return keyword to return values from functions. In this case, you'd need to return a tuple pair of titles and links like return titles, links.

Since functions should do one task only, having to return a pair shows reveals a possible design flaw. A function like title_link is overloaded and should probably be two separate functions, one to get titles and one to get links.

Having said that, the functions here seem like premature abstractions since the operations can be done directly.

Here's a suggested rewrite:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/page-%d/v%dc9073l3200008p%d"
data = {"title": [], "link": []}

for i in range(1, 6):
    page = requests.get(url % (i, i, i))
    soup = BeautifulSoup(page.content, "html.parser")
    titles = soup.find_all("a", class_="href-link tile-title-text")
    data["title"].extend([x.next_element for x in titles])
    data["link"].extend("https://www.gumtree.pl" + x.get("href") for x in titles)

df = pd.DataFrame(data)
print(df.head(100))

Other remarks:

i+=1 is unnecessary; for loops move forward automatically in Python.
(1,5+1) is clearer as (1, 6).
List comprehensions are great, but if they run multiple lines, consider writing them as normal loops or creating an intermediate variable or two.
Imports should be at the top of a file only. See PEP-8.
list.extend(other_list) is preferable to list = list + other_list, which is slow and memory-intensive, creating a whole copy of the list.

hi there dear ggordon i rerunned the code and got on a freshly installed Atom on Win 10 this back ` Traceback (most recent call last): File "C:\Users\Kasper\Documents\_mk_\_dev_\python\_examples_\gumtree_pl.py", line 16, in print(df.head(100)) File "C:\Program Files\Python37\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u015b' in position 115: character maps to ` any idea — zero, Mar 24 '20 at 22:23
It works OK for me just now on Windows and Python 3.8. It looks like a unicode error during the print. If you comment out the print, does it work? If you're ultimately dumping the data to CSV or something, it might not matter, otherwise see threads such as [this](https://stackoverflow.com/questions/14630288/unicodeencodeerror-charmap-codec-cant-encode-character-maps-to-undefined). — ggorlen, Mar 24 '20 at 22:38
hi there dear ggorlen - many thanks for the hint - . guess that youre right - assumption: website is utf-8 encoded, . See also the comment of E.Bassett below - wich gave the same hint with his comment: added page.encoding = 'utf-8' and as extra (probably unnecessary) measure, from_encoding='utf-8' to the BeautifulSoup keep up the great job - it rocks - tahnks ggorlen — zero, Mar 25 '20 at 01:19

Use Beatiful Soup in scraping multiple websites

3 Answers3

Linked