How to remove accents to webscrape name using python

Question

i have a list of names, but some have accents. i want to be able to find the page of the person without having to manually get rid of the accent on the name, which prevents the search. is there a way to even do this?

import requests
from bs4 import BeautifulSoup
import pandas as pd
from pandas import DataFrame

base_url = 'https://basketball.realgm.com'

player_names=['Ante Žižić','Anžejs Pasečņiks', 'Dario Šarić', 'Dāvis Bertāns', 'Jakob Pöltl']

# Empty DataFrame
result = pd.DataFrame()

for name in player_names:
    url = f'{base_url}/search?q={name.replace(" ", "+")}' 
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    if url == response.url:
        # Get all NBA players
        for player in soup.select('.tablesaw tr:has(a[href*="/nba/teams/"]) a[href*="/player/"]'): 
            response = requests.get(base_url + player['href'])
            player_soup = BeautifulSoup(response.content, 'lxml') 
            player_data = get_player_stats(search_name=player.text, real_name=name, player_soup=player_soup) 
            result = result.append(player_data, sort=False).reset_index(drop=True)
    else:
        player_data = get_player_stats(search_name=name, real_name=name, player_soup=soup) 
        result = result.append(player_data, sort=False).reset_index(drop=True)

score 2 · Answer 1 · answered Jan 31 '20 at 18:43

You could install a package called unidecode.

Now you can do something like this before processing the list further:

player_names=['Ante Žižić','Anžejs Pasečņiks', 'Dario Šarić', 'Dāvis Bertāns', 'Jakob Pöltl']

for player in player_names:
    player=unidecode(player)
    print(player)

Output:

Ante Zizic
Anzejs Pasecniks
Dario Saric
Davis Bertans
Jakob Poltl

Cohan · Accepted Answer · 2020-01-31T19:10:57.183

python-slugify can handle the spaces and the unicode characters. Then since you're dealing with a search string, just convert the - to + with a simple replace('-', '+').

from slugify import slugify

base_url = "https://basketball.realgm.com"

player_names = [
    "Ante Žižić",
    "Anžejs Pasečņiks",
    "Dario Šarić",
    "Dāvis Bertāns",
    "Jakob Pöltl",
]

for name in player_names:
    url = f"{base_url}/search?q={slugify(name).replace('-', '+')}"
    print(url)

Output:

https://basketball.realgm.com/search?q=ante+zizic
https://basketball.realgm.com/search?q=anzejs+pasecniks
https://basketball.realgm.com/search?q=dario+saric
https://basketball.realgm.com/search?q=davis+bertans
https://basketball.realgm.com/search?q=jakob+poltl

Granted, the unidecode module the others have mentioned will work as well.

from unidecode import unidecode

for name in player_names:
    url = f"{base_url}/search?q={unidecode(name).replace(' ', '+')}"
    print(url)

The URL doesn't seem to care if you have lower or title case for the names.

https://basketball.realgm.com/search?q=Ante+Zizic
https://basketball.realgm.com/search?q=Anzejs+Pasecniks
https://basketball.realgm.com/search?q=Dario+Saric
https://basketball.realgm.com/search?q=Davis+Bertans
https://basketball.realgm.com/search?q=Jakob+Poltl

Here's the links so you can validate that it's working.

i'm getting an error ```NameError: name 'unicode' is not definedNameError: name 'unicode' is not defined```. i'm using python 3. — J. Doe, Jan 31 '20 at 19:35
unidecode worked perfectly (thank you!), but why is this even the case if i'm trying to use slugify? — J. Doe, Jan 31 '20 at 19:37
you can use either `slugify` by itself or `unidecode` by itself. — Cohan, Jan 31 '20 at 20:38

score 0 · Answer 3 · answered Jan 31 '20 at 18:40

0

Try answer 2 of this: Replace non-ASCII characters with a single space -- the unidecode module

answered Jan 31 '20 at 18:40

blacktj

156
1
13

How to remove accents to webscrape name using python

3 Answers3