0

I was trying to scrape https://stores.pandora.net/en-au/ for all locations in Australia and their addresses using Parsehub and it wasn't throwing results as it normally would.

Parse Hub screenshot:

enter image description here

As shown in the picture, the live preview shows the table perfectly fine, but when I run it only throws junk values ( like 2 stores in the US)

I tried a hand at using Beautiful soup, but the classes looked more complicated than I first assumed. ( Looks like it is is sitting in a Maplist array, but I'm not sure how I can extract that bit)

Any help here would be greatly appreciated! Thanks :)

political scientist
  • 3,382
  • 3
  • 13
  • 30

1 Answers1

0

This site fetch the data from this API https://maps.pandora.net/api/getAsyncLocations with search value in query parameters. The result is a JSON object with a field maplist which contains html data (a single div). This div embed several JSON objects comma delimited :

curl 'https://maps.pandora.net/api/getAsyncLocations?level=domain&template=domain&search=Melbourne+Victoria%2C+Australie'

So we need to rearrange the JSON objects comma delimited into an array to parse it. The following example uses , (json parser), & (html parser) to extract the data :

search="Melbourne+Victoria+Australie"
curl -s -G 'https://maps.pandora.net/api/getAsyncLocations' \
    -d 'level=domain' \
    -d 'template=domain' \
    -d "search=$search" | \
    jq -r '.maplist' | \
    pup -p div text{} | \
    sed '$ s/.$//' | \
    sed -e "\$a]" | \
    sed '1s/^/[/' | \
    jq '.[] | { 
        location: .location_name, 
        address: .address_1, 
        complement: (.city + "," + .big_region + " " + .location_post_code) 
    }'

In with & :

import requests
from bs4 import BeautifulSoup
import json

search = "Melbourne+Victoria+Australie"

response = requests.get(
    'https://maps.pandora.net/api/getAsyncLocations',
    params = {
        'level':'domain',
        'template':'domain',
        'search': search
    }
)
soup = BeautifulSoup(response.json()['maplist'], 'html.parser')

formatted_json = "[{}]".format(soup.div.string[:-1])
data = json.loads(formatted_json)

print([
    (i['location_name'], i['address_1'], i['city'], i['big_region'], i['location_post_code']) 
    for i in data
])
Bertrand Martel
  • 32,363
  • 15
  • 95
  • 118