1

I am very new to the world of scraping data off of websites and am at a lost on how to grab data off of a website that is using Tableau Public

website: https://showmestrong.mo.gov/data/public-health/

I've been reading up on several sources on how to inspect the elements and finding the table within it but I am at a loss. I've tried using in Python requests and BeautifulSoup but don't know how to work past that.

import requests
from bs4 import BeautifulSoup
import json
import re

r = requests.get("https://showmestrong.mo.gov/data/public-health/")
soup = BeautifulSoup(r.text, "html.parser")

and it doesn't seem to show any tables about cases and deaths for example.

Any tips or documentation/forums about this would be appreciated!

Doughey
  • 127
  • 6

1 Answers1

2

The tableau.js library seems to load another url from which it gets the data :

https://public.tableau.com/views/COVID-19inMissouri/COVID-19inMissouri?:embed=y&:showVizHome=no&:host_url=https%3A%2F%2Fpublic.tableau.com%2F&:embed_code_version=3&:tabs=no&:toolbar=no&:animate_transition=yes&:display_static_image=no&:display_spinner=no&:display_overlay=yes&:display_count=yes&:language=en&:loadOrderID=0

From there, it's very similar to this answer and this one where you would extract a JSON configuration from a textarea tag. Extract the sessionid to build the URL to get the data :

import requests
from bs4 import BeautifulSoup
import json
import re

r = requests.get("https://public.tableau.com/views/COVID-19inMissouri/COVID-19inMissouri", 
    params = {
    ":embed": "y",
    ":showVizHome": "no",
    ":host_url": "https://public.tableau.com/",
    ":embed_code_version": 3,
    ":tabs": "no",
    ":toolbar": "no",
    ":animate_transition": "yes",
    ":display_static_image": "no",
    ":display_spinner": "no",
    ":display_overlay": "yes",
    ":display_count": "yes",
    ":language": "en",
    ":loadOrderID": 0
})
soup = BeautifulSoup(r.text, "html.parser")

tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)

dataUrl = f'https://public.tableau.com{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'

r = requests.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],
})
dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))

print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])

The result is not JSON so it needs to be parsed using regex to extract the JSON configuration from it as depicted in the above code

run this on repl.it

Bertrand Martel
  • 32,363
  • 15
  • 95
  • 118
  • Thank you so much!!! You've helped so much with this. If I may ask, do you know how or any other forums that can help with parsing this huge line of data? – Doughey Oct 02 '20 at 22:52