Edit
I've made a tableau scraper library to extract the worksheets data into a pandas dataframe.
The code is simpler, but in your case you still need to build the URL with the xsrf token :
from tableauscraper import TableauScraper as TS
import requests
from bs4 import BeautifulSoup
init_url = "https://idph.illinois.gov/OpioidDataDashboard/"
r = requests.get(init_url)
soup = BeautifulSoup(r.text, "html.parser")
paramTags = dict([
(t["name"], t["value"])
for t in soup.find("div", {"class": "tableauPlaceholder"}).findAll("param")
])
url = f'{paramTags["host_url"]}trusted/{paramTags["ticket"]}{paramTags["site_root"]}/views/{paramTags["name"]}'
ts = TS()
ts.loads(url)
dashboard = ts.getWorkbook()
for t in dashboard.worksheets:
# show worksheet name
print(f"WORKSHEET NAME : {t.name}")
# show dataframe for this worksheet
print(t.data)
Try this on repl.it
Original post
It's a bit complex since there are a combination of the following :
- the tableau "configuration page" where there is tsconfig textarea is not part of the original page. The url is built dynamically from some
param
html tag
- it uses a cross forgery token in cookies but in order to get that cookie you need to call a specific api whose url is built dynamically from some
param
html tag
- from the tsconfig parameter, we can build the data url as you've found out in other stackoverflow post such as this, this and this
The flow is the following :
- call
GET https://idph.illinois.gov/OpioidDataDashboard/
, scrape the param
tags under the div with class tableauPlaceholder
From there the host is : https://interactive.data.illinois.gov
from the formerĀ param
tags, build the "session URL" which looks like this :
GET /trusted/{ticket}/t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
the url above will be only used to store the cookies (including xsrf token in the cookies)
from the former param
tags, build the "configuration URL" which looks like this :
GET /t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
Extract the textarea with id tsConfigContainer
and parse the json from it
build the "data url" from the json extracted above, the url looks like this :
POST /vizql/t/DPH/w/opioidTDWEB_prod/v/MortalityandMorbidity/bootstrapSession/sessions/{session_id}
Then you have a json response with some string in front of it to prevent json hijacking. You need regex to extract it and then parse the huge json data
All url needed would be like :
GET https://idph.illinois.gov/OpioidDataDashboard/
GET https://interactive.data.illinois.gov/trusted/yIm7jkXyRQuH9Ff1oPvz_w==:790xMcZuwmnvijXHg6ymRTrU/t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
GET https://interactive.data.illinois.gov/t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
POST https://interactive.data.illinois.gov/vizql/t/DPH/w/opioidTDWEB_prod/v/MortalityandMorbidity/bootstrapSession/sessions/2A3E3BA96A6C4E65B36AEDB4A536D09F-1:0
The full code:
import requests
from bs4 import BeautifulSoup
import json
import re
s = requests.Session()
init_url = "https://idph.illinois.gov/OpioidDataDashboard/"
print(f"GET {init_url}")
r = s.get(init_url)
soup = BeautifulSoup(r.text, "html.parser")
paramTags = dict([
(t["name"], t["value"])
for t in soup.find("div", {"class":"tableauPlaceholder"}).findAll("param")
])
# get xsrf cookie
session_url = f'{paramTags["host_url"]}trusted/{paramTags["ticket"]}{paramTags["site_root"]}/views/{paramTags["name"]}'
print(f"GET {session_url}")
r = s.get(session_url)
config_url = f'{paramTags["host_url"][:-1]}{paramTags["site_root"]}/views/{paramTags["name"]}'
print(f"GET {config_url}")
r = s.get(config_url,
params = {
":embed": "y",
":showVizHome": "no",
":host_url": "https://interactive.data.illinois.gov/",
":embed_code_version": 2,
":tabs": "yes",
":toolbar": "no",
":showShareOptions": "false",
":display_spinner": "no",
":loadOrderID": 0,
})
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'{paramTags["host_url"][:-1]}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
print(f"POST {dataUrl}")
r = s.post(dataUrl, data= {
"sheet_id": tableauData["sheetId"],
})
dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))
print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])
Try this on repl.it