3

I am trying to pull locations and names of Naloxone distribution centers in Illinois for a research project on the opioid crisis.

This tableau generated dashboard is accessible from here from the department of public health https://idph.illinois.gov/OpioidDataDashboard/

I've tried everything I could find. First changing the url to "download" the data using Tableau's interface. That only let me download a pdf map not the actual dataset behind it. Second, I modified the python script I've seen a few times on Stack overflow to try to request the data. But, I think it runs into some kind of error. Code below.

url = "https://interactive.data.illinois.gov/t/DPH/views/opioidTDWEB_prod/NaloxoneDistributionLocations"

r = requests.get(
    url,
    params= {
        ":embed":"y",
        ":showAppBanner":"false",
        ":showShareOptions":"true",
        ":display_count":"no",
        "showVizHome": "no"
    }
)
soup = BeautifulSoup(r.text, "html.parser")
print(soup)
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)

dataUrl = f'https://tableau.ons.org.br{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'

r = requests.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],
})

dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))

print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])

Appreciate any help.

1 Answers1

4

Edit

I've made a tableau scraper library to extract the worksheets data into a pandas dataframe.

The code is simpler, but in your case you still need to build the URL with the xsrf token :

from tableauscraper import TableauScraper as TS
import requests
from bs4 import BeautifulSoup

init_url = "https://idph.illinois.gov/OpioidDataDashboard/"
r = requests.get(init_url)
soup = BeautifulSoup(r.text, "html.parser")
paramTags = dict([
    (t["name"], t["value"])
    for t in soup.find("div", {"class": "tableauPlaceholder"}).findAll("param")
])

url = f'{paramTags["host_url"]}trusted/{paramTags["ticket"]}{paramTags["site_root"]}/views/{paramTags["name"]}'

ts = TS()
ts.loads(url)
dashboard = ts.getWorkbook()

for t in dashboard.worksheets:
    # show worksheet name
    print(f"WORKSHEET NAME : {t.name}")
    # show dataframe for this worksheet
    print(t.data)

Try this on repl.it


Original post

It's a bit complex since there are a combination of the following :

  • the tableau "configuration page" where there is tsconfig textarea is not part of the original page. The url is built dynamically from some param html tag
  • it uses a cross forgery token in cookies but in order to get that cookie you need to call a specific api whose url is built dynamically from some param html tag
  • from the tsconfig parameter, we can build the data url as you've found out in other stackoverflow post such as this, this and this

The flow is the following :

  • call GET https://idph.illinois.gov/OpioidDataDashboard/, scrape the param tags under the div with class tableauPlaceholder

From there the host is : https://interactive.data.illinois.gov

  • from the formerĀ param tags, build the "session URL" which looks like this :

    GET /trusted/{ticket}/t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
    

the url above will be only used to store the cookies (including xsrf token in the cookies)

  • from the former param tags, build the "configuration URL" which looks like this :

    GET /t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
    

Extract the textarea with id tsConfigContainer and parse the json from it

  • build the "data url" from the json extracted above, the url looks like this :

    POST /vizql/t/DPH/w/opioidTDWEB_prod/v/MortalityandMorbidity/bootstrapSession/sessions/{session_id}
    

Then you have a json response with some string in front of it to prevent json hijacking. You need regex to extract it and then parse the huge json data

All url needed would be like :

GET https://idph.illinois.gov/OpioidDataDashboard/
GET https://interactive.data.illinois.gov/trusted/yIm7jkXyRQuH9Ff1oPvz_w==:790xMcZuwmnvijXHg6ymRTrU/t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
GET https://interactive.data.illinois.gov/t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
POST https://interactive.data.illinois.gov/vizql/t/DPH/w/opioidTDWEB_prod/v/MortalityandMorbidity/bootstrapSession/sessions/2A3E3BA96A6C4E65B36AEDB4A536D09F-1:0

The full code:

import requests
from bs4 import BeautifulSoup
import json
import re

s = requests.Session()

init_url = "https://idph.illinois.gov/OpioidDataDashboard/"
print(f"GET {init_url}")
r = s.get(init_url)
soup = BeautifulSoup(r.text, "html.parser")
paramTags = dict([
    (t["name"], t["value"]) 
    for t in soup.find("div", {"class":"tableauPlaceholder"}).findAll("param")
])

# get xsrf cookie
session_url = f'{paramTags["host_url"]}trusted/{paramTags["ticket"]}{paramTags["site_root"]}/views/{paramTags["name"]}'
print(f"GET {session_url}")
r = s.get(session_url)

config_url = f'{paramTags["host_url"][:-1]}{paramTags["site_root"]}/views/{paramTags["name"]}'
print(f"GET {config_url}")
r = s.get(config_url,
    params = {
        ":embed": "y",
        ":showVizHome": "no",
        ":host_url": "https://interactive.data.illinois.gov/",
        ":embed_code_version": 2,
        ":tabs": "yes",
        ":toolbar": "no",
        ":showShareOptions": "false",
        ":display_spinner": "no",
        ":loadOrderID": 0,
})
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)

dataUrl = f'{paramTags["host_url"][:-1]}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
print(f"POST {dataUrl}")
r = s.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],
})
dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))

print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])

Try this on repl.it

Bertrand Martel
  • 32,363
  • 15
  • 95
  • 118