2

I am trying to figure out if there is a way and how to scrape tooltip values from a Tableau embedded graph in a webpage using python.

Here is an example of a graph with tooltips when user hovers over the bars:

https://public.tableau.com/views/NumberofCOVID-19patientsadmittedordischarged/DASHPublicpage_patientsdischarges?:embed=y&:showVizHome=no&:host_url=https%3A%2F%2Fpublic.tableau.com%2F&:embed_code_version=3&:tabs=no&:toolbar=yes&:animate_transition=yes&:display_static_image=no&:display_spinner=no&:display_overlay=yes&:display_count=yes&publish=yes&:loadOrderID=1

I grabbed this url from the original webpage that I want to scrape from:

https://covid19.colorado.gov/hospital-data

Any help is appreciated.

dmornad
  • 101
  • 10
  • 1
    I suggest looking resources such as https://www.tableau.com/covid-19-coronavirus-data-resources before trying to scrape the data from a published visualization. You may find the original source and get a more reliable way of obtaining the data you want. – Alex Blakemore May 24 '20 at 00:34

1 Answers1

5

Edit

I've made a python library to scrape tableau dashboard. The implementation is more straightforward :

from tableauscraper import TableauScraper as TS

url = "https://public.tableau.com/views/Colorado_COVID19_Data/CO_Home"

ts = TS()
ts.loads(url)
dashboard = ts.getDashboard()

for t in dashboard.worksheets:
    #show worksheet name
    print(f"WORKSHEET NAME : {t.name}")
    #show dataframe for this worksheet
    print(t.data)

run this on repl.it


Old answer

The graphic seems to be generated in JS from the result of an API which looks like :

POST https://public.tableau.com/TITLE/bootstrapSession/sessions/SESSION_ID 

The SESSION_ID parameter is located (among other things) in tsConfigContainer textarea in the URL used to build the iframe.

Starting from https://covid19.colorado.gov/hospital-data :

  • check element with class tableauPlaceholder
  • get the param element with attribute name
  • it gives you the url : https://public.tableau.com/views/{urlPath}
  • the previous link gives you a textarea with id tsConfigContainer with a bunch of json values
  • extract the session_id and root path (vizql_root)
  • make a POST on https://public.tableau.com/ROOT_PATH/bootstrapSession/sessions/SESSION_ID with the sheetId as form data
  • extract the json from the result (result is not json)

Code :

import requests
from bs4 import BeautifulSoup
import json
import re

r = requests.get("https://covid19.colorado.gov/hospital-data")
soup = BeautifulSoup(r.text, "html.parser")

# get the second tableau link
tableauContainer = soup.findAll("div", { "class": "tableauPlaceholder"})[1]
urlPath = tableauContainer.find("param", { "name": "name"})["value"]

r = requests.get(
    f"https://public.tableau.com/views/{urlPath}",
    params= {
        ":showVizHome":"no",
    }
)
soup = BeautifulSoup(r.text, "html.parser")

tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)

dataUrl = f'https://public.tableau.com{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'

r = requests.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],
})

dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))

print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])

From there you have all the data. You will need to look for the way the data is splitted as it seems all the data is dumped through a single list. Probably looking at the other fields in the JSON object would be useful for that.

Bertrand Martel
  • 32,363
  • 15
  • 95
  • 118
  • Great response! Thanks for your time. Would you answer the following questions I have for my clarity and understanding for tackling other sites I need to do the same for: 1- I don't see the "param" elements when I inspect in Chrome. Why not? 2- How did you figure the Get parameter of "showVizHome=no" from? I don't see it anywhere. – dmornad May 24 '20 at 01:38
  • @dmornad there are some `param` elements inside div with class `tableauPlaceholder` of https://covid19.colorado.gov/hospital-data – Bertrand Martel May 24 '20 at 01:59
  • @dmornad when you inspect the graphic, you can notice it is embedded in an iframe. In fact tableau js lib is creating the iframe url dynamically. I've just managed to reproduce this url. When you inspect you can see that there are a lot of url params, but I've found out that just the showVizHome is necessary to get the data. – Bertrand Martel May 24 '20 at 02:06
  • Full url is : https://public.tableau.com/views/NumberofCOVID-19patientsadmittedordischarged/DASHPublicpage_patientsdischarges?:embed=y&:showVizHome=no&:host_url=https%3A%2F%2Fpublic.tableau.com%2F&:embed_code_version=3&:tabs=no&:toolbar=yes&:animate_transition=yes&:display_static_image=no&:display_spinner=no&:display_overlay=yes&:display_count=yes&publish=yes&:loadOrderID=1 – Bertrand Martel May 24 '20 at 02:06