Extracting underlying data via RSelenium with embedded leaflet svg, and more

Question

I would like to extract information about each ad in this link. Now, I got to the stage where I can automatically click See Ad Details, but there is much underlying data that is not straightforward to wrangle into a neat dataframe.

library(RSelenium)
rs <- rsDriver()
remote <- rs$client
remote$navigate(
  paste0(
    "https://www.facebook.com/ads/library/?", 
    "active_status=all&ad_type=political_and_issue_ads&country=US&", 
    "impression_search_field=has_impressions_lifetime&", 
    "q=actblue&view_all_page_id=38471053686"
  )
)

test <- remote$findElement(using = "xpath", "//*[@class=\"_7kfh\"]")
test$clickElement()
## Manually figured out element
test <- remote$findElement(using = "xpath", "//*[@class=\"_7lq0\"]")
test$getElementText()

The output text is messy itself but I believe with some time and effort, it can be wrangled into something useful. The problem is wrangling the underlying data in

the graph, which seems to be just an image, and
leaflet svg, which displays data when a cursor hovers over it.

I am at a loss to how to systematically extract this image and especially the leaflet svg. How would I take each ad and then extract the full data available in the details in this case?

are you restricted to rselenium or open to alternatives? And did you check if its allowed to scrape the data? Thanks! — Tonio Liebrand, Feb 18 '20 at 15:59
@TonioLiebrand definitely open to alternatives, and this report is open to public---no logins or credentials required. — Kim, Feb 18 '20 at 19:57
I've had a good look at this Kim to see if there is an obvious answer, but the question remains a bit vague. There are lots of different types of data you could extract from the page, but they can't all be wrangled into a neat data frame. Perhaps you'd get more help if you were a bit more specific? — Allan Cameron, Feb 22 '20 at 12:39

score 4 · Accepted Answer · edited Feb 24 '20 at 13:39

Age and gender graphic are a canva element. To get them as an images, you can take a screenshot of the element. Python example:

driver.find_element_by_tag_name('canvas').screenshot("age_and_gender.png")

Where this ad was shown is a SVG and you can save it as image in same way. Result will be not very accurate, because visible part of SVG and actual is different. But you can crop the image after. Python example:

driver.find_element_by_tag_name('svg').screenshot("where_this_ad_was_shown.png")

To extract full data from it, you cannot use Selenium. The way you can get the data is to config proxy server, catch API request, and get data that will be in JSON format. And yes it's possible.

Easy way is to use some requests to get ADs and details without Selenium. Python working example:

import json
import requests

params = (
    ('q', 'actblue'),
    ('count', '1000'), # default is 30, for 38471053686 it will return about 300 results.
    ('active_status', 'all'),
    ('ad_type', 'political_and_issue_ads'),
    ('countries/[0/]', 'US'),
    ('impression_search_field', 'has_impressions_lifetime'),
    ('view_all_page_id', '38471053686'),
)

data = {'__a': '1', }

with requests.session() as s:
    response = s.post('https://www.facebook.com/ads/library/async/search_ads/', params=params, data=data)
    ads = json.loads(response.text.replace('for (;;);', ''))['payload']['results']
    for ad in ads:
        ad_details_params = (
            ('ad_archive_id', ad[0]['adArchiveID']),
            ('country', 'US'),
        )
        response = s.post('https://www.facebook.com/ads/library/async/insights/', params=ad_details_params, data=data)
        print('parse json from response')

Not: Facebook not allows for automated data collection without written permission https://www.facebook.com/apps/site_scraping_tos_terms.php

But as we all know, Facebook does not refuse to collect our data.

Response for each AD detail will be like:

{
  "__ar": 1,
  "payload": {
    "ageGenderData": [
      {
        "age_range": "18-24",
        "female": 0.03,
        "male": 0.05,
        "unknown": 0
      },
      {
        "age_range": "25-34",
        "female": 0.12,
        "male": 0.12,
        "unknown": 0.01
      },
      {
        "age_range": "35-44",
        "female": 0.16,
        "male": 0.09,
        "unknown": 0
      },
      {
        "age_range": "45-54",
        "female": 0.11,
        "male": 0.05,
        "unknown": 0
      },
      {
        "age_range": "55-64",
        "female": 0.09,
        "male": 0.04,
        "unknown": 0
      },
      {
        "age_range": "65+",
        "female": 0.09,
        "male": 0.03,
        "unknown": 0
      }
    ],
    "currency": "USD",
    "currencyMatched": true,
    "impressions": "35\u00a0B - 40\u00a0B",
    "locationData": [
      {
        "reach": 0,
        "region": "Alabama"
      },
      {
        "reach": 0,
        "region": "Utah"
      },
      {
        "reach": 0,
        "region": "Maine"
      },
      {
        "reach": 0,
        "region": "Louisiana"
      },
      {
        "reach": 0,
        "region": "Kentucky"
      },
      {
        "reach": 0,
        "region": "Kansas"
      },
      {
        "reach": 0,
        "region": "Idaho"
      },
      {
        "reach": 0,
        "region": "Delaware"
      },
      {
        "reach": 0,
        "region": "Connecticut"
      },
      {
        "reach": 0,
        "region": "Arkansas"
      },
      {
        "reach": 0,
        "region": "Hawaii"
      },
      {
        "reach": 0,
        "region": "Alaska"
      },
      {
        "reach": 0,
        "region": "Montana"
      },
      {
        "reach": 0,
        "region": "West Virginia"
      },
      {
        "reach": 0,
        "region": "Vermont"
      },
      {
        "reach": 0,
        "region": "Mississippi"
      },
      {
        "reach": 0,
        "region": "Wyoming"
      },
      {
        "reach": 0,
        "region": "Oklahoma"
      },
      {
        "reach": 0,
        "region": "North Dakota"
      },
      {
        "reach": 0,
        "region": "New Mexico"
      },
      {
        "reach": 0,
        "region": "New Hampshire"
      },
      {
        "reach": 0,
        "region": "Nebraska"
      },
      {
        "reach": 0,
        "region": "Rhode Island"
      },
      {
        "reach": 0,
        "region": "South Dakota"
      },
      {
        "reach": 0.01,
        "region": "Wisconsin"
      },
      {
        "reach": 0.01,
        "region": "Missouri"
      },
      {
        "reach": 0.01,
        "region": "Oregon"
      },
      {
        "reach": 0.01,
        "region": "Minnesota"
      },
      {
        "reach": 0.01,
        "region": "Maryland"
      },
      {
        "reach": 0.01,
        "region": "New Jersey"
      },
      {
        "reach": 0.01,
        "region": "Tennessee"
      },
      {
        "reach": 0.01,
        "region": "Washington, District of Columbia"
      },
      {
        "reach": 0.01,
        "region": "Indiana"
      },
      {
        "reach": 0.02,
        "region": "Michigan"
      },
      {
        "reach": 0.02,
        "region": "Iowa"
      },
      {
        "reach": 0.02,
        "region": "North Carolina"
      },
      {
        "reach": 0.02,
        "region": "Georgia"
      },
      {
        "reach": 0.02,
        "region": "Colorado"
      },
      {
        "reach": 0.02,
        "region": "Ohio"
      },
      {
        "reach": 0.02,
        "region": "Arizona"
      },
      {
        "reach": 0.02,
        "region": "Pennsylvania"
      },
      {
        "reach": 0.02,
        "region": "Virginia"
      },
      {
        "reach": 0.03,
        "region": "Washington"
      },
      {
        "reach": 0.03,
        "region": "Massachusetts"
      },
      {
        "reach": 0.04,
        "region": "Illinois"
      },
      {
        "reach": 0.04,
        "region": "Florida"
      },
      {
        "reach": 0.06,
        "region": "New York"
      },
      {
        "reach": 0.13,
        "region": "California"
      },
      {
        "reach": 0.19,
        "region": "Texas"
      }
    ],
    "singleCountry": "US",
    "spend": "$500 - $599",
    "pageSpend": {
      "currentWeek": null,
      "isPoliticalPage": true,
      "weeklyByDisclaimer": {
        "WARREN FOR PRESIDENT, INC.": 270970
      },
      "lifetimeByDisclaimer": {
        "Elizabeth for MA": 781272,
        "Warren for President": 3396973,
        "": 13584,
        "WARREN FOR PRESIDENT, INC.": 4081618,
        "the Elizabeth Warren Presidential Exploratory Committee": 219471
      },
      "hasPoliticalSpendInAnyCountry": true
    },
    "pageBlurb": "United States Senator from Massachusetts, former teacher, and candidate for President of the United States. (official campaign account)"
  },
  "bootloadable": {},
  "ixData": {},
  "bxData": {},
  "gkxData": {},
  "qexData": {},
  "lid": "6796246259692811543"
}

Finally, to run this python code from R, use reticulate, and simply run the entire python script as a string - note that if the python script doesn't contain any " characters, it makes it very convenient to drop straight into R, like so

library(reticulate)
py_run_string("import json
import requests
rest of script etc 
etc 
etc")

Also that you will need to install the two python libraries the script uses. This can be done by opening terminal on mac, and typing pip install json to install the json python library, and pip install requests for the requests library)

In the form data, does one of the parameters correspond to an Ad ID? (if so, perhaps grabbing the ad IDs and iterating through them is a possibility?). `__spin_t` looks like a possibility? — stevec, Feb 22 '20 at 12:58
When I try it (using R), I get an empty body response? Could the session_id in the request url be the issue (I have copied exactly the one in the answer) — stevec, Feb 22 '20 at 13:13
No matter R or Python or Curl, all and correct Form Data required to get not empty response. — Sers, Feb 22 '20 at 13:24
Second that @stevec, the only thing that changes in the headers across requests is the sessionid. Where session seems to be more of a single page view. I tried to find the session id in previous reponses from the server but wasnt succesful. Maybe sers can share a reproducible example? — Tonio Liebrand, Feb 22 '20 at 13:26
I used the string you provided as the body of a `POST`. Would any part(s) of it need to be edited (e.g. parameters etc), or should it work as-is? — stevec, Feb 22 '20 at 13:26
Try to copy from Dev Tools as a curl and test. But complete solution for OP question, as I mentioned in the answer, will be to use Proxy Server and catch the requests. There is no sessionid in the body. — Sers, Feb 22 '20 at 13:28
@Sers Thank you for your answer. My Python skills are basic at best, and it's not very straightforward to jump from your answers/comments to a workable code---could you provide some reproducible code that produced the response? I have grabbed ad IDs, so I can iterate as stevec suggested. — Kim, Feb 23 '20 at 23:10
@Sers In fact, I'm not entirely sure what is the API that you mentioned. If you were so kind as to fill in some of the intermediate steps, I can award you the bounty straightaway. Thank you. — Kim, Feb 24 '20 at 04:49
check the answer update with working example and some explanation. — Sers, Feb 24 '20 at 09:59
@Sers Kim will need to save the data. Can you include a line writes a standard JSON file? I will edit your answer to include how to run your code from R — stevec, Feb 24 '20 at 13:33
@Sers I included how to run the python script from R. It works from my mac. It just doesn't write the data (yet). If it can be written as JSON in the python script, and @Kim wants the data in R, can simply use `library(jsonlite); fromJSON('path/to/file.json')` to read it back into R — stevec, Feb 24 '20 at 13:41
@Kim I think setting `count` to a small number (e.g. 5, rather than 1000) may make for faster testing in case it doesn't work perfectly first time (although I only tried it once to check that it worked as expected so I am not completely sure) — stevec, Feb 24 '20 at 13:48
@Sers It worked, and now I understand what's going on. Thank you very much, and also to stevec. — Kim, Feb 24 '20 at 23:24

score 3 · Answer 2 · answered Feb 22 '20 at 12:52

3

This is a not a complete answer, but hopefully it may help.

I had a go scraping/parsing, but couldn't make sense of the graph data as it seems to be located in complex locations across many files accessed through the 'network' tab in chrome dev tools (I found patches of the data, by using command+f from inside the network tab and searching for words contained in the graphs e.g. 'Women', 'Unknown' etc)

Someone who is familiar with ReactJS may have more luck!

What may work

You could try a totally different method using optical character recognition (OCR).

That is, take a screenshot (i.e. remote$screenshot()), convert from base64 to image, read it, extract the relevant area (i.e. the locations of the specific data you're after), and use methods described here to convert the areas containing the data you're after into text! (I will update if I get a chance to try it, but not looking likely, keen to hear how you go)

answered Feb 22 '20 at 12:52

stevec

15,490
6
67
110

1

Be it either `tesseract` or `magick`, an OCR approach here is inadequate because you have to hover over leaflet svgs to "see" the underlying data. Thank you for your answer though. – Kim Feb 23 '20 at 23:06
Hello @stevec, what library do you use for `remote$screenshot()`? – Manu Feb 24 '20 at 18:57
1

@Manu it’s part of `Rselenium` – stevec Feb 24 '20 at 20:10
1

Thank you @stevec, I've been following this post and it's interesting. – Manu Feb 24 '20 at 20:42

Extracting underlying data via RSelenium with embedded leaflet svg, and more

2 Answers2

What may work