8

I want to capture the traffic to sites I'm browsing to using Selenium with python and since the traffic will be https using a proxy won't get me far.

My idea was to run phantomJS with selenium to and use phantomJS to execute a script (not on the page using webdriver.execute_script(), but on phantomJS itself). I was thinking of the netlog.js script (from here https://github.com/ariya/phantomjs/blob/master/examples/netlog.js).

Since it works like this in the command line

phantomjs --cookies-file=/tmp/foo netlog.js https://google.com

there must be a similar way to do this with selenium?

Thanks in advance

Update:

Solved it with browsermob-proxy.

pip3 install browsermob-proxy

Python3 code

from selenium import webdriver
from browsermobproxy import Server

server = Server(<path to browsermob-proxy>)
server.start()
proxy = server.create_proxy({'captureHeaders': True, 'captureContent': True, 'captureBinaryContent': True})

service_args = ["--proxy=%s" % proxy.proxy, '--ignore-ssl-errors=yes']
driver = webdriver.PhantomJS(service_args=service_args)

proxy.new_har()
driver.get('https://google.com')
print(proxy.har)  # this is the archive
# for example:
all_requests = [entry['request']['url'] for entry in proxy.har['log']['entries']]
isedwards
  • 2,067
  • 16
  • 26
Bart
  • 83
  • 1
  • 5
  • 2
    In addition to installing the python library with `pip`, also need to download the latest release of bmp from `https://github.com/lightbody/browsermob-proxy/releases` and install the java runtime environment `apt-get install default-jre`.`` is then set to the path you downloaded bmp to. – isedwards Jul 02 '16 at 20:36

3 Answers3

9

I am using a proxy for this

from selenium import webdriver
from browsermobproxy import Server

server = Server(environment.b_mob_proxy_path)
server.start()
proxy = server.create_proxy()
service_args = ["--proxy-server=%s" % proxy.proxy]
driver = webdriver.PhantomJS(service_args=service_args)

proxy.new_har()
driver.get('url_to_open')
print proxy.har  # this is the archive
# for example:
all_requests = [entry['request']['url'] for entry in proxy.har['log']['entries']]

the 'har' (http archive format) has a lot of other information about the requests and responses, it's very useful to me

installing on Linux:

pip install browsermob-proxy
Jonathan
  • 436
  • 2
  • 9
  • Thanks, that did it as well. Though for Python3 you need to change some code and a phantomJS parameter. Updated it in my post. – Bart Apr 25 '16 at 08:07
  • used `driver = webdriver.Chrome(service_args=service_args)` instead and it worked like charm – Indra Jan 31 '18 at 09:26
3

If anyone here is looking for a pure Selenium/Python solution, the following snippet might help. It uses Chrome to log all requests and, as an example, prints all json requests with their corresponding response.

from time import sleep

from selenium import webdriver
from selenium.webdriver import DesiredCapabilities

# make chrome log requests
capabilities = DesiredCapabilities.CHROME

capabilities["loggingPrefs"] = {"performance": "ALL"}  # chromedriver < ~75
# capabilities["goog:loggingPrefs"] = {"performance": "ALL"}  # chromedriver 75+

driver = webdriver.Chrome(
    desired_capabilities=capabilities, executable_path="./chromedriver"
)

# fetch a site that does xhr requests
driver.get("https://sitewithajaxorsomething.com")
sleep(5)  # wait for the requests to take place

# extract requests from logs
logs_raw = driver.get_log("performance")
logs = [json.loads(lr["message"])["message"] for lr in logs_raw]

def log_filter(log_):
    return (
        # is an actual response
        log_["method"] == "Network.responseReceived"
        # and json
        and "json" in log_["params"]["response"]["mimeType"]
    )

for log in filter(log_filter, logs):
    request_id = log["params"]["requestId"]
    resp_url = log["params"]["response"]["url"]
    print(f"Caught {resp_url}")
    print(driver.execute_cdp_cmd("Network.getResponseBody", {"requestId": request_id}))

Gist: https://gist.github.com/lorey/079c5e178c9c9d3c30ad87df7f70491d

lorey
  • 1,287
  • 11
  • 20
2

I use a solution without a proxy server for this. I modified the selenium source code according to the link bellow in order to add the executePhantomJS function.

https://github.com/SeleniumHQ/selenium/pull/2331/files

Then I execute the following script after getting the phantomJS driver:

from selenium.webdriver import PhantomJS

driver = PhantomJS()

script = """
    var page = this;
    page.onResourceRequested = function (req) {
        console.log('requested: ' + JSON.stringify(req, undefined, 4));
    };
    page.onResourceReceived = function (res) {
        console.log('received: ' + JSON.stringify(res, undefined, 4));
    };
"""

driver.execute_phantomjs(script)
driver.get("http://ariya.github.com/js/random/")
driver.quit()

Then all the requests are logged in the console (usually the ghostdriver.log file)

kosnik
  • 2,108
  • 7
  • 23