0

Fairly new to python, I learn by doing, so I thought I'd give this project a shot. Trying to create a script which finds the google analytics request for a certain website parses the request payload and does something with it.

Here are the requirements:

  1. Ask user for 2 urls ( for comparing the payloads from 2 diff. HAR payloads)
  2. Use selenium to open the two urls, use browsermobproxy/phantomJS to get all HAR
  3. Store the HAR as a list
  4. From the list of all HAR files, find the google analytics request, including the payload
  5. If Google Analytics tag found, then do things....like parse the payload, etc. compare the payload, etc.

Issue: Sometimes for a website that I know has google analytics, i.e. nytimes.com - the HAR that I get is incomplete, i.e. my prog. will say "GA Not found" but that's only because the complete HAR was not captured so when the regex ran to find the matching HAR it wasn't there. This issue in intermittent and does not happen all the time. Any ideas?

I'm thinking that due to some dependency or latency, the script moved on and that the complete HAR didn't get captured. I tried the "wait for traffic to stop" but maybe I didn't do something right.

Also, as a bonus, I would appreciate any help you can provide on how to make this script run fast, its fairly slow. As I mentioned, I'm new to python so go easy :)

This is what I've got thus far.

import browsermobproxy as mob
from selenium import webdriver
import re
import sys
import urlparse
import time
from datetime import datetime


def cleanup():
    s.stop()
    driver.quit()

proxy_path = '/Users/bob/Downloads/browsermob-proxy-2.1.4-bin/browsermob-proxy-2.1.4/bin/browsermob-proxy'
s = mob.Server(proxy_path)
s.start()
proxy = s.create_proxy()
proxy_address = "--proxy=127.0.0.1:%s" % proxy.port
service_args = [proxy_address, '--ignore-ssl-errors=yes', '--ssl-protocol=any']  # so that i can do https connections
driver = webdriver.PhantomJS(executable_path='/Users/bob/Downloads/phantomjs-2.1.1-windows/phantomjs-2.1.1-windows/bin/phantomjs', service_args=service_args)
driver.set_window_size(1400, 1050)

urlLists = []
collectTags = []
gaCollect = 0
varList = []

for x in range(0,2): # I want to ask the user for 2 inputs
    url = raw_input("Enter a website to find GA on: ")
    time.sleep(2.0)
    urlLists.append(url)

    if not url:
        print "You need to type something in...here"
        sys.exit()
    #gets the two user url and stores in list

for urlList in urlLists:

    print urlList, 'start 2nd loop' #printing for debug purpose, no need for this

    if not urlList:
        print 'Your Url list is empty'
        sys.exit()

    proxy.new_har()
    driver.get(urlList)
    #proxy.wait_for_traffic_to_stop(15, 30) #<-- tried this but did not do anything

    for ent in proxy.har['log']['entries']:
        gaCollect = (ent['request']['url'])

        print gaCollect

        if re.search(r'google-analytics.com/r\b', gaCollect):

            print 'Found GA'
            collectTags.append(gaCollect)
            time.sleep(2.0)
            break
    else:

        print 'No GA Found - Ending Prog.'
        cleanup()
        sys.exit()

cleanup()
Asif R.
  • 23
  • 5

1 Answers1

0

This might be a stale question, but I found an answer that worked for me.

You need to change two things: 1 - Remove sys.exit() -- this causes your programme to stop after the first iteration through the ent list, so if what you want is not the first thing, it won't be found

2 - call new_har with the captureContent option enabled to get the payload of requests: proxy.new_har(options={'captureHeaders':True, 'captureContent': True})

See if that helps.