0

I have this automation tool I've built with Selenium 2 and Browsermob proxy that works quite well for most of what I need. However, I've run into a snag on capturing network traffic.

I basically want to capture the har that a click provides before the page redirects. For example, I have an analytics call happening on the click that I want to capture, then another analytics call on the page load that I don't want to capture.

All of my attempts currently capture the har too late, so I see both the click analytics call and the page load one. Is there any way to get this working? I've included my current relevant code sections below

METHODS INSIDE HELPER CLASS
class _check_for_page_load(object):
    def __init__(self, browser, parent):
        self.browser = browser
        self.maxWait = 5
        self.parent = parent

    def __enter__(self):
        self.old_page = self.browser.find_element_by_tag_name('html')

    def wait_for(self,condition_function):
        start_time = time.time()
        while time.time() < start_time + self.maxWait:
            if condition_function():
                return True
            else:
                time.sleep(0.01)
        raise Exception(
            'Timeout waiting for {}'.format(condition_function.__name__)
        )

    def page_has_loaded(self):
        new_page = self.browser.find_element_by_tag_name('html')
        ###self.parent.log("testing ---- " + str(new_page.id) + " " + str(self.old_page.id))
        return new_page.id != self.old_page.id

    def __exit__(self, *_):
        try:
            self.wait_for(self.page_has_loaded)
        except:
            pass


def startNetworkCalls(self):
    if self._p != None:
        self._p.new_har("Step"+str(self._currStep))


def getNetworkCalls(self, waitForTrafficToStop = True):
    if self._p != None:
        if waitForTrafficToStop:
            self._p.wait_for_traffic_to_stop(5000, 30*1000);
        return self._p.har
    else:
        return "{}"    


def click(self, selector):
    ''' clicks on an element '''
    self.log("Clicking element '" + selector + "'")
    el = self.findEl(selector)
    traffic = ""

    with self._check_for_page_load(self._d, self):
        try:
            self._curr_window = self._d.window_handles[0]
            el.click()
        except:
            actions = ActionChains(self._d);
            actions.move_to_element(el).click().perform()
    traffic = self.getNetworkCalls(False)

    try:
        popup = self._d.switch_to.alert
        if popup != None:
            popup.dismiss()
    except:
        pass
    try:
        window_after = self._d.window_handles[1]
        if window_after != self._curr_window:
            self._d.close()
            self._d.switch_to_window(self._curr_window)
    except:
        pass

    return traffic
INSIDE FILE THAT RUNS MULTIPLE SELENIUM ACTIONS
##inside a for loop, we get an action that looks like "click('#selector')"
util.startNetworkCalls()
if action.startswith("click"):
    temp_traffic = eval(action)


if temp_traffic == "":
    temp_traffic = util.getNetworkCalls()
traffic = json.dumps(temp_traffic, sort_keys=True) ##gives json har info that is saved later

You can see from these couple snippets that I initiate the "click" function which returns network traffic. Inside the click function, you can see it references the class "_check_for_page_load". However, the first time it reaches this line:

###self.parent.log("testing ---- " + str(new_page.id) + " " + str(self.old_page.id))

The log (when enabled) shows that the element ids don't match on the first time it logs, indicating the page load has already started to happen. I'm pretty stuck right now as I've tried everything I can think of to try to accomplish this functionality.

sroskelley
  • 305
  • 3
  • 16

1 Answers1

0

I found a solution to my own question - though it isn't perfect. I told my network calls to capture headers:

def startNetworkCalls(self):
    if self._p != None:
        self._p.new_har("Step"+str(self._currStep),{"captureHeaders": "true"})

Then, when I retrieve the har data, I can look for the "Referer" header and compare that with the page that was initially loaded (before the redirect from the click). From there, I can split the har into two separate lists of network calls to further process later.

This works for my needs, but it isn't perfect. Some things, like image requests, sometimes get the same referrer that the previous page's url matched, so the splitting puts those into the first bucket rather than the appropriate second bucket. However, since I'm more interested in requests that aren't on the same domain, this isn't really an issue.

sroskelley
  • 305
  • 3
  • 16