2

I am trying to scrape Instagram with selenium using chrome webdriver. I need to get XHR response info and i tried "browsermob-proxy" and that info wasnt enough:

server = Server("/home/doruk/Downloads/browsermob-proxy 2.1.4/bin/browsermob-proxy")
server.start()
time.sleep(1)
proxy = server.create_proxy()
time.sleep(1)

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--proxy-server={0}".format(proxy.proxy)) 
browser = webdriver.Chrome(chrome_options=chrome_options)

##############################################
####This is output of proxy.har in json format.
 {
    "comment": "", 
    "serverIPAddress": "155.245.9.55", 
    "pageref": "", 
    "startedDateTime": "2018-05-21T16:44:41.053+03:00", 
    "cache": {}, 
    "request": {
      "comment": "", 
      "cookies": [], 
      "url": "https://scontent-sof1-1.cdninstagram.com/vp/e95312434013bc43a5c00c458b53022cb/5BC46751/t51.2885-19/s150x150/26432586_139925760144086_726193654523232256_n.jpg", 
      "queryString": [], 
      "headers": [], 
      "headersSize": 528, 
      "bodySize": 0, 
      "method": "GET", 
      "httpVersion": "HTTP/1.1"
    }, 

when i click "Load More Comments" in a content, a link something like this

https://www.instagram.com/graphql/query/?query_hash=33ba35000cb50da46f5b5e889df7d159&variables=%7B"shortcode"%3A"Bi9ZURdA6Gn"%2C"first"%3A36%2C"after"%3A"AQBr-wP7U4Ykr1QRH7PYJ1a0KQivhS0Ndwae-5F8vrZ5sf1eA_Bfgn4dZ0ql0pwUf9GXPm_LPyhtCnlhH6YOHfuNstwXK9VZuUIR4zD3k24s6Q"%7D

shows up and i need that info inside of it. Is there any way to handle this situation?

i need just the "?query_hash=" thing.

Example view

doruksahin
  • 83
  • 2
  • 7
  • After you click on the link in question, can you wait say 10 seconds and export the HAR, I know its silly but sometimes, lot of requests are happening at background and may be you are exporting the HAR before the information you are looking for is yet to be captured by browsermob-proxy – Satish May 21 '18 at 14:59

1 Answers1

1

I've done it! The trick form me was just to wait the entire loading of the page. Not the DOM ready state page for me continues loading. There is a way to remove the arbitrary sleep and ask the driver for the real complete loading of the page. I do not recall the code... I've to search.

from browsermobproxy import Server
import json
from selenium import webdriver
import time

urle = "https://www.yoururl.com";

server = Server(path="./browsermob-proxy-2.1.4/bin/browsermob-proxy")
server.start()
proxy = server.create_proxy()
profile = webdriver.FirefoxProfile()
profile.set_proxy(proxy.selenium_proxy())
driver = webdriver.Firefox(firefox_profile=profile, executable_path='./geckodriver')
proxy.new_har(urle, options={'captureHeaders': True, 'captureContent':True})
driver.get(urle)
time.sleep(10)
result = json.dumps(proxy.har, ensure_ascii=False)
print result
proxy.stop()
driver.quit()