0

I am new to scraping websites with python 3. Currently, I am facing an issue that getting a request of a site (www.tink.de) is really slow. Every request takes around 40 seconds. When I am trying my script with other sites, I am getting the request immediately.

I have already read this, this, this and many other stuff around this issue...but I didn't get it solved. I also tried running the script on a different machine and OS and even use a different internet connection.

My current workaround is to use silenium (which is indeed faster), but I would like to solve the problem with the request module.

Can anyone help?

Here is my example code:

import requests
from datetime import datetime

url = 'https://www.tink.de'

headers = {
    'user-agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
                   'AppleWebKit/537.36 (KHTML, like Gecko) '
                   'Chrome/45.0.2454.101 Safari/537.36')
}

print('Process started! ' + str(datetime.now()))

r = requests.get(url, headers=headers) # I also tried with stream=True
print(r.content)

print('Process finished! ' + str(datetime.now()))

Update, here is my response header:

{'Date': 'Sun, 10 Feb 2019 22:27:15 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Content-Length': '69400', 'Connection': 'keep-alive', 'Server': 'nginx/1.10.3 (Ubuntu)', 'X-Frame-Options': 'SAMEORIGIN', 'X-Aoestatic-Action': 'cms_index_index', 'X-Tags': 'PAGE-14-1', 'X-Aoestatic': 'cache', 'X-Aoestatic-Lifetime': '86400', 'X-Aoestatic-Debug': 'true', 'Expires': 'Mon, 30 Apr 2008 10:00:00 GMT', 'X-Url': '/', 'Cache-Control': 'public', 'X-Aoestatic-Fetch': 'Removed cookie in vcl_backend_response', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding', 'X-Varnish': '134119436 128286748', 'Age': '33396', 'Via': '1.1 varnish-v4', 'X-Cache': 'HIT (2292)', 'Client-ip': '10.XX.XX.XX', 'Accept-Ranges': 'bytes'}

Thanks a lot for your help!

matze
  • 1
  • 3
  • fwiw, same code runs for me in under 100ms from halfway across the globe.. it's likely an issue with your machine configuration or network. – Corey Goldberg Feb 10 '19 at 16:20
  • Its really a network latency issue. – Jeevan Chaitanya Feb 10 '19 at 16:52
  • Have you been making lots of requests to this website? if so you could be rate limited. Can you post your response headers? – Dan-Dev Feb 10 '19 at 17:24
  • thanks for your feedback. do you how I can figure out „my“ network issue? – matze Feb 10 '19 at 17:32
  • @Dan-Dev: I updated my response header in my first post. – matze Feb 10 '19 at 22:29
  • Are the headers sent from your code identical to the headers sent by your browser? I guess not. Snoop on what your browser sends when it gets a fast response, using e.g. Telerik Fiddler, then replicate that in your code. Does that work any quicker? – barny Feb 14 '19 at 22:19

2 Answers2

1

If its fast on other sites and its only 'www.tink.de' that is slow then its probally down to that site being slow. You could always try the request without any headers so just a simple:

import requests

url = 'http://tink.de'
resp = requests.get(url)

print("Status: {}".format(resp.status_code))
print("Content:")
print(resp.content)

Hope this helps.

Joe Tilsed
  • 280
  • 2
  • 10
  • unfortunately the same. takes around 40 seconds :-( – matze Feb 10 '19 at 16:20
  • I just tried, it did take longer than a normal request for me however not 40 seconds. ``` import requests from datetime import datetime url = 'http://tink.de' print("Calling: {}".format(url)) start = datetime.utcnow() resp = requests.get(url) end = datetime.utcnow() time_taken = end-start print("Took: {}".format(time_taken)) ``` And the output: ``` Calling: http://tink.de Took: 0:00:03.095986 ``` – Joe Tilsed Feb 10 '19 at 16:31
  • but this would mean it’s not only a network issue on my site, isn’t it? – matze Feb 10 '19 at 17:32
  • How long does it take for you to call other websites, for example "http://api.instagram.com" as for me this only takes `0.45` of a second where as tink.de took about 3. Meaning tink might just be a slow site combined with different internet speeds this could make the difference. But would be interesting to see how long the response from "http://api.instagram.com" takes for you – Joe Tilsed Feb 10 '19 at 17:38
  • yeah it's fast. It took 0:00:01.532907 – matze Feb 10 '19 at 22:25
  • I'm not too sure then mate, how fast was it today? – Joe Tilsed Feb 11 '19 at 20:43
  • it’s everyday slow...so it is a consistent issue. I updated my response header, could you have a look at it? thanks again for your help! – matze Feb 12 '19 at 07:38
  • I think there is a larger issue here than the headers. What happens when you try and load the site in a browser, is it just as slow? – Joe Tilsed Feb 12 '19 at 09:02
  • no it’s fast. That’s why I am suprised! Currently, I am using selenium as a workaround. – matze Feb 12 '19 at 11:04
  • do you have any idea, Joe? – matze Feb 13 '19 at 16:53
  • Now I used a proxy and it's fast! 0:00:00.330741 secs. So they slowed down the requests for my IP? But why is it fast, when I am browsing the website directly by using chrome or firefox? – matze Feb 13 '19 at 17:19
  • Are the headers sent from your code *identical* to the headers sent by your browser? I guess not. Snoop on what your browser sends when it gets a fast response, using e.g. Telerik Fiddler, then replicate that in your code. Does that work any quicker? – barny Feb 14 '19 at 22:18
  • While playing around with fiddler, I disabled my IPv6 Connection and only allow IPv4 in Windows. Now the request takes 1-4 seconds instead of 40 seconds. But why?! And why is this causing only when I am requesting www.tink.de and not other websites?? – matze Feb 15 '19 at 09:18
0

For now, I forced python to use IPv4-Connection instead of IPv6 and added the following code to my script:

import socket
import ssl

try:
    from http.client import HTTPConnection
except ImportError:
    from httplib import HTTPConnection
from requests.packages.urllib3.connection import VerifiedHTTPSConnection


class MyHTTPSConnection(VerifiedHTTPSConnection):
    def connect(self):
        self.sock = socket.socket(socket.AF_INET)
        self.sock.connect((self.host, self.port))
        if self._tunnel_host:
            self._tunnel()
        self.sock = ssl.wrap_socket(self.sock, self.key_file, self.cert_file)

requests.packages.urllib3.connectionpool.HTTPSConnection = MyHTTPSConnection
requests.packages.urllib3.connectionpool.VerifiedHTTPSConnection = MyHTTPSConnection
requests.packages.urllib3.connectionpool.HTTPSConnectionPool.ConnectionCls = MyHTTPSConnection

socket.AF_INET does the trick and forces requests to use IPv4 connection.

Thanks to @user2824140: https://stackoverflow.com/a/39233701/3956043

To disable the insecure warning add:

import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
matze
  • 1
  • 3