3

My internet connection is through a proxy with authentication and when i try to run scraoy library to make the more simple example, for example :

scrapy shell http://stackoverflow.com

All it's ok until you request something with the XPath selector the response is the next :

>>> hxs.select('//title')
[<HtmlXPathSelector xpath='//title' data=u'<title>ERROR: Cache Access Denied</title'>]

Or if you try to run any spider created inside a project gave me the following error :

C:\Users\Victor\Desktop\test\test>scrapy crawl test
2012-08-11 17:38:02-0400 [scrapy] INFO: Scrapy 0.16.5 started (bot: test)
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetCon
sole, CloseSpider, WebService, CoreStats, SpiderState
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddlewa
re, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled item pipelines:
2012-08-11 17:38:02-0400 [test] INFO: Spider opened
2012-08-11 17:38:02-0400 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped
0 items (at 0 items/min)
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
4
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6081
2012-08-11 17:38:47-0400 [test] DEBUG: Retrying <GET http://automation.whatismyi
p.com/n09230945.asp> (failed 1 times): TCP connection timed out: 10060: Se produ
jo un error durante el intento de conexi¾n ya que la parte conectada no respondi
¾ adecuadamente tras un periodo de tiempo, o bien se produjo un error en la cone
xi¾n establecida ya que el host conectado no ha podido responder..
2012-08-11 17:39:02-0400 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped
0 items (at 0 items/min)
...
2012-08-11 17:39:29-0400 [test] INFO: Closing spider (finished)
2012-08-11 17:39:29-0400 [test] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
  'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 3,
     'downloader/request_bytes': 732,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 8, 11, 21, 39, 29, 908000),
     'log_count/DEBUG': 9,
     'log_count/ERROR': 1,
     'log_count/INFO': 5,
     'scheduler/dequeued': 3,
     'scheduler/dequeued/memory': 3,
     'scheduler/enqueued': 3,
     'scheduler/enqueued/memory': 3,
     'start_time': datetime.datetime(2012, 8, 11, 21, 38, 2, 876000)}
2012-08-11 17:39:29-0400 [test] INFO: Spider closed (finished)

it appears that my proxy its the problem. Please if somebody know a way to use scrapy with a authentication proxy let me know.

Vkt0r
  • 45
  • 1
  • 6

2 Answers2

5

Scrapy supports proxies by using HttpProxyMiddleware:

This middleware sets the HTTP proxy to use for requests, by setting the proxy meta value to Request objects. Like the Python standard library modules urllib and urllib2, it obeys the following environment variables:

  • http_proxy
  • https_proxy
  • no_proxy

Also see:

Community
  • 1
  • 1
alecxe
  • 414,977
  • 106
  • 935
  • 1,083
  • The thing is that the first link does not work for me I don't know why and the second already read and I don't know hot to set the http_proxy variable, If you help me – Vkt0r Aug 11 '13 at 22:36
  • Have you set `http_proxy` or `https_proxy` environment variable? – alecxe Aug 11 '13 at 22:47
  • 2
    yes man, the problem is fixed thanks. The link http://mahmoud.abdel-fattah.net/2012/04/07/using-scrapy-with-proxies/ was very usefull. – Vkt0r Aug 12 '13 at 21:09
  • @alecxe can I set the https_proxy per request? – William Kinaan Sep 01 '16 at 12:48
  • If anyone is looking for the first link wbwm has it: `https://web.archive.org/web/*/mahmoud.abdel-fattah.net/2012/04/07/using-scrapy-with-proxies/` – Craig van Tonder Dec 22 '16 at 19:30
  • @Sir alecxe, as I'm very new to `scrapy`, I'm little behind understanding where to put `middlewares.py` file, If I try to follow the first link you have provided in which there is a instruction `Create a new file called “middlewares.py” and save it in your scrapy project and add---`. However, the thing is I've already a file named `middlewares.py` which has been created by default. So should I replace it with the suggested one? Thanks in advance. – SIM Apr 20 '18 at 11:15
0

Repeating the answer by Mahmoud M. Abdel-Fattah, because the page is not available now. Credit goes to him, however, I made slight modifications.

If middlewares.py already exist, add the following code into it.

class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass.encode())
        #encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + \
            str(encoded_user_pass)

In settings.py file, add the following code

    DOWNLOADER_MIDDLEWARES = {
    'project_name.middlewares.ProxyMiddleware': 100,
}

This should work by setting http_proxy. However, In my case, I'm trying to access a URL with HTTPS protocol, need to set https_proxy which I'm still investigating. Any lead on that will be of great help.

Jacob Nelson
  • 1,797
  • 14
  • 29