3

I got a suscriptions for splash on scrapinghub and I want to use this from a script that is running on my local machine. The instrucctions I have foud so far are:

1) Edits the settings file:

#I got this one from my scraping hub account
SPLASH_URL = 'http://xx.x0-splash.scrapinghub.com'


DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

From that I have one dobt, when I try to open the spash server on the browser it asks me for a username, I don't see where to set this on scrapy.

enter image description here

2) the spider file:

import scrapy
import json
from scrapy import  Request
from scrapy_splash import SplashRequest
import scrapy_splash


class ListSpider(scrapy.Spider):

    name = 'list'
    allowed_domains = ['https://medium.com/']
    start_urls = ['https://medium.com/']

    def parse(self, response):
        print (response.body)
        with open('data/cookies_file.json') as f:
            cookies_data = json.loads(f.read())[0]
        #print (cookies_data)
        url = 'https://medium.com/' 
        #cookies=cookies_data,
        yield Request(url,  callback=self.afterlogin,meta={'splash': {'args': {'html': 1, 'png': 1,}}})






    def afterlogin(self,response):
        with open(data_dir + 'after_login_page.html','w') as f:
            f.write(str(response.body))

I'm not getting errors but I'm not sure if splash is working either, also besides the server ip, scraping provides a password wich I don't know where to use for this script.

After using splashrequest and adding the API key, This is the traceback of I'm getting, the content of the sites is still not loading.

2019-07-17 10:10:08 [scrapy.core.engine] INFO: Spider opened
2019-07-17 10:10:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-07-17 10:10:08 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-07-17 10:10:09 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "www.meetmindful.com"; '*.meetmindful.com'!='www.meetmindful.com'
2019-07-17 10:10:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.meetmindful.com/> (referer: None)
2019-07-17 10:10:13 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "uyu74ur0-splash.scrapinghub.com"; '*.scrapinghub.com'!='uyu74ur0-splash.scrapinghub.com'
2019-07-17 10:10:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://app.meetmindful.com/login via https://uyu74ur0-splash.scrapinghub.com/render.html> (referer: None)
2019-07-17 10:10:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://app.meetmindful.com/grid via https://uyu74ur0-splash.scrapinghub.com/render.html> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-07-17 10:10:21 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "uyu74ur0-splash.scrapinghub.com"; '*.scrapinghub.com'!='uyu74ur0-splash.scrapinghub.com'
2019-07-17 10:10:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://app.meetmindful.com/grid via https://uyu74ur0-splash.scrapinghub.com/render.html> (referer: None)
2019-07-17 10:10:26 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-17 10:10:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
 'downloader/request_bytes': 2952,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 1,
 'downloader/request_method_count/POST': 3,
 'downloader/response_bytes': 28104,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 7, 17, 14, 10, 26, 292646),
 'log_count/DEBUG': 5,
 'log_count/INFO': 8,
 'log_count/WARNING': 3,
 'memusage/max': 54104064,
 'memusage/startup': 54104064,
 'request_depth_max': 2,
 'response_received_count': 3,
 'retry/count': 1,
 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 1,
 'scheduler/dequeued': 6,
 'scheduler/dequeued/memory': 6,
 'scheduler/enqueued': 6,
 'scheduler/enqueued/memory': 6,
 'splash/render.html/request_count': 2,
 'splash/render.html/response_count/200': 2,
 'start_time': datetime.datetime(2019, 7, 17, 14, 10, 8, 200073)}
2019-07-17 10:10:26 [scrapy.core.engine] INFO: Spider closed (finished)

Edit:

This is the complete log I'm getting;

INFO: Scrapy 1.5.2 started (bot: meetmindfull)
INFO: Versions: lxml 4.3.2.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.7.3 (default, Mar 27 2019, 22:11:17) - [GCC 7.3.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1  11 Sep 2018), cryptography 2.6.1, Platform Linux-4.15.0-20-generic-x86_64-with-debian-buster-sid
INFO: Overridden settings: {'BOT_NAME': 'meetmindfull', 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage', 'LOG_FILE': 'log.txt', 'LOG_FORMAT': '%(levelname)s: %(message)s', 'NEWSPIDER_MODULE': 'meetmindfull.spiders', 'SPIDER_MODULES': ['meetmindfull.spiders']}
INFO: Telnet Password: 4a122ec20dcf75e1
INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO: Enabled item pipelines:
[]
INFO: Spider opened
INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
DEBUG: Telnet console listening on 127.0.0.1:6023
WARNING: Remote certificate is not valid for hostname "www.meetmindful.com"; '*.meetmindful.com'!='www.meetmindful.com'
DEBUG: Crawled (200) <GET https://www.meetmindful.com/> (referer: None)
WARNING: Remote certificate is not valid for hostname "uyu74ur0-splash.scrapinghub.com"; '*.scrapinghub.com'!='uyu74ur0-splash.scrapinghub.com'
DEBUG: Crawled (200) <GET https://app.meetmindful.com/login via https://uyu74ur0-splash.scrapinghub.com/render.html> (referer: None)
DEBUG: Retrying <GET https://app.meetmindful.com/grid via https://uyu74ur0-splash.scrapinghub.com/render.html> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
WARNING: Remote certificate is not valid for hostname "uyu74ur0-splash.scrapinghub.com"; '*.scrapinghub.com'!='uyu74ur0-splash.scrapinghub.com'
DEBUG: Crawled (200) <GET https://app.meetmindful.com/grid via https://uyu74ur0-splash.scrapinghub.com/render.html> (referer: None)
INFO: Closing spider (finished)
INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,



'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
 'downloader/request_bytes': 2952,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 1,
 'downloader/request_method_count/POST': 3,
 'downloader/response_bytes': 28096,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 7, 17, 14, 47, 46, 604347),
 'log_count/DEBUG': 5,
 'log_count/INFO': 8,
 'log_count/WARNING': 3,
 'memusage/max': 54267904,
 'memusage/startup': 54267904,
 'request_depth_max': 2,
 'response_received_count': 3,
 'retry/count': 1,
 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 1,
 'scheduler/dequeued': 6,
 'scheduler/dequeued/memory': 6,
 'scheduler/enqueued': 6,
 'scheduler/enqueued/memory': 6,
 'splash/render.html/request_count': 2,
 'splash/render.html/response_count/200': 2,
 'start_time': datetime.datetime(2019, 7, 17, 14, 47, 28, 791792)}
INFO: Spider closed (finished)
Gallaecio
  • 2,991
  • 2
  • 20
  • 53

1 Answers1

0

If you look into their example file, they have shown how to use it

https://github.com/scrapy-plugins/scrapy-splash/blob/e40ca4f3b367ab463273bee1357d3edfe0601f0d/example/scrashtest/spiders/quotes.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor

from scrapy_splash import SplashRequest


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    # http_user = 'splash-user'
    # http_pass = 'splash-password'

    def parse(self, response):
        ...

Also you need to yield SplashRequest instead of Request, you are actually not using Splash at all in your code

yield Request(url,  callback=self.afterlogin,meta={'splash': {'args': {'html': 1, 'png': 1,}}})

should be

yield SplashRequest(url,  callback=self.afterlogin,meta={'splash': {'args': {'html': 1, 'png': 1,}}})
Tarun Lalwani
  • 124,930
  • 8
  • 149
  • 214