I got a suscriptions for splash on scrapinghub and I want to use this from a script that is running on my local machine. The instrucctions I have foud so far are:

1) Edits the settings file:

#I got this one from my scraping hub account
SPLASH_URL = 'http://xx.x0-splash.scrapinghub.com'

    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,

    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

From that I have one dobt, when I try to open the spash server on the browser it asks me for a username, I don't see where to set this on scrapy.

enter image description here

2) the spider file:

import scrapy
import json
from scrapy import  Request
from scrapy_splash import SplashRequest
import scrapy_splash

class ListSpider(scrapy.Spider):

    name = 'list'
    allowed_domains = ['https://medium.com/']
    start_urls = ['https://medium.com/']

    def parse(self, response):
        print (response.body)
        with open('data/cookies_file.json') as f:
            cookies_data = json.loads(f.read())[0]
        #print (cookies_data)
        url = 'https://medium.com/' 
        yield Request(url,  callback=self.afterlogin,meta={'splash': {'args': {'html': 1, 'png': 1,}}})

    def afterlogin(self,response):
        with open(data_dir + 'after_login_page.html','w') as f:

I'm not getting errors but I'm not sure if splash is working either, also besides the server ip, scraping provides a password wich I don't know where to use for this script.

After using splashrequest and adding the API key, This is the traceback of I'm getting, the content of the sites is still not loading.

2019-07-17 10:10:08 [scrapy.core.engine] INFO: Spider opened
2019-07-17 10:10:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-07-17 10:10:08 [scrapy.extensions.telnet] DEBUG: Telnet console listening on
2019-07-17 10:10:09 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "www.meetmindful.com"; '*.meetmindful.com'!='www.meetmindful.com'
2019-07-17 10:10:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.meetmindful.com/> (referer: None)
2019-07-17 10:10:13 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "uyu74ur0-splash.scrapinghub.com"; '*.scrapinghub.com'!='uyu74ur0-splash.scrapinghub.com'
2019-07-17 10:10:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://app.meetmindful.com/login via https://uyu74ur0-splash.scrapinghub.com/render.html> (referer: None)
2019-07-17 10:10:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://app.meetmindful.com/grid via https://uyu74ur0-splash.scrapinghub.com/render.html> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-07-17 10:10:21 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "uyu74ur0-splash.scrapinghub.com"; '*.scrapinghub.com'!='uyu74ur0-splash.scrapinghub.com'
2019-07-17 10:10:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://app.meetmindful.com/grid via https://uyu74ur0-splash.scrapinghub.com/render.html> (referer: None)
2019-07-17 10:10:26 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-17 10:10:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
 'downloader/request_bytes': 2952,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 1,
 'downloader/request_method_count/POST': 3,
 'downloader/response_bytes': 28104,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 7, 17, 14, 10, 26, 292646),
 'log_count/DEBUG': 5,
 'log_count/INFO': 8,
 'log_count/WARNING': 3,
 'memusage/max': 54104064,
 'memusage/startup': 54104064,
 'request_depth_max': 2,
 'response_received_count': 3,
 'retry/count': 1,
 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 1,
 'scheduler/dequeued': 6,
 'scheduler/dequeued/memory': 6,
 'scheduler/enqueued': 6,
 'scheduler/enqueued/memory': 6,
 'splash/render.html/request_count': 2,
 'splash/render.html/response_count/200': 2,
 'start_time': datetime.datetime(2019, 7, 17, 14, 10, 8, 200073)}
2019-07-17 10:10:26 [scrapy.core.engine] INFO: Spider closed (finished)


If you look into their example file, they have shown how to use it


# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor

from scrapy_splash import SplashRequest

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    # http_user = 'splash-user'
    # http_pass = 'splash-password'

    def parse(self, response):

Also you need to yield SplashRequest instead of Request, you are actually not using Splash at all in your code

yield Request(url,  callback=self.afterlogin,meta={'splash': {'args': {'html': 1, 'png': 1,}}})

should be

yield SplashRequest(url,  callback=self.afterlogin,meta={'splash': {'args': {'html': 1, 'png': 1,}}})
