Questions tagged [scrapinghub]

a web scraping development and services company, supplies cloud-based web crawling platforms.

175 questions
11
votes
1 answer

Not able Running/deploying custom script with shub-image

I have problem for Running/deploying custom script with shub-image. setup.py from setuptools import setup, find_packages setup( name = 'EU-Crawler', version = '1.0', packages = find_packages(), scripts = [ …
parik
  • 1,924
  • 10
  • 36
  • 62
7
votes
4 answers

scrapy passing custom_settings to spider from script using CrawlerProcess.crawl()

I am trying to programatically call a spider through a script. I an unable to override the settings through the constructor using CrawlerProcess. Let me illustrate this with the default spider for scraping quotes from the official scrapy site (last…
hAcKnRoCk
  • 942
  • 3
  • 11
  • 27
6
votes
1 answer

Scrapy hidden memory leak

Background - TLDR: I have a memory leak in my project Spent a few days looking through the memory leak docs with scrapy and can't find the problem. I'm developing a medium size scrapy project, ~40k requests per day. I am hosting this using…
Hector Haffenden
  • 1,128
  • 6
  • 22
6
votes
0 answers

Pygsheets unable to find the server at www.googleapis.com

I'm trying to use pygsheets in a script on ScrapingHub. The pygsheets part of the script begins with: google_client = pygsheets.authorize(service_file=CREDENTIALS_FILENAME, no_cache=True) spreadsheet = google_client.open_by_key(SHEET_ID) Where…
6
votes
0 answers

Scrapy concurrent requests with stateful sessions

I've been web scraping for some time but relatively new to python, have recently switched all my scraping activity from ruby over to python primarily because of scrapy and scrapinghub which seem to provide better support for large-scale…
acowpy
  • 306
  • 3
  • 7
5
votes
1 answer

Scrapy does not fetch markup on response.css

I've built a simple scrapy spider running on scrapinghub: class ExtractionSpider(scrapy.Spider): name = "extraction" allowed_domains = ['domain'] start_urls = ['http://somedomainstart'] user_agent = "Mozilla/5.0 (Windows NT 10.0;…
qubits
  • 925
  • 2
  • 14
  • 39
4
votes
1 answer

scrapy how to load urls from file at scrapinghub

I know how to load data into Scrapy spider from external source when working localy. But I strugle to find any info on how to deploy this file to scrapinghub and what path to use there. Now i use this approach from SH documentation - enter link…
Billy Jhon
  • 879
  • 12
  • 23
3
votes
0 answers

Splash - Scrapy - HAR data

In general I understand how to work with Scrapy and x-path to parse the html. However, I don't know how to grab the HAR data. mport scrapy from scrapy_splash import SplashRequest class QuotesSpider(scrapy.Spider): name = 'quotes' …
Zach
  • 371
  • 1
  • 3
  • 9
3
votes
1 answer

Why is scrapy with crawlera running so slow?

I am using scrapy 1.7.3 with crawlera (C100 plan from scrapinghub) and python 3.6. When running the spider with crawlera enabled I get about 20 - 40 items per minute. Without crawlera I get 750 - 1000 (but I get banned quickly of course). Have I…
Wramana
  • 151
  • 2
  • 15
3
votes
1 answer

Use splash from scrapinghub scraping hub locally

I got a suscriptions for splash on scrapinghub and I want to use this from a script that is running on my local machine. The instrucctions I have foud so far are: 1) Edits the settings file: #I got this one from my scraping hub account SPLASH_URL =…
3
votes
1 answer

ScrapingHub Environment Variables Not Loaded

I'm deploying a bunch of spiders on ScrapingHub. The spider itself is working. I would like to change the feed output depending on whether the spider is running locally or on ScrapingHub (if it is running locally then output to a temp folder, if it…
Ze Xuan
  • 56
  • 6
3
votes
1 answer

scrapinghub starting job too slow

I am new in scraping and I am running different jobs on scrapinghub. I run them via their API. The problem is that starting the spider and initializing it takes too much time like 30 seconds. When I run it locally, it takes up to 5 seconds to finish…
Mara M
  • 153
  • 1
  • 1
  • 10
3
votes
2 answers

Scrapy and Splash times out for a specific site

I have an issue with Scrapy, Crawlera and Splash when trying the fetch responses from this site. I tried the following without luck: pure Scrapy shell - times out Scrapy + Crawlera - times out Scrapinghub Splash instance (small) - times…
3
votes
2 answers

Download project's source-code from Scrapinghub

I have a project deployed on Scrapinghub, I do not have any copy of that code at all. How can I download the whole project's code on my localhost from Scrapinghub?
Umair Ayub
  • 13,220
  • 12
  • 53
  • 124
3
votes
2 answers

How to install xvfb on Scrapinghub for using Selenium?

I use Python-Selenium in my spider (Scrapy), for using Selenium i should install xvfb on Scrapinghub. when i use apt-get for installing xvfb i have this error message: E: Could not open lock file /var/lib/dpkg/lock - open (13: Permission denied) …
parik
  • 1,924
  • 10
  • 36
  • 62
1
2 3
11 12