a web scraping development and services company, supplies cloud-based web crawling platforms.
Questions tagged [scrapinghub]
175 questions
11
votes
1 answer
Not able Running/deploying custom script with shub-image
I have problem for Running/deploying custom script with shub-image.
setup.py
from setuptools import setup, find_packages
setup(
name = 'EU-Crawler',
version = '1.0',
packages = find_packages(),
scripts = [
…
![](../../users/profiles/4527978.webp)
parik
- 1,924
- 10
- 36
- 62
7
votes
4 answers
scrapy passing custom_settings to spider from script using CrawlerProcess.crawl()
I am trying to programatically call a spider through a script. I an unable to override the settings through the constructor using CrawlerProcess. Let me illustrate this with the default spider for scraping quotes from the official scrapy site (last…
![](../../users/profiles/579692.webp)
hAcKnRoCk
- 942
- 3
- 11
- 27
6
votes
1 answer
Scrapy hidden memory leak
Background - TLDR: I have a memory leak in my project
Spent a few days looking through the memory leak docs with scrapy and can't find the problem.
I'm developing a medium size scrapy project, ~40k requests per day.
I am hosting this using…
![](../../users/profiles/9555388.webp)
Hector Haffenden
- 1,128
- 6
- 22
6
votes
0 answers
Pygsheets unable to find the server at www.googleapis.com
I'm trying to use pygsheets in a script on ScrapingHub. The pygsheets part of the script begins with:
google_client = pygsheets.authorize(service_file=CREDENTIALS_FILENAME, no_cache=True)
spreadsheet = google_client.open_by_key(SHEET_ID)
Where…
![](../../users/profiles/2371501.webp)
osjerick
- 566
- 1
- 5
- 19
6
votes
0 answers
Scrapy concurrent requests with stateful sessions
I've been web scraping for some time but relatively new to python, have recently switched all my scraping activity from ruby over to python primarily because of scrapy and scrapinghub which seem to provide better support for large-scale…
![](../../users/profiles/2507428.webp)
acowpy
- 306
- 3
- 7
5
votes
1 answer
Scrapy does not fetch markup on response.css
I've built a simple scrapy spider running on scrapinghub:
class ExtractionSpider(scrapy.Spider):
name = "extraction"
allowed_domains = ['domain']
start_urls = ['http://somedomainstart']
user_agent = "Mozilla/5.0 (Windows NT 10.0;…
![](../../users/profiles/1444319.webp)
qubits
- 925
- 2
- 14
- 39
4
votes
1 answer
scrapy how to load urls from file at scrapinghub
I know how to load data into Scrapy spider from external source when working localy. But I strugle to find any info on how to deploy this file to scrapinghub and what path to use there. Now i use this approach from SH documentation - enter link…
![](../../users/profiles/7209826.webp)
Billy Jhon
- 879
- 12
- 23
3
votes
0 answers
Splash - Scrapy - HAR data
In general I understand how to work with Scrapy and x-path to parse the html. However, I don't know how to grab the HAR data.
mport scrapy
from scrapy_splash import SplashRequest
class QuotesSpider(scrapy.Spider):
name = 'quotes'
…
![](../../users/profiles/9902833.webp)
Zach
- 371
- 1
- 3
- 9
3
votes
1 answer
Why is scrapy with crawlera running so slow?
I am using scrapy 1.7.3 with crawlera (C100 plan from scrapinghub) and python 3.6.
When running the spider with crawlera enabled I get about 20 - 40 items per minute. Without crawlera I get 750 - 1000 (but I get banned quickly of course).
Have I…
![](../../users/profiles/4399669.webp)
Wramana
- 151
- 2
- 15
3
votes
1 answer
Use splash from scrapinghub scraping hub locally
I got a suscriptions for splash on scrapinghub and I want to use this from a script that is running on my local machine. The instrucctions I have foud so far are:
1) Edits the settings file:
#I got this one from my scraping hub account
SPLASH_URL =…
![](../../users/profiles/4544413.webp)
Luis Ramon Ramirez Rodriguez
- 6,361
- 20
- 65
- 123
3
votes
1 answer
ScrapingHub Environment Variables Not Loaded
I'm deploying a bunch of spiders on ScrapingHub. The spider itself is working. I would like to change the feed output depending on whether the spider is running locally or on ScrapingHub (if it is running locally then output to a temp folder, if it…
![](../../users/profiles/11662012.webp)
Ze Xuan
- 56
- 6
3
votes
1 answer
scrapinghub starting job too slow
I am new in scraping and I am running different jobs on scrapinghub. I run them via their API. The problem is that starting the spider and initializing it takes too much time like 30 seconds. When I run it locally, it takes up to 5 seconds to finish…
![](../../users/profiles/9497956.webp)
Mara M
- 153
- 1
- 1
- 10
3
votes
2 answers
Scrapy and Splash times out for a specific site
I have an issue with Scrapy, Crawlera and Splash when trying the fetch responses from this site.
I tried the following without luck:
pure Scrapy shell - times out
Scrapy + Crawlera - times out
Scrapinghub Splash instance (small) - times…
![](../../users/profiles/6337523.webp)
Szabolcs
- 3,041
- 13
- 31
3
votes
2 answers
Download project's source-code from Scrapinghub
I have a project deployed on Scrapinghub, I do not have any copy of that code at all.
How can I download the whole project's code on my localhost from Scrapinghub?
![](../../users/profiles/4094231.webp)
Umair Ayub
- 13,220
- 12
- 53
- 124
3
votes
2 answers
How to install xvfb on Scrapinghub for using Selenium?
I use Python-Selenium in my spider (Scrapy), for using Selenium i should install xvfb on Scrapinghub.
when i use apt-get for installing xvfb i have this error message:
E: Could not open lock file /var/lib/dpkg/lock - open (13: Permission denied) …
![](../../users/profiles/4527978.webp)
parik
- 1,924
- 10
- 36
- 62