3

I am new in scraping and I am running different jobs on scrapinghub. I run them via their API. The problem is that starting the spider and initializing it takes too much time like 30 seconds. When I run it locally, it takes up to 5 seconds to finish the spider. But in scrapinghub it takes 2:30 minutes. I understand that closing a spider after all requests are finished takes a little bit more time, but this is not a problem. Anyway, my problem is that from the moment I call the API to start the job (I see that it appear in running jobs instantly, but takes too long to make the first request) and the moment the first request is done, I have to wait too much. Any idea how I can make it to last as shortly as locally? Thanks!

I already tried to put AUTOTHROTTLE_ENABLED = false as I saw in some other question on stackoverflow.

Mara M
  • 153
  • 1
  • 1
  • 10

1 Answers1

0

According to scrapy cloud docs:
Scrapy Cloud jobs run in containers. These containers can be of different sizes defined by Scrapy Cloud units.

A Scrapy Cloud provides: 1 GB of RAM, 2.5GB of disk space,1x CPU and 1 concurrent crawl slot.

Resources available to the job are proportional to the number of units allocated.
It means that allocating more Scrapy Cloud units can solve your problem.

Georgiy
  • 2,276
  • 1
  • 4
  • 16