Questions tagged [splash-js-render]

Splash JS is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python using Twisted and QT. It's Selenium's competitor.

https://splash.readthedocs.io/en/stable/

Splash - A javascript rendering service

Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python using Twisted and QT. The (twisted) QT reactor is used to make the sever fully asynchronous allowing to take advantage of webkit concurrency via QT main loop. Some of Splash features:

  • process multiple webpages in parallel;
  • get HTML results and/or take screenshots;
  • turn OFF images or use Adblock Plus rules to make rendering faster;
  • execute custom JavaScript in page context;
  • write Lua browsing scripts;
  • develop Splash Lua scripts in Splash-Jupyter Notebooks.
  • get detailed rendering info in HAR format.
134 questions
22
votes
3 answers

Scrapy Shell and Scrapy Splash

We've been using scrapy-splash middleware to pass the scraped HTML source through the Splash javascript engine running inside a docker container. If we want to use Splash in the spider, we configure several required project settings and yield a…
alecxe
  • 414,977
  • 106
  • 935
  • 1,083
13
votes
3 answers

Adding a wait-for-element while performing a SplashRequest in python Scrapy

I am trying to scrape a few dynamic websites using Splash for Scrapy in python. However, I see that Splash fails to wait for the complete page to load in certain cases. A brute force way to tackle this problem was to add a large wait time (eg. 5…
NightFury13
  • 659
  • 6
  • 15
10
votes
1 answer

How to set splash timeout in scrapy-splash?

I use scrapy-splash to crawl web page, and run splash service on docker. commond: docker run -p 8050:8050 scrapinghub/splash --max-timeout 3600 But I got a 504 error. "error": {"info": {"timeout": 30}, "description": "Timeout exceeded rendering…
Jhon Smith
  • 161
  • 2
  • 10
9
votes
3 answers

Scrapy CrawlSpider + Splash: how to follow links through linkextractor?

I have the following code that is partially working, class ThreadSpider(CrawlSpider): name = 'thread' allowed_domains = ['bbs.example.com'] start_urls = ['http://bbs.example.com/diy'] rules = ( Rule(LinkExtractor( …
eN_Joy
  • 763
  • 2
  • 10
  • 17
8
votes
2 answers

how does scrapy-splash handle infinite scrolling?

I want to reverse engineering the contents generated by scrolling down in the webpage. The problem is in the url https://www.crowdfunder.com/user/following_page/80159?user_id=80159&limit=0&per_page=20&screwrand=933. screwrand doesn't seem to follow…
Bowen Liu
  • 99
  • 2
  • 7
7
votes
0 answers

Splash containers stops working after 30 minutes

I have some issue with Aquarium and splash. They stop working after 30 minutes after the start. A number of pages for loading are 50K-80K. I made cron job for automatically rebooting every 10 minutes, each Splash container, but it didn't work How…
amarynets
  • 1,666
  • 7
  • 20
7
votes
2 answers

Using docker, scrapy splash on Heroku

I have a scrapy spider that uses splash which runs on Docker localhost:8050 to render javascript before scraping. I am trying to run this on heroku but have no idea how to configure heroku to start docker to run splash before running my web: scrapy…
HearthQiu
  • 231
  • 1
  • 2
  • 5
7
votes
2 answers

How to install python-gtk2, python-webkit and python-jswebkit on OSX

I've read through many of the related questions but am still unclear how to do this as there are many software combinations available and many solutions seem outdated. What is the best way to install the following on my virtual environment on…
jyek
  • 1,051
  • 9
  • 18
6
votes
1 answer

scrapy, splash, lua, button click

I am new to all instruments here. My goal is to extract all URLs from a lot of pages which are connected moreless by a "Weiter"/"next" button - that for several URLS. I decided to try that with scrapy. The page is dynamically generated. Then I…
P. Guyan
  • 61
  • 4
6
votes
0 answers

Docker Scrapinghub/splash exited with 139

I'm using Scrapy to do some crawling with Splash using the Scrapinghub/splash docker container however the container exit after a while by itself with exit code 139, I'm running the scraper on an AWS EC2 instance with 1GB swap assigned. i also tried…
MtziSam
  • 120
  • 10
6
votes
1 answer

scrapy-splash returns its own headers and not the original headers from the site

I use scrapy-splash to build my spider. Now what I need is to maintain the session, so I use the scrapy.downloadermiddlewares.cookies.CookiesMiddleware and it handles the set-cookie header. I know it handles the set-cookie header because i set…
Roman Smelyansky
  • 309
  • 1
  • 13
6
votes
1 answer

Splash lua script to do multiple clicks and visits

I'm trying to crawl Google Scholar search results and get all the BiBTeX format of each result matching the search. Right now I have a Scrapy crawler with Splash. I have a lua script which will click the "Cite" link and load up the modal window…
5
votes
2 answers

Google App Engine: Load another Docker Image for Scrapy + Splash

I'd like to scrape a javascript website using Scrapy + Splash in Google App Engine. The Splash plugin is a Docker image. Is there any way to use this within Google App Engine? App Engine itself uses a Docker image, but I'm not sure how to load and…
bgolson
  • 3,330
  • 4
  • 22
  • 41
5
votes
1 answer

Scrapy does not fetch markup on response.css

I've built a simple scrapy spider running on scrapinghub: class ExtractionSpider(scrapy.Spider): name = "extraction" allowed_domains = ['domain'] start_urls = ['http://somedomainstart'] user_agent = "Mozilla/5.0 (Windows NT 10.0;…
qubits
  • 925
  • 2
  • 14
  • 39
5
votes
0 answers

FileNotFoundError: [Errno 2] after pushing splash to heroku

I'm trying to deploy the latest scrapinghub/splash I am using git-bash on win10. I forked the repo to https://github.com/kc1/splash/blob/master and I have been trying to follow Using docker, scrapy splash on Heroku to modify the docker file After…
user1592380
  • 26,587
  • 62
  • 220
  • 414
1
2 3
8 9