Questions tagged [scraper]

Synonym of [web-scraping]

Synonym of : Let's [scrape] these tags off the bottom of our shoe

366 questions
86
votes
3 answers

XPath:: Get following Sibling

I have following HTML Structure: I am trying to build a robust method to extract second color digest element since there will be many of these tag within the DOM. …
add-semi-colons
  • 14,928
  • 43
  • 126
  • 211
66
votes
5 answers

crawler vs scraper

Can somebody distinguish between a crawler and scraper in terms of scope and functionality.
Nayn
  • 3,346
  • 8
  • 34
  • 47
41
votes
7 answers

BeautifulSoup: extract text from anchor tag

I want to extract: text from following src of the image tag and text of the anchor tag which is inside the div class data I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.
add-semi-colons
  • 14,928
  • 43
  • 126
  • 211
40
votes
3 answers

How to scrape a website that requires login first with Python

First of all, I think it's worth saying that, I know there are a bunch of similar questions but NONE of them works for me... I'm a newbie on Python, html and web scraper. I'm trying to scrape user information from a website which needs to login…
user2830451
  • 1,474
  • 3
  • 18
  • 24
30
votes
3 answers

scrape websites with infinite scrolling

I have written many scrapers but I am not really sure how to handle infinite scrollers. These days most website etc, Facebook, Pinterest has infinite scrollers.
add-semi-colons
  • 14,928
  • 43
  • 126
  • 211
13
votes
5 answers

Facebook meta tags scraped with locale not working

My website is multi-language and I have a FB like button. I'd like to have the like posts in different languages. According to Facebook documentation, if I use the meta tag og:locale and og:locale:alternate, the scraper would get my site info…
Alouw Net
  • 131
  • 1
  • 3
12
votes
2 answers

How to use Selenium Webdriver on Heroku?

I am developing a Node.js app, and I use Selenium Webdriver on it for scraping purposes. However, when I deploy on Heroku, Selenium doesn't work. How can I make Selenium work on Heroku?
Athanasios Canko
  • 123
  • 1
  • 1
  • 6
11
votes
1 answer

Crawling LinkedIn while authenticated with Scrapy

So I've read through the Crawling with an authenticated session in Scrapy and I am getting hung up, I am 99% sure that my parse code is correct, I just don't believe the login is redirecting and being successful. I also am having an issue with the…
Gates
  • 131
  • 1
  • 8
10
votes
5 answers

BeautifulSoup: Strip specified attributes, but preserve the tag and its contents

I'm trying to 'defrontpagify' the html of a MS FrontPage generated website, and I'm writing a BeautifulSoup script to do it. However, I've gotten stuck on the part where I try to strip a particular attribute (or list attributes) from every tag in…
Kurtosis
  • 14,401
  • 7
  • 25
  • 38
10
votes
2 answers

Facebook scraper doesn't load dynamic meta-tags

I am creating the HTML meta-tags dynamically using the function below (GWT). It takes 1 second to have this on the DOM. It is working fine except for Facebook. When I share a link from my web, the scraper gets the meta-tags that are in the HTML:…
user411103
10
votes
7 answers

Print Python output by PHP Code

I have a scraper which scrape one site (Written in python). While scraping the site, that print lines which are about to write in CSV. Scraper has been written in Python and now I want to execute it via PHP code. My question is how can I print…
Rajiv Pingale
  • 955
  • 9
  • 26
9
votes
2 answers

Scrapy Body Text Only

I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet. Wishing some scholars might be able to help me here scraping all the text from the tag.
mmrs151
  • 3,577
  • 2
  • 32
  • 34
8
votes
2 answers

Can't get Scrapy pipeline to work

I have spider that I have written using the Scrapy framework. I am having some trouble getting any pipelines to work. I have the following code in my pipelines.py: class FilePipeline(object): def __init__(self): self.file =…
Jim Jeffries
  • 9,051
  • 13
  • 57
  • 99
8
votes
3 answers

How to crawl with php Goutte and Guzzle if data is loaded by Javascript?

Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery)
Batman
  • 81
  • 1
  • 1
  • 5
7
votes
3 answers

Accessing Metacritic API and/or Scraping

Does anybody know where documentation for the Metacritic api is/if it still works. There used to be a Metacritic API at https://market.mashape.com/byroredux/metacritic-v2#get-user-details which disappeared today. Otherwise I'm trying to scrape the…
boblikesoup
  • 194
  • 1
  • 2
  • 16
1
2 3
24 25