Questions tagged [scrapy-pipeline]
193 questions
17
votes
1 answer
Scrapy: how to use items in spider and how to send items to pipelines?
I am new to scrapy and my task is simple:
For a given e-commerce website:
crawl all website pages
look for products page
If the URL point to a product page
Create an Item
Process the item to store it in a database
I created the spider but…
![](../../users/profiles/4759209.webp)
farhawa
- 8,406
- 16
- 37
- 80
5
votes
2 answers
Scrapy file download how to use custom filename
For my scrapy project I'm currently using the FilesPipeline. The downloaded files are stored with a SHA1 hash of their URLs as the file names.
[(True,
{'checksum': '2b00042f7481c7b056c4b410d28f33cf',
'path':…
![](../../users/profiles/1724692.webp)
Michael
- 1,950
- 1
- 29
- 46
5
votes
1 answer
Django Relations with Scrapy how are items saved?
I just need to understand How can I detect whether scrapy saved and item in spider ? I'm fetching items from a site and after that I'm fetching comments on that item. So first I have to save the item after that I'll save comments. But when I'm…
![](../../users/profiles/4715927.webp)
Murat Kaya
- 1,143
- 1
- 26
- 49
5
votes
1 answer
Scrapy, make http request in pipeline
Assume I have an scraped item that looks like this
{
name: "Foo",
country: "US",
url: "http://..."
}
In a pipeline I want to make a GET request to the url and check some headers like content_type and status. When the headers do not meet…
![](../../users/profiles/401025.webp)
Upvote
- 65,847
- 122
- 353
- 577
5
votes
0 answers
Twisted (Scrapy) and Postgres
Im using Scrapy (aka Twisted) and also Postgres as a database.
After I while my connections seem to fill up and then my script is been stuck. I checked this with this query SELECT * FROM pg_stat_activity; and read that its caused because Postgres…
![](../../users/profiles/227821.webp)
lony
- 5,002
- 6
- 50
- 71
5
votes
2 answers
scrapy - handling multiple types of items - multiple and related Django models and saving them to database in pipelines
I have the following Django models. I am not sure what is the best way to save these inter-related objects when scanned in spider to the database in Django using scrapy pipelines. Seems like scrapy pipeline was built to handle only one 'kind' of…
![](../../users/profiles/3403972.webp)
dowjones123
- 3,233
- 5
- 34
- 72
5
votes
1 answer
When saving scraped item and file, Scrapy inserts empty lines in output csv file
I have Scrapy (version 1.0.3) spider in which I extract both some data from web page and I also download file, like this (simplified):
def extract_data(self, response):
title = response.xpath('//html/head/title/text()').extract()[0].strip()
…
![](../../users/profiles/2521843.webp)
zdenulo
- 214
- 2
- 11
5
votes
1 answer
Closing database connection from pipeline and middleware in Scrapy
I have a Scrapy project that uses custom middleware and a custom pipeline to check and store entries in a Postgres DB. The middleware looks a bit like this:
class ExistingLinkCheckMiddleware(object):
def __init__(self):
... open…
![](../../users/profiles/269738.webp)
Jamie Brown
- 903
- 9
- 12
4
votes
1 answer
Custom Files Pipeline in Scrapy never downloads Files even though logs should all functions being accessed
I have the following custom pipeline for downloading JSON files. It was functioning fine until I need to add the __init__ function, in which I subclass the FilesPipeline class in order to add a few new properties. The pipeline takes URLs that are to…
![](../../users/profiles/2962937.webp)
CaffeinatedMike
- 1,456
- 2
- 22
- 43
4
votes
1 answer
Export scrapy items to different files
I'm scraping review from moocs likes this one
From there I'm getting all the course details, 5 items and another 6 items from each review itself.
This is the code I have for the course details:
def parse_reviews(self, response):
l =…
![](../../users/profiles/4544413.webp)
Luis Ramon Ramirez Rodriguez
- 6,361
- 20
- 65
- 123
4
votes
2 answers
Scrapy store returned items in variables to use in main script
I am quite new to Scrapy and want to try the following:
Extract some values from a webpage, store it in a variable and use it in my main script.
Therefore I followed their tutorial and changed code for my purposes:
import scrapy
from scrapy.crawler…
![](../../users/profiles/7757662.webp)
MaGi
- 89
- 1
- 1
- 7
4
votes
1 answer
Scrapy Pipelines to Seperate Folder/Files - Abstraction
I currently finalising a Scrapy project however I have quite a lengthy pipelines.py file.
I noticed that in my settings.py the pipe lines are show as follows (trimmed down):
ITEM_PIPELINES = {
'proj.pipelines.MutatorPipeline': 200,
…
![](../../users/profiles/1781104.webp)
Matt The Ninja
- 2,389
- 2
- 22
- 48
4
votes
1 answer
How to download image using Scrapy?
I am newbie to scrapy. I am trying to download an image from here. I was following Official-Doc and this article.
My settings.py looks like:
BOT_NAME = 'shopclues'
SPIDER_MODULES = ['shopclues.spiders']
NEWSPIDER_MODULE =…
![](../../users/profiles/6791449.webp)
Prashant Prabhakar Singh
- 871
- 4
- 11
- 30
4
votes
2 answers
Understand the scrapy framework architecture
Recently, I've been trying to get to grips with scrapy. I feel if I had a better understanding to the architecture, I'd move a lot faster. The current, concrete problem I have this: I want to store all of the links that scrapy extracts in a…
![](../../users/profiles/3185563.webp)
user3185563
- 1,103
- 2
- 10
- 13
3
votes
1 answer
Pass file_name argument to pipeline for csv export in scrapy
I need scrapy to take an argument (-a FILE_NAME="stuff") from the command line and apply that to the file created in my CSVWriterPipeLine in pipelines.py file. (The reason I went with pipeline.py was that the built in exporter was repeating data…
![](../../users/profiles/4114926.webp)
Josh Usre
- 619
- 1
- 9
- 30