I am creating a Scrapy program to scrape profile pages for numerical data. Each profile has a section that lists the different tags that the user uses. Each of those tags link to a paginated set of pages that shows all the posts made under that tag, with each post's word count. Each profile has this same layout with the same selectors, so no worries there.
The data I want from each profile:
- number of posts in each tag
- total word count of each tag
I have a list of all the profiles I want to scrape in start_urls
, so there's no need to crawl to find profiles. I just want to know the logic behind using Item Loaders and/or Pipelines to ensure the data is organized by profile and by tag. Here's some pseudo-code from my spider to make things more clear
class ProfileScraper(scrapy.Spider):
name = 'scraper'
start_urls = ['https://example.com/fancyuser1', 'https://example.com/coooluser2', 'https://example.com/superuser3', ... ]
def parse(self, response):
tags_pages = response.css('.tags + a').getall()
yield from response.followall(tags_pages, callback=self.parse_tagspage)
def parse_tagspage(self, response):
number_of_posts = response.css('div.postcount span::text')
number_of_words = response.css('div.wordcount span::text').getall()
page_sum_words = sum(int(number) for number in number_of_words if number.strip())
# pseudo-code is here
all_that_data[username].apphend([number_of_posts, page_sum_words])
# let's pretend there is only one page for this tag
My ideal dict
of it would look like the following. The 2D array represents the different tags and each tag's number of posts and words.
all_that_data = {
'fancyuser1': [[3, 200], [10, 500], [6, 450]],
'coooluser2': [[1, 150], [6, 800], [2, 100], [4, 400]],
'superuser3': [[3, 350], [5, 400]],
...
}
Each profile can have as many tags as it wants, so when I follow those links, how do I ensure the data from each tag stays separate from each other, but all nested under the correct profile?