Organizing scraped data based on the url on the data came from

Question

I am creating a Scrapy program to scrape profile pages for numerical data. Each profile has a section that lists the different tags that the user uses. Each of those tags link to a paginated set of pages that shows all the posts made under that tag, with each post's word count. Each profile has this same layout with the same selectors, so no worries there.

The data I want from each profile:

number of posts in each tag
total word count of each tag

I have a list of all the profiles I want to scrape in start_urls, so there's no need to crawl to find profiles. I just want to know the logic behind using Item Loaders and/or Pipelines to ensure the data is organized by profile and by tag. Here's some pseudo-code from my spider to make things more clear

class ProfileScraper(scrapy.Spider):
    name = 'scraper'
    
    start_urls = ['https://example.com/fancyuser1', 'https://example.com/coooluser2', 'https://example.com/superuser3', ... ]
    
    def parse(self, response):
        tags_pages = response.css('.tags + a').getall()
        yield from response.followall(tags_pages, callback=self.parse_tagspage)

    def parse_tagspage(self, response):
        number_of_posts = response.css('div.postcount span::text') 
        number_of_words = response.css('div.wordcount span::text').getall()
        page_sum_words = sum(int(number) for number in number_of_words if number.strip())

        # pseudo-code is here
        all_that_data[username].apphend([number_of_posts, page_sum_words])
        # let's pretend there is only one page for this tag

My ideal dict of it would look like the following. The 2D array represents the different tags and each tag's number of posts and words.

all_that_data = {
    'fancyuser1': [[3, 200], [10, 500], [6, 450]],
    'coooluser2': [[1, 150], [6, 800], [2, 100], [4, 400]],
    'superuser3': [[3, 350], [5, 400]],
    ...
}

Each profile can have as many tags as it wants, so when I follow those links, how do I ensure the data from each tag stays separate from each other, but all nested under the correct profile?

@jsotola Thanks for the tip! I don't know how I'd connect that to the data I'm getting from my spider though. I should clarify my question more. — harada, Nov 20 '20 at 02:10

score 1 · Answer 1 · answered Nov 20 '20 at 04:19

1

In your parse_tagspage you need to:

yield { 'username': username, 'number_of_posts': number_of_posts, 'page_sum_words': page_sum_words}

And in your pipelines.py you can work with all_that_data class variable that will be filled in process_item

answered Nov 20 '20 at 04:19

gangabass

9,476
2
21
33

score 0 · Accepted Answer · answered Nov 23 '20 at 00:38

Here is the full solution that worked for me. The other answer here really helped me think out the process that the data goes through when using a pipeline.

parse_tagspage should have this in it, which will yield data to be processed by the pipeline

url_info = response.url.replace('https://example.com/','').split('/') ## get important parts of the url 
        
userkey = url_info[0] ## just get profile name part
tag = url_info[1].replace('posts?tag=','') ## just get tag part of link
        
yield { 'username': userkey, 'tag': tag, 'number_of_posts': number_of_posts, 'page_sum_words': page_sum_words }

My pipeline takes that information and turns it into a dict. It's not the exact organization I wanted originally, but it is easy to create a new dict and populate it with the same data using loops through the existing dict. Long story short, I got the 2D arrays I wanted.

## self.all_that_data is a dict defined in the pipeline's __init__
def process_item(self, item, spider):
    ## initiate local variables
    username = item['username']
    tag = item['tag']
    number_of_posts = item['number_of_posts']
    page_sum_words = item['page_sum_words']
        
    if not username in self.all_that_data:
        self.all_that_data[username] = {}
        self.all_that_data[username][tag] = [number_of_posts, page_sum_words]
    else:
        if not tag in self.full_data[username]:
            self.all_that_data[username][tag] = [number_of_posts, page_sum_words]
        else:
            ## this is used if one tag has multiple pages, each page sum is added
            self.all_that_data[username][tag][1] += word_count
    return item ## process_item function has to have this, even if item isn't changed

The all_that_data ends up looking like this:

all_that_data = {
    'fancyuser1': { 'fun_tag':[3, 200], 'happy_tag':[10, 500], 'meme_tag':[6, 450] },
    'coooluser2': { 'happy_tag':[1, 150], 'fun_tag':[6, 800], 'super_tag':[4, 400] },
    'superuser3': { 'happy_tag':[3, 350], 'super_tag':[4, 400] },
    ...
}

Organizing scraped data based on the url on the data came from

2 Answers2