0

I'm using BeautifulSoup4 for scraping websites. There are only a couple of things in the site that I'm interested in, and most of them are within the tag article. However, there are some without article tag but inside a div tag with class name.

Source: an example of the site that I'm trying to scrape is https://news.3m.com/English/3m-stories/3m-details/2020/3M-Foundation-awards-1M-to-four-local-organizations-in-support-of-racial-equity/default.aspx

In the site, I'm only interested in the article which is not inside an article tag, but is written inside div tag with class name module_body.

Here's What I have done so far:

  • read each URL
  • save the websites as .html in the directory. (Reason: We wanted to make sure the data we collected is not updated, and check the difference few months later)
  • after all sites are stored, read each .html file
  • use BeautifulSoup to parse html and extract the text within the tag/class.

Helper Function

def parse_article(response, tag):
    
    article = [e.get_text() for e in response.find_all(tag)]
   
    article = '\n'.join(article)

    return article



def check_article(response):
    tags_classess_query = [
        ('article'), 
        ('div', {'class': 'module_body'})
    ]
    
    for item in tags_classess_query: 
        print('checking for {}'.format(item))
        
        if response.find(item):
            return item

    return None



# list all html files downloaded
html_files = [file for file in os.listdir(path) if '.html' in file]

# loop html_files to process each file
for file in html_files:
    
    filepath = os.path.join(path,file)
    article_file = os.path.splitext(filepath)[0]
    
    # file name to store the extracted text using BS4
    article_file = article_file + '.txt'
    
    
    with open(filepath, 'r', encoding='utf-8') as f:
    
        html = BeautifulSoup(f, 'html.parser')
        

    
    # check if selected tag exists in HTML. 
    
    tag = check_article(html)
        
    if tag is not None:
        #This is where I'm running into this issue where it still saves all of html page not just the text inside the selected tag/class

        article = parse_article(html, tag)
        
        w = open(article_file, 'w+', encoding='utf-8') 
        w.write(article)
        w.close()

    else:
        print("tag not found for %s" % file)
    
    

I'm now running into this issue where it doesn't extract text only in the selected tag but everything. What am I doing wrong?

Kuni
  • 716
  • 5
  • 19

1 Answers1

1

You were passing ('div',{'class':'module_body'}) instead of 'div',{'class':'module_body'}. Note the later is 2 separate arguments. So just replace this line in your parse_article function.

def parse_article(response, tag):
    article = [e.text for e in response.find_all(tag[0],tag[1])]

Since your other tag doesn't have 2 elements you might get outof Index error so You can use unpacking operator *

def parse_article(response, tag):
    article = [e.text for e in response.find_all(*tag)]
venky__
  • 5,552
  • 3
  • 17
  • 28
  • If I'm not wrong, this is how you wanted to suggest `.find_all([tag[0],tag[1]])`, right? – SIM Oct 04 '20 at 17:19
  • No just `soup.findAll(tag[0,tag[1])`. Since `tag = ('div',{'class':'module_body'})`. So it will be `findall('div' , {'class':'module_body'})` – venky__ Oct 04 '20 at 17:23
  • You are still doing it wrong. When you wanna use multiple tags within `.find_all()`, you need to use them within a list. Check out [this link](https://stackoverflow.com/a/20649408/9189799) for better clarity. Thanks. – SIM Oct 04 '20 at 17:28
  • But it's not multiple tags though. `'div' , {'class':'module_body'}` is one tag – venky__ Oct 04 '20 at 17:31
  • Yes, I got you. I didn't notice that you edited your comment to clarify. Thanks. – SIM Oct 04 '20 at 17:32
  • @SIM good point though if there is just one tag it will throw an error. updated my answer to handle that. – venky__ Oct 04 '20 at 17:36
  • Splendid! I was just about to ask you that. You thought of that before I could ask. +1 – Kuni Oct 04 '20 at 17:40
  • @venky__, one thing I realized is that if `*tag` is used in `article` tag, it unpacks to `['a', 'r', 't', 'i', 'c', 'l', 'e']`. For now, I've done this: `if isinstance(tag, str): args = tag else: *args,=tag` Is there a better way to solve it? – Kuni Oct 05 '20 at 14:03
  • @Kuni aghh. Instead of a tuple you can make it an array then `tags_classess_query = [ ['article'], ['div', {'class': 'module_body'}]]` – venky__ Oct 05 '20 at 14:18