How to extract only the text inside a given class or tag using BeautifulSoup?

Question

I'm using BeautifulSoup4 for scraping websites. There are only a couple of things in the site that I'm interested in, and most of them are within the tag article. However, there are some without article tag but inside a div tag with class name.

Source: an example of the site that I'm trying to scrape is https://news.3m.com/English/3m-stories/3m-details/2020/3M-Foundation-awards-1M-to-four-local-organizations-in-support-of-racial-equity/default.aspx

In the site, I'm only interested in the article which is not inside an article tag, but is written inside div tag with class name module_body.

Here's What I have done so far:

read each URL
save the websites as .html in the directory. (Reason: We wanted to make sure the data we collected is not updated, and check the difference few months later)
after all sites are stored, read each .html file
use BeautifulSoup to parse html and extract the text within the tag/class.

Helper Function

def parse_article(response, tag):
    
    article = [e.get_text() for e in response.find_all(tag)]
   
    article = '\n'.join(article)

    return article



def check_article(response):
    tags_classess_query = [
        ('article'), 
        ('div', {'class': 'module_body'})
    ]
    
    for item in tags_classess_query: 
        print('checking for {}'.format(item))
        
        if response.find(item):
            return item

    return None



# list all html files downloaded
html_files = [file for file in os.listdir(path) if '.html' in file]

# loop html_files to process each file
for file in html_files:
    
    filepath = os.path.join(path,file)
    article_file = os.path.splitext(filepath)[0]
    
    # file name to store the extracted text using BS4
    article_file = article_file + '.txt'
    
    
    with open(filepath, 'r', encoding='utf-8') as f:
    
        html = BeautifulSoup(f, 'html.parser')
        

    
    # check if selected tag exists in HTML. 
    
    tag = check_article(html)
        
    if tag is not None:
        #This is where I'm running into this issue where it still saves all of html page not just the text inside the selected tag/class

        article = parse_article(html, tag)
        
        w = open(article_file, 'w+', encoding='utf-8') 
        w.write(article)
        w.close()

    else:
        print("tag not found for %s" % file)

I'm now running into this issue where it doesn't extract text only in the selected tag but everything. What am I doing wrong?

venky__ · Accepted Answer · 2020-10-04T17:35:20.267

1

You were passing ('div',{'class':'module_body'}) instead of 'div',{'class':'module_body'}. Note the later is 2 separate arguments. So just replace this line in your parse_article function.

def parse_article(response, tag):
    article = [e.text for e in response.find_all(tag[0],tag[1])]

Since your other tag doesn't have 2 elements you might get outof Index error so You can use unpacking operator *

def parse_article(response, tag):
    article = [e.text for e in response.find_all(*tag)]

edited Oct 04 '20 at 17:35

answered Oct 04 '20 at 17:10

venky__

5,552
3
17
28

If I'm not wrong, this is how you wanted to suggest `.find_all([tag[0],tag[1]])`, right? – SIM Oct 04 '20 at 17:19
No just `soup.findAll(tag[0,tag[1])`. Since `tag = ('div',{'class':'module_body'})`. So it will be `findall('div' , {'class':'module_body'})` – venky__ Oct 04 '20 at 17:23
You are still doing it wrong. When you wanna use multiple tags within `.find_all()`, you need to use them within a list. Check out [this link](https://stackoverflow.com/a/20649408/9189799) for better clarity. Thanks. – SIM Oct 04 '20 at 17:28
But it's not multiple tags though. `'div' , {'class':'module_body'}` is one tag – venky__ Oct 04 '20 at 17:31
Yes, I got you. I didn't notice that you edited your comment to clarify. Thanks. – SIM Oct 04 '20 at 17:32
@SIM good point though if there is just one tag it will throw an error. updated my answer to handle that. – venky__ Oct 04 '20 at 17:36
Splendid! I was just about to ask you that. You thought of that before I could ask. +1 – Kuni Oct 04 '20 at 17:40
@venky__, one thing I realized is that if `*tag` is used in `article` tag, it unpacks to `['a', 'r', 't', 'i', 'c', 'l', 'e']`. For now, I've done this: `if isinstance(tag, str): args = tag else: *args,=tag` Is there a better way to solve it? – Kuni Oct 05 '20 at 14:03
@Kuni aghh. Instead of a tuple you can make it an array then `tags_classess_query = [ ['article'], ['div', {'class': 'module_body'}]]` – venky__ Oct 05 '20 at 14:18

How to extract only the text inside a given class or tag using BeautifulSoup?

1 Answers1