I'm using BeautifulSoup4
for scraping websites. There are only a couple of things in the site that I'm interested in, and most of them are within the tag article
. However, there are some without article
tag but inside a div
tag with class name.
Source: an example of the site that I'm trying to scrape is https://news.3m.com/English/3m-stories/3m-details/2020/3M-Foundation-awards-1M-to-four-local-organizations-in-support-of-racial-equity/default.aspx
In the site, I'm only interested in the article which is not inside an article
tag, but is written inside div
tag with class name module_body
.
Here's What I have done so far:
- read each URL
- save the websites as
.html
in the directory. (Reason: We wanted to make sure the data we collected is not updated, and check the difference few months later) - after all sites are stored, read each
.html
file - use
BeautifulSoup
to parse html and extract the text within the tag/class.
Helper Function
def parse_article(response, tag):
article = [e.get_text() for e in response.find_all(tag)]
article = '\n'.join(article)
return article
def check_article(response):
tags_classess_query = [
('article'),
('div', {'class': 'module_body'})
]
for item in tags_classess_query:
print('checking for {}'.format(item))
if response.find(item):
return item
return None
# list all html files downloaded
html_files = [file for file in os.listdir(path) if '.html' in file]
# loop html_files to process each file
for file in html_files:
filepath = os.path.join(path,file)
article_file = os.path.splitext(filepath)[0]
# file name to store the extracted text using BS4
article_file = article_file + '.txt'
with open(filepath, 'r', encoding='utf-8') as f:
html = BeautifulSoup(f, 'html.parser')
# check if selected tag exists in HTML.
tag = check_article(html)
if tag is not None:
#This is where I'm running into this issue where it still saves all of html page not just the text inside the selected tag/class
article = parse_article(html, tag)
w = open(article_file, 'w+', encoding='utf-8')
w.write(article)
w.close()
else:
print("tag not found for %s" % file)
I'm now running into this issue where it doesn't extract text only in the selected tag but everything. What am I doing wrong?