I have a list of URLs, that I collected from this page, which are basically just quotes from people, and I want to save the quotes in a separate file for each different URL.
To get the URL list, I have used:
import bs4
from urllib.request import Request,urlopen as uReq
from bs4 import BeautifulSoup as soup
import re
#define url of interest
my_url = 'http://archive.ontheissues.org/Free_Trade.htm'
# set up known browser user agent for the request to bypass HTMLError
req=Request(my_url,headers={'User-Agent': 'Mozilla/5.0'})
#opening up connection, grabbing the page
uClient = uReq(req)
page_html = uClient.read()
uClient.close()
#html is jumbled at the moment, so call html using soup function
soup = soup(page_html, "html.parser")
# Test: print title of page
soup.title
tags = soup.findAll("a" , href=re.compile("javascript:pop"))
print(tags)
# get list of all URLS
for links in tags:
link = links.get('href')
if "java" in link:
print("http://archive.ontheissues.org" + link[18:len(link)-3])
How would I go about extracting the content, including text, bullet points, paragraphs from each of those links, then saving them to a separate file? Also, I don't want things which are not quotes, like other URLs within those pages.