How to scrape entire body of text from a URL?

Question

I have a list of URLs, that I collected from this page, which are basically just quotes from people, and I want to save the quotes in a separate file for each different URL.

To get the URL list, I have used:

import bs4
from urllib.request import Request,urlopen as uReq
from bs4 import BeautifulSoup as soup
import re
#define url of interest
my_url = 'http://archive.ontheissues.org/Free_Trade.htm'

# set up known browser user agent for the request to bypass HTMLError
req=Request(my_url,headers={'User-Agent': 'Mozilla/5.0'})

#opening up connection, grabbing the page
uClient = uReq(req)
page_html = uClient.read()
uClient.close()

#html is jumbled at the moment, so call html using soup function
soup = soup(page_html, "html.parser")

# Test: print title of page
soup.title


tags = soup.findAll("a" , href=re.compile("javascript:pop"))
print(tags)

# get list of all URLS
for links in tags:
    link = links.get('href')
    if "java" in link: 
        print("http://archive.ontheissues.org" + link[18:len(link)-3])

How would I go about extracting the content, including text, bullet points, paragraphs from each of those links, then saving them to a separate file? Also, I don't want things which are not quotes, like other URLs within those pages.

score 1 · Answer 1 · answered Jul 06 '19 at 03:06

The 'quote' pages that you wish to scrape, have a bit of incomplete/dangling HTML tags. These might be a pain to parse if you don't understand the parser that you're using. To get a hint about them, see this page.

Now coming back to the code, for my convenience I made use of the lxml parser. Moving ahead, if you observe the page source of any of those 'quote' pages, then you'll see that most of the text that you wish to scrape is present in one of the following tags : {h3,p,ul,ol}. Also, note that there's a string that sits right next to every h3 tag. This string can be captured using .next_sibling. Now that the conditions are set, let's move on to the code.

import bs4
from urllib.request import Request,urlopen as uReq, HTTPError 
#Import HTTPError in order to avoid the links with no content/resource of interest
from bs4 import BeautifulSoup as soup_
import re
#define url of interest
my_url = 'http://archive.ontheissues.org/Free_Trade.htm'

#Creating a function to harness the power of scraping frequently
def make_soup(url):
    # set up known browser user agent for the request to bypass HTMLError
    req=Request(url,headers={'User-Agent': 'Mozilla/5.0'})

    #opening up connection, grabbing the page
    uClient = uReq(req)
    page_html = uClient.read()
    uClient.close()

    #html is jumbled at the moment, so call html using soup function
    soup = soup_(page_html, "lxml") 
    return soup

# Test: print title of page
#soup.title

soup = make_soup(my_url)
tags = soup.findAll("a" , href=re.compile("javascript:pop\("))
#print(tags)

# get list of all URLS
for links in tags:
    link = links.get('href')
    if "java" in link: 
        print("http://archive.ontheissues.org" + link[18:len(link)-3])
        main_url = "http://archive.ontheissues.org" + link[18:len(link)-3] 
        try:
            sub_soup = make_soup(main_url)
            content_collexn = sub_soup.body.contents #Splitting up the page into contents for iterative access 
            #text_data = [] #This list can be used to store data related to every person
            for item in content_collexn:
                #Accept an item if it belongs to the following classes
                if(type(item) == str):
                    print(item.get_text())
                elif(item.name == "h3"):
                    #Note that over here, every h3 tagged title has a string following it
                    print(item.get_text())   
                    #Hence, grab that string too
                    print(item.next_sibling) 
                elif(item.name in ["p", "ul", "ol"]):
                    print(item.get_text())
        except HTTPError: #Takes care of missing pages and related HTTP exception
            print("[INFO] Resource not found. Skipping to next link.")

        #print(text_data)

Hi argon, this code works beautifully in taking the entire content of the page. I just had two questions. Firstly, I am terrible at reading HTML, so I was wondering, is there some way to make an exclusion so that the script doesn't pick up the very bottom of the page where it says "Click here for definitions & background information on Free Trade." and onwards? Second question, is it possible to save each different page that is scraped in a txt file, using some kind of loop, with the name of each candidate for the file name? — HonsTh, Jul 06 '19 at 15:24
For the first question, you can try checking the text part of each content item for "Click here for definitions..." type of text, using regular expressions. If they are present in any of the content items then you can either skip that item or replace that specific text with a blank character. And for second question, yes you can definitely do it. You can use the `text_data` list commented above to record the data of the page& then write it into a file. To know more about how to write a list to a file refer [this](https://stackoverflow.com/questions/899103/writing-a-list-to-a-file-with-python). — Argon, Jul 06 '19 at 16:03
Also, if you use your `link[18:len(link)-3]` you used to create your individual page links, then you'll be able to extract the names from it. Simply, store this part in a variable, say `sub_link`. You'll have to extract the name of the person from this string. For this, consider one of the results of `sub_link` : `2020/Justin_Amash_Free_Trade.htm`. Here, replace the `2020/` & `_Free_Trade.htm` part with blank character. The remaining string is basically the name of the person. Use this string in the file name argument of the `open()` function when you write the page data into a file. — Argon, Jul 06 '19 at 16:20

QHarr · Answer 2 · 2019-07-06T15:39:24.060

1

These are a couple of side points to help.

You can use Session object for efficiency of re-using connection.

You can condense, with bs4 4.7.1, your opening code to get the right urls to as shown below, where I use an attribute = value css selector to restrict to hrefs containing javascript:pop. The * is the contains operator.

[href*="javascript:pop"]

Then add on pseudo selector of :contains to further restrict to urls whose innerText has the word quote in it. This refines the list of matched elements to exactly those required.

:contains(quote)

import requests
from bs4 import BeautifulSoup as bs

with requests.Session() as s:
    r = s.get('http://archive.ontheissues.org/Free_Trade.htm')
    soup = bs(r.content, 'lxml')
    links = [item['href'] for item in soup.select('[href*="javascript:pop"]:contains(quote)')]
    for link in links:
        #rest of code working with Session

References:

edited Jul 06 '19 at 15:39

answered Jul 06 '19 at 04:11

QHarr

72,711
10
44
81

Thanks so much for this. I must admit, I only understood about 20% of the words in your post, as I am quite new to python, but I tried this code and got all the links I was after, much more concisely. "You can use Session object for efficiency of re-using connection." - what does this mean? – HonsTh Jul 06 '19 at 15:28
Hi, Session object is explained here: https://2.python-requests.org/en/master/user/advanced/ . It is a way of re-using an existing connection rather than to keep creating new ones. _with requests.Session() as s:_ <=== that line creates the Session object, _s_, and re-uses it within the _with_ statement so you would do _s.get(link)_ in your loop over _links_ – QHarr Jul 06 '19 at 15:37

How to scrape entire body of text from a URL?

2 Answers2