How can you read and open each link in BeautifulSoup and then print out certain data?

Question

Code:

from bs4 import BeautifulSoup
import urllib.request
import sys
import time
import re



for num in range(680):
    address = ('http://www.diabetes.org/mfa-recipes/recipes/recipes-archive.html?page=' + str(num))
    html = urllib.request.urlopen(address).read()
    soup = BeautifulSoup((html), "html.parser")

    for link in soup.findAll('a', attrs={'href': re.compile("/recipes/20")}):
        find = re.compile('/recipes/20(.*?)"')
        searchRecipe = re.search(find, str(link))
        recipe = searchRecipe.group(1)
        urllinks = ('http://www.diabetes.org/mfa-recipes/recipes/20' + str(recipe))
        urllinks = urllinks.replace(" ","")
        outfile = open('C:/recipes/recipe.txt', 'a')
        outfile.write(str(urllinks) + '\n')


f = open('recipe.txt', 'r')
for line in f.readlines():
    id = line.strip('\n')
    url = "urllinks".format(id)

    html_two = urllib.request.urlopen(url).read()
    soup_two = BeautifulSoup((html_two), "html.parser")
    for div in soup.find_all('div', class_='ingredients'):
        print(div.text)
    for div in soup.find_all('div', class_='nutritional_info'):
        print(div.text)
    for div in soup.find_all('div', class_='instructions'):
        print(div.text)

The first section (which ends with the outfile) works for sure but the second part doesn't. I know this because when I run the program it stores all the links but doesn't do anything else after that. For the second part I'm trying to open the file "recipe.txt" and going to each link and scraping certain data (ingredients, nutritional_info, and the instructions).

shouldn't there just be `url = line.strip()` instead of `url = "urllinks".format(id)`? otherwise, `url` is always equal to "urllinks" — ewcz, Mar 05 '17 at 09:06
Yes, what exactly are you trying to do with `"urllinks".format(id)`. Also, don't use `id` as variable name, it's a reserved keyword in Python. — elena, Mar 05 '17 at 09:46
honestly I don't know what I was doing with "urllinks".format(id). I just got it off another forum. — , Mar 05 '17 at 10:04
I changed the `url = "urllinks".format(id)` to `url = line.strip()` but the only code that is executing is the top portion with the urls. — , Mar 05 '17 at 10:09

score 0 · Answer 1 · answered Mar 05 '17 at 11:42

f = open('C:/recipes/recipe.txt', 'r')
for line in f.readlines():
    wholeline = line.strip()
    # url = "urllinks".format(wholeline) Don't know what's this supposed to do ?

    html_two = urllib.request.urlopen(wholeline).read()
    soup_two = BeautifulSoup((html_two), "html.parser")
    for div in soup_two.find_all('div', class_='ingredients'):
        print(div.text)
    for div in soup_two.find_all('div', class_='nutritional_info'):
        print(div.text)
    for div in soup_two.find_all('div', class_='instructions'):
        print(div.text)

You've used the same variable twice not soup_two but >soup, in your original code. And you've already stripped it there's no need to format it.

So I changed everything described above and when I run the program it runs the first section but doesn't do anything for the second section. There are no errors either. — , Mar 05 '17 at 18:32

score 0 · Answer 2 · edited May 23 '17 at 12:09

So I've modified your code quite a bit. First of all, I would recommend using requests instead of urllib, since it's so easy in use (What are the differences between the urllib, urllib2, and requests module?).

Second of all, use with statement for opening files. Then you don't have to worry about closing the file in the right place (What is the python "with" statement designed for?).

Third of all, I believe some method names were changed in bs4, so use find_all instead of findAll (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names). I left these untouched though, you can change them yourself.

Another note, don't use special names like id, find for variable names, as they are reserved in Python for special use (e.g. find is a function).

from bs4 import BeautifulSoup
import requests
import sys
import time
import re


with open('file_with_links', 'w+') as f:
    for num in range(860):
        address = 'http://www.diabetes.org/mfa-recipes/recipes/recipes-archive.html?page=' + str(num)
        html = requests.get(address).content
        soup = BeautifulSoup(html, "html.parser")

        for link in soup.findAll('a', attrs={'href': re.compile("/recipes/20")}):
            print link
            find_urls = re.compile('/recipes/20(.*?)"')
            searchRecipe = re.search(find_urls, str(link))
            recipe = searchRecipe.group(1)
            urllinks = 'http://www.diabetes.org/mfa-recipes/recipes/20' + str(recipe)
            urllinks = urllinks.replace(" ", "")
            f.write(urllinks + '\n')

with open('file_with_links', 'r') as f:
    for line in f:
        url = line.strip()
        print url
        html_two = requests.get(url).content
        soup_two = BeautifulSoup(html_two, "html.parser")
        for div in soup_two.find_all('div', class_='ingredients'):
            print(div.text)
        for div in soup_two.find_all('div', class_='nutritional_info'):
            print(div.text)
        for div in soup_two.find_all('div', class_='instructions'):
            print(div.text)

Another important advice for future: try to understand each line in the code and what exactly is it doing.

How can you read and open each link in BeautifulSoup and then print out certain data?

2 Answers2