2

I am trying to use beautifulsoup to get all the images of a site with a certain class. my issue is that when i run the code just to see if my code can find each image it only gets images 1-5. I think the issue is the html since images 6-end is located in a nested div but Find_all should be able to find all the img with the same class.

import requests, os, bs4, sys, webbrowser

url = 'https://mangapanda.onl/chapter/'
os.makedirs('manga', exist_ok=True)

comic = sys.argv[1:]
aComic = '-'.join(sys.argv[1:])  

issue = input('which issue do you want?')
aIssue = ('/chapter-' + issue)
aComic = (aComic + '_110' +  aIssue) 

comicUrl = (url + aComic)
res = requests.get(comicUrl)
res.raise_for_status()

soup = bs4.BeautifulSoup(res.text, 'html.parser')


comicElem = soup.find_all(class_="PB0mN")  
if comicElem == []:
    print('nothing in the list')
else:
    print('There are ' + str(len(comicElem)) + ' on this page')
    for i in range(len(comicElem)):
        comicPage = comicElem[i].get('src')
        print(str(comicPage) + '\n')

is there something I am missing when it comes to using beautiful soup that could have helped me solve this? is it the html that is causing this problem? Was there a better way i could have diagnosis this problem myself that would have been in my realm of capability (side note: i am currently going through the book "Automating the Boring Stuff with Python". it is where i got the idea for this mini project and a decent idea of where my level of skill is with python. Lastly I am using BeautifulSoup to learn more about it. If possible i would like to solve this issue using BeautifulSoup will research other options of parsing through html if i need to.

Using firefox quantim 59.0.2 using python3

PS, if you know of other questions that may have answered this problem already feel free to just link me to it. I really wanted to just figure out the answer through someone else questions but it seems like my issue was pretty unique.

Mr.Magik
  • 21
  • 3
  • you should take a look at your soup.prettify() and see if those other image sources are even visible. I am looking at the source right now for a comic from the site you linked and it seems like only issues 1 through 5 are visible. They have a simple naming scheme, so a work around is possible. But you should see if your soup can even see the images first, post soup.prettify() here if you need to. – WildCard Apr 25 '18 at 05:30
  • To clarify printing soup.prettify() will give you a more legible version of the HTML you are trying to parse so that you can see what you are working with. – WildCard Apr 25 '18 at 05:31

1 Answers1

0

The problem is some of the images are being added to the DOM via Javascript after the page is loaded. So

res = requests.get(comicUrl)

gets the HTML and DOM before any modification are made by javascript. This is why

soup = bs4.BeautifulSoup(res.text, 'html.parser')
comicElem = soup.find_all(class_="PB0mN")
len(comicElem) # = 5

only finds 5 images.

If you want to get all the images that are loaded you cannot use the requests library. Here is an example using selenium:

from selenium import webdriver
browser = webdriver.Chrome('/Users/glenn/Downloads/chromedriver')
comicUrl = "https://mangapanda.onl/chapter/naruto_107/chapter-700.5"
browser.get(comicUrl)
images = browser.find_elements_by_class_name("PB0mN")
for image in images:
    print(image.get_attribute('src'))
len(images) # = 18 images

See this post for additional resources for scraping javascript pages: Web-scraping JavaScript page with Python

Regarding how to tell if the HTML is being modified using javascript?

I don't have any hard rules but these are some investigative steps you can carry out:

As you observed only finding 5 images originally with requests but seeing there are more images on the page is the first clue the DOM is being changed after it is loaded.

A second step: using the browser Developer Tools -> Scripts you can see there are several javascript files associated with the page. Note that not all javascript modify the DOM so the presence of these scripts does not necessarily mean they are modifying the DOM.

For further verification the DOM is being modified after the page is loaded:

Copy the html from Developer Tools -> View Page Source into an HTML formatter tool like http://htmlformatter.com, format the html and look at the line count. The Developer Tools -> View Page Source is the html that is sent by the server without any modifications.

Then copy the html from Developer Tools -> Elements (be sure to get the whole thing from <html>...</html>) and copy this into an HTML formatter tool like http://htmlformatter.com, format and look at the line count. The Developer Tools -> Elements html is the complete, modified DOM.

If the line counts are significantly different then you know the DOM is being modified after it is loaded.

Comparing line counts for "https://mangapanda.onl/chapter/naruto_107/chapter-700.5" shows 479 lines for the source html and 3245 lines for the complete DOM so you know something is modifying the DOM after the page is loaded.

glenn15
  • 116
  • 3
  • Thank you very much glenn15. I had looked into selenium before starting this project but i read that it no longer worked on the most up to date version of firefox so i gave up on it earlier than i should have. It seems like a pretty powerful tool though albeit a bit slow (or at least from what i read it seems to be that way). Anyway, i think i will start looking into using selenium more and just try to find another tool to web-scrap javascript firefox with. if you do not mind me asking,How can you tell that the html added after page 5 was added using javascript? – Mr.Magik Apr 25 '18 at 12:55