1

This code should download the html page and just print it to screen, but instead I get an HTTP 500 error exception, which I cant figure how to manage.

Any ideas?

import requests ,bs4

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0'}

#Load mainPage
_requestResult = requests.get("http://www.geometriancona.it/categoria_albo/albo/",headers = headers, timeout = 20)
_requestResult.raise_for_status()
_htmlPage = bs4.BeautifulSoup(_requestResult.text, "lxml")
print(_htmlPage)

#search for stuff in html code
Jason Aller
  • 3,391
  • 28
  • 37
  • 36
Steve
  • 73
  • 1
  • 12

2 Answers2

1

You can use the urllib module to download individual URLs but this will just return the data. It will not parse the HTML and automatically download things like CSS files and images. If you want to download the "whole" page you will neestrong textd to parse the HTML and find the other things you need to download. You could use something like Beautiful Soup to parse the HTML you retrieve. This question has some sample code doing exactly that.

Community
  • 1
  • 1
Japan Gor
  • 11
  • 2
  • I dont need the CSS or image file, just the raw HTML.The code should work, but with some websites it return that 500 error. – Steve Jan 15 '17 at 12:15
1

Try to visit: http://www.geometriancona.it/categoria_albo/albo/ with your anonymous browser, it gives HTTP 500 Error

because you need to log in, don't you?

Maybe you should try this sintaxt:

r = requests.get('https://api.github.com/user', auth=('user', 'pass'))

your code works but you have to

print(_htmlPage)

try it with

_requestResult = requests.get("http://www.google.com",headers = headers, timeout = 20)

UPDATE

The problem was the cookies, after packet analysis i found four cookies so that's the code that works for me

import requests ,bs4

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0'}
jar = requests.cookies.RequestsCookieJar()
jar.set('PHPSESSID', '1bj8opfs9nb41l9dgtdlt5cl63', domain='geometriancona.it')
jar.set('wfvt', '587b6fcd2d87b', domain='geometriancona.it')
jar.set('_iub_cs-7987130', '%7B%22consent%22%3Atrue%2C%22timestamp%22%3A%222017-01-15T12%3A17%3A09.702Z%22%2C%22version%22%3A%220.13.9%22%2C%22id%22%3A7987130%7D', domain='geometriancona.it')
jar.set('wordfence_verifiedHuman', 'e8220859a74b2ee9689aada9fd7349bd', domain='geometriancona.it')
#Load mainPage
_requestResult = requests.get("http://www.geometriancona.it/categoria_albo/albo/",headers = headers,cookies=jar)
_requestResult.raise_for_status()
_htmlPage = bs4.BeautifulSoup(_requestResult.text, "lxml")
print(_htmlPage)

That's my output: http://prnt.sc/dvw2ec

Alessandro Lodi
  • 494
  • 1
  • 4
  • 14
  • It does not give me any error when I open the url with my browser. – Steve Jan 15 '17 at 12:14
  • 1
    @Steve if with www.google.com it works the problem is the website that you visit maybe because you have to **log in** – Alessandro Lodi Jan 15 '17 at 12:24
  • Now it looks like this but still gives error: _requestResult = requests.get("http://www.geometriancona.it/categoria_albo/albo/", auth=('user', 'pass'), headers = headers) – Steve Jan 15 '17 at 12:32
  • Someone else got this issue http://stackoverflow.com/questions/11892729/how-to-log-in-to-a-website-using-pythons-requests-module – Alessandro Lodi Jan 15 '17 at 12:38
  • The method of log with python requests depends by the site's log in method, my **first idea** is to **analyze packets** when you log in with [Wireshark](https://www.wireshark.org/) – Alessandro Lodi Jan 15 '17 at 12:40
  • What do you mean by "log in"? On my browser I can see the page normally without logging in at all. See here: http://prntscr.com/dvvrse – Steve Jan 15 '17 at 12:42
  • Ok now i see the right screen so it was a real internal server error, try to launch your program again – Alessandro Lodi Jan 15 '17 at 12:47
  • I checked online and it says that a HTTP 500 error is a general error caused by the server you are visiting, but I dont get why this is happening since with a normal browser it loads, but it does not with python (tried at least 15 times). – Steve Jan 15 '17 at 13:01
  • 1
    With **Wireshark** i saw that when i visit the site with browser **it sends those cookies** and i thought that **"wordfence_verifiedHuman"** is important to **prevent visit from robots** so i added it in the request but to be safe **i added all the cookies**, that's all.. Wireshark it's useful, consider it ;) – Alessandro Lodi Jan 15 '17 at 13:43