4

My goal is to scrape data from consumerreports.com, so I am utilizing 'requests' and 'beautifulsoup' for this project. Webscraping aside, I am having a lot of trouble successfully logging in on consumerreports.com through requests.

Here is my code: I created two text files in which I write the post and response, so I can check if it successfully logged in.

import requests
import os.path

#declares any necessary variables
#file1, file2 to check if login is successful

save_path = '/Users/myName/Documents/Webscraping Project/'
login_url = 'https://www.consumerreports.org/cro/index.htm'
my_url = 'https://www.consumerreports.org/cro/index.htm'
pName = os.path.join(save_path, 'post text file'+".txt")
rName = os.path.join(save_path, 'response text file'+".txt")
post_file = open(pName, "w")
response_file = open(rName, "w")

#login using Session class from Requests package
with requests.Session() as s:

    payload = {"userName":"myName@university.edu","password":"my_password"}
    p = s.post(login_url, data=payload)
    print(p.text)

    r = s.get(my_url)

    #saves files to see if login was successful

    post_file.write(str(p.text.encode('utf-8')))
    response_file.write(str(r.text.encode('utf-8')))
post_file.close()
response_file.close()


print('Files created.')

This is what I got:

<!DOCTYPE html>
<html>
  <head>
    <title>405 Not allowed.</title>
  </head>
  <body>
    <h1>Error 405 Not allowed.</h1>
    <p>Not allowed.</p>
    <h3>Guru Meditation:</h3>
    <p>XID: #some number </p>
    <hr>
    <p>Varnish cache server</p>
  </body>
</html>

In addition, I checked the contents of the 'response text file.txt', and was able to determine through basic ctrl+f function that the system had not successfully logged in.

It seems that the web server does not accept the 'post' method, at least for this particular url, and that is why it's returning the error. However, I don't know how to proceed from here. I looked online, and someone suggested using

response = requests.get(login_url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'})

to create a user agent to "log in" or whatever. I'm still fairly new to python, so any advice will be appreciated.

Gino Mempin
  • 12,644
  • 19
  • 50
  • 69
dxxg-syz35
  • 103
  • 1
  • 9

2 Answers2

3

You may need to add headers in s.post There is a solution to this error here. It worked for me. Hope this helps.

S. B.
  • 31
  • 4
2

The reason for this is the sign in form is created via javascript. As the login form is added to the DOM as a result of a click event, it doesn't exist when you execute the request. All requests does is gets the existing content from the page. If the URL did change to reflect the state (displaying the login form), then you could use that, but it doesn't.

What you need to do is use a headless browser (chrome or firefox in headless mode) combined with a library like Selenium. You can load the site in the headless browser and write code using Selenium to interact with. However, this is significantly more challenging to implement.

Jason
  • 10,777
  • 19
  • 81
  • 169
  • Thanks for the response. Luckily CR has an alternate login page that doesn't require a click event. I replaced the original `login_url` with `https://secure.consumerreports.org/ec/login`, and am no longer getting the original error. However, it's still not logging in! I checked `response_file.txt`, and still didn't find my name. The HTML code of the webpage when successfully logged in contains my name, so `response_file.txt` should as well, correct? And I was thinking since the webpage contains a submit button, do I have to include some sort of click action in my post request? – dxxg-syz35 Jun 12 '18 at 00:05