1

I'm trying to scrape a web forum, and having trouble accessing pages that are behind a login. Inspecting the elements of the login page, I found that the ID of the username and password input elements change each time I refresh the page. My current strategy is to

  1. Create and use a requests session
  2. Make GET request for the forum login page
  3. Use BeautifulSoup to extract the IDs of the username and password input elements
  4. Use the extracted IDs as the keys, and my account username and password as values, for a payload dict that is passed into a POST request for the login page
  5. Make GET request for a page on the forum

I'm running into a problem in step 4: the status code of the POST request is 400, indicating that I'm doing something wrong.

Here's an MWE, in which the variables KIWIFARMS_USERNAME and KIWIFARMS_PASSWORD have been changed to not be my actual account username and password:

import os

import requests
from bs4 import BeautifulSoup

# login url for forum, and fake forum credentials (they're real in my script)
LOGIN_URL = 'https://kiwifarms.net/login/'
KIWIFARMS_USERNAME = 'username'
KIWIFARMS_PASSWORD = 'password'

with requests.Session( ) as session:

  # step 1
  r = session.get( LOGIN_URL )

  # step 2
  soup = BeautifulSoup( r.content, 'lxml' )

  # step 3
  username_id = soup.find( 'input', { 'autocomplete' : 'username' } )[ 'id' ]
  password_id = soup.find( 'input', { 'type' : 'password' } )[ 'id' ]

  payload = {
    username_id: KIWIFARMS_USERNAME,
    password_id : KIWIFARMS_PASSWORD }

  # step 4
  post = session.post( LOGIN_URL, data = payload )

  # failure of step 4 (prints 400)
  print( post.status_code )

I've looked at a lot of pages and links, including this, this, this, and this, but I still can't figure out why my post request is getting a 400 Bad Request error.

I have a version of this working in Selenium, but I'd really like to know the mistake I'm making and get this working using Requests. Any help would be greatly appreciated.

petezurich
  • 6,779
  • 8
  • 29
  • 46
  • The general way to solve this sort of problem is to inspect how a browser login works using a network tracing programm like Telerik Fiddler, then make sure your code provides the needed header and data. – barny Mar 08 '20 at 09:00

2 Answers2

0

The website is generating a _xfToken during the login, also you missed some Form-Data for the POST request.

Here I've maintain the session using requests.Session()and then i parsed the value of _xfToken during my GET request, and then passed it via POST request.

import requests
from bs4 import BeautifulSoup


def Main():
    with requests.Session() as req:
        r = req.get("https://kiwifarms.net/login/login")
        soup = BeautifulSoup(r.text, 'html.parser')
        token = soup.find("input", {'name': '_xfToken'}).get("value")
        data = {
            'username': 'test',
            'password': 'test',
            'remember': '1',
            '_xfRedirect': '/',
            '_xfToken': token
        }
        r = req.post("https://kiwifarms.net/login/login", data=data)
        print(r)


Main()

Output:

<Response [200]>

if you will check r.text so you will see that we are on the right track.

<div class="blockMessage blockMessage--error blockMessage--iconic">
The requested user could not be found.
</div>

That's confirm we are doing it correctly since i didn't passed a valid user/pass.

0

you're trying to post to https://kiwifarms.net/login/ , while the login form action is /login

I got the same error when I had url/login/ in the url. It passed status_code to 200 when I simply changed it to url/login ... (basically just removed the last redundant slash!)

MayOJ
  • 133
  • 1
  • 1
  • 6