2

I wish to extract all forms from a given website using Python3 and BeautifulSoup.

Here is an example that does this, but fails to pick up all forms:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://www.qantas.com/au/en.html'
data = urlopen(url)
parser = BeautifulSoup(data, 'html.parser')
forms = parser.find_all('form')
for form in forms:
    print(form)
    print('\n\n')

If you run the code and visit the URL, you will notice that the Book a trip form is not scraped by the parser.

The above code only picks up three forms, whereas in Chrome's Developers tools > elements page shows 13 <form> elements. But if I view the page source (Ctrl+U in Chrome), the source only shows the three forms that BeautifulSoup scraped.

How can I scrape all forms?

Josh
  • 1,243
  • 2
  • 21
  • 43
  • 1
    Not sure what is going on here, but if you go to View Source for the page, it shows only three forms there, which is exactly what you're getting. Could it be that the other forms are generated from a server request *after* the page is loaded? – Abid Hasan Mar 27 '17 at 00:52

2 Answers2

1

It seems that the web page uses JavaScript to load the web content. Try to view the page in your browser with the javascript disabled.

Check if your form is there. If not, check if it is any XHR request in the console that fetches the form. If not, you should think about go to selenium with phantomjs headless browser or abandon the scraping of this site!!

The headless browser will allow you to get the content of the dynamically created web page and feed that content to BeautifulSoup.

Community
  • 1
  • 1
Christos Papoulas
  • 2,341
  • 2
  • 24
  • 39
1

With the help of phantomjs(http://phantomjs.org/download.html) and Selenium you can do this

Step: 1. on terminal or cmd use command: pip install selenium 2. Download the phantomjs & unzip it than put the "phantomjs.exe" at python path for example on windows, C:\Python27

Than use this code it will give you desired result:

# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
from  selenium import webdriver


url = 'https://www.qantas.com/au/en.html'


driver = webdriver.PhantomJS()
driver.get(url)

data = driver.page_source
parser = BeautifulSoup(data, 'html.parser')


forms = parser.find_all('form')
for form in forms:
    print(form)
    print('\n\n')

driver.quit()

It will print all 13 forms.

Note:Due to word limit not able to put output in Answer.

thebadguy
  • 1,872
  • 1
  • 17
  • 27