5

This is my code:

from urllib import urlopen
from bs4 import BeautifulSoup
import pandas as pd

url = "http://www.basketball-reference.com/draft/NBA_2014.html"
html = urlopen(url)
soup = BeautifulSoup(html)
column_headers = [th.getText() for th in soup.findAll('tr',limit=2)[1].findAll('th')]
data_rows = soup.findAll('tr')[2:]
player_data = [[td.getText() for td in data_rows[i].findAll('td')] for i in range(len(data_rows))] #PLAYER DATA 

type(soup)
type(data_rows)

df = pd.DataFrame(player_data,columns=column_headers)

The error seems to occur in the last line.

sophros
  • 8,714
  • 5
  • 30
  • 57
Aditya Gade
  • 59
  • 1
  • 1
  • 3
  • 1
    Can you elaborate to make your post clearer? – nyedidikeke Nov 28 '16 at 23:16
  • 1
    Please read [How to create a Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve). It will help you to revise your question so that you communicate the information that we need to help you. Learning [how to ask good questions](http://stackoverflow.com/help/how-to-ask) is painful but well worth the effort. – MikeJRamsey56 Nov 29 '16 at 02:59
  • I will re-post the above with clear content – Aditya Gade Nov 29 '16 at 03:24
  • "22 columns passed, passed data had 21 columns" is pretty self-descriptive. – ivan_pozdeev Nov 30 '16 at 16:58

1 Answers1

4

First of all, the error is pretty straight-forward: your column_headers list has 22 columns, but player_data entries only have 21. So you need to find which out column is missing and why. Just by visually comparing the entries from the dataframe and the headers list, it appears one of the two first columns is missing. player_data[0][0] returns

1, CLE, Andrew Wiggins, University of Kansas,... but it should be

1, 1, CLE, Andrew Wiggins, University of Kansas,...

The problem is the table itself. Navigate to the website, hover over the table and right-click: inspect.

The first row of data (underneath the 'Rk') consists of 21 td and 1 th element. The "rk" entry is actually of type th and not td:

Screenshot of table of provided data

That is why your

player_data = [[td.getText() for td in data_rows[i].findAll('td')] for i in range(len(data_rows))] 

skips the first column because it is only iterating over td elements. Hence the different length. I don't know how important the first column is; quick fix would be to drop the Rk column from your headers list.

Alternatively, search for both td and th elements:

player_data = [[td.getText() for td in data_rows[i].findAll(['td','th'])] for i in range(len(data_rows))]
Community
  • 1
  • 1
marts
  • 592
  • 1
  • 7
  • 11