Scrape html table for information (python)

Question

I've been trying to get some information from a table within a stock exchange (https://www.idx.co.id/en-us/listed-companies/company-profiles/)

using python (lxlml, requests & pandas) this is the reference i used:

https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059

Since I am an absolute newbie to python/programming maybe somebody has an idea of how to do apply .xpath on only the row elements in tablebody and then extract the content? I have looked into using bs4/beautifulsoup as well but didn't get that to work either. Any help or suggestion is much appreciated! Thank you for your time

My code

from lxml import html as lh
import requests
import pandas as pd

#create a handle page to handle the contents of the website
page = requests.get('http://www.idx.co.id/en-us/listed-companies/company-profiles/')
# stores contents under doc
doc = lh.fromstring(page.content)
#parses data stored in between <tr>..<tr> of the html
tr_elements = doc.xpath('//*[@id="companyTable"]/tbody')

#create empty list
col = []
i = 0

for j in range(0,len(tr_elements)):
    #T is our j'th row
    T = tr_elements[j]

    #If row is not of size 4, the //tr data is not from our table
    if len(T)!=4:
        break

    # i is column index
    i=0

    # Iterate through each element of the row
    for t in T.iterchildren():
        data = t.text_content()

        #Append the data to the empty list of the i'th column
        col[i][1].append(data)

        #Increment i for the next column
        i+=1
[len(C) for (title,C) in col] # checking no of values in all columns

Dict = {title:column for (title,column) in col}
df = pd.DataFrame(Dict)

print(df)

Output of print(df)

Empty DataFrame
Columns: []
Index: []

The expected output:

Columns: [No, Code, Name, Listing Date]  
Index: [1, AALI, Astra Agro Lestari Tbk, 09 Dec 1997]

Perhaps this [thread](https://stackoverflow.com/questions/1064968/how-to-use-xpath-contains-here) can help. — YusufUMS, Feb 22 '19 at 02:39
hey @Yusuf thanks for the recommendation, sadly I dont really understand enough to apply this to my problem. I'll have to spend my weekend going through the documentation and I'll get it eventually. — Nick, Feb 22 '19 at 04:44
Can you provide the expected output if the code is going well? — YusufUMS, Feb 22 '19 at 05:22
I think the `td` elements in the table can't be accessed. That's why the result always empty. Try to use `selenium`, refers to [here](https://stackoverflow.com/questions/45499517/beautifulsoup-parser-cant-access-html-elements) — YusufUMS, Feb 22 '19 at 07:43
@Yusuf So I'm still struggling to make it work but you definitely put me on the right path with using selenium. Thank you so much for that, really appreciate the help! — Nick, Feb 22 '19 at 09:37

Yohanes Gultom · Accepted Answer · 2019-02-22T09:07:42.597

The reason you are unable to get empty result is because the page uses AJAX to load the table's content (it uses https://datatables.net). If you want to scrape javascript generated result, requests is insufficient since it does not execute javascript. You need to run a browser or headless browser such as Chromedriver using library like selenium-python. If you want to go through that path, there is a lot of tutorials available in the internet.

However, there is a better way. If you understand how AJAX works, the page obviously needs to call an API to retrieve the data. Once you find it, you can easily retrieve the data using that API directly with just a few lines of code:

import requests
import pandas as pd

res = requests.get('https://www.idx.co.id/umbraco/Surface/ListedCompany/GetCompanyProfiles?draw=1&columns%5B0%5D%5Bdata%5D=KodeEmiten&columns%5B0%5D%5Bname%5D&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=KodeEmiten&columns%5B1%5D%5Bname%5D&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=false&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=NamaEmiten&columns%5B2%5D%5Bname%5D&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=false&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=TanggalPencatatan&columns%5B3%5D%5Bname%5D&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=false&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&start=0&length=700&search%5Bvalue%5D&search%5Bregex%5D=false&_=155082600847')
data = res.json()
df = pd.DataFrame.from_dict(data['data'])
print(df.columns)
print(df)

Result:

Index(['Alamat', 'BAE', 'DataID', 'Divisi', 'EfekEmiten_EBA', 'EfekEmiten_ETF',
       'EfekEmiten_Obligasi', 'EfekEmiten_SPEI', 'EfekEmiten_Saham', 'Email',
       'Fax', 'JenisEmiten', 'KegiatanUsahaUtama', 'KodeDivisi', 'KodeEmiten',
       'Logo', 'NPKP', 'NPWP', 'NamaEmiten', 'PapanPencatatan', 'Sektor',
       'Status', 'SubSektor', 'TanggalPencatatan', 'Telepon', 'Website', 'id'],
      dtype='object')
                                                Alamat ... id
0    Jl Pulo Ayang Raya Blok OR No. 1  Kawasan Indu... ...  0
1    Sahid Office Boutique, Blok G Jl Jend Sudirman... ...  0
2    Plaza ABDA Lt. 27  Jl. Jend. Sudirman Kav. 59 ... ...  0
3    Gedung TMT 1 Lantai 18  Jl. Cilandak KKO No. 1... ...  0
4    Gedung Kawan Lama Lantai 5  Jl. Puri Kencana N... ...  0
5    ACSET Building, Jalan Majapahit No.26, Kelurah... ...  0
6    Perkantoran Hijau Arkadia Tower C Lantai 15\rJ... ...  0
7         Jalan Raya Pasar Minggu Km. 18 Jakarta 12510 ...  0
8    Gedung The Landmark I Lantai 26-31\r\nJl. Jend... ...  0
9    Gedung Wisma 46 Kota BNI Kav 1 LT. 20 JL Jend.... ...  0
..                                                 ... ... ..
620  Gedung Graha Irama lt. 2-E\rJl. H.R. Rasuna Sa... ...  0
621  Plaza Mutiara Lt. 5,\rJl. Dr. Ide Anak Agung G... ...  0
622  Jl. Jemur Sari Selatan IV/3, \r\nSurabaya 6023... ...  0
623  Jl. Pantai Indah Selatan I, Elang Laut Blok A ... ...  0
624  Jalan Karet Pedurenan No. 240, Karet Kuningan,... ...  0

[625 rows x 27 columns]

thanks a lot for that! May I know how you get the url for the API? I've tried finding it myself on a specific companies page by looking at the html but while I do see the references to /umbraco I can't seem to find a way to get an actual url. Either way tiri makasih pak! — Nick, Feb 22 '19 at 11:25
I just used Chrome developer tools to monitor XHR network requests. If I'm not mistaken, there were only 3 of them in the page so I just checked it one by one https://developers.google.com/web/tools/chrome-devtools/network/reference — Yohanes Gultom, Feb 22 '19 at 11:47

Scrape html table for information (python)

1 Answers1