I've been trying to get some information from a table within a stock exchange (https://www.idx.co.id/en-us/listed-companies/company-profiles/)
using python (lxlml, requests & pandas) this is the reference i used:
https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059
Since I am an absolute newbie to python/programming maybe somebody has an idea of how to do apply .xpath
on only the row elements in tablebody and then extract the content? I have looked into using bs4/beautifulsoup as well but didn't get that to work either. Any help or suggestion is much appreciated! Thank you for your time
My code
from lxml import html as lh
import requests
import pandas as pd
#create a handle page to handle the contents of the website
page = requests.get('http://www.idx.co.id/en-us/listed-companies/company-profiles/')
# stores contents under doc
doc = lh.fromstring(page.content)
#parses data stored in between <tr>..<tr> of the html
tr_elements = doc.xpath('//*[@id="companyTable"]/tbody')
#create empty list
col = []
i = 0
for j in range(0,len(tr_elements)):
#T is our j'th row
T = tr_elements[j]
#If row is not of size 4, the //tr data is not from our table
if len(T)!=4:
break
# i is column index
i=0
# Iterate through each element of the row
for t in T.iterchildren():
data = t.text_content()
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
[len(C) for (title,C) in col] # checking no of values in all columns
Dict = {title:column for (title,column) in col}
df = pd.DataFrame(Dict)
print(df)
Output of print(df)
Empty DataFrame
Columns: []
Index: []
The expected output:
Columns: [No, Code, Name, Listing Date]
Index: [1, AALI, Astra Agro Lestari Tbk, 09 Dec 1997]