1

I was working on a data scraping from a website. I found that the table data is displayed as loading in the page's source code. I am wondering how to collect that data using python. It seems to be a react js web app.

URL: https://www.ycombinator.com/companies/

KunduK
  • 26,790
  • 2
  • 10
  • 32

2 Answers2

1

Can't find it as a request under XHR, so you could use Selenium which will allow the page to render, and then grab the table with pandas:

from selenium import webdriver
import pandas as pd

driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')

url = 'https://www.ycombinator.com/companies/'
driver.get(url)

df = pd.read_html(driver.page_source)[0]

driver.close()

Output:

print (df)
[                    0      1                                                  2
0          Actiondesk  s2019  Google Sheets meets Zapier. Actiondesk lets no...
1               Alana  s2019  Helping large companies in LATAM hire blue-col...
2        Apero Health  s2019                            Modern medical billing.
3             Apurata  s2019  Small loans for the Latin American middle clas...
4        Arpeggio Bio  s2019  Arpeggio builds technology to watch and learn ...
5              Asayer  s2019  Asayer is a session replay tool for developers...
6           Asher Bio  s2019                    We build better immunotherapies
7          AudioFocus  s2019                                                NaN
8          Axite Labs  s2019  A modern IP licensing platform to accelerate t...
9               basis  s2019  Software to automate construction workflows, s...
10         Beacons AI  s2019  Helping creators monetize through short video ...
11              Binks  s2019  Binks is a chain of trusted micro-boutiques th...
12              Blair  s2019  Financing college education through Income Sha...
13       Boost Biomes  s2019                                                NaN
14            Bouncer  s2019  SDK for scanning and verifying credit cards an...
15         Brave Care  s2019  Modern healthcare for kids. We do that with a ...
16          Breadfast  s2019  Breadfast delivers fresh bread, milk and eggs ...
17        BuildStream  s2019              A market network for industrial labor
18     Business Score  s2019     Connecting startups with the things they need.
19              Canix  s2019  Canix makes it easy to get and stay compliant ...
20              Carry  s2019  Carry plans, books, and supports corporate tra...
21              Carve  s2019                                                NaN
22            Cloosiv  s2019  Cloosiv is an order-ahead app for independent ...
23               Coco  s2019  The Venezuelan Instacart - allowing Venezuelan...
24     CoLab Software  s2019              Jira for Mechanical Engineering Teams
25           Compound  s2019  Compound helps people who work at startups und...
26            Courier  s2019  Send your product's user notifications to the ...
27             Covela  s2019     The digital insurance broker for SMEs in LATAM
28              Cuboh  s2019  Cuboh helps restaurants use several delivery p...
29              Curri  s2019  We provide on-demand material delivery for the...
              ...    ...                                                ...
2009           Zenter  w2007                                                NaN
2010          Jamglue  s2006                                                NaN
2011         Jumpchat  s2006                                                NaN
2012       Likebetter  s2006                                                NaN
2013           Omgpop  s2006                                                NaN
2014       Pollground  s2006                                      Online polls.
2015           Scribd  s2006                    World's largest online library.
2016         Shoutfit  s2006                                                NaN
2017          Talkito  s2006                                                NaN
2018       Thinkature  s2006                                                NaN
2019            Xobni  s2006                                                NaN
2020        Zanbazaar  s2006                                                NaN
2021        Audiobeta  w2006                                                NaN
2022         Clustrix  w2006                                                NaN
2023            Flagr  w2006                                                NaN
2024          Inkling  w2006                                                NaN
2025  Project Wedding  w2006                                                NaN
2026         Snipshot  w2006                  We sold Snipshot to Ansa in 2013.
2027            Wufoo  w2006                               Online form builder.
2028          Airtime  s2005                                                NaN
2029       Clickfacts  s2005                                                NaN
2030         Infogami  s2005                                                NaN
2031             Kiko  s2005  We're the best online calendar solution to eve...
2032            Loopt  s2005                                                NaN
2033           Memamp  s2005                                                NaN
2034          Parakey  s2005                                                NaN
2035        Posthaven  s2005                                   Blogging forever
2036           Reddit  s2005                     The frontpage of the internet.
2037          Simmery  s2005                                                NaN
2038        TextPayMe  s2005                                                NaN

[2039 rows x 3 columns]]
chitown88
  • 17,911
  • 2
  • 19
  • 48
  • I am having unresolved import error with selenium – srinivas muralidharan Dec 20 '19 at 13:00
  • did you install selenium and the correct drivers? – chitown88 Dec 20 '19 at 13:16
  • @chitown88 There is an API if you go All Tab under NetWork you will find that. – KunduK Dec 20 '19 at 13:22
  • 1
    @KunduK, good find! Yup that's what I was looking for! But was only looking under XHR. I'll have to remember to check under All tab in the future. Thanks for posting that! Srinivas, accept Kunduks solution. while selenium will work, their's is the better alternative. – chitown88 Dec 20 '19 at 13:38
  • To be more in depth, how does scraping differ from api calls? is this scraping or api call? – srinivas muralidharan Dec 20 '19 at 13:45
  • going through selenium, or requests, or beautifulsoup (and actually pandas' `pd.read_html()` uses beautifulsoup under the hood), you'd be scraping: meaning you are parsing the html source to pull out/extract the data. A request to an API is just directly getting the data. You aren't really scrapping the data then, you're just extracting/querying for the data directly from the source that is rendering the data into the html – chitown88 Dec 20 '19 at 13:52
  • API is always the better way to go if you can. It's usually nicely structured and a lot of the times you can get additional metadata that's not seen in the html easily. – chitown88 Dec 20 '19 at 13:53
  • 1
    @chitown88 : When you said nothing find from XHR I had a doubt and then I checked all tab and find that API link.However I appreciate your effort. – KunduK Dec 20 '19 at 14:03
  • Thanks @KunduK. I appreciate your efforts towards the solution. I am curiously looking for a web scraping technology though – srinivas muralidharan Dec 20 '19 at 15:31
1

If you Go To NetWork Tab you will find below API which returns data in json format. You don't need selenium or beautifulsoup.

https://api.ycombinator.com/companies/export.json?

Here is the code below.

import requests
res=requests.get("https://api.ycombinator.com/companies/export.json?").json()
for item in res:
    try:
      print('name:' + item['name'])
    except:
        continue
    try:
      print('URL:' + item['url'])
    except:
        continue

    try:
        print('batch:' + item['batch'])

    except:
        continue

    try:
        print('Description:' + item['description'])
    except:
        continue

Snapshot Of API

enter image description here

Response:

enter image description here

KunduK
  • 26,790
  • 2
  • 10
  • 32