0

Am trying to scrap flight prices from the expedia website using the rvest and SelectorGadget to get the CSS selector. Following is my code:


library(rvest)
library(lubridate)  

url <- paste('https://www.expedia.com/Flights-Search?trip=oneway&leg1=from%3AAustin%2C%20TX%2C%20United%20States%20(AUS)%2Cto%3ASan%20Francisco%2C%20CA%2C%20United%20States%20of%20America%20(SFO)%2Cdeparture%3A', 06,'%2F', 10,'%2F',2018,'TANYT&passengers=adults%3A1%2Cchildren%3A0%2Cseniors%3A0%2Cinfantinlap%3AY&options=cabinclass%3Aeconomy&mode=search&origref=www.expedia.com', sep = "")

  webpage <- read_html(url)

  departure_time_data_html <- html_nodes(webpage,'.medium-bold span:nth-child(1)')
  departure_time_data <- html_text(departure_time_data_html)
  departure_time_data

[1] "11:40am" "7:45am" "6:29am" "6:00am" "5:55am"

In actual website there are 42 entries in a single page, but the code only extracts 5 values. Following is the link to website:

https://www.expedia.com/Flights-Search?trip=oneway&leg1=from%3AAustin%2C%20TX%2C%20United%20States%20(AUS)%2Cto%3ASan%20Francisco%2C%20CA%2C%20United%20States%20of%20America%20(SFO)%2Cdeparture%3A6%2F10%2F2018TANYT&passengers=adults%3A1%2Cchildren%3A0%2Cseniors%3A0%2Cinfantinlap%3AY&options=cabinclass%3Aeconomy&mode=search&origref=www.expedia.com

Would be glad to hear from anyone of you. Thank you.

coder
  • 6,805
  • 15
  • 34
  • 47
  • You may want to check [this](https://stackoverflow.com/questions/29861117/r-rvest-scraping-a-dynamic-ecommerce-page) answer – tyumru Jun 05 '18 at 07:37

1 Answers1

2

What happens is that the website stored the data into a JSON string, the string is parsed by the browser. And in fact, you can extract information directly from that JSON string.(below is the page source.)

enter image description here

library(rvest)
library(jsonlite)
library(purrr)

url <- paste('https://www.expedia.com/Flights-Search?trip=oneway&leg1=from%3AAustin%2C%20TX%2C%20United%20States%20(AUS)%2Cto%3ASan%20Francisco%2C%20CA%2C%20United%20States%20of%20America%20(SFO)%2Cdeparture%3A', 06,'%2F', 10,'%2F',2018,'TANYT&passengers=adults%3A1%2Cchildren%3A0%2Cseniors%3A0%2Cinfantinlap%3AY&options=cabinclass%3Aeconomy&mode=search&origref=www.expedia.com', sep = "")

webpage <- read_html(url)

departure_time_data_html <- html_node(webpage,'#cachedResultsJson') # id to the json string
json_text <- departure_time_data_html %>% html_text() # get json string as text

result <- fromJSON(json_text) # parse the json string content into list
result1 <- fromJSON(result$content) # parse the json string content into list

result1$legs$`0c46a88d484464ad78b9a0985e80ab4e`$timeline$departureTime # a sample of how to extract info from one flight

map(result1$legs,~ .x$timeline$departureTime) # extract all info using map

sample result:

> map(result1$legs,~ .x$timeline$departureTime)
$`0c46a88d484464ad78b9a0985e80ab4e`
date dateLongStr   time     dateTime travelDate                        isoStr hour
1 6/10/2018 Sun, Jun 10 7:05am 1.528632e+12   06/10/18 2018-06-10T07:05:00.000-05:00   NA
2      <NA>        <NA>   <NA>           NA       <NA>                          <NA>   NA
3 6/10/2018 Sun, Jun 10 9:02am 1.528639e+12   06/10/18 2018-06-10T09:02:00.000-05:00   NA

$`90341ad9782711784a797ffeb22a5e44`
date dateLongStr   time    dateTime travelDate                        isoStr hour
1 6/10/2018 Sun, Jun 10 5:30pm 1.52867e+12   06/10/18 2018-06-10T17:30:00.000-05:00   NA

$c40e4d757819356926cc693ca1820827
date dateLongStr   time     dateTime travelDate                        isoStr hour
1 6/10/2018 Sun, Jun 10 7:50pm 1.528678e+12   06/10/18 2018-06-10T19:50:00.000-05:00   NA
2      <NA>        <NA>   <NA>           NA       <NA>                          <NA>   NA
3 6/10/2018 Sun, Jun 10 9:42pm 1.528685e+12   06/10/18 2018-06-10T21:42:00.000-05:00   NA

$`83d7b1595e668e9c4fa886b164202f37`
date dateLongStr   time     dateTime travelDate                        isoStr hour
1 6/10/2018 Sun, Jun 10 5:54pm 1.528671e+12   06/10/18 2018-06-10T17:54:00.000-05:00   NA
2      <NA>        <NA>   <NA>           NA       <NA>                          <NA>   NA
3 6/10/2018 Sun, Jun 10 7:45pm 1.528678e+12   06/10/18 2018-06-10T19:45:00.000-05:00   NA
yusuzech
  • 5,099
  • 1
  • 12
  • 28