0

I have installed those packages: rvest stringr tidyr data.table plyr xml2 selectr tibble purr datapasta jsonlite

I have scraped UN FAO website for research and finally got this:

FAO_AreaName <-"MEX"
news_url <- paste0("http://www.fao.org/countryprofiles/common/allnews/en/?iso3=",FAO_AreaName,"&allnews=no&limit=2")
news<- fromJSON(news_url)
title <- news[3]
date <- news[6]
FAO_AreaName_1 <- news[5]
content_MEX <- news[5]
MEX <- cbind(FAO_AreaName, FAO_AreaName_1, date,title, content_MEX)

FAO_AreaName <-"FSM"
news_url <- paste0("http://www.fao.org/countryprofiles/common/allnews/en/?iso3=",FAO_AreaName,"&allnews=no&limit=2")
news<- fromJSON(news_url)
title <- news[3]
date <- news[6]
FAO_AreaName_1 <- news[5]
content_FSM <- news[5]
FSM <- cbind(FAO_AreaName, FAO_AreaName_1, date,title, content_FSM)

When I merge the two datasets I get this:

MERGE <- merge(MEX, FSM, by="FAO_AreaName", all=T)

str(MERGE)
'data.frame':   4 obs. of  7 variables:
 $ FAO_AreaName : Factor w/ 2 levels "MEX","FSM": 1 1 2 2
 $ date_format.x: chr  "27/11/2017" "16/11/2017" NA NA
 $ title.x      : chr  "México es sede de reunión regional sobre la iniciativa de Crecimiento Azul de la FAO   " "Lograr el hambre cero pasa por reducir la pérdida y desperdicio de alimentos" NA NA
 $ bodytext.x   : chr  " \r\n\r\nLa Comisión Nacional de Acuacultura y Pesca de México es anfitriona de la principal reunión sobre la actividad en Amér"| __truncated__ " \r\n\r\nSe realiza  Foro sobre el desperdicio y pérdida de alimentos en México: retos y soluciones, organizado en el Senado de"| __truncated__ NA NA
 $ date_format.y: chr  NA NA "11/11/2017" "11/11/2017"
 $ title.y      : chr  NA NA "Pacific leaders alarmed over climate change’s negative impact on food systems and food security" "Pacific leaders alarmed over climate change’s negative impact on food systems and food security"
 $ bodytext.y   : chr  NA NA "11 November 2017, Rome – Climate change poses an alarming threat to food systems and food security in the Pacific islands, warn"| __truncated__ "11 November 2017, Rome – Climate change poses an alarming threat to food systems and food security in the Pacific islands, warn"| __truncated__

Of course, I do not want the variables duplication such as .x and .y

Ileeo
  • 19
  • 5

1 Answers1

0

You were so close, even using cbind to create lists (not data frames or data tables!) as intermediary data types. We'll use rbind and modify the way you construct your intermediary data structures to get you to the right solution.

The steps are:

  1. Read in your data
  2. Create intermediary data tables
  3. Row bind your data tables

Read in your data & Create intermediary data tables

There is one minor change required for your code. Rather than generating a list with cbind, this solution generates a data frame using data.frame. Note the parameter stringsAsFactors is set to FALSE, to make sure your strings are of type chr instead of FACTOR.

library(pacman)
p_load(rvest, 
       stringr, 
       tidyr,
       data.table,
       plyr,
       xml2,
       selectr,
       tibble,
       purrr,
       datapasta,
       jsonlite)

# code omitted for brevity, see original post above
MEX <- data.frame(FAO_AreaName, FAO_AreaName_1, date,title, content_MEX,
                  stringsAsFactors = F)
# code omitted for brevity, see original post above
FSM <- data.frame(FAO_AreaName, FAO_AreaName_1, date,title, content_FSM,
                  stringsAsFactors = F)

Row bind your data tables

This one is simple.

> df <- rbind(MEX, FSM)
> dim(df)
[1] 4 5
> str(df, nchar.max = 30)
'data.frame':   4 obs. of  5 variables:
 $ FAO_AreaName: chr  "MEX" "MEX" "FSM" "FSM"
 $ bodytext    : chr  " \r\n\r\nLa C"| __truncated__ " \r\n\r\nSe r"| __truncated__ "11 November 2"| __truncated__ "11 November 2"| __truncated__
 $ date_format : chr  "27/11/2017" "16/11/2017" "11/11/2017" "11/11/2017"
 $ title       : chr  "México es sed"| __truncated__ "Lograr el ham"| __truncated__ "Pacific leade"| __truncated__ "Pacific leade"| __truncated__
 $ bodytext.1  : chr  " \r\n\r\nLa C"| __truncated__ " \r\n\r\nSe r"| __truncated__ "11 November 2"| __truncated__ "11 November 2"| __truncated__
Suggestions for improvement
  • Consider building an extractor function that takes in a URL and FAO area name, and outputs a data frame with 5 columns of information.
  • Learn the different row binding methods. rbind is one, and dplyr has a number of them, including bind_rows and various types of joins.
  • Remove irrelevant characters, such as in bodytext (FAO Area Name 1) the characters \r\n\r\nLa C the escape characters with a backslash can be removed.

NOTES:

Note 1

Some people will have issues with fromJSON because they are at work and there is some kind of timeout error. If that's the case, use this workaround:

# workaround in office 
download.file(news_url, destfile = "scrapedpage.html", quiet=TRUE)
news<- fromJSON("scrapedpage.html")
Note 2

Because of the method you used to extract data from the JSON data structure, you data has names. For example:

> names(date)
[1] "date_format"

If you want to override these names, you'll need to modify your code slightly to:

> MEX <- data.frame(FAO_AreaName = FAO_AreaName, FAO_AreaName_1, date,title, content_MEX,
+                   stringsAsFactors = F)
> names(MEX)
[1] "FAO_AreaName" "bodytext"     "date_format"  "title"        "bodytext.1"
Kamil
  • 392
  • 2
  • 11