2

I have json strings inside a dataframe column. I want to bring all these new json columns into the dataframe.

# Input
JsonID <- as.factor(c(1,2,3))
JsonString1 = "{\"device\":{\"site\":\"Location1\"},\"tags\":{\"Engine Pressure\":\"150\",\"timestamp\":\"2608411982\",\"historic\":false,\"adhoc\":false},\"online\":true,\"time\":\"2608411982\"}"
JsonString2 = "{\"device\":{\"site\":\"Location2\"},\"tags\":{\"Engine Pressure\":\"160\",\"timestamp\":\"3608411983\",\"historic\":false,\"adhoc\":false},\"online\":true,\"time\":\"3608411983\"}"
JsonString3 = "{\"device\":{\"site\":\"Location3\"},\"tags\":{\"Brake Fluid\":\"100\",\"timestamp\":\"4608411984\",\"historic\":false,\"adhoc\":false},\"online\":true,\"time\":\"4608411984\"}"
JsonStrings = c(JsonString1, JsonString2, JsonString3)
Example <- data.frame(JsonID, JsonStrings)

Using the jsonlite library I can make each json string into a 1 row dataframe.

library(jsonlite)

# One row dataframes
DF1 <- data.frame(fromJSON(JsonString1))
DF2 <- data.frame(fromJSON(JsonString2))
DF3 <- data.frame(fromJSON(JsonString3))

Unfortunately the JsonID variable column is lost. All json strings share common column name such as "time". But there are column names they don't share. By pivoting the data longer I could Rbind all the dataframes together.

library(dplyr)
library(tidyr)

# Row bindable one row dataframes
DF1_RowBindable <- DF1 %>%
  rename_all(~gsub("tags.", "", .x)) %>% 
  tidyr::pivot_longer(cols = c(colnames(.)[2]))

Is there a better way to do this?

I have never worked with json strings before. The solution must be computationally scalable.

Brad
  • 353
  • 2
  • 10

1 Answers1

2

We can store the data from fromJSON in list in the dataframe itself so we don't loose any information that we already have in the data. We can use unnest_wider to create new columns from named list.

library(dplyr)
library(tidyr)
library(jsonlite)

Example %>%
  rowwise() %>%
  mutate(data = list(fromJSON(JsonStrings))) %>%
  unnest_wider(data) %>%
  select(-JsonStrings) %>%
  unnest_wider(tags) %>%
  unnest_wider(device)

# JsonID site      `Engine Pressure` timestamp  historic adhoc `Brake Fluid` online time      
#  <fct>  <chr>     <chr>             <chr>      <lgl>    <lgl> <chr>         <lgl>  <chr>     
#1 1      Location1 150               2608411982 FALSE    FALSE NA            TRUE   2608411982
#2 2      Location2 160               3608411983 FALSE    FALSE NA            TRUE   3608411983
#3 3      Location3 NA                4608411984 FALSE    FALSE 100           TRUE   4608411984

Since each column (data, tags, device) are of different lengths we need to use unnest_wider separately on each one of them.

Ronak Shah
  • 286,338
  • 16
  • 97
  • 143
  • Azure databricks shoots errors on rowwise(). How can we preform the same operation using spark/sparklyr tools. While avoiding bringing the data into the databricks node, which exhausts the memory. – Brad Feb 22 '21 at 01:23
  • 1
    Unfortunately, I don't have any experience with spark so I am not sure how to translate this. Maybe you can ask a new question about solving this specific in spark. – Ronak Shah Feb 22 '21 at 01:25
  • Would there be a way to obtain the same results using dplyrs transmute(), while avoiding rowwise(). This would be Spark friendly. – Brad Feb 26 '21 at 08:38
  • Here is a small example where transmute is used. Individual unnested columns are manually named. Unfortunately I cannot afford to manually name all the unnested columns. https://therinspark.com/data.html – Brad Feb 26 '21 at 08:39