6

I need to process data that is being provided in a password protected Excel (xlsx) workbook. For legal reasons, I cannot create an unprotected Excel file or a csv file etc and process from there. None of the Excel import packages can deal with password protected workbooks.

From this answer Import password-protected xlsx workbook into R I have managed to extract the data. However it is imported in a list of character lists format. The dput of my list looks like this:

list(list("ID", "ID1", "ID2"),
     list("V2", NULL, "text2"),
     list("Name", "John Smith", "Mary Brown"),
     list("Score", 1, 2),
     list("email", "JS@gmail.com", "MB@gov.uk"))

What I want is a dataframe with columns ID, V2 etc that looks like this:

   ID    V2     Name        Score  email
   ID1   NULL   John Smith  1      JS@gmail.com
   ID2   text2  Mary Brown  2      MS@gov.uk

There are empty cells in the original Excel workbook, so solutions with unlist will not work.

Using a combination of answers from R list to data frame and other similar questions, I have the following code (where listform is the name of the list):

matform <- as.matrix(sapply(listform, function(s) s)) # retains empty
df <- data.frame(matform[2:nrow(matform),])
names(df) = matform[1,]

This is close, but the dataframe has lists as columns. So str(df) yields:

'data.frame':   2 obs. of  5 variables:
 $ ID:List of 2
  ..$ : chr "ID1"
  ..$ : chr "ID2"
 $ V2:List of 2
  ..$ : NULL
  ..$ : chr "text2"
and so on
Community
  • 1
  • 1
JenB
  • 16,759
  • 2
  • 13
  • 42

3 Answers3

3

First step: extract the names:

names = lapply(listform, `[[`, 1)
data = setNames(lapply(listform, `[`, -1), names)

Second step: unlist data and bind columns

result = as.data.frame(sapply(data, unlist))

The magic here happens in sapply, which internally calls simplify2array. as.data.frame at the end is needed to make a data.frame out of the resulting matrix, and to assign proper data types to the individual columns.

The above code has got one problem though: the column types probably aren’t what you want. This can be fixed as follows:

col_classes = sapply(lapply(listform, `[[`, 2), typeof)
result = as.data.frame(sapply(data, unlist), stringsAsFactors = FALSE)
for (col in seq_len(ncol(result)))
    class(result[[col]]) = col_classes[col]

Now you’ll get the following result:

> str(result)
'data.frame':   2 obs. of  5 variables:
 $ ID   : chr  "ID1" "ID2"
 $ V2   : chr  "text1" "text2"
 $ Name : chr  "John Smith" "Mary Brown"
 $ Score: num  1 2
 $ email: chr  "JS@gmail.com" "MB@gov.uk"

And this, I think, is what you want.

Konrad Rudolph
  • 482,603
  • 120
  • 884
  • 1,141
  • Thanks, I just tried this and it doesn't seem to work. Am I supposed to be using my list (listform) where you have data or my datafram (df)? Either way I didn't seem to get a sensible dataframe out. – JenB Aug 20 '15 at 11:54
  • @JenB Yes, sorry. Use `listform` instead of `data`, I’ll amend the answer. This code definitely works with your example data. – Konrad Rudolph Aug 20 '15 at 11:56
  • This gets the correct structure but unfortunately has the problem that it can't deal with empty cells in the Excel workbook (which is a problem with unlist). I will have a go with the `function(s) s` in the sapply, whcih I think enters NULL for the empties. – JenB Aug 20 '15 at 12:06
  • @JenB `function (s) s` does nothing at all, actually. It’s the same as `identity`, and applying the identity isn’t a meaningful operation. You can just call `simplify2array(listform)` directly, instead of calling `sapply` for this effect. However, a better solution would be to fill the empty elements before applying my code. Incidentally, the `as.matrix` in your code is also a redundant null operation. – Konrad Rudolph Aug 20 '15 at 12:08
  • You are correct, made no difference :) So I will have to investigate adding some prior code to fill the lists with NA – JenB Aug 20 '15 at 12:25
  • Thanks, I really wanted this one to work but ended up accepting the other because it also solved my NULL problem – JenB Aug 20 '15 at 15:09
3

"SetDT" from the "data.table" package seems to be very powerful:

> library(data.table)

> null2na <- function(x){ ifelse(is.null(x),NA,x)}

> f <- function(x){sapply(x,null2na)}

> L <- list(list("ID", "ID1", "ID2"),
+           list("V2", NULL, "text2"),
+           list("Name", "John Smith", "Mary Brown"),
+           list("S ..." ... [TRUNCATED] 

> L <- setDT(L)[, lapply(.SD, f)]

> setnames(L,colnames(L),unlist(L[1,]))

> L <- L[-1,]

> L
    ID    V2       Name Score        email
1: ID1    NA John Smith     1 JS@gmail.com
2: ID2 text2 Mary Brown     2    MB@gov.uk

> str(L)
Classes ‘data.table’ and 'data.frame':  2 obs. of  5 variables:
 $ ID   : chr  "ID1" "ID2"
 $ V2   : chr  NA "text2"
 $ Name : chr  "John Smith" "Mary Brown"
 $ Score: chr  "1" "2"
 $ email: chr  "JS@gmail.com" "MB@gov.uk"
 - attr(*, ".internal.selfref")=<externalptr> 
> 

(A data table is a better data frame.)

The function "f" does two jobs: It "unlist"s and turns NULL into NA.

mra68
  • 2,930
  • 1
  • 7
  • 17
2

Here's one way using data.table v1.9.5 (for the transpose() function):

require(data.table) # v1.9.5+
setDT(sapply(ll, function(x) setattr(transpose(x[-1L]), 'names', x[[1L]])))[]
#     ID    V2       Name Score        email
# 1: ID1    NA John Smith     1 JS@gmail.com
# 2: ID2 text2 Mary Brown     2    MB@gov.uk

Use setDF() instead of setDT() to return a data.frame instead.

Arun
  • 108,644
  • 21
  • 263
  • 366