0

I wish to create a data.frame with two columns, and each column contains multiple columns. (I need it to feed plsr in the pls package)

It's like the oliveoil data.

> oliveoil
   chemical.Acidity chemical.Peroxide chemical.K232 chemical.K270 chemical.DK sensory.yellow sensory.green
G1           0.7300           12.7000        1.9000        0.1390      0.0030           21.4          73.4
G2           0.1900           12.3000        1.6780        0.1160     -0.0040           23.4          66.3
G3           0.2600           10.3000        1.6290        0.1160     -0.0050           32.7          53.5
G4           0.6700           13.7000        1.7010        0.1680     -0.0020           30.2          58.3
G5           0.5200           11.2000        1.5390        0.1190     -0.0010           51.8          32.5
I1           0.2600           18.7000        2.1170        0.1420      0.0010           40.7          42.9
I2           0.2400           15.3000        1.8910        0.1160      0.0000           53.8          30.4
I3           0.3000           18.5000        1.9080        0.1250      0.0010           26.4          66.5
I4           0.3500           15.6000        1.8240        0.1040      0.0000           65.7          12.1
I5           0.1900           19.4000        2.2220        0.1580     -0.0030           45.0          31.9
S1           0.1500           10.5000        1.5220        0.1160     -0.0040           70.9          12.2
S2           0.1600            8.1400        1.5270        0.1063     -0.0020           73.5           9.7
S3           0.2700           12.5000        1.5550        0.0930     -0.0020           68.1          12.0
S4           0.1600           11.0000        1.5730        0.0940     -0.0030           67.6          13.9
S5           0.2400           10.8000        1.3310        0.0850     -0.0030           71.4          10.6
S6           0.3000           11.4000        1.4150        0.0930     -0.0040           71.4          10.0
   sensory.brown sensory.glossy sensory.transp sensory.syrup
G1          10.1           79.7           75.2          50.3
G2           9.8           77.8           68.7          51.7
G3           8.7           82.3           83.2          45.4
G4          12.2           81.1           77.1          47.8
G5           8.0           72.4           65.3          46.5
I1          20.1           67.7           63.5          52.2
I2          11.5           77.8           77.3          45.2
I3          14.2           78.7           74.6          51.8
I4          10.3           81.6           79.6          48.3
I5          28.4           75.7           72.9          52.8
S1          10.8           87.7           88.1          44.5
S2           8.3           89.9           89.7          42.3
S3          10.8           78.4           75.1          46.4
S4          11.9           84.6           83.8          48.5
S5          10.8           88.1           88.5          46.7
S6          11.4           89.5           88.5          47.2

And it is a data.frame with 2 columns:

> is.data.frame(oliveoil)
[1] TRUE

> dim(oliveoil)
[1] 16  2

I tried the following code:

x = data.frame(a = c(1,2,3), b = c(1,3,4))
y = data.frame(c = c(3,4,5), d = c(5,4,2))

d = data.frame(x = x, y = y)

it returns:

> d
  x.a x.b y.c y.d
1   1   1   3   5
2   2   3   4   4
3   3   4   5   2

but I cannot call x with d$x

> d$x
NULL

what I expect is:

> d$x
  a b
1 1 1
2 2 3
3 3 4

I am expecting some arguments in the data.frame function make it work, something like:

d = data.frame(x = x, y = y, merge.columns = F)

But I cannot find any arguments doing this in the docs

Xin Niu
  • 303
  • 2
  • 10
  • Does this answer your question? [How do I make a list of data frames?](https://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames) – camille Feb 14 '20 at 04:20
  • @camille unfortunately no. what i want is a dataframe, not a list – Xin Niu Feb 14 '20 at 04:36
  • 1
    Hang on, I mean - `data.frame(x = I(as.matrix(x)), y = I(as.matrix(y)))` ? – thelatemail Feb 14 '20 at 04:39
  • @thelatemail. Yes, that works. Thanks a lot!. But as.matrix change the values in y to string. I will try if plsr works when I convert it to numeric – Xin Niu Feb 14 '20 at 04:42

1 Answers1

1

The pls::plsr() function does not require data to be set up exactly like oliveoil. plsr() allows the response term to be a matrix, and oliveoil has a particular way of storing matrices, but you can supply any matrix to plsr().

For example, this fits a model without error:

y <- matrix(rnorm(n), nrow = 10)
x <- matrix(rnorm(n), nrow = 10)

plsr(y ~ x)
# Partial least squares regression , fitted with the kernel algorithm.
# Call:
# plsr(formula = y ~ x)

Also, consider that the yarn dataset is also used in the pls docs, which just stores regular matrices in a data frame rather than the I() approach used by oliveoil.

For a bit more explanation:

The sub-components of oliveoil are not actually of class data.frame.
If you run str(oliveoil), you'll see the sensory and chemical objects in oliveoil are cast as AsIs objects. They're not technically data frame-classed objects, and in fact they were probably matrices with named rows and columns to begin with.

str(oliveoil)

'data.frame':   16 obs. of  2 variables:
 $ chemical: 'AsIs' num [1:16, 1:5] 0.73 0.19 0.26 0.67 0.52 0.26 0.24 0.3 0.35 0.19 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr  "G1" "G2" "G3" "G4" ...
  .. ..$ : chr  "Acidity" "Peroxide" "K232" "K270" ...
 $ sensory : 'AsIs' num [1:16, 1:6] 21.4 23.4 32.7 30.2 51.8 40.7 53.8 26.4 65.7 45 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr  "G1" "G2" "G3" "G4" ...
  .. ..$ : chr  "yellow" "green" "brown" "glossy" ...

The AsIs class means they were stored in oliveoil using the I() function (I think "I" is for "Identity"). I() protects an object from being converted into something else during an operation, like storage into a data frame.

You can reproduce this with a simple example (although note that if you try and store two data frames in a data frame with I() you'll get an error):

n <- 100
matrix_a <- matrix(rnorm(n), nrow = 10)
matrix_b <- matrix(rnorm(n), nrow = 10)

df <- data.frame(a = I(matrix_a), b = I(matrix_b))

str(df)

'data.frame':   10 obs. of  2 variables:
 $ a: 'AsIs' num [1:10, 1:10] -0.817 -0.233 -1.987 0.523 -1.596 ...
 $ b: 'AsIs' num [1:10, 1:10] 1.9189 -0.7043 0.0624 0.0152 -0.5409 ...

And df now contains matrix_a as $a and matrix_b as $b:

df$a
            [,1]        [,2]       [,3]        [,4]        [,5]        [,6]
 [1,] -0.8167554 -0.61629222  0.3673423  1.30882012  0.97618868 -0.53124825
 [2,] -0.2329451  0.08556506 -0.5839086  0.86298000  1.20452166  0.09825958
 [3,] -1.9873738 -0.93537922  0.1057309  0.63585036 -1.09604531  1.33080572
 [4,]  0.5227912  1.89505993  1.1184905  1.20683770 -0.02431886 -1.15878634
# ...

You could also just save matrix_a and matrix_b as matrices, directly:

# also works
df2 <- data.frame(a = matrix_a, b = matrix_b, foo = letters[1:10])

TL;DR - plsr() takes any matrix, but if you want your data stored in a data frame, create a matrix and save it into a data frame, with or without I().

andrew_reece
  • 16,937
  • 2
  • 20
  • 46