read.csv, header on first line, skip second line

Question

I have a CSV file with two header rows, the first row I want to be the header, but the second row I want to discard. If I do the following command:

data <- read.csv("HK Stocks bbg.csv", header = T, stringsAsFactors = FALSE)

The first row becomes the header and the second row of the file becomes the first row of my data frame:

  Xaaaaaaaaa       X X.1     Xbbbbbbbbbb     X.2 X.3
1         Date PX_LAST  NA         Date PX_LAST  NA
2   31/12/2002  38.855  NA   31/12/2002  19.547  NA
3   02/01/2003  38.664  NA   02/01/2003  19.547  NA
4   03/01/2003  40.386  NA   03/01/2003  19.547  NA
5   06/01/2003  40.386  NA   06/01/2003  19.609  NA
6   07/01/2003  40.195  NA   07/01/2003  19.609  NA

I want to skip this second row of the CSV file and just get

  X1.HK.Equity       X X.1 X2.HK.Equity     X.2 X.3
2   31/12/2002  38.855  NA   31/12/2002  19.547  NA
3   02/01/2003  38.664  NA   02/01/2003  19.547  NA
4   03/01/2003  40.386  NA   03/01/2003  19.547  NA
5   06/01/2003  40.386  NA   06/01/2003  19.609  NA
6   07/01/2003  40.195  NA   07/01/2003  19.609  NA

I tried data <- read.csv("HK Stocks bbg.csv", header = T, stringsAsFactors = FALSE, skip = 1) but that returns:

        Date PX_LAST  X     Date.1 PX_LAST.1 X.1
1 31/12/2002  38.855 NA 31/12/2002    19.547  NA
2 02/01/2003  38.664 NA 02/01/2003    19.547  NA
3 03/01/2003  40.386 NA 03/01/2003    19.547  NA
4 06/01/2003  40.386 NA 06/01/2003    19.609  NA
5 07/01/2003  40.195 NA 07/01/2003    19.609  NA
6 08/01/2003  40.386 NA 08/01/2003    19.547  NA

The header row comes from the second line of my CSV file, not the first line.

Thank you.

You can try two step approach. Read the data starting from the 3rd line without header in the first step. Read only the first line as a character vector and then add the character vector as names for the data. — djhurio, Apr 07 '13 at 07:56
I'm confused why this question is marked as a duplicate of a question that was asked a year later, shouldn't it be the other way round? — Adrian Tompkins, Jun 07 '20 at 09:30

Paul Hiemstra · Accepted Answer · 2013-04-07T13:48:50.227

56

This should do the trick:

all_content = readLines("file.csv")
skip_second = all_content[-2]
dat = read.csv(textConnection(skip_second), header = TRUE, stringsAsFactors = FALSE)

The first step using readLines reads the entire file into a list, where each item in the list represents a line in the file. Next, you discard the second line using the fact that negative indexing in R means select all but this index. Finally, we feed this data to read.csv to process it into a data.frame.

edited Apr 07 '13 at 13:48

answered Apr 07 '13 at 08:12

Paul Hiemstra

56,833
11
132
142

Thanks for your reply. The last line `dat = read.csv(skip_second, header = TRUE, stringsAsFactors = FALSE)` gives me an error `Error in file(file, "rt") : invalid 'description' argument`. How can I get read.csv to accept a variable instead of a file path? – mchangun Apr 07 '13 at 08:39
2

Use `textConnection` in addition. – Paul Hiemstra Apr 07 '13 at 09:03
6

As a heads up Paul, this approach worked brilliantly with smaller files (less than 5MB), but had trouble with larger files. I asked a question on it and provided an answer after getting it working nicely on larger files here: http://stackoverflow.com/questions/24921387/long-lag-time-importing-large-csvs-in-r-with-header-in-second-row – Nathaniel Payne Jul 23 '14 at 22:02

score 5 · Answer 2 · edited Aug 10 '18 at 09:00

You can strip the first line(s) after the header directly from the dataframe, to allow you to do this in one line:

df<-read.csv("test.txt",header=T)[-1,]

if my datafile "test.txt" is the following:

var1, var2
units1, units2
2.3,6.8
4.5,6.7

this gives me

> read.csv("test.txt",header=T)[-1,]
var1 var2
2  2.3  6.8
3  4.5  6.7

This answers your question exactly, but just to generalize the answer, you can also skip the Nth to the Mth lines in this way:

df<-read.csv("test.txt",header=T)[-N:-M,]

where N and M are integers of course.

Note: This method will convert all columns into factor.

str(read.csv("test.csv", header = TRUE)[-1,])
# 'data.frame': 2 obs. of  2 variables:
#   $ var1: Factor w/ 3 levels "2.3","4.5","units1": 1 2
#   $ var2: Factor w/ 3 levels " units2","6.7",..: 3 2

score 1 · Answer 3 · answered Oct 10 '18 at 04:56

1

On Linux (or Mac) you can take advantage of being able to use linux commands in data.table::fread, so

data.table::fread("sed -e '2d' myfile.txt", data.table = F)

will skip the second line.

answered Oct 10 '18 at 04:56

daknowles

1,325
12
18

read.csv, header on first line, skip second line

3 Answers3

Linked

Related