31

I have a CSV file with two header rows, the first row I want to be the header, but the second row I want to discard. If I do the following command:

data <- read.csv("HK Stocks bbg.csv", header = T, stringsAsFactors = FALSE)

The first row becomes the header and the second row of the file becomes the first row of my data frame:

  Xaaaaaaaaa       X X.1     Xbbbbbbbbbb     X.2 X.3
1         Date PX_LAST  NA         Date PX_LAST  NA
2   31/12/2002  38.855  NA   31/12/2002  19.547  NA
3   02/01/2003  38.664  NA   02/01/2003  19.547  NA
4   03/01/2003  40.386  NA   03/01/2003  19.547  NA
5   06/01/2003  40.386  NA   06/01/2003  19.609  NA
6   07/01/2003  40.195  NA   07/01/2003  19.609  NA

I want to skip this second row of the CSV file and just get

  X1.HK.Equity       X X.1 X2.HK.Equity     X.2 X.3
2   31/12/2002  38.855  NA   31/12/2002  19.547  NA
3   02/01/2003  38.664  NA   02/01/2003  19.547  NA
4   03/01/2003  40.386  NA   03/01/2003  19.547  NA
5   06/01/2003  40.386  NA   06/01/2003  19.609  NA
6   07/01/2003  40.195  NA   07/01/2003  19.609  NA

I tried data <- read.csv("HK Stocks bbg.csv", header = T, stringsAsFactors = FALSE, skip = 1) but that returns:

        Date PX_LAST  X     Date.1 PX_LAST.1 X.1
1 31/12/2002  38.855 NA 31/12/2002    19.547  NA
2 02/01/2003  38.664 NA 02/01/2003    19.547  NA
3 03/01/2003  40.386 NA 03/01/2003    19.547  NA
4 06/01/2003  40.386 NA 06/01/2003    19.609  NA
5 07/01/2003  40.195 NA 07/01/2003    19.609  NA
6 08/01/2003  40.386 NA 08/01/2003    19.547  NA

The header row comes from the second line of my CSV file, not the first line.

Thank you.

smci
  • 26,085
  • 16
  • 96
  • 138
mchangun
  • 8,274
  • 18
  • 64
  • 94
  • 4
    You can try two step approach. Read the data starting from the 3rd line without header in the first step. Read only the first line as a character vector and then add the character vector as names for the data. – djhurio Apr 07 '13 at 07:56
  • I checked your code and this line works for me: df – Andrii Oct 02 '17 at 13:42
  • 1
    I'm confused why this question is marked as a duplicate of a question that was asked a year later, shouldn't it be the other way round? – Adrian Tompkins Jun 07 '20 at 09:30

3 Answers3

56

This should do the trick:

all_content = readLines("file.csv")
skip_second = all_content[-2]
dat = read.csv(textConnection(skip_second), header = TRUE, stringsAsFactors = FALSE)

The first step using readLines reads the entire file into a list, where each item in the list represents a line in the file. Next, you discard the second line using the fact that negative indexing in R means select all but this index. Finally, we feed this data to read.csv to process it into a data.frame.

Paul Hiemstra
  • 56,833
  • 11
  • 132
  • 142
  • Thanks for your reply. The last line `dat = read.csv(skip_second, header = TRUE, stringsAsFactors = FALSE)` gives me an error `Error in file(file, "rt") : invalid 'description' argument`. How can I get read.csv to accept a variable instead of a file path? – mchangun Apr 07 '13 at 08:39
  • 2
    Use `textConnection` in addition. – Paul Hiemstra Apr 07 '13 at 09:03
  • 6
    As a heads up Paul, this approach worked brilliantly with smaller files (less than 5MB), but had trouble with larger files. I asked a question on it and provided an answer after getting it working nicely on larger files here: http://stackoverflow.com/questions/24921387/long-lag-time-importing-large-csvs-in-r-with-header-in-second-row – Nathaniel Payne Jul 23 '14 at 22:02
5

You can strip the first line(s) after the header directly from the dataframe, to allow you to do this in one line:

df<-read.csv("test.txt",header=T)[-1,]

if my datafile "test.txt" is the following:

var1, var2
units1, units2
2.3,6.8
4.5,6.7

this gives me

> read.csv("test.txt",header=T)[-1,]
var1 var2
2  2.3  6.8
3  4.5  6.7

This answers your question exactly, but just to generalize the answer, you can also skip the Nth to the Mth lines in this way:

df<-read.csv("test.txt",header=T)[-N:-M,]

where N and M are integers of course.


Note: This method will convert all columns into factor.

str(read.csv("test.csv", header = TRUE)[-1,])
# 'data.frame': 2 obs. of  2 variables:
#   $ var1: Factor w/ 3 levels "2.3","4.5","units1": 1 2
#   $ var2: Factor w/ 3 levels " units2","6.7",..: 3 2
zx8754
  • 42,109
  • 10
  • 93
  • 154
Adrian Tompkins
  • 4,261
  • 1
  • 23
  • 56
1

On Linux (or Mac) you can take advantage of being able to use linux commands in data.table::fread, so

data.table::fread("sed -e '2d' myfile.txt", data.table = F)

will skip the second line.

daknowles
  • 1,325
  • 12
  • 18