1

I have a simple tsv file with the following structure:

0 - headerline
1 - empty line
2 - PIG schema
3 - empty line
4 - 1-st line of DATA
5 - 2-nd line of DATA

I would like to read it, possibly using readr::read_tsv but here is the problem.

As you can see, the first row contains the headers. Then I have three rows that I do NOT want to read it (they contains some super weird data coming from Apache PIG), and at row 4 the data starts. In Pandas, I would do something like

df = pd.read_csv('/localpath/data.tsv', sep='\t', skiprows=[1,2,3])

which allows me to read the headers AND to skip row one, two, three.

I don't see a similar option in readr::read_tsv. That is :

df = read_tsv('/localpath/data.tsv', col_names = TRUE, skip = 4)

which does not parse the headers...

Any ideas?

demongolem
  • 8,796
  • 36
  • 82
  • 101
ℕʘʘḆḽḘ
  • 15,284
  • 28
  • 88
  • 180

1 Answers1

4

Posting my comment as an answer. Basically, we read in the first row as our header, and then read in the remaining rows as the data:

library(readr)
names_t <- read_tsv('/localpath/data.tsv', col_names = FALSE, n_max = 1)
df1 <- read_tsv('/localpath/data.tsv', col_names = FALSE, skip = 4)
names(df1) <- names_t

Note that in my comment I specified nrows = 1 to read in the names (this would work for read.csv), but it appears that this argument is replaced by n_max in readr::read_tsv.

bouncyball
  • 10,036
  • 14
  • 28