dplyr : how to read a tsv file with headers while skipping some lines?

Question

I have a simple tsv file with the following structure:

0 - headerline
1 - empty line
2 - PIG schema
3 - empty line
4 - 1-st line of DATA
5 - 2-nd line of DATA

I would like to read it, possibly using readr::read_tsv but here is the problem.

As you can see, the first row contains the headers. Then I have three rows that I do NOT want to read it (they contains some super weird data coming from Apache PIG), and at row 4 the data starts. In Pandas, I would do something like

df = pd.read_csv('/localpath/data.tsv', sep='\t', skiprows=[1,2,3])

which allows me to read the headers AND to skip row one, two, three.

I don't see a similar option in readr::read_tsv. That is :

df = read_tsv('/localpath/data.tsv', col_names = TRUE, skip = 4)

which does not parse the headers...

Any ideas?

Maybe read in the first row as a separate object, and then read in the remaining rows? — bouncyball, Nov 17 '16 at 14:01
how would you code that? are there other alternatives with other packages? I want to reduce playing with the data as much as possible — ℕʘʘḆḽḘ, Nov 17 '16 at 14:02
@rawr it does not seem `skip` allows a list or rows though? `skip integer: the number of lines of the data file to skip before beginning to read data.` — ℕʘʘḆḽḘ, Nov 17 '16 at 14:12
yes that's true http://stackoverflow.com/questions/15860071/read-csv-header-on-first-line-skip-second-line — rawr, Nov 17 '16 at 14:27
thanks, but the solution provided there fails wiyh large data. I think i ll have to go with the two step processs... — ℕʘʘḆḽḘ, Nov 17 '16 at 14:28

bouncyball · Accepted Answer · 2016-11-17T14:54:57.137

Posting my comment as an answer. Basically, we read in the first row as our header, and then read in the remaining rows as the data:

library(readr)
names_t <- read_tsv('/localpath/data.tsv', col_names = FALSE, n_max = 1)
df1 <- read_tsv('/localpath/data.tsv', col_names = FALSE, skip = 4)
names(df1) <- names_t

Note that in my comment I specified nrows = 1 to read in the names (this would work for read.csv), but it appears that this argument is replaced by n_max in readr::read_tsv.

dplyr : how to read a tsv file with headers while skipping some lines?

1 Answers1