0

I was given some usage statistics data. Its months of data split into multiple TSV files missing header;

07-01-2017_01.tsv
07-01-2017_02.tsv
07-02-2017_01.tsv
07-02-2017_02.tsv
07-03-2017_01.tsv
07-03-2017_02.tsv
07-04-2017_01.tsv
07-04-2017_02.tsv
07-04-2017_03.tsv

I would like to merge the data per day, add header and export it as CSV. I managed to do it with the following code for 1 day, but I was wondering if there is any way to automate it so that I would not need to run the code for each day of the month.

  data_part1 <-  read.delim("~/07-01-2017_01.tsv", header = FALSE, sep = "\t", quote = "", stringsAsFactors=FALSE)
  data_part2 <-  read.delim("~/07-01-2017_02.tsv", header = FALSE, sep = "\t", quote = "", stringsAsFactors=FALSE)
  data_merged <- rbind(data_part1, data_part2)
  names(data_merged) <-
    c(
      "post_visid_high",
      "post_visid_low",
      "quarterly_visitor",
      "visid_timestamp",
      "visid_type",
      "visit_keywords",
      "visit_num",
      "visit_page_num",
      "visit_ref_domain",
      "visit_ref_type",
      "visit_referrer",
      "visit_search_engine",
      "visit_start_page_url",
      "visit_start_pagename",
      "visit_start_time_gmt",
    )
  write.csv(data_merged, "~/07-01-2017_02.csv")

Expected Output

07-01-2017_merged.csv
07-02-2017_merged.csv
07-03-2017_merged.csv
07-04-2017_merged.csv
Efe
  • 169
  • 11
  • 1
    Read all the data [into a list of data frames, as explained here](https://stackoverflow.com/a/24376207/903061). That answer also shows how to combine the list into a single data frame. – Gregor Thomas Dec 12 '17 at 17:15
  • @Gregor Hi Gregor, great documentation! My only issues with those examples is that I would end up with one giant file that I cannot process. While merging files, I still need to have the day split. So rather than `07-01-2017_01.tsv, 07-01-2017_02.tsv, 07-02-2017_01.tsv, 07-02-2017_02.tsv`, I want to have it; `07-01-2017.tsv, 07-02-2017.tsv` – Efe Dec 12 '17 at 21:14

2 Answers2

0

You can do something like this:

setwd("path")

temp = list.files(pattern = "*.tsv")

allCsv <- data.frame()

for(file in 1:length(temp)){

  currFile <- temp[file]
  k <- read.tsv(currFile)
  allCsv <- rbind(allCsv,k)
  }

colnames(allCsv) =c(
                       "post_visid_high",
                       "post_visid_low",
                       "quarterly_visitor",
                       "visid_timestamp",
                       "visid_type",
                       "visit_keywords",
                       "visit_num",
                       "visit_page_num",
                       "visit_ref_domain",
                       "visit_ref_type",
                       "visit_referrer",
                       "visit_search_engine",
                       "visit_start_page_url",
                       "visit_start_pagename",
                       "visit_start_time_gmt",
                     )

write.csv(allCsv, "outputPath")
DataTx
  • 1,579
  • 3
  • 22
  • 40
  • But wouldn't I end with one giant file when I do it this way. I dont think I have processing power for it, thats why I wanted to merge them per day; `07-01-2017_01.csv`, `07-02-2017_01.csv`, `07-03-2017_01.csv`. – Efe Dec 12 '17 at 18:18
0

You just need to adapt the list solution to your specific case. A for loop works fine:

all_files = list.files(pattern = "\\.tsv$")
output_path = "/path/to/output_dir/"
cnames =  c(
      "post_visid_high",
      "post_visid_low",
      "quarterly_visitor",
      "visid_timestamp",
      "visid_type",
      "visit_keywords",
      "visit_num",
      "visit_page_num",
      "visit_ref_domain",
      "visit_ref_type",
      "visit_referrer",
      "visit_search_engine",
      "visit_start_page_url",
      "visit_start_pagename",
      "visit_start_time_gmt"
    )

all_files = data.frame(fn = all_files, date = sub(pattern = "_[0-9]+\\.tsv", replacement = "",  x = all_files))

for (d in unique(all_files$date)) {
  data_list = lapply(all_files$fn[all_files$date == d],
    read.delim, header = FALSE, sep = "\t",
    quote = "", stringsAsFactors = FALSE
  )
  merged_data = do.call(rbind, data_list)
  names(merged_data) = cnames
  write.csv(merged_data, paste0(output_path, d, "_merged.csv"))
}

If you want to speed this up, use library(data.table) and replace read.delim with fread, do.call(rbind...) with rbindlist(data_list), and write.csv with fwrite.

Gregor Thomas
  • 104,719
  • 16
  • 140
  • 257
  • Thanks a lot, I will try it immediately. In order to locate the correct files, would I need to change the first line to `all_files = list.files(path = "C://Clickstream Data/Aug14", pattern = "\\.tsv$")` – Efe Dec 12 '17 at 21:48
  • Great. You don't really need to tell me the filepath to your files - feel free to tweak on your own. (You may also prefer to just set your working directory to that folder). – Gregor Thomas Dec 12 '17 at 21:55
  • Hi Gregor - When I ran the code this morning, I was stuck at `+` as the prompt is expecting more from the line of code. – Efe Dec 14 '17 at 21:56
  • I had an extra `)`, deleted it now. Make sure all your `{}`, `()` and `[]` pairs match. – Gregor Thomas Dec 14 '17 at 22:04
  • Good morning Gregor - Just checked the code for () {} [], then added `path` to the first line it looks clean now. ut I'm still stuck at that same point when I run the code. took a screenshot, https://prnt.sc/hnz1sj. I think something is wrong with this line; `all_files = data.frame(fn = all_files, date = sub(pattern = "_[0-9]+\\.tsv", replacement = "", x = all_files))` – Efe Dec 15 '17 at 17:08
  • You are missing the `}` at the bottom that ends the for loop. It's the last code line in my answer, `}` on a line by itself. – Gregor Thomas Dec 15 '17 at 17:11