-1

I have a big space delimited .txt file (about 50 MB) and the structure of the file looks like this. I want to get rid of the first 8 space delimited columns.

L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!
L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!
L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.
L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?
L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.
L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow
L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie.

desired output (in .txt):

They do not!
They do to!
I hope so.
She okay?
...

How can I do it in Python 2.7 or 3.4 (please specify the version), in R, or using linux command line? Thank you!

Meilin
  • 21
  • 2
  • 1
    Possible duplicate of [Using awk to print all columns from the nth to the last](http://stackoverflow.com/questions/2961635/using-awk-to-print-all-columns-from-the-nth-to-the-last). It's not exactly the same question because it says 'awk' instead of 'Python', but it pretty much is the same, and the answers cover many ways of doing that with Linux command line tools. – TessellatingHeckler Nov 19 '15 at 02:30
  • 3
    In R, `sub("^[+]+ ", "", data.table::fread(filename, sep = "$", header = FALSE, select = 5)[[1L]])` works for me and might be sufficiently efficient. – Rich Scriven Nov 19 '15 at 02:46
  • 1
    @RichardScriven: post as answer? – Ben Bolker Nov 19 '15 at 02:47
  • @BenBolker - I thought about it, but it seems a bit pointless when one could just read in the entire file as a single column then `gsub` out the unwanted parts. – Rich Scriven Nov 19 '15 at 04:17

4 Answers4

8

On my Linux system (Ubuntu 12.04) this works fine:

cut -f 9- -d " " tmp.tmp >newfile.out

-f 9- specifies fields 9 onwards; -d " " specifies space-delimited.

My guess would be that this is pretty fast (since cut is a tool exactly for this purpose). It could probably be done in a couple of lines of Python but might be a little bit slower(?); doing it in R would probably be slow/inefficient.

Ben Bolker
  • 173,430
  • 21
  • 312
  • 389
  • Thank you! It works for printing the results. Could you also answer how to save the output to a .txt file? Thanks! – Meilin Nov 19 '15 at 02:39
  • 3
    @Meilin usually you just do `> filename` at the end of the command to direct the output to file – Rich Scriven Nov 19 '15 at 02:42
2

An R approach:

txt <- "L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!
L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!
L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.
L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?
L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.
L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow
L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie."

txt_obj <- readLines(textConnection(txt))
txt8 <- gsub( "^(([^ ]+[ ]){8})", "", txt_obj)
txt8
#----------
[1] "They do not!"                                  
[2] "They do to!"                                   
[3] "I hope so."                                    
[4] "She okay?"                                     
[5] "Let's go."                                     
[6] "Wow"                                           
[7] "Okay -- you're gonna need to learn how to lie."
IRTFM
  • 240,863
  • 19
  • 328
  • 451
1

It's so easy to do this use Python slice:

with open('in_file') as in_f:
    with open('out_file', 'w') as out_f:
        for i in [i.strip() for i in in_f if i != '\n']:
            out_f.write(' '.join(i.split()[8:]) + '\n')
Casimir Crystal
  • 18,651
  • 14
  • 55
  • 76
0

This would remove all the characters from the upto the last +++

sed 's/.*+++[[:blank:]]\+//' file
Avinash Raj
  • 160,498
  • 22
  • 182
  • 229