perl text-processing (in particular when loading files)

Question

Loading files and sorting columns is usually easy in shell with a combination of grep, cut, sed, awk & so on.

However, when I have to do it in Perl, I often end up doing long and painful things using many splits, one after another, regexes, and the result is dirty code that looks like something like this:

open $FH, "<", $file;
@file = <$FH>;
close $FH;
foreach $line (@file) {
    ( $foo, $bar, $some, $thing) = ( split(/,/, $line) )[3,8,9,15] 
    ( $new_some ) = (split(/-/, $some))[2];
    ($new_foo = $foo) =~ s/xx//;
    $uc_bar = uc($bar);
    # and so on.....
}

Isn't there a more elegant way of doing such things (splitting fields, replacing patterns etc.)? Or a more "quicker" way (not necessarily elegant)?

Also is there a way to load just the required part of the file at loading time, (without having to load everything in memory, but filter prior to the loading)?

There's no "magic bullet", you have to write the logic you want. If you have a more specific set of requirements you want to clean up, people may have suggestions. However, in a oneliner the [-a option](https://perldoc.pl/perlrun#-a) may help. — Grinnz, Feb 13 '19 at 17:46
Well, specifying indices `3,8,9,15` ... ? How are you going to do that "elegantly" in the shell? You have to write what you want done, but that can also be done in a number of ways and some are far nicer than others. Also, for some jobs there are ready solutions. So, what do you want to do? — zdim, Feb 13 '19 at 17:51
Note, the Perl code you show will not be "_easy_" with "_a combination_" of those 4 ("_& so on_") tools, two of which are themselves programming languages. — zdim, Feb 13 '19 at 17:52
Autosplit mode (`-a`) in combination with `-l`, `-n` and often `-F` is about as simple as using awk. See `man perlrun` for details on those options. — Shawn, Feb 13 '19 at 18:14
The first `split()` indicates that your input data may be CSV. So the first change would be not to write [yet another faulty CSV parser](https://stackoverflow.com/a/24950812/8866606) and use [Text::CSV](https://metacpan.org/pod/Text::CSV) instead. — Stefan Becker, Feb 13 '19 at 19:20
Why does your code read the file into memory instead of using the standard Perl idiom `while () { ... }`? Better yet, if you have only one input file to process, why don't you implement your code as filter from `STDIN` to `STDOUT`, i.e. `while () {`? — Stefan Becker, Feb 13 '19 at 19:22
thanks for your answers. i was "thinking" that this kind of Perl code was dirty, but looking at your comments it looks like there's nothing really wrong in my example (apart from the fact that i'm loading all the file in memory in this example) which is.. dissapointing a little bit (i was think that i could do simpler things in perl) but i am glad that my code is not so dirty in fact.. thank you all for your comments — olivierg, Feb 13 '19 at 20:26
OK, fair enough, thanks for feedback. Look, the code you show can be written more "nicely" (have far too many parens, for instance) -- but the main thing here is that there is no well defined task in your example, to put together a "nicer" code snippet for. On the other hand, try splitting a string and extracting some fields in C++ for instance. Perhaps you just aren't enjoying Perl's syntax, with all the @$% (not swearing!) ... ? :) — zdim, Feb 13 '19 at 21:44

score 2 · Accepted Answer · answered Feb 13 '19 at 19:17

Elegance is subjective, but I can answer at least one of your questions, and suggest some things that might shorten or improve your code.

"is there a way to load just the required part of the file at loading time" - in the code you showed, I don't see the need to load the entire file into memory. The typical pattern for processing files line-by-line, and the equivalent of what Perl's -n and -p switches do, is this pattern:

open my $fh, '<', $file or die "$file: $!";
while (<$fh>) {          # reads line into $_
    my @fields = split;  # splits $_ on whitespace, like awk
    my ($foo, $bar, $some, $thing) = @fields[3,8,9,15];
    ...
}
close $fh;

I consider that fairly elegant, but based on what you're writing I guess you're comparing that to oneliners of piped commands that fit within maybe 100 characters. Perl can do that too: as the comments have already mentioned, have a look at the switches -n, -p, -a, -F, and -i. If you show some concrete examples of things you want to do, you'll probably get some replies showing how to do it shorter with Perl.

But if you're going to be doing more, then it's usually better to expand that into a script like the one above. IMHO putting things into a script gives you more power: it's not ephemeral like the command-line history, it's more easily extensible, and it's easier to use modules, you can add command-line options, process multiple files, and so on. Just for example, with the following snippet, you get all the power of Text::CSV - support for quoting, escaping, multiline strings, etc.

use Text::CSV;
my $csv = Text::CSV->new({binary=>1, auto_diag=>2, eol=>$/});
open my $fh, '<', $file or die "$file: $!";
while ( my $row = $csv->getline($fh) ) {
    ...
    $csv->print(select, $row);
}
$csv->eof or $csv->error_diag;
close $fh;

You might also want to check out that module's csv function, which provides a lot of functionality in a short function. If you still think that's all to "painful" and "dirty" and you'd rather do stuff with less code, then there are a few shortcuts you could take, for example to slurp a whole file into memory, my $data = do { local (*ARGV, $/) = $file; <> };, or to do the same as the -i command-line switch:

local ($^I, @ARGV) = ('.bak', $file);
while (<>) {
    # s///; or @F=split; or whatever
    print;  # prints $_ back out
}

One thing I like about Perl is that it lets you express yourself in lots of different ways - whether you want to hack together a really short script to take care of a one-time task, or write a big OO project, TIMTOWTDI

thank you for your answer, this will help me :), the comma was just to give an example separator, i don't intend to use CSV specifically, but it will help for sure. Also, i didn't know that i could use that @fields[1,2,3] syntax, ty — olivierg, Feb 13 '19 at 20:28

perl text-processing (in particular when loading files)

1 Answers1