Find the maximum and a set of largest numbers (in scientific notation) contained in a huge ascii file

Question

Background:

(1) Here is what I extract from a huge ascii file of around 700Mb:

0, 0, 0, 0, 0, 0, 0, 0, 3.043678e-05, 3.661498e-05, 2.070347e-05,
    2.47175e-05, 1.49877e-05, 3.031176e-05, 2.12128e-05, 2.817522e-05,
    1.802658e-05, 7.192285e-06, 8.467806e-06, 2.047874e-05, 9.621194e-05,
    4.467542e-05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.000421869,
    5.0003081213, 0.0001938675, 8.70334e-05, 0.0002973858, 0.0003385935,
    8.763598e-05, 2.743326e-05, 0, 0.0001043894, 3.409237e-05, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;

(2) I would like to do two tasks:

(2.1) Find the maximum among the numbers separated by colons and semicolons.

It is 5.0003081213 in the above extracted lines.

(2.2) Find the largest 4 (says) values among the lines.

It is 5.0003081213, 0.000421869, 0.0003385935 and 0.0002973858 in the above extracted lines.

My thought:

(3) I expect to do the work with perl.

(4) I think that I can match the number with ([0-9.e-]+).

My Problem:

(5) However, I am new to perl and unix and I do not know how to proceed to find the maximum values.

(6) I searched similar questions for a half day and found that I may make use of List::Util. I do not know it is an appropriate choice for my problem and actually I do not know how this subroutine can be adopted.

(7) Says, the numbers are contained in a file, named input.txt. May I know if it is possible to finish the tasks with a one line script?

Thanks for your understanding and I appreciate so much for your help.

Further Question raised:

Thanks to many warm replies and help from stack overflow users, I got the above question solved. However, if I would like to find out a maximum only from Line 3 to Line 6 of the following data:

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.193129938e-07, 0, 0, 0, 0, 0, 0,
    0, 2.505016514e-05, 4.835713883e-05, 6.128770648e-05, 1.38018881e-05, 2.303402101e-05,
    0, 0, 0, 0, 3.5838803e-05, 0.000104883779, 0, 0, 1.813278467e-05, 0.0001350646297,
    0.0007846746908, 0.001728603877, 0.001082733652, 0.001511217708, 0.0009537032505,
    0.0004436753321, 0.002182536356, 0.0005719495782, 9.055173127e-05, 1.245663419e-05,
    0.0004568318755, 0.0003056741688, 3.186642459e-05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0.000101613512, 5.451410965e-05, 0, 0, 0, 0, 0.001172270099, 7.088900819e-05, 0,
    1.848198352e-06, 0.0006870109246, 0.00276857581, 0.002038545509, 0.001111047938,
    0.0007607533934, 0.0007915864957, 0.001105735631, 0.001456989534, 0.0007245351113,
    0.0004262289031, 0.0003041285247, 0.0001528418892, 2.332078749e-05, 9.695149464e-05,
    1.004024021e-07, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

That is,

0, 0, 0, 0, 3.5838803e-05, 0.000104883779, 0, 0, 1.813278467e-05, 0.0001350646297,
    0.0007846746908, 0.001728603877, 0.001082733652, 0.001511217708, 0.0009537032505,
    0.0004436753321, 0.002182536356, 0.0005719495782, 9.055173127e-05, 1.245663419e-05,
    0.0004568318755, 0.0003056741688, 3.186642459e-05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Then, how can I modify the script grep -o '[0-9e.-]*' file | sort -rg | head -1 to achieve this purpose?

I know that the command sed can work on lines of files by adding an option (3,6p). So, I am wondering if I can modify the above scripts by adding an option like this. I appreciate your help again.

fedorqui 'SO stop harming' · Accepted Answer · 2015-05-29T11:57:16.587

7

I would use a combination of grep and sort:

grep -o '[0-9e.-]*' file | sort -rg | head -N

The command grep -o '[0-9e.-]\+' (using the regex provided in the question) extracts all the numbers in the file.
Then, sort -g sorts taking exponential values into consideration; by using -r we reverse the results, so that the top values show at the top.
Finally, head gets the top N values.

Top value:

$ grep -o '[0-9e.-]*' file | sort -rg | head -1
5.0003081213

Top 4:

$ grep -o '[0-9e.-]*' file | sort -rg | head -4
5.0003081213
0.000421869
0.0003385935
0.0002973858

edited May 29 '15 at 11:57

answered May 29 '15 at 09:24

fedorqui 'SO stop harming'

228,878
81
465
523

1

If not bound to perl and the input is not extremely huge, I'd go for this solution too instead of mine. – Dmitry Egorov May 29 '15 at 10:29
1

@fedorqui Thank you. It works amazingly! I have extracted a part (http://sprunge.us/XCeD) of my huge ascii file. It works with only one script! Thanks again :) – nam May 29 '15 at 11:03
@fedorqui May I know if I would like to make use of `grep -o '[0-9e.-]*' file | sort -rg | head -1` to grep specific lines from a file says, `Line 123` to `Line 456`, then what I should do? I know there is similar function in `sed` but do now know how to apply in this case. I appreciate your help so much. – nam Jun 01 '15 at 11:30
@nam I don't understand what exactly you mean here. Could you update your question providing more insight? – fedorqui 'SO stop harming' Jun 01 '15 at 11:34
@fedorqui I have updated the question and I hope that I formulate my problem clearly. – nam Jun 01 '15 at 16:39
@nam aaah now I understand what you meant. I would just add a command before the `grep`. Something like `awk 'NR==3,NR==6' file`, or any variant you can find in [How can I extract a range of lines from a text file on Unix?](http://stackoverflow.com/q/83329/1983854). All together it would be `awk 'NR==3,NR==6' file | grep -o '[0-9e.-]*' | sort -rg | head -N` – fedorqui 'SO stop harming' Jun 01 '15 at 21:26
1

@fedorqui I appreciate your help again. Your script saves me a lot of time. I do no need to extract the content from the huge file with `sed` first but just apply your script. Thanks again :) – nam Jun 02 '15 at 02:31

hek2mgl · Answer 2 · 2015-05-29T08:47:57.943

1

awk can work with numbers - even in scientific notation. You can use the following script the get the maximum:

awk '{m=(m>$0)?m:$0}END{print m}' RS="[,\n;]" input.file

edited May 29 '15 at 08:47

answered May 29 '15 at 08:42

hek2mgl

133,888
21
210
235

@fedorqui Which version of `awk` are you using? Which OS? – hek2mgl May 29 '15 at 08:45
I wasn't sure if the `;` was just a typo in the question, that's why I didn't added it (and removed it from the input file). I should have mentioned that. – hek2mgl May 29 '15 at 08:47
@ hek2mgl Thanks too much. I am able to find the maximum by your neat script provided. The `;` is not a typo. It happens in my file when it is meant to shift to another chemical. – nam May 29 '15 at 08:51
For the `2.2` version, I would say `awk '{val[$0]=$0} END{n=asort(val); for (i=n-1; i>(n-1)-4; i--) print val[i]}' RS="[,\n;]" file`. – fedorqui 'SO stop harming' May 29 '15 at 08:52
@fedorqui Seems working and looks great! Can you post this as an answer? (Looks like I will not find something better at the moment) – hek2mgl May 29 '15 at 08:55
Feel free to use it! You did find the key point here by using this specific `RS`, so get all the credit :) – fedorqui 'SO stop harming' May 29 '15 at 08:56
@fedorqui Seems the script omitted `5.0003081213` as one of the candidates. – nam May 29 '15 at 08:56
1

@nam not to me. In my case it returns 5.0003081213, 0.000421869, 0.0003385935 and 0.0002973858. – fedorqui 'SO stop harming' May 29 '15 at 08:57
@nam It is working for me too. Are we all working with the *same* input? – hek2mgl May 29 '15 at 08:58
@nam it can be a matter of changing the starting point of the `for` loop. I used `n-1` since I assumed it starts by 0, so it has n-1 elements. You can play around that and see what is best. – fedorqui 'SO stop harming' May 29 '15 at 09:02
@nam This is my terminal: https://i.na.cx/cj4465.png I used a chinese website, wow!! Don't ask me what I've clicked there!! :) I need to be AFK for an urgent task, I'll have a look here later again. – hek2mgl May 29 '15 at 09:14

score 1 · Answer 3 · answered May 29 '15 at 09:05

This solution is a very verbose and assumes you already know how to get the data into the program. There is no need to find numbers with regex. You can just split on comma, get a list and sort it by size.

#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use List::Util 'max';

# I'm assuming you already have that data in one line in a variable
my $data = qq{0, 0, 0, 0, 0, 0, 0, 0, 3.043678e-05, 3.661498e-05, 2.070347e-05, 2.47175e-05, 1.49877e-05, 3.031176e-05, 2.12128e-05, 2.817522e-05, 1.802658e-05, 7.192285e-06, 8.467806e-06, 2.047874e-05, 9.621194e-05,4.467542e-05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.000421869,    5.0003081213, 0.0001938675, 8.70334e-05, 0.0002973858, 0.0003385935,8.763598e-05, 2.743326e-05, 0, 0.0001043894, 3.409237e-05, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;};

# remove the semicolon
chop $data;

# split to a list on comma and possible whitespace
my @numbers = split /,\s*/, $data;

# this is from List::Util
say 'Max: ' . max(@numbers);

# sort numerical and grab the highest 4
say $_ for ( reverse sort { $a <=> $b } @numbers )[ 0 .. 3 ];

Thanks. But I do not know how to get the files of few millions of lines into `$mydata` and thus I have adopted another the solutions. Thank you very much for your help too. — nam, May 29 '15 at 11:10

score 1 · Answer 4 · answered May 29 '15 at 09:12

I understand your question in that way that you want to filter numbers from your huge input file. So, splitting at delimiters is not sufficient but instead you need to extract numbers by a regex.

This is my attempt:

use strict;
use warnings;

my(@numbers);
while (my $line = <>) {
    while($line =~ m|([-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?)|g) {
        push @numbers, $1;
    }
}
@numbers = sort { $b <=> $a } @numbers;

print "largest value:\n  $numbers[0]\n";
print "next four numbers: \n  " . join("\n  ",@numbers[1..4]) . "\n";

It's not a one-liner but maybe better to read.

Use it like this: perl findNumbers.pl input.txt where findNumbers.pl is the script as above.

Thank you for your help too. It takes some time for me, a beginner to learn from your scripts. — nam, May 29 '15 at 11:08

Dmitry Egorov · Answer 5 · 2015-05-29T10:19:30.817

UPDATE: The one-liner:

perl -nle 'foreach (split(",|;")) { $_ += 0; @top_n = sort {$b <=> $a} ($_, @top_n); pop @top_n if @top_n > 4; } END { print foreach @top_n; }' input.txt

Nam, the other solutions are just fine and, I believe, have already helped you in solving you problem. However, they don't take into account the huge input. Even lue's solution implies storing the entire array in memory and performing sort operation against all these hundreds of megabytes. Although I totally support lue in his idea of not redefining the input record separator and reading line by line. This really helps when processing huge files.

There're only about 5 lines of actual code. The rest are comments which will help you understand what's behind the scene and hopefully help you learn a bit of perl.

#!/usr/bin/perl -nl

# 0) The -n from above would make the script read the input line by line
# and the -l parameter would automatically strip off any newline chars
# from input and add a newline to every output line

# 1.1) So, the -n parameter made perl read a line from STDIN and place it
# into $_ variable for you. The following code (excluding the END{} block)
# is executed for every input line.
# 1.2) split() takes this $_ string and breaks it into a series of numbers
# (technically still sub-strings), returning the series as an array
# 1.3) Then foreach loops through this array placing each array's item into
# $_ again. (NB. Yes, we're losing the previous $_'s value which was an input
# string but we don't care about it any longer since we've already processed
# it with split().)
foreach (split(",|;")) {

    # 2) Ensure its stored internally as a numeral by adding zero to it.
    # This would save us a bit of conversion when sorting values and also
    # make final output nicer. Still, you'll get what you want if you
    # comment the following line out.
    $_ += 0;

    # 3.1) Compose a new array by adding the current value ($_) to what
    # we already have (@top_n). The new array is "($_, @top_n)". It's OK
    # if @top_n has nothing in it or even undefined so far, perl will
    # define and initialise it with an empty array when it encounters
    # the @top_n variable first time. (Note: we should better use -w
    # perl command line parameter and define @top_n explicitly beforehand
    # but I'm omitting it here for the sake of simplicity.)
    # 3.2) Then sort the new array. The "$b <=> $a" expression will make
    # it sorted in descending order.
    @top_n = sort {$b <=> $a} ($_, @top_n);

    # 3.3) Finally, throw away the last item (pop does this) if our top-N
    # array has grown beyond the lenth or interest (4 in this example).
    # This helps keeps our sript's memory consumption reasonaably low.
    # Without doing this we'd ended up with several hundreds of megabytes
    # in memory which would require sorting.
    pop @top_n if @top_n > 4;
}

# 4) This block is only executed once, after all the input file is read and
# processed.
END {
    # 4.1) Here our old good foreach reads the @top_n array storing
    # current value in $_ for each iteration.
    # 4.2) Being called without parameters, print() outputs the value
    # of $_ variable. Remember, it also adds a newline to the output
    # - we told it doing so by adding -l in the very first line of the
    # script.
    print foreach @top_n;
}

Usage: perl top_n.pl input.txt, provided top_n.pl is the script name.

Thank you for your reminder of working with a huge file. I am at the state of working part of the big file and will refer to your kind advice later. I appreciate you so much. — nam, May 29 '15 at 11:05

lamchob · Answer 6 · 2015-05-29T08:57:49.237

0

if you really want to use a one line script, you can use this to get the largest value:

$/=undef;print "largest: " .(sort {$b <=> $a} split /,/ , scalar <> =~ tr/\n ;//rd)[0] . "\n";

And this to get the four largest Values:

$/=undef;print join ("," , (sort {$b <=> $a} split /,/ , scalar <> =~ tr/\n ;//rd)[0..3]) . "\n";

Save one of these lines into a file, say sort.pl, and execute cat /path/to/input.txt | perl /path/to/sort.pl

Although it does what should, it is not the prettiest solution.

edited May 29 '15 at 08:57

answered May 29 '15 at 08:51

lamchob

80
7

Thank you for your help and I learned how to write a simple script to drive another file. – nam May 29 '15 at 11:06

Sobrique · Answer 7 · 2015-05-29T12:52:55.437

From a perl perspective, what is useful to know $/ is the record separator. By default, it's linefeed, but you can set it to anything you like.

Looking at your sample data, therefore I'd say:

#!/usr/bin/perl

use strict;
use warnings;
use List::Util qw ( max );

$/ = ';';

while (<>) {
    s/;//g;
    my @lines = split("\n");
    s/\s+//g;
    my $block_max = max( split(",") );
    last unless defined $block_max;
    print $block_max, "\n";

    my @top;
    foreach my $line (@lines) {
        $line =~ s/\s+//g;
        my @numbers = split( ",", $line );
        my $max_num = max(@numbers);
        if ( defined $max_num ) { push( @top, $max_num ) }
    }

    print "Top 5:\n";
    print join( "\n", ( sort { $b <=> $a } (@top) )[ 0 .. 4 ] );
}

What we do is:

iterate your file based on ;.
Split on \n to get some lines.
split on , to get individual values.
use max on the block - print that.
use max on each line, stuff that in an array @top.
print the sorted first 5 elements from @top.

Then move on to the next ; delimited 'chunk'.

To extend - based on your original file, you can include in there a regex to extract numbers.

E.g.

my @numbers = m/[\d+.-]+/g;

because of the way perl handles regular expressions, it'll 'match' all the chunks that fit this particular 'format'. (Of course if someone includes ee-44 in the file, that'll match too).

I would suggest - don't go looking for one liners. It's a false economy. Far better to have a script that you can write out, comment and actually understand later, than a compact block of text that no one can tell what's going on in 12 months time.

Find the maximum and a set of largest numbers (in scientific notation) contained in a huge ascii file

7 Answers7