1

I have a bash script that extracts logs from a file between two timestamps. However, as the files get bigger (more than 2 GB, up to 10GB) it takes considerably longer to complete (more than 20 mins)

My log structure looks like this:

087B0037 08AD0056 03/09 02:40:40 [MMS:Main,INF] MMS state changed 
087B0037 096100BE 03/09 02:40:41 [Navigation,INF] CDDClient Initialize...
EndeavourDriver: 03/09/2017 02:40:42 :
00400004 047B0012 EndeavourDriver: 71 [SDIO87871]:

087B0037 0BE10002 03/10 06:40:40 [NNS:NNS,INF] Initializing NNS thread id 0x0BE10002...
087B0037 08AD0056 03/10 06:40:40 Initialized state: BITServer

My script uses the following command:

grep -a -A 1000000 "03/09" fileName.txt | grep -a -B 1000000 "03/10"

But it takes too long. If I add time (e.g. "03/09 02:") is a faster but log is not always running so some time values might be missing. The date values are always in the 3rd column so I tried using awk:

 awk '$3 >= "03/09" && $3 <= "03/10"' fileName.txt

But that does not collect the following lines:

EndeavourDriver: 03/09/2017 02:40:42 :
00400004 047B0012 EndeavourDriver: 71 [SDIO87871]:

I'm not too familiar with awk, sed and grep so any suggestions be would be appreciated. Perhaps something in a different language like python would be better? Thanks

James Brown
  • 31,411
  • 6
  • 31
  • 52
Erik
  • 11
  • 3
  • that lines `EndeavourDriver: 03/09/2017 02:40:42 : 00400004 047B0012 EndeavourDriver: 71 [SDIO87871]:` have a linebreak between them, but you want them to be treated as a single line. Why? – RomanPerekhrest Mar 19 '17 at 18:24
  • 1
    the most efficient way is starting with a nice time stamp format which can be compared without parsing its elements. – karakfa Mar 19 '17 at 18:29
  • @Eric, could there be more than two lines which should be treated as a single line? – RomanPerekhrest Mar 19 '17 at 18:55
  • @RomanPerekhrest That is the way the logs are saved. I can't control that. I don't think it would make a difference with my awk command if it was in a single line? – Erik Mar 19 '17 at 18:57
  • @Erik, if it was a single it would be much easier to parse. In other cases, the format is unclear and unexpected – RomanPerekhrest Mar 19 '17 at 18:58
  • The most efficient is to not need to do any comparisons at all. If you need to do this repeatedly, pulling the log into a database or splitting it into smaller periods of time will quickly pay back the effort with shorter processing times. – tripleee Oct 08 '18 at 05:25

4 Answers4

1

if your log file is in time order and you just want to extract one or two days, this might work for you

awk '$3=="03/09"{s=1} s; $3=="03/11"{exit}' log_file

will start with the first instance of 03/09 and exit with the first instance of 03/11. If the next day may not be present in the file perhaps you can change it to $3>"03/10" to make it more robust for missed dates.

Early exit may speed up to work on the dates in the beginning of the file, but not for the later days since it still needs to scan the file.

Also, there may be an accidental match for your multi-line records, for that you need to define a better record structure or fall back to costly regex matches.

Note that the last line of the extract will have the exit value intentionally so that you can inspect a false positive match.

karakfa
  • 62,998
  • 7
  • 34
  • 47
  • Some lines use `$2` with the year. You can save the semi-colon by putting `s` at the end. Consider `awk '$3 == "03/11" || $2 == "03/11/2017" { exit } $3 == "03/09" || $2 == "03/09/2017" { s = 1 } s' log_file` – Adam Katz Mar 24 '17 at 18:00
  • I think those lines are continuation of the previous lines. – karakfa Mar 24 '17 at 18:25
  • I don't think so; look at the final block of text in the question – Adam Katz Mar 24 '17 at 19:04
  • I'm almost sure, but we have to agree to disagree until OP comments on this topic. – karakfa Mar 24 '17 at 19:17
0

IMO I think you should reformat the way the logs are output so that they're in a consistent format (i.e. timestamp is always in the first column) then your awk would work.

Otherwise, though a bit clunky, you could use this to find the first and last occurrence of the date of interest and then use sed to select that range.

Community
  • 1
  • 1
dataflow
  • 360
  • 1
  • 11
0

Try this awk solution-

     cat time.awk
        {
        if($4 ~ /^[0-9][0-9]:[0-9][0-9]:[0-9][0-9]$/  && $3 >= "03/09" && $3 <= "03/10") 
            print $0
        else if($3 ~ /^[0-9][0-9]:[0-9][0-9]:[0-9][0-9]$/ && $2 >= "03/09/2017" && $2 <= "03/10/2017") 
        {
        x=$0
            print x;
        getline n
        print n
        }
        else

        print ""
}

Input file :

cat f
087B0037 08AD0056 03/09 02:40:40 [MMS:Main,INF] MMS state changed 
087B0037 096100BE 03/09 02:40:41 [Navigation,INF] CDDClient Initialize...
EndeavourDriver: 03/09/2017 02:40:42 :
00400004 047B0012 EndeavourDriver: 71 [SDIO87871]:

087B0037 0BE10002 03/10 06:40:40 [NNS:NNS,INF] Initializing NNS thread id 0x0BE10002...
087B0037 08AD0056 03/10 06:40:40 Initialized state: BITServer
087B0037 08AD0056 04/10 06:40:40 Initialized state: BITServer

Processing :

awk -f time.awk f
087B0037 08AD0056 03/09 02:40:40 [MMS:Main,INF] MMS state changed 
087B0037 096100BE 03/09 02:40:41 [Navigation,INF] CDDClient Initialize...
EndeavourDriver: 03/09/2017 02:40:42 :
00400004 047B0012 EndeavourDriver: 71 [SDIO87871]:

087B0037 0BE10002 03/10 06:40:40 [NNS:NNS,INF] Initializing NNS thread id 0x0BE10002...
087B0037 08AD0056 03/10 06:40:40 Initialized state: BITServer
VIPIN KUMAR
  • 2,692
  • 1
  • 14
  • 28
0

Have you tried limiting the number of matches? And using fgrep? That may improve processing time dramatically:

fgrep -a -A -m 1 1000000 "03/09" fileName.txt | fgrep -a -B 1000000 "03/10"

Here are some other ideas for speeding it up as well. In particular using fgrep instead of grep.

Community
  • 1
  • 1
dataflow
  • 360
  • 1
  • 11