2

Well it has finally happened. My Google-fu has failed me. Please help...

I have a batch file that goes through a directory and gets information from Comic archives (.cbz files)

It generates a CSV file with the Titles, # of Pages, Resolution of last page, Size of archive, and name of the artist

This all works fine except for the resolution. I am able to get the resolution no problem but extracting the last page only works if files are named a specific way in the archive (Files are named Page 000 to whatever and i count the number of files and substract 1). If it deviates (first page is Page 801 and last is Page 868) it fails to extract the page because i am telling it to extract Page 068 instead of 868.

So i figured if i just get the actual name of the last page, i am golden.

I am trying to grep the last filename in a zip file by using:

7z l filename | grep -o -P Page\s[0-9]{3}\..*(?!Page\s[0-9]{3}\..*)

But that gives me all the filenames.

Here is the output i am trying to grep:

7-Zip [64] 9.38 beta  Copyright (c) 1999-2014 Igor Pavlov  2015-01-03

Listing archive: Christian Knockers {Pages 0801-0868} [Dark Lord].cbz

--
Path = Christian Knockers {Pages 0801-0868} [Dark Lord].cbz
Type = zip
Physical Size = 224551692

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2020-11-19 15:51:25 ....A      3589432      3589432  Page 801.png
2020-11-19 16:09:29 ....A      3455981      3455981  Page 802.png
2020-11-26 14:48:47 ....A      3017353      3017353  Page 803.png
2020-11-26 15:02:27 ....A      3627637      3627637  Page 804.png
2020-11-26 15:13:05 ....A      3212321      3212321  Page 805.png
<snip>
2021-03-19 15:37:49 ....A      3106721      3106721  Page 864.png
2021-03-19 15:37:19 ....A      2619460      2619460  Page 865.png
2021-03-19 15:37:21 ....A      3063014      3063014  Page 866.png
2021-03-19 15:36:38 ....A      2423233      2423233  Page 867.png
2021-03-19 15:36:41 ....A      2908774      2908774  Page 868.png
------------------- ----- ------------ ------------  ------------------------
2021-03-19 15:38:54          224542422    224542422  68 files

Kernel  Time =     0.015 =   18%
User    Time =     0.000 =    0%
Process Time =     0.015 =   18%    Virtual  Memory =      3 MB
Global  Time =     0.084 =  100%    Physical Memory =      7 MB

I am getting better and better at regex but only groups i have used are capturing groups. What i googled keeps saying negative lookahead but i am not having any luck.

Any help is appreciated!

Jack P.
  • 21
  • 1

1 Answers1

0

Use

grep -zoP '(?s)Page\s[0-9]{3}\.\w+(?!.*Page\s[0-9]{3}\.\w+)' file

-z will treat the file as a single line. (?s) will allow the dot to match line boundaries.

EXPLANATION

--------------------------------------------------------------------------------
  Page                     'Page'
--------------------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
  [0-9]{3}                 any character of: '0' to '9' (3 times)
--------------------------------------------------------------------------------
  \.                       '.'
--------------------------------------------------------------------------------
  \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    .*                       any character (0 or more times matching the most amount possible)
--------------------------------------------------------------------------------
    Page                     'Page'
--------------------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
    [0-9]{3}                 any character of: '0' to '9' (3 times)
--------------------------------------------------------------------------------
    \.                       '.'
--------------------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of look-ahead
Ryszard Czech
  • 10,599
  • 2
  • 12
  • 31