0

I am trying to redact some pdfs with dollar amounts using c#. Below is what I have tried

@"/ (\d)(?= (?:\d{ 3})+(?:\.|$))| (\.\d\d ?)\d *$/ g"
@"(?<=each)(((\d*[,|.]\d{2,3}))*)"
@"(?<=each)(((\d*[,|.]\d{2,3}))*)"
@"\d+\.\d{2}"

Here are some test cases that it needs to match

76,249.25
131,588.00
7.09
21.27
420.42
54.77
32.848
3,056.12
0.009
0.01
32.85
2,948.59
$99,249.25
$9.0000
$1,800.0000
$1,000,000

Here are some test cases that it should not target

666-257-6443
F1A 5G9
Bolt, Locating, M8 x 1.25 x 30 L
Precision Washer, 304 SS, 0.63 OD x 0.31
Flat Washer 300 Series SS; Pack of 50
U-SSFAN 0.63-L6.00-F0.75-B0.64-T0.38-SC5.62
U-CLBUM 0.63-D0.88-L0.875
U-WSSS 0.38-D0.88-T0.125
U-BGHK 6002ZZ - H1.50
U-SSCS 0.38-B0.38
6412K42
Std Dowel, 3/8" x 1-1/2" Lg, Steel
2019.07.05
2092-002.0180
SHCMG 0.25-L1.00
280160717

Please note the c# portion is interfacing with iText 7 pdfSweep.

Guid g = new Guid();

            CompositeCleanupStrategy strategy = new CompositeCleanupStrategy();

            string guid = g.ToString();
            string input = @"C:\Users\JM\Documents\pdftest\61882 _280011434 (1).pdf";
            string output = @"C:\Users\JM\Documents\pdftest\61882 _2800011434 (1) x2" + guid+".pdf";

            string regex = @"(?m)^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$";

                 strategy.Add(new RegexBasedCleanupStrategy(regex));

            PdfDocument pdf = new PdfDocument(new PdfReader(input), new PdfWriter(output));
            PdfAutoSweep autoSweep = new PdfAutoSweep(strategy);
            autoSweep.CleanUp(pdf);
            pdf.Close();

Please share your wisdom

  • Problem is he doesn't have an exact definition of what separates a money number from other numbers. Considering possibilities of non-numbers listed against number examples there are a couple which would be indistinguishable to any regex. A regex won't cut this. Without an example doc further analysis isn't really possible. –  Jul 24 '19 at 13:14
  • 1
    The list of things you don't want to match contains numbers in the format x.xx which are also things you do want to match, you need another more complex rule to determine which is the case. – Alex K. Jul 24 '19 at 13:14
  • Correct me if you are wrong, you want to match lines where there is only one figure, prefixed with `$` or not, using commas and dots for decimals. – palvarez Jul 24 '19 at 13:15

1 Answers1

2

You may use

^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$

See the regex demo

If you need to support any currency char, not just $, replace \$ with \p{Sc}.

Details

  • ^ - start of string
  • \$? - an optional dollar symbol
  • [0-9]{1,3} - one to three digits
  • (?:,[0-9]{3})* - any 0 or more repetitions of a comma and then three digits
  • (?:\.[0-9]+)? - an optional sequence of a dot and then any 1 or more digits
  • $ - end of string.

C# check for a match:

if (Regex.IsMatch(str, @"^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$")) 
{
    // there is a match
}

pdfSweep notice:

Apply the fix from this answer. The point is that the line breaks are lost when parsing the text. The regex you need then is

@"(?m)^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?\r?$"

where (?m) makes ^ and $ match start/end of lines and \r? is required as $ only matches before LF, not before CRLF in .NET regex.

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • You can replace `[0-9]` by `\d` – palvarez Jul 24 '19 at 13:26
  • @palvarez I [don't think it is a good idea](https://stackoverflow.com/questions/16621738) since this is string validation. – Wiktor Stribiżew Jul 24 '19 at 13:31
  • Interesting, awesome! Thanks for the link ;) – palvarez Jul 24 '19 at 13:34
  • Thank you for your fast reply! the regex does seem to work in C# but not with iText 7 pdfSweep. – SUPERNOVICE Jul 24 '19 at 13:40
  • 1
    @SUPERNOVICE There is no mentioning of anything like pdfSweep in the question. What are you doing? Please add the relevant code into the question. – Wiktor Stribiżew Jul 24 '19 at 13:45
  • @SUPERNOVICE If you need to match whole lines in a multiline document add `(?m)` at the start of the pattern, like `@"(?m)^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?$"`. – Wiktor Stribiżew Jul 24 '19 at 13:55
  • Updated to include the c# portion. Sorry, I should have mentioned the pdfsweep part earlier but my assumption was that same regex would work – SUPERNOVICE Jul 24 '19 at 14:15
  • @SUPERNOVICE Try `@"(? – Wiktor Stribiżew Jul 24 '19 at 14:16
  • Thank you but the new regex doesnt work for most cases in the pdf file. – SUPERNOVICE Jul 24 '19 at 14:25
  • @SUPERNOVICE Then use the previous one, but [use the fix from this answer](https://stackoverflow.com/a/52704442/3832970). The point is that the line breaks are lost. And use `@"(?m)^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?\r?$"` – Wiktor Stribiżew Jul 24 '19 at 15:20
  • Sorry but the @"(?m)^\$?[0-9]{1,3}(?:,[0-9]{3})*(?:\.[0-9]+)?\r?$" does not work at all in the pdf file. – SUPERNOVICE Jul 25 '19 at 11:47
  • @SUPERNOVICE Did you apply the fix from the question I linked to? – Wiktor Stribiżew Jul 25 '19 at 12:25
  • Not sure how to apply that. I am not using itext the same way. I think the problem is in the regex compatibility with pdfSweep. It probably can't do anything complex. This is the only pdf I could find on regex and pdfsweep. https://itextpdf.com/sites/default/files/2018-10/pdfSweep-whitepaper.pdf – SUPERNOVICE Jul 25 '19 at 12:49
  • The only other thing I can think of is to manually feed the PdfCleanUpLocation. But not sure how to get the location of string if I use c# regex. The closest thing was regex = @"$?\d?\d+\.\d{2}"; – SUPERNOVICE Jul 25 '19 at 13:18