2

I am just picking up and learning Python, For work i go through a lot of pdfs and so I found a PDFMINER tool that converts a directory to a text file. I then made the below code to tell me whether the pdf file is an approved claim or a denied claim. I dont understand how I can say find me the string that starts with "Tracking Identification Number..." AND is the 18 characters after that and stuff it into an array?

import os
import glob
import csv
def check(filename):
    if 'DELIVERY NOTIFICATION' in open(filename).read():
        isDenied = True
        print ("This claim was Denied")
        print (isDenied)
    elif 'Dear Customer:' in open(filename).read():
        isDenied = False
        print("This claim was Approved")
        print (isDenied)
    else:
        print("I don't know if this is approved or denied")

def iterate():

    path = 'text/'
    for infile in glob.glob(os.path.join(path, '*.txt')):
        print ('current file is:' + infile)
        filename = infile
        check(filename)


iterate()

Any help would be appreciated. this is what the text file looks like

Shipper Number............................577140Pickup Date....................................06/27/17
Number of Parcels........................1Weight.............................................1 LBS
Shipper Invoice Number..............30057010Tracking Identification Number...1Z000000YW00000000
Merchandise..................................1 S NIKE EQUALS EVERYWHERE T BK B
WE HAVE BEEN UNABLE TO PROVIDE SATISFACTORY PROOF OF DELIVERY FOR THE ABOVE
SHIPMENT.  WE APOLOGIZE FOR THE INCONVENIENCE THIS CAUSES.
NPT8AEQ:000A0000LDI 07
----------------Page (1) Break----------------

update: Many helpful answers, here is the route I took, and is working quite nicely if I do say so myself. this is gonna save tons of time!! Here is my the entire code for any future viewers.

import os
import glob

arrayDenied = []

def iterate():
    path = 'text/'
    for infile in glob.glob(os.path.join(path, '*.txt')):
        print ('current file is:' + infile)
        check(infile)

def check(filename):
    with open(filename, 'rt') as file_contents:
        myText = file_contents.read()
        if 'DELIVERY NOTIFICATION' in myText:
            start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
            myNumber = myText[start : start+18]
            print("Denied: " + myNumber)
            arrayDenied.append(myNumber)
        elif 'Dear Customer:' in open(filename).read():
print("This claim was Approved")

startTrackingNum = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
myNumber = myText[startTrackingNum : startTrackingNum+18]

startClaimNumberIndex = myText.index("Claim Number ") + len("Claim Number ")
myClaimNumber = myText[startClaimNumberIndex : startClaimNumberIndex+11]

arrayApproved.append(myNumber + " - " + myClaimNumber)
        else:
            print("I don't know if this is approved or denied")   
iterate()
with open('Approved.csv', "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in arrayApproved:
        writer.writerow([val])
with open('Denied.csv', "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in arrayDenied:
        writer.writerow([val])
print(arrayDenied) 
print(arrayApproved)

Update: Added the rest of my finished code, Writes the lists to a CSV file where i go execute some =left()'s and such and boom I have 1000 tracking numbers in a matter of minutes. This is why programming is great.

  • Are the dots really in the file? Is the tracking number always 18 characters starting with 1Z? – pault Feb 07 '18 at 15:38
  • Yes, I have 1000's of pdfs to go through and typically I copy and paste them into an excel sheet so I am trying to automate this painful process. The approval pdfs are a little different but yes essentially they are all structured the same, – Bluestreak22 Feb 07 '18 at 15:42
  • https://www.computerhope.com/issues/ch001721.htm – Georgy Feb 07 '18 at 15:42
  • pault I added a little bit of code to the bottom, and an issue im having with it. Does anything look off? – Bluestreak22 Feb 07 '18 at 16:29
  • 1
    It does, the syntax is off. Please see my answer and let me know if this solves the issue. – FatihAkici Feb 07 '18 at 16:32
  • 1
    @Bluestreak22 also you should in general avoid manually opening files such as `open(filename).read()`. You can open the file once with `with open()`, and do your `if` check and all the rest of the operations in it. I cover that in the answer. – FatihAkici Feb 07 '18 at 16:53

3 Answers3

2

If your goal is just to find the "Tracking Identification Number..." string and the subsequent 18 characters; you can just find the index of that string, then reach where it ends, and slice from that point until the end of the subsequent 18 characters.

# Read the text file into memory:
with open(filename, 'rt') as txt_file:
    myText = txt_file.read()
    if 'DELIVERY NOTIFICATION' in myText:
        # Find the desired string and get the subsequent 18 characters:
        start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
        myNumber = myText[start : start+18]
        arrayDenied.append(myNumber)

You can also modify the append line into arrayDenied.append(myText + ' ' + myNumber) or things like that.

FatihAkici
  • 3,673
  • 1
  • 20
  • 40
  • I just took a swing at this. I am getting an traceback error staying that Tracking Identification Number... is not in the list. which I would assum is because its not reading the text file right or maybe because there is a string bunched up against that without a space in the original text file? – Bluestreak22 Feb 07 '18 at 17:01
  • Actually all I did was remove .splitlines() and boom it worked :) – Bluestreak22 Feb 07 '18 at 17:02
  • @Bluestreak22 Oh awesome, that is right! Glad it worked! :) Edited the answer accordingly. – FatihAkici Feb 07 '18 at 17:03
  • Could you maybe explain what the index of the string is? To me an index would be a value in an array but to my knowledge a string or text file is not an array? – Bluestreak22 Feb 07 '18 at 17:04
  • index is the starting location of a substring in a string. Say your string is `myText = "helloabc1234hello"`, then `start=myText.index("abc")` gives you 5 because it starts at 5th index of `myText`. Then you add length of `abc` to reach where it ends. That index is where `1234` starts, which you are interested in, hence you do `myText[start : start+4]` to get those 4 characters. – FatihAkici Feb 07 '18 at 17:07
1

Regular expressions are the way to go for your task. Here is a way to modify your code to search for the pattern.

import re
pattern = r"(?<=Tracking Identification Number)(?:(\.+))[A-Z-a-z0-9]{18}"

def check(filename):
    file_contents = open(filename, 'r').read()
    if 'DELIVERY NOTIFICATION' in file_contents:
        isDenied = True
        print ("This claim was Denied")
        print (isDenied)
        matches = re.finditer(pattern, test_str)
        for match in matches:
            print("Tracking Number = %s" % match.group().strip("."))
    elif 'Dear Customer:' in file_contents:
        isDenied = False
        print("This claim was Approved")
        print (isDenied)
    else:
        print("I don't know if this is approved or denied")

Explanation:

r"(?<=Tracking Identification Number)(?:(\.+))[A-Z-a-z0-9]{18}"

  • (?<=Tracking Identification Number) Looks behind the capturing group to find the string "Tracking Identification Number"
  • (?:(\.+)) matches one or more dots (.) (we strip these out after)
  • [A-Z-a-z0-9]{18} matches 18 instances of (capital or lowercase) letters or numbers

More on Regex.

pault
  • 32,557
  • 9
  • 66
  • 110
  • I am not gonna lie whenever I see something like this "(?:(\.+))[A-Z-a-z0-9]{18" I get the heeby jeebies and think like holy crap lol I will try this out though along with other answers just to know two ways of doing something. – Bluestreak22 Feb 07 '18 at 16:43
  • @Bluestreak22 I am by no means an expert in regex, but I find this site [regex101.com](http://www.regex101.com) to be extremely useful in testing patterns. Paste your text in there, select your programming language, and try to make your own pattern. – pault Feb 07 '18 at 16:44
0

I think this solves your issue, just turn it into a function.

import re

string = 'Tracking Identification Number...1Z000000YW00000000'

no_dots = re.sub('\.', '', string) #Removes all dots from the string

matchObj = re.search('^Tracking Identification Number(.*)', no_dots) #Matches anything after the "Tracking Identification Number"

try:
   print (matchObj.group(1))
except:
    print("No match!")

If you want to read the documentation it is here: https://docs.python.org/3/library/re.html#re.search

Setti7
  • 151
  • 3
  • 15
  • What if there were extra stuff after the tracking number as in `s = 'Tracking Identification Number...1Z000000YW00000000...Extra Stuff'` – pault Feb 07 '18 at 16:13
  • @pault The file he showed has a line break at the end of that number, so it should stop there. – Setti7 Feb 07 '18 at 16:15