Locating and extracting a string from multiple text files in Python

Question

I am just picking up and learning Python, For work i go through a lot of pdfs and so I found a PDFMINER tool that converts a directory to a text file. I then made the below code to tell me whether the pdf file is an approved claim or a denied claim. I dont understand how I can say find me the string that starts with "Tracking Identification Number..." AND is the 18 characters after that and stuff it into an array?

import os
import glob
import csv
def check(filename):
    if 'DELIVERY NOTIFICATION' in open(filename).read():
        isDenied = True
        print ("This claim was Denied")
        print (isDenied)
    elif 'Dear Customer:' in open(filename).read():
        isDenied = False
        print("This claim was Approved")
        print (isDenied)
    else:
        print("I don't know if this is approved or denied")

def iterate():

    path = 'text/'
    for infile in glob.glob(os.path.join(path, '*.txt')):
        print ('current file is:' + infile)
        filename = infile
        check(filename)


iterate()

Any help would be appreciated. this is what the text file looks like

Shipper Number............................577140Pickup Date....................................06/27/17
Number of Parcels........................1Weight.............................................1 LBS
Shipper Invoice Number..............30057010Tracking Identification Number...1Z000000YW00000000
Merchandise..................................1 S NIKE EQUALS EVERYWHERE T BK B
WE HAVE BEEN UNABLE TO PROVIDE SATISFACTORY PROOF OF DELIVERY FOR THE ABOVE
SHIPMENT.  WE APOLOGIZE FOR THE INCONVENIENCE THIS CAUSES.
NPT8AEQ:000A0000LDI 07
----------------Page (1) Break----------------

update: Many helpful answers, here is the route I took, and is working quite nicely if I do say so myself. this is gonna save tons of time!! Here is my the entire code for any future viewers.

import os
import glob

arrayDenied = []

def iterate():
    path = 'text/'
    for infile in glob.glob(os.path.join(path, '*.txt')):
        print ('current file is:' + infile)
        check(infile)

def check(filename):
    with open(filename, 'rt') as file_contents:
        myText = file_contents.read()
        if 'DELIVERY NOTIFICATION' in myText:
            start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
            myNumber = myText[start : start+18]
            print("Denied: " + myNumber)
            arrayDenied.append(myNumber)
        elif 'Dear Customer:' in open(filename).read():
print("This claim was Approved")

startTrackingNum = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
myNumber = myText[startTrackingNum : startTrackingNum+18]

startClaimNumberIndex = myText.index("Claim Number ") + len("Claim Number ")
myClaimNumber = myText[startClaimNumberIndex : startClaimNumberIndex+11]

arrayApproved.append(myNumber + " - " + myClaimNumber)
        else:
            print("I don't know if this is approved or denied")   
iterate()
with open('Approved.csv', "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in arrayApproved:
        writer.writerow([val])
with open('Denied.csv', "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in arrayDenied:
        writer.writerow([val])
print(arrayDenied) 
print(arrayApproved)

Update: Added the rest of my finished code, Writes the lists to a CSV file where i go execute some =left()'s and such and boom I have 1000 tracking numbers in a matter of minutes. This is why programming is great.

Are the dots really in the file? Is the tracking number always 18 characters starting with 1Z? — pault, Feb 07 '18 at 15:38
Yes, I have 1000's of pdfs to go through and typically I copy and paste them into an excel sheet so I am trying to automate this painful process. The approval pdfs are a little different but yes essentially they are all structured the same, — Bluestreak22, Feb 07 '18 at 15:42
pault I added a little bit of code to the bottom, and an issue im having with it. Does anything look off? — Bluestreak22, Feb 07 '18 at 16:29
It does, the syntax is off. Please see my answer and let me know if this solves the issue. — FatihAkici, Feb 07 '18 at 16:32
@Bluestreak22 also you should in general avoid manually opening files such as `open(filename).read()`. You can open the file once with `with open()`, and do your `if` check and all the rest of the operations in it. I cover that in the answer. — FatihAkici, Feb 07 '18 at 16:53

FatihAkici · Accepted Answer · 2018-02-07T19:47:47.550

2

If your goal is just to find the "Tracking Identification Number..." string and the subsequent 18 characters; you can just find the index of that string, then reach where it ends, and slice from that point until the end of the subsequent 18 characters.

# Read the text file into memory:
with open(filename, 'rt') as txt_file:
    myText = txt_file.read()
    if 'DELIVERY NOTIFICATION' in myText:
        # Find the desired string and get the subsequent 18 characters:
        start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
        myNumber = myText[start : start+18]
        arrayDenied.append(myNumber)

You can also modify the append line into arrayDenied.append(myText + ' ' + myNumber) or things like that.

edited Feb 07 '18 at 19:47

answered Feb 07 '18 at 16:32

FatihAkici

3,673
1
20
40

I just took a swing at this. I am getting an traceback error staying that Tracking Identification Number... is not in the list. which I would assum is because its not reading the text file right or maybe because there is a string bunched up against that without a space in the original text file? – Bluestreak22 Feb 07 '18 at 17:01
Actually all I did was remove .splitlines() and boom it worked :) – Bluestreak22 Feb 07 '18 at 17:02
@Bluestreak22 Oh awesome, that is right! Glad it worked! :) Edited the answer accordingly. – FatihAkici Feb 07 '18 at 17:03
Could you maybe explain what the index of the string is? To me an index would be a value in an array but to my knowledge a string or text file is not an array? – Bluestreak22 Feb 07 '18 at 17:04
index is the starting location of a substring in a string. Say your string is `myText = "helloabc1234hello"`, then `start=myText.index("abc")` gives you 5 because it starts at 5th index of `myText`. Then you add length of `abc` to reach where it ends. That index is where `1234` starts, which you are interested in, hence you do `myText[start : start+4]` to get those 4 characters. – FatihAkici Feb 07 '18 at 17:07

pault · Answer 2 · 2018-02-07T16:45:44.017

Regular expressions are the way to go for your task. Here is a way to modify your code to search for the pattern.

import re
pattern = r"(?<=Tracking Identification Number)(?:(\.+))[A-Z-a-z0-9]{18}"

def check(filename):
    file_contents = open(filename, 'r').read()
    if 'DELIVERY NOTIFICATION' in file_contents:
        isDenied = True
        print ("This claim was Denied")
        print (isDenied)
        matches = re.finditer(pattern, test_str)
        for match in matches:
            print("Tracking Number = %s" % match.group().strip("."))
    elif 'Dear Customer:' in file_contents:
        isDenied = False
        print("This claim was Approved")
        print (isDenied)
    else:
        print("I don't know if this is approved or denied")

Explanation:

r"(?<=Tracking Identification Number)(?:(\.+))[A-Z-a-z0-9]{18}"

(?<=Tracking Identification Number) Looks behind the capturing group to find the string "Tracking Identification Number"
(?:(\.+)) matches one or more dots (.) (we strip these out after)
[A-Z-a-z0-9]{18} matches 18 instances of (capital or lowercase) letters or numbers

Locating and extracting a string from multiple text files in Python

3 Answers3