0

I have about 1000 txt files in the format below. The identifier is separated from the data by ':' . However the data it self can contains ':' I am working on a script that can allow me to transpose and import the data into an excel, csv, microsoft acess or mssql format. in a structured way. Given that the identifiers will be the same for all the txt files.

Author: sharkespare
Content-Length: 15200098
Content-Type: application/pdf
Creation-Date: 2015-02-05T07:27:34Z
Last-Modified: 2015-02-05T07:28:38Z
Last-Save-Date: 2015-02-05T07:28:38Z
created: Thu Feb 05 08:27:34 CET 2015
creator: Williams spear
date: 2015-02-05T07:27:34Z
dc:creator: Library of congress
John Kings
  • 49
  • 5

3 Answers3

0

You can loop thought the string until you encounter the ":" symbol and than look at the next letter if it is space( " " ) than the data you want is separated from the whole string

Georges Badra
  • 55
  • 1
  • 2
  • 10
  • I can use sed or awk and find to clean the data. The main issue is how to transpose the data.. and auto import the data "files" into a csv format or database – John Kings Sep 25 '15 at 12:48
0

You do not state what technique is preferred, but since python and Excel are tagged I assume you are OK using both. The following is pretty easy to implement in python. I am assuming the file format is static.

Create a header line string, e.g. "Author, Content-Length, ...". Open an output file and write the header to the file.

Loop through the directory of input files. For each file read the lines in the file and split of the identifier (you know the names of identifiers so a split or string replace should work, you could split by colon, but then you will have to take special care of the dc:creator identifier). Append the isolated data to a string for that file, append a ','. When done with the file append that string to the output file. When done with all files the output file will be a csv that can be opened with Excel.

Some pointers that might help with the implementation:

How can I iterate over files in a given directory?

How do I read a file line-by-line into a list?

Writing to CSV with Python adds blank lines

Community
  • 1
  • 1
ikkjo
  • 535
  • 7
  • 13
0

What about:

Open txt file and split by ": " (with space after the colon), put it into a list.

text=[]
with open('read_file.txt','r') as f:
    for line in f:
        for word in line.split(": "):
            text.append(word)

Now every even element is an identifier and the next odd element ist the corresponding data. Rebuild the list putting every identifier and its data together:

sorted = [text[i:i+2] for i in range(0, len(text), 2)]

Finally write everything into a csv file in colums using zip:

import csv
with open('save_file.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerows(zip(*sorted))

If you don't want to transpose it, just replace zip(*sorted) with sorted. After saving the text from the first file, loop over the other ones and add only the data to the csv file. You can achieve that by opening with the 'a' parameter, allowing you to append stuff at the end of the file, and using [zip(*sorted)[1]]:

files = ["file1.txt", "file2.txt", "file3.txt", etc]
for each in  files:
    text=[]
    with open(str(each),'r') as f:
        for line in f:
            for word in line.split(": "):
                text.append(word)

    sorted = [lst[i:i+2] for i in range(0, len(lst), 2)]

    with open('save_file.csv', 'a') as f:
        writer = csv.writer(f)
        writer.writerows([zip(*sorted)[1]])
a.smiet
  • 1,447
  • 1
  • 16
  • 35