Python 2.7 Can't Write out file from DictReader with DictWriter after re.findall with Regex

Question

I've tried many approaches based on great stack overflow ideas per:

How to write header row with csv.DictWriter?

Writing a Python list of lists to a csv file

csv.DictWriter -- TypeError: __init__() takes at least 3 arguments (4 given)

Python: tuple indices must be integers, not str when selecting from mysql table

https://docs.python.org/2/library/csv.html

python csv write only certain fieldnames, not all

Python 2.6 Text Processing and

Why is DictWriter not Writing all rows in my Dictreader instance?

I tried mapping reader and writer fieldnames and special header parameters.

I built a second layer test from some great multi-column SO articles:

code follows

import csv
import re
t = re.compile('<\*(.*?)\*>')
headers = ['a', 'b', 'd', 'g']
with open('in2.csv', 'rb') as csvfile:
    with open('out2.csv', 'wb') as output_file:
        reader = csv.DictReader(csvfile)
        writer = csv.DictWriter(output_file, headers, extrasaction='ignore')
        writer.writeheader()
        print(headers)
        for row in reader:
            row['d'] = re.findall(t, row['d'])
            print(row['a'], row['b'], row['d'], row['g'])
            writer.writerow(row)

input data is:

a, b, c, d, e, f, g, h 

<* number 1 *>, <* number 2 *>, <* number 3 *>, <* number 4 *>, ...<* number 8 *> 

<* number 2 *>, <* number 3 *>, <* number 4 *>, ...<* number 8 *>, <* number 9 *>

output data is:

['a', 'b', 'd', 'g' ] 

('<* number 1 *>', '<* number 2 *>', ' number 4 ', <* number 7 *>) 

('<* number 2 *>', '<* number 3 *>', ' number 5 ', <* number 8 *>)

exactly as desired.

But when I use a rougher data set that has words with blanks, double quotes, and mixes of upper and lower case letters, the printing works at the row level, but the writing does not work entirely.

By entirely, I have been able (I know I'm in epic fail mode here) to actually write one row of the challenging data, but not in that instance, a header and multiple rows. Pretty lame that I can't overcome this hurdle with all the talented articles that I've read.

All four columns fail with either a key error or with a "TypeError: tuple indices must be integers, not str"

I'm obviously not understanding how to grasp what Python needs to make this happen.

The high level is: read in text files with seven observations / columns. Use only four columns to write out; perform the regex on one column. Make sure to write out each newly formed row, not the original row.

I may need a more friendly type of global temp table to read the row into, update the row, then write the row out to a file.

Maybe I'm asking too much of Python architecture to coordinate a DictReader and a DictWriter to read in data, filter to four columns, update the fourth column with a regex, then write out the file with the updated four tuples.

At this juncture, I don't have the time to investigate a parser. I would like to eventually in more detail, since per release of Python (2.7 now, 3.x later) parsers seem handy.

Again, apologize for the complexity of the approach and my lack of understanding of the underpinnings of Python. In R language, the parallel of my shortcomings would be understanding coding at the S4 level, not just the S3 level.

Here is data that is closer to what fails, sorry--I needed to show how the headers are set up, how the file rows coming in are formatted with individual double quotes with quotes around the entire row and how the date is formatted, but not quoted:

    stuff_type|stuff_date|stuff_text
""cool stuff"|01-25-2015|""the text stuff <*to test*> to find a way to extract all text that is <*included in special tags*> less than star and greater than star"""
""cool stuff"|05-13-2014|""the text stuff <*to test a second*> to find a way to extract all text that is <*included in extra special tags*> less than star and greater than star"""
""great big stuff"|12-7-2014|"the text stuff <*to test a third*> to find a way to extract all text that is <*included in very special tags*> less than star and greater than star"""
""nice stuff"|2-22-2013|""the text stuff <*to test a fourth ,*> to find a way to extract all text that is <*included in doubly special tags*> less than star and greater than star"""

stuff_type,stuff_date,stuff_text
cool stuff,1/25/2015,the text stuff <*to test*> to find a way to extract all text that is <*included in special tags*> less than star and greater than star
cool stuff,5/13/2014,the text stuff <*to test a second*> to find a way to extract all text that is <*included in extra special tags*> less than star and greater than star
great big stuff,12/7/2014,the text stuff <*to test a third*> to find a way to extract all text that is <*included in very special tags*> less than star and greater than star
nice stuff,2/22/2013,the text stuff <*to test a fourth *> to find a way to extract all text that is <*included in really special tags*> less or greater than star

I plan to retest this, but a Spyder update made my Python console crash this morning. Ugghh. With vanilla Python, the test data above fails with the following code... no need to do the write step...can't even print here... may need the QUOTES.NONE in the dialect.

import csv
import re 
t = re.compile('<\*(.*?)\*>')
headers = ['stuff_type', 'stuff_date', 'stuff_text']
with open('C:/Temp/in3.csv', 'rb') as csvfile:
    with open('C:/Temp/out3.csv', 'wb') as output_file:
        reader = csv.DictReader(csvfile)
        writer = csv.DictWriter(output_file, headers, extrasaction='ignore')
        writer.writeheader()
        print(headers)
        for row in reader:
            row['stuff_text'] = re.findall(t, row['stuff_text'])
            print(row['stuff_type'], row['stuff_date'], row['stuff_text'])
            writer.writerow(row)

Error:

can't past the snipping tool image in here ....sorry

KeyError: 'stuff_text'

OK: it might be in the quoting and separation of columns: the data above without quotes printed without a KeyError and now writes to the file correctly: I may have to clean up the file from quote characters before I pull out text with the regex. Any thoughts would be appreciated.

Good question @ Andrea Corbellini

The code above generates the following output if I've manually removed the quotes:

stuff_type,stuff_date,stuff_text
cool stuff,1/25/2015,"['to test', 'included in special tags']"
cool stuff,5/13/2014,"['to test a second', 'included in extra special tags']"
great big stuff,12/7/2014,"['to test a third', 'included in very special tags']"
nice stuff,2/22/2013,"['to test a fourth ', 'included in really special tags']"

which is what I want in regards to output. So, thanks for your "lazy" question---I'm the lazy one that should have put this second output as a follow on.

Again, without removing multiple sets of quotation marks, I have KeyError: 'stuff_type'. I apologize that I have attempted to insert the image from a screen capture of the Python with the error, but have not figured out yet how to do that in SO. I used the Images section above, but that seems to point to a file that maybe is uploaded to SO? not inserted?

With @monkut's excellent input below on using ".".join things or literally stuff is getting better.

{['stuff_type', 'stuff_date', 'stuff_text']
('cool stuff', '1/25/2015', 'to test:included in special tags')
('cool stuff', '5/13/2014', 'to test a second:included in extra special tags')
('great big stuff', '12/7/2014', 'to test a third:included in very special tags')
('nice stuff', '2/22/2013', 'to test a fourth :included in really special tags')}
    
import csv
import re 
t = re.compile('<\*(.*?)\*>')
headers = ['stuff_type', 'stuff_date', 'stuff_text']
csv.register_dialect('piper', delimiter='|', quoting=csv.QUOTE_NONE)
with open('C:/Python/in3.txt', 'rb') as csvfile:
    with open('C:/Python/out5.csv', 'wb') as output_file:
        reader = csv.DictReader(csvfile, dialect='piper')
        writer = csv.DictWriter(output_file, headers, extrasaction='ignore')
        writer.writeheader()
        print(headers)
        for row in reader:
            row['stuff_text'] = ":".join(re.findall(t, row['stuff_text']))
            print(row['stuff_type'], row['stuff_date'], row['stuff_text'])
            writer.writerow(row)

Error path follows:

runfile('C:/Python/test quotes with dialect quotes none or quotes filter and special characters with findall regex.py', wdir='C:/Python')
['stuff_type', 'stuff_date', 'stuff_text']
('""cool stuff"', '01-25-2015', 'to test')
Traceback (most recent call last):

  File "<ipython-input-3-832ce30e0de3>", line 1, in <module>
    runfile('C:/Python/test quotes with dialect quotes none or quotes filter and special characters with findall regex.py', wdir='C:/Python')

  File "C:\Users\Methody\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
    execfile(filename, namespace)

  File "C:\Users\Methody\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "C:/Python/test quotes with dialect quotes none or quotes filter and special characters with findall regex.py", line 20, in <module>
    row['stuff_text'] = ":".join(re.findall(t, row['stuff_text']))

  File "C:\Users\Methody\Anaconda\lib\re.py", line 177, in findall
    return _compile(pattern, flags).findall(string)

TypeError: expected string or buffer

I'll have a find a stronger way to clean up and remove the quotes before processing the regex findall. Probably something row = string.remove(quotes with blanks).

Please forgive my laziness: I have read just 10% of your question. Could you just show the code together with the input that produces the issue, the expected output and the actual output? — Andrea Corbellini, Feb 04 '16 at 17:14
your KeyError suggests that you don't have a 'stuff_text' column in input, in3.csv , file. — monkut, Feb 06 '16 at 22:07
@monkut: understand what you are saying, however there is a column header called 'stuff_text'. But, you're more right than you think ... the rows of data coming in are strings that are not decomposed into data elements until they are read into the dict. If I eliminate the quotes on the strings, this works. Will be testing row.replace quotes with blanks before running the regex. I'd better not fight the immutability of the strings. You're right in that the header is not aligned to the dict data element if the quotes are still in the picture. — MethodyM, Feb 07 '16 at 00:03
@monkut, don't hesitate to correct my attempt at discussing Python internals; please don't assume my terminology about data elements is correct--I was tempted to use a Python-specific term tuples in a theoretical relational algebra, C. J. Date, sort of sense. Data elements was an attempt at describing DictReader and DictWriter JSON. — MethodyM, Feb 07 '16 at 00:06
here, your TypeError results from giving a compiled regex object to findall, try, t.findall(row['stuff_text']). — monkut, Feb 07 '16 at 03:10
@monkut your row['d'] = ":".join(re.findall(t, row['d'])) works with the stripped of quotes csv file: The t.findall(row['stuff_text']) fails on both the ugly quotes txt file and the cleaned up csv file. I'm pulling out hair trying to find a way to do in Python, what I know I can do easily in R--remove quotes from a large corpus. Of course R doesn't perform at older versions past the billion row crazy matrix limit for dataFrames. You're extension of the regex with ":".join works fine, just not with a quoted ugly, irregular file. — MethodyM, Feb 08 '16 at 05:57
R for example uses the stringr package, as in: The idea is that a massive replace all runs much faster than reading line by line to replace characters. I may be in the situation where I'm looking to have a regex clean out all quotes first, then another regex do the findall and extract out the data between the special characters, which will work after the quotes are eliminated from the input lines or rows. Maybe a two step that writes to different files, Not the best Python. — MethodyM, Feb 08 '16 at 06:34
Python should be able to do this in memory before writing to files--this being any independent data clean up then data extraction functions. — MethodyM, Feb 08 '16 at 06:35

monkut · Accepted Answer · 2016-02-12T00:01:31.127

1

I think findall returns a list, which may be screwing things up, since dictwriter wants a single string value.

row['d'] = re.findall(t, row['d'])

You can use .join to turn the results to a single string value:

row['d'] = ":".join(re.findall(t, row['d']))

Where, here values are joined with, ":". As you mention, though, you may need to clean the values a bit more...

You mentioned there was a problem with using the compiled regex object. Here's an example of how the compiled regex object is used:

import re
t = re.compile('<\*(.*?)\*>')
text= ('''cool stuff,1/25/2015,the text stuff <*to test*> to find a way to extract all text that'''
       ''' is <*included in special tags*> less than star and greater than star''')
result = t.findall(text)

This should return the following into result:

['to test', 'included in special tags']

edited Feb 12 '16 at 00:01

answered Feb 04 '16 at 17:40

monkut

36,357
21
109
140

When I pull the quotes out, it works ...that was cheating on my part...used an Excel csv without quotes versus a notepad text file with lots of quotes which is typically what i deal with. Appreciate you, monkut, taking the time to think about this: I think you're right in some aspects: re.findall requires cleaner data than I imagined to return a good list item. – MethodyM Feb 04 '16 at 18:10
Without quotes, here are the results with the ":".join added, please see the output formatted in the question above. – MethodyM Feb 06 '16 at 21:56

Python 2.7 Can't Write out file from DictReader with DictWriter after re.findall with Regex

code follows

1 Answers1