I've tried many approaches based on great stack overflow ideas per:
How to write header row with csv.DictWriter?
Writing a Python list of lists to a csv file
csv.DictWriter -- TypeError: __init__() takes at least 3 arguments (4 given)
Python: tuple indices must be integers, not str when selecting from mysql table
https://docs.python.org/2/library/csv.html
python csv write only certain fieldnames, not all
Python 2.6 Text Processing and
Why is DictWriter not Writing all rows in my Dictreader instance?
I tried mapping reader and writer fieldnames and special header parameters.
I built a second layer test from some great multi-column SO articles:
code follows
import csv
import re
t = re.compile('<\*(.*?)\*>')
headers = ['a', 'b', 'd', 'g']
with open('in2.csv', 'rb') as csvfile:
with open('out2.csv', 'wb') as output_file:
reader = csv.DictReader(csvfile)
writer = csv.DictWriter(output_file, headers, extrasaction='ignore')
writer.writeheader()
print(headers)
for row in reader:
row['d'] = re.findall(t, row['d'])
print(row['a'], row['b'], row['d'], row['g'])
writer.writerow(row)
input data is:
a, b, c, d, e, f, g, h
<* number 1 *>, <* number 2 *>, <* number 3 *>, <* number 4 *>, ...<* number 8 *>
<* number 2 *>, <* number 3 *>, <* number 4 *>, ...<* number 8 *>, <* number 9 *>
output data is:
['a', 'b', 'd', 'g' ]
('<* number 1 *>', '<* number 2 *>', ' number 4 ', <* number 7 *>)
('<* number 2 *>', '<* number 3 *>', ' number 5 ', <* number 8 *>)
exactly as desired.
But when I use a rougher data set that has words with blanks, double quotes, and mixes of upper and lower case letters, the printing works at the row level, but the writing does not work entirely.
By entirely, I have been able (I know I'm in epic fail mode here) to actually write one row of the challenging data, but not in that instance, a header and multiple rows. Pretty lame that I can't overcome this hurdle with all the talented articles that I've read.
All four columns fail with either a key error or with a "TypeError: tuple indices must be integers, not str"
I'm obviously not understanding how to grasp what Python needs to make this happen.
The high level is: read in text files with seven observations / columns. Use only four columns to write out; perform the regex on one column. Make sure to write out each newly formed row, not the original row.
I may need a more friendly type of global temp table to read the row into, update the row, then write the row out to a file.
Maybe I'm asking too much of Python architecture to coordinate a DictReader and a DictWriter to read in data, filter to four columns, update the fourth column with a regex, then write out the file with the updated four tuples.
At this juncture, I don't have the time to investigate a parser. I would like to eventually in more detail, since per release of Python (2.7 now, 3.x later) parsers seem handy.
Again, apologize for the complexity of the approach and my lack of understanding of the underpinnings of Python. In R language, the parallel of my shortcomings would be understanding coding at the S4 level, not just the S3 level.
Here is data that is closer to what fails, sorry--I needed to show how the headers are set up, how the file rows coming in are formatted with individual double quotes with quotes around the entire row and how the date is formatted, but not quoted:
stuff_type|stuff_date|stuff_text
""cool stuff"|01-25-2015|""the text stuff <*to test*> to find a way to extract all text that is <*included in special tags*> less than star and greater than star"""
""cool stuff"|05-13-2014|""the text stuff <*to test a second*> to find a way to extract all text that is <*included in extra special tags*> less than star and greater than star"""
""great big stuff"|12-7-2014|"the text stuff <*to test a third*> to find a way to extract all text that is <*included in very special tags*> less than star and greater than star"""
""nice stuff"|2-22-2013|""the text stuff <*to test a fourth ,*> to find a way to extract all text that is <*included in doubly special tags*> less than star and greater than star"""
stuff_type,stuff_date,stuff_text
cool stuff,1/25/2015,the text stuff <*to test*> to find a way to extract all text that is <*included in special tags*> less than star and greater than star
cool stuff,5/13/2014,the text stuff <*to test a second*> to find a way to extract all text that is <*included in extra special tags*> less than star and greater than star
great big stuff,12/7/2014,the text stuff <*to test a third*> to find a way to extract all text that is <*included in very special tags*> less than star and greater than star
nice stuff,2/22/2013,the text stuff <*to test a fourth *> to find a way to extract all text that is <*included in really special tags*> less or greater than star
I plan to retest this, but a Spyder update made my Python console crash this morning. Ugghh. With vanilla Python, the test data above fails with the following code... no need to do the write step...can't even print here... may need the QUOTES.NONE in the dialect.
import csv
import re
t = re.compile('<\*(.*?)\*>')
headers = ['stuff_type', 'stuff_date', 'stuff_text']
with open('C:/Temp/in3.csv', 'rb') as csvfile:
with open('C:/Temp/out3.csv', 'wb') as output_file:
reader = csv.DictReader(csvfile)
writer = csv.DictWriter(output_file, headers, extrasaction='ignore')
writer.writeheader()
print(headers)
for row in reader:
row['stuff_text'] = re.findall(t, row['stuff_text'])
print(row['stuff_type'], row['stuff_date'], row['stuff_text'])
writer.writerow(row)
Error:
can't past the snipping tool image in here ....sorry
KeyError: 'stuff_text'
OK: it might be in the quoting and separation of columns: the data above without quotes printed without a KeyError and now writes to the file correctly: I may have to clean up the file from quote characters before I pull out text with the regex. Any thoughts would be appreciated.
Good question @ Andrea Corbellini
The code above generates the following output if I've manually removed the quotes:
stuff_type,stuff_date,stuff_text
cool stuff,1/25/2015,"['to test', 'included in special tags']"
cool stuff,5/13/2014,"['to test a second', 'included in extra special tags']"
great big stuff,12/7/2014,"['to test a third', 'included in very special tags']"
nice stuff,2/22/2013,"['to test a fourth ', 'included in really special tags']"
which is what I want in regards to output. So, thanks for your "lazy" question---I'm the lazy one that should have put this second output as a follow on.
Again, without removing multiple sets of quotation marks, I have KeyError: 'stuff_type'. I apologize that I have attempted to insert the image from a screen capture of the Python with the error, but have not figured out yet how to do that in SO. I used the Images section above, but that seems to point to a file that maybe is uploaded to SO? not inserted?
With @monkut's excellent input below on using ".".join things or literally stuff is getting better.
{['stuff_type', 'stuff_date', 'stuff_text']
('cool stuff', '1/25/2015', 'to test:included in special tags')
('cool stuff', '5/13/2014', 'to test a second:included in extra special tags')
('great big stuff', '12/7/2014', 'to test a third:included in very special tags')
('nice stuff', '2/22/2013', 'to test a fourth :included in really special tags')}
import csv
import re
t = re.compile('<\*(.*?)\*>')
headers = ['stuff_type', 'stuff_date', 'stuff_text']
csv.register_dialect('piper', delimiter='|', quoting=csv.QUOTE_NONE)
with open('C:/Python/in3.txt', 'rb') as csvfile:
with open('C:/Python/out5.csv', 'wb') as output_file:
reader = csv.DictReader(csvfile, dialect='piper')
writer = csv.DictWriter(output_file, headers, extrasaction='ignore')
writer.writeheader()
print(headers)
for row in reader:
row['stuff_text'] = ":".join(re.findall(t, row['stuff_text']))
print(row['stuff_type'], row['stuff_date'], row['stuff_text'])
writer.writerow(row)
Error path follows:
runfile('C:/Python/test quotes with dialect quotes none or quotes filter and special characters with findall regex.py', wdir='C:/Python')
['stuff_type', 'stuff_date', 'stuff_text']
('""cool stuff"', '01-25-2015', 'to test')
Traceback (most recent call last):
File "<ipython-input-3-832ce30e0de3>", line 1, in <module>
runfile('C:/Python/test quotes with dialect quotes none or quotes filter and special characters with findall regex.py', wdir='C:/Python')
File "C:\Users\Methody\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\Users\Methody\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Python/test quotes with dialect quotes none or quotes filter and special characters with findall regex.py", line 20, in <module>
row['stuff_text'] = ":".join(re.findall(t, row['stuff_text']))
File "C:\Users\Methody\Anaconda\lib\re.py", line 177, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
I'll have a find a stronger way to clean up and remove the quotes before processing the regex findall. Probably something row = string.remove(quotes with blanks).