1

objective

I am trying to automatically generate an EDA report for each column in my dataframe, starting with value_counts().

problem

the problem is that my function doesn't return anything. So while it does print to console, it doesn't print that same output to my text file. I was using this to just generate syntax and then run it line-by-line in my IDE to look at all the variables, but that is not a very programmatic solution.

notes

Once this is working, I am going to add some syntax for graphs and the output of df.describe(), but for now I can't even get the basics of what I want.

Output doesnt have to be .txt, but I thought that would be easiest while getting this to work.

I tried

import pandas as pd

def EDA(df, name):

    df.name = name  # name == string version of df
    print('#', df.name)
    for val in df.columns:
        print('# ', val, '\n', df[val].value_counts(dropna=False), '\n', sep='')
        print(df[val].value_counts(dropna=False))

path = 'Data/nameofmyfile.csv'

# name of df
activeWD = pd.read_csv(path, skiprows=6)

f = open('Output/outtext.txt', 'a+', encoding='utf-8')
f.write(EDA(activeWD, 'activeWD'))
f.close()

also tried

  1. various version of replacing print with return

    def EDA(df, name):

        df.name = name  # name == string version of df
        print('#', df.name)
        for val in df.columns:
            print('# ', val, '\n', df[val].value_counts(dropna=False), '\n', sep='')
            return(df[val].value_counts(dropna=False))
    
  2. running file from anaconda prompt

    Python Syntax\newdataEDA.5.py >> Output.outtext.txt

which results in the following codec error:

(base) C:\Users\auracoll\Analytic Projects\IDL Attrition>Python Syntax\newdatanewlife11.5.py >> Output.outtext.txt
sys:1: DtypeWarning: Columns (3,16,39,40,41,42,49) have mixed types. Specify dtype option on import or set low_memory=False.
Traceback (most recent call last):
  File "Syntax\newdatanewlife11.5.py", line 46, in <module>
    EDA(activeWD, name='activeWD')
  File "Syntax\newdatanewlife11.5.py", line 38, in EDA
    print(df[col].value_counts(dropna=False))
  File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 382-385: character maps to <undefined>

I tried encoding='utf-8' and encoding='ISO-8859-1', neither of which resolve this problem.

  1. I have tried to save intermediary variables, which return none type.

    testvar = for val in df.columns: df[val].value_counts(dropna=False)

when I do this, testvar is NoneType object of builtins module

Andrew
  • 870
  • 7
  • 22
  • 1
    something like `df['column'].value_counts().to_frame().reset_index().to_csv(...)`? It's a bit long but should work. – Alex Nov 14 '18 at 16:55
  • Would that work for multiple columns? In the past i've used apply to create a single df of value_counts by column, but the output isnt very tidy, each var gets a new column and each set of values gets unique rows, so it creates a diagonal pattern that is hard to read – Andrew Nov 14 '18 at 16:58
  • same problem. if I try to save this is a var `for val in df.columns: df[val].value_counts().to_frame().reset_index() it saves as nonetype. that is the same problem the above code has – Andrew Nov 14 '18 at 17:04
  • I would consider adding intermediate steps to make sure your outputs are working as you think. For one thing, though you said you've tried it with `return`, your current code is trying to write nothing to the file, because your `EDA(activeWD, 'activeWD')` has no return, and will therefore return `None`. I would say to change those `print`s to a `return`, then assign a variable like `x=EDA(activeWD, 'activeWD')`, print that, and if it looks right, try to write it to file – G. Anderson Nov 14 '18 at 17:11
  • More info on the extra steps needed to redirect a `print` to a file: [How to redirect 'print' output to a file using python?](https://stackoverflow.com/questions/7152762/how-to-redirect-print-output-to-a-file-using-python) – G. Anderson Nov 14 '18 at 17:12
  • @G.Anderson I did read that link and several others, which is why I included that I have already tried running it from cmd. I have tried making the intermediary variables, so I do know that the problem is no return. However, none of the solutions i've read on this site or elsewhere work for me. – Andrew Nov 14 '18 at 17:15
  • 1
    It may be helpful to provide exactly what returns you've tried in your question, since you stated that you know that's the problem – G. Anderson Nov 14 '18 at 17:23
  • Ok I gave an example of what I tried with return() instead of print(). in addition, I tried to replace the print calls, I tried to call return() directly after w the same content inside the parentheses and with various combinations or smaller chunks of the function code. – Andrew Nov 14 '18 at 18:16

1 Answers1

1

Command-line solution, although you can certainly print to file using pure python as your commenters suggested. I'm posting this because you mentioned you already tried using your command prompt and weren't able to get your outputs to print to file. So, edit your script, filename.py as follows...

import pandas as pd

df = pd.DataFrame({'Pet':['Cat','Dog','Dog','Dog','Fish'],
                   'Color':['Blue','Blue','Red','Orange','Orange'],
                   'Name':['Henry','Bob','Mary','Doggo','Henry']})

def EDA(df, name):
    df.name = name
    print('#{}\n'.format(df.name))
    for col in df.columns:
        print('#{}\n'.format(col))
        print(df[col].value_counts(dropna=False))
        print('\n')

if __name__=='__main__':
    EDA(df, name='test')

Then you should be able to run: python filename.py > output.txt in your terminal.

EDIT

For posterity's sake, OP's issue was not with how they were printing to file, instead there was an issue where their csv contained uncommon characters which pandas.read_csv was having trouble decoding. The solution involved setting python's I/O encoding to UTF-8 before running the code, as shown here: python 3.2 UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 9629: character maps to <undefined>

chcp 65001
set PYTHONIOENCODING=utf-8
Dascienz
  • 965
  • 7
  • 13
  • This isn't working for me due to codec error. I should have specified earlier that when I tried this type of solution I receive a codec error. the question has been updated. – Andrew Nov 14 '18 at 18:14
  • @Andrew, that error is due to how you're importing your `pandas.DataFrame`, not how you're writing to file. There are mixed `dtypes` within your columns. Please try reading in your `pandas.DataFrame` with an encoding argument as follows: `activeWD = pd.read_csv(path, skiprows=6, encoding='ISO-8859-1')`. – Dascienz Nov 14 '18 at 18:36
  • hmm, it imports fine in the IDE... similar error, now says `UnicodeEncodeError: 'charmap' codec can't encode character '\x95` in position 457: character maps to – Andrew Nov 14 '18 at 19:03
  • 1
    Test the code yourself on a different dataset to see if it works, that way you'll be able to narrow it down to the dataset you're trying to work with. It works on the example `pandas.DataFrame` that I wrote out in my answer, but you should be wary of mixed data in the set you're currently working on. Maybe try `encoding='utf-8'` for dealing with unicode characters. ALSO, it's good practice to look at the column values that your code is failing on to better understand the issue. – Dascienz Nov 14 '18 at 19:07
  • the codec is struggling because my dataset contains chinese lettering. I typically use ISO-8859-1 for this because it works. When I run a python session in the terminal (the one that throws the codec error). I can piece this code together and it prints a box with a [?] rather than throw an error. I'm not sure what or if there is a way around that, which is why I was trying the f.write() solution, but that doesn't work either. – Andrew Nov 14 '18 at 19:30
  • Yes, there's a way around that. You should be able to remedy this issue by importing your csv with the proper encoding argument. You've definitely tried all different encodings? Try `encoding='utf-8'`, `encoding='gbk'`, or even `encoding='gb2312'` in your `read_csv` line. – Dascienz Nov 14 '18 at 19:36
  • 'gbk codec cant decode byte 0xa2 in posiotion 20: illegal multibyte sequence' latin1, utf-8, and ISO-8859-1 throw the error we've already mentioned: \x95 – Andrew Nov 14 '18 at 19:44
  • `encoding='gb2312'`? I've got my fingers crossed for one of these as I have no idea what characters are in your columns, haha. You could also try `utf-16` and `utf-32`, but if `gb2312` doesn't work I doubt those would. – Dascienz Nov 14 '18 at 19:53
  • uggh, none of those. I started moving through these here: https://docs.python.org/3/library/codecs.html#standard-encodings similar results. when I open the file with open(path), it shows encoding=cp1252, but that does not work in the script. When I open it in notepad, it shows a utf-8 encoding. This dataset is somewhat tricky... some of the column names have quotes; characters like [, (, or -; spaces; etc. and the values are from a global database including 30 different countries. I started with ISO8859 (after utf-8 didnt work) because that has worked in the past – Andrew Nov 14 '18 at 20:21
  • Is there any chance you could post the dataset somewhere, if it's not sensitive information? I'd like to take a personal look and see if there's any other low hanging fruit to try out. – Dascienz Nov 14 '18 at 21:53
  • Unfortunatley it is all confidential data. I could probably post a few examples of the Chinese lettering and some of the field names. I used chardet, which gave a 100% probability that it is utf-8 (i disagree). I am also trying to implement something discussed here, https://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console/32176732#32176732 so far nothing. and I have win_unicode_console enabled. – Andrew Nov 15 '18 at 13:21
  • 1
    Maybe post a toy dataframe which contains the characters your program is failing on? – Dascienz Nov 15 '18 at 13:38
  • ok. it is hard for me to figure out which specific characters are failing, but I will do this and update the question... lot of new info is in this comment section – Andrew Nov 15 '18 at 13:42
  • SOLVED. I used the solution here https://stackoverflow.com/a/28041598/7747975, which worked. Even the the console was reading utf-8, stdout was writing in cp1252. I couldn't make an example df, because when I used f.read() to look at the position of characters throwing and error, they were just normal characters like "e" or " Wo". – Andrew Nov 15 '18 at 15:01
  • 1
    Very happy to hear you solved this issue. Good luck on your project! – Dascienz Nov 15 '18 at 15:35