2

I am trying to perform several operations in one program at same time. I have a data-frame that has Dates of which I have no clue of start and end and I want to find:

  1. Total number of days the data-set has
  2. Total number of hours
  3. Median of the Count
  4. Write a separate output for median per day/date.
  5. If possible Median-of-Median in most possible simple way.

Input: Few rows from the a large file of GB size

2004-01-05,16:00:00,17:00:00,Mon,10766,656
2004-01-05,17:00:00,18:00:00,Mon,12223,670
2004-01-05,18:00:00,19:00:00,Mon,12646,710
2004-01-05,19:00:00,20:00:00,Mon,19269,778
2004-01-05,20:00:00,21:00:00,Mon,20504,792
2004-01-05,21:00:00,22:00:00,Mon,16553,783
2004-01-05,22:00:00,23:00:00,Mon,18944,790
2004-01-05,23:00:00,00:00:00,Mon,17534,750
2004-01-06,00:00:00,01:00:00,Tue,17262,747
2004-01-06,01:00:00,02:00:00,Tue,19072,777
2004-01-06,02:00:00,03:00:00,Tue,18275,785
2004-01-06,03:00:00,04:00:00,Tue,13589,757
2004-01-06,04:00:00,05:00:00,Tue,16053,735

The start and end date are NOT known.

Edit: Expected Output:1 will have only one row of results

days,hours,median,median-of-median
2,17262,13,17398

Median-of-Median is the median value of median column from output 2

Expected Output:2, will have medians of every date which are to used to find median-of-median

date,median
2004-01-05,17534
2004-01-06,17262

Code:

import pandas as pd 
from datetime import datetime

df = pd.read_csv('one_hour.csv')
df.columns = ['date', 'startTime', 'endTime', 'day', 'count', 'unique']

date_count = df.count(['date'])
all_median = df.median(['count'])
all_hours = df.count(['startTime'])
med_med = df.groupby(['date','count']).median()

print date_count
print all_median
print all_hours

stats = ['date_count', 'all_median', 'all_hours', 'median-of-median']
stats.to_csv('stats_all.csv', index=False)

med_med.to_csv('med_day.csv', index=False, header=False)

Obviously the code does not give the result as it is supposed to.

The error is shown below.

Error:

Traceback (most recent call last):
  File "day_median.py", line 8, in <module>
    all_median = df.median(['count'])
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 5310, in stat_func
    numeric_only=numeric_only)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4760, in _reduce
    axis = self._get_axis_number(axis)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 308, in _get_axis_number
    axis = self._AXIS_ALIASES.get(axis, axis)
TypeError: unhashable type: 'list'
Sitz Blogz
  • 927
  • 5
  • 23
  • 46

1 Answers1

3

IIUC maybe help change:

date_count = df.count(['date'])
all_median = df.median(['count'])
all_hours = df.count(['startTime'])

to:

date_count = df['date'].count()
all_median = df['count'].median()
all_hours = df['startTime'].count()

print (date_count)
print (all_median)
print (all_hours)
13
17262.0
13

if need count statistics from columns date, count and startTime.

EDIT by comment:

If need count unique values of column use nunique:

date_count = df['date'].nunique()
print (date_count)
2

DataFrame stats:

cols = ['date_count', 'all_median', 'all_hours']
stats = pd.DataFrame([[date_count, all_median, all_hours]], columns = cols)
print (stats)
   date_count  all_median  all_hours
0           2     17262.0         13
jezrael
  • 629,482
  • 62
  • 918
  • 895
  • Thank you so much .. But If you see in the input the number of days are 2 as it has only 2 dates and hours are 13 – Sitz Blogz Jul 15 '16 at 21:32
  • 1
    Hmmm, can you add desired output in `print (date_count) print (all_median) print (all_hours)` ? – jezrael Jul 15 '16 at 21:34
  • I will put these output in a csv, including the new headers. – Sitz Blogz Jul 15 '16 at 21:36
  • 1
    Please see edit. Another question: Do you need create `DataFrame` from all variables? – jezrael Jul 15 '16 at 21:40
  • Please can you check the question again. I just now included the expected output – Sitz Blogz Jul 15 '16 at 21:46
  • 1
    Ok, give me a few minutes. – jezrael Jul 15 '16 at 21:50
  • What is median of median? Do you need it in `stats`? Because in `stats = ['date_count', 'all_median', 'all_hours']` is missing. – jezrael Jul 15 '16 at 21:57
  • I was trying to find median of median but I could not find the simplest way to do that. Hence did not include in print and csv. – Sitz Blogz Jul 15 '16 at 21:59
  • 1
    I dont understand median of median. can you explain more? – jezrael Jul 15 '16 at 22:09
  • Idea of [median of median] (http://stackoverflow.com/questions/9489061/understanding-median-of-medians-algorithm) In our case to make it more simple what we need to do is find median value of `count` of every date and write all those medians in a separate csv and later find the median value of those medians from new csv and write back to stats. so we can either find medians of all counts with respect to dates and then find the stats. – Sitz Blogz Jul 15 '16 at 22:15
  • And do you need median of median of colum ``count` or `unique`? – jezrael Jul 15 '16 at 22:23
  • Actually both but when one works the other is just the replica. I have also made edit in expected outputs of how it will be with given input. Please check once. – Sitz Blogz Jul 15 '16 at 22:25
  • Unfortunately i cant help you - I dont know algo for finding true median. – jezrael Jul 15 '16 at 22:38
  • We do not really have to follow the true Median algorithm. We can follow the simple technique that I explained. First find all the medians with respect to date and then find the median value of those medians. The algorithm implementation in literal manner is complex. We need simplified version here. – Sitz Blogz Jul 15 '16 at 22:41
  • How can we round off the float value of the median? I did try `.median().round()` but that gives an error. – Sitz Blogz Jul 15 '16 at 22:48
  • 1
    Use `all_median = round(df['count'].median())` – jezrael Jul 15 '16 at 22:50
  • 1
    Maybe check [this](http://stackoverflow.com/questions/56820/round-in-python-doesnt-seem-to-be-rounding-properly). – jezrael Jul 15 '16 at 22:55