Are there packages to register/identify schemas for a Pandas data analysis workflow?

Question

I'm using Pandas to automate analysis of a variety of different 3rd party reports. Most are in csv format.

Assuming only correct files are loaded into the program, I need to:

identify the origin of the report (3rd party), based on
- schema
- predictable column values
store historical reports of same origin,
return origin, maybe some other thing-ys

I only need to manage 10 reports in the beginning. I imagine it could grow into identifying upwards of several hundred--noting that a flat file and some dictionaries couldn't handle. But why reinvent the wheel, ...

Are there packages to register/identify schemas for a Pandas data analysis workflow?

score 0 · Accepted Answer · edited May 23 '17 at 11:47

I've taken a first pass solution which I'll offer as answer. I've implemented a class-based solution with defaultdict. Here's the basic outline:

Register class oop structure to handle and access schemas in my scripts:
- Report(object)
- ChildReport(Report)
'vividict' or multi-dimensional dictionary structure to handle the collection of reports using Python's defaultdict:
- client_reports['date']['type'] = ChildReport(self)
ReportsManager(object) class. Initializes the vividict, and collects multiple methods for accessing and managing the collections--one for each client.
Python's Pickle module to store the ReportManager object--one for each client.

I have a few doubts about how I structured the defaultdict with the ReportsManager class. It's a start.

Are there packages to register/identify schemas for a Pandas data analysis workflow?

1 Answers1