0

I'm using Pandas to automate analysis of a variety of different 3rd party reports. Most are in csv format.

Assuming only correct files are loaded into the program, I need to:

  • identify the origin of the report (3rd party), based on
    • schema
    • predictable column values
  • store historical reports of same origin,
  • return origin, maybe some other thing-ys

I only need to manage 10 reports in the beginning. I imagine it could grow into identifying upwards of several hundred--noting that a flat file and some dictionaries couldn't handle. But why reinvent the wheel, ...

Are there packages to register/identify schemas for a Pandas data analysis workflow?

xtian
  • 2,172
  • 2
  • 25
  • 51

1 Answers1

0

I've taken a first pass solution which I'll offer as answer. I've implemented a class-based solution with defaultdict. Here's the basic outline:

  • Register class oop structure to handle and access schemas in my scripts:
    • Report(object)
    • ChildReport(Report)
  • 'vividict' or multi-dimensional dictionary structure to handle the collection of reports using Python's defaultdict:
    • client_reports['date']['type'] = ChildReport(self)
  • ReportsManager(object) class. Initializes the vividict, and collects multiple methods for accessing and managing the collections--one for each client.
  • Python's Pickle module to store the ReportManager object--one for each client.

I have a few doubts about how I structured the defaultdict with the ReportsManager class. It's a start.

Community
  • 1
  • 1
xtian
  • 2,172
  • 2
  • 25
  • 51