11

Is there a way to hint about a pandas DataFrame's schema "statically" so that we can get code completion, static type checking, and just general predictability during coding?

I wouldn't mind duplicating the schema info in code and type annotation for this to work..

So maybe something roughly like mypy comment type annotations:

df = pd.DataFrame({'a': [1.0, 2.4, 4.5], 'B': [1,2,3]})  # pd.schema: ('a': np.dtype(float)), ('B': np.dtype(int))

(or better yet have the schema specified in some external JSON file or such)

Then you can image things like df. auto-completing during coding to df.a or df.B. Or mypy (and any other static code analyzer) being able to infer the type of df.B[0] and such.

Although hopeful, I'm guessing this isn't really possible (or desired...). If so, what would be a good standard for writing good reusable code that returns pd.DataFrame's with specific columns? So imagine there's a function get_data() -> pd.DataFrame that returns data with columns that are known in advance - how would you make this transparent to a user of this function? Anything smarter / more standardized than just spelling it out in the function's docstring?

stav
  • 1,135
  • 1
  • 8
  • 31
  • SO you want to select the columns base on the datatype ? – BENY Apr 22 '19 at 00:27
  • 2
    No, I want to statically annotate the types of the columns – stav Apr 22 '19 at 17:44
  • 1
    there seems to be some related wip in mypy: https://github.com/pandas-dev/pandas/issues/26792 also see some kind of workaround in https://stackoverflow.com/questions/46412821/type-checking-pandas-dataframes – Rafael Jan 11 '20 at 17:36
  • 2
    Related to this, is there a way to type a single Series that mypy will understand? Something like `pd.Series[str]`, `pd.Series[int]`, etc – BallpointBen Feb 17 '21 at 22:53
  • 1
    [dataenforce](https://github.com/CedricFR/dataenforce) wraps the DataFrame to allow exactly the kind of type hinting you describe. – above_c_level Mar 15 '21 at 12:31

1 Answers1

1

This may be something you already know, but a reliable way to get the auto-completion you are after is to develop code "live" in Jupyter notebooks. It's very commonly used in data science applications - for your instance it might be appropriate to instantiate a version of the DataFrame with the types that you are looking for at the top of the notebook, then Jupyter will provide the autocomplete for the columns and types as you code. Obviously it has a big advantage over the IDE in terms of knowing what is in scope, because the dataframe is actually loaded into memory as you are developing.

Per above_c_level's comment, dataenforce looks promising for its connection with pytest (ie. testing after code is developed), but unless there are some fancy integrations with your IDE I don't think it will be able to match Jupyter's "live knowledge" of the object.

DaveB
  • 75
  • 6