Is there a contiguous group labelling capability in Pandas

Question

I have been puzzling over this problem for some time now. I was wandering if there is some "Pandas" like way to get there. I have a simple DataFrame with two columns PivotHigh and PivotLow, representing high values and low values. I need to "connect the dots" in pairs from the lowest low to the highest high. Sounds simple but there is some complications:

Rows do not clearly alternate, there will be some overlaps.
There will be runs in a column, when such occurs the highest/lowest value of the run must be selected

In the image I have annotated the "runs" in yellow, and the values (max/min) that should be selected in green

I have been trying something along the lines of:

def find_swings(df_pivots, 
column_names={"HighPivots":"HighPivots","LowPivots":"LowPivots"}):
"""
    Given a Dataframe of PivotHigh and PivotLow will return a Dataframe describing the pivots joined together.

    Parameters:
        df_pivots (DataFrame): Contains pivot information
            PivotHigh   - A sereis of pivot highs
            PivotLow    - A series of pivot lows

        column_names (dictionary): Specify the column names for the relevant columns

    Returns:
        DataFrame:
            StartBar        - Start Date of the Bar
            EndBar          - End Date of the Bar
            StartValue      - Starting value of the swing
            EndValue        - Ending value of the swing
            Diff            - EndValue - StartValue
            SwingType       - "Up" | "Down"
            N_Bars          - Number of Bars
            Gradient        - (EndValue-StartValue)/N_Bars
            Angle           - Angle of the Swing in Radians
            Magnitude       - Magnitude of the Swing: Sqrt(Diff^2 + N_Bars^2)
            
            (Next Phase)
            ClosePriceArea  - Close Price Area: The unsigned area enclosed by the swing
                                Lower means tight fit on trend
"""
df_ret = pd.DataFrame()
df = df.reindex()
start_index = 0
current_pivot = "PivotLow" if df["PivotLow"].index.min() <= df["PivotHigh"].index.min() else "PivotHigh"
alternate_pivot = "PivotHigh" if current_pivot == "PivotLow" else "PivotLow"

while(True):
    swing_end = df_t.loc[df_t[alternate_pivot].notnull()].index[0]
    swing_start = df.iloc[df.iloc[:swing_end]]
    df_trimmed = df_ret[swing_start:swing_end]
    if (current_pivot == "PivotLow"):
        df_ret["StartDate"] = df_trimmed.loc[df_trimmed[current_pivot].idxmin(axis=1)]["Date"]
        #....
    else: 
        df_ret["StartDate"] = df_trimmed.loc[df_trimmed[current_pivot].idxmax(axis=1)]["Date"]
        #....

But from there things seem to snowball out of control.

Is there a way to introduce a new sequential label column on the "groups", then I will be able to perform some max/min aggregation on them?

Or any other general advice/approaches?

Thanks

Data:

Date,PivotHigh,PivotLow
2020-01-01 02:45:00,,7210.59
2020-01-01 04:30:00,,7195.01
2020-01-01 06:30:00,,7204.53
2020-01-01 07:15:00,7209.15,
2020-01-01 11:15:00,7215.0,
2020-01-01 11:30:00,,7198.99
2020-01-01 14:30:00,,7200.19
2020-01-01 15:30:00,7238.14,
2020-01-01 17:15:00,,7221.18
2020-01-01 18:15:00,,7219.01
2020-01-01 20:30:00,,7209.6
2020-01-01 21:00:00,7222.6,
2020-01-01 21:45:00,,7210.73
2020-01-02 00:45:00,7186.18,7179.77
2020-01-02 06:30:00,,7112.27
2020-01-02 10:30:00,7156.65,7139.93
2020-01-02 14:45:00,7137.23,
2020-01-02 19:45:00,6976.17,6965.0
2020-01-02 22:45:00,,6954.34
2020-01-03 04:30:00,7281.24,
2020-01-03 07:15:00,7220.0,
2020-01-03 10:15:00,7361.7,7316.31
2020-01-03 12:15:00,7339.31,7329.66
2020-01-03 13:45:00,7369.61,7319.0
2020-01-03 16:15:00,7401.2,
2020-01-03 16:45:00,,7358.18
2020-01-03 20:00:00,7349.44,
2020-01-03 21:15:00,,7319.45
2020-01-03 23:30:00,,7315.0
2020-01-04 00:00:00,7339.99,
2020-01-04 03:00:00,7332.89,
2020-01-04 05:30:00,7359.99,
2020-01-04 07:00:00,,7341.89
2020-01-04 07:15:00,7354.98,
2020-01-04 10:15:00,7357.39,7343.95
2020-01-04 11:45:00,,7319.19
2020-01-04 15:15:00,7317.58,
2020-01-04 15:30:00,,7304.6
2020-01-04 18:00:00,,7328.74
2020-01-04 18:30:00,7398.0,
2020-01-04 21:30:00,7358.51,7341.03
2020-01-05 02:00:00,7483.28,7441.4
2020-01-05 03:45:00,,7460.0
2020-01-05 04:00:00,7488.87,
2020-01-05 06:00:00,7478.24,
2020-01-05 07:45:00,,7462.22
2020-01-05 08:00:00,7475.0,
2020-01-05 13:00:00,7460.03,7446.36
2020-01-05 16:30:00,7479.52,
2020-01-05 17:45:00,,7460.0

can you share enough rows from your input as text not image, you can use [this QA](https://stackoverflow.com/questions/52413246/how-to-provide-a-reproducible-copy-of-your-dataframe-with-to-clipboard) to help how to do it — Ben.T, Jul 31 '20 at 11:10

Quang Hoang · Accepted Answer · 2020-07-31T11:15:32.513

1

Let's try cumsum on isna to identify the consecutive non-nan blocks, then use groupby().ngroup() to identify the ordinary number of each block:

new_df = pd.DataFrame()
for col, op in zip(df.columns, [np.max, np.min]):
    s = df[col]

    blocks = s.isna().cumsum()
    non_na_blocks = s.dropna().groupby(blocks).ngroup()
    new_df[col] = s.dropna().groupby(non_na_blocks).apply(op)

Output (for first two blocks):

   PivotHigh  PivotLow
0    7215.00   7195.01
1    7238.14   7198.99

edited Jul 31 '20 at 11:15

answered Jul 31 '20 at 11:10

Quang Hoang

117,517
10
34
52

Unbelievably creative solution! This looks really promising. However it breaks when the blocks are overlapped. If you look at my 4 yellow annotation on the right hand column you see the PivotHigh splits the pivot low block. The sequence when translated to your approach should look as such: ``` 0....correct as is 1.... correct as is 2. 7222 7209 (still correct) 3. 7186 7210 4. 7156 7112 ``` Not too sure how to approach that issue. Perhaps somehow splitting the block where an overlap takes place by inserting a null entry? – Slappy Jul 31 '20 at 13:18
1

With regards to my above issue on the block overlaps. I tried this and it worked, but I have some serious reservations about it: df_pivots.loc[(df_pivots["PivotHigh"] + df_pivots["PivotLow"]).notnull(),"PivotLow"] = np.NaN I think it needs to be a lot smarter than that by looking back at the last assignment. This works when the conflict is PivotLow. Not sure how it reacts when the conflicted block is PivotHigh – Slappy Jul 31 '20 at 13:34

Is there a contiguous group labelling capability in Pandas

1 Answers1