Home > Net >  Pandas longest final stretch within each group
Pandas longest final stretch within each group

Time:09-11

I have the following Pandas dataframe:

import pandas as pd

    df = pd.DataFrame(
    
        [
            ("bird", '2022-01',"Falconiformes"),
            ("bird", '2022-02',"Falconiformes"),
            ("bird", '2022-03',"Falconiformes"),
            ("bird", '2022-04',"Falconiformes"),
            ("bird", '2022-05',"Falconiformes"),
            ("bird", '2022-06',"Falconiformes"),
            ("bird", '2022-07',"Falconiformes"),
            ("bird", '2022-08',"Falconiformes"),
            ("bird", '2022-09',"Psittaciformes"),
            ("bird", '2022-10',"Psittaciformes"),
            ("bird", '2022-11',"Psittaciformes"),
            ("bird", '2022-12',"Psittaciformes"),
            ("mammal", '2022-01',"Falconiformes"),
            ("mammal", '2022-02',"Falconiformes"),
            ("mammal",'2022-03',"Falconiformes"),
            ("mammal", '2022-04',"Falconiformes"),
            ("mammal",'2022-05',"Falconiformes"),
            ("mammal", '2022-06',"Psittaciformes"),
            ("mammal", '2022-07',"Falconiformes"),
            ("mammal", '2022-08',"Falconiformes"),
            ("mammal", '2022-09',"Falconiformes"),
            ("mammal", '2022-10',"Falconiformes"),
            ("mammal", '2022-11',"Falconiformes"),
            ("mammal", '2022-12',"Falconiformes"),
        
    
        ],
    
        columns=("animal", "date", "attribute"),
    )

Now it's getting complicated. For each type of animal I want the count of the latest consecutive sequence of values within that group.

The result should like

result = pd.DataFrame(
    [   ("bird", 'Psittaciformes' ,4),
        ("mammal", 'Falconiformes' ,6),
    ],
    columns=("animal", "attribute", "count"),
)
result

I think it could be programmed with itergroup or something like that. What I'm looking for is a oneliner. It should be possible, is it?

CodePudding user response:

You can use groupby.agg with a custom function to compute the count:

(df.groupby('animal', as_index=False)
   .agg(attribute=('attribute', 'last'),
        count=('attribute', lambda s: s.eq(s.iloc[-1])[::-1].cummin().sum())
       )
)

output:

   animal       attribute  count
0    bird  Psittaciformes      4
1  mammal   Falconiformes      6

Function:

s.eq(s.iloc[-1])   # identify values equal to last one
[::-1]             # inverse Series
.cummin()          # set all values False after the first False
.sum()             # count the True

CodePudding user response:

Another solution:

df_out = df.groupby("animal", as_index=False).apply(
    lambda x: x.groupby((x.attribute != x.attribute.shift()).cumsum())
    .agg(
        animal=("animal", "first"),
        attribute=("attribute", "first"),
        count=("animal", "count"),
    )
    .iloc[-1]
)

print(df_out)

Prints:

   animal       attribute  count
0    bird  Psittaciformes      4
1  mammal   Falconiformes      6
  • Related