I have the following Pandas dataframe:
import pandas as pd
df = pd.DataFrame(
[
("bird", '2022-01',"Falconiformes"),
("bird", '2022-02',"Falconiformes"),
("bird", '2022-03',"Falconiformes"),
("bird", '2022-04',"Falconiformes"),
("bird", '2022-05',"Falconiformes"),
("bird", '2022-06',"Falconiformes"),
("bird", '2022-07',"Falconiformes"),
("bird", '2022-08',"Falconiformes"),
("bird", '2022-09',"Psittaciformes"),
("bird", '2022-10',"Psittaciformes"),
("bird", '2022-11',"Psittaciformes"),
("bird", '2022-12',"Psittaciformes"),
("mammal", '2022-01',"Falconiformes"),
("mammal", '2022-02',"Falconiformes"),
("mammal",'2022-03',"Falconiformes"),
("mammal", '2022-04',"Falconiformes"),
("mammal",'2022-05',"Falconiformes"),
("mammal", '2022-06',"Psittaciformes"),
("mammal", '2022-07',"Falconiformes"),
("mammal", '2022-08',"Falconiformes"),
("mammal", '2022-09',"Falconiformes"),
("mammal", '2022-10',"Falconiformes"),
("mammal", '2022-11',"Falconiformes"),
("mammal", '2022-12',"Falconiformes"),
],
columns=("animal", "date", "attribute"),
)
Now it's getting complicated. For each type of animal I want the count of the latest consecutive sequence of values within that group.
The result should like
result = pd.DataFrame(
[ ("bird", 'Psittaciformes' ,4),
("mammal", 'Falconiformes' ,6),
],
columns=("animal", "attribute", "count"),
)
result
I think it could be programmed with itergroup or something like that. What I'm looking for is a oneliner. It should be possible, is it?
CodePudding user response:
You can use groupby.agg
with a custom function to compute the count
:
(df.groupby('animal', as_index=False)
.agg(attribute=('attribute', 'last'),
count=('attribute', lambda s: s.eq(s.iloc[-1])[::-1].cummin().sum())
)
)
output:
animal attribute count
0 bird Psittaciformes 4
1 mammal Falconiformes 6
Function:
s.eq(s.iloc[-1]) # identify values equal to last one
[::-1] # inverse Series
.cummin() # set all values False after the first False
.sum() # count the True
CodePudding user response:
Another solution:
df_out = df.groupby("animal", as_index=False).apply(
lambda x: x.groupby((x.attribute != x.attribute.shift()).cumsum())
.agg(
animal=("animal", "first"),
attribute=("attribute", "first"),
count=("animal", "count"),
)
.iloc[-1]
)
print(df_out)
Prints:
animal attribute count
0 bird Psittaciformes 4
1 mammal Falconiformes 6