I got the following table in pandas:
x | y |
---|---|
1 | 1 |
2 | 3 |
2 | 5 |
2 | 4 |
1 | 4 |
1 | 5 |
I would like to see the pattern of variable x
, so I would like to see the pattern goes.
In the table you would see x=1 then x=2 for three times and then go back to x=1 twice.
x | # occurance | first y value | last y value |
---|---|---|---|
1 | 1 | 1 | 1 |
2 | 3 | 3 | 4 |
1 | 2 | 4 | 5 |
I initially tried groupby
but using groupby would group all x together, and that is not the desired way.
For convenience, I should provide the content of the dataframe.
data = {'x': [1, 2, 2, 2, 1, 1],
'y': [1, 3, 5, 4, 4, 5]}
df = pd.DataFrame(data)
Thanks in advance.
CodePudding user response:
This is an island problem. We need to first group the x values into islands by taking the index
and subtracting the cumcount
of that x
group:
group = df.index - df.groupby('x').cumcount()
Output:
0 0
1 1
2 1
3 1
4 3
5 3
dtype: int64
We can now group by that variable, adding the appropriate list of functions for the desired outputs:
df.groupby(group).agg({'x':[('x', 'first'), ('# occurrence', 'size')], 'y':[('first y', 'first'), ('last y', 'last')]}).reset_index(drop=True)
Output:
x y
x # occurrence first y last y
0 1 1 1 1
1 2 3 3 4
2 1 2 4 5
Dependent on your version of pandas (>=0.25), you could also use a dict to specify the aggregations:
df.groupby(group).agg(**{ 'x' :('x', 'first'), '# occurrence':('x', 'count'), 'first y':('y', 'first'), 'last y':('y', 'last')})
Output:
x # occurrence first y last y
0 1 1 1 1
1 2 3 3 4
3 1 2 4 5
CodePudding user response:
Use:
g = (df['x'].shift(1, fill_value=df['x'].iloc[0])!=df['x']).cumsum()
from collections import Counter
df.groupby(g).agg({'x': [('x', lambda x: x.iloc[0]), ('# occurance', lambda x: list(Counter(x).values())[0])], 'y': [('first y value', lambda x: x.iloc[0]), ('last y value', lambda x: x.iloc[-1])]}).droplevel(level=0, axis=1)
Output:
x # occurance first y value last y value
x
0 1 1 1 1
1 2 3 3 4
2 1 2 4 5
CodePudding user response:
Create consecutive groups by compare shifted values with not equal and cumulative sum and then use named aggregation:
df = (df.groupby(df['x'].ne(df['x'].shift()).cumsum())
.agg(**{'x':('x','first'),
'# occurance':('x','size'),
'first y value':('y','first'),
'last y value':('y','last')})
.reset_index(drop=True))
print (df)
x # occurance first y value last y value
0 1 1 1 1
1 2 3 3 4
2 1 2 4 5