I have the following dataframe
import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,1,2,2,2,2,3,3,3,3],
'time': [1,2,3,4,1,2,3,4,1,2,3,4],
'cat': ['a', 'a', 'b', 'c',
'a', 'b', 'b', 'b',
'c', 'b', 'c', 'b']
})
I want to calculate, how many times the cat
changes from one time
to the next, by id
So:
- for
id == 1
,cat
changesfrom a to a
1 time,from a to b
1 time andfrom b to c
1 times - for
id == 2
,cat
changesfrom a to b
1 time, andfrom b to b 2
times - for
id == 3
,cat
changesfrom c to b 2
times, andfrom b to c 1
time
Any ideas how I could compute that ?
Ideally the output should look something like:
pd.DataFrame({'id': [1,2,3],
'a to a': [1,0,0],
'a to b': [1,1,0],
'a to c': [0,0,0],
'b to a': [0,0,0],
'b to b': [0,2,0],
'b to c': [1,0,1],
'c to a': [0,0,0],
'c to b': [0,0,2],
'c to c': [0,0,0]
})
CodePudding user response:
Similar to @Anky we will use shift
within a group to create a label for the current and next value. Then we just need a crosstab
. Since the .str.cat
will keep NaN
from the shift, and crosstab ignores them we can ensure we only count within group transitions.
import pandas as pd
s = foo['cat'].str.cat(' to ' foo.groupby('id')['cat'].shift(-1))
pd.crosstab(foo['id'], s)
cat a to a a to b b to b b to c c to b
id
1 1 1 0 1 0
2 0 1 2 0 0
3 0 0 0 1 2
CodePudding user response:
Probably this might be another way:
g = foo.groupby(["id"])
s = (g['cat'].shift().where(g['time'].diff().ge(1)).fillna('')).add(foo['cat'])
t = foo[['id']].assign(k=s)
out = t[t['k'].str.len()>1].groupby("id")['k'].value_counts().unstack(fill_value=0)
print(out.rename_axis(None,axis=1))
aa ab bb bc cb
id
1 1 1 0 1 0
2 0 1 2 0 0
3 0 0 0 1 2
CodePudding user response:
You can use collections.Counter
and itertools.pairwise
:
from collections import Counter
from itertools import pairwise
foo.groupby('id')['cat'].apply(lambda x: Counter(pairwise(x))).unstack(level=0).fillna(0)
output:
id 1 2 3
(a, a) 1.0 0.0 0.0
(a, b) 1.0 1.0 0.0
(b, c) 1.0 0.0 1.0
(b, b) 0.0 2.0 0.0
(c, b) 0.0 0.0 2.0
NB. pairwise
requires python ≥ 3.10, for versions below use the recipe from the documentation.
CodePudding user response:
We can use pandas.Series.shift
to shift the values, then pandas.DataFrame.groupby
to group the distinct values, and count.
# Have the next category be stored in next_cat
foo['next_cat'] = foo['cat'].shift(-1)
# If the use condition is to ensure that each ID's next category is
# of the same ID, we use this instead:
foo['next_cat'] = foo.groupby(['id'])['cat'].shift(-1)
# Only groupby for values that are not NaN in next_cat (ignore the last row)
# Then group by category, next category, and the ID, and count these values.
foo.loc[foo['next_cat'] == foo['next_cat']].groupby(['cat', 'next_cat', 'id']).count()
This outputs:
time
cat next_cat id
a a 1 1
b 1 1
2 1
b b 2 2
c 1 1
2 1
3 1
c a 1 1
b 3 2
We can then drop the index and pivot in order to achieve your ideal shape using pandas.DataFrame.pivot_table
:
# This time around we're storing the data into foo.
foo = foo.loc[foo['next_cat'] == foo['next_cat']].groupby(['cat', 'next_cat', 'id']).count()
# Reset the index so we can pivot using these columns.
foo = foo.reset_index()
foo.pivot_table(columns=foo['cat'] " to " foo['next_cat'], values=['time'], index=['id']).fillna(0).astype(int)
This outputs:
time
a to a a to b b to b b to c c to a c to b
id
1 1 1 0 1 1 0
2 0 1 2 1 0 0
3 0 0 0 1 0 2