Find how many switches happen at the columns of a pandas column by group-CodePudding

I have the following dataframe

import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,1,2,2,2,2,3,3,3,3],
                    'time': [1,2,3,4,1,2,3,4,1,2,3,4],
                    'cat': ['a', 'a', 'b', 'c', 
                            'a', 'b', 'b', 'b', 
                            'c', 'b', 'c', 'b']
})

I want to calculate, how many times the cat changes from one time to the next, by id

So:

for id == 1, cat changes from a to a 1 time, from a to b 1 time and from b to c 1 times
for id == 2, cat changes from a to b 1 time, and from b to b 2 times
for id == 3, cat changes from c to b 2 times, and from b to c 1 time

Any ideas how I could compute that ?

Ideally the output should look something like:

pd.DataFrame({'id': [1,2,3],
              'a to a': [1,0,0],
              'a to b': [1,1,0],
              'a to c': [0,0,0],
              'b to a': [0,0,0],
'b to b': [0,2,0],
'b to c': [1,0,1],
'c to a': [0,0,0],
'c to b': [0,0,2],
'c to c': [0,0,0]
    })

CodePudding user response：

Similar to @Anky we will use shift within a group to create a label for the current and next value. Then we just need a crosstab. Since the .str.cat will keep NaN from the shift, and crosstab ignores them we can ensure we only count within group transitions.

import pandas as pd

s = foo['cat'].str.cat(' to '   foo.groupby('id')['cat'].shift(-1))
pd.crosstab(foo['id'], s)

cat  a to a  a to b  b to b  b to c  c to b
id                                         
1         1       1       0       1       0
2         0       1       2       0       0
3         0       0       0       1       2

CodePudding user response：

Probably this might be another way:

g = foo.groupby(["id"])
s = (g['cat'].shift().where(g['time'].diff().ge(1)).fillna('')).add(foo['cat'])

t = foo[['id']].assign(k=s)
out = t[t['k'].str.len()>1].groupby("id")['k'].value_counts().unstack(fill_value=0)

print(out.rename_axis(None,axis=1))

    aa  ab  bb  bc  cb
id                    
1    1   1   0   1   0
2    0   1   2   0   0
3    0   0   0   1   2

CodePudding user response：

You can use collections.Counter and itertools.pairwise:

from collections import Counter
from itertools import pairwise

foo.groupby('id')['cat'].apply(lambda x: Counter(pairwise(x))).unstack(level=0).fillna(0)

output:

id        1    2    3
(a, a)  1.0  0.0  0.0
(a, b)  1.0  1.0  0.0
(b, c)  1.0  0.0  1.0
(b, b)  0.0  2.0  0.0
(c, b)  0.0  0.0  2.0

NB. pairwise requires python ≥ 3.10, for versions below use the recipe from the documentation.

CodePudding user response：

We can use pandas.Series.shift to shift the values, then pandas.DataFrame.groupby to group the distinct values, and count.

# Have the next category be stored in next_cat
foo['next_cat'] = foo['cat'].shift(-1)

# If the use condition is to ensure that each ID's next category is
# of the same ID, we use this instead:
foo['next_cat'] = foo.groupby(['id'])['cat'].shift(-1)

# Only groupby for values that are not NaN in next_cat (ignore the last row)
# Then group by category, next category, and the ID, and count these values.
foo.loc[foo['next_cat'] == foo['next_cat']].groupby(['cat', 'next_cat', 'id']).count()

This outputs:

                 time
cat next_cat id
a   a        1      1
    b        1      1
             2      1
b   b        2      2
    c        1      1
             2      1
             3      1
c   a        1      1
    b        3      2

We can then drop the index and pivot in order to achieve your ideal shape using pandas.DataFrame.pivot_table:

# This time around we're storing the data into foo.
foo = foo.loc[foo['next_cat'] == foo['next_cat']].groupby(['cat', 'next_cat', 'id']).count()

# Reset the index so we can pivot using these columns.
foo = foo.reset_index()
foo.pivot_table(columns=foo['cat']   " to "   foo['next_cat'], values=['time'], index=['id']).fillna(0).astype(int)

This outputs:

     time
   a to a a to b b to b b to c c to a c to b
id
1       1      1      0      1      1      0
2       0      1      2      1      0      0
3       0      0      0      1      0      2