Create matrix with count of consecutive strings-CodePudding

I have a pandas dataframe made of two columns with strings in it, as follows:

word 1   word 2
cat      dog
dog      mouse
mouse    dog
dog      dog
dog      mouse
mouse     ...

What I would like to do in python, is build a matrix that counts how many times a word follows another, as this:

       cat   dog   mouse
cat     0     1      0
dog     0     1      2
mouse   0     1      0

What I have tried so far was tokenization (but it may not be the best approach) and computing a matrix correlation (but apparently pandas.DataFrame.corr does not work with strings).

Do you have any idea on how to proceed? Thanks.

CodePudding user response：

You can use pandas.crosstab and reindex to ensure having all combinations:

import numpy as np
idx = np.unique(df.values.flatten())

(pd.crosstab(df['word 1'], df['word 2'])
   .reindex(index=idx, columns=idx, fill_value=0)
)

output:

word 2  ...  cat  dog  mouse
word 1                      
...       0    0    0      0
cat       0    0    1      0
dog       0    0    1      2
mouse     1    0    1      0

NB. ... appears here as a the word due to your example