Home > database >  Create matrix with count of consecutive strings
Create matrix with count of consecutive strings

Time:11-17

I have a pandas dataframe made of two columns with strings in it, as follows:

word 1   word 2
cat      dog
dog      mouse
mouse    dog
dog      dog
dog      mouse
mouse     ...

What I would like to do in python, is build a matrix that counts how many times a word follows another, as this:

       cat   dog   mouse
cat     0     1      0
dog     0     1      2
mouse   0     1      0

What I have tried so far was tokenization (but it may not be the best approach) and computing a matrix correlation (but apparently pandas.DataFrame.corr does not work with strings).

Do you have any idea on how to proceed? Thanks.

CodePudding user response:

You can use pandas.crosstab and reindex to ensure having all combinations:

import numpy as np
idx = np.unique(df.values.flatten())

(pd.crosstab(df['word 1'], df['word 2'])
   .reindex(index=idx, columns=idx, fill_value=0)
)

output:

word 2  ...  cat  dog  mouse
word 1                      
...       0    0    0      0
cat       0    0    1      0
dog       0    0    1      2
mouse     1    0    1      0

NB. ... appears here as a the word due to your example

  • Related