I have a pandas dataframe made of two columns with strings in it, as follows:
word 1 word 2
cat dog
dog mouse
mouse dog
dog dog
dog mouse
mouse ...
What I would like to do in python, is build a matrix that counts how many times a word follows another, as this:
cat dog mouse
cat 0 1 0
dog 0 1 2
mouse 0 1 0
What I have tried so far was tokenization (but it may not be the best approach) and computing a matrix correlation (but apparently pandas.DataFrame.corr
does not work with strings).
Do you have any idea on how to proceed? Thanks.
CodePudding user response:
You can use pandas.crosstab
and reindex
to ensure having all combinations:
import numpy as np
idx = np.unique(df.values.flatten())
(pd.crosstab(df['word 1'], df['word 2'])
.reindex(index=idx, columns=idx, fill_value=0)
)
output:
word 2 ... cat dog mouse
word 1
... 0 0 0 0
cat 0 0 1 0
dog 0 0 1 2
mouse 1 0 1 0
NB. ...
appears here as a the word due to your example