I have a Dataframe as below and i wish to detect the repeated words either in split or non split words:
Table A:
Cat Comments
Stat A power down due to electric shock
Stat A powerdown because short circuit
Stat A top 10 on re work
Stat A top10 on rework
I wish to get the output as below:
Repeated words= ['Powerdown', 'top10','on','rework']
Anyone have ideas?
CodePudding user response:
I assume that having the words in a dataframe column is not really relevant for the problem at hand. I will therefore transfer them into a list, and then search for repeats.
import pandas as pd
df = pd.DataFrame({"Comments": ["power down due to electric shock", "powerdown because short circuit", "top 10 on re work", "top10 on rework"]})
words = df['Comments'].to_list()
This leads to
['power down due to electric shock',
'powerdown because short circuit',
'top 10 on re work',
'top10 on rework']
Now we create a new list to account for the fact that "top 10" and "top10" should be treated equal:
newa = []
for s in words:
a = s.split()
for i in range(len(a)-1):
w = a[i] a[i 1]
a.append(w)
newa.append(a)
which yields:
[['power',
'down',
'due',
'to',
'electric',
'shock',
'powerdown',
'downdue',
'dueto',
'toelectric',
'electricshock'],...
Finally we flatten the list and use Counter
to find words which occur more than once:
from collections import Counter
from itertools import chain
wordList = list(chain(*newa))
wordCount = Counter(wordList)
[w for w,c in wordCount.most_common() if c>1]
leading to
['powerdown', 'on', 'top10', 'rework']
CodePudding user response:
Let's try:
words = df['Comments'].str.split(' ').explode()
biwords = words words.groupby(level=0).shift(-1)
(pd.concat([words.groupby(level=0).apply(pd.Series.drop_duplicates), # remove duplicates words within a comment
biwords.groupby(level=0).apply(pd.Series.drop_duplicates)]) # remove duplicate bi-words within a comment
.dropna() # remove NaN created by shifting
.to_frame().join(df[['Cat']]) # join with original Cat
.loc[lambda x: x.duplicated(keep=False)] # select the duplicated `Comments` within `Cat`
.groupby('Cat')['Comments'].unique() # select the unique values within each `Cat`
)
Output:
Cat
Stat A [powerdown, on, top10, rework]
Name: Comments, dtype: object