Python: How to match the worlds in split and non spread?-CodePudding

I have a Dataframe as below and i wish to detect the repeated words either in split or non split words:

Table A:

Cat       Comments
Stat A    power down due to electric shock
Stat A    powerdown because short circuit
Stat A    top 10 on re work
Stat A    top10 on rework

I wish to get the output as below:

Repeated words= ['Powerdown', 'top10','on','rework']

Anyone have ideas?

CodePudding user response：

I assume that having the words in a dataframe column is not really relevant for the problem at hand. I will therefore transfer them into a list, and then search for repeats.

import pandas as pd

df = pd.DataFrame({"Comments": ["power down due to electric shock", "powerdown because short circuit", "top 10 on re work", "top10 on rework"]})
words = df['Comments'].to_list()

This leads to

['power down due to electric shock',
 'powerdown because short circuit',
 'top 10 on re work',
 'top10 on rework']

Now we create a new list to account for the fact that "top 10" and "top10" should be treated equal:

newa = []
for s in words:
    a = s.split()
    for i in range(len(a)-1):
        w = a[i] a[i 1]
        a.append(w)
    newa.append(a)

which yields:

[['power',
  'down',
  'due',
  'to',
  'electric',
  'shock',
  'powerdown',
  'downdue',
  'dueto',
  'toelectric',
  'electricshock'],...

Finally we flatten the list and use Counter to find words which occur more than once:

from collections import Counter
from itertools import chain
wordList = list(chain(*newa))
wordCount = Counter(wordList)
[w for w,c in wordCount.most_common() if c>1]

leading to

['powerdown', 'on', 'top10', 'rework']

CodePudding user response：

Let's try:

words = df['Comments'].str.split(' ').explode()

biwords = words   words.groupby(level=0).shift(-1)

(pd.concat([words.groupby(level=0).apply(pd.Series.drop_duplicates),     # remove duplicates words within a comment
            biwords.groupby(level=0).apply(pd.Series.drop_duplicates)])  # remove duplicate bi-words within a comment   
   .dropna()                                             # remove NaN created by shifting                                                    
   .to_frame().join(df[['Cat']])                         # join with original Cat
   .loc[lambda x: x.duplicated(keep=False)]              # select the duplicated `Comments` within `Cat`
   .groupby('Cat')['Comments'].unique()                                 # select the unique values within each `Cat`
)

Output:

Cat
Stat A    [powerdown, on, top10, rework]
Name: Comments, dtype: object