Home > database >  replace the duplicated sentences with word "same"
replace the duplicated sentences with word "same"

Time:12-14

I would like to change the repeated comments with word "same" but keep the original ones and change the ID like below. However, some comments are not matched exactly such as the last three.

df = {'Key': ['111', '111','111', '222*1','222*2', '333*1','333*2', '333*3','444','444', '444'],
      'id' : ['', '','', '1','2', '1','2', '3','', '','',],
        'comment': ['wrong sentence', 'wrong sentence','wrong sentence', 'M','M', 'F','F', 'F','wrong sentence used in the topic', 'wrong sentence used','wrong sentence use']}
  
# Create DataFrame
df = pd.DataFrame(df)

print(df)

the input :

enter image description here

Desired output :

enter image description here

CodePudding user response:

ind = df['comment'].str.contains('wrong sentence')

def my_func(x):
    if len(x['comment'].values[0]) > 1 and len(x) > 1 and ind[x.index[0]]:
        df.loc[x.index[1:], 'comment'] = 'same'
        df.loc[x.index, 'id'] = range(1, len(x) 1)

df.groupby('Key').apply(my_func)

print(df)

Output

      Key id                           comment
0     111  1                    wrong sentence
1     111  2                              same
2     111  3                              same
3   222*1  1                                 M
4   222*2  2                                 M
5   333*1  1                                 F
6   333*2  2                                 F
7   333*3  3                                 F
8     444  1  wrong sentence used in the topic
9     444  2                              same
10    444  3                              same

Here, contains is used to match 'wrong sentence'. The result is a boolean mask.

Groupby is applied on the 'Key' column, the grouping result is passed to the user-defined function: my_func. Where the conditions are checked string is greater than 1, strings greater than 1 and matches the word 'wrong sentence'.

loc is used to reset values.

CodePudding user response:

Use:

#test substrings to mask
m = df['comment'].str.contains('wrong sentence')
#create consecutive groups only for matched mask and create counter
counter = df.groupby((~m).cumsum().where(m)).cumcount().add(1)

#assign counter only for matched rows
df.loc[m, 'id'] = counter[m]

#assign same for duplicates - it means if counter values greater like 1
df.loc[counter.gt(1) & m, 'comment'] = 'same'
print (df)
      Key id                           comment
0     111  1                    wrong sentence
1     111  2                              same
2     111  3                              same
3   222*1  1                                 M
4   222*2  2                                 M
5   333*1  1                                 F
6   333*2  2                                 F
7   333*3  3                                 F
8     444  1  wrong sentence used in the topic
9     444  2                              same
10    444  3                              same

If need also test duplicates per Key groups:

m = df['comment'].str.contains('wrong sentence')
counter = df.groupby(['Key',(~m).cumsum().where(m)]).cumcount().add(1)

df.loc[m, 'id'] = counter[m]
df.loc[counter.gt(1) & m, 'comment'] = 'same'
print (df)
      Key id                           comment
0     111  1                    wrong sentence
1     111  2                              same
2     111  3                              same
3   222*1  1                                 M
4   222*2  2                                 M
5   333*1  1                                 F
6   333*2  2                                 F
7   333*3  3                                 F
8     444  1  wrong sentence used in the topic
9     444  2                              same
10    444  3                              same
  • Related