Comparing different rows inside a data frame and removing unnecessary rows: df.apply and index-CodePudding

I am comparing values in different rows within a data frame. If a row from the data frame matches any other row(s) by at least 75%, I will keep the original one and remove the others.

The original data frame looks like this:

ID	Text
11213	I am going to the cinema
23213	Black is my favourite colour
35455	I am going to the cinema with you
421323	My friends think I am a good guy

The desired output would be:

ID	Text
11213	I am going to the cinema
23213	Black is my favourite colour
421323	My friends think I am a good guy

For that, I am looping over a data frame using df.apply twice to find out which rows match. My plan is to get indices of the rows that match and remove them. However, after iterating through the data frame using df.apply, the indices do not match which produces an error.

So far I have been able to come up with this:

def compare(text):
    match = df.apply(lambda row: (fuzz.partial_ratio(row['text'], text) >= 75), axis=1)
    return [i for i, x in enumerate(matching) if x]


df = pd.read_csv("test.csv", dtype=str, lineterminator='\n')

result = df.apply(lambda row: compare(row['text']).reset_index(), axis=1)
result.to_csv(f'output.csv', index=False)

Which enables me to get indices of the rows that match (though for some reason the index is 2 difference to the original data frame), but I am not sure how to proceed and delete the ones that satisfy the criterion while keeping the original rows only.

My goal is to keep the original indices and find a better way to remove rows where row['text'] are 75% similar.

CodePudding user response：

I suggest using a comparison df:

df_compare = pd.DataFrame(
            df['Text'].apply(lambda row: max([fuzz.partial_ratio(row, x), i] 
                            for i, x in enumerate(df['Text']) if x != row)
                        ).to_list(),
                        columns=['ratio', 'closest']
            )

This gives you:

   ratio  closest
0    100        2
1     38        0
2    100        0
3     42        2

Then you can filter your initial df using df_compare:

df[~((df_compare['ratio']>=75) & (df_compare['closest'] < df_compare.index))]

Output:

       ID                              Text
0   11213          I am going to the cinema
1   23213      Black is my favourite colour
3  421323  My friends think I am a good guy

Edit:

Computing a full comparison matrix would give you the same result:

df_compare = pd.DataFrame(
            df['Text'].apply(lambda row: [fuzz.partial_ratio(x, row)
                            for x in df['Text']]
                        ).to_list()
            )

for i in df_compare.index:
    for j in df_compare.columns[i:]:
        df_compare.iloc[i,j] = 0

This gives the following:

     0   1   2  3
0    0   0   0  0
1   38   0   0  0
2  100  33   0  0
3   42  36  41  0

You can finally filter with:

df[df_compare.max(axis=1)<75]