Compare two columns of lists in pandas dataframe to get number of elements added and deleted as 2 ad-CodePudding

I have a dataframe with 2 columns of lists. I would like to have 2 additional columns with the number of elements added and deleted when comparing the column of "word2" with the column of "word1".

word1	word2
[apple, orange]	[tomato, cucumber]
[banana, kiwi]	[kiwi, mango]

My expected dataframe is as below. How can I accomplish this efficiently?

word1	word2	added	deleted
[apple, orange]	[tomato, cucumber]	2	2
[banana, kiwi]	[kiwi, mango]	1	1

CodePudding user response：

One simple way to do this is with set difference operations in Python and then apply that to each row.

In the example below, we get the difference between the two sets in the column. The difference between word2 and word1 tells you the rows that are added. The difference between word1 and word2 tells you what was deleted:

import pandas as pd

df = pd.DataFrame({
    "word1": [["apple", "orange"], ["banana", "kiwi"]],
    "word2": [["tomato", "cucumber"], ["kiwi", "mango"]]
})

df["added"] = df.apply(lambda row: len(set(row.word2) - set(row.word1)), axis=1)
df["deleted"] = df.apply(lambda row: len(set(row.word1) - set(row.word2)), axis=1)

df
#>              word1               word2  added  deleted
#> 0  [apple, orange]  [tomato, cucumber]      2        2
#> 1   [banana, kiwi]       [kiwi, mango]      1        1

You may also consider numpy set operations if performance is a major concern.