I have a dataframe with 2 columns of lists. I would like to have 2 additional columns with the number of elements added and deleted when comparing the column of "word2" with the column of "word1".
word1 | word2 |
---|---|
[apple, orange] | [tomato, cucumber] |
[banana, kiwi] | [kiwi, mango] |
My expected dataframe is as below. How can I accomplish this efficiently?
word1 | word2 | added | deleted |
---|---|---|---|
[apple, orange] | [tomato, cucumber] | 2 | 2 |
[banana, kiwi] | [kiwi, mango] | 1 | 1 |
CodePudding user response:
One simple way to do this is with set difference operations in Python and then apply that to each row.
In the example below, we get the difference between the two sets in the column. The difference between word2 and word1 tells you the rows that are added. The difference between word1 and word2 tells you what was deleted:
import pandas as pd
df = pd.DataFrame({
"word1": [["apple", "orange"], ["banana", "kiwi"]],
"word2": [["tomato", "cucumber"], ["kiwi", "mango"]]
})
df["added"] = df.apply(lambda row: len(set(row.word2) - set(row.word1)), axis=1)
df["deleted"] = df.apply(lambda row: len(set(row.word1) - set(row.word2)), axis=1)
df
#> word1 word2 added deleted
#> 0 [apple, orange] [tomato, cucumber] 2 2
#> 1 [banana, kiwi] [kiwi, mango] 1 1
You may also consider numpy set operations if performance is a major concern.