I am new to pandas and looking a way to find missing values in columns 'a' from column 'b'. How to get the following result?
MWE
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [[1,2,3],[10,20,30]],
'b': [[1,2],[20]]})
df['required'] = [[3],[10,30]]
df
a b required
0 [1, 2, 3] [1, 2] [3]
1 [10, 20, 30] [20] [10, 30]
The column 'c' should have all the elements from column 'a' that are not present in column 'b'.
I am looking for a vectorized way something like 'np.setdiff1d' but I could not find a way to do it. I can do it by using for loop but that would be too inefficient.
update
- there are no duplicates
- order does not matter
CodePudding user response:
You can use np.setdiff1d
, just inside of apply
:
df['c'] = df.apply(lambda row: np.setdiff1d(row['a'], row['b']), axis=1)
df
a b c
0 [1, 2, 3] [1, 2] [3]
1 [10, 20, 30] [20] [10, 30]
CodePudding user response:
I would use np.setdiff1d
too but because your data doesn't have duplicates I would set assume_unique = True
which speeds up computation.
From docs:
assume_unique : bool
If True, the input arrays are both assumed to be unique, which can speed up the calculation. Default is False.
df['c'] = df.apply(lambda x: np.setdiff1d(x['a'], x['b'], assume_unique=True), axis=1)
Alternatively with list comprehension (which should be fast too):
df['c'] = df.apply(lambda x: [e for e in x['a'] if e not in x['b']], axis=1)
CodePudding user response:
here is one way to do it, by making use of the sets.
df['diff']=df.apply(lambda x: list(set(x['a']).difference(set(x['b']))), axis=1)
df
a b required diff
0 [1, 2, 3] [1, 2] [3] [3]
1 [10, 20, 30] [20] [10, 30] [10, 30]
CodePudding user response:
I came up with this answer:
df['c'] = df['a'].apply(set) - df['b'].apply(set)
a b required c
0 [1, 2, 3] [1, 2] [3] {3}
1 [10, 20, 30] [20] [10, 30] {10, 30}
CodePudding user response:
# Make them sets, not lists:
df = df.applymap(set)
# Subtract a from b:
df['required'] = df.a.sub(df.b)
print(df)
Output:
a b required
0 {1, 2, 3} {1, 2} {3}
1 {10, 20, 30} {20} {10, 30}
Another approach using np.setdiff1d
would be to "vectorize" it. (It's kinda a fake vectorization I believe, but may be nice if you want to do this a lot)
setdiff2d = np.vectorize(np.setdiff1d, otypes=[list])
df['required'] = setdiff2d(df['a'], df['b'])
# Output
a b required
0 [1, 2, 3] [1, 2] [3]
1 [10, 20, 30] [20] [10, 30]