Home > Blockchain >  Pandas get values in one column that are not in another column
Pandas get values in one column that are not in another column

Time:08-27

I am new to pandas and looking a way to find missing values in columns 'a' from column 'b'. How to get the following result?

MWE

import numpy as np
import pandas as pd


df = pd.DataFrame({'a': [[1,2,3],[10,20,30]],
                  'b': [[1,2],[20]]})

df['required'] = [[3],[10,30]]

df

              a       b  required
0     [1, 2, 3]  [1, 2]       [3]
1  [10, 20, 30]    [20]  [10, 30]

The column 'c' should have all the elements from column 'a' that are not present in column 'b'.

I am looking for a vectorized way something like 'np.setdiff1d' but I could not find a way to do it. I can do it by using for loop but that would be too inefficient.

update

  • there are no duplicates
  • order does not matter

CodePudding user response:

You can use np.setdiff1d, just inside of apply:

df['c'] = df.apply(lambda row: np.setdiff1d(row['a'], row['b']), axis=1)

df
              a       b         c
0     [1, 2, 3]  [1, 2]       [3]
1  [10, 20, 30]    [20]  [10, 30]

CodePudding user response:

I would use np.setdiff1d too but because your data doesn't have duplicates I would set assume_unique = True which speeds up computation.

From docs:

assume_unique : bool

If True, the input arrays are both assumed to be unique, which can speed up the calculation. Default is False.
df['c'] = df.apply(lambda x: np.setdiff1d(x['a'], x['b'], assume_unique=True), axis=1)

Alternatively with list comprehension (which should be fast too):

df['c'] = df.apply(lambda x: [e for e in x['a'] if e not in x['b']], axis=1)

CodePudding user response:

here is one way to do it, by making use of the sets.

df['diff']=df.apply(lambda x: list(set(x['a']).difference(set(x['b']))), axis=1)
df

    a                b      required    diff
0   [1, 2, 3]       [1, 2]  [3]         [3]
1   [10, 20, 30]    [20]    [10, 30]    [10, 30]

CodePudding user response:

I came up with this answer:

df['c'] = df['a'].apply(set) - df['b'].apply(set)

              a       b  required         c
0     [1, 2, 3]  [1, 2]       [3]       {3}
1  [10, 20, 30]    [20]  [10, 30]  {10, 30}

CodePudding user response:

# Make them sets, not lists:
df = df.applymap(set)

# Subtract a from b:
df['required'] = df.a.sub(df.b)

print(df)

Output:

              a       b  required
0     {1, 2, 3}  {1, 2}       {3}
1  {10, 20, 30}    {20}  {10, 30}

Another approach using np.setdiff1d would be to "vectorize" it. (It's kinda a fake vectorization I believe, but may be nice if you want to do this a lot)

setdiff2d = np.vectorize(np.setdiff1d, otypes=[list])
df['required'] = setdiff2d(df['a'], df['b'])

# Output

              a       b  required
0     [1, 2, 3]  [1, 2]       [3]
1  [10, 20, 30]    [20]  [10, 30]
  • Related