I have two columns that I need to compare to third and to get data that is not same in first two columns.
import pandas as pd
from numpy import nan
df = pd.DataFrame ({'Want_value': ['a', 'c', 'c', 'c', 'v', 'b', nan, nan, nan, nan, nan, nan, nan], 'Unwanted_value': ['r', 't', 't', 'z', 't', nan, nan, nan, nan, nan, nan, nan, nan], 'new_all_data': ['r', 'z', 'a', 'c', 't', 'v', 'b', 'j', 'r', 't', 'v', 'a', 'k'], 'new_values': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]})
Looks like this...
Basically needs to be like this, 'Want_value'
'Unwanted_value'
and compare to column with name 'new_all_data'
, at the end I need values that are in 'new_all_data'
but there are not in 'Want_value'
and 'Unwanted_value'
. I hope this is clear.
The values in this minimal reproducible example that are in 'new_all_data'
and not in 'Want_value'
and 'Unwanted_value'
are j
and k
. These values needs to be in a new column that I put here as empty 'new_values'
.
Thanks in advance!
CodePudding user response:
You can use set operations:
current_values = set(df.Want_value) | set(df.Unwanted_value)
: union of the unique values ofWant_value
andUnwanted_value
columns.new_values = set(df.new_all_data) - current_values
: symmetric difference, i.e. get the values that arenew_all_data
but are not incurrent values
.
Then you can use DataFrame.where
to create the new_values
column, by getting only the values of new_all_data
column which are in new_values
set.
new_values = set(df.new_all_data) - (set(df.Want_value) | set(df.Unwanted_value))
df['new_values'] = df['new_all_data'].where(df['new_all_data'].isin(new_values))
Output:
>>> new_values
{'j', 'k'}
>>> df
Want_value Unwanted_value new_all_data new_values
0 a r r NaN
1 c t z NaN
2 c t a NaN
3 c z c NaN
4 v t t NaN
5 b NaN v NaN
6 NaN NaN b NaN
7 NaN NaN j j
8 NaN NaN r NaN
9 NaN NaN t NaN
10 NaN NaN v NaN
11 NaN NaN a NaN
12 NaN NaN k k