I have two dataframes, something like below:
df1:
date col1 col2 col3
15-5-2022 ABC 1 PQR
16-5-2022 BCD 2 ABC
17-5-2022 CDE 4 XYZ
df2:
date col1 col2 col3
5-4-2022 XYZ 1 ABC
6-4-2022 PQR 2 ABC
7-4-2022 BCD 4 PQR
My task is to get total number of unique values that are in df2.col1 but not in df1.col1. The way I am doing this is by creating first a list of all col1 unique values from df1 and then from df2 and then comparing these two lists and creating a third list with what exists in second list but not the first. Since I need the count of items in the final list, I am doing a len on third list. My code is like below:
list1 = df1.col1.unique()
list2 = df2.col1.unique()
list3 = [x for x in list2 if x not in list1]
num_list3 = len(list3)
This is getting my task done, but taking a very long time to run, probably because my dfs are quite big. I was wondering if there is a smarter and more efficient way of doing this please. I would appreciate any help
CodePudding user response:
Use:
df2.loc[~df2['col1'].isin(df1['col1']), 'col1'].unique()
output: array(['XYZ', 'PQR'], dtype=object)
Or, with sets:
set(df2['col1']) - set(df1['col1'])
output: {'PQR', 'XYZ'}
CodePudding user response:
I encounter a similar problem but more difficult. My problem is to compare the unique combination and get the differences between df1
and df2
. Here I post the 2 solutions in case you need them.
Solution key idea: Using a trick of concat
, groupby
or merge
.
If you want to get the joint or the mutual of the two dataframes. By getting unique of df1
, and df2
, called u1
and u2
. You concat u1
, and u2
, then use groupby
to count for the number of occurance. If more than 1 then it appears in both u1
and u2
. If it is 1, then it appears in one of the two u1
or u2
.
If you want to get the uniques from df1
or df2
only, then use merge
trick with option indicator=True
Data set for replication:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'id':['a','a','b','b','c','d','e'],
'val': [1,2,1,3,4,5,6]})
df2 = pd.DataFrame({'id':['a','a','b','b','c','c','e','f','f','d'],
'val': [1,2,1,3,4,5,5,7,8,9]})
Problem 1: a single column comparison
# Getting unique in each dataframe and concat
u1 = pd.DataFrame(df1['id'].unique(), columns=['id'])
u2 = pd.DataFrame(df2['id'].unique(), columns=['id'])
u = pd.concat([u1,u2])
# Groupby, count the number of occurrence with function `size`:
u.groupby('id').size().reset_index()
# You can do the rest by your choice if you want the joint or the not joint part
Problem 2: multiple columns combination
# Getting unique combination of `id` and `val` by using trick of `size()` in `groupby`:
u1 = df1.groupby(['id', 'val']).size().reset_index().drop(columns=0)
u2 = df2.groupby(['id', 'val']).size().reset_index().drop(columns=0)
u = pd.concat([u1,u2])
# Groupby, count the number of occurrence with function `size`:
u.groupby('id').size().reset_index()
# You can do the rest by your choice if you want the joint or the not joint part
If you need uique of df1
or df2
only
# This will tell you the uniques combinations belong to df1 or df2 or both:
pd.merge(u1, u2, how='outer', on='id', indicator=True)
This should speed up your old code.
Hope this help