Home > database >  How to compare value counts of two dataframes?
How to compare value counts of two dataframes?

Time:03-16

I am loading multiple csv's into two dataframes df1,df2 and first check the value counts of a column on both dataframes, I could do it easily Sample code:

df1 = pd.DataFrame()
for i in full_list:
  cf = read_csv_dataframe(i) #custom method which reads csv dataframes
  df1 = pd,concat([df1,cf])


print(df1['location'].value_counts())

this results in

Location1     644668      
Location2     616490      
Location3     283440      

df2

Location1    640500
Location2    500000
Location3    100000

Now what I need to do is compare df2-Location1 count with df1-Location1 count and difference between them should be -10% ie if df1-df2 is greater than 10% of df1 or less than 10% of df1 then save Location1 to a new list or df or anything, this would continue till all Locations are compared. I tried looping and some other things but didnt get the results. Final result

Location2.  500000
Location3   100000

CodePudding user response:

You can filter your series based on the conditions you described. This will output the counts from Location2 which are outside the 10% threshold.

loc1 = df1['location'].value_counts()
loc2 = df2['location'].value_counts()

loc2[abs(loc2-loc1) > (loc1 * .1)]

It assumes the same locations exist in both dataframes.

CodePudding user response:

import pandas as pd

df1 = pd.DataFrame([{"location":"Location 1", "value": 100},{"location":"Location 2", "value": 100},{"location":"Location 3", "value": 100}])
df2 = pd.DataFrame([{"location":"Location 1", "value": 110},{"location":"Location 2", "value": 80},{"location":"Location 3", "value": 105}])


df_new = df1.merge(right=df2, on="location")
df_new["bool"] = False

df_new["bool"] = df_new.apply(lambda row: abs(row.value_x - row.value_y) / row.value_x >= 0.1, axis=1)

print(df_new)

df_new = df_new[df_new["bool"] == True]

print(df_new.drop("bool", axis=1))

Output:

0  Location 1      100      110   True
1  Location 2      100       80   True
2  Location 3      100      105  False
     location  value_x  value_y
0  Location 1      100      110
1  Location 2      100       80
  • Related