How to compare value counts of two dataframes?-CodePudding

I am loading multiple csv's into two dataframes df1,df2 and first check the value counts of a column on both dataframes, I could do it easily Sample code:

df1 = pd.DataFrame()
for i in full_list:
  cf = read_csv_dataframe(i) #custom method which reads csv dataframes
  df1 = pd,concat([df1,cf])


print(df1['location'].value_counts())

this results in

Location1     644668      
Location2     616490      
Location3     283440

df2

Location1    640500
Location2    500000
Location3    100000

Now what I need to do is compare df2-Location1 count with df1-Location1 count and difference between them should be -10% ie if df1-df2 is greater than 10% of df1 or less than 10% of df1 then save Location1 to a new list or df or anything, this would continue till all Locations are compared. I tried looping and some other things but didnt get the results. Final result

Location2.  500000
Location3   100000

CodePudding user response：

You can filter your series based on the conditions you described. This will output the counts from Location2 which are outside the 10% threshold.

loc1 = df1['location'].value_counts()
loc2 = df2['location'].value_counts()

loc2[abs(loc2-loc1) > (loc1 * .1)]

It assumes the same locations exist in both dataframes.

CodePudding user response：

import pandas as pd

df1 = pd.DataFrame([{"location":"Location 1", "value": 100},{"location":"Location 2", "value": 100},{"location":"Location 3", "value": 100}])
df2 = pd.DataFrame([{"location":"Location 1", "value": 110},{"location":"Location 2", "value": 80},{"location":"Location 3", "value": 105}])


df_new = df1.merge(right=df2, on="location")
df_new["bool"] = False

df_new["bool"] = df_new.apply(lambda row: abs(row.value_x - row.value_y) / row.value_x >= 0.1, axis=1)

print(df_new)

df_new = df_new[df_new["bool"] == True]

print(df_new.drop("bool", axis=1))

Output:

0  Location 1      100      110   True
1  Location 2      100       80   True
2  Location 3      100      105  False
     location  value_x  value_y
0  Location 1      100      110
1  Location 2      100       80