I am loading multiple csv's into two dataframes df1,df2 and first check the value counts of a column on both dataframes, I could do it easily Sample code:
df1 = pd.DataFrame()
for i in full_list:
cf = read_csv_dataframe(i) #custom method which reads csv dataframes
df1 = pd,concat([df1,cf])
print(df1['location'].value_counts())
this results in
Location1 644668
Location2 616490
Location3 283440
df2
Location1 640500
Location2 500000
Location3 100000
Now what I need to do is compare df2-Location1 count with df1-Location1 count and difference between them should be -10% ie if df1-df2 is greater than 10% of df1 or less than 10% of df1 then save Location1 to a new list or df or anything, this would continue till all Locations are compared. I tried looping and some other things but didnt get the results. Final result
Location2. 500000
Location3 100000
CodePudding user response:
You can filter your series based on the conditions you described. This will output the counts from Location2 which are outside the 10% threshold.
loc1 = df1['location'].value_counts()
loc2 = df2['location'].value_counts()
loc2[abs(loc2-loc1) > (loc1 * .1)]
It assumes the same locations exist in both dataframes.
CodePudding user response:
import pandas as pd
df1 = pd.DataFrame([{"location":"Location 1", "value": 100},{"location":"Location 2", "value": 100},{"location":"Location 3", "value": 100}])
df2 = pd.DataFrame([{"location":"Location 1", "value": 110},{"location":"Location 2", "value": 80},{"location":"Location 3", "value": 105}])
df_new = df1.merge(right=df2, on="location")
df_new["bool"] = False
df_new["bool"] = df_new.apply(lambda row: abs(row.value_x - row.value_y) / row.value_x >= 0.1, axis=1)
print(df_new)
df_new = df_new[df_new["bool"] == True]
print(df_new.drop("bool", axis=1))
Output:
0 Location 1 100 110 True
1 Location 2 100 80 True
2 Location 3 100 105 False
location value_x value_y
0 Location 1 100 110
1 Location 2 100 80