If I have two dataframes - df1 (data for current day), and df2 (data for previous day).
Both dataframes have 40 columns, and all columns are object data type.
how do I compare Top 3 value_counts for both dataframes, ideally so that the result is side by side, like the following;
df1 df2
Column a Value count 1 Value count 1
Value count 2 Value count 2
Value count 3 Value count 3
Column b Value count 1 Value count 1
Value count 2 Value count 2
Value count 3 Value count 3
The main idea is to check for data anomalies between the data for the two days.
I only know that for each column per dataframe, I must do something like this -
df1.Column.value_counts().head(3)
But this doesn't show combined results as I want. Please help!
CodePudding user response:
You can compare if same columns names in both DataFrames - first use lambda function with Series.value_counts
, top3 and create default index for both DataFrames and then join them with concat
and for expected order add DataFrame.stack
:
np.random.seed(2022)
df1 = pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c')
df2 = pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c')
df11 = df1.apply(lambda x: x.value_counts().head(3).reset_index(drop=True))
df22 = df2.apply(lambda x: x.value_counts().head(3).reset_index(drop=True))
df = pd.concat([df11, df22], axis=1, keys=('df1','df2')).stack().sort_index(level=1)
print (df)
df1 df2
0 c0 7 8
1 c0 6 8
2 c0 6 6
0 c1 8 9
1 c1 7 7
2 c1 7 7
0 c2 9 7
1 c2 7 7
2 c2 7 6
0 c3 9 7
1 c3 7 7
2 c3 7 6
0 c4 11 14
1 c4 7 8
2 c4 7 7
Or use DataFrame.compare
:
df = (df11.compare(df22,keep_equal=True)
.rename(columns={'self':'df1','other':'df2'})
.stack(0)
.sort_index(level=1))
print (df)
df1 df2
0 c0 7 8
1 c0 6 8
2 c0 6 6
0 c1 8 9
1 c1 7 7
2 c1 7 7
0 c2 9 7
1 c2 7 7
2 c2 7 6
0 c3 9 7
1 c3 7 7
2 c3 7 6
0 c4 11 14
1 c4 7 8
2 c4 7 7
EDIT: For add categories use f-strings
for join indices and values of Series in list comprehension:
np.random.seed(2022)
df1 = 'Cat1' pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c').astype(str)
df2 = 'Cat2' pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c').astype(str)
df11 = df1.apply(lambda x: [f'{a} - {b}' for a, b in x.value_counts().head(3).items()])
df22 = df2.apply(lambda x:[f'{a} - {b}' for a, b in x.value_counts().head(3).items()])
df = pd.concat([df11, df22], axis=1, keys=('df1','df2')).stack().sort_index(level=1)
print (df)
df1 df2
0 c0 Cat18 - 7 Cat29 - 8
1 c0 Cat11 - 6 Cat24 - 8
2 c0 Cat19 - 6 Cat23 - 6
0 c1 Cat17 - 8 Cat24 - 9
1 c1 Cat10 - 7 Cat26 - 7
2 c1 Cat14 - 7 Cat20 - 7
0 c2 Cat13 - 9 Cat28 - 7
1 c2 Cat11 - 7 Cat25 - 7
2 c2 Cat19 - 7 Cat26 - 6
0 c3 Cat15 - 9 Cat20 - 7
1 c3 Cat18 - 7 Cat24 - 7
2 c3 Cat13 - 7 Cat27 - 6
0 c4 Cat12 - 11 Cat25 - 14
1 c4 Cat13 - 7 Cat20 - 8
2 c4 Cat15 - 7 Cat26 - 7