Home > front end >  Align value counts of two dataframes side by side
Align value counts of two dataframes side by side

Time:07-08

If I have two dataframes - df1 (data for current day), and df2 (data for previous day).

Both dataframes have 40 columns, and all columns are object data type.

how do I compare Top 3 value_counts for both dataframes, ideally so that the result is side by side, like the following;

                  df1                  df2

Column a          Value count 1       Value count 1
                  Value count 2       Value count 2
                  Value count 3       Value count 3   


Column b         Value count 1        Value count 1
                 Value count 2        Value count 2 
                 Value count 3        Value count 3

The main idea is to check for data anomalies between the data for the two days.

I only know that for each column per dataframe, I must do something like this -

df1.Column.value_counts().head(3)

But this doesn't show combined results as I want. Please help!

CodePudding user response:

You can compare if same columns names in both DataFrames - first use lambda function with Series.value_counts, top3 and create default index for both DataFrames and then join them with concat and for expected order add DataFrame.stack:

np.random.seed(2022)
df1 = pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c')
df2 = pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c')

df11 = df1.apply(lambda x: x.value_counts().head(3).reset_index(drop=True))
df22 = df2.apply(lambda x: x.value_counts().head(3).reset_index(drop=True))


df = pd.concat([df11, df22], axis=1, keys=('df1','df2')).stack().sort_index(level=1)
print (df)
      df1  df2
0 c0    7    8
1 c0    6    8
2 c0    6    6
0 c1    8    9
1 c1    7    7
2 c1    7    7
0 c2    9    7
1 c2    7    7
2 c2    7    6
0 c3    9    7
1 c3    7    7
2 c3    7    6
0 c4   11   14
1 c4    7    8
2 c4    7    7

Or use DataFrame.compare:

df = (df11.compare(df22,keep_equal=True)
          .rename(columns={'self':'df1','other':'df2'})
          .stack(0)
          .sort_index(level=1))
print (df)

      df1  df2
0 c0    7    8
1 c0    6    8
2 c0    6    6
0 c1    8    9
1 c1    7    7
2 c1    7    7
0 c2    9    7
1 c2    7    7
2 c2    7    6
0 c3    9    7
1 c3    7    7
2 c3    7    6
0 c4   11   14
1 c4    7    8
2 c4    7    7

EDIT: For add categories use f-strings for join indices and values of Series in list comprehension:

np.random.seed(2022)
df1 = 'Cat1'   pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c').astype(str)
df2 = 'Cat2'   pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c').astype(str)

df11 = df1.apply(lambda x: [f'{a} - {b}' for a, b in x.value_counts().head(3).items()])
df22 = df2.apply(lambda x:[f'{a} - {b}' for a, b in x.value_counts().head(3).items()])

df = pd.concat([df11, df22], axis=1, keys=('df1','df2')).stack().sort_index(level=1)
print (df)
             df1         df2
0 c0   Cat18 - 7   Cat29 - 8
1 c0   Cat11 - 6   Cat24 - 8
2 c0   Cat19 - 6   Cat23 - 6
0 c1   Cat17 - 8   Cat24 - 9
1 c1   Cat10 - 7   Cat26 - 7
2 c1   Cat14 - 7   Cat20 - 7
0 c2   Cat13 - 9   Cat28 - 7
1 c2   Cat11 - 7   Cat25 - 7
2 c2   Cat19 - 7   Cat26 - 6
0 c3   Cat15 - 9   Cat20 - 7
1 c3   Cat18 - 7   Cat24 - 7
2 c3   Cat13 - 7   Cat27 - 6
0 c4  Cat12 - 11  Cat25 - 14
1 c4   Cat13 - 7   Cat20 - 8
2 c4   Cat15 - 7   Cat26 - 7
  • Related