Home > Back-end >  How to get the closest the column value by the bin Multiindex?
How to get the closest the column value by the bin Multiindex?

Time:12-23

I have a DataFrame (df_value_bin) with Multiindex which is the binned values, like this:

import pandas as pd
import numpy as np

np.random.seed(100)
df = pd.DataFrame(np.random.randn(100, 3), columns=['a', 'b', 'value'])

a_bins = np.arange(-3, 4, 1)
b_bins = np.arange(-2, 4, 2)

df['a_bins'] = pd.cut(df['a'], bins=a_bins)
df['b_bins'] = pd.cut(df['b'], bins=b_bins)
df_value_bin = df.groupby(['a_bins','b_bins']).agg({'value':'mean'})

Here's the quickview of df_value_bin:

                     value
a_bins   b_bins           
(-3, -2] (-2, 0] -0.417606
         (0, 2]  -0.267035
(-2, -1] (-2, 0] -0.296727
         (0, 2]  -0.112280
(-1, 0]  (-2, 0]  0.459780
         (0, 2]   0.131588
(0, 1]   (-2, 0]  0.110268
         (0, 2]   0.287755
(1, 2]   (-2, 0]  0.254337
         (0, 2]  -0.627460
(2, 3]   (-2, 0] -0.075165
         (0, 2]  -0.589709

Then, I want to get the closest value of df_value_bin when giving some a and b.

Let's say a=1.5 and b=-1, then we should get value=0.254337.

Attempt 1

I can generate the boolean mask for a_bins and b_bins:

a_test = 1.5
b_test = -1

boolean_a = df_value_bin.index.get_level_values('a_bins').categories.contains(a_test)
boolean_b = df_value_bin.index.get_level_values('b_bins').categories.contains(b_test)

print(boolean_a, boolean_b) # Output: [False False False False  True False] [ True False]

However, I have no idea of using the masks to select the row ...

Attempt 2

I can get the index directly:

index_a = np.digitize(a_test, a_bins, right=True)
index_b = np.digitize(b_test, b_bins, right=True)

print(index_a, index_b) # Output: 5 1

Again, I don't know how to use the index to select the row directly.

Notes

It seems the second method should be quicker as it uses np.digitize(). If you have any idea to complete it or other better methods, please feel free to answer!

CodePudding user response:

You can just index with numbers in this case:

df_value_bin.loc[(1.5, -1)]

Output (ignore value, which is generated randomly, look at the Name):

value    0.047439
Name: ((1, 2], (-2, 0]), dtype: float64
  • Related