I have a DataFrame (df_value_bin
) with Multiindex which is the binned values, like this:
import pandas as pd
import numpy as np
np.random.seed(100)
df = pd.DataFrame(np.random.randn(100, 3), columns=['a', 'b', 'value'])
a_bins = np.arange(-3, 4, 1)
b_bins = np.arange(-2, 4, 2)
df['a_bins'] = pd.cut(df['a'], bins=a_bins)
df['b_bins'] = pd.cut(df['b'], bins=b_bins)
df_value_bin = df.groupby(['a_bins','b_bins']).agg({'value':'mean'})
Here's the quickview of df_value_bin
:
value
a_bins b_bins
(-3, -2] (-2, 0] -0.417606
(0, 2] -0.267035
(-2, -1] (-2, 0] -0.296727
(0, 2] -0.112280
(-1, 0] (-2, 0] 0.459780
(0, 2] 0.131588
(0, 1] (-2, 0] 0.110268
(0, 2] 0.287755
(1, 2] (-2, 0] 0.254337
(0, 2] -0.627460
(2, 3] (-2, 0] -0.075165
(0, 2] -0.589709
Then, I want to get the closest value
of df_value_bin
when giving some a
and b
.
Let's say a=1.5 and b=-1, then we should get value=0.254337.
Attempt 1
I can generate the boolean mask for a_bins
and b_bins
:
a_test = 1.5
b_test = -1
boolean_a = df_value_bin.index.get_level_values('a_bins').categories.contains(a_test)
boolean_b = df_value_bin.index.get_level_values('b_bins').categories.contains(b_test)
print(boolean_a, boolean_b) # Output: [False False False False True False] [ True False]
However, I have no idea of using the masks to select the row ...
Attempt 2
I can get the index directly:
index_a = np.digitize(a_test, a_bins, right=True)
index_b = np.digitize(b_test, b_bins, right=True)
print(index_a, index_b) # Output: 5 1
Again, I don't know how to use the index to select the row directly.
Notes
It seems the second method should be quicker as it uses np.digitize()
.
If you have any idea to complete it or other better methods, please feel free to answer!
CodePudding user response:
You can just index with numbers in this case:
df_value_bin.loc[(1.5, -1)]
Output (ignore value, which is generated randomly, look at the Name
):
value 0.047439
Name: ((1, 2], (-2, 0]), dtype: float64