I'm trying to filter through a DataFrame by running a filter on each row, and comparing the X value to the corresponding value in a separate Series. (For context, I'm getting the lower border of a curve using data_y_mins
and only want to keep values beneath that border)
data
is a DataFrame with a structure similar to the following (but with other columns included too which I've left out for simplicity):
X Y
2.0 155.3
4.0 120.0
2.0 80.2
5.0 307.0
4.0 66.3
I'm getting the minimum Y
for each X
and storing it in a Series data_y_mins
:
data_y_mins = data.groupby("X")["Y"].min()
What I'd like to do is keep all rows in data
where the Y value is less than the minimum value for the corresponding X in data_y_mins
.
i.e. in the example above, as the minimum Y for X=2.0 would be 80.2, we wouldn't keep the row with values X=2.0, Y=155.3
To run this filter over all rows individually I'm trying a combination of .loc
and a lambda
:
data.loc[lambda row: row['Y'] <= data_y_mins.get(rows['X'])]
but I keep getting errors saying
ValueError: Can only compare identically-labeled Series objects
I've tried combinations involving .apply
instead, but I get key errors with that approach. I get the feeling the issue is with the data_y_mins.get(rows['X'])
part of the code above, where Pandas doesn't like running a query on a separate filter in order to use that value to filter the current DataFrame.
I've also tried converting the series to a DataFrame again but haven't had much luck with that approach either:
data_y_mins = pd.DataFrame(
{
"X": data_y_mins.index,
"Y": data_y_mins.values,
}
)
Is there something I'm missing or is there a better approach to this? I've considered iterrows
for this (if it were Arrays/Lists in Python/JS I would have mapped over them) but that feels too expensive for a pretty sizeable DataFrame
CodePudding user response:
I think you're looking for groupby
transform('min')
filtering:
out = df[df['Y'] <= df.groupby('X')['Y'].transform('min')]
or since you're filtering the minimum values of "Y", just use groupby
idxmin
:
out = df.loc[df.groupby('X')['Y'].idxmin()]
Output:
X Y
2 2.0 80.2
3 5.0 307.0
4 4.0 66.3
CodePudding user response:
Can't you create a data_y_mins_index and data_y_mins_values
from data_y_mins
in the same dataframe from Y and X
?
Then you can filter like:
data[data['Y']<=data['y_min_value']]