Home > OS >  Filtering rows on a Pandas DataFrame based on values from other Series
Filtering rows on a Pandas DataFrame based on values from other Series

Time:03-31

I'm trying to filter through a DataFrame by running a filter on each row, and comparing the X value to the corresponding value in a separate Series. (For context, I'm getting the lower border of a curve using data_y_mins and only want to keep values beneath that border)

data is a DataFrame with a structure similar to the following (but with other columns included too which I've left out for simplicity):

X    Y  
2.0  155.3       
4.0  120.0       
2.0  80.2       
5.0  307.0       
4.0  66.3

I'm getting the minimum Y for each X and storing it in a Series data_y_mins:

data_y_mins = data.groupby("X")["Y"].min()

What I'd like to do is keep all rows in data where the Y value is less than the minimum value for the corresponding X in data_y_mins. i.e. in the example above, as the minimum Y for X=2.0 would be 80.2, we wouldn't keep the row with values X=2.0, Y=155.3

To run this filter over all rows individually I'm trying a combination of .loc and a lambda:

data.loc[lambda row: row['Y'] <= data_y_mins.get(rows['X'])]

but I keep getting errors saying

ValueError: Can only compare identically-labeled Series objects

I've tried combinations involving .apply instead, but I get key errors with that approach. I get the feeling the issue is with the data_y_mins.get(rows['X']) part of the code above, where Pandas doesn't like running a query on a separate filter in order to use that value to filter the current DataFrame.

I've also tried converting the series to a DataFrame again but haven't had much luck with that approach either:

data_y_mins = pd.DataFrame(
        {
            "X": data_y_mins.index,
            "Y": data_y_mins.values,
        }
    )

Is there something I'm missing or is there a better approach to this? I've considered iterrows for this (if it were Arrays/Lists in Python/JS I would have mapped over them) but that feels too expensive for a pretty sizeable DataFrame

CodePudding user response:

I think you're looking for groupby transform('min') filtering:

out = df[df['Y'] <= df.groupby('X')['Y'].transform('min')]

or since you're filtering the minimum values of "Y", just use groupby idxmin:

out = df.loc[df.groupby('X')['Y'].idxmin()]

Output:

     X      Y
2  2.0   80.2
3  5.0  307.0
4  4.0   66.3

CodePudding user response:

Can't you create a data_y_mins_index and data_y_mins_values from data_y_mins in the same dataframe from Y and X? Then you can filter like:

data[data['Y']<=data['y_min_value']]
  • Related