Filtering by a range of data between the maximum and minimum value of a pandas column-CodePudding

I have a dataset in which I need to filter by the values of one of the columns.

I'll try to explain what I need with an example. Suppose we have the following dataset, whose columns may contain NaN values.

In [11]: df
Out[11]:
date        A     B     C
2012-11-29  0     0     NaN
2012-11-30  1     1     NaN
2012-12-01  2     2     2
2012-12-02  NaN   3     3
2012-12-03  4     4     4
2012-12-04  5     5     NaN
2012-12-05  6     6     6
2012-12-06  7     7     7
2012-12-07  8     8     NaN
2012-12-08  9     9     NaN

I need to filter a dataframe to get data between the maximum and minimum values of a column C.

That is, at the output, I should get the following data set, while all the values of NaN within this interval should be unchanged.

The result of filtering should be the following:

date        A     B     C
2012-12-01  2     2     2
2012-12-02  NaN   3     3
2012-12-03  4     4     4
2012-12-04  5     5     NaN
2012-12-05  6     6     6
2012-12-06  7     7     7

How can I do that? I tried this kind of construct but it didn't give any results:

interval_1 = pd.DataFrame(pd.date_range(df['C'].min(), df['C'].max()))

CodePudding user response：

You can use idxmin/idxmax and slicing:

df.loc[df['C'].idxmin():df['C'].idxmax()]

output:

         date    A  B    C
2  2012-12-01  2.0  2  2.0
3  2012-12-02  NaN  3  3.0
4  2012-12-03  4.0  4  4.0
5  2012-12-04  5.0  5  NaN
6  2012-12-05  6.0  6  6.0
7  2012-12-06  7.0  7  7.0

CodePudding user response：

import pandas as pd
import numpy as np

columns = ['date','A','B','C']
data = [    
['2012-11-29',  0 ,    0 ,    np.nan],
['2012-11-30',  1 ,   1  ,   np.nan],
['2012-12-01',  2  ,   2 ,    2],
['2012-12-02',  np.nan,   3 ,    3],
['2012-12-03',  4  ,   4 ,    4],
['2012-12-04',  5  ,   5 ,    np.nan],
['2012-12-05',  6  ,   6 ,    6],
['2012-12-06',  7  ,   7 ,    7],
['2012-12-07' , 8  ,   8 ,    np.nan],
['2012-12-08' , 9  ,   9 ,    np.nan]]


df = pd.DataFrame(data=data, columns=columns)


minVal = df['C'].min()
maxVal = df['C'].max()

df_filter = df[((df['A'] >= minVal) | (df['B'] >= minVal)) & ((df['A'] <= maxVal) | (df['B'] <= maxVal))]

Output:

print(df_filter)
         date    A  B    C
2  2012-12-01  2.0  2  2.0
3  2012-12-02  NaN  3  3.0
4  2012-12-03  4.0  4  4.0
5  2012-12-04  5.0  5  NaN
6  2012-12-05  6.0  6  6.0
7  2012-12-06  7.0  7  7.0

CodePudding user response：

You could first get min and max values of C, and then filter the whole df by a and b column values:

min_c = df.c.min()
max_c = df.c.max()
df = df[(df.a >= min_c) & (df.a <= max_c) & (df.b >= min_c) & (df.b <= max_c)]