I have a dataset in which I need to filter by the values of one of the columns.
I'll try to explain what I need with an example. Suppose we have the following dataset, whose columns may contain NaN
values.
In [11]: df
Out[11]:
date A B C
2012-11-29 0 0 NaN
2012-11-30 1 1 NaN
2012-12-01 2 2 2
2012-12-02 NaN 3 3
2012-12-03 4 4 4
2012-12-04 5 5 NaN
2012-12-05 6 6 6
2012-12-06 7 7 7
2012-12-07 8 8 NaN
2012-12-08 9 9 NaN
I need to filter a dataframe to get data between the maximum and minimum values of a column C
.
That is, at the output, I should get the following data set, while all the values of NaN
within this interval should be unchanged.
The result of filtering should be the following:
date A B C
2012-12-01 2 2 2
2012-12-02 NaN 3 3
2012-12-03 4 4 4
2012-12-04 5 5 NaN
2012-12-05 6 6 6
2012-12-06 7 7 7
How can I do that? I tried this kind of construct but it didn't give any results:
interval_1 = pd.DataFrame(pd.date_range(df['C'].min(), df['C'].max()))
CodePudding user response:
You can use idxmin
/idxmax
and slicing:
df.loc[df['C'].idxmin():df['C'].idxmax()]
output:
date A B C
2 2012-12-01 2.0 2 2.0
3 2012-12-02 NaN 3 3.0
4 2012-12-03 4.0 4 4.0
5 2012-12-04 5.0 5 NaN
6 2012-12-05 6.0 6 6.0
7 2012-12-06 7.0 7 7.0
CodePudding user response:
import pandas as pd
import numpy as np
columns = ['date','A','B','C']
data = [
['2012-11-29', 0 , 0 , np.nan],
['2012-11-30', 1 , 1 , np.nan],
['2012-12-01', 2 , 2 , 2],
['2012-12-02', np.nan, 3 , 3],
['2012-12-03', 4 , 4 , 4],
['2012-12-04', 5 , 5 , np.nan],
['2012-12-05', 6 , 6 , 6],
['2012-12-06', 7 , 7 , 7],
['2012-12-07' , 8 , 8 , np.nan],
['2012-12-08' , 9 , 9 , np.nan]]
df = pd.DataFrame(data=data, columns=columns)
minVal = df['C'].min()
maxVal = df['C'].max()
df_filter = df[((df['A'] >= minVal) | (df['B'] >= minVal)) & ((df['A'] <= maxVal) | (df['B'] <= maxVal))]
Output:
print(df_filter)
date A B C
2 2012-12-01 2.0 2 2.0
3 2012-12-02 NaN 3 3.0
4 2012-12-03 4.0 4 4.0
5 2012-12-04 5.0 5 NaN
6 2012-12-05 6.0 6 6.0
7 2012-12-06 7.0 7 7.0
CodePudding user response:
You could first get min and max values of C, and then filter the whole df by a and b column values:
min_c = df.c.min()
max_c = df.c.max()
df = df[(df.a >= min_c) & (df.a <= max_c) & (df.b >= min_c) & (df.b <= max_c)]