Home > Software design >  Drop columns if all of their values in a specific date range are NaNs using Pandas
Drop columns if all of their values in a specific date range are NaNs using Pandas

Time:11-30

Given a data sample as follows:

          date  value1    value2    value3
0   2021-10-12   1.015  1.115668  1.015000
1   2021-10-13     NaN  1.104622  1.030225
2   2021-10-14     NaN  1.093685       NaN
3   2021-10-15   1.015  1.082857       NaN
4   2021-10-16   1.015  1.072135  1.077284
5   2021-10-29   1.015  1.061520  1.093443
6   2021-10-30   1.015  1.051010  1.109845
7   2021-10-31   1.015       NaN  1.126493
8    2021-11-1   1.015       NaN       NaN
9    2021-11-2   1.015  1.020100       NaN
10   2021-11-3     NaN  1.010000       NaN
11  2021-11-30   1.015  1.000000       NaN

Let's say I want to drop columns whose values all are NaNs in the November of 2021, which means range of 2021-11-01 to 2021-11-30 (including the starting and ending date).

Under this requirement, vlue3 will be drop since all its values in 2021-11 are NaNs. Other columns have NaNs in 2021-11 but not all, so those columns will be kept.

How could I achieve that in Pandas? Thanks.

EDIT:

df['date'] = pd.to_datetime(df['date'])  
mask = (df['date'] >= '2021-11-01') & (df['date'] <= '2021-11-30')
df.loc[mask]

Out:

         date  value1  value2  value3
8  2021-11-01   1.015     NaN     NaN
9  2021-11-02   1.015  1.0201     NaN
10 2021-11-03     NaN  1.0100     NaN
11 2021-11-30   1.015  1.0000     NaN

CodePudding user response:

You can filter rows by November of 2021 and test if all rows has NaNs by conditions:

df['date'] = pd.to_datetime(df['date'])

df = df.loc[:, ~df[df['date'].dt.to_period('m') == pd.Period('2021-11')].isna().all()]

Or:

df['date'] = pd.to_datetime(df['date'])

df = df.loc[:, df[df['date'].dt.to_period('m') == pd.Period('2021-11')].notna().any()]

EDIT: If need manually set some columns for not processing use:

mask = (df['date'] >= '2021-11-01') & (df['date'] <= '2021-11-30')
df = df.loc[:, df.loc[mask].notna().any()]

Out:

         date  value1    value2
0  2021-10-12   1.015  1.115668
1  2021-10-13     NaN  1.104622
2  2021-10-14     NaN  1.093685
3  2021-10-15   1.015  1.082857
4  2021-10-16   1.015  1.072135
5  2021-10-29   1.015  1.061520
6  2021-10-30   1.015  1.051010
7  2021-10-31   1.015       NaN
8  2021-11-01   1.015       NaN
9  2021-11-02   1.015  1.020100
10 2021-11-03     NaN  1.010000
11 2021-11-30   1.015  1.000000

EDIT:

df = df.assign(value4 = np.nan)
print (df)
          date  value1    value2    value3  value4
0   2021-10-12   1.015  1.115668  1.015000     NaN
1   2021-10-13     NaN  1.104622  1.030225     NaN
2   2021-10-14     NaN  1.093685       NaN     NaN
3   2021-10-15   1.015  1.082857       NaN     NaN
4   2021-10-16   1.015  1.072135  1.077284     NaN
5   2021-10-29   1.015  1.061520  1.093443     NaN
6   2021-10-30   1.015  1.051010  1.109845     NaN
7   2021-10-31   1.015       NaN  1.126493     NaN
8    2021-11-1   1.015       NaN       NaN     NaN
9    2021-11-2   1.015  1.020100       NaN     NaN
10   2021-11-3     NaN  1.010000       NaN     NaN
11  2021-11-30   1.015  1.000000       NaN     NaN

df['date'] = pd.to_datetime(df['date'])

m = df[df['date'].dt.to_period('m') == pd.Period('2021-11')].isna().all()
m.loc['value4'] = False

print (m)
date      False
value1    False
value2    False
value3     True
value4    False
dtype: bool

df = df.loc[:, ~m]
print (df)
         date  value1    value2  value4
0  2021-10-12   1.015  1.115668     NaN
1  2021-10-13     NaN  1.104622     NaN
2  2021-10-14     NaN  1.093685     NaN
3  2021-10-15   1.015  1.082857     NaN
4  2021-10-16   1.015  1.072135     NaN
5  2021-10-29   1.015  1.061520     NaN
6  2021-10-30   1.015  1.051010     NaN
7  2021-10-31   1.015       NaN     NaN
8  2021-11-01   1.015       NaN     NaN
9  2021-11-02   1.015  1.020100     NaN
10 2021-11-03     NaN  1.010000     NaN
11 2021-11-30   1.015  1.000000     NaN
  • Related