Delete multiple rows based on single blank in any column Python Pandas-CodePudding

I have a dataframe that looks like:

|      Datetime     |   Rainfall   |     Flow  |
| ----------------- | ------------ | --------- |
|  3/19/2018 12:05  |       1      |    5.85   |
|  3/19/2018 12:10  |       2      |    4.47   |
|  3/19/2018 12:15  |       0      |  (BLANK)  |        |
|  3/19/2018 12:20  |       0      |    2.62   |
...
|  3/19/2018 13:00  |       1      |    5.85   |
...

It is time series data on a 5-minute interval for rainfall and flow and my objective is to convert this to hourly data. The data has blanks either in the flow or rainfall columns and if there is a blank in either of these columns, I want to delete all the rows for that hour of data (I only want to keep data that has a full hours worth of data) .

For example, in the table above I would delete all of the data for 12:00 - 12:55.

So far I have gotten to converting the data to hourly but realized I likely need to delete the hour-blank rows before resampling to 1H:

rain_hourly = rain.set_index('Date & Time').resample('1h').sum()
flow_hourly = flow.set_index('Date & Time').resample('1h').mean()
df_hourly = rain_hourly.merge(flow_hourly, how='left', on='Date & Time')

Any help is greatly appreciated!

CodePudding user response：

you can do something like this :

import numpy as np 
df_hourly['Flow']=df_hourly['Flow'].replace('', np.nan)
df_hourly=df_hourly.loc[df_hourly['Flow'].notna(),:].copy()

Using replace should do the work I think , now you just need to pass a list instead of one feature

CodePudding user response：

To remove the whole hour:

Ensure that datetime is the proper format with something like pd.to_datetime(). And that NaN values are formatted correctly. .replace(<whatblanklookslike>, pd.NA)
Make new columns for date and hour:

df['hour'] = df['Datetime'].dt.hour
df['date'] = df['Datetime'].dt.date

Group elements by date and hour:

grouped = df.groupby(['date', 'hour'])

Filter the groups, we only want groups where all values aren't NaN:

df = grouped.filter(lambda x: x.notna().all().all())

Simplified example:

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
                    'B' : [1, 2, 3, 4, 5, 6],
                    'C' : [2.0, 5., pd.NA, 1., 2., 9.]})
grouped = df.groupby('A')
df = grouped.filter(lambda x: x.notna().all().all())
print(df)

Output:

     A  B    C
1  bar  2  5.0
3  bar  4  1.0
5  bar  6  9.0

CodePudding user response：

Dataframe contains blank:

                 Date  Rainfall     Flow
0 2022-04-26 12:00:00         0        0
1 2022-04-26 12:05:00         1  (BLANK)
2 2022-04-26 12:10:00         2        2
3 2022-04-26 12:15:00         3        3
4 2022-04-26 12:20:00         4        4

First I created a column that has the hour, and then a temp column that concatenates rainfall and flow columns to check if one of them contains blank (there are many ways to do this). Then I created a function to check for blanks, and then drop all rows that have an hour containing a blank in one of the two columns.

from datetime import datetime as dt

df["hour"] = df["Date"].dt.hour
df["day"] = df["Date"].dt.day
df["res"] = df["Rainfall"].astype(str)   df["Flow"].astype(str)


def blank(row):
    if "(BLANK)" in row:
        return "YES"
    else:
        return "NO"

df["blank"] = df["res"].apply(blank)

to_del0 = set()
to_del1 = set()
for i, k in df.iterrows():
    if df.at[i, "blank"] == "YES":
        to_del0.add(df.at[i, "hour"])
        to_del1.add(df.at[i, "day"])

df.drop( df.index[ (df['hour'].isin(to_del0)) & (df["day"].isin(to_del1)) ], inplace=True)
df = df[["Date", "Rainfall", "Flow"]]
df.head()

The result:

                  Date  Rainfall Flow
12 2022-04-26 13:00:00        12   12
13 2022-04-26 13:05:00        13   13
14 2022-04-26 13:10:00        14   14
15 2022-04-26 13:15:00        15   15
16 2022-04-26 13:20:00        16   16