Home > OS >  Exclude df rows where a dates field: time/seconds are between a specific period
Exclude df rows where a dates field: time/seconds are between a specific period

Time:10-12

Morning All,

I have a very large df but need to strip out data NOT between 8.30am AEST to 5pm UTC.

# Dates are dd/mm/yyyy
df ={ 'rfq_create_date_time': ['01/10/2021  00:00:00 AM',
                                '02/10/2021  01:01:01 AM',
                                '03/10/2021  05:00:00 AM',
                                '04/10/2021  10:15:15 AM',
                                '05/10/2021  01:01:01 PM',
                               '21/10/2021  10:29:29 PM',
                               '22/10/2021  10:30:00 PM',
                               '23/10/2021  10:30:01 PM',],
        'Other_Field': ['A', 'B', 'C','D','E','F','G','H',],
         }
df = pd.DataFrame.from_dict(df)
print(df)

Required df:

      rfq_create_date_time Other_Field
2  03/10/2021  05:00:00 AM           C
3  04/10/2021  10:15:15 AM           D
4  05/10/2021  01:01:01 PM           E
5  21/10/2021  10:29:29 PM           F
6  22/10/2021  10:30:00 PM           G

First issue: I tried the between_time function but I don't want the date to be the index. This was added as I was getting TypeError: Index must be DatetimeIndex

df.index = pd.to_datetime(df['rfq_create_date_time'])

Second issue: I just want to do counts of before and after but am getting TypeError: bad operand type for unary ~: 'str' when assigning mask = ~

# Count the number of rows excluded
dfUTC_05_To_2230 = ((df['rfq_create_date_time'].between_time('5:00', '22:30')))
print(dfUTC_05_To_2230)
Total_UTC_Removed = np.sum(dfUTC_05_To_2230)
print(" Total records filtered out due to exclusion of RFQ's from UTC 0500 to UTC 2230 "   str(Total_UTC_Removed), end='\n')
# Mask to exclude these rows
mask = ~((df['rfq_create_date_time'].between_time('5:00', '22:30')))
Total_Rows_After_Mask = df.shape[0]
Difference = Total_Rows_Db - Total_UTC_Removed - Total_Rows_After_Mask
print("Total records in df after exclusion of RFQ's from UTC 0500 to UTC 2230 "   str(Total_Rows_After_Mask), end='\n')
print("Difference after exclusion of RFQ's from UTC 0500 to UTC 2230 "   str(Difference), end='\n')

CodePudding user response:

To use between_time, as you've probably realised, the date/time needs to be the index of the dataframe.

When the date/time is a column in the dataframe you can use 'standard' filtering.

from datetime import time
import pandas as pd


# Dates are dd/mm/yyyy
data = {
    "rfq_create_date_time": [
        "01/10/2021  00:00:00 AM",
        "02/10/2021  01:01:01 AM",
        "03/10/2021  05:00:00 AM",
        "04/10/2021  10:15:15 AM",
        "05/10/2021  01:01:01 PM",
        "21/10/2021  10:29:29 PM",
        "22/10/2021  10:30:00 PM",
        "23/10/2021  10:30:01 PM",
    ],
    "Other_Field": [
        "A",
        "B",
        "C",
        "D",
        "E",
        "F",
        "G",
        "H",
    ],
}
df = pd.DataFrame.from_dict(data)

df["rfq_create_date_time"] = pd.to_datetime(df["rfq_create_date_time"])

mask = (df["rfq_create_date_time"].dt.time >= time(5, 0)) & (
    df["rfq_create_date_time"].dt.time <= time(23, 30)
)

df_filtered = df[~mask]

print(df_filtered)

print(
f"""There were {df.shape[0]} records in the original data,
and after filtering there are {df_filtered.shape[0]} records left."""
)
rfq_create_date_time Other_Field
10/01/2021 00:00:00 A
10/02/2021 01:01:01 B
  • Related