Morning All,
I have a very large df but need to strip out data NOT between 8.30am AEST to 5pm UTC.
# Dates are dd/mm/yyyy
df ={ 'rfq_create_date_time': ['01/10/2021 00:00:00 AM',
'02/10/2021 01:01:01 AM',
'03/10/2021 05:00:00 AM',
'04/10/2021 10:15:15 AM',
'05/10/2021 01:01:01 PM',
'21/10/2021 10:29:29 PM',
'22/10/2021 10:30:00 PM',
'23/10/2021 10:30:01 PM',],
'Other_Field': ['A', 'B', 'C','D','E','F','G','H',],
}
df = pd.DataFrame.from_dict(df)
print(df)
Required df:
rfq_create_date_time Other_Field
2 03/10/2021 05:00:00 AM C
3 04/10/2021 10:15:15 AM D
4 05/10/2021 01:01:01 PM E
5 21/10/2021 10:29:29 PM F
6 22/10/2021 10:30:00 PM G
First issue: I tried the between_time
function but I don't want the date to be the index. This was added as I was getting TypeError: Index must be DatetimeIndex
df.index = pd.to_datetime(df['rfq_create_date_time'])
Second issue: I just want to do counts of before and after but am getting TypeError: bad operand type for unary ~: 'str'
when assigning mask = ~
# Count the number of rows excluded
dfUTC_05_To_2230 = ((df['rfq_create_date_time'].between_time('5:00', '22:30')))
print(dfUTC_05_To_2230)
Total_UTC_Removed = np.sum(dfUTC_05_To_2230)
print(" Total records filtered out due to exclusion of RFQ's from UTC 0500 to UTC 2230 " str(Total_UTC_Removed), end='\n')
# Mask to exclude these rows
mask = ~((df['rfq_create_date_time'].between_time('5:00', '22:30')))
Total_Rows_After_Mask = df.shape[0]
Difference = Total_Rows_Db - Total_UTC_Removed - Total_Rows_After_Mask
print("Total records in df after exclusion of RFQ's from UTC 0500 to UTC 2230 " str(Total_Rows_After_Mask), end='\n')
print("Difference after exclusion of RFQ's from UTC 0500 to UTC 2230 " str(Difference), end='\n')
CodePudding user response:
To use between_time
, as you've probably realised, the date/time needs to be the index of the dataframe.
When the date/time is a column in the dataframe you can use 'standard' filtering.
from datetime import time
import pandas as pd
# Dates are dd/mm/yyyy
data = {
"rfq_create_date_time": [
"01/10/2021 00:00:00 AM",
"02/10/2021 01:01:01 AM",
"03/10/2021 05:00:00 AM",
"04/10/2021 10:15:15 AM",
"05/10/2021 01:01:01 PM",
"21/10/2021 10:29:29 PM",
"22/10/2021 10:30:00 PM",
"23/10/2021 10:30:01 PM",
],
"Other_Field": [
"A",
"B",
"C",
"D",
"E",
"F",
"G",
"H",
],
}
df = pd.DataFrame.from_dict(data)
df["rfq_create_date_time"] = pd.to_datetime(df["rfq_create_date_time"])
mask = (df["rfq_create_date_time"].dt.time >= time(5, 0)) & (
df["rfq_create_date_time"].dt.time <= time(23, 30)
)
df_filtered = df[~mask]
print(df_filtered)
print(
f"""There were {df.shape[0]} records in the original data,
and after filtering there are {df_filtered.shape[0]} records left."""
)
rfq_create_date_time | Other_Field |
---|---|
10/01/2021 00:00:00 | A |
10/02/2021 01:01:01 | B |