I have a data frame that consists of a series of dates:
Year Month Day Hour
2020 12 3 22
2021 1 1 0
2021 1 1 1
2021 1 1 2
...
2021 1 1 23
2021 1 2 1
2021 1 2 3
...
I would like to return all rows for dates that have information for all 24 hours in the day. In the above example, I would only want to return the rows:
2021 1 1 0
2021 1 1 1
2021 1 1 2
...
2021 1 1 23
My data set is very long. I would appreciate any assistance. Thank you.
CodePudding user response:
import pandas as pd
import random as rd
# generate dummy data
sz = 40000
df = pd.DataFrame()
df['Y'] = [rd.randint(2020, 2021) for _ in range(sz)]
df['M'] = [rd.randint(1, 12) for _ in range(sz)]
df['D'] = [rd.randint(1, 31) for _ in range(sz)]
df['H'] = [rd.randint(0, 23) for _ in range(sz)]
# make an ethalon hour sequence
h24 = [i for i in range(24)]
# group and check if we have 24 hours in the group
# if NaN then no 24 hours here - drop, explode the rest
df = df.groupby(by=['Y', 'M', 'D']).apply(lambda x: None if x.value_counts().size != 24 else h24). \
dropna(how='any').explode().reset_index().rename(columns={0: "H"})
print(df)
Prints:
Y M D H
0 2020 1 3 0
1 2020 1 3 1
2 2020 1 3 2
3 2020 1 3 3
4 2020 1 3 4
... ... .. .. ..
1363 2021 12 11 19
1364 2021 12 11 20
1365 2021 12 11 21
1366 2021 12 11 22
1367 2021 12 11 23
[1368 rows x 4 columns]