Home > Software design >  Python - Select all rows where there is data for each hour in a day
Python - Select all rows where there is data for each hour in a day

Time:10-04

I have a data frame that consists of a series of dates:

Year          Month       Day      Hour

2020          12          3           22

2021          1           1           0

2021          1           1           1

2021          1           1           2

...
2021          1           1           23

2021          1           2           1

2021          1           2           3
...

I would like to return all rows for dates that have information for all 24 hours in the day. In the above example, I would only want to return the rows:

2021          1           1           0

2021          1           1           1

2021          1           1           2

...
2021          1           1           23

My data set is very long. I would appreciate any assistance. Thank you.

CodePudding user response:

import pandas as pd
import random as rd

# generate dummy data
sz = 40000
df = pd.DataFrame()
df['Y'] = [rd.randint(2020, 2021) for _ in range(sz)]
df['M'] = [rd.randint(1, 12) for _ in range(sz)]
df['D'] = [rd.randint(1, 31) for _ in range(sz)]
df['H'] = [rd.randint(0, 23) for _ in range(sz)]

# make an ethalon hour sequence
h24 = [i for i in range(24)]

# group and check if we have 24 hours in the group
# if NaN then no 24 hours here - drop, explode the rest
df = df.groupby(by=['Y', 'M', 'D']).apply(lambda x: None if x.value_counts().size != 24 else h24). \
    dropna(how='any').explode().reset_index().rename(columns={0: "H"})
print(df)

Prints:

         Y   M   D   H
0     2020   1   3   0
1     2020   1   3   1
2     2020   1   3   2
3     2020   1   3   3
4     2020   1   3   4
...    ...  ..  ..  ..
1363  2021  12  11  19
1364  2021  12  11  20
1365  2021  12  11  21
1366  2021  12  11  22
1367  2021  12  11  23

[1368 rows x 4 columns]
  • Related