Home > Enterprise >  Aggregate hourly data and count duplicates in pandas python
Aggregate hourly data and count duplicates in pandas python

Time:01-27

I have a dataframe df that looks like this:

time Hit/Miss
2016-09-29 08:00:00 FN
2016-09-29 08:30:00 FN
2016-09-29 09:45:00 TP
2016-10-05 14:00:00 FP

"time" is imported straight from a csv file without being set as index, and "Hit/Miss" contains three categorical values FN, TP, and FP.

Now, I'd like to group the data by hour, day, month and count how many FN, TP, FP's occurred during the hour. As you can see from the brief example dataset, now every hour has Hit/Miss values.

This is what I have so far after grouping by day, which does give me the desired output:

df['time'] = pd.to_datetime(df['time'])
df['year'] = df['time'].dt.year
df['month'] = df['time'].dt.month
df['day'] = df['time'].dt.day

HM_DD = pd.crosstab([df['year'],df['month'],df['day']], df['Hit/Miss'], dropna=True)
HM_MM = pd.crosstab([df['year'],df['month']], df['Hit/Miss'], dropna=True)
HM_YYMM = pd.crosstab(df['year'], df['Hit/Miss'], dropna=True)

HM_DD = HM_DD.reset_index()

And the output looks like the snip shot desired output here .

It does show what I need to see, however, I feel like there's a better way to do it. After all, I need to visualize the data by day and year (extra qn if you have time: What would be the best way/chart to visualize it?), so I'd like to keep the time intact for better maneuver later on, not like how I separated them into year, month, day. I have tried setting time as index and resample(), but that didn't work for me.

Would appreciate help!

CodePudding user response:

You can use offsets to aggregate your data:

HM_DD = pd.crosstab(df['time'].dt.normalize(), df['Hit/Miss'])
HM_MM = pd.crosstab(df['time'].dt.date pd.offsets.MonthBegin(-1), df['Hit/Miss'])
HM_YYMM = pd.crosstab(df['time'].dt.date pd.offsets.YearBegin(-1), df['Hit/Miss'])

Output:

>>> HM_DD.reset_index().rename_axis(columns=None)
        time  FN  FP  TP
0 2016-09-29   2   0   1
1 2016-10-05   0   1   0

>>> HM_MM.reset_index().rename_axis(columns=None)
        time  FN  FP  TP
0 2016-09-01   2   0   1
1 2016-10-01   0   1   0

>>> HM_YYMM.reset_index().rename_axis(columns=None)
        time  FN  FP  TP
0 2016-01-01   2   1   1
  • Related