Home > Software design >  Pandas Dataframe GroupBy, How to get the value that everything is grouped by?
Pandas Dataframe GroupBy, How to get the value that everything is grouped by?

Time:10-21

My apologizes for the title, I can't think of a better one. I have a csv files that I am reading to a dataframe. This CSV tracks all the times a machine was turned on and logs that time. I am converting the time to a timestamp, and then using df.groupby() to count all the occurrences within an hour. The data looks like this within the dataframe:

                         Machines Used per Hour  Revenue per Hour
Timestamp                                                    
2021-08-22 06:00:00                       4             14.00
2021-08-22 08:00:00                       1              4.25
2021-08-22 09:00:00                       8             32.75
2021-08-22 10:00:00                      14             63.75

The issue I am running into, is that I can grab the data for the Machines used per hour and Revenue per hour, but I cannot grab the timestamp. I'd like to push into my database the timestamp as it is displayed in the dataframe, but it is not an actual column and I can't find a way to grab it by itself. Here is my code:

 df = wr.s3.read_csv(path=[f's3://{csvList[i].bucket_name}/{csvList[i].key}'])
 df['Timestamp'] = pd.to_datetime(df['Timestamp'])
 hourlyMachineUseCount = (df.groupby(df['Timestamp'].dt.floor('h'))['Machine Name'].count()) # Sorts by day, counting amount of machines used per hour
 totalHourlyRevenue = (df.groupby(df['Timestamp'].dt.floor('h'))['Total Revenue'].sum()) # Gives back the total revenue per hour per day 
 hours = (df.groupby(df['Timestamp'].dt.floor('h'))['Timestamp']) 

 machineAndRevenuePerHourDF = pd.DataFrame()
 machineAndRevenuePerHourDF['Machines Used per Hour'] = hourlyMachineUseCount
 machineAndRevenuePerHourDF['Revenue per Hour'] = totalHourlyRevenue
 machineAndRevenuePerHourDF['Timestamp'] = hours

The hours variable is where I am currently stuck. The way it is setup now is that my new dataframes timestamp column will show each occurrence within an hour.

   2021-08-22 06:07:21
   2021-08-22 06:37:41
   2021-08-22 06:39:45
   2021-08-22 06:41:28

I want it to look like 2021-08-22 06:00:00

CodePudding user response:

When you perform a DataFrame.goupby(), the resulting DataFrame will have the by argument as index (here Timestamp).

You can use DataFrame.reset_index() after your groupby to reset the index to the default one. The old index will be turned back into a column.

CodePudding user response:

Instead of calculating separate variables and building a new DataFrame you can make use of the agg method:

df = df.groupby(df['Timestamp'].dt.floor('h')).agg(
    machines_used_per_hour=('Machine Name', 'count'),
    revenue_per_hour=('Total Revenue', 'sum')
).reset_index()
  • Related