My apologizes for the title, I can't think of a better one. I have a csv files that I am reading to a dataframe. This CSV tracks all the times a machine was turned on and logs that time. I am converting the time to a timestamp, and then using df.groupby()
to count all the occurrences within an hour. The data looks like this within the dataframe:
Machines Used per Hour Revenue per Hour
Timestamp
2021-08-22 06:00:00 4 14.00
2021-08-22 08:00:00 1 4.25
2021-08-22 09:00:00 8 32.75
2021-08-22 10:00:00 14 63.75
The issue I am running into, is that I can grab the data for the Machines used per hour and Revenue per hour, but I cannot grab the timestamp. I'd like to push into my database the timestamp as it is displayed in the dataframe, but it is not an actual column and I can't find a way to grab it by itself. Here is my code:
df = wr.s3.read_csv(path=[f's3://{csvList[i].bucket_name}/{csvList[i].key}'])
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
hourlyMachineUseCount = (df.groupby(df['Timestamp'].dt.floor('h'))['Machine Name'].count()) # Sorts by day, counting amount of machines used per hour
totalHourlyRevenue = (df.groupby(df['Timestamp'].dt.floor('h'))['Total Revenue'].sum()) # Gives back the total revenue per hour per day
hours = (df.groupby(df['Timestamp'].dt.floor('h'))['Timestamp'])
machineAndRevenuePerHourDF = pd.DataFrame()
machineAndRevenuePerHourDF['Machines Used per Hour'] = hourlyMachineUseCount
machineAndRevenuePerHourDF['Revenue per Hour'] = totalHourlyRevenue
machineAndRevenuePerHourDF['Timestamp'] = hours
The hours variable is where I am currently stuck. The way it is setup now is that my new dataframes timestamp column will show each occurrence within an hour.
2021-08-22 06:07:21
2021-08-22 06:37:41
2021-08-22 06:39:45
2021-08-22 06:41:28
I want it to look like 2021-08-22 06:00:00
CodePudding user response:
When you perform a DataFrame.goupby()
, the resulting DataFrame will have the by
argument as index (here Timestamp).
You can use DataFrame.reset_index()
after your groupby to reset the index to the default one. The old index will be turned back into a column.
CodePudding user response:
Instead of calculating separate variables and building a new DataFrame
you can make use of the agg
method:
df = df.groupby(df['Timestamp'].dt.floor('h')).agg(
machines_used_per_hour=('Machine Name', 'count'),
revenue_per_hour=('Total Revenue', 'sum')
).reset_index()