I don't think my title accurately conveys my question but I struggled on it for a bit. I have a range of CSV files. These files contain column names and values. My current code works exactly as I want it to, in that it groups the data by time and then gets me a count of uses per hour and revenue per hour. However I now want to refine this, in my CSV there is a column name called Machine Name. Each value in this column is unique, but they share the same naming scheme. They can either be Dryer #39 or Dryer #38 or Washer #1 or Washer #12. What I want is to get a count of Dryers and Washers used per hour and I do not care what number washer or dryer it was. Just that it was a washer or dryer.
Here is my code.
for i in range(1): # len(csvList))
df = wr.s3.read_csv(path=[f's3://{csvList[i].bucket_name}/{csvList[i].key}'])
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.groupby(df['Timestamp'].dt.floor('h')).agg(
machines_used_per_hour=('Machine Name', 'count'),
revenue_per_hour=('Total Revenue', 'sum')
).reset_index() # Reset the index for the timestamp column
for j in df.iterrows():
dbInsert = """INSERT INTO `store-machine-use`(store_id, timestamp, machines_used_per_hour, revenue_per_hour, notes) VALUES (%s, %s, %s, %s, %s)"""
values = (int(storeNumberList[i]), str(j[1]['Timestamp']), int(j[1]['machines_used_per_hour']), int(j[1]['revenue_per_hour']),'')
cursor.execute(dbInsert, values)
cnx.commit()
This data enters the database and looks like:
store_id, Timestamp, machines_used_per_hour, revenue_per_hour, notes
10, 2021-08-22 06:00:00, 4, 14, Test
I want to get an individual count of the types of machines used every hour, in the case of my example it would look like:
store_id, Timestamp, machines_used_per_hour, revenue_per_hour, washers_per_hour, dryers_per_hour, notes
10, 2021-08-22 06:00:00, 4, 14, 1, 3, Test
CodePudding user response:
you cout use pd.Series.str.startswith and then sum in the aggregation:
df['is_dryer'] = df['Machine Name'].startswith('Dryer')
df['is_washer'] = df['Machine Name'].startswith('Washer')
df = df.groupby(df['Timestamp'].dt.floor('h')).agg(
machines_used_per_hour=('Machine Name', 'count'),
revenue_per_hour=('Total Revenue', 'sum'),
washers_per_hour=('is_washer', 'sum'),
dryers_per_hour=('is_dryer', 'sum')
).reset_index() # Reset the index for the timestamp column
note that if you need more complex pattern matching for determining which machine belongs to which category, you can use regexes with pd.Series.str.match
example
for instance with some fake data, if I have:
dataframe = pd.DataFrame(
{"machine": ["Dryer #1", "Dryer #2", "Washer #43", "Washer #89", "Washer #33"],
"aggregation_key": [1, 2, 1, 2, 2]}
)
after creating the boolean columns with
dataframe["is_dryer"] = dataframe.machine.str.startswith("Dryer")
dataframe["is_washer"] = dataframe.machine.str.startswith("Washer")
dataframe will be
machine aggregation_key is_dryer is_washer
0 Dryer #1 1 True False
1 Dryer #2 2 True False
2 Washer #43 1 False True
3 Washer #89 2 False True
4 Washer #33 2 False True
and then aggregation gives you what you want:
dataframe.groupby(dataframe["aggregation_key"]).agg(
washers_per_hour=('is_washer', 'sum'),
dryers_per_hour=('is_dryer', 'sum')
).reset_index()
result will be
aggregation_key washers_per_hour dryers_per_hour
0 1 1 1
1 2 2 1
CodePudding user response:
you can use regex to replace the common machine number identifier pattern to create a machine_type
series which you can then use to aggregate on.
df['Machine Type'] = df['Machine Name'].str.replace(' #[0-9]', '', regex=True)
you can then group on the Machine Type
df = df.groupby(df['Timestamp'].dt.floor('h')).agg(
machines_used_per_hour=('Machine Type', 'count'),
revenue_per_hour=('Total Revenue', 'sum')
).reset_index()