Home > Software engineering >  Pandas Dataframe Getting a count of semi-unique values from columns in a CSV
Pandas Dataframe Getting a count of semi-unique values from columns in a CSV

Time:11-30

I don't think my title accurately conveys my question but I struggled on it for a bit. I have a range of CSV files. These files contain column names and values. My current code works exactly as I want it to, in that it groups the data by time and then gets me a count of uses per hour and revenue per hour. However I now want to refine this, in my CSV there is a column name called Machine Name. Each value in this column is unique, but they share the same naming scheme. They can either be Dryer #39 or Dryer #38 or Washer #1 or Washer #12. What I want is to get a count of Dryers and Washers used per hour and I do not care what number washer or dryer it was. Just that it was a washer or dryer.

Here is my code.

    for i in range(1): # len(csvList))
    df = wr.s3.read_csv(path=[f's3://{csvList[i].bucket_name}/{csvList[i].key}']) 
    df['Timestamp'] = pd.to_datetime(df['Timestamp'])
    df = df.groupby(df['Timestamp'].dt.floor('h')).agg( 
    machines_used_per_hour=('Machine Name', 'count'),
    revenue_per_hour=('Total Revenue', 'sum')
    ).reset_index() # Reset the index for the timestamp column 

    for j in df.iterrows():
        dbInsert = """INSERT INTO `store-machine-use`(store_id, timestamp, machines_used_per_hour, revenue_per_hour, notes) VALUES (%s, %s, %s, %s, %s)"""
        values = (int(storeNumberList[i]), str(j[1]['Timestamp']), int(j[1]['machines_used_per_hour']), int(j[1]['revenue_per_hour']),'')
        cursor.execute(dbInsert, values)
        cnx.commit() 

This data enters the database and looks like:

store_id, Timestamp, machines_used_per_hour, revenue_per_hour, notes
10, 2021-08-22 06:00:00, 4, 14, Test

I want to get an individual count of the types of machines used every hour, in the case of my example it would look like:

store_id, Timestamp, machines_used_per_hour, revenue_per_hour, washers_per_hour, dryers_per_hour, notes
10, 2021-08-22 06:00:00, 4, 14, 1, 3, Test

CodePudding user response:

you cout use pd.Series.str.startswith and then sum in the aggregation:

df['is_dryer'] = df['Machine Name'].startswith('Dryer')
df['is_washer'] = df['Machine Name'].startswith('Washer')

df = df.groupby(df['Timestamp'].dt.floor('h')).agg( 
    machines_used_per_hour=('Machine Name', 'count'),
    revenue_per_hour=('Total Revenue', 'sum'),
    washers_per_hour=('is_washer', 'sum'),
    dryers_per_hour=('is_dryer', 'sum')
).reset_index() # Reset the index for the timestamp column 

note that if you need more complex pattern matching for determining which machine belongs to which category, you can use regexes with pd.Series.str.match

example

for instance with some fake data, if I have:

dataframe = pd.DataFrame(
    {"machine": ["Dryer #1", "Dryer #2", "Washer #43", "Washer #89", "Washer #33"],
    "aggregation_key": [1, 2, 1, 2, 2]}
)

after creating the boolean columns with

dataframe["is_dryer"]  = dataframe.machine.str.startswith("Dryer")
dataframe["is_washer"]  = dataframe.machine.str.startswith("Washer")

dataframe will be

      machine  aggregation_key  is_dryer  is_washer
0    Dryer #1                1      True      False
1    Dryer #2                2      True      False
2  Washer #43                1     False       True
3  Washer #89                2     False       True
4  Washer #33                2     False       True

and then aggregation gives you what you want:

dataframe.groupby(dataframe["aggregation_key"]).agg( 
    washers_per_hour=('is_washer', 'sum'),
    dryers_per_hour=('is_dryer', 'sum')
).reset_index()

result will be

   aggregation_key  washers_per_hour  dryers_per_hour
0                1                 1                1
1                2                 2                1

CodePudding user response:

you can use regex to replace the common machine number identifier pattern to create a machine_type series which you can then use to aggregate on.

df['Machine Type'] = df['Machine Name'].str.replace(' #[0-9]', '', regex=True)

you can then group on the Machine Type

df = df.groupby(df['Timestamp'].dt.floor('h')).agg( 
    machines_used_per_hour=('Machine Type', 'count'),
    revenue_per_hour=('Total Revenue', 'sum')
    ).reset_index()
  • Related