How to combine multiple rows into a single row with many columns in pandas using an id (clustering m-CodePudding

Situation:

all_task_usage_10_19 is the file which consists of 29229472 rows × 20 columns. There are multiple rows with the same ID inside the column machine_ID with different values in other columns.

Columns:

'start_time_of_the_measurement_period','end_time_of_the_measurement_period', 'job_ID', 'task_index','machine_ID', 'mean_CPU_usage_rate','canonical_memory_usage', 'assigned_memory_usage','unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage','mean_disk_I/O_time', 'mean_local_disk_space_used', 'maximum_CPU_usage','maximum_disk_IO_time', 'cycles_per_instruction_(CPI)', 'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage'

2. clustering code

I am trying to cluster multiple machine_ID records using the following code, referencing: How to combine multiple rows into a single row with pandas

3. Output

Output displayed using: with option_context as it allows to better visualise the content

My Aim:

I am trying to cluster multiple rows with the same machine_ID into a single record, so I can apply algorithms like Moving averages, LSTM and HW for predicting cloud workloads.

Something like this.

CodePudding user response：

Maybe a Multi-Index is what you're looking for?

df.set_index(['machine_ID', df.index])

Note that by default set_index returns a new dataframe, and does not change the original. To change the original (and return None) you can pass an argument inplace=True.

Example:

df = pd.DataFrame({'machine_ID': [1, 1, 2, 2, 3],
                   'a': [1, 2, 3, 4, 5],
                   'b': [10, 20, 30, 40, 50]})
new_df = df.set_index(['machine_ID', df.index])  # not in-place
df.set_index(['machine_ID', df.index], inplace=True)  # in-place

For me, it does create a multi-index: first level is 'machine_ID', second one is the previous range index:

CodePudding user response：

The below code worked for me:

all_task_usage_10_19.groupby('machine_ID'[['start_time_of_the_measurement_period','end_time_of_the_measurement_period','job_ID', 'task_index','mean_CPU_usage_rate', 'canonical_memory_usage',
    'assigned_memory_usage', 'unmapped_page_cache_memory_usage',           'total_page_cache_memory_usage', 'maximum_memory_usage',
    'mean_disk_I/O_time', 'mean_local_disk_space_used','maximum_CPU_usage',
    'maximum_disk_IO_time', 'cycles_per_instruction_(CPI)',
    'memory_accesses_per_instruction_(MAI)', 'sample_portion',
    'aggregation_type', 'sampled_CPU_usage']].agg(list).reset_index()