Home > Back-end >  Grouping the data then converting timestamp column to row
Grouping the data then converting timestamp column to row

Time:07-01

I have gone through Convert columns into rows with Pandas and Merge timestamp column in pandas , the goal is to first group data by ID and then convert start_time column into an entity in the process column

Given

  start_time   time   process  ID
    14:05      14:16  A       1
    14:05      14:34  B       1
    14:05      15:00  C       1
    14:05      15:10  D       1
    14:12      14:19  A       2
    14:12      14:54  B       2

Goal

time   process    ID
14:05  start_time 1    (Previously it was in separate column)
14:16  A          1
14:34  B          1
15:00  C          1
15:10  D          1
14:12  start_time 2
14:19  A          2
14:54  B          2


df.groupby('ID').melt(df.columns.difference(['start_time']), value_name='time')

Note:start_time value in each ID remains the same.

CodePudding user response:

You can treat your data as 2 separate DataFrames and recombine them like so:

# Extract start_times and clean up to match column names
start_times = (
    df[['start_time', 'ID']]
    .drop_duplicates()
    .rename(columns={'start_time': 'time'})
    .assign(process='start_time')
)

# combine data vertically
out = (
    pd.concat([start_times, df.drop(columns='start_time')])
    .sort_values(['ID', 'time'])
    .reset_index(drop=True)
)

print(out)
    time  ID     process
0  14:05   1  start_time
1  14:16   1           A
2  14:34   1           B
3  15:00   1           C
4  15:10   1           D
5  14:12   2  start_time
6  14:19   2           A
7  14:54   2           B

CodePudding user response:

You could use:

cols = df.columns.difference(['start_time', 'process']).to_list()

# identify first row per group
mask = ~df.duplicated('ID')
# melt first row per group
df2 = (df[mask]
       .drop(columns=['process', 'time'])
       .melt(cols, var_name='process', value_name='time')
      )
# concatenate with original dataframe and reorder
out = (pd.concat([df2, df])
         .sort_values(by='ID', kind='stable')
         [['time', 'process'] cols]
         #.reset_index(drop=True) # optional
       )

output:

    time     process  ID
0  14:05  start_time   1
0  14:16           A   1
1  14:34           B   1
2  15:00           C   1
3  15:10           D   1
1  14:12  start_time   2
4  14:19           A   2
5  14:54           B   2
  • Related