Home > Software engineering >  split pandas dataframe row into multiple rows by tasks
split pandas dataframe row into multiple rows by tasks

Time:10-17

I have a pandas dataframe and would like to split each row containing multiple tasks into a new row.

The dataframe columns are:
start_t_i = start time of a task.
end_t_i = end time of a task.
weight_i = cost of a task.

For example, suppose I have the following dataframe df1:

task_name start_t1 end_t1 weight_1 start_t_2 end_t2 weight_2.. start_t_k end_t_k weight_k
john 5 7 1 9 10 9
sally 3 4 1 8 11 7 19 21 1
tom 1 2 3

I would like to transform it into the following df2:

task_name start_t end_t weight
john 5 7 1
john 9 10 9
sally 3 4 1
sally 8 11 7
sally 19 21 1
tom 1 2 3

so far I managed to transform df1 manually into df2 by assuming each person has only a maximum of two tasks. my question is how can I get a df such as df2 from df1 when there are up to k tasks for each person.

CodePudding user response:

If we assume that every task has a start_t, end_t and weight_t columns then we can use:

cols = list(df.columns[1:])
task_col = df.columns[0]
dfs = []
for i in range(0, len(cols), 3):
    subset_cols = cols[i:i 3]
    rename_cols = {subset_cols[0]: 'start_t', subset_cols[1]:'end_t', subset_cols[2]:'weight'}
    dfs.append(df[[task_col]   subset_cols].rename(columns=rename_cols))

transformed_df = pd.concat(dfs, ignore_index=True).sort_values(by='task_name').dropna(subset=['start_t', 'end_t', 'weight']).reset_index(drop=True)

OUTPUT:

  task_name  start_t  end_t  weight
0      john      5.0    7.0     1.0
1      john      9.0   10.0     9.0
2     sally      3.0    4.0     1.0
3     sally      8.0   11.0     7.0
4     sally     19.0   21.0     1.0
5       tom      1.0    2.0     3.0

CodePudding user response:

You can melt the columns, extract the valid column names and pivot:

df = df.replace("", np.NaN).melt("task_name").dropna()
df["variable"] = df["variable"].str.extract("(^[A-Za-z] )_")

print (df.assign(count=df.groupby(["task_name", "variable"]).cumcount())
         .pivot(["count", "task_name"], "variable", "value")
         .reset_index(0, drop=True).sort_index())

variable    end  start  weight
task_name                     
john        7.0    5.0     1.0
john       10.0    9.0     9.0
sally       4.0    3.0     1.0
sally      11.0    8.0     7.0
sally      21.0   19.0     1.0
tom         2.0    1.0     3.0
  • Related