I have a inital dataframe:
df = pd.DataFrame({
'job': ['job1', None, None, 'job3', None, None, 'job4', None, None, None, 'job5', None, None, None, 'job6', None, None, None, None],
'name': ['n_j1', None, None, 'n_j3', None, None, 'n_j4', None, None, None, 'nj5', None, None, None, 'nj6', None, None, None, None],
'schedule': ['01', None, None, '06', None, None, '09', None, None, None, None, None, None, None, None, None, None, None, None],
'task_type': ['START', 'TA', 'END', 'START', 'TB', 'END', 'START', 'TB', 'TB', 'END', 'START', 'TA', 'TA', 'END', 'START', 'TA', 'TA', 'TA', 'END'],
'task': [None, 'task12', None, None, 'task31', None, None, 'task18', 'nt6', None, None, 'task3', 'task7', None, None, 'task19', 'task6', 'task88', None],
'task_name': [None, 'name_t12', None, None, 'name_t31', None, None, 'name_t18', 'n_t5', None, None, 'name_t33', 'n7', None, None, 'name_t19', 'n6', 'n88', None]
})
job name schedule task_type task task_name
0 job1 n_j1 01 START None None
1 None None None TA task12 name_t12
2 None None None END None None
3 job3 n_j3 06 START None None
4 None None None TB task31 name_t31
5 None None None END None None
6 job4 n_j4 09 START None None
7 None None None TB task18 name_t18
8 None None None TB nt6 n_t5
9 None None None END None None
10 job5 nj5 None START None None
11 None None None TA task3 name_t33
12 None None None TA task7 n7
13 None None None END None None
14 job6 nj6 None START None None
15 None None None TA task19 name_t19
16 None None None TA task6 n6
17 None None None TA task88 n88
18 None None None END None None
My expected result is:
job name schedule types tasks n_names
0 job1 n_j1 01 [TA] [task12] [name_t12]
1 job3 n_j3 06 [TB] [task31] [name_t31]
2 job4 n_j4 09 [TB, TB] [task18, nt6] [name_t18, n_t5]
3 job5 nj5 None [TA, TA] [task3, task7] [name_t33, n7]
4 job6 nj6 None [TA, TA, TA] [task19, task6, task88] [name_t19, n6, n88]
grouped by a job
with listed its tasks, where tasks are defined by task_type
between START and END
What I came up with so far:
handling None:
df[['job', 'name', 'schedule']] = df[['job', 'name', 'schedule']].fillna(method='ffill')
filtering rows for desired task types:
df[df['task_type'].isin(['TA', 'TB'])]
grouping the data:
df.groupby(['job', 'name', 'schedule']).apply(lambda x: [list(x['task_type']), list(x['task']), list(x['task_name'])]).apply(pd.Series).reset_index(names=['job', 'name', 'schedule'])
But the results is not what expect:
job name schedule 0 1 2
0 job1 n_j1 01 [TA] [task12] [name_t12]
1 job3 n_j3 06 [TB] [task31] [name_t31]
2 job4 n_j4 09 [TB, TB] [task18, nt6] [name_t18, n_t5]
3 job5 nj5 09 [TA, TA] [task3, task7] [name_t33, n7]
4 job6 nj6 09 [TA, TA, TA] [task19, task6, task88] [name_t19, n6, n88]
My question is how to get rid of schedule
values for job5
, job6
and name the lists more friendly.
There should by more elegant way of doing this, any help would be appreciated.
CodePudding user response:
Try:
x = df.groupby(df["job"].ffill()).agg(
lambda x: [v for v in x if v and v not in {"START", "END"}]
)
x["job"] = x["job"].str[0]
x["name"] = x["name"].str[0]
x["schedule"] = x["schedule"].str[0]
print(x.reset_index(drop=True))
Prints:
job name schedule task_type task task_name
0 job1 n_j1 01 [TA] [task12] [name_t12]
1 job3 n_j3 06 [TB] [task31] [name_t31]
2 job4 n_j4 09 [TB, TB] [task18, nt6] [name_t18, n_t5]
3 job5 nj5 NaN [TA, TA] [task3, task7] [name_t33, n7]
4 job6 nj6 NaN [TA, TA, TA] [task19, task6, task88] [name_t19, n6, n88]
CodePudding user response:
Here you go:
Do the following steps on your initial df
:
df[['job', 'name']] = df[['job', 'name']].fillna(method='ffill')
df = df.groupby(['job', 'name']).agg(lambda x: x.tolist()).applymap(lambda x: [i for i in x if i not in [None, 'START', 'END']])
# replace empty lists with None on schedule column
df['schedule'] = df['schedule'].apply(lambda x: None if x == [] else x)