EDIT: have edited the post to clarify
So I'm managing a portfolio of project. Each project got its own Excel sheet and I try to track time, cost etc. Problem is that the data look like this:
time cost section error other
0 1 500 lift E1
1 2 100 lift E2
2 1 400 wait E3
3 4 200 lift D4 E4
4 5 500 lift E5
5 2 322 crash E3
6 3 500 lift E1
7 3 200 wait D5 E2
when I want it to look like this, because I want to distinguish between first lift attempt and the second one:
time cost section error other
0 1 500 lift E1
1 2 100 lift E2
2 1 400 wait E3
3 4 200 lift (2) D4 E4
4 5 500 lift (2) E5
5 2 322 crash E3
6 3 500 lift (3) E1
7 3 200 wait (2) D5 E2
Each excel file got between 300-500 lines and some 50 columns so this is just a simple representation.
So I made some code, and honestly what surprises me the most is that me banging rocks together over here actually produces a result. I can’t help wondering though if there is a more beautiful way of doing this.
My code:
df = pd.DataFrame({
'time': ['1', '2', '1', '4', '5', '2', '3', '3'],
'cost': ['500', '100', '400', '200', '500', '322','500','200'],
'section': ['lift', 'lift', 'wait', 'lift', 'lift', 'crash', 'lift','wait'],
'error': ['', '', '', 'D4', '','','','D5'],
'other': ['E1', 'E2', 'E3', 'E4', 'E5','E3','E1','E2']}
)
print(df)
# Check if value in 'section' is identical to value in "cell" above
df['match'] = df.section.eq(df.section.shift())
# If true write section value to new column
df.loc[df['match'] == False, 'match2'] = df['section']
# Check if the value have duplicate further up
df.loc[df['match'] == False, 'match3'] = df.groupby(['match2']).cumcount() 1
# Add an extension to the value based on number of entries
df.loc[df['match3'] == 1, 'match4'] = df['section']
df.loc[df['match3'] == 2, 'match4'] = df['section'] ' (2)'
df.loc[df['match3'] == 3, 'match4'] = df['section'] ' (3)'
df.loc[df['match3'] == 4, 'match4'] = df['section'] ' (4)'
df.loc[df['match3'] == 5, 'match4'] = df['section'] ' (5)'
df.loc[df['match3'] == 6, 'match4'] = df['section'] ' (6)'
df['match4'] = df['match4'].ffill()
df['section'] = df['match4']
df.drop(['match', 'match2', 'match3', 'match4'], inplace=True, axis=1 )
print(df)
Output:
time cost section error other
0 1 500 lift E1
1 2 100 lift E2
2 1 400 wait E3
3 4 200 lift D4 E4
4 5 500 lift E5
5 2 322 crash E3
6 3 500 lift E1
7 3 200 wait D5 E2
time cost section error other
0 1 500 lift E1
1 2 100 lift E2
2 1 400 wait E3
3 4 200 lift (2) D4 E4
4 5 500 lift (2) E5
5 2 322 crash E3
6 3 500 lift (3) E1
7 3 200 wait (2) D5 E2
CodePudding user response:
IIUC, you want to label the stretches of "lift":
# get rows with "lift"
m = df['section'].eq('lift')
n = (df.loc[m, 'section'] # for each "lift"
.groupby((~m).cumsum()) # group by successive "lift"
.ngroup().add(1) # enumerate group
)
# keep only enumeration > 1
n = n[n.gt(1)]
# add as string between parentheses
df.loc[n.index, 'section'] = ' (' n.astype(str) ')'
output:
time cost section error other
0 1 500 lift E1
1 2 100 lift E2
2 1 400 wait E3
3 4 200 lift (2) D4 E4
4 5 500 lift (2) E5
5 2 322 crash E3
6 3 500 lift (3) E1
generalization to all groups:
# groups of successive values
group = df['section'].ne(df['section'].shift()).cumsum()
# get the stretch number per group
n = (group.groupby(df['section'])
.apply(lambda g: pd.Series(pd.factorize(g)[0] 1,
index=g.index)
)
)
# same as above
n = n[n.gt(1)]
df.loc[n.index, 'section'] = ' (' n.astype(str) ')'
print(df)
output:
time cost section error other
0 1 500 lift E1
1 2 100 lift E2
2 1 400 wait E3
3 4 200 lift (2) D4 E4
4 5 500 lift (2) E5
5 2 322 crash E3
6 3 500 lift (3) E1
7 3 200 wait (2) D5 E2