Read dataframe sequentially and change value if duplicate found above-CodePudding

EDIT: have edited the post to clarify

So I'm managing a portfolio of project. Each project got its own Excel sheet and I try to track time, cost etc. Problem is that the data look like this:

  time cost section error other
0    1  500    lift          E1
1    2  100    lift          E2
2    1  400    wait          E3
3    4  200    lift    D4    E4
4    5  500    lift          E5
5    2  322   crash          E3
6    3  500    lift          E1
7    3  200    wait    D5    E2

when I want it to look like this, because I want to distinguish between first lift attempt and the second one:

  time cost   section error other
0    1  500      lift          E1
1    2  100      lift          E2
2    1  400      wait          E3
3    4  200  lift (2)    D4    E4
4    5  500  lift (2)          E5
5    2  322     crash          E3
6    3  500  lift (3)          E1
7    3  200  wait (2)    D5    E2

Each excel file got between 300-500 lines and some 50 columns so this is just a simple representation.

So I made some code, and honestly what surprises me the most is that me banging rocks together over here actually produces a result. I can’t help wondering though if there is a more beautiful way of doing this.

My code:

df = pd.DataFrame({
    'time': ['1', '2', '1', '4', '5', '2', '3', '3'],
    'cost': ['500', '100', '400', '200', '500', '322','500','200'],
    'section': ['lift', 'lift', 'wait', 'lift', 'lift', 'crash', 'lift','wait'],
    'error': ['', '', '', 'D4', '','','','D5'],
    'other': ['E1', 'E2', 'E3', 'E4', 'E5','E3','E1','E2']}
    )
print(df)
# Check if value in 'section' is identical to value in "cell" above
df['match'] = df.section.eq(df.section.shift())
# If true write section value to new column
df.loc[df['match'] == False, 'match2'] = df['section']
# Check if the value have duplicate further up
df.loc[df['match'] == False, 'match3'] = df.groupby(['match2']).cumcount() 1

# Add an extension to the value based on number of entries
df.loc[df['match3'] == 1, 'match4'] = df['section']
df.loc[df['match3'] == 2, 'match4'] = df['section']   ' (2)'
df.loc[df['match3'] == 3, 'match4'] = df['section']   ' (3)'
df.loc[df['match3'] == 4, 'match4'] = df['section']   ' (4)'
df.loc[df['match3'] == 5, 'match4'] = df['section']   ' (5)'
df.loc[df['match3'] == 6, 'match4'] = df['section']   ' (6)'

df['match4'] = df['match4'].ffill()
df['section'] = df['match4']
df.drop(['match', 'match2', 'match3', 'match4'], inplace=True, axis=1 )
print(df)

Output:


  time cost section error other
0    1  500    lift          E1
1    2  100    lift          E2
2    1  400    wait          E3
3    4  200    lift    D4    E4
4    5  500    lift          E5
5    2  322   crash          E3
6    3  500    lift          E1
7    3  200    wait    D5    E2
  time cost   section error other
0    1  500      lift          E1
1    2  100      lift          E2
2    1  400      wait          E3
3    4  200  lift (2)    D4    E4
4    5  500  lift (2)          E5
5    2  322     crash          E3
6    3  500  lift (3)          E1
7    3  200  wait (2)    D5    E2

CodePudding user response：

IIUC, you want to label the stretches of "lift":

# get rows with "lift"
m = df['section'].eq('lift')


n = (df.loc[m, 'section']       # for each "lift"
       .groupby((~m).cumsum())  # group by successive "lift"
       .ngroup().add(1)         # enumerate group
    )

# keep only enumeration > 1
n = n[n.gt(1)]

# add as string between parentheses
df.loc[n.index, 'section']  = ' (' n.astype(str) ')'

output:

  time cost   section error other
0    1  500      lift          E1
1    2  100      lift          E2
2    1  400      wait          E3
3    4  200  lift (2)    D4    E4
4    5  500  lift (2)          E5
5    2  322     crash          E3
6    3  500  lift (3)          E1

generalization to all groups:

# groups of successive values
group = df['section'].ne(df['section'].shift()).cumsum()

# get the stretch number per group
n = (group.groupby(df['section'])
          .apply(lambda g: pd.Series(pd.factorize(g)[0] 1,
                                     index=g.index)
                )
    )

# same as above
n = n[n.gt(1)]
df.loc[n.index, 'section']  = ' (' n.astype(str) ')'

print(df)

output:

  time cost   section error other
0    1  500      lift          E1
1    2  100      lift          E2
2    1  400      wait          E3
3    4  200  lift (2)    D4    E4
4    5  500  lift (2)          E5
5    2  322     crash          E3
6    3  500  lift (3)          E1
7    3  200  wait (2)    D5    E2