How to identify second occurance of a value in a column then for that same row but different column-CodePudding

I have a dataframe that I am working with and I would like to identify the second instance of a particular value in a column, and then map a value in a different column for that same row. An example might be easiest:

col1     col2     
 'a'      's'
 'b'      's'
 'f'      's'
 'g'      's'
 'b'      's'
 'l'      's'

The above is a short example of the dataframe(s) that I am working with, so I would like to identify the second occurance of 'b' and for that row map col2 to 'p'.

So the output would be:

col1     col2     
 'a'      's'
 'b'      's'
 'f'      's'
 'g'      's'
 'b'      'p'
 'l'      's'

I am quite new to dataframes so no existing code to work with I'm afraid. I should mention that there IS NO INDEXING for this dataframe, so iloc cannot be used, otherwise it would be trivial.

CodePudding user response：

Use boolean indexing with two masks (equality to "b" and (&) duplicated to identify the second "b"):

# is the row a "b"?
m1 = df['col1'].eq('b')

# is the value duplicated? (i.e. a second "b")
m2 = df['col1'].duplicated()

# if both match, then update
df.loc[m1&m2, 'col2'] = 'p'

Output:

  col1 col2
0    a    s
1    b    s
2    f    s
3    g    s
4    b    p
5    l    s

Intermediates:

  col1 col2     m1     m2  m1&m2
0    a    s  False  False  False
1    b    s   True  False  False
2    f    s  False  False  False
3    g    s  False  False  False
4    b    p   True   True   True
5    l    s  False  False  False

CodePudding user response：

Use Series.duplicated chained by & for bitwise AND with compare value by Series.eq and pass to DataFrame.loc - here all duplicated values without first are converted to p:

df.loc[df['col1'].eq('b') & df['col1'].duplicated(), 'col2'] = 'p'

Or chain GroupBy.cumcount for compare second value by compare 1 (python count from 0):

m =  df['col1'].eq('b')
df.loc[m & df[m].groupby('col1').cumcount().eq(1), 'col2'] = 'p'

Difference of solutions are if 3 or more values b:

m =  df['col1'].eq('b')
df.loc[m & df[m].groupby('col1').cumcount().eq(1), 'col3'] = 'p'

df.loc[df['col1'].eq('b') & df['col1'].duplicated(), 'col4'] = 'p'

print (df)
  col1 col2 col3 col4
0    a    s  NaN  NaN
1    b    s  NaN  NaN
2    f    s  NaN  NaN
3    g    s  NaN  NaN
4    b    s    p    p
5    l    s  NaN  NaN
6    b    s  NaN    p