Home > Software design >  How to identify second occurance of a value in a column then for that same row but different column
How to identify second occurance of a value in a column then for that same row but different column

Time:01-27

I have a dataframe that I am working with and I would like to identify the second instance of a particular value in a column, and then map a value in a different column for that same row. An example might be easiest:

col1     col2     
 'a'      's'
 'b'      's'
 'f'      's'
 'g'      's'
 'b'      's'
 'l'      's'

The above is a short example of the dataframe(s) that I am working with, so I would like to identify the second occurance of 'b' and for that row map col2 to 'p'.

So the output would be:

col1     col2     
 'a'      's'
 'b'      's'
 'f'      's'
 'g'      's'
 'b'      'p'
 'l'      's'

I am quite new to dataframes so no existing code to work with I'm afraid. I should mention that there IS NO INDEXING for this dataframe, so iloc cannot be used, otherwise it would be trivial.

CodePudding user response:

Use boolean indexing with two masks (equality to "b" and (&) duplicated to identify the second "b"):

# is the row a "b"?
m1 = df['col1'].eq('b')

# is the value duplicated? (i.e. a second "b")
m2 = df['col1'].duplicated()

# if both match, then update
df.loc[m1&m2, 'col2'] = 'p'

Output:

  col1 col2
0    a    s
1    b    s
2    f    s
3    g    s
4    b    p
5    l    s

Intermediates:

  col1 col2     m1     m2  m1&m2
0    a    s  False  False  False
1    b    s   True  False  False
2    f    s  False  False  False
3    g    s  False  False  False
4    b    p   True   True   True
5    l    s  False  False  False

CodePudding user response:

Use Series.duplicated chained by & for bitwise AND with compare value by Series.eq and pass to DataFrame.loc - here all duplicated values without first are converted to p:

df.loc[df['col1'].eq('b') & df['col1'].duplicated(), 'col2'] = 'p'

Or chain GroupBy.cumcount for compare second value by compare 1 (python count from 0):

m =  df['col1'].eq('b')
df.loc[m & df[m].groupby('col1').cumcount().eq(1), 'col2'] = 'p'

Difference of solutions are if 3 or more values b:

m =  df['col1'].eq('b')
df.loc[m & df[m].groupby('col1').cumcount().eq(1), 'col3'] = 'p'

df.loc[df['col1'].eq('b') & df['col1'].duplicated(), 'col4'] = 'p'

print (df)
  col1 col2 col3 col4
0    a    s  NaN  NaN
1    b    s  NaN  NaN
2    f    s  NaN  NaN
3    g    s  NaN  NaN
4    b    s    p    p
5    l    s  NaN  NaN
6    b    s  NaN    p
  • Related