Home > front end >  Pandas setting column values to adjacent ones for duplicates found in a key column
Pandas setting column values to adjacent ones for duplicates found in a key column

Time:12-22

Assuming we have the following table:

 --------- ------ ------ ------ ------ ------ ------ ------ ------ 
| COL_ID  | ColB | COLC | COLD | COLE | COLF | COLG | COLH | COLI |
 --------- ------ ------ ------ ------ ------ ------ ------ ------ 
| aa1     |    1 |    1 |      |      |      |      |      |      |
| aa1     | 2    |    1 |      |      |      |      |      |      |
| aa2     |    3 |    1 |      |      |      |      |      |      |
| ab3     |    6 |    2 |      |      |      |      |      |      |
| ab3     |    5 |    2 |      |      |      |      |      |      |
| ab3     |    7 |    1 |      |      |      |      |      |      |
| ab3     |    1 |    1 |      |      |      |      |      |      |
 --------- ------ ------ ------ ------ ------ ------ ------ ------ 

How can we assign the values of duplicates in the adjacent column if a duplicate is found?

 --------- ------ ------ ------ ------ ------ ------ ------ ------ 
| COL_ID  | ColB | COLC | COLD | COLE | COLF | COLG | COLH | COLI |
 --------- ------ ------ ------ ------ ------ ------ ------ ------ 
| aa1     |    1 |    1 |    1 |    1 |    2 |    1 |      |      |
| aa2     |    3 |    1 |      |      |      |      |      |      |
| ab3     |    6 |    2 |    5 |    2 |    7 |    1 |    1 |    1 |
 --------- ------ ------ ------ ------ ------ ------ ------ ------ 

Here is the sample code to generate this table

import pandas as pd
import numpy as np


my_dic =  {'COL_ID': ['aa1', 'aa1', 'aa2', 'ab3','ab3','ab3','ab3'],
'COLB': [1,2,3,6,5,7,1],
'COLC': [1,1,1,2,2,1,1],
'COLD':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'COLF':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'COLG':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'COLH':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'COLI:':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}
    
dp = pd.DataFrame(my_dic)

CodePudding user response:

First idea is use lambda function for all columns without COL_ID converted to index by DataFrame.set_index with DataFrame.stack, create default index by Series.reset_index for new columns names, last reshape by Series.unstack with converting to integers with missing values by Int64:

dp = (dp.set_index('COL_ID')
        .groupby('COL_ID')
        .apply(lambda x: x.stack().reset_index(drop=True))
        .unstack()
        .astype('Int64')
        .add_prefix('Col')
        .reset_index())
print (dp)
  COL_ID  Col0  Col1  Col2  Col3  Col4  Col5  Col6  Col7
0    aa1     1     1     2     1  <NA>  <NA>  <NA>  <NA>
1    aa2     3     1  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
2    ab3     6     2     5     2     7     1     1     1

Another idea:

s = dp.set_index('COL_ID').stack().droplevel(1)
counter = s.groupby('COL_ID').cumcount().to_numpy()
s.index = [s.index, counter]

dp = s.unstack().astype('Int64').add_prefix('Col').reset_index()
print (dp)
  COL_ID  Col0  Col1  Col2  Col3  Col4  Col5  Col6  Col7
0    aa1     1     1     2     1  <NA>  <NA>  <NA>  <NA>
1    aa2     3     1  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
2    ab3     6     2     5     2     7     1     1     1
  • Related