Assuming we have the following table:
--------- ------ ------ ------ ------ ------ ------ ------ ------
| COL_ID | ColB | COLC | COLD | COLE | COLF | COLG | COLH | COLI |
--------- ------ ------ ------ ------ ------ ------ ------ ------
| aa1 | 1 | 1 | | | | | | |
| aa1 | 2 | 1 | | | | | | |
| aa2 | 3 | 1 | | | | | | |
| ab3 | 6 | 2 | | | | | | |
| ab3 | 5 | 2 | | | | | | |
| ab3 | 7 | 1 | | | | | | |
| ab3 | 1 | 1 | | | | | | |
--------- ------ ------ ------ ------ ------ ------ ------ ------
How can we assign the values of duplicates in the adjacent column if a duplicate is found?
--------- ------ ------ ------ ------ ------ ------ ------ ------
| COL_ID | ColB | COLC | COLD | COLE | COLF | COLG | COLH | COLI |
--------- ------ ------ ------ ------ ------ ------ ------ ------
| aa1 | 1 | 1 | 1 | 1 | 2 | 1 | | |
| aa2 | 3 | 1 | | | | | | |
| ab3 | 6 | 2 | 5 | 2 | 7 | 1 | 1 | 1 |
--------- ------ ------ ------ ------ ------ ------ ------ ------
Here is the sample code to generate this table
import pandas as pd
import numpy as np
my_dic = {'COL_ID': ['aa1', 'aa1', 'aa2', 'ab3','ab3','ab3','ab3'],
'COLB': [1,2,3,6,5,7,1],
'COLC': [1,1,1,2,2,1,1],
'COLD':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'COLF':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'COLG':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'COLH':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'COLI:':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}
dp = pd.DataFrame(my_dic)
CodePudding user response:
First idea is use lambda function for all columns without COL_ID
converted to index by DataFrame.set_index
with DataFrame.stack
, create default index by Series.reset_index
for new columns names, last reshape by Series.unstack
with converting to integers with missing values by Int64
:
dp = (dp.set_index('COL_ID')
.groupby('COL_ID')
.apply(lambda x: x.stack().reset_index(drop=True))
.unstack()
.astype('Int64')
.add_prefix('Col')
.reset_index())
print (dp)
COL_ID Col0 Col1 Col2 Col3 Col4 Col5 Col6 Col7
0 aa1 1 1 2 1 <NA> <NA> <NA> <NA>
1 aa2 3 1 <NA> <NA> <NA> <NA> <NA> <NA>
2 ab3 6 2 5 2 7 1 1 1
Another idea:
s = dp.set_index('COL_ID').stack().droplevel(1)
counter = s.groupby('COL_ID').cumcount().to_numpy()
s.index = [s.index, counter]
dp = s.unstack().astype('Int64').add_prefix('Col').reset_index()
print (dp)
COL_ID Col0 Col1 Col2 Col3 Col4 Col5 Col6 Col7
0 aa1 1 1 2 1 <NA> <NA> <NA> <NA>
1 aa2 3 1 <NA> <NA> <NA> <NA> <NA> <NA>
2 ab3 6 2 5 2 7 1 1 1