Home > Software engineering >  Identifying Duplicates and arranging corresponding data in their perspective columns using python
Identifying Duplicates and arranging corresponding data in their perspective columns using python

Time:11-06

I am having a CSV file that contains multiple duplicate entries. What I'am trying to do is that, I need to gather those duplicate fields and arrange their corresponding fields.

My table:

Column A Column B Column C Column D
1004. 1
1004. 3
1004. 2

What I need:

Column A Column B Column C Column D
1004. 1 3 2

How do I solve this in Python.

I tried Identifying the duplicates and dunno what to do next.

import pandas

df = pandas.read_csv(csv_file, names=fields, index_col=False)

df = df[df.duplicated([column1], keep=False)]

df.to_csv(csv_file2, index=False)

CodePudding user response:

you can use a lambda function. First, we group by Column A:

df=pd.DataFrame(data={'col_a':[1004,1004,1004,1005],'col_b':[1,3,2,2],'col_c':['','','',''],'col_d':['','','','']})

df=df.replace(r'^\s*$',np.nan,regex=True) #replace empty cells with nan
dfx=df.groupby('col_a').agg(list)
print(dfx)

col_a   col_b       col_c           col_d
1004    [1, 3, 2]   [nan, nan, nan] [nan, nan, nan]
1005    [2]         [nan]           [nan]

if you have a several columns, you can replace the nan values ​​according to the index number of the values ​​in the list:

dfx['col_c']=dfx['col_b'].apply(lambda x: x[1] if len(x) > 1 else np.nan) #get the second value from col_b
dfx['col_d']=dfx['col_b'].apply(lambda x: x[2] if len(x) > 2 else np.nan) #get the third value from col_b
dfx['col_b']=dfx['col_b'].apply(lambda x: x[0] if len(x) > 0 else np.nan) #finally replace col_b with its first value (col_b must be replaced last) 

Note: if we replace col_b first, we will lose the list. This is why we are replacing col_b last.

if there are many columns we can use a for loop:

loop_cols=[*dfx.columns[1:]] # get the list of columns except col_b.

list_index=1
list_lenght=1

for i in loop_cols:
    dfx[i]=dfx['col_b'].apply(lambda x: x[list_index] if len(x) > list_lenght else np.nan)
    list_index =1
    list_lenght =1

#finally, replace col_b.
dfx['col_b']=dfx['col_b'].apply(lambda x: x[0] if len(x) > 0 else np.nan)

output:

col_a   col_b   col_c   col_d
1004    1       3.0     2.0
1005    2       

  • Related