I am having a CSV file that contains multiple duplicate entries. What I'am trying to do is that, I need to gather those duplicate fields and arrange their corresponding fields.
My table:
Column A | Column B | Column C | Column D |
---|---|---|---|
1004. | 1 | ||
1004. | 3 | ||
1004. | 2 |
What I need:
Column A | Column B | Column C | Column D |
---|---|---|---|
1004. | 1 | 3 | 2 |
How do I solve this in Python.
I tried Identifying the duplicates and dunno what to do next.
import pandas
df = pandas.read_csv(csv_file, names=fields, index_col=False)
df = df[df.duplicated([column1], keep=False)]
df.to_csv(csv_file2, index=False)
CodePudding user response:
you can use a lambda function. First, we group by Column A:
df=pd.DataFrame(data={'col_a':[1004,1004,1004,1005],'col_b':[1,3,2,2],'col_c':['','','',''],'col_d':['','','','']})
df=df.replace(r'^\s*$',np.nan,regex=True) #replace empty cells with nan
dfx=df.groupby('col_a').agg(list)
print(dfx)
col_a col_b col_c col_d
1004 [1, 3, 2] [nan, nan, nan] [nan, nan, nan]
1005 [2] [nan] [nan]
if you have a several columns, you can replace the nan values according to the index number of the values in the list:
dfx['col_c']=dfx['col_b'].apply(lambda x: x[1] if len(x) > 1 else np.nan) #get the second value from col_b
dfx['col_d']=dfx['col_b'].apply(lambda x: x[2] if len(x) > 2 else np.nan) #get the third value from col_b
dfx['col_b']=dfx['col_b'].apply(lambda x: x[0] if len(x) > 0 else np.nan) #finally replace col_b with its first value (col_b must be replaced last)
Note: if we replace col_b first, we will lose the list. This is why we are replacing col_b last.
if there are many columns we can use a for loop:
loop_cols=[*dfx.columns[1:]] # get the list of columns except col_b.
list_index=1
list_lenght=1
for i in loop_cols:
dfx[i]=dfx['col_b'].apply(lambda x: x[list_index] if len(x) > list_lenght else np.nan)
list_index =1
list_lenght =1
#finally, replace col_b.
dfx['col_b']=dfx['col_b'].apply(lambda x: x[0] if len(x) > 0 else np.nan)
output:
col_a col_b col_c col_d
1004 1 3.0 2.0
1005 2