I have a pandas dataframe with nested lists as values in a column as follows:
sample_df = pd.DataFrame({'single_proj_name': [['jsfk'],['fhjk'],['ERRW'],['SJBAK']],
'single_item_list': [['ABC_123'],['DEF123'],['FAS324'],['HSJD123']],
'single_id':[[1234],[5678],[91011],[121314]],
'multi_proj_name':[['AAA','VVVV','SASD'],['QEWWQ','SFA','JKKK','fhjk'],['ERRW','TTTT'],['SJBAK','YYYY']],
'multi_item_list':[[['XYZAV','ADS23','ABC_123'],['XYZAV','ADS23','ABC_123']],['XYZAV','DEF123','ABC_123','SAJKF'],['QWER12','FAS324'],[['JFAJKA','HSJD123'],['JFAJKA','HSJD123']]],
'multi_id':[[[2167,2147,29481],[2167,2147,29481]],[2313,57567,2321,7898],[1123,8775],[[5237,43512],[5237,43512]]]})
As you can see above, in some columns, the same list is repeated twice or more.
So, I would like to remove the duplicated list and only retain one copy of the list.
I was trying something like the below:
for i, (single, multi_item, multi_id) in enumerate(zip(sample_df['single_item_list'],sample_df['multi_item_list'],sample_df['multi_id'])):
if (any(isinstance(i, list) for i in multi_item)) == False:
for j, item_list in enumerate(multi_item):
if single[0] in item_list:
pos = item_list.index(single[0])
sample_df.at[i,'multi_item_list'] = [item_list]
sample_df.at[i,'multi_id'] = [multi_id[j]]
else:
print("under nested list")
for j, item_list in enumerate(zip(multi_item,multi_id)):
if single[0] in multi_item[j]:
pos = multi_item[j].index(single[0])
sample_df.at[i,'multi_item_list'][j] = single[0]
sample_df.at[i,'multi_id'][j] = multi_id[j][pos]
else:
sample_df.at[i,'multi_item_list'][j] = np.nan
sample_df.at[i,'multi_id'][j] = np.nan
But this assigns NA to the whole column value. I expect to remove that specific list (within a nested list).
I expect my output to be like as below:
CodePudding user response:
In the data it looks like removing duplicates is equivalent to keeping the first element in any list of lists while any standard lists are kept as they are. If this is true, then you can solve it as follows:
def get_first_list(x):
if isinstance(x[0], list):
return [x[0]]
return x
for c in ['multi_item_list', 'multi_id']:
sample_df[c] = sample_df[c].apply(get_first_list)
Result:
single_proj_name single_item_list single_id multi_proj_name multi_item_list multi_id
0 [jsfk] [ABC_123] [1234] [AAA, VVVV, SASD] [[XYZAV, ADS23, ABC_123]] [[2167, 2147, 29481]]
1 [fhjk] [DEF123] [5678] [QEWWQ, SFA, JKKK, fhjk] [XYZAV, DEF123, ABC_123, SAJKF] [2313, 57567, 2321, 7898]
2 [ERRW] [FAS324] [91011] [ERRW, TTTT] [QWER12, FAS324] [1123, 8775]
3 [SJBAK] [HSJD123] [121314] [SJBAK, YYYY] [[JFAJKA, HSJD123]] [[5237, 43512]]
To handle the case where there can be more than a single unique list the get_first_list
method can be adjusted to:
def get_first_list(x):
if isinstance(x[0], list):
new_x = []
for i in x:
if i not in new_x:
new_x.append(i)
return new_x
return x
This will keep the order of the sublists while removing any sublist duplicates.
CodePudding user response:
Shortly with np.unique
function:
cols = ['multi_item_list', 'multi_id']
sample_df[cols] = sample_df[cols].apply(lambda x: [np.unique(a, axis=0) if type(a[0]) == list else a for a in x.values])
In [382]: sample_df
Out[382]:
single_proj_name single_item_list single_id multi_proj_name \
0 [jsfk] [ABC_123] [1234] [AAA, VVVV, SASD]
1 [fhjk] [DEF123] [5678] [QEWWQ, SFA, JKKK, fhjk]
2 [ERRW] [FAS324] [91011] [ERRW, TTTT]
3 [SJBAK] [HSJD123] [121314] [SJBAK, YYYY]
multi_item_list multi_id
0 [[XYZAV, ADS23, ABC_123]] [[2167, 2147, 29481]]
1 [XYZAV, DEF123, ABC_123, SAJKF] [2313, 57567, 2321, 7898]
2 [QWER12, FAS324] [1123, 8775]
3 [[JFAJKA, HSJD123]] [[5237, 43512]]