Home > Mobile >  Drop duplicate lists within a nested list value in a column
Drop duplicate lists within a nested list value in a column

Time:01-19

I have a pandas dataframe with nested lists as values in a column as follows:

sample_df = pd.DataFrame({'single_proj_name': [['jsfk'],['fhjk'],['ERRW'],['SJBAK']],
                              'single_item_list': [['ABC_123'],['DEF123'],['FAS324'],['HSJD123']],
                              'single_id':[[1234],[5678],[91011],[121314]],
                              'multi_proj_name':[['AAA','VVVV','SASD'],['QEWWQ','SFA','JKKK','fhjk'],['ERRW','TTTT'],['SJBAK','YYYY']],
                              'multi_item_list':[[['XYZAV','ADS23','ABC_123'],['XYZAV','ADS23','ABC_123']],['XYZAV','DEF123','ABC_123','SAJKF'],['QWER12','FAS324'],[['JFAJKA','HSJD123'],['JFAJKA','HSJD123']]],
                              'multi_id':[[[2167,2147,29481],[2167,2147,29481]],[2313,57567,2321,7898],[1123,8775],[[5237,43512],[5237,43512]]]})

As you can see above, in some columns, the same list is repeated twice or more.

So, I would like to remove the duplicated list and only retain one copy of the list.

I was trying something like the below:

for i, (single, multi_item, multi_id) in enumerate(zip(sample_df['single_item_list'],sample_df['multi_item_list'],sample_df['multi_id'])):
    if (any(isinstance(i, list) for i in multi_item)) == False:
        for j, item_list in enumerate(multi_item):
            if single[0] in item_list:
                pos = item_list.index(single[0])
                sample_df.at[i,'multi_item_list'] = [item_list]
                sample_df.at[i,'multi_id'] = [multi_id[j]]
    else:
        print("under nested list")
        for j, item_list in enumerate(zip(multi_item,multi_id)):
            if single[0] in multi_item[j]:
                pos = multi_item[j].index(single[0])
                sample_df.at[i,'multi_item_list'][j] = single[0]
                sample_df.at[i,'multi_id'][j] = multi_id[j][pos]
            else:
                sample_df.at[i,'multi_item_list'][j] = np.nan
                sample_df.at[i,'multi_id'][j] = np.nan

But this assigns NA to the whole column value. I expect to remove that specific list (within a nested list).

I expect my output to be like as below:

enter image description here

CodePudding user response:

In the data it looks like removing duplicates is equivalent to keeping the first element in any list of lists while any standard lists are kept as they are. If this is true, then you can solve it as follows:

def get_first_list(x):
    if isinstance(x[0], list):
        return [x[0]]
    return x

for c in ['multi_item_list', 'multi_id']:
    sample_df[c] = sample_df[c].apply(get_first_list)

Result:

  single_proj_name single_item_list single_id           multi_proj_name                  multi_item_list                   multi_id
0           [jsfk]        [ABC_123]    [1234]         [AAA, VVVV, SASD]        [[XYZAV, ADS23, ABC_123]]      [[2167, 2147, 29481]]
1           [fhjk]         [DEF123]    [5678]  [QEWWQ, SFA, JKKK, fhjk]  [XYZAV, DEF123, ABC_123, SAJKF]  [2313, 57567, 2321, 7898]
2           [ERRW]         [FAS324]   [91011]              [ERRW, TTTT]                 [QWER12, FAS324]               [1123, 8775]
3          [SJBAK]        [HSJD123]  [121314]             [SJBAK, YYYY]              [[JFAJKA, HSJD123]]            [[5237, 43512]]

To handle the case where there can be more than a single unique list the get_first_list method can be adjusted to:

def get_first_list(x):
    if isinstance(x[0], list):
        new_x = []
        for i in x:
            if i not in new_x:
                new_x.append(i)
        return new_x
    return x

This will keep the order of the sublists while removing any sublist duplicates.

CodePudding user response:

Shortly with np.unique function:

cols = ['multi_item_list', 'multi_id']
sample_df[cols] = sample_df[cols].apply(lambda x: [np.unique(a, axis=0) if type(a[0]) == list else a for a in x.values])

In [382]: sample_df
Out[382]: 
  single_proj_name single_item_list single_id           multi_proj_name  \
0           [jsfk]        [ABC_123]    [1234]         [AAA, VVVV, SASD]   
1           [fhjk]         [DEF123]    [5678]  [QEWWQ, SFA, JKKK, fhjk]   
2           [ERRW]         [FAS324]   [91011]              [ERRW, TTTT]   
3          [SJBAK]        [HSJD123]  [121314]             [SJBAK, YYYY]   

                   multi_item_list                   multi_id  
0        [[XYZAV, ADS23, ABC_123]]      [[2167, 2147, 29481]]  
1  [XYZAV, DEF123, ABC_123, SAJKF]  [2313, 57567, 2321, 7898]  
2                 [QWER12, FAS324]               [1123, 8775]  
3              [[JFAJKA, HSJD123]]            [[5237, 43512]]  
  • Related