Home > Software engineering >  Pandas pick an item from list column and search in other list column
Pandas pick an item from list column and search in other list column

Time:01-17

I have a dataframe like as shown below

sample_df = pd.DataFrame({'single_proj_name': [['jsfk'],['fhjk'],['ERRW'],['SJBAK']],
                          'single_item_list': [['ABC_123'],['DEF123'],['FAS324'],['HSJD123']],
                          'single_id':[[1234],[5678],[91011],[121314]],
                          'multi_proj_name':[['AAA','VVVV','SASD'],['QEWWQ','SFA','JKKK','fhjk'],['ERRW','TTTT'],['SJBAK','YYYY']],
                          'multi_item_list':[[['XYZAV','ADS23','ABC_123'],['ABC_123','ADC_123']],['XYZAV','DEF123','ABC_123','SAJKF'],['QWER12','FAS324'],['JFAJKA','HSJD123']],
                          'multi_id':[[[2167,2147,29481],[5432,1234]],[2313,57567,2321,7898],[1123,8775],[5237,43512]]})

I would like to do the below

a) Pick the value from single_item_list for each row

b) search that value in multi_item_list column of the same row

c) If match found, keep only that value in multi_item_list and remove all other non-matching values from multi_item_list

d) Based on the position of the match item, look for corresponding value in multi_id list and keep only that item. Remove all other position items from the list

So, I tried the below but it doesn't work

def func(df):
    return list(set(sample_df['single_item_list']) - set(sample_df['multi_item_list']))

sample_df['col3'] = sample_df.apply(func, axis = 1)

I expect my output to be like as below

enter image description here

CodePudding user response:

You can explode the two lists in parallel (pandas 1.3 ), and filter:

(sample_df
  .explode(['multi_item_list', 'multi_id'])
  .loc[lambda d: d['single_item_list'].str[0].eq(d['multi_item_list'])]
)

If you want your lists back:

(sample_df
  .explode(['multi_item_list', 'multi_id'])
  .loc[lambda d: d['single_item_list'].str[0].eq(d['multi_item_list'])]
  .assign(multi_item_list=lambda d: d['multi_item_list'].apply(lambda x: [x]),
          multi_id=lambda d: d['multi_id'].apply(lambda x: [x]),
         )
)

Output:

  single_proj_name single_item_list single_id           multi_proj_name multi_item_list multi_id
0           [jsfk]        [ABC_123]    [1234]         [AAA, VVVV, SASD]       [ABC_123]  [29481]
1           [fhjk]         [DEF123]    [5678]  [QEWWQ, SFA, JKKK, fhjk]        [DEF123]  [57567]
2           [ERRW]         [FAS324]   [91011]              [ERRW, TTTT]        [FAS324]   [8775]
3          [SJBAK]        [HSJD123]  [121314]             [SJBAK, YYYY]       [HSJD123]  [43512]

alternative using a list comprehension:

sample_df[['multi_item_list', 'multi_id']] = \
pd.DataFrame([next(([[x], [c[i]]] for i,x in enumerate(b) if x==a[0]), None)
              for a, b, c in
              zip(sample_df['single_item_list'],
                  sample_df['multi_item_list'],
                  sample_df['multi_id'])],
              columns=['multi_item_list', 'multi_id'])
  • Related