Pandas filter list of list values in a dataframe column-CodePudding

I have a dataframe like as below

sample_df = pd.DataFrame({'single_proj_name': [['jsfk'],['fhjk'],['ERRW'],['SJBAK']],
                              'single_item_list': [['ABC_123'],['DEF123'],['FAS324'],['HSJD123']],
                              'single_id':[[1234],[5678],[91011],[121314]],
                              'multi_proj_name':[['AAA','VVVV','SASD'],['QEWWQ','SFA','JKKK','fhjk'],['ERRW','TTTT'],['SJBAK','YYYY']],
                              'multi_item_list':[[['XYZAV','ADS23','ABC_123'],['ABC_123','ADC_123']],['XYZAV','DEF123','ABC_123','SAJKF'],['QWER12','FAS324'],['JFAJKA','HSJD123']],
                              'multi_id':[[[2167,2147,29481],[5432,1234]],[2313,57567,2321,7898],[1123,8775],[5237,43512]]})

I would like to do the below

a) Pick the value from single_item_list for each row

b) search that value in multi_item_list column of the same row. Please note that it could be list of lists for some of the rows

c) If match found, keep only that matched values in multi_item_list and remove all other non-matching values from multi_item_list

d) Based on the position of the match item, look for corresponding value in multi_id list and keep only that item. Remove all other position items from the list

So, I tried the below but it doesn't work

for a, b, c in zip(sample_df['single_item_list'],sample_df['multi_item_list'],sample_df['multi_id']):
    for i, x in enumerate(b):
        print(x)
        print(a[0])
        if a[0] in x:
            print(x.index(a[0]))
            pos = x.index(a[0])
            print(c[pos-1])

I expect my output to be like as below

CodePudding user response：

Your problem is that you have columns of list. So my advice would be to use this method instad:

import pandas as pd
sample_df = pd.DataFrame({'single_proj_name': [['jsfk'],['fhjk'],['ERRW'],['SJBAK']],
                              'single_item_list': [['ABC_123'],['DEF123'],['FAS324'],['HSJD123']],
                              'single_id':[[1234],[5678],[91011],[121314]],
                              'multi_proj_name':[['AAA','VVVV','SASD'],['QEWWQ','SFA','JKKK','fhjk'],['ERRW','TTTT'],['SJBAK','YYYY']],
                              'multi_item_list':[[['XYZAV','ADS23','ABC_123'],['ABC_123','ADC_123']],['XYZAV','DEF123','ABC_123','SAJKF'],['QWER12','FAS324'],['JFAJKA','HSJD123']],
                              'multi_id':[[[2167,2147,29481],[5432,1234]],[2313,57567,2321,7898],[1123,8775],[5237,43512]]})

for i in range(len(sample_df)):
    single_item = sample_df.iloc[i]['single_item_list'][0]
    multi_item_list = sample_df.iloc[i]['multi_item_list']
    multi_id_list = sample_df.iloc[i]['multi_id']
    for j, sublist in enumerate(multi_item_list):
        for k, item in enumerate(sublist):
            if single_item == item:
                flag = 1
                sample_df.at[i, 'multi_item_list'] = [item]
                sample_df.at[i, 'multi_id'] = multi_id_list[j][k]

which returs:

 single_proj_name single_item_list single_id           multi_proj_name  \
0           [jsfk]        [ABC_123]    [1234]         [AAA, VVVV, SASD]   
1           [fhjk]         [DEF123]    [5678]  [QEWWQ, SFA, JKKK, fhjk]   
2           [ERRW]         [FAS324]   [91011]              [ERRW, TTTT]   
3          [SJBAK]        [HSJD123]  [121314]             [SJBAK, YYYY]   

                   multi_item_list                   multi_id  
0                        [ABC_123]                       5432  
1  [XYZAV, DEF123, ABC_123, SAJKF]  [2313, 57567, 2321, 7898]  
2                 [QWER12, FAS324]               [1123, 8775]  
3                [JFAJKA, HSJD123]              [5237, 43512]

The nested loop will iterate through the sublists within the multi_item_list and check each item within those sublists to match the single_item. When it finds the match, it updates the multi_item_list and multi_id columns.

CodePudding user response：

The code you've provided uses a zip() function to iterate over the 'single_item_list', 'multi_item_list', and 'multi_id' columns of the DataFrame simultaneously.

For each iteration, it uses a nested for loop to iterate over the sublists in the 'multi_item_list' column. It checks if the first element of the 'single_item_list' is present in the current sublist, using the in operator. If it is present, it finds the index of the matching element in the sublist using the index() method, and assigns it to the variable pos. Then it prints the value in the corresponding index of the 'multi_id' column.

This code will work correctly, but it's only printing the matched value in multi_id column, it's not updating the multi_item_list and multi_id columns of the DataFrame. In order to update the DataFrame with the matched values, you will have to use the .iloc method to update the Dataframe. e.g: sample_df.iloc[i,j] = new_val

for i, (single, multi_item, multi_id) in enumerate(zip(sample_df['single_item_list'],sample_df['multi_item_list'],sample_df['multi_id'])):
for j, item_list in enumerate(multi_item):
    if single[0] in item_list:
        pos = item_list.index(single[0])
        sample_df.at[i,'multi_item_list'] = [item_list]
        sample_df.at[i,'multi_id'] = [multi_id[j]]

print(sample_df)

This will print the updated DataFrame with the filtered values in the 'multi_item_list' and 'multi_id' columns. Please note that the print(sample_df) should be placed after the for loop to make sure the table is printed after the updates.

This code iterates over the 'single_item_list', 'multi_item_list', and 'multi_id' columns of the DataFrame simultaneously. In each iteration, it uses a nested for loop to iterate over the sublists in the 'multi_item_list' column. It checks if the first element of the 'single_item_list' is present in the current sublist, using the in operator. If it is present, it finds the index of the matching element in the sublist using the index() method, and assigns it to the variable pos. Then it updates the 'multi_item_list' and 'multi_id' columns of the DataFrame at the current index with the matched value using the at method.

Please note that this code will remove the non-matching items from the 'multi_item_list' and 'multi_id' columns, if there is no matching item it will keep the original values.