How to compare the each elements in the delimited string in pandas data frame column with a python l-CodePudding

I have a data frame that has a delimited string column that has to be compared with a list. If the result of the elements in the delimited string and elements of the list intersect, consider that row.

For example

test_lst = [20, 45, 35]
data = pd.DataFrame({'colA': [1, 2, 3],
          'colB': ['20,45,50,60', '22,70,35', '10,90,100']})

should have the output as because the elements 20,45 are common in both the list variable and delimited text in DF in the first row.

Likewise, 35 intersects in row 2

colA	colB
1	20,45,50,60
2	22,70,35

What I have tried is

test_lst = [20, 45, 35]
data["colC"]= data['colB'].str.split(',')
data

# data["colC"].apply(lambda x: set(x).intersection(test_lst))
print(data[data['colC'].apply(lambda x: set(x).intersection(test_lst)).astype(bool)])
data

Does not give the required result.

Any help is appreciated

CodePudding user response：

The problem with your code is that test_lst is a list of integers, while each row in data consists of strings, so set(x).intersection(test_lst) is checking the intersection of integers and strings, which always evaluates to False. If you convert test_lst to a list/set of strings, it works as expected.

Since you're not actually interested in the intersection but rather if an intersection exists, I recommend using set.isdisjoint in a loop.

test_list_strings = set(map(str, test_lst))
msk = data['colB'].str.split(',').apply(lambda row: not test_list_strings.isdisjoint(row))
out = data[msk]

You could write the second line as a list comprehension as well:

msk = [not test_list_strings.isdisjoint(row) for row in data['colB'].str.split(',')]

Output:

   colA         colB
0     1  20,45,50,60
1     2     22,70,35

CodePudding user response：

This might not be the best approach, but it works.

import pandas as pd

df = pd.DataFrame({'colA': [1, 2, 3],
          'colB': ['20,45,50,60', '22,70,35', '10,90,100']}) 

def match_element(row):
    row_elements = [int(n) for n in row.split(',')]
    test_lst = [20, 45, 35]
    
    if [value for value in row_elements if value in test_lst]:
        return True
    else:
        return False

mask = df['colB'].apply(lambda row: match_element(row))
df = df[mask]

output:

	colA	colB
0	1	20,45,50,60
1	2	22,70,35