I have a data frame that has a delimited string column that has to be compared with a list. If the result of the elements in the delimited string and elements of the list intersect, consider that row.
For example
test_lst = [20, 45, 35]
data = pd.DataFrame({'colA': [1, 2, 3],
'colB': ['20,45,50,60', '22,70,35', '10,90,100']})
should have the output as because the elements 20,45 are common in both the list variable and delimited text in DF in the first row.
Likewise, 35 intersects in row 2
colA | colB |
---|---|
1 | 20,45,50,60 |
2 | 22,70,35 |
What I have tried is
test_lst = [20, 45, 35]
data["colC"]= data['colB'].str.split(',')
data
# data["colC"].apply(lambda x: set(x).intersection(test_lst))
print(data[data['colC'].apply(lambda x: set(x).intersection(test_lst)).astype(bool)])
data
Does not give the required result.
Any help is appreciated
CodePudding user response:
The problem with your code is that test_lst
is a list of integers, while each row in data
consists of strings, so set(x).intersection(test_lst)
is checking the intersection of integers and strings, which always evaluates to False. If you convert test_lst
to a list/set of strings, it works as expected.
Since you're not actually interested in the intersection but rather if an intersection exists, I recommend using set.isdisjoint
in a loop.
test_list_strings = set(map(str, test_lst))
msk = data['colB'].str.split(',').apply(lambda row: not test_list_strings.isdisjoint(row))
out = data[msk]
You could write the second line as a list comprehension as well:
msk = [not test_list_strings.isdisjoint(row) for row in data['colB'].str.split(',')]
Output:
colA colB
0 1 20,45,50,60
1 2 22,70,35
CodePudding user response:
This might not be the best approach, but it works.
import pandas as pd
df = pd.DataFrame({'colA': [1, 2, 3],
'colB': ['20,45,50,60', '22,70,35', '10,90,100']})
def match_element(row):
row_elements = [int(n) for n in row.split(',')]
test_lst = [20, 45, 35]
if [value for value in row_elements if value in test_lst]:
return True
else:
return False
mask = df['colB'].apply(lambda row: match_element(row))
df = df[mask]
output:
colA | colB | |
---|---|---|
0 | 1 | 20,45,50,60 |
1 | 2 | 22,70,35 |