comparing 2 tuples in a dataframe-CodePudding

Based on the following dataframe:

import json
import numpy as np
import pandas as pd
test_list = ['purple', 'red', 'yellow']
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': [['red','blue'], ['white'], ['blue','yellow']]})
df['colors_new'] = df.colors.map(tuple)

I am trying to generate a new column where if at least one of the elements in test_list is in each row then I mark that row as true

df['found'] = any((True for x in test_list if x in df['colors_new']))
df

in the example above row 0 and 2 should be true, because red is in row 0 and yellow is in row 2

which will be the most efficient and correct way as I am currently getting the wrong results

I think the closest that I am able to get a right response is

df['found'] = ['red' in x for x in df['colors_new']]

But using this doesnt work when I have multiple items (test_list = ['purple', 'red', 'yellow'])

CodePudding user response：

Use sets with isdisjoint if performance is important:

s = set(test_list)
df['colors_new'] = ~df.colors.map(s.isdisjoint)

Or:

s = set(test_list)
df['colors_new'] = df['colors'].map(s.intersection).astype(bool)

print (df)

   numbers          colors  colors_new
0        1     [red, blue]        True
1        2         [white]       False
2        3  [blue, yellow]        True

Performance in test data, best test in real, because depends of length of DataFrame, length of tested list, number of matched values:

df['colors_new'] = df.colors.map(tuple)

#3k rows
df = pd.concat([df] * 1000, ignore_index=True)

test_list = ['purple', 'red', 'yellow']

s = set(test_list)

In [46]: %timeit df['colors_new'] = ~df.colors.map(s.isdisjoint)
707 µs ± 20.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [47]: %timeit df['colors_new'] = df['colors'].map(s.intersection).astype(bool)
1.38 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [50]: %timeit df['found'] = df['colors_new'].apply(lambda x: len(s.intersection(x))>0)
1.68 ms ± 42.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [51]: %timeit df['found'] = df['colors_new'].explode().isin(test_list).groupby(level=0).max()
4.66 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [52]: %timeit df['found'] = df['colors_new'].apply(lambda x: bool(max([1 if y in test_list else 0 for y in x])))
2.91 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [54]: %timeit df["colors_map"] = df[['colors','colors_new']].apply(lambda x:any([x2 in test_list for x1 in x for x2 in x1]), axis=1)
26.1 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response：

using explode

df['found'] = df['colors_new'].explode().isin(test_list).groupby(level=0).max()

output:

   numbers          colors      colors_new  found
0        1     [red, blue]     (red, blue)   True
1        2         [white]        (white,)  False
2        3  [blue, yellow]  (blue, yellow)   True

using python sets

You could use a set and set.intersection, if the intersection is not empty, there are common values.

Sets operations are faster than classical loops.

test_list = set(test_list)
df['found'] = df['colors_new'].apply(lambda x: len(test_list.intersection(x))>0)

output:

   numbers          colors      colors_new  found
0        1     [red, blue]     (red, blue)   True
1        2         [white]        (white,)  False
2        3  [blue, yellow]  (blue, yellow)   True

NB. as a bonus you can use the same approach to get the found elements

df['found elements'] = df['colors_new'].apply(test_list.intersection)

output:

   numbers          colors      colors_new  found found elements
0        1     [red, blue]     (red, blue)   True          {red}
1        2         [white]        (white,)  False             {}
2        3  [blue, yellow]  (blue, yellow)   True       {yellow}

CodePudding user response：

You could use the lambda function to get what you want:

import json
import numpy as np
import pandas as pd
test_list = ['purple', 'red', 'yellow']
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': [['red','blue'], ['white'], ['blue','yellow']]})
df['colors_new'] = df.colors.map(tuple)

df['found'] = df['colors_new'].apply(lambda x: bool(max([1 if y in test_list else 0 for y in x])))

CodePudding user response：

You can use list comprehensions as well:

df["colors_map"] = df[['colors','colors_new']].apply(lambda x:any([x2 in test_list for x1 in x for x2 in x1]), axis=1)

If you have a lot of colors colums you are checking (not just 2):

df["colors_map"] = df[[x for x in df.columns if "colors" in x]].apply(lambda x:any([x2 in test_list for x1 in x for x2 in x1]), axis=1)