I have a dataframe with two columns that each contain lists. I want to determine the overlap between the lists in the two columns.

For example:

df = pd.DataFrame({'one':[['a', 'b', 'c'], ['d', 'e', 'f'], ['h', 'i', 'j']], 
                   'two':[['b', 'c', 'd'], ['f', 'g', 'h',], ['l', 'm', 'n']]})

        one         two
    0   [a, b, c]   [b, c, d]
    1   [d, e, f]   [f, g, h]
    2   [h, i, j]   [l, m, n]

Ultimately, I want it to look like:

        one         two             overlap
    0   [a, b, c]   [b, c, d]       [b, c]
    1   [d, e, f]   [f, g, h]       [f]
    2   [h, i, j]   [l, m, n]       []

CodePudding user response：

There is no efficient vectorial way to perform this, the fastest approach will be a list comprehension with set intersection:

df['overlap'] = [list(set(a)&set(b)) for a,b in zip(df['one'], df['two'])]

Output:

         one        two overlap
0  [a, b, c]  [b, c, d]  [b, c]
1  [d, e, f]  [f, g, h]     [f]
2  [h, i, j]  [l, m, n]      []

CodePudding user response：

Here is a way using applymap to convert your lists to sets and using set.intersection to find the overlap:

df.join(df.applymap(set).apply(lambda x: set.intersection(*x),axis=1).map(list).rename('overlap'))

CodePudding user response：

Using `pandas`

The Pandas way of doing this could be like this -

f = lambda row: list(set(row['one']).intersection(row['two']))
df['overlap'] = df.apply(f,1)
print(df)

         one        two overlap
0  [a, b, c]  [b, c, d]  [b, c]
1  [d, e, f]  [f, g, h]     [f]
2  [h, i, j]  [l, m, n]      []

The apply function goes row by row (axis=1) and finds the set.intersection() between the list in column one and column two. Then it returns the result as a list.

Apply methods are not the fastest, but quite readible imo. But since your question doesn't mention speed as a criteria, this wont be an issue.

Additionally, you could use either of the two expressions as your lambda function, as both do the same task -

#Option 1:
f = lambda x: list(set(x['one']) & set(x['two']))

#Option 2:
f = lambda x: list(set(x['one']).intersection(x['two']))

Using `Numpy`

You can use the numpy method np.intersect1d as well along with a map over the 2 series.

import numpy as np
import pandas as pd

df['overlap'] = pd.Series(map(np.intersect1d, df['one'], df['two']))
print(df)

         one        two overlap
0  [a, b, c]  [b, c, d]  [b, c]
1  [d, e, f]  [f, g, h]     [f]
2  [h, i, j]  [l, m, n]      []

Benchmarks

Adding some benchmarks for reference -

%timeit [list(set(a)&set(b)) for a,b in zip(df['one'], df['two'])]        #list comprehension
%timeit df.apply(lambda x: list(set(x['one']).intersection(x['two'])),1)  #apply 1
%timeit df.apply(lambda x: list(set(x['one']) & set(x['two'])),1)         #apply 2
%timeit pd.Series(map(np.intersect1d, df['one'], df['two']))              #numpy intersect1d

6.99 µs ± 17.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
167 µs ± 830 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
166 µs ± 338 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
84.1 µs ± 270 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Using pandas

Using Numpy

Benchmarks

Using `pandas`

Using `Numpy`