I have 2 DataFrames, and each ones has a column of lists.
I need to check each list from df2
to see if there is a list in df1
that contains all its elements.
ie. df2
row 1 has a list of [1,5]
. I need to check against all the lists of df1
to see if ANY of them contain both 1 and 5.
For any lists from df2
that had a full match, they get marked 'clean'
in the df2['bucket']
column.
My problem: With larger data sets, this is currently very slow. My hope is that someone can suggest a faster approach.
import pandas as pd
list1 = [
['a', [1, 5, 10, 14, 15]],
['l', [7, 13, 25, 46, 50]],
['o', [2, 4, 6, 19, 36]],
['d', [1, 19, 24, 26, 29]]]
df1 = pd.DataFrame(list1, columns =['Fruits', 'Values'])
list2 = [
['a', [1, 5]],
['r', [2, 4, 5]],
['e', [2, 9]],
['f', [2, 6, 36]],
['w', [2, 6, 37]],
['a', [24, 29]],
['q', [1, 14, 15]]]
df2 = pd.DataFrame(list2, columns =['Fruits', 'Values'])
df1['bucket'] = ""
list1_row_count = df1.shape[0]
list2_row_count = df2.shape[0]
for x in range(list2_row_count):
break_out = False
for _ in range(list1_row_count):
if break_out == True : break
df1_list = df1['Values'].iloc[_]
df2_list = df2['Values'].iloc[x]
check = all(item in df1_list for item in df2_list)
if check is True:
df2.loc[x, 'bucket'] = 'clean'
break_out = True
else :
df2.loc[x, 'bucket'] = 'mixed'
print(df1)
print('')
print(df2)
CodePudding user response:
You can try use to_numpy()
, which is quite fast:
def process(a, b):
def compare(a, i):
idx = 0
while idx < a.shape[0]:
if all(x in a[idx] for x in i): return 'clean'
idx =1
return 'mixed'
return [compare(a, i) for i in b]
df2['bucket'] = process(df1.Values.to_numpy(), df2.Values.to_numpy())
Fruits Values bucket
0 a [1, 5] clean
1 r [2, 4, 5] mixed
2 e [2, 9] mixed
3 f [2, 6, 36] clean
4 w [2, 6, 37] mixed
5 a [24, 29] clean
6 q [1, 14, 15] clean
CodePudding user response:
@AloneTogether's solution should already significantly cut the runtime of your code. I think you could use set operations to further improve runtime.
Basically, you could convert each sublist to a set and use set.issubset
iteratively to check if any list in df1['Value']
is a subset of any list in df2['Value']
:
def process(a, b):
def compare(a, i):
for s in a:
if i.issubset(s):
return 'clean'
return 'mixed'
return [compare(a, i) for i in b]
df2['bucket'] = process(map(set, df1['Values']), map(set, df2['Values']))
Runtime (on the sample you've provided):
Yours:
7.08 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
@AloneTogether:
273 µs ± 19.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Mine:
216 µs ± 22.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)