Home > Net >  Checking one DataFrame's Column of Lists Against Another DataFrame's Column of Lists
Checking one DataFrame's Column of Lists Against Another DataFrame's Column of Lists

Time:02-14

I have 2 DataFrames, and each ones has a column of lists.

I need to check each list from df2 to see if there is a list in df1 that contains all its elements.

ie. df2 row 1 has a list of [1,5]. I need to check against all the lists of df1 to see if ANY of them contain both 1 and 5.

For any lists from df2 that had a full match, they get marked 'clean' in the df2['bucket'] column.

My problem: With larger data sets, this is currently very slow. My hope is that someone can suggest a faster approach.

import pandas as pd

list1 = [
    ['a', [1, 5, 10, 14, 15]],
    ['l',  [7, 13, 25, 46, 50]],
    ['o',  [2, 4, 6, 19, 36]],
    ['d',  [1, 19, 24, 26, 29]]]

df1 = pd.DataFrame(list1, columns =['Fruits', 'Values']) 

list2 = [
    ['a', [1, 5]],
    ['r',  [2, 4, 5]],
    ['e',  [2, 9]],
    ['f',  [2, 6, 36]],
    ['w',  [2, 6, 37]],
    ['a',  [24, 29]],
    ['q',  [1, 14, 15]]]

df2 = pd.DataFrame(list2, columns =['Fruits', 'Values']) 

df1['bucket'] = ""


list1_row_count = df1.shape[0]
list2_row_count = df2.shape[0]

for x in range(list2_row_count):
    break_out = False
    for _ in range(list1_row_count):
        if break_out == True : break
        df1_list = df1['Values'].iloc[_]
        df2_list = df2['Values'].iloc[x]
        check =  all(item in df1_list for item in df2_list)
        

        if check is True:
            df2.loc[x, 'bucket'] = 'clean'
            break_out = True
        else :
            df2.loc[x, 'bucket'] = 'mixed'


print(df1)
print('')
print(df2)

CodePudding user response:

You can try use to_numpy(), which is quite fast:

def process(a, b):
  def compare(a, i):
    idx = 0
    while idx < a.shape[0]:
      if all(x in a[idx] for x in i): return 'clean'  
      idx  =1
    return 'mixed'
  return [compare(a, i) for i in b]

df2['bucket'] = process(df1.Values.to_numpy(), df2.Values.to_numpy())
  Fruits       Values bucket
0      a       [1, 5]  clean
1      r    [2, 4, 5]  mixed
2      e       [2, 9]  mixed
3      f   [2, 6, 36]  clean
4      w   [2, 6, 37]  mixed
5      a     [24, 29]  clean
6      q  [1, 14, 15]  clean

CodePudding user response:

@AloneTogether's solution should already significantly cut the runtime of your code. I think you could use set operations to further improve runtime.

Basically, you could convert each sublist to a set and use set.issubset iteratively to check if any list in df1['Value'] is a subset of any list in df2['Value']:

def process(a, b):
    def compare(a, i):
        for s in a:
            if i.issubset(s):
                return 'clean'
        return 'mixed'
    return [compare(a, i) for i in b]

df2['bucket'] = process(map(set, df1['Values']), map(set, df2['Values']))

Runtime (on the sample you've provided):

Yours:

7.08 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@AloneTogether:

273 µs ± 19.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Mine:

216 µs ± 22.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  • Related