Home > Software design >  pandas dataframe iterate over cell value that is a list and compare each element to other cell
pandas dataframe iterate over cell value that is a list and compare each element to other cell

Time:11-23

I have a dataframe with 2 columns - a tuple and a list:

df = t        l
    (1,2) [1,2,3,4,5,6]
    (0,5) [1,4,9]
    (0,4) [9,11]

I want to add a new column of "how many elements from l are in the range of t. So for example, here if will be:

df =counter  t       l
      2    (1,2) [1,2,3,4,5,6]
      2    (0,5) [1,4,9]
      0    (0,4) [9,11]

What is the best way to do so?

CodePudding user response:

Use list comprehension with generator and sum:

df['counter'] = [sum(a <= i <= b for i in y) for (a, b), y in df[['t','l']].to_numpy()]

A bit faster solution with set.intersection is:

df['counter'] = [len(set(range(a, b 1)).intersection(y)) 
                 for (a, b), y in df[['t','l']].to_numpy()]

print (df)
        t                   l  counter
0  (1, 2)  [1, 2, 3, 4, 5, 6]        2
1  (0, 5)           [1, 4, 9]        2
2  (0, 4)             [9, 11]        0

Performance in test data:

#30k rows
df = pd.concat([df] * 10000, ignore_index=True)

In [67]: %timeit [sum(a <= i <= b for i in y) for (a, b), y in df[['t','l']].to_numpy()]
65.3 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [68]: %timeit [len(set(range(a, b 1)).intersection(y)) for (a, b), y in df[['t','l']].to_numpy()]
60.7 ms ± 520 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • Related