Home > Software engineering >  Loop to delete rows based on condition Pandas
Loop to delete rows based on condition Pandas

Time:07-22

I got large data samples (1.6 million rows each) where I wish to delete all rows which does not fit certain conditions.

I do have over 1400 different conditions which are tested if they should be applied and once applied I use following code to delete them (with provided random example of data sample):

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1,100,size=(1600000, 13)), columns=list('ABCDEFGHIJKLM'))

cols = ['A','B','C','D','E','F','G','H','I','J','K','L','M']


df['Conditions'] = df[(df[cols] >= 30) & (df[cols] <= 50)].count(axis=1)
df = df[(df["Conditions"] >= 2) & (df["Conditions"] <= 6)]

So for this example-loop. Values between 30 and 50 should occur min 2 but max 6 times per row (all conditions are similar but with different values) My problem is that this takes very long time and since I got 1200 different data samples I'd like to find any way to speed up the process. Do you have any suggestions of method to increase the speed of this? I've also tried df.drop but I experience this as faster. Appreciate all suggestions.

CodePudding user response:

I just realized that you used count, which is slower because you need to copy your data on the mask. I'd suggest you use sum on the logic:

mask = ((df[cols] >= 30) & (df[cols] <= 50)).sum(axis=1)
df = df[mask.between(2,6)]

This takes about 400ms on my system whereas your approach takes about 1s (including my commented suggestion, without it, it's about 2s).

CodePudding user response:

My idea is to:

  1. Generate each condition as a boolean list.
  2. Concatenate them into an array named cond (each condition creates a row).
  3. Compute "final" criterion for each source row, i.e. for each column in cond. E.g. leave source rows which meet all "partial" criteria. This way you have a single list (True of False for each row).
  4. Get rows meeting this (final) criterion, using boolean indexing.

Let's generate the source array a bit smaller and with generator seeding, to get repeatable results:

nRows = 30    # Number of rows
cols = ['A','B','C','D','E','F','G','H','I','J','K','L','M']  # Column names
rng = np.random.default_rng(0)  # Seed
# Generate the source DataFrame
df = pd.DataFrame(rng.integers(1, 100, (nRows, len(cols))), columns=cols)

Then assume that you have just 2 "partial" criteria for each row:

  • between 30 and 50, the number of elements in this range must be between 3 and 5 elements of a row (your original criterion),
  • between 50 and 70, the number of elements in this range must be also between 3 and 5.

To express this in code, write:

cond = np.vstack((
        ((df[cols] >=30) & (df[cols] <= 50)).sum(axis=1).between(3,5),
        ((df[cols] >=50) & (df[cols] <= 70)).sum(axis=1).between(3,5)
    )).all(axis=0)

You can add here your other criteria.

Steps:

  • Each row computes a partial criterion (a boolean vector).
  • vstack stacks them vertically, so that we have as many rows as the number of partial criteria and as many columns as source rows.
  • any(axis=0) - converts this array into a single boolean vector (one element per source row).

Then, to get the final result, run:

result = df[cond]

The result is:

     A   B   C   D   E   F   G   H   I   J   K   L   M
3   42  40   3   1  13   1  67  53  65  26  61  76  38
4   46  99  80  98  38  68  95  65  84  69  70  39  87
12  60  94   2  46  83  76  41  50  42  53  23  78   8
20  80  59  85  55  76  81   7  56  46  29  45  41  49
25  49  49  97  69  77   1  31  98  27  51  86  64  88

The speed gain of this solution is that you perform selection from the source array only once.

Your original solution (as I suppose) to narrow down the initial DataFrame and each time create a new DataFrame (over 1000 times) causes, that such code runs slow.

My solution should work faster.

  • Related