Home > Back-end >  How ignore specific range of rows in a dataframe
How ignore specific range of rows in a dataframe

Time:11-17

I have a dataframe with 1000000 rows and I want to ignore 8000 rows in first 40000 rows and then ignore 80000 rows in next 40000 rows. How can I achieve this ?

As an example: Drop 1 to 8000, 40001 to 48000, 80001 to 88000 rows and so on.

CodePudding user response:

Approach

  • Adapted Numpy slicing function: Dynamically create slice indices np.r an answer that uses a mask rather than np.r_ so can be done dynamically
  • Two solutions
    • Looping solution (to illustrate method)
    • Vectorized solution (for performance) using
      • [numpy.ma.masked_where(https://numpy.org/doc/stable/reference/generated/numpy.ma.masked_where.html) to generate mask array and
      • numpy.ma.getmask to get mask
  • Note: Dataframes are 0-index, so first 40,000 rows has indexing 0 to 39,999 rather than 1 to 40,000

Looping Solution

def drop_rows(df, chunksize, separation):
    '''
        Drops rows in dataframe
        
        rows 0               to    chunksizse
        rows 1*separation    to    1*separation   chunksize
        rows 2*separation    to    2*separation   chunksizse
                 ...
    '''
    # Create mask which is True for rows we want to drop
    n = len(df.index)                     # number of rows in dataframe
    mask = np.zeros(n, dtype=bool)
    for start in np.arange(0, n, separation):
        stop = start   chunksize
        mask[start:stop] = True
    
    return df.drop(df[mask].index)             # drop rows based upon indexes where mask is true

Vectorized Solution

def drop_rows_vect(df, chunksize, separation):
    '''
        Drops rows in dataframe
        
        rows 0               to    chunksizse
        rows 1*separation    to    1*separation   chunksize
        rows 2*separation    to    2*separation   chunksizse
                 ...
    '''
    # Create mask which is True for rows we want to drop
    mask = np.ma.getmaskarray(np.ma.masked_where(df.index % separation < chunksize, df.index))
    
    return df.drop(df[mask].index)   # drop rows based upon indexes where mask is true

Test

Create random dataframe with two columns

data = np.random.randint(100, size=(40, 2))
df = pd.DataFrame(data, columns = ['A', 'B'])

Drop rows using the two methods

# Drop first 8 rows in chunks with a separation of 10
df_looping = drop_rows(df, chunksize = 8, separation = 10)
df_vect = drop_rows_vect(df, chunksize = 8, separation = 10)

Two methods produces the same result

print(df_looping.equals(df_vect )
# Output: True

Show result

print(df_vect)
# Output

    A   B
2   69  48
3   61  45
4   15  29
7   30  42
8   54  46
9   22  78
  • Related