I have a dataframe with 1000000 rows and I want to ignore 8000 rows in first 40000 rows and then ignore 80000 rows in next 40000 rows. How can I achieve this ?
As an example: Drop 1 to 8000, 40001 to 48000, 80001 to 88000 rows and so on.
CodePudding user response:
Approach
- Adapted Numpy slicing function: Dynamically create slice indices np.r an answer that uses a mask rather than np.r_ so can be done dynamically
- Two solutions
- Looping solution (to illustrate method)
- Vectorized solution (for performance) using
- [numpy.ma.masked_where(https://numpy.org/doc/stable/reference/generated/numpy.ma.masked_where.html) to generate mask array and
- numpy.ma.getmask to get mask
- Note: Dataframes are 0-index, so first 40,000 rows has indexing 0 to 39,999 rather than 1 to 40,000
Looping Solution
def drop_rows(df, chunksize, separation):
'''
Drops rows in dataframe
rows 0 to chunksizse
rows 1*separation to 1*separation chunksize
rows 2*separation to 2*separation chunksizse
...
'''
# Create mask which is True for rows we want to drop
n = len(df.index) # number of rows in dataframe
mask = np.zeros(n, dtype=bool)
for start in np.arange(0, n, separation):
stop = start chunksize
mask[start:stop] = True
return df.drop(df[mask].index) # drop rows based upon indexes where mask is true
Vectorized Solution
def drop_rows_vect(df, chunksize, separation):
'''
Drops rows in dataframe
rows 0 to chunksizse
rows 1*separation to 1*separation chunksize
rows 2*separation to 2*separation chunksizse
...
'''
# Create mask which is True for rows we want to drop
mask = np.ma.getmaskarray(np.ma.masked_where(df.index % separation < chunksize, df.index))
return df.drop(df[mask].index) # drop rows based upon indexes where mask is true
Test
Create random dataframe with two columns
data = np.random.randint(100, size=(40, 2))
df = pd.DataFrame(data, columns = ['A', 'B'])
Drop rows using the two methods
# Drop first 8 rows in chunks with a separation of 10
df_looping = drop_rows(df, chunksize = 8, separation = 10)
df_vect = drop_rows_vect(df, chunksize = 8, separation = 10)
Two methods produces the same result
print(df_looping.equals(df_vect )
# Output: True
Show result
print(df_vect)
# Output
A B
2 69 48
3 61 45
4 15 29
7 30 42
8 54 46
9 22 78