I have a long data frame with approx 150k rows and two columns (A and B). I'm writing a code that in column "B" for a row "i", if the next 6 rows in column A == 1 then, B=1.
So far I managed to write a working code (see below). However, it is very slow to run, about 5min. Do you have any suggestions about how to make the code more efficient? Thank you!
df[j]['B'] = 0
for i in np.arange(0,len(df[j]['A'])-1, dtype=np.float64):
try:
if (df[j]['A'][i 1] == 1) & (df[j]['A'][i 2]==1) & (df[j]['A'][i 3]==1) & (df[j]['A'][i 4]==1) & (df[j]['A'][i 5]==1) & (df[j]['A'][i 6]==1) == True:
df[j]['B'][i] = 1
except:
pass
I've tried to remove the try/except block but still no improvement, actually it is 1 second slower.
for i in np.arange(0,len(WT_df[j]['ID10'])-7, dtype=np.float64):
if (WT_df[j]['ID10'][i 1] == 1) & (WT_df[j]['ID10'][i 2]==1) & (WT_df[j]['ID10'][i 3]==1) & (WT_df[j]['ID10'][i 4]==1) & (WT_df[j]['ID10'][i 5]==1) & (WT_df[j]['ID10'][i 6]==1) == True:
WT_df[j]['ID_on'][i] = 1
Best regards, Luis
CodePudding user response:
Pandas direct indexing is slow (at least in loops). You can vectorize the loop to speed it up. Here is an untested example:
column = WT_df[j]['ID10'].values
size = column.size-7
result = np.ones(size, dtype=bool)
for k in range(1, 7):
result &= column[k:size k] == 1
WT_df[j].loc[np.where(result)[0], 'ID_on'] = 1