I have a dataframe as below
id col1 col2
1 1 1
2 0 NaN
3 0 NaN
4 0 0
5 1 1
6 0 NaN
7 1 1
8 0 0
Column 2 (col2) follows a pattern (1 followed by 0). It will have 1 (like row number 1) and it will end with 0 (like row number 4). It may or may not have NaN between the rows..
For example
9 1 1
10 0 0
is valid.. As the 9th row has 1 followed by 0 in the 10th row.
As i mentioned before , the expectation is - 1 should be followed by 0. However, there are some places where 1 occurs inbetween 1 and 0.
For example:
Row number 5 has 1. However, row number 7 is also 1. In such a case i need to change both column 1 and column 2 as 0.
id col1 col2
1 1 1
2 0 NaN
3 0 NaN
4 0 0
5 1 1
6 0 NaN
7 0 0
8 0 0
I have an iterative solution in mind.
Iterate through each row. Whenever col2 is 1 , set a flag to True During the iteration, if flag is set , and col2 is 1 , then update col1 and col2 to 0.
But it will be very slow as i have over million rows in the dataframe. Is there any way we can achieve this without iterative solution.
CodePudding user response:
You can do something like this.
# Get previous non-null col2 value.
df.loc[~df.col2.isna(), 'pv'] = df.col2.dropna().shift()
# If current col2 and previous values are both 1, change them to 0
df.loc[(df.col2 == 1) & (df.pv == 1), ['col1','col2']] = 0
df = df.drop('pv', axis=1)
CodePudding user response:
Well, what do you think about this solution?
dropped_df = df.dropna()
rolling_df = dropped_df.rolling(window=2).sum()
indexes = rolling_df[rolling_df.col2>1].index
df['col1'].loc[indexes] = 0
df['col2'].loc[indexes] = 0
Also I tried your idea:
ind = 0
for i in range(0, len(df)):
if (ind == 1) and (df['col2'].loc[i] == 1):
df['col1'].loc[i] = 0
df['col2'].loc[i] = 0
ind = 0
elif df['col2'].loc[i] == 1:
ind = 1
elif df['col2'].loc[i] == 0:
ind = 0
And for your example, your idea works much faster then both, mine first solution and the solution in another answer.
Mine first solution:
CPU times: user 7.81 ms, sys: 2.11 ms, total: 9.92 ms
Wall time: 11.8 ms
Loop solution:
CPU times: user 2.23 ms, sys: 1.44 ms, total: 3.67 ms
Wall time: 5.53 ms
Solution from the second answer:
CPU times: user 7.4 ms, sys: 1.22 ms, total: 8.62 ms
Wall time: 11.2 ms
Though, I guess, it's not a good idea to make any conclusions with one run and small data. You could try all of these solutions.