Home > Back-end >  Update inbetween rows when the value repeats - Pandas
Update inbetween rows when the value repeats - Pandas

Time:07-13

I have a dataframe as below

id    col1    col2
1      1       1

2      0       NaN

3      0       NaN

4      0       0

5      1       1

6      0       NaN

7      1       1

8      0       0

Column 2 (col2) follows a pattern (1 followed by 0). It will have 1 (like row number 1) and it will end with 0 (like row number 4). It may or may not have NaN between the rows..

For example

9      1       1

10     0       0

is valid.. As the 9th row has 1 followed by 0 in the 10th row.

As i mentioned before , the expectation is - 1 should be followed by 0. However, there are some places where 1 occurs inbetween 1 and 0.

For example:

Row number 5 has 1. However, row number 7 is also 1. In such a case i need to change both column 1 and column 2 as 0.

   id    col1    col2
    1      1       1
    
    2      0       NaN

    3      0       NaN

    4      0       0

    5      1       1

    6      0       NaN

    7      0       0

    8      0       0

I have an iterative solution in mind.

Iterate through each row. Whenever col2 is 1 , set a flag to True During the iteration, if flag is set , and col2 is 1 , then update col1 and col2 to 0.

But it will be very slow as i have over million rows in the dataframe. Is there any way we can achieve this without iterative solution.

CodePudding user response:

You can do something like this.

# Get previous non-null col2 value.
df.loc[~df.col2.isna(), 'pv'] =  df.col2.dropna().shift()

# If current col2 and previous values are both 1, change them to 0
df.loc[(df.col2 == 1) & (df.pv == 1), ['col1','col2']] = 0

df = df.drop('pv', axis=1)

CodePudding user response:

Well, what do you think about this solution?

dropped_df = df.dropna()
rolling_df = dropped_df.rolling(window=2).sum()
indexes = rolling_df[rolling_df.col2>1].index
df['col1'].loc[indexes] = 0
df['col2'].loc[indexes] = 0

Also I tried your idea:

ind = 0
for i in range(0, len(df)):
    if (ind == 1) and (df['col2'].loc[i] == 1):
        df['col1'].loc[i] = 0
        df['col2'].loc[i] = 0
        ind = 0
    elif df['col2'].loc[i] == 1:
        ind = 1
    elif df['col2'].loc[i] == 0:
        ind = 0

And for your example, your idea works much faster then both, mine first solution and the solution in another answer.

Mine first solution:

CPU times: user 7.81 ms, sys: 2.11 ms, total: 9.92 ms
Wall time: 11.8 ms

Loop solution:

CPU times: user 2.23 ms, sys: 1.44 ms, total: 3.67 ms
Wall time: 5.53 ms

Solution from the second answer:

CPU times: user 7.4 ms, sys: 1.22 ms, total: 8.62 ms
Wall time: 11.2 ms

Though, I guess, it's not a good idea to make any conclusions with one run and small data. You could try all of these solutions.

  • Related