Find the number of rows a value is away from a specific row value in a Python DataFrame-CodePudding

I have the following dataframe, df, and I would like to add the 'distance' column to it, such that:

date	active	distance
01/09/2022	1	0
02/09/2022	0	1
05/09/2022	0	2
06/09/2022	0	3
07/09/2022	0	4
08/09/2022	1	0
09/09/2022	0	1

Here, the distance is how far away each row is from the previous value of '1' in the active column, with the distance being the number of business days. I have tried using the following:

df['distance'] = np.where(
    df['active'] == 1, 0, df['distance'].shift(1,fill_value=0).astype(int)   1
)

But it seems that Python does not like me referencing a column as I am defining it. I tried to also define a function to run this but unsure how to do so using .shift() as this command seems necessary in order to use to take the previous value and add to it.

Other variations of the above code do not seem to work since Python really wants to concatenate the shift and the 1 instead of just summing them together.

Any help would be very much appreciated.

CodePudding user response：

your column can be entirely defined from the "active" column. your formula is the same as:

count_up = pd.Series(np.arange(len(df)), index=df.index)
distance = count_up - count_up.where(df.active).ffill()

CodePudding user response：

There are sure myriads of approaches all getting the same result. Here are three of them:

# -------------------------------------------
# Using a loop over values in column 'active':
d=[];c=-1
for i in df['active']:
    c =1 
    if i: c = 0
    d.append(c)
df["distance"] = d
print(df)

# -------------------------------------------
# Using a function  
c = -1
def f(i):
    global c    
    if i: c=0 
    else: c =1; 
    return c
# ----------------------
#     along with a list comprehension:
df['distance'] = [ f(i) for i in df['active'] ]
print(df)
#     along with pandas apply() function: 
df['distance'] = df['active'].apply(f)
print(df)

Below one of them including full code with data:

import pandas as pd
import numpy  as np
df_print = """\
date     active
01/09/2022   1  
02/09/2022   0  
05/09/2022   0  
06/09/2022   0  
07/09/2022   0  
08/09/2022   1  
09/09/2022   0"""
open('df_print', 'w').write(df_print)
df = pd.read_table('df_print', sep=r'\s\s\s*' ) # index_col = 0)
print(df)

distance =  []
counter   = -1
for index, row in df.iterrows():
    if row['active']:
        counter = 0
        distance.append(counter)
        continue
    counter  =1
    distance.append(counter)
df["distance"] = distance
print(df)

gives:

         date  active
0  01/09/2022       1
1  02/09/2022       0
2  05/09/2022       0
3  06/09/2022       0
4  07/09/2022       0
5  08/09/2022       1
6  09/09/2022       0
         date  active  distance
0  01/09/2022       1         0
1  02/09/2022       0         1
2  05/09/2022       0         2
3  06/09/2022       0         3
4  07/09/2022       0         4
5  08/09/2022       1         0
6  09/09/2022       0         1

CodePudding user response：

Use cumsum to mark the active groups.

g = (df['active']==1).cumsum()
df.assign(distance=g.groupby(g).transform(lambda x: range(len(x))))
print(df)

Result

          date  active  distance
0  01/09/2022        1         0
1  02/09/2022        0         1
2  05/09/2022        0         2
3  06/09/2022        0         3
4  07/09/2022        0         4
5  08/09/2022        1         0
6  09/09/2022        0         1

CodePudding user response：

Create groups by compare 1 with Series.cumsum and cumulative count them by GroupBy.cumcount:

df['distance'] = df.groupby(df['active'].eq(1).cumsum()).cumcount()
print (df)
         date  active  distance
0  01/09/2022       1         0
1  02/09/2022       0         1
2  05/09/2022       0         2
3  06/09/2022       0         3
4  07/09/2022       0         4
5  08/09/2022       1         0
6  09/09/2022       0         1