I have the following dataframe, df, and I would like to add the 'distance' column to it, such that:
date | active | distance |
---|---|---|
01/09/2022 | 1 | 0 |
02/09/2022 | 0 | 1 |
05/09/2022 | 0 | 2 |
06/09/2022 | 0 | 3 |
07/09/2022 | 0 | 4 |
08/09/2022 | 1 | 0 |
09/09/2022 | 0 | 1 |
Here, the distance is how far away each row is from the previous value of '1' in the active column, with the distance being the number of business days. I have tried using the following:
df['distance'] = np.where(
df['active'] == 1, 0, df['distance'].shift(1,fill_value=0).astype(int) 1
)
But it seems that Python does not like me referencing a column as I am defining it. I tried to also define a function to run this but unsure how to do so using .shift()
as this command seems necessary in order to use to take the previous value and add to it.
Other variations of the above code do not seem to work since Python really wants to concatenate the shift and the 1 instead of just summing them together.
Any help would be very much appreciated.
CodePudding user response:
your column can be entirely defined from the "active" column. your formula is the same as:
count_up = pd.Series(np.arange(len(df)), index=df.index)
distance = count_up - count_up.where(df.active).ffill()
CodePudding user response:
There are sure myriads of approaches all getting the same result. Here are three of them:
# -------------------------------------------
# Using a loop over values in column 'active':
d=[];c=-1
for i in df['active']:
c =1
if i: c = 0
d.append(c)
df["distance"] = d
print(df)
# -------------------------------------------
# Using a function
c = -1
def f(i):
global c
if i: c=0
else: c =1;
return c
# ----------------------
# along with a list comprehension:
df['distance'] = [ f(i) for i in df['active'] ]
print(df)
# along with pandas apply() function:
df['distance'] = df['active'].apply(f)
print(df)
Below one of them including full code with data:
import pandas as pd
import numpy as np
df_print = """\
date active
01/09/2022 1
02/09/2022 0
05/09/2022 0
06/09/2022 0
07/09/2022 0
08/09/2022 1
09/09/2022 0"""
open('df_print', 'w').write(df_print)
df = pd.read_table('df_print', sep=r'\s\s\s*' ) # index_col = 0)
print(df)
distance = []
counter = -1
for index, row in df.iterrows():
if row['active']:
counter = 0
distance.append(counter)
continue
counter =1
distance.append(counter)
df["distance"] = distance
print(df)
gives:
date active
0 01/09/2022 1
1 02/09/2022 0
2 05/09/2022 0
3 06/09/2022 0
4 07/09/2022 0
5 08/09/2022 1
6 09/09/2022 0
date active distance
0 01/09/2022 1 0
1 02/09/2022 0 1
2 05/09/2022 0 2
3 06/09/2022 0 3
4 07/09/2022 0 4
5 08/09/2022 1 0
6 09/09/2022 0 1
CodePudding user response:
Use cumsum
to mark the active groups.
g = (df['active']==1).cumsum()
df.assign(distance=g.groupby(g).transform(lambda x: range(len(x))))
print(df)
Result
date active distance
0 01/09/2022 1 0
1 02/09/2022 0 1
2 05/09/2022 0 2
3 06/09/2022 0 3
4 07/09/2022 0 4
5 08/09/2022 1 0
6 09/09/2022 0 1
CodePudding user response:
Create groups by compare 1
with Series.cumsum
and cumulative count them by GroupBy.cumcount
:
df['distance'] = df.groupby(df['active'].eq(1).cumsum()).cumcount()
print (df)
date active distance
0 01/09/2022 1 0
1 02/09/2022 0 1
2 05/09/2022 0 2
3 06/09/2022 0 3
4 07/09/2022 0 4
5 08/09/2022 1 0
6 09/09/2022 0 1