I want to get the information in which row the value 1
occurs last for each column of my dataframe. Given this last row index I want to calculate the "recency" of the occurence. Like so:
>> df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
>> df
a b c d
0 0 1 1 0
1 0 1 0 0
2 1 1 0 0
3 0 1 0 0
4 0 1 1 0
Desired result:
>> calculate_recency_vector(df)
[3,1,1,None]
The desired result shows for each column "how many rows ago" the value 1
appeared for the last time. Eg for the column a
the value 1
appears last in the 3rd-last row, hence the recency of 3
in the result vector. Any ideas how to implement this?
Edit: to avoid confusion, I changed the desired output for the last column from 0
to None
. This column has no recency because the value 1
does not occur at all.
Edit II: Thanks for the great answers! I have to calculate this recency vector approx. 150k times on dataframes shaped (42,250). A more efficient solution would be much appreciated.
CodePudding user response:
With this example dataframe, you can define a function as follow:
def calculate_recency_vector(df: pd.DataFrame, condition: int) -> list:
recency_vector = []
for col in df.columns:
last = 0
for i, y in enumerate(df[col].to_list()):
if y == condition:
last = i
recency = len(df[col].to_list()) - last
if recency == len(df[col].to_list()):
recency = None
recency_vector.append(recency)
return recency_vector
Running the function, it will return this:
calculate_recency_vector(df, 1)
[3, 1, 1, None]
CodePudding user response:
One direct approach is to implement this function would be to use a loop to iterate through each column in the DataFrame, and within that loop, use another loop to iterate through each row in the column. For each row, check if the value is 1. If it is, update a variable to store the len(df[column])-index. After the inner loop finishes, return the stored value as the recency for that column. If 1 never appears in the column, return None.
import pandas
def calculate_recency_vector(df):
recency_vector = []
for column in df:
last_occurrence = None
for index, value in df[column].iteritems():
if value == 1:
last_occurrence =len(df[column])-index
recency_vector.append(last_occurrence)
return recency_vector
df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
print(calculate_recency_vector(df))
CodePudding user response:
This
df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
df.apply(lambda x : ([df.shape[0] - i for i ,v in x.items() if v==1] or [None])[-1], axis=0)
produces the desired output as a pd.Series
, with the only diffrence that the result is float and None
is replaced by pandas Nan
, u could then take the desired column