Home > Blockchain >  Pandas get postion of last value based on condition for each column (efficiently)
Pandas get postion of last value based on condition for each column (efficiently)

Time:12-26

I want to get the information in which row the value 1 occurs last for each column of my dataframe. Given this last row index I want to calculate the "recency" of the occurence. Like so:

>> df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
>> df
   a  b  c  d
0  0  1  1  0
1  0  1  0  0
2  1  1  0  0
3  0  1  0  0
4  0  1  1  0

Desired result:

>> calculate_recency_vector(df)
[3,1,1,None]

The desired result shows for each column "how many rows ago" the value 1 appeared for the last time. Eg for the column a the value 1 appears last in the 3rd-last row, hence the recency of 3 in the result vector. Any ideas how to implement this?

Edit: to avoid confusion, I changed the desired output for the last column from 0 to None. This column has no recency because the value 1 does not occur at all.

Edit II: Thanks for the great answers! I have to calculate this recency vector approx. 150k times on dataframes shaped (42,250). A more efficient solution would be much appreciated.

CodePudding user response:

With this example dataframe, you can define a function as follow:

def calculate_recency_vector(df: pd.DataFrame, condition: int) -> list:
    recency_vector = []

    for col in df.columns:
        last = 0
        for i, y in enumerate(df[col].to_list()):
            if y == condition:
                last = i

        recency = len(df[col].to_list()) - last
        if recency == len(df[col].to_list()):
            recency = None

        recency_vector.append(recency)

    return recency_vector

Running the function, it will return this:

calculate_recency_vector(df, 1)
[3, 1, 1, None]

CodePudding user response:

One direct approach is to implement this function would be to use a loop to iterate through each column in the DataFrame, and within that loop, use another loop to iterate through each row in the column. For each row, check if the value is 1. If it is, update a variable to store the len(df[column])-index. After the inner loop finishes, return the stored value as the recency for that column. If 1 never appears in the column, return None.

import pandas
def calculate_recency_vector(df):
    recency_vector = []
    for column in df:
        last_occurrence = None
        for index, value in df[column].iteritems():
            if value == 1:
                last_occurrence =len(df[column])-index
        recency_vector.append(last_occurrence)
    return recency_vector


df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
print(calculate_recency_vector(df))

CodePudding user response:

This

df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
df.apply(lambda x : ([df.shape[0] - i for i ,v in x.items() if v==1] or [None])[-1], axis=0)

produces the desired output as a pd.Series , with the only diffrence that the result is float and None is replaced by pandas Nan, u could then take the desired column

  • Related