Optimizing a function to replace a row with a previous row given, a condition in Pandas-CodePudding

I have a relatively large dataframe (~24000 rows and 15 columns) which has 2D coordinate data of rat movements, outputted by a neural network (DeepLabCut).

As part of this output data, there is a p-value score that is a measure of how certain the neural network was when applying that label. I'm trying to filter low quality predictions by copying the previous row into its place, each time that a low p-value is encountered, which assumes that the rat remained still for that frame.

Here's my function thus far:

def checkPVals(DataFrame, CutOff):
    for Cols in DataFrame.columns.values:
        if Cols % 3 == 0:
            for Vals in DataFrame.index.values:
                if float(DataFrame[Cols][Vals]) < CutOff:
                    if (Vals != 0):
                        PreviousRow = DataFrame.loc[Vals - 1, Cols - 3:Cols]
                        DataFrame.loc[Vals, Cols - 3:Cols] = PreviousRow
    return(DataFrame)

Here is a sample of the input data frame:

pd.DataFrame(data={
    "x":[1, 2, 3, 4],
    "y":[5, 4, 3, 2],
    "likelihood":[1, 1, 0.3, 1]
    })

Here is a sample of the desired output:

   x  y  Pval
0  1  5   1.0
1  2  4   1.0
2  2  4   1.0
3  4  2   1.0

With the idea being that row index 2 is replaced with values from row index 1, such that when the inter-frame Euclidean distance between these coordinates is calculated, the distance is 0, implying the label (rat) has not moved.

Clearly, my current implementation is very inefficient. I was looking at iterrows(), but that converts my data into a series and messes with it. My other thought was to convert the p-value columns into np.arrrays, iterate through those, take the index of the p-values below threshold and then swap the rows for the previous one in an iterative manner. However, I feel like that'll take just as long.

Any help is very much appreciated. Thank you!

CodePudding user response：

I'm pretty sure I understood what you are attempting to do. If you could update your question to have a sample output that's paired with you sample input, that would be greatly beneficial.

If I understood correctly, you should be using a vectorized approach instead of explicit looping (this will massively speed up your data wrangling). Essentially you can mask the rows of the dataframe depending on whether or not the "likelihood" column is above a certain value. Once you mask the low likelihoods away (i.e. replace those values with NaN), you can simply forward fill the entire dataframe to fill in the "bad" rows with the previous row's values.

df = pd.DataFrame(data={
    "x":[1, 2, 3, 4],
    "y":[5, 4, 3, 2],
    "likelihood":[1, 1, 0.3, 1]
})

cutoff = 0.5
new_df = df.mask(df["likelihood"] < cutoff).ffill()

print(new_df)
     x    y  likelihood
0  1.0  5.0         1.0
1  2.0  4.0         1.0
2  2.0  4.0         1.0
3  4.0  2.0         1.0