Home > Software design >  How to manipulate Pandas DataFrame to access previous rows and set a value in a batch?
How to manipulate Pandas DataFrame to access previous rows and set a value in a batch?

Time:10-08

I'm a bit new to python but not to programming in general. I'm a bit lost when it comes to array manipulations. I am in a situation where I have a DataFrame of n rows and 2 columns. I need to create a new column but I'm not sure how. Here is an example :

Date        | Amount   || IsGood  <---- Need to add this column
--------------------------------
2021-05-01  | 1.52     || ?
2021-05-02  | 2.19     || ?
2021-05-03  | 1.38     || ?
...
2021-09-25  | 5.39     || ?
2021-09-26  | 8.36     || ?
2021-09-27  | 5.19     || ?
2021-09-28  | 6.31     || ?
2021-09-29  | 6.19     || ?
2021-09-30  | 5.78     || ?

The IsGood column isn't there in the original DataFrame and I need to add it. The math is to take the current row and its previous four to calculate and average then verify if more than a threashold (here is 3). I would like to avoid doing for loops on every row as I know in python there is a better way to achieve this.

What I've found so far is something similar to this but it doesn't check the previous 5 rows, only if the the value is higher than the threashold:

threashold = 3
mean_size = 5

df.loc[
    (
        # I can check the current value but that's not exactly
        # the result I need
        df["Amount"] > threashold 
        
        # I would need something that takes the last 5 rows
        # and add them together then do the average
        # Something like this:
  
        # df["Amount"].sum(-mean_size)/mean_size > threashold
        
    ), 
    "IsGood"] = 1

I can't manage to tweak it to lookback at the previous rows. The array manipulations in python aren't intuitive for me. Is the only solution to go for manual loops?

Thanks!

CodePudding user response:

You simply treat the new column as if it has always existed. Sample Code:

df["IsGood"]= df["Amount"] > threashold

This will create a new column in your existing dataframe and return False or True. If you want to replace True/False, you can do so by: Sample Code:

df["IsGood"].replace(False,1,inplace=True)

Then inspect your df, using head() or tail

df.head()

CodePudding user response:

Thanks to users Tom and Anna Pas I managed to came up with a solution to my problem. I'll post it here in case someone else is in the same boat as I was.

threashold = 3
mean_size = 5

# With the rolling function I can lookback 5 days prior and sums it all.
# Then I only need to do the average of it and check if it's higher than
# the threashold
df["IsGood"] = df["Amount"].rolling(mean_size).mean() > threashold

# Then I replace True/False with 1 and 0 because in my specific case
# I need an integer
df["IsGood"].replace([True, False], [1, 0], inplace=True)

Thanks again everyone!

  • Related