Home > Software design >  Pandas assign value based on next row(s)
Pandas assign value based on next row(s)

Time:11-03

Consider this simple pandas DataFrame with columns 'record', 'start', and 'param'. There can be multiple rows with the same record value, and each unique record value corresponds to the same start value. However, the 'param' value can be different for the same 'record' and 'start' combination:

pd.DataFrame({'record':[1,2,3,4,4,5,6,7,7,7,8], 'start':[0,5,7,13,13,19,27,38,38,38,54], 'param':['t','t','t','u','v','t','t','t','u','v','t']})

I'd like to make a column 'end' that takes the value of 'start' in the row with the next unique value of 'record'. The values of column 'end' should be:

[5,7,13,19,19,27,38,54,54,54,NaN]

I'm able to do this using a for loop, but I know this is not preferred when using pandas:

max_end = 100
for idx, row in df.iterrows():
    try:
        n = 1
        next_row = df.iloc[idx n]
        while next_row['start'] == row['start']:
            n = n 1
            next_row = df.iloc[idx n]
        end = next_row['start']
    except:
        end = max_end
    df.at[idx, 'end'] = end

Is there an easy way to achieve this without a for loop?

CodePudding user response:

I have no doubt there is a smarter solution but here is mine.

df1['end'] = df1.drop_duplicates(subset = ['record', 'start'])['start'].shift(-1).reindex(index = df1.index, method = 'ffill')

-=EDIT=- Added subset into drop_duplicates to account for question amendment

CodePudding user response:

This solution is equivalent to @Quixotic22 although more explicit.

df = pd.DataFrame({
'record':[1,2,3,4,4,5,6,7,7,7,8],
'start':[0,5,7,13,13,19,27,38,38,38,54],
'param':['t','t','t','u','v','t','t','t','u','v','t']
})
max_end = 100

df["end"] = None  # create new column with empty values
loc = df["record"].shift(1) != df["record"] # record where the next value is diff from previous

df.loc[loc, "end"] = df.loc[loc, "start"].shift(-1)  # assign desired values
df["end"].fillna(method = "ffill", inplace = True)  # fill remaining missing values
df.loc[df.index[-1], "end"] = max_end  # override last value

df
  • Related