How to compare repeated values of the same column of a dataframe in python-CodePudding

I'm using this dataset to solve a problem ->https://www.kaggle.com/datasets/harlfoxem/housesalesprediction.

I used this code to see repeated values in the column "id": pd.concat(g for _, g in df.groupby("id") if len(g) > 1) and the output is this:

output

what i'm want to do is to compare repeated values, e.g: id 1000102 appear two times, that means it was sold more than once in the range. What I would like to do is take these two entries and compare the sale value in the "price" column to find the biggest variances, but with all the entries. For example sorting from highest to lowest variation.

CodePudding user response：

In groups by 'id': from the LAST element subtract the FIRST one. This will be the diff value that corresponds to the 'id' of the group.

diffs = df_2.groupby('id')['price'].agg(lambda x: x.iloc[-1] - x.iloc[0] )
diffs.name = 'diff'

You can aggregate with any function here! For example: lambda x: x.max() - x.min().

If you'd like to join the diffs Series as a column to the df:

df.iloc[:, :3].merge(diffs, left_on='id', right_index=True)

Other

This question provides a better version of leaving only groups with more than 1 element:

df_2 = df.groupby('id').filter(lambda x: len(x) > 1)