I'm using this dataset to solve a problem ->https://www.kaggle.com/datasets/harlfoxem/housesalesprediction.
I used this code to see repeated values in the column "id": pd.concat(g for _, g in df.groupby("id") if len(g) > 1)
and the output is this:
what i'm want to do is to compare repeated values, e.g: id 1000102 appear two times, that means it was sold more than once in the range. What I would like to do is take these two entries and compare the sale value in the "price" column to find the biggest variances, but with all the entries. For example sorting from highest to lowest variation.
CodePudding user response:
In groups by 'id'
: from the LAST element subtract the FIRST one. This will be the diff
value that corresponds to the 'id'
of the group.
diffs = df_2.groupby('id')['price'].agg(lambda x: x.iloc[-1] - x.iloc[0] )
diffs.name = 'diff'
You can aggregate with any function here! For example: lambda x: x.max() - x.min()
.
If you'd like to join the diffs
Series as a column to the df
:
df.iloc[:, :3].merge(diffs, left_on='id', right_index=True)
Other
This question provides a better version of leaving only groups with more than 1 element:
df_2 = df.groupby('id').filter(lambda x: len(x) > 1)