I have a pandas dataframe into which I would like to include a new column ('colors'), that contains a list of all colors (column 'color') of an item in that year previous to that row (i.e. grouped by the columns 'year' and 'item' and only including the rows above).
Suppose my df looks like this:
id item year color
0 shirt 2021 yellow
1 shoes 2022 pink
2 shirt 2021 green
3 shirt 2021 black
My goal would be:
id item year color colors
0 shirt 2021 yellow []
1 shoes 2022 pink [pink]
2 shirt 2021 green [yellow]
3 shirt 2021 black [yellow, green]
So far I have played around with code like this
self.df['colors'] = self.df.groupby(by = ['year', 'item'], group_keys = False)['color'].apply(list())
or
self.df['colors'] = self.df.groupby(by = ['year', 'item'], group_keys = False)['color'].apply(lambda x : list(x.shift())
But I ran into errors around re-indexing etc., so after so I would be glad if some of you experts could help me here.
CodePudding user response:
Here is one way you could do it:
import itertools
df['colors'] = df.groupby(['item', 'year'])['color'].transform(lambda x: list(itertools.accumulate(x, '{} {}'.format))).shift()
print(df)
id item year color colors
0 0 shirt 2021 yellow NaN
1 1 shoes 2022 pink yellow
2 2 shirt 2021 green pink
3 3 shirt 2021 black yellow green
If you need them to be stored as lists, just add df['colors'] = df['colors'].str.split()
. Replace the nan with an empty list is also possible if you want that, shown here.