Home > front end >  Insert list of strings as a column in a dataframe
Insert list of strings as a column in a dataframe

Time:11-29

I have a pandas dataframe into which I would like to include a new column ('colors'), that contains a list of all colors (column 'color') of an item in that year previous to that row (i.e. grouped by the columns 'year' and 'item' and only including the rows above).

Suppose my df looks like this:

id  item   year  color
0   shirt  2021  yellow
1   shoes  2022  pink
2   shirt  2021  green
3   shirt  2021  black

My goal would be:

id  item   year  color   colors
0   shirt  2021  yellow  []
1   shoes  2022  pink    [pink]
2   shirt  2021  green   [yellow]
3   shirt  2021  black   [yellow, green]

So far I have played around with code like this

self.df['colors'] = self.df.groupby(by = ['year', 'item'], group_keys = False)['color'].apply(list())

or

self.df['colors'] = self.df.groupby(by = ['year', 'item'], group_keys = False)['color'].apply(lambda x : list(x.shift())

But I ran into errors around re-indexing etc., so after so I would be glad if some of you experts could help me here.

CodePudding user response:

Here is one way you could do it:

import itertools
df['colors'] = df.groupby(['item', 'year'])['color'].transform(lambda x: list(itertools.accumulate(x, '{} {}'.format))).shift()
print(df)
   id   item  year   color        colors
0   0  shirt  2021  yellow           NaN
1   1  shoes  2022    pink        yellow
2   2  shirt  2021   green          pink
3   3  shirt  2021   black  yellow green

If you need them to be stored as lists, just add df['colors'] = df['colors'].str.split(). Replace the nan with an empty list is also possible if you want that, shown here.

  • Related