How to have a column full of lists in pandas-CodePudding

I have a pandas dataframe as follows

fruit_name | color
----------- -------
apple      | red
banana     | yellow
orange     | orange

Using this dataframe, I want to create a new dataframe as follows

name  |  fruit
------ -------
Ben   |  ['apple', 'banana', 'orange']
Alex  |  ['apple', 'banana', 'orange']
Jude  |  ['apple', 'banana', 'orange']
May   |  ['apple', 'banana', 'orange']
...

For this, I tried the following

new_df = pd.DataFrame(columns=['name', 'fruit']
new_df['name'] = students['name']
new_df['fruit'] = fruits['fruits_name'].to_list()

I get the following error message

ValueError: Length of values (3) does not match length of index (10)

I want my final result as follows

name  |  fruit
------ -------
Ben   |  apple
Ben   |  banana
Ben   |  orange
Alex  |  apple
Alex  |  banana
Alex  |  orange
...

I thought I would use the explode function once I have a column of lists. But I am stuck getting the dataframe with the column of lists.

CodePudding user response：

What you are looking for is actually called a cross merge, available from Pandas 1.2.0 :

new_df = students[['name']].merge(fruits[['fruit_name']], how='cross')

Output:

    name fruit_name
0    Ben      apple
1    Ben     banana
2    Ben     orange
3   Alex      apple
4   Alex     banana
5   Alex     orange
6   Jude      apple
7   Jude     banana
8   Jude     orange
9    May      apple
10   May     banana
11   May     orange

Note if you're stuck with earlier version, use the same key to merge:

new_df = (students[['name']].assign(key=1)
   .merge(fruits[['fruit_name']].assign(key=1), on='key')
   .drop('key', axis=1)
)

CodePudding user response：

The reason you are getting the error ValueError: Length of values (3) does not match length of index (10) is because you are trying to insert the list of fruits as a column. To make it a nested list you can use np.repeat or as below:

fruit_list = ['apple', 'banana', 'orange']
names = ['Ben', 'Alex', 'Jude', 'May']

fruit_df = pd.DataFrame(columns=['name', 'fruit'])
fruit_df['name'] = names
fruit_df['fruit'] = [fruit_list for _ in range(len(names))] # make a nested fruit list for every name

that gives:

>>> fruit_df.head(2)
  name                    fruit
0   Ben  [apple, banana, orange]
1  Alex  [apple, banana, orange]

as you are asking about. and then you can

>>> fruit_df.explode('fruit') 
   name   fruit
0   Ben   apple
0   Ben  banana
0   Ben  orange
1  Alex   apple
1  Alex  banana
1  Alex  orange
2  Jude   apple
2  Jude  banana
2  Jude  orange
3   May   apple
3   May  banana
3   May  orange