I have a pandas dataframe as follows
fruit_name | color
----------- -------
apple | red
banana | yellow
orange | orange
Using this dataframe, I want to create a new dataframe as follows
name | fruit
------ -------
Ben | ['apple', 'banana', 'orange']
Alex | ['apple', 'banana', 'orange']
Jude | ['apple', 'banana', 'orange']
May | ['apple', 'banana', 'orange']
...
For this, I tried the following
new_df = pd.DataFrame(columns=['name', 'fruit']
new_df['name'] = students['name']
new_df['fruit'] = fruits['fruits_name'].to_list()
I get the following error message
ValueError: Length of values (3) does not match length of index (10)
I want my final result as follows
name | fruit
------ -------
Ben | apple
Ben | banana
Ben | orange
Alex | apple
Alex | banana
Alex | orange
...
I thought I would use the explode function once I have a column of lists. But I am stuck getting the dataframe with the column of lists.
CodePudding user response:
What you are looking for is actually called a cross merge, available from Pandas 1.2.0 :
new_df = students[['name']].merge(fruits[['fruit_name']], how='cross')
Output:
name fruit_name
0 Ben apple
1 Ben banana
2 Ben orange
3 Alex apple
4 Alex banana
5 Alex orange
6 Jude apple
7 Jude banana
8 Jude orange
9 May apple
10 May banana
11 May orange
Note if you're stuck with earlier version, use the same key to merge:
new_df = (students[['name']].assign(key=1)
.merge(fruits[['fruit_name']].assign(key=1), on='key')
.drop('key', axis=1)
)
CodePudding user response:
The reason you are getting the error ValueError: Length of values (3) does not match length of index (10)
is because you are trying to insert the list of fruits as a column. To make it a nested list you can use np.repeat or as below:
fruit_list = ['apple', 'banana', 'orange']
names = ['Ben', 'Alex', 'Jude', 'May']
fruit_df = pd.DataFrame(columns=['name', 'fruit'])
fruit_df['name'] = names
fruit_df['fruit'] = [fruit_list for _ in range(len(names))] # make a nested fruit list for every name
that gives:
>>> fruit_df.head(2)
name fruit
0 Ben [apple, banana, orange]
1 Alex [apple, banana, orange]
as you are asking about. and then you can
>>> fruit_df.explode('fruit')
name fruit
0 Ben apple
0 Ben banana
0 Ben orange
1 Alex apple
1 Alex banana
1 Alex orange
2 Jude apple
2 Jude banana
2 Jude orange
3 May apple
3 May banana
3 May orange