I want to expand the columns of the following (toy example) pandas DataFrame,
df = pd.DataFrame({'col1': ["A", "A", "A", "B", "B", "B"],
'col2': [1, 7, 3, 2, 9, 4],
'col3': [3, -1, 0, 5, -2, -3],})
col1 col2 col3
0 A 1 3
1 A 7 -1
2 A 3 0
3 B 2 5
4 B 9 -2
5 B 4 -3
such that it will become row-wise,
col1 col2_1 col2_2 col2_3 col3_1 col3_2 col3_3
0 A 1 7 3 3 -1 0
1 B 2 9 4 5 -2 -3
I know that I shall use groupby('col1')
but do not know how to achieve the desired DataFrame.
Note: The number of elements in each group when we perform groupby('col1')
are all equal (in this case we have three A's and three B's)
Edit: I managed to do it by the following code, but it is not efficient,
import pandas as pd
from functools import partial
def func(x, exclude_list):
for col in x.columns:
if col in exclude_list:
continue
for i, value in enumerate(x[col].values):
x[f'{col}_{i 1}'] = value
return x
df = pd.DataFrame({'col1': ["A", "A", "A", "B", "B", "B"],
'col2': [1, 7, 3, 2, 9, 4],
'col3': [3, -1, 0, 5, -2, -3],})
exclude_list = ['col1']
columns_to_expand = ['col2', 'col3']
func2 = partial(func, exclude_list=exclude_list)
df2 = df.groupby(exclude_list).apply(func2)
df2.drop(columns_to_expand, axis=1, inplace=True)
df3 = df2.groupby(exclude_list).tail(1).reset_index()
df3.drop('index', axis=1, inplace=True)
print(df3)
which results in,
col1 col2_1 col2_2 col2_3 col3_1 col3_2 col3_3
0 A 1 7 3 3 -1 0
1 B 2 9 4 5 -2 -3
Edit2: This code, based on ouroboros1
answer works efficiently,
df_pivot = None
for col in columns_to_expand:
df['index'] = [f'{col}_{i}' for i in range(1,4)]*len(np.unique(df[exclude_list].values))
if df_pivot is None:
df_pivot = df.pivot(index=exclude_list, values=col, columns='index').reset_index(drop=False)
else:
df_pivot = df_pivot.merge(df.pivot(index=exclude_list, values=col, columns='index').reset_index(drop=False))
CodePudding user response:
Solution for any number of columns:
def f(group):
result = (group.set_axis(range(1, len(group) 1))
.T
.stack())
result.index = (result.index
.to_flat_index()
.map('{0[0]}_{0[1]}'.format))
return result
Specify the columns (as a list) here:
chosen_cols = ['col2', 'col3']
df.groupby('col1')[chosen_cols].apply(f).reset_index()
Result:
col1 col2_1 col2_2 col2_3 col3_1 col3_2 col3_3
0 A 1 7 3 3 -1 0
1 B 2 9 4 5 -2 -3
Explanation:
f
accepts a sub-dataframe: a group (where 'col1'
is the same),
and converts this group into a Series.
When returned, this Series will become the row.
For example:
Let's say that that f
receives a group
where col1
is 'A'
:
col2 col3
0 1 3
1 7 -1
2 3 0
It resets the index (because you want the numbering to start from 1):
group.set_axis(range(1, len(group) 1))
col2 col3
1 1 3
2 7 -1
3 3 0
After .T.stack()
, the dataframe becomes a Series with a MultiIndex:
col2 1 1
2 7
3 3
col3 1 3
2 -1
3 0
dtype: int64
This MultiIndex is flattened (into tuples) and the tuple elements are merged together with a '_' in between them:
col2_1 1
col2_2 7
col2_3 3
col3_1 3
col3_2 -1
col3_3 0
dtype: int64
That's it! After that, this Series is returned from f
and becomes a row:
col1 col2_1 col2_2 col2_3 col3_1 col3_2 col3_3
0 A 1 7 3 3 -1 0
Previous solution (it works only for 1 column, so don't use it):
3 steps (using .groupby
):
- Aggregate all elements of each group into a list:
temp = df.groupby('col1')['col2'].agg(list)
col1
A [1, 7, 3]
B [2, 9, 4]
Name: col2, dtype: object
- Expand these lists into new columns:
result = pd.DataFrame(temp.tolist(), index=temp.index)
0 1 2
col1
A 1 7 3
B 2 9 4
- Rename the columns and reset index:
result.columns = range(1, len(result.columns) 1)
result = result.add_prefix('col2_').reset_index()
col1 col2_1 col2_2 col2_3
0 A 1 7 3
1 B 2 9 4
CodePudding user response:
Update: the question has been updated to expand multiple columns row-wise. This requires some refactoring of the initial answers that were tailored to the initial question, which only required the operation to take place on one column (col2
). Note that the current refactored answers also work perfectly fine on a single column. However, since they are a little verbose for that situation, I'm keeping the original answers for just 1 column at the end.
Answers for expanding multiple columns row-wise
You could use df.pivot
for this:
import pandas as pd
df = pd.DataFrame({'col1': ["A", "A", "A", "B", "B", "B"],
'col2': [1, 7, 3, 2, 9, 4],
'col3': [3, -1, 0, 5, -2, -3],})
cols = ['col2','col3']
# val count per unique val in col1: N.B. expecting all to have same count!
vals_unique_col1 = df.col1.value_counts()[0] 1 # 3 1 (use in `range()`)
len_unique_col1 = len(df.col1.unique()) # 2
# create temp cols [1,2,3] and store in new col
df['my_index'] = [i for i in range(1,vals_unique_col1)]*len_unique_col1
df_pivot = df.pivot(index='col1',values=cols,columns='my_index')\
.reset_index(drop=False)
# customize df cols by joining MultiIndex columns
df_pivot.columns = ['_'.join(str(i) for i in x) for x in df_pivot.columns]
df_pivot.rename(columns={'col1_':'col1'}, inplace=True)
print(df_pivot)
col1 col2_1 col2_2 col2_3 col3_1 col3_2 col3_3
0 A 1 7 3 3 -1 0
1 B 2 9 4 5 -2 -3
2 alternative solutions based on df.groupby
could like this:
- Groupby solution 1
import pandas as pd
df = pd.DataFrame({'col1': ["A", "A", "A", "B", "B", "B"],
'col2': [1, 7, 3, 2, 9, 4],
'col3': [3, -1, 0, 5, -2, -3],})
cols = ['col2','col3']
df_groupby = df.groupby('col1')[cols].agg(list)\
.apply(pd.Series.explode, axis=1).reset_index(drop=False)
# same as in `pivot` method, this will be 3
len_cols = df.col1.value_counts()[0]
# rename cols
df_groupby.columns=[f'{col}_{(idx-1)%len_cols 1}' if col != 'col1' else col
for idx, col in enumerate(df_groupby.columns)]
- Groupby solution 2
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': ["A", "A", "A", "B", "B", "B"],
'col2': [1, 7, 3, 2, 9, 4],
'col3': [3, -1, 0, 5, -2, -3],})
cols = ['col2','col3']
agg_lists = df.groupby('col1')[cols].agg(list)
dfs = [pd.DataFrame(agg_lists[col].tolist(), index=agg_lists.index)
for col in agg_lists.columns]
df_groupby = pd.concat(dfs, axis=1)
len_cols = df.col1.value_counts()[0]
cols_rep = np.repeat(cols,len_cols)
df_groupby.columns = [f'{col}_{str(i 1)}' for col, i
in zip(cols_rep, df_groupby.columns)]
df_groupby.reset_index(drop=False, inplace=True)
(Original) answers for expanding single column row-wise
You could use df.pivot
for this:
import pandas as pd
df = pd.DataFrame({'col1': ["A", "A", "A", "B", "B", "B"],
'col2': [1, 7, 3, 2, 9, 4]})
# add col with prospective col names (`col1_1,*_2,*_3`)
# and multiply by len unique values in `df.col1`
df['index'] = [f'col2_{i}' for i in range(1,4)]*len(df.col1.unique())
df_pivot = df.pivot(index='col1',values='col2',columns='index')\
.reset_index(drop=False)
print(df_pivot)
index col1 col2_1 col2_2 col2_3
0 A 1 7 3
1 B 2 9 4
Alternative solution based on df.groupby
could like this:
import pandas as pd
df = pd.DataFrame({'col1': ["A", "A", "A", "B", "B", "B"], \
'col2': [1, 7, 3, 2, 9, 4]})
# create lists of values in `col2` per group in `col1`,
# then expand into multiple cols with `apply(pd.Series), finally reset index
df_groupby = df.groupby('col1').agg(list)['col2']\
.apply(pd.Series).reset_index(drop=False)
# overwrite new cols (`0,1,2`) with desired col names `col2_1, etc.`
df_groupby.columns=[f'col2_{col 1}' if col != 'col1' else col
for col in list(df_groupby.columns)]
print(df_groupby)
col1 col2_1 col2_2 col2_3
0 A 1 7 3
1 B 2 9 4