So I know there are a lot of posts on how to split a single column of tuples into several columns. But I haven't seen any on how to split several columns of tuples at the same time. I've tried doing a nested apply:
spl.apply(lambda col: spl[col].apply(pd.Series))
but it doesn't work.
spl = pd.DataFrame({('a', 'b') : [(i, i 1) for i in range(3)],
('a', 'c'): [(2*i, 2*i 1) for i in range(3)],
('e', 'b'): [(2*i, 2*i 1) for i in range(3)]})
>>> spl
a e
b c b
0 (0, 1) (0, 1) (0, 1)
1 (1, 2) (2, 3) (2, 3)
2 (2, 3) (4, 5) (4, 5)
The target is:
a e
b0 b1 c0 c1 b0 b1
0 0 1 0 1 0 1
1 1 2 2 3 2 3
2 2 3 4 5 4 5
CodePudding user response:
Here's a solution:
new_df = pd.concat([pd.DataFrame(spl[c].tolist()).add_prefix(c[-1]) for c in spl], axis=1)
new_df.columns = pd.MultiIndex.from_arrays([np.repeat(spl.columns.get_level_values(0), 2), new_df.columns])
Output:
>>> new_df
a e
b0 b1 c0 c1 b0 b1
0 0 1 0 1 0 1
1 1 2 2 3 2 3
2 2 3 4 5 4 5
One-big-liner :)
new_df = pd.concat([pd.DataFrame(spl[c].tolist()).add_prefix(c[-1]) for c in spl], axis=1).pipe(lambda x: x.set_axis(pd.MultiIndex.from_arrays([np.repeat(spl.columns.get_level_values(0), 2), x.columns]), axis=1))
CodePudding user response:
We could use unstack
DataFrame constructor pivot
. We can't pivot a Series object, so we reset_index
first.
To fix duplicate column names, we could reformat column names by creating a MultiIndex object by combining original column name level 1 with the tuple index:
tmp = spl.unstack()
out = (pd.pivot(pd.DataFrame(tmp.tolist(), index=tmp.index).reset_index(),
['level_2'], ['level_0','level_1'], [0,1])
.rename_axis(columns=[None, None, None], index=[None]))
out.columns = pd.MultiIndex.from_arrays((out.columns.get_level_values(1),
out.columns.get_level_values(2)
out.columns.get_level_values(0).astype(str)))
out = out.sort_index(axis=1)
Output:
a e
b0 b1 c0 c1 b0 b1
0 0 1 0 1 0 1
1 1 2 2 3 2 3
2 2 3 4 5 4 5
As @sammywemmy suggested in the comments to this question, rebuilding the DataFrame from vanilla Python is indeed easier than the mess above.
from itertools import chain, product
out = pd.DataFrame([list(chain.from_iterable(lst)) for lst in spl.to_numpy().tolist()],
columns=pd.MultiIndex.from_tuples((i, j str(k)) for (i,j), k in
product(spl.columns, [0,1])))