I have a dataframe with two columns of lists:
>>> import pandas as pd
>>> df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7']]})
>>> df
A B C
0 x1 [v1, v2] [c1, c2]
1 x2 [v3, v4] [c3, c4]
2 x3 [v6] [c5, c6]
3 x4 [v7, v8] [c7]
I would like to explode columns B and C, so the output looks like this:
>>> df_exploded
A B C
0 x1 v1 c1
1 x1 v2 c2
2 x2 v3 c3
3 x2 v4 c4
4 x3 v6 c5
5 x3 v6 c6
6 x4 v7 c7
7 x4 v8 c7
My current solution is to first separate rows where elements in column B and C have the same length and run df.explode(["B", "C"])
and for the rest rows, run df.explode("B")
followed by df.explode("C")
I am wondering if there's a better solution. Thanks in advance!
CodePudding user response:
use itertools.zip_longest
import itertools
df1 = (df.apply(lambda x: list(itertools.zip_longest(x['B'], x['C'])), axis=1)
.explode()
.apply(lambda x: pd.Series(x, index=['B', 'C']))
.groupby(level=0).ffill())
df1
B C
0 v1 c1
0 v2 c2
1 v3 c3
1 v4 c4
2 v6 c5
2 v6 c6
3 v7 c7
3 v8 c7
get desired output by using df1
df[['A']].join(df1)
output:
A B C
0 x1 v1 c1
0 x1 v2 c2
1 x2 v3 c3
1 x2 v4 c4
2 x3 v6 c5
2 x3 v6 c6
3 x4 v7 c7
3 x4 v8 c7
if you want, you can use reset_index
for index
CodePudding user response:
Yes, there is a better solution. Instead of separating the rows where the lists in columns B and C have the same length, you can use the explode
method on both columns at the same time, and it will automatically take care of rows where the lists have different lengths. Here's how you can do it:
df_exploded = df.explode(["B", "C"])
This will give you the expected output:
A B C
0 x1 v1 c1
1 x1 v2 c2
2 x2 v3 c3
3 x2 v4 c4
4 x3 v6 c5
5 x3 v6 c6
6 x4 v7 c7
7 x4 v8 c7