So my question is similar to this
Desired output example
I have been playing around with explode()
, reset_index()
, drop()
and more but have not been able to get anything to give the correct output yet.
One thing I tried was this
df = df.explode("C1").reset_index().drop("index",1).explode("C4").reset_index().drop("index",1)
But output was wrong
CodePudding user response:
It seems that the exploded columns and the non-exploded columns need to be separated. Since we can't hide them in the index as we normally do (given C2
) contains lists (which are unhashable) we must separate the DataFrame then rejoin.
# Convert to single series to explode
cols = ['C1', 'C4']
new_df = df[cols].stack().explode().to_frame()
# Enumerate groups then unstack
new_df = new_df.set_index(
new_df.groupby(level=[0, 1]).cumcount(),
append=True
).unstack(1).groupby(level=0).ffill()
# Join Back Unaffected columns
new_df = new_df.droplevel(0, axis=1).droplevel(1, axis=0).join(
df[df.columns.symmetric_difference(cols)]
)
# Re order columns and reset index
new_df = new_df.reindex(df.columns, axis=1).reset_index(drop=True)
new_df
:
C1 C2 C3 C4
0 A [1] s1 123
1 B [1] s1 123
2 C [2] s2 321
3 D [3] s3 777
4 E [3] s3 111
5 F [4] s4 145
We stack
to get all values into a single series then explode
together and convert back to_frame
cols = ['C1', 'C4']
new_df = df[cols].stack().explode().to_frame()
new_df
0
0 C1 A
C1 B
C4 123
1 C1 C
C4 321
2 C1 D
C1 E
C4 777
C4 111
3 C1 F
C4 145
We can create a new index by enumerating groups with groupby cumcount
set_index
and unstacking
:
new_df = new_df.set_index(
new_df.groupby(level=[0, 1]).cumcount(),
append=True
).unstack(1)
0
C1 C4
0 0 A 123
1 B NaN
1 0 C 321
2 0 D 777
1 E 111
3 0 F 145
We can then groupby ffill
within index groups:
new_df = new_df.groupby(level=0).ffill()
new_df
:
0
C1 C4
0 0 A 123
1 B 123
1 0 C 321
2 0 D 777
1 E 111
3 0 F 145
We can then join
back the unaffected columns to the DataFrame and reindex
to reorder them the way the initially appeared also droplevel
to remove unneeded index levels, lastly reset_index
:
# Join Back Unaffected columns
new_df = new_df.droplevel(0, axis=1).droplevel(1, axis=0).join(
df[df.columns.symmetric_difference(cols)]
)
# Re order columns and reset index
new_df = new_df.reindex(df.columns, axis=1).reset_index(drop=True)
new_df
:
C1 C2 C3 C4
0 A [1] s1 123
1 B [1] s1 123
2 C [2] s2 321
3 D [3] s3 777
4 E [3] s3 111
5 F [4] s4 145
CodePudding user response:
df=df.explode('C4').assign(C1=df['C1'].str.join(',').str.split(',')).explode('C1')#Explode to expand dataframe
m=df.duplicated(subset='C1', keep=False)#loc select the duplicated
df.loc[m,'C4']=df.loc[m,'C4'].shift(1)#Introduce Nan
df.dropna().drop_duplicates(subset='C1', keep='last')#clean dataframe
Output
C1 C2 C3 C4
0 A [1] s1 123
0 B [1] s1 123
1 C [2] s2 321
2 D [3] s3 777
2 E [3] s3 111
3 F [4] s4 145