How to explode pd.Series and concat to column in Pandas without duplicating other column values?-CodePudding

The issue starts with how my excel file is read into Pandas.

	id	name	application	new_applications
0	451124577	ABC123	app.1	[app.10, app.20]
1			app.2
2	845151514	DEF456	app.3	[app.30, app.40]
3			app.4

When reading the file into Pandas there is a hierarchy but none of the cells are grouped or set as a multi-index. Unfortunately our system requires this format so any changes I make have to maintain that format.

So far this is what I have:

df1 =df[['id', 'name', 'application', 'new_applications']]

df1 = df.set_index(['id', 'name', 'application']).explode('new_applications')

Outcome:

id	name	application
451124577	ABC123	app.1
	ABC123	app.2
	ABC123	app.10
	ABC123	app.20
845151514	DEF456	app.3
	DEF456	app.4
	DEF456	app.30
	DEF456	app.40

Also tried:

df1 =df[['id', 'name', 'application', 'new_applications']]

temp_df = df['new_applications'].rename('application').explode()
df2 = pd.merge(df1, temp, left_index=True, right_index=True)

# and

df3 = pd.concat([df1, temp], axis=1)

# and

df10 = df1.set_index('id').join(temp.set_index('id'), rsuffix='_r')
# or
df10 = df1.join(temp.set_index('id'), on='id', rsuffix='_r')

Outcome Needed

id	name	application
451124577	ABC123	app.1
		app.2
		app.10
		app.20
845151514	DEF456	app.3
		app.4
		app.30
		app.40

When I reset_index() I can't figure out a way to get replace all the duplicate values in columns ['id', 'name'] without dropping the entire row.

Nevertheless, I inevitably end up with the same issue.

Any help would be awesome!

CodePudding user response：

First you can groupby the cumsum of id and name columns, then concat value in new_applications columns to application column in each group.

df_ = (df.groupby([df['id'].notna().cumsum(), df['name'].notna().cumsum()]).apply(lambda group: pd.concat([group, pd.DataFrame({'application': group['new_applications'].iloc[0]})])))

print(df_)

                    id    name application  new_applications
id name
1  1    0  451124577.0  ABC123       app.1  [app.10, app.20]
        1          NaN     NaN       app.2               NaN
        0          NaN     NaN      app.10               NaN
        1          NaN     NaN      app.20               NaN
2  2    2  845151514.0  DEF456       app.3  [app.30, app.40]
        3          NaN     NaN       app.4               NaN
        0          NaN     NaN      app.30               NaN
        1          NaN     NaN      app.40               NaN

Then drop the multilevel index, reset the index and drop the new_applications column.

df_.index = df_.index.droplevel()
df_ = df_.reset_index(drop=True).drop('new_applications', axis=1)

print(df_)

            id    name application
0  451124577.0  ABC123       app.1
1          NaN     NaN       app.2
2          NaN     NaN      app.10
3          NaN     NaN      app.20
4  845151514.0  DEF456       app.3
5          NaN     NaN       app.4
6          NaN     NaN      app.30
7          NaN     NaN      app.40