The issue starts with how my excel file is read into Pandas.
id | name | application | new_applications | |
---|---|---|---|---|
0 | 451124577 | ABC123 | app.1 | [app.10, app.20] |
1 | app.2 | |||
2 | 845151514 | DEF456 | app.3 | [app.30, app.40] |
3 | app.4 |
When reading the file into Pandas there is a hierarchy but none of the cells are grouped or set as a multi-index. Unfortunately our system requires this format so any changes I make have to maintain that format.
So far this is what I have:
df1 =df[['id', 'name', 'application', 'new_applications']]
df1 = df.set_index(['id', 'name', 'application']).explode('new_applications')
Outcome:
id | name | application |
---|---|---|
451124577 | ABC123 | app.1 |
ABC123 | app.2 | |
ABC123 | app.10 | |
ABC123 | app.20 | |
845151514 | DEF456 | app.3 |
DEF456 | app.4 | |
DEF456 | app.30 | |
DEF456 | app.40 |
Also tried:
df1 =df[['id', 'name', 'application', 'new_applications']]
temp_df = df['new_applications'].rename('application').explode()
df2 = pd.merge(df1, temp, left_index=True, right_index=True)
# and
df3 = pd.concat([df1, temp], axis=1)
# and
df10 = df1.set_index('id').join(temp.set_index('id'), rsuffix='_r')
# or
df10 = df1.join(temp.set_index('id'), on='id', rsuffix='_r')
Outcome Needed
id | name | application |
---|---|---|
451124577 | ABC123 | app.1 |
app.2 | ||
app.10 | ||
app.20 | ||
845151514 | DEF456 | app.3 |
app.4 | ||
app.30 | ||
app.40 |
When I reset_index()
I can't figure out a way to get replace all the duplicate values in columns ['id', 'name']
without dropping the entire row.
Nevertheless, I inevitably end up with the same issue.
Any help would be awesome!
CodePudding user response:
First you can groupby the cumsum of id
and name
columns, then concat value in new_applications
columns to application
column in each group.
df_ = (df.groupby([df['id'].notna().cumsum(), df['name'].notna().cumsum()]).apply(lambda group: pd.concat([group, pd.DataFrame({'application': group['new_applications'].iloc[0]})])))
print(df_)
id name application new_applications
id name
1 1 0 451124577.0 ABC123 app.1 [app.10, app.20]
1 NaN NaN app.2 NaN
0 NaN NaN app.10 NaN
1 NaN NaN app.20 NaN
2 2 2 845151514.0 DEF456 app.3 [app.30, app.40]
3 NaN NaN app.4 NaN
0 NaN NaN app.30 NaN
1 NaN NaN app.40 NaN
Then drop the multilevel index, reset the index and drop the new_applications
column.
df_.index = df_.index.droplevel()
df_ = df_.reset_index(drop=True).drop('new_applications', axis=1)
print(df_)
id name application
0 451124577.0 ABC123 app.1
1 NaN NaN app.2
2 NaN NaN app.10
3 NaN NaN app.20
4 845151514.0 DEF456 app.3
5 NaN NaN app.4
6 NaN NaN app.30
7 NaN NaN app.40