I have a sheet I need to transform from data being stored horizontally to vertically, i.e. from:
Company ID | DoB | Name | DoB | Name | DoB | Name |
---|---|---|---|---|---|---|
ID 1 | DoB 1 | Name 1 | DoB 2 | Name 2 | DoB 3 | Name 3 |
ID 2 | DoB 4 | Name 4 | DoB 5 | Name 5 |
To:
Company ID | DoB | Name |
---|---|---|
ID 1 | DoB 1 | Name 1 |
ID 1 | DoB 2 | Name 2 |
ID 1 | DoB 3 | Name 3 |
ID 2 | DoB 4 | Name 4 |
ID 2 | DoB 5 | Name 5 |
The data is structured such that the DoB/name entities always occur periodically on rows as illustrated above. The number of entities stored horizontally on each row can vary from none to 16.
How would one go about solving this in Python/Pandas (or something else)? This is a one-time thing, so performance is not really an issue.
Thankful for any help!
CodePudding user response:
You can use:
out = (df
.set_index('Company ID')
.pipe(lambda d: d.set_axis(
pd.MultiIndex
.from_arrays([d.columns,
d.groupby(level=0, axis=1).cumcount().add(1)]),
axis=1)
)
.stack()
.reset_index().drop(columns='level_1')
)
Alternative:
out = pd.concat([g.stack().droplevel(1).rename(k)
for k,g in df.set_index('Company ID').groupby(level=0, axis=1)],
axis=1).reset_index()
output:
Company ID DoB Name
0 ID 1 DoB 1 Name 1
1 ID 1 DoB 2 Name 2
2 ID 1 DoB 3 Name 3
3 ID 2 DoB 4 Name 4
4 ID 2 DoB 5 Name 5
CodePudding user response:
I would use stack
in a for loop here:
res = pd.DataFrame()
for col in df.columns.drop('Company ID').unique():
new_df = ( df.set_index('Company ID')[col].stack().reset_index().
rename({0:col}, axis=1)[['Company ID', col]] )
res = pd.concat([res, new_df], axis=1)
# drop duplicated columns coming from stack
res = res.loc[:, ~res.columns.duplicated()]
Output:
Company ID DoB Name
0 ID 1 DoB 1 Name 1
1 ID 1 DoB 2 Name 2
2 ID 1 DoB 3 Name 3
3 ID 2 DoB 4 Name 4
4 ID 2 DoB 5 Name 5
CodePudding user response:
One option is with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = 'Company ID',
names_to = '.value',
names_pattern= '(. )',
sort_by_appearance=True)
.dropna()
)
Company ID DoB Name
0 ID 1 DoB 1 Name 1
1 ID 1 DoB 2 Name 2
2 ID 1 DoB 3 Name 3
3 ID 2 DoB 4 Name 4
4 ID 2 DoB 5 Name 5