Columns are the description of the data and the rows keep the values. However, in some columns there are multiple values (tabular form on website). Rows of those tabular get merged in one cell and are separated by hashtags. Since they are only part of the tabular they refer to other columns with values in cells also separated by hashtags.
Column Name: solution_id | type labour | labour_unit | est_labour_quantity | est_labour_costs | est_labour_total_costs
10 | WorkA#WorkB | Person#Person | 2.0#2.0 | 300.0#300.0. | 600.0#600.0
11 | WorkC#WorkD | Person#Person | 3.0#2.0 | 300.0#300.0. | 900.0#600.0
My questions are twofold:
What would be a good way to transform the data to work on it more efficiently, e.g. create as many as new columns as there are entries in one cell. So e.g. separate it like e.g.
Column Name: solution_id | type labour_1 | labour_unit_1 | est_labour_quantity_1 | est_labour_costs_1 | est_labour_total_costs_1 | type labour_2 | labour_unit_2 | est_labour_quantity_2 | est_labour_costs_2 | est_labour_total_costs_2 10 | WorkA | Person. | 2.0. | 300.0. | 600.0. | WorkB | Person | 2.0 | 300.0 | 600.0 11 | WorkC | Person. | 3.0. | 300.0. | 900.0. | WorkD | Person | 2.0 | 300.0 | 600.0
This makes it more readable but it doubles the amount of columns and I have some cells with up to 5 entries, so it would be x5 more columns. What I also don't like so much about the idea is that the new column names are not really meaningful and it will be hard to interpret them.
How can I make this separation in pandas, so that I have WorkA and then the associated values, and then Work B etc...
If there is another better way to work with this tabular form, maybe bring it all in one cell? Please let me know!
CodePudding user response:
Use:
#unpivot by melt
df = df.melt('solution_id')
#create lists by split #
df['value'] = df['value'].str.split('#')
#repeat rows by value column
df = df.explode('value')
#counter for new columns names
df['g'] = df.groupby(['solution_id','variable']).cumcount().add(1)
#pivoting and sorting MultiIndex
df = (df.pivot('solution_id',['variable', 'g'], 'value')
.sort_index(level=1, axis=1, sort_remaining=False))
#flatten MultiIndex
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
type_labour_1 labour_unit_1 est_labour_quantity_1 \
solution_id
10 WorkA Person 2.0
11 WorkC Person 3.0
est_labour_costs_1 est_labour_total_costs_1 type_labour_2 \
solution_id
10 300.0 600.0 WorkB
11 300.0 900.0 WorkD
labour_unit_2 est_labour_quantity_2 est_labour_costs_2 \
solution_id
10 Person 2.0 300.0.
11 Person 2.0 300.0.
est_labour_total_costs_2
solution_id
10 600.0
11 600.0
CodePudding user response:
You can split your strings, explode and reshape:
df2 = (df
.set_index('solution_id')
.apply(lambda c: c.str.split('#'))
.explode(list(df.columns[1:]))
.assign(idx=lambda d: d.groupby(level=0).cumcount().add(1))
.set_index('idx', append=True)
.unstack('idx')
.sort_index(axis=1, level='idx', sort_remaining=False)
)
df2.columns = [f'{a}_{b}' for a,b in df2.columns]
output:
type labour_1 labour_unit_1 est_labour_quantity_1 est_labour_costs_1 est_labour_total_costs_1 type labour_2 labour_unit_2 est_labour_quantity_2 est_labour_costs_2 est_labour_total_costs_2
solution_id
10 WorkA Person 2.0 300.0 600.0 WorkB Person 2.0 300.0. 600.0
11 WorkC Person 3.0 300.0 900.0 WorkD Person 2.0 300.0. 600.0
Or, shorter code using the same initial split followed by slicing and concatenation:
df2=(df
.set_index('solution_id')
.apply(lambda c: c.str.split('#'))
)
pd.concat([df2.apply(lambda c: c.str[i]).add_suffix(f'_{i 1}')
for i in range(len(df2.iat[0,0]))], axis=1)