Dealing with multiple values in Pandas Dataframe Cell-CodePudding

Columns are the description of the data and the rows keep the values. However, in some columns there are multiple values (tabular form on website). Rows of those tabular get merged in one cell and are separated by hashtags. Since they are only part of the tabular they refer to other columns with values in cells also separated by hashtags.

Column Name: solution_id | type labour | labour_unit    | est_labour_quantity | est_labour_costs | est_labour_total_costs
             10          | WorkA#WorkB | Person#Person  | 2.0#2.0             | 300.0#300.0.     | 600.0#600.0     
             11          | WorkC#WorkD | Person#Person  | 3.0#2.0             | 300.0#300.0.     | 900.0#600.0

My questions are twofold:

What would be a good way to transform the data to work on it more efficiently, e.g. create as many as new columns as there are entries in one cell. So e.g. separate it like e.g.

Column Name: solution_id | type labour_1 | labour_unit_1    | est_labour_quantity_1 | est_labour_costs_1 | est_labour_total_costs_1 | type labour_2 | labour_unit_2    | est_labour_quantity_2 | est_labour_costs_2 | est_labour_total_costs_2
             10          | WorkA         | Person.          | 2.0.                  | 300.0.             | 600.0.                   | WorkB         | Person           | 2.0                   | 300.0              | 600.0     
             11          | WorkC         | Person.          | 3.0.                  | 300.0.             | 900.0.                   | WorkD         | Person           | 2.0                   | 300.0              | 600.0

This makes it more readable but it doubles the amount of columns and I have some cells with up to 5 entries, so it would be x5 more columns. What I also don't like so much about the idea is that the new column names are not really meaningful and it will be hard to interpret them.

How can I make this separation in pandas, so that I have WorkA and then the associated values, and then Work B etc...

If there is another better way to work with this tabular form, maybe bring it all in one cell? Please let me know!

CodePudding user response：

Use:

#unpivot by melt
df = df.melt('solution_id')
#create lists by split #
df['value'] = df['value'].str.split('#')
#repeat rows by value column
df = df.explode('value')
#counter for new columns names
df['g'] = df.groupby(['solution_id','variable']).cumcount().add(1)

#pivoting and sorting MultiIndex
df = (df.pivot('solution_id',['variable', 'g'], 'value')
         .sort_index(level=1, axis=1, sort_remaining=False))

#flatten MultiIndex
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')

print (df)
            type_labour_1 labour_unit_1 est_labour_quantity_1  \
solution_id                                                     
10                  WorkA        Person                   2.0   
11                  WorkC        Person                   3.0   

            est_labour_costs_1 est_labour_total_costs_1 type_labour_2  \
solution_id                                                             
10                       300.0                    600.0         WorkB   
11                       300.0                    900.0         WorkD   

            labour_unit_2 est_labour_quantity_2 est_labour_costs_2  \
solution_id                                                          
10                 Person                   2.0             300.0.   
11                 Person                   2.0             300.0.   

            est_labour_total_costs_2  
solution_id                           
10                             600.0  
11                             600.0

CodePudding user response：

You can split your strings, explode and reshape:

df2 = (df
 .set_index('solution_id')
 .apply(lambda c: c.str.split('#'))
 .explode(list(df.columns[1:]))
 .assign(idx=lambda d: d.groupby(level=0).cumcount().add(1))
 .set_index('idx', append=True)
 .unstack('idx')
 .sort_index(axis=1, level='idx', sort_remaining=False)
)

df2.columns = [f'{a}_{b}' for a,b in df2.columns]

output:

            type labour_1 labour_unit_1 est_labour_quantity_1 est_labour_costs_1 est_labour_total_costs_1 type labour_2 labour_unit_2 est_labour_quantity_2 est_labour_costs_2 est_labour_total_costs_2
solution_id                                                                                                                                                                                            
10                  WorkA        Person                   2.0              300.0                    600.0         WorkB        Person                   2.0             300.0.                    600.0
11                  WorkC        Person                   3.0              300.0                    900.0         WorkD        Person                   2.0             300.0.                    600.0

Or, shorter code using the same initial split followed by slicing and concatenation:

df2=(df
 .set_index('solution_id')
 .apply(lambda c: c.str.split('#'))
)
pd.concat([df2.apply(lambda c: c.str[i]).add_suffix(f'_{i 1}')
           for i in range(len(df2.iat[0,0]))], axis=1)