Pandas: Three listed columns to wide format-CodePudding

How to expand a set of columns using the first column's values as headers for the other columns?

For example:

x = pd.DataFrame({'id':[11,998,3923], 'count':[7,7,7],
  'attributes':['VIS,TEMP,MIN','MIN,VIS,TEMP','MIN,VIS'],
  'attribute_values':['0,4,2','2,3,0','0,9'],
  'attribute_years':['2000,2001,2002','2001,2002,2003','2008,2009']})

(Edit: note that attributes could be out of order or missing.)

index	id	count	attributes	attribute_values	attribute_years
0	11	7	VIS,TEMP,MIN	0,4,2	2000,2001,2002
1	998	7	MIN,VIS,TEMP	2,3,0	2001,2002,2003
2	3923	7	MIN,VIS	0,9	2008,2009

In this case, the attributes column values should used to make new columns with attribute_values and attribute_years columns.

Ideal output:

index	id	count	attribute_values_VIS	attribute_values_TEMP	attribute_values_MIN	attribute_years_VIS	attribute_years_TEMP	attribute_years_MIN
0	11	7	0	4	2	2000	2001	2002
1	998	7	3	0	2	2002	2003	2001
2	3923	7	9	NaN	0	2009	NaN	2008

CodePudding user response：

Solution

Split the strings in attribute like columns around delimiter , to convert into lists, then explode to convert lists into individual rows, then pivot with columns=attributes to reshape, finally flatten the multindex using map join

y = x.set_index(['id', 'count'])
y = y.apply(lambda s: s.str.split(',')).explode([*y])
y = y.pivot(columns='attributes')
y.columns = y.columns.map('_'.join)
y = y.reset_index()

>>> y
     id  count attribute_values_MIN attribute_values_TEMP attribute_values_VIS attribute_years_MIN attribute_years_TEMP attribute_years_VIS
0    11      7                    2                     4                    0                2002                 2001                2000
1   998      7                    2                     0                    3                2001                 2003                2002
2  3923      7                    0                   NaN                    9                2008                  NaN                2009

CodePudding user response：

I guess there is no native function in pandas to split strings and create new columns from them. However, you can easily write your own little function. Assuming your example (note that you missed to format the dictionary keys as strings)

import pandas as pd

x = pd.DataFrame({'id': [1,2,3],
'attributes': ['VIS,TEMP,MIN','MIN,VIS,TEMP','MIN,VIS'],
'attribute_values': ['0,4,2','2,3,0','0,9'],
'attribute_years': ['2000,2001,2002','2001,2002,2003','2008,2009']})

You can just loop over the rows, split the strings in the columns x['attributes'], x['attribute_values'], and x['attribute_years'], split the strings (always assuming that we don't need to fear different length and that ',' indicates the character to split the strings) and create a new dictionary. Collecting all dictionaries with the new key-value pairs, you can just build a new pandas.DataFrame and assign it to the original if you like:

data = []
for i, row in x.iterrows():
    # extract data
    att = row['attributes'].split(',')
    val = list(map(int, row['attribute_values'].split(',')))
    yrs = list(map(int, row['attribute_years'].split(',')))
    # create new dictionaries
    row_new = {f'attribute_values_{a}': v for a, v in zip(att, val)}
    row_new.update({f'attribute_years_{a}': y for a, y in zip(att, yrs)})
    # concatenate dictionaries and append to list
    data.append(row_new)
# create table from list of rows
pd.DataFrame(data, index=x.index)

output:

attribute_values_VIS	attribute_values_TEMP	attribute_values_MIN	attribute_years_VIS	attribute_years_TEMP	attribute_years_MIN
0	0	4.0	2	2000	2001.0
1	3	0.0	2	2002	2003.0
2	9	NaN	0	2009	NaN