How to expand a set of columns using the first column's values as headers for the other columns?
For example:
x = pd.DataFrame({'id':[11,998,3923], 'count':[7,7,7],
'attributes':['VIS,TEMP,MIN','MIN,VIS,TEMP','MIN,VIS'],
'attribute_values':['0,4,2','2,3,0','0,9'],
'attribute_years':['2000,2001,2002','2001,2002,2003','2008,2009']})
(Edit: note that attributes could be out of order or missing.)
index | id | count | attributes | attribute_values | attribute_years |
---|---|---|---|---|---|
0 | 11 | 7 | VIS,TEMP,MIN | 0,4,2 | 2000,2001,2002 |
1 | 998 | 7 | MIN,VIS,TEMP | 2,3,0 | 2001,2002,2003 |
2 | 3923 | 7 | MIN,VIS | 0,9 | 2008,2009 |
In this case, the attributes
column values should used to make new columns with attribute_values
and attribute_years
columns.
Ideal output:
index | id | count | attribute_values_VIS | attribute_values_TEMP | attribute_values_MIN | attribute_years_VIS | attribute_years_TEMP | attribute_years_MIN |
---|---|---|---|---|---|---|---|---|
0 | 11 | 7 | 0 | 4 | 2 | 2000 | 2001 | 2002 |
1 | 998 | 7 | 3 | 0 | 2 | 2002 | 2003 | 2001 |
2 | 3923 | 7 | 9 | NaN | 0 | 2009 | NaN | 2008 |
CodePudding user response:
Solution
Split the strings in attribute like columns around delimiter ,
to convert into lists, then explode
to convert lists into individual rows, then pivot
with columns=attributes
to reshape, finally flatten the multindex using map join
y = x.set_index(['id', 'count'])
y = y.apply(lambda s: s.str.split(',')).explode([*y])
y = y.pivot(columns='attributes')
y.columns = y.columns.map('_'.join)
y = y.reset_index()
>>> y
id count attribute_values_MIN attribute_values_TEMP attribute_values_VIS attribute_years_MIN attribute_years_TEMP attribute_years_VIS
0 11 7 2 4 0 2002 2001 2000
1 998 7 2 0 3 2001 2003 2002
2 3923 7 0 NaN 9 2008 NaN 2009
CodePudding user response:
I guess there is no native function in pandas to split strings and create new columns from them. However, you can easily write your own little function. Assuming your example (note that you missed to format the dictionary keys as strings)
import pandas as pd
x = pd.DataFrame({'id': [1,2,3],
'attributes': ['VIS,TEMP,MIN','MIN,VIS,TEMP','MIN,VIS'],
'attribute_values': ['0,4,2','2,3,0','0,9'],
'attribute_years': ['2000,2001,2002','2001,2002,2003','2008,2009']})
You can just loop over the rows, split the strings in the columns x['attributes']
, x['attribute_values']
, and x['attribute_years']
, split the strings (always assuming that we don't need to fear different length and that ','
indicates the character to split the strings) and create a new dictionary. Collecting all dictionaries with the new key-value pairs, you can just build a new pandas.DataFrame and assign it to the original if you like:
data = []
for i, row in x.iterrows():
# extract data
att = row['attributes'].split(',')
val = list(map(int, row['attribute_values'].split(',')))
yrs = list(map(int, row['attribute_years'].split(',')))
# create new dictionaries
row_new = {f'attribute_values_{a}': v for a, v in zip(att, val)}
row_new.update({f'attribute_years_{a}': y for a, y in zip(att, yrs)})
# concatenate dictionaries and append to list
data.append(row_new)
# create table from list of rows
pd.DataFrame(data, index=x.index)
output:
attribute_values_VIS | attribute_values_TEMP | attribute_values_MIN | attribute_years_VIS | attribute_years_TEMP | attribute_years_MIN |
---|---|---|---|---|---|
0 | 0 | 4.0 | 2 | 2000 | 2001.0 |
1 | 3 | 0.0 | 2 | 2002 | 2003.0 |
2 | 9 | NaN | 0 | 2009 | NaN |