Pandas pivoting/stacking/reshaping from string in rows-CodePudding

This data seem designed to be analytics unfriendly.

I would like to convert a dataset in the form of:

pd.Series(["key1=value1 key2=value2 key3=value3", "key1=value11 key2=value22 key3=value33", "key1=value111,key2=value222,key3=value333"])

#0          key1=value1 key2=value2 key3=value3
#1       key1=value11 key2=value22 key3=value33
#2    key1=value111,key2=value222,key3=value333
#dtype: object

With the expected output:

pd.DataFrame.from_dict({"key1":["value1", "value11", "value111"], "key2":["value2", "value22", "value222"], "key3":["value3", "value33", "value333"]})

#       key1      key2      key3
#0    value1    value2    value3
#1   value11   value22   value33
#2  value111  value222  value333

The challenge of course is that both the variable names and values have to be parsed from the string. I would also like to keep the index unchanged.

CodePudding user response：

You could do the entire transformation with python, which should be faster and easier. Given an input Series s:

import re 
pd.DataFrame([dict(e.split('=') for e in re.split("[\s,]", ent)) for ent in s])
 
       key1      key2      key3
0    value1    value2    value3
1   value11   value22   value33
2  value111  value222  value333

CodePudding user response：

convert the series to a dictionary. use a defaultdict(list) to get all the values for a key. the data has two delimiters: a space and a comma between each key/value pair.

data=pd.Series(["key1=value1 key2=value2 key3=value3", "key1=value11 key2=value22 key3=value33", "key1=value111,key2=value222,key3=value333"])

data_dict=defaultdict(list)
for record in data:
     my_lista=[]
     my_listb=[]
     if ',' in record:
        my_listb=record.split(',')
     else:
        my_lista=record.split()
     my_list=my_lista my_listb
     for item in my_list:
         key,value=item.split("=")
         data_dict[key].append(value)

#a. convert the dictionary to a dataframe
data_df=pd.DataFrame(data_dict.items())
print(data_df)

output

0                            1
0  key1  [value1, value11, value111]
1  key2  [value2, value22, value222]
2  key3  [value3, value33, value333]