This data seem designed to be analytics unfriendly.
I would like to convert a dataset in the form of:
pd.Series(["key1=value1 key2=value2 key3=value3", "key1=value11 key2=value22 key3=value33", "key1=value111,key2=value222,key3=value333"])
#0 key1=value1 key2=value2 key3=value3
#1 key1=value11 key2=value22 key3=value33
#2 key1=value111,key2=value222,key3=value333
#dtype: object
With the expected output:
pd.DataFrame.from_dict({"key1":["value1", "value11", "value111"], "key2":["value2", "value22", "value222"], "key3":["value3", "value33", "value333"]})
# key1 key2 key3
#0 value1 value2 value3
#1 value11 value22 value33
#2 value111 value222 value333
The challenge of course is that both the variable names and values have to be parsed from the string. I would also like to keep the index unchanged.
CodePudding user response:
You could do the entire transformation with python, which should be faster and easier. Given an input Series s
:
import re
pd.DataFrame([dict(e.split('=') for e in re.split("[\s,]", ent)) for ent in s])
key1 key2 key3
0 value1 value2 value3
1 value11 value22 value33
2 value111 value222 value333
CodePudding user response:
convert the series to a dictionary. use a defaultdict(list) to get all the values for a key. the data has two delimiters: a space and a comma between each key/value pair.
data=pd.Series(["key1=value1 key2=value2 key3=value3", "key1=value11 key2=value22 key3=value33", "key1=value111,key2=value222,key3=value333"])
data_dict=defaultdict(list)
for record in data:
my_lista=[]
my_listb=[]
if ',' in record:
my_listb=record.split(',')
else:
my_lista=record.split()
my_list=my_lista my_listb
for item in my_list:
key,value=item.split("=")
data_dict[key].append(value)
#a. convert the dictionary to a dataframe
data_df=pd.DataFrame(data_dict.items())
print(data_df)
output
0 1
0 key1 [value1, value11, value111]
1 key2 [value2, value22, value222]
2 key3 [value3, value33, value333]