How to sort values according to their parameter name and create a table in Python?-CodePudding

I have a huge list of strings (similar to strs given below, but much larger). The time stamps are given for each column.
I'd like to efficiently convert it to a table format (numpy array or pandas dataframe or ...) according to the one below.

strs = ['time', 'stamp1', 'a', '1',    'b', '2',    'c', '3', 
        'time', 'stamp2', 'a', '11',   'b', '22',                'd', '4', 
        'time', 'stamp3', 'a', '111',  'b', '222',  'c', '333', 
        'time', 'stamp4', 'a', '1111', 'b', '2222', 'c', '3333', 'd', '444']

time	a	b	c	d
stamp1	1	2	3
stamp2	11	22		4
stamp3	111	222	333
stamp4	1111	2222	3333	444

CodePudding user response：

I would first process this list as a dictionary and then turn the dictionary into a pandas DataFrame. But first you'll need to fill in the missing values because the lists will need to be the same size to create the dataframe. I did this by splitting the strs list into sub lists based on the appearance of the time string. I then also got the column names by slicing over the list every other value starting at 0 and then passing it to a set to get only the unique values.

I then looped over the list of sublists that represent our rows and if there wasn't a value representative of the column in the list I added it and gave it a NaN value. Once the list of sublists had all the values for every column. I then looped over it and assigned the values to the dict to create the dataframe. Once the dictionary was created just passing it to from_dict will create the columns based on key, value pairs in the dictionary.

The only issue is the columns aren't in order so I reordered them.

strs = ['time', 'stamp1', 'a', '1',    'b', '2',    'c', '3', 
        'time', 'stamp2', 'a', '11',   'b', '22',                'd', '4', 
        'time', 'stamp3', 'a', '111',  'b', '222',  'c', '333', 
        'time', 'stamp4', 'a', '1111', 'b', '2222', 'c', '3333', 'd', '444']

# splitting the lists by the 'time' string since thats the start of a new row
col_names = set(strs[::2])
Lsub = []
L2 = []
for e in strs:
    if e == 'time':
        if Lsub: 
            L2.append(Lsub)
        Lsub = [e]
    else:
        Lsub.append(e)
L2.append(Lsub)

#fill in missing values
for sublist in L2:
    for col in col_names:
        if col not in sublist:
            sublist.extend([col, np.nan])


# create dictionary to assign values too
df_dict = {k: [] for k in col_names}

for x in L2:
    for i, y in enumerate(x):
        if i == 0:
            continue
        if x[i-1] in col_names:
            df_dict[x[i-1]].append(y)


df = pd.DataFrame.from_dict(df_dict)
cols = ['time']   sorted([x for x in col_names if x != 'time'])
df = df[cols]
print(df)

Which gives this output:

     time     a     b     c    d
0  stamp1     1     2     3  NaN
1  stamp2    11    22   NaN    4
2  stamp3   111   222   333  NaN
3  stamp4  1111  2222  3333  444

CodePudding user response：

You could do:

import pandas as pd

records = []
record = {strs[0]: strs[1]}
for key, value in zip(strs[2::2], strs[3::2]):
    if key == "time":
        records.append(record)
        record = {key: value}
    else:
        record[key] = value
else:
    records.append(record)
    
table = pd.DataFrame(records)

Result:

     time     a     b     c    d
0  stamp1     1     2     3  NaN
1  stamp2    11    22   NaN    4
2  stamp3   111   222   333  NaN
3  stamp4  1111  2222  3333  444