I have a huge list of strings (similar to strs given below, but much larger). The time stamps are given for each column.
I'd like to efficiently convert it to a table format (numpy array or pandas dataframe or ...) according to the one below.
strs = ['time', 'stamp1', 'a', '1', 'b', '2', 'c', '3',
'time', 'stamp2', 'a', '11', 'b', '22', 'd', '4',
'time', 'stamp3', 'a', '111', 'b', '222', 'c', '333',
'time', 'stamp4', 'a', '1111', 'b', '2222', 'c', '3333', 'd', '444']
time | a | b | c | d |
---|---|---|---|---|
stamp1 | 1 | 2 | 3 | |
stamp2 | 11 | 22 | 4 | |
stamp3 | 111 | 222 | 333 | |
stamp4 | 1111 | 2222 | 3333 | 444 |
CodePudding user response:
I would first process this list as a dictionary and then turn the dictionary into a pandas
DataFrame. But first you'll need to fill in the missing values because the lists will need to be the same size to create the dataframe. I did this by splitting the strs
list into sub lists based on the appearance of the time
string. I then also got the column names by slicing over the list every other value starting at 0 and then passing it to a set
to get only the unique values.
I then looped over the list of sublists that represent our rows and if there wasn't a value representative of the column in the list I added it and gave it a NaN
value. Once the list of sublists had all the values for every column. I then looped over it and assigned the values to the dict
to create the dataframe. Once the dictionary was created just passing it to from_dict
will create the columns based on key, value pairs in the dictionary.
The only issue is the columns aren't in order so I reordered them.
strs = ['time', 'stamp1', 'a', '1', 'b', '2', 'c', '3',
'time', 'stamp2', 'a', '11', 'b', '22', 'd', '4',
'time', 'stamp3', 'a', '111', 'b', '222', 'c', '333',
'time', 'stamp4', 'a', '1111', 'b', '2222', 'c', '3333', 'd', '444']
# splitting the lists by the 'time' string since thats the start of a new row
col_names = set(strs[::2])
Lsub = []
L2 = []
for e in strs:
if e == 'time':
if Lsub:
L2.append(Lsub)
Lsub = [e]
else:
Lsub.append(e)
L2.append(Lsub)
#fill in missing values
for sublist in L2:
for col in col_names:
if col not in sublist:
sublist.extend([col, np.nan])
# create dictionary to assign values too
df_dict = {k: [] for k in col_names}
for x in L2:
for i, y in enumerate(x):
if i == 0:
continue
if x[i-1] in col_names:
df_dict[x[i-1]].append(y)
df = pd.DataFrame.from_dict(df_dict)
cols = ['time'] sorted([x for x in col_names if x != 'time'])
df = df[cols]
print(df)
Which gives this output:
time a b c d
0 stamp1 1 2 3 NaN
1 stamp2 11 22 NaN 4
2 stamp3 111 222 333 NaN
3 stamp4 1111 2222 3333 444
CodePudding user response:
You could do:
import pandas as pd
records = []
record = {strs[0]: strs[1]}
for key, value in zip(strs[2::2], strs[3::2]):
if key == "time":
records.append(record)
record = {key: value}
else:
record[key] = value
else:
records.append(record)
table = pd.DataFrame(records)
Result:
time a b c d
0 stamp1 1 2 3 NaN
1 stamp2 11 22 NaN 4
2 stamp3 111 222 333 NaN
3 stamp4 1111 2222 3333 444