I have a nested dictionary with no key-value pairs. I am trying to separate them into a dataframe with separate columns and do not need to preserve the original structure. The intention is to separate each visible row into an actual row in a dataframe with the columns being named word
, start_time
, and end_time
. I have tried to flatten it with flatdict
, but since there is no named key it does not work.
Here is an example of the nested dictionary stored in the variable word_timestamps
.
[[['hello', 3.06, 3.32]],
[['hi', 4.2, 4.32],
['can', 4.54, 4.62],
['i', 4.66, 4.7],
['please', 4.74, 4.86],
['speak', 4.9, 5.04],
['to', 5.06, 5.14],
['ashley', 5.2, 5.56]],
[['yeah', 6.84, 6.94],
['may', 7.04, 7.12],
['i', 7.12, 7.12],
['ask', 7.18, 7.28],
["who's", 7.36, 7.46],
['calling', 7.54, 7.86]]]
I can view individual "rows" of this successfully using this format word_timestamps[0]
. This returns:
[['hello', 3.06, 3.32]]
Or I can access a single word using word_timestamps[0][0][0]
which returns 'hello'
.
How do I flatten the dictionary and get rid of the entire structure?
Edit: added everything below.
I used [value for sublist in word_timestamps for value in sublist]
which returned the same answer as below. The full code used was:
df_word_timestamps = pd.DataFrame([value for sublist in word_timestamps for value in sublist], columns =["word", "from", "to"])
Which results in:
word from to
0 hello 3.06 3.32
1 hi 4.20 4.32
2 can 4.54 4.62
3 i 4.66 4.70
4 please 4.74 4.86
... ... ... ...
1179 right 399.98 400.08
1180 bye-bye 400.64 400.86
1181 thanks 401.70 401.92
1182 bye 402.02 402.16
1183 bye 402.88 403.04
The reason I did this was so I could join a second dataframe on the matching start/stop times. The second dataframe contains the person that spoke these words. Together I can create a labeled transcript.
CodePudding user response:
You're basically "adding" a set of sublists.
word_timestamps = [[['hello', 3.06, 3.32]],
[['hi', 4.2, 4.32],
['can', 4.54, 4.62],
['i', 4.66, 4.7],
['please', 4.74, 4.86],
['speak', 4.9, 5.04],
['to', 5.06, 5.14],
['ashley', 5.2, 5.56]],
[['yeah', 6.84, 6.94],
['may', 7.04, 7.12],
['i', 7.12, 7.12],
['ask', 7.18, 7.28],
["who's", 7.36, 7.46],
['calling', 7.54, 7.86]]]
combine = sum(word_timestamps, [])
print(combine)