Home > database >  Python Split Nested Dictionary with No Key Value Pairs
Python Split Nested Dictionary with No Key Value Pairs

Time:11-17

I have a nested dictionary with no key-value pairs. I am trying to separate them into a dataframe with separate columns and do not need to preserve the original structure. The intention is to separate each visible row into an actual row in a dataframe with the columns being named word, start_time, and end_time. I have tried to flatten it with flatdict, but since there is no named key it does not work.

Here is an example of the nested dictionary stored in the variable word_timestamps.

[[['hello', 3.06, 3.32]],
 [['hi', 4.2, 4.32],
  ['can', 4.54, 4.62],
  ['i', 4.66, 4.7],
  ['please', 4.74, 4.86],
  ['speak', 4.9, 5.04],
  ['to', 5.06, 5.14],
  ['ashley', 5.2, 5.56]],
 [['yeah', 6.84, 6.94],
  ['may', 7.04, 7.12],
  ['i', 7.12, 7.12],
  ['ask', 7.18, 7.28],
  ["who's", 7.36, 7.46],
  ['calling', 7.54, 7.86]]]

I can view individual "rows" of this successfully using this format word_timestamps[0]. This returns:

[['hello', 3.06, 3.32]]

Or I can access a single word using word_timestamps[0][0][0] which returns 'hello'.

How do I flatten the dictionary and get rid of the entire structure?

Edit: added everything below.

I used [value for sublist in word_timestamps for value in sublist] which returned the same answer as below. The full code used was:

df_word_timestamps = pd.DataFrame([value for sublist in word_timestamps for value in sublist], columns =["word", "from", "to"])

Which results in:

    word    from    to
0   hello   3.06    3.32
1   hi  4.20    4.32
2   can 4.54    4.62
3   i   4.66    4.70
4   please  4.74    4.86
... ... ... ...
1179    right   399.98  400.08
1180    bye-bye 400.64  400.86
1181    thanks  401.70  401.92
1182    bye 402.02  402.16
1183    bye 402.88  403.04

The reason I did this was so I could join a second dataframe on the matching start/stop times. The second dataframe contains the person that spoke these words. Together I can create a labeled transcript.

CodePudding user response:

You're basically "adding" a set of sublists.

word_timestamps = [[['hello', 3.06, 3.32]],
 [['hi', 4.2, 4.32],
  ['can', 4.54, 4.62],
  ['i', 4.66, 4.7],
  ['please', 4.74, 4.86],
  ['speak', 4.9, 5.04],
  ['to', 5.06, 5.14],
  ['ashley', 5.2, 5.56]],
 [['yeah', 6.84, 6.94],
  ['may', 7.04, 7.12],
  ['i', 7.12, 7.12],
  ['ask', 7.18, 7.28],
  ["who's", 7.36, 7.46],
  ['calling', 7.54, 7.86]]]

combine = sum(word_timestamps, [])
print(combine)
  • Related