Hello there I am trying to create tokens with some features and arrange them in some kind of JSON format, using the following text example:
words = ['The study of aviation safety report in the aviation industry usually relies',
'The experimental results show that compared with traditional',
'Heterogeneous Aviation Safety Cases: Integrating the Formal and the Non-formal']
{"sentence": [
{
indexSentence:0,
tokens: [{
"indexWord": 1,
"word": "The",
"len": 3
},
{ "indexWord": 2,
"word": "study",
"len": 5},
{"indexWord": 3,
"word": "of",
"len": 2
},
{"indexWord": 4,
"word": "aviation",
"len": 8},
...
]
},
{
"indexSentence" : 1,
"tokens" : [{
...
}]
},
....
]}
I trying to use the following code with no success...
t_d = {len(i):i for i in words}
[{'Lon' : len(t_d[i]),
'tex' : t_d[i],
'Sub' : [{'index' : j,
'token': [{
'word':['word: ' j for i,j in enumerate(str(t_d[i]).split(' '))]
}],
'lenTo' : len(str(t_d[i]).split(' '))
}
],
'Sub1':[{'index' : j}]
} for j,i in enumerate(t_d)]
CodePudding user response:
The solution below assumes that the tokenization splits the sentence by whitespace using the str.split
function. The solution should still be able to work with any other tokenize function.
from collections import defaultdict
words = ['The study of aviation safety report in the aviation industry usually relies',
'The experimental results show that compared with traditional',
'Heterogeneous Aviation Safety Cases: Integrating the Formal and the Non-formal']
sentence = defaultdict(list)
for idx,i in enumerate(words):
struct = {"indexSentence":idx,"tokens":[{"indexWord":idx_w,
"word":w,
"len":len(w)} for idx_w, w in enumerate(i.split())]}
sentence['sentence'].append(struct)
dict(sentence)
>>
{'sentence': [{'indexSentence': 0,
'tokens': [{'indexWord': 0, 'word': 'The', 'len': 3},
{'indexWord': 1, 'word': 'study', 'len': 5},
{'indexWord': 2, 'word': 'of', 'len': 2},
{'indexWord': 3, 'word': 'aviation', 'len': 8},
{'indexWord': 4, 'word': 'safety', 'len': 6},
{'indexWord': 5, 'word': 'report', 'len': 6},
{'indexWord': 6, 'word': 'in', 'len': 2},
{'indexWord': 7, 'word': 'the', 'len': 3},
{'indexWord': 8, 'word': 'aviation', 'len': 8},
{'indexWord': 9, 'word': 'industry', 'len': 8},
{'indexWord': 10, 'word': 'usually', 'len': 7},
{'indexWord': 11, 'word': 'relies', 'len': 6}]},
{'indexSentence': 1,
'tokens': [{'indexWord': 0, 'word': 'The', 'len': 3},
...
}
You can leverage defaultdict
to first create your list or array and then append the desired structure on top. To mimic a json
structure you can turn in back to a dict
.