Home > Net >  tokenizing text with features in specif format
tokenizing text with features in specif format

Time:11-04

Hello there I am trying to create tokens with some features and arrange them in some kind of JSON format, using the following text example:

words = ['The study of aviation safety report in the aviation industry usually relies', 
         'The experimental results show that compared with traditional',
         'Heterogeneous Aviation Safety Cases: Integrating the Formal and the Non-formal']
{"sentence": [
           {
             indexSentence:0,
             tokens: [{
                       "indexWord": 1,
                        "word": "The",
                         "len": 3
                      },
                      { "indexWord": 2,
                        "word": "study",
                         "len": 5},
                      {"indexWord": 3,
                        "word": "of",
                         "len": 2
                       },
                       {"indexWord": 4,
                        "word": "aviation",
                         "len": 8},
                        ...
                        ]
           },
           {
            "indexSentence" : 1,
            "tokens" : [{
                        ...
                        }]
           },
           ....
         ]}

I trying to use the following code with no success...

t_d = {len(i):i for i in words}

[{'Lon' : len(t_d[i]),
  'tex' : t_d[i], 
  'Sub' : [{'index' : j,
            'token': [{
                      'word':['word: '   j for i,j in enumerate(str(t_d[i]).split(' '))] 
                      }],
            'lenTo' : len(str(t_d[i]).split(' '))
           }
          ],
  'Sub1':[{'index' : j}]
 } for j,i in enumerate(t_d)]

CodePudding user response:

The solution below assumes that the tokenization splits the sentence by whitespace using the str.split function. The solution should still be able to work with any other tokenize function.

from collections import defaultdict

words = ['The study of aviation safety report in the aviation industry usually relies', 
         'The experimental results show that compared with traditional',
         'Heterogeneous Aviation Safety Cases: Integrating the Formal and the Non-formal']

sentence = defaultdict(list)

for idx,i in enumerate(words):
    struct = {"indexSentence":idx,"tokens":[{"indexWord":idx_w,
                                             "word":w,
                                             "len":len(w)} for idx_w, w in enumerate(i.split())]}
    sentence['sentence'].append(struct)
    
dict(sentence)

>>
{'sentence': [{'indexSentence': 0,
   'tokens': [{'indexWord': 0, 'word': 'The', 'len': 3},
    {'indexWord': 1, 'word': 'study', 'len': 5},
    {'indexWord': 2, 'word': 'of', 'len': 2},
    {'indexWord': 3, 'word': 'aviation', 'len': 8},
    {'indexWord': 4, 'word': 'safety', 'len': 6},
    {'indexWord': 5, 'word': 'report', 'len': 6},
    {'indexWord': 6, 'word': 'in', 'len': 2},
    {'indexWord': 7, 'word': 'the', 'len': 3},
    {'indexWord': 8, 'word': 'aviation', 'len': 8},
    {'indexWord': 9, 'word': 'industry', 'len': 8},
    {'indexWord': 10, 'word': 'usually', 'len': 7},
    {'indexWord': 11, 'word': 'relies', 'len': 6}]},
  {'indexSentence': 1,
   'tokens': [{'indexWord': 0, 'word': 'The', 'len': 3},
...
}

You can leverage defaultdict to first create your list or array and then append the desired structure on top. To mimic a json structure you can turn in back to a dict.

  • Related