How to group sentences from text file into one structure?-CodePudding

From this text data: https://drive.google.com/file/d/1p34ChEAC9R7HnkyllnpCLCYrIevP4u8T/view?usp=sharing

I want to create a structure in this form:

{
  'tokens': ['Setelah', 'melalui', 'proses', 'telepon', 'yang', 'panjang', 'tutup', 'sudah', 'kartu', 'kredit', 'bca', 'Ribet'],
  'tag': ['O', 'B', 'B', 'I', 'O', 'O', 'B', 'O', 'B', 'I', 'I', 'B']
}
{
  'tokens': ['@HaloBCA', 'Saya', 'mencoba', 'mengakses', 'menu', 'm-BCA', 'saya', 'namun', 'saya', 'mendapat', 'respons', 'Fasilitas', 'Mobile', 'Banking', 'terblokir', 'bagimana', 'sih', 'padahal', 'saya', 'baru', 'coba', 'akses', 'lo'],
  'tag': ['B', 'O', 'O', 'B', 'B', 'I', 'O', 'O', 'O', 'B', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
}

This is what I've tried to do, using dictionary:

f = open("a_testdata.txt", "r")
dicts = {}
tokens = []
tags = []

for line in f:
  if len(line.strip()) != 0:
    fields = line.split('\t')
    text = fields[0]
    tag = fields[1].strip()
    tokens.append(text)
    tags.append(tag)
    dicts['token'] = tokens
    dicts['tag'] = tags
  else:
    tokens = []
    tags = []

for key, value in dicts.items():
  print(key, value)

This only outputs the last sentences.

token ['@HaloBCA', 'Saya', 'mencoba', 'mengakses', 'menu', 'm-BCA', 'saya', 'namun', 'saya', 'mendapat', 'respons', 'Fasilitas', 'Mobile', 'Banking', 'terblokir', 'bagimana', 'sih', 'padahal', 'saya', 'baru', 'coba', 'akses', 'lo']
tag ['B', 'O', 'O', 'B', 'B', 'I', 'O', 'O', 'O', 'B', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

My question is how to group those sentences (1 sentence is separated by an empty white line, see text file) into one structure, if dictionary is not possible? If I can, how can I use a DataFrame?

CodePudding user response：

You'll need an array of dictionaries, since keys can't be duplicated
Before resetting the token/tag list, you need to save it to the output and then reset dicts as well
Corner case: if dicts has data, and we don't run into a blank line at the end, the data won't be added to the list

f = open("a_testdata.txt", "r")
output = []
dicts = {}
tokens = []
tags = []

for line in f:
  if len(line.strip()) != 0:
    fields = line.split('\t')
    text = fields[0]
    tag = fields[1].strip()
    tokens.append(text)
    tags.append(tag)
  else:
    dicts['token'] = tokens
    dicts['tag'] = tags
    output.append(dicts)
    dicts = {}
    tokens = []
    tags = []

if dicts:
  output.append(dicts)

for item in output:
  for key, value in item.items():
    print(key, value)

CodePudding user response：

Dictionary can't have duplicate keys, you have to either merge all the tokens and tags together or use a list of dictionaries.

f = open("a_testdata.txt", "r")

dicts = []
tokens = []
tags = []

for line in f:
    if len(line.strip()) != 0:
        fields = line.split('\t')
        text = fields[0]
        tag = fields[1].strip()
        tokens.append(text)
        tags.append(tag)
    else:
        dicts.append({'token': tokens, 'tags': tags})
        tokens = []
        tags = []

        
# print(dicts)
for d in dicts:
    print(d)

Warning: in some cases, I have found that the loop never reaches to else block for the last dictionary. The reason is that the last empty line is never read.