From this text data: https://drive.google.com/file/d/1p34ChEAC9R7HnkyllnpCLCYrIevP4u8T/view?usp=sharing
I want to create a structure in this form:
{
'tokens': ['Setelah', 'melalui', 'proses', 'telepon', 'yang', 'panjang', 'tutup', 'sudah', 'kartu', 'kredit', 'bca', 'Ribet'],
'tag': ['O', 'B', 'B', 'I', 'O', 'O', 'B', 'O', 'B', 'I', 'I', 'B']
}
{
'tokens': ['@HaloBCA', 'Saya', 'mencoba', 'mengakses', 'menu', 'm-BCA', 'saya', 'namun', 'saya', 'mendapat', 'respons', 'Fasilitas', 'Mobile', 'Banking', 'terblokir', 'bagimana', 'sih', 'padahal', 'saya', 'baru', 'coba', 'akses', 'lo'],
'tag': ['B', 'O', 'O', 'B', 'B', 'I', 'O', 'O', 'O', 'B', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
}
This is what I've tried to do, using dictionary:
f = open("a_testdata.txt", "r")
dicts = {}
tokens = []
tags = []
for line in f:
if len(line.strip()) != 0:
fields = line.split('\t')
text = fields[0]
tag = fields[1].strip()
tokens.append(text)
tags.append(tag)
dicts['token'] = tokens
dicts['tag'] = tags
else:
tokens = []
tags = []
for key, value in dicts.items():
print(key, value)
This only outputs the last sentences.
token ['@HaloBCA', 'Saya', 'mencoba', 'mengakses', 'menu', 'm-BCA', 'saya', 'namun', 'saya', 'mendapat', 'respons', 'Fasilitas', 'Mobile', 'Banking', 'terblokir', 'bagimana', 'sih', 'padahal', 'saya', 'baru', 'coba', 'akses', 'lo']
tag ['B', 'O', 'O', 'B', 'B', 'I', 'O', 'O', 'O', 'B', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
My question is how to group those sentences (1 sentence is separated by an empty white line, see text file) into one structure, if dictionary is not possible? If I can, how can I use a DataFrame?
CodePudding user response:
- You'll need an array of dictionaries, since keys can't be duplicated
- Before resetting the token/tag list, you need to save it to the output and then reset
dicts
as well - Corner case: if
dicts
has data, and we don't run into a blank line at the end, the data won't be added to the list
f = open("a_testdata.txt", "r")
output = []
dicts = {}
tokens = []
tags = []
for line in f:
if len(line.strip()) != 0:
fields = line.split('\t')
text = fields[0]
tag = fields[1].strip()
tokens.append(text)
tags.append(tag)
else:
dicts['token'] = tokens
dicts['tag'] = tags
output.append(dicts)
dicts = {}
tokens = []
tags = []
if dicts:
output.append(dicts)
for item in output:
for key, value in item.items():
print(key, value)
CodePudding user response:
Dictionary can't have duplicate keys, you have to either merge all the tokens and tags together or use a list of dictionaries.
f = open("a_testdata.txt", "r")
dicts = []
tokens = []
tags = []
for line in f:
if len(line.strip()) != 0:
fields = line.split('\t')
text = fields[0]
tag = fields[1].strip()
tokens.append(text)
tags.append(tag)
else:
dicts.append({'token': tokens, 'tags': tags})
tokens = []
tags = []
# print(dicts)
for d in dicts:
print(d)
Warning: in some cases, I have found that the loop never reaches to else block for the last dictionary. The reason is that the last empty line is never read.