I have a txt file that look likes
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O
Peter NNP B-NP B-PER
Blackburn NNP I-NP I-PER
BRUSSELS NNP B-NP B-LOC
1996-08-22 CD I-NP O
And Im trying to make a tuples from this txt which ı will evalute them laterly word to features later on. I want to have a list of list look like this :
[(EU, NNP,B-NP, B-ORG),(rejects, VBZ, B-VP, O),(German, JJ, B-NP, B-MISC),(call, NN, I-NP, O).....
(Peter, NNP, B-NP, B-PER),(Blackburn, NNP, I-N,P I-PER),
(BRUSSELS, NNP, B-NP, B-LOC),(1996-08-22, CD, I-NP, O)
All of the whitespaces indicates that the sentences over and should add to list to given index, laterly after whitespace we should move on the next index of the list to add all sentences.
# function to read data, return list of tuples each tuple represents a token contains word, pos tag, chunk tag, and ner tag
import csv
def read_data(filename) -> list:
data = []
sentences = []
with open(filename) as load_file:
reader = csv.reader(load_file, delimiter=" ") # read
for row in reader:
if(len(tuple(row)) != 0):
data.append(tuple(row))
sentences.append(data)
return sentences
I have a function like this however it return this:
('EU', 'NNP', 'B-NP', 'B-ORG'),
('rejects', 'VBZ', 'B-VP', 'O'),
('German', 'JJ', 'B-NP', 'B-MISC'),
('call', 'NN', 'I-NP', 'O'),
('to', 'TO', 'B-VP', 'O'),
('boycott', 'VB', 'I-VP', 'O'),
('British', 'JJ', 'B-NP', 'B-MISC'),
('lamb', 'NN', 'I-NP', 'O'),
('.', '.', 'O', 'O'),
('Peter', 'NNP', 'B-NP', 'B-PER'),
('Blackburn', 'NNP', 'I-NP', 'I-PER'),
('BRUSSELS', 'NNP', 'B-NP', 'B-LOC'),
('1996-08-22', 'CD', 'I-NP', 'O'),
How can ı solve this problem, ı use 2 different list to add them together but ı could not find a way.
CodePudding user response:
I think all problem is because you show expected result
[(EU, NNP,B-NP, B-ORG),(rejects, VBZ, B-VP, O),(German, JJ, B-NP, B-MISC),(call, NN, I-NP, O).....
(Peter, NNP, B-NP, B-PER),(Blackburn, NNP, I-N,P I-PER),
(BRUSSELS, NNP, B-NP, B-LOC),(1996-08-22, CD, I-NP, O)
but I think you expect
[
[(EU, NNP,B-NP, B-ORG),(rejects, VBZ, B-VP, O),(German, JJ, B-NP, B-MISC),(call, NN, I-NP, O).....],
[(Peter, NNP, B-NP, B-PER),(Blackburn, NNP, I-N,P I-PER)],
[(BRUSSELS, NNP, B-NP, B-LOC),(1996-08-22, CD, I-NP, O)],
]
and this need
for row in reader:
if row:
data.append(tuple(row))
else:
sentences.append(data)
data = []
At the end it may need also to add last data
becuase there is no empty line after these data
if data:
sentences.append(data)
Full working example.
I use io
only to simulate file in memory so everyone can copy and run it. But you should use open()
without text
.
text = '''EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O
Peter NNP B-NP B-PER
Blackburn NNP I-NP I-PER
BRUSSELS NNP B-NP B-LOC
1996-08-22 CD I-NP O'''
import csv
import io
data = []
sentences = []
#with open(filename) as load_file:
with io.StringIO(text) as load_file:
reader = csv.reader(load_file, delimiter=" ") # read
for row in reader:
if row:
data.append(tuple(row))
else:
sentences.append(data)
data = []
# add last data because there is no empty line after these data
if data:
sentences.append(data)
print(sentences)