Create list of list tuples from reading a txt file-CodePudding

I have a txt file that look likes

   EU NNP B-NP B-ORG
    rejects VBZ B-VP O
    German JJ B-NP B-MISC
    call NN I-NP O
    to TO B-VP O
    boycott VB I-VP O
    British JJ B-NP B-MISC
    lamb NN I-NP O
    . . O O
    
    Peter NNP B-NP B-PER
    Blackburn NNP I-NP I-PER

    BRUSSELS NNP B-NP B-LOC
    1996-08-22 CD I-NP O

And Im trying to make a tuples from this txt which ı will evalute them laterly word to features later on. I want to have a list of list look like this :

[(EU, NNP,B-NP, B-ORG),(rejects, VBZ, B-VP, O),(German, JJ, B-NP, B-MISC),(call, NN, I-NP, O).....
 (Peter, NNP, B-NP, B-PER),(Blackburn, NNP, I-N,P I-PER),
 (BRUSSELS, NNP, B-NP, B-LOC),(1996-08-22, CD, I-NP, O)

All of the whitespaces indicates that the sentences over and should add to list to given index, laterly after whitespace we should move on the next index of the list to add all sentences.

# function to read data, return list of tuples each tuple represents a token contains word, pos tag, chunk tag, and ner tag
import csv
def read_data(filename) -> list:
  data = []
  sentences = []
  with open(filename) as load_file:
    reader = csv.reader(load_file, delimiter=" ")   # read
   
    for row in reader:
      if(len(tuple(row)) != 0):
        data.append(tuple(row))
     
  sentences.append(data)
  return sentences

I have a function like this however it return this:

('EU', 'NNP', 'B-NP', 'B-ORG'),
  ('rejects', 'VBZ', 'B-VP', 'O'),
  ('German', 'JJ', 'B-NP', 'B-MISC'),
  ('call', 'NN', 'I-NP', 'O'),
  ('to', 'TO', 'B-VP', 'O'),
  ('boycott', 'VB', 'I-VP', 'O'),
  ('British', 'JJ', 'B-NP', 'B-MISC'),
  ('lamb', 'NN', 'I-NP', 'O'),
  ('.', '.', 'O', 'O'),
  ('Peter', 'NNP', 'B-NP', 'B-PER'),
  ('Blackburn', 'NNP', 'I-NP', 'I-PER'),
  ('BRUSSELS', 'NNP', 'B-NP', 'B-LOC'),
  ('1996-08-22', 'CD', 'I-NP', 'O'),

How can ı solve this problem, ı use 2 different list to add them together but ı could not find a way.

CodePudding user response：

I think all problem is because you show expected result

[(EU, NNP,B-NP, B-ORG),(rejects, VBZ, B-VP, O),(German, JJ, B-NP, B-MISC),(call, NN, I-NP, O).....
 (Peter, NNP, B-NP, B-PER),(Blackburn, NNP, I-N,P I-PER),
 (BRUSSELS, NNP, B-NP, B-LOC),(1996-08-22, CD, I-NP, O)

but I think you expect

[
 [(EU, NNP,B-NP, B-ORG),(rejects, VBZ, B-VP, O),(German, JJ, B-NP, B-MISC),(call, NN, I-NP, O).....], 
 [(Peter, NNP, B-NP, B-PER),(Blackburn, NNP, I-N,P I-PER)],
 [(BRUSSELS, NNP, B-NP, B-LOC),(1996-08-22, CD, I-NP, O)],
]

and this need

    for row in reader:
        if row:
           data.append(tuple(row))
        else:
           sentences.append(data)
           data = []

At the end it may need also to add last data becuase there is no empty line after these data

    if data:
       sentences.append(data)

Full working example.

I use io only to simulate file in memory so everyone can copy and run it. But you should use open() without text.

text = '''EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O

Peter NNP B-NP B-PER
Blackburn NNP I-NP I-PER

BRUSSELS NNP B-NP B-LOC
1996-08-22 CD I-NP O'''

import csv
import io

data = []
sentences = []

#with open(filename) as load_file:
with io.StringIO(text) as load_file:    
    reader = csv.reader(load_file, delimiter=" ")   # read
   
    for row in reader:
        if row:
           data.append(tuple(row))
        else:
           sentences.append(data)
           data = []

    # add last data because there is no empty line after these data           
    if data:
       sentences.append(data)

print(sentences)