Reading a text file into lists, based on the spaces in the file-CodePudding

So I have this txt file:

Haiku
5 *
7 *
5 *

Limerick
8 A
8 A
5 B
5 B
8 A

And I want to write a function that returns something like this:

[['Haiku', '5', '*', '7', '*', '5', '*'], ['Limerick', '8', 'A', '8', 'A', '5', 'B', '5', 'B', '8' ,'A']]

Ive tried this:

small_pf = open('datasets/poetry_forms_small.txt')

lst = []

for line in small_pf:
    lst.append(line.strip())
    
small_pf.close()

print(lst)

At the end I end up with this:

['Haiku', '5 *', '7 *', '5 *', '', 'Limerick', '8 A', '8 A', '5 B', '5 B', '8 A']

My problem is that this is one big list, and the elements of the list are attached together, like '5 *' or '8 A'. I honestly don't know where to start and thats why I need some guidance into what to do for those two problems. Any help would be greatly appreciated.

CodePudding user response：

When you see an empty line : don't add it, save the tmp list you've been filling, and continue

lst = []
with open('test.txt') as small_pf:
    tmp_list = []
    for line in small_pf:
        line = line.rstrip("\n")
        if line == "":
            lst.append(tmp_list)
            tmp_list = []
        else:
            tmp_list.extend(line.split())

    if tmp_list:  # add last one
        lst.append(tmp_list)

print(lst)
# [['Haiku', '5', '*', '7', '*', '5', '*'],
#  ['Limerick', '8', 'A', '8', 'A', '5', 'B', '5', 'B', '8', 'A']]

CodePudding user response：

First split the file into sections on blank lines (\n\n), then split each section on any whitespace (newlines or spaces).

lst = [section.split() for section in small_pf.read().split('\n\n')]

Result:

[['Haiku', '5', '*', '7', '*', '5', '*'],
 ['Limerick', '8', 'A', '8', 'A', '5', 'B', '5', 'B', '8', 'A']]

CodePudding user response：

Solution without using extra modules

small_pf = small_pf.readlines()
result = []
tempList = []
for index,line in enumerate(small_pf):
  if line == "\n" or index == len(small_pf) -1:
    result.append(tempList.copy())
    del tempList[:]
  else:
    for value in line.strip("\n").split():
      tempList.append(value)
result

Solution with module

You can use regex to solve your problem:

import re
small_pf = small_pf.read()
[re.split("\s|\n", x) for x in re.split("\n\n", small_pf)]

Output

[['Haiku', '5', '*', '7', '*', '5', '*'],
 ['Limerick', '8', 'A', '8', 'A', '5', 'B', '5', 'B', '8', 'A']]

CodePudding user response：

This approach assumes that a line either starts with a character that is a decimal value or a nondecimal value. Moreover, it assumes that if it starts with a nondecimal value that this should start a new list with the line (as a string, without any trailing whitespace) as the first element. If subsequent lines start with a decimal value, these are stripped of trailing whitespace, and parts of the line (determined by separation from a space) are added as elements in the most recently created list.

lst = []
with open("blankpaper.txt") as f:
    for line in f:
        # ignore empty lines 
        if line.rstrip() == '':
            continue
        if not line[0].isdecimal():
            new_list = [line.rstrip()]
            lst.append(new_list)
            continue
        new_list.extend(line.rstrip().split(" "))

print(lst)

Output

[['Haiku', '5', '*', '7', '*', '5', '*'], ['Limerick', '8', 'A', '8', 'A', '5', 'B', '5', 'B', '8', 'A']]

I hope this helps. If there are any questions, please let me know.