Python Readline Loop and Subloop-CodePudding

I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.

This is what the text looks like:

Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number 
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python

This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:

a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:  
for line in unstr:
      if a in line:
        titleList.append(line)

Now I want to do the below:

a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:  
for line in unstr:
  if a in line:
    list.append(line)
  if b in line:
     1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
     2. Continue the for loop within which all of this sits

As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.

CodePudding user response：

If you want to stick with your for-loop, you're probably going to need something like this:

titles = []
texts = []
subjects = []

with open('sample.txt', encoding="utf8") as f:
    inside_fulltext = False
    for line in f:
        if line.startswith("Title:"):
            inside_fulltext = False
            titles.append(line)
        elif line.startswith("Full text:"):
            inside_fulltext = True
            full_text = line
        elif line.startswith("Subject:"):
            inside_fulltext = False
            texts.append(full_text)
            subjects.append(line)
        elif inside_fulltext:
            full_text  = line
        else:
            # Possibly throw a format error here?
            pass

(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)

Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.

CodePudding user response：

As your goal is to construct a DataFrame, here is a re numpy pandas solution:

import re
import pandas as pd
import numpy as np

# read all file
with open('sample.txt', encoding="utf8") as f:
    text = f.read()


keys = ['Subject', 'Title', 'Full text']

regex = '(?:^|\n)(%s): ' % '|'.join(keys)

# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])

Output:

                      Title                                                                                                                                               Full text Subject
0       title of an article  unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three..  Python
1  title of another article                                                                               again unfortunately the full text of each article,\nis on numerous lines.  Python