How to read specific lines of .txt file?-CodePudding

I am trying to extract information out of a text file and store each "paragraph", by paragraph I mean I need the date (always the first index) and whatever description is associated with that date (the information right after that date, but before the next date), .txt looks likes

September 2013. **I NEED THE DATA THAT WOULD BE WRITTEN HERE STORED WITH ITS DATE HOWEVER 
WHEN ANOTHER DATE SHOWS UP IT NEEDS TO BE SEPERATED
September 2013. blah blah balh this is an example blah blaha blah I need the information hereblah blah balh this is an example blah blaha blah I need the information here
blah blah balh this is an example blah blaha blah I need the information here
August 2013. blah blah balh this is an example blah blaha blah I need the information here
August 2013.blah blah balh this is an example blah blaha blah I need the information here
blah blah balh this is an example blah blaha blah I need the information hereblah blah balh this is an example blah blaha blah I need the information hereblah blah balh this is an example blah blaha blah I need the information here
June 2013. blah blah balh this is an example blah blaha blah I need the information hereeeeee

There isn't a definite number of lines which comes after the date.

I am able to find every line starting with a date using

with open("test.txt", encoding="utf8") as input:
    for line in input:
        for month in months:
            if month in line:
                print(line)

but this outputs

"May 2014. only the first line is taken in and not the rest of the paragraph

April 2013. only the first line is taken in and not the rest of the paragraph

December 2013. only the first line is taken in and not the rest of the paragraph

November 2012. only the first line is taken in and not the rest of the paragraph

CodePudding user response：

If the file you read fits in memory, it's most of the time the best option to just read the complete file and then operate on it.

If you might have huge files (100MB and more), you might want to read in chunks:

https://stackoverflow.com/a/519653/562769

However, this means that you need to write more complex logic how to deal with those chunks.

Reading by lines doesn't make sense if your lines can become arbitrary big. For the OS/file system, a line is no meaningful unit. A newline character is only that: one character in a bigger file. Just like any other character.

Regarding the line matching, you could do something like this:

with open("file.txt") as fp:
    data = fp.read()

for line in data.split("/n"):
    if matches(line):
        operate(line)

Where matches is a function that checks if your date condition is met and operate does what you want to do with the line.

The matches function could use several if-elif statements or regular expressions (the re module). Using split / startswith / "pattern" in "haystack" might be useful

CodePudding user response：

This will work, assuming every line begins with a month and year separated by a space. You have a line in your example text that does not begin with a month/year, however, which is making me wonder if you're expecting it to reject lines that do not begin with a month/year.

with open('filename.txt', 'r') as f:
    data = f.readlines()

for line in data:
    words = line.strip().split(' ')
    date = ' '.join(words[0:2])
    desc = ' '.join(words[2:])
    print(f'{date} | {desc}\n')