Use text lines in one file as filename for others-CodePudding

My script reads two files as input: articles.txt and article-titles.txt

articles.txt contains articles that are delimited with "<<<" without quotes. While article-titles.txt contain a list of titles delimited "\n" without the quotes. The last one may or may not be delimited with a \n

articles.txt:

This is article.txt. This is article.txt. This is article.txt.
This is article.txt. This is article.txt. This is article.txt.
This is article.txt.This is article.txt. This is article.txt. This is article.txt.
>>>

This is article.txt.This is article.txt. This is article.txt.
This is article.txt. This is article.txt. This is article.txt.
This is article.txt. This is article.txt. This is article.txt.
This is article.txt. This is article.txt. This is article.txt.
>>>

This is article.txt. This is article.txt. This is article.txt. This is article.txt.
This is article.txt. This is article.txt. This is article.txt.This is article.txt.

article-title.txt:

This is the filename of the first article
This is the filename of the second article
This is the filename of the third article

My script should split the articles in articles.txt into separate text files. Name each file according to each line on article-title.txt. Fill each character space in the filename with a dash "-" Filenames should end with a .txt

Therefore a successful execution of the script should have three files or whatever number of files required and one file will be named: This-is-the-filename-of-the-first-article.txt

At the moment my scrip outputs a single file

with open("inputfile.txt", "r") as f1, open("inputfile-title.txt", "r") as f2:
    buff = []
    i = 1
    for line1, line2 in zip(f1, f2):
        x = 0
        if line1.strip():
           buff.append(line1)
        if line1.strip() == ">>>":
           data = f2.readlines()
           output = open('%s.txt' % data[x].replace('\n', ''),'w')
           output.write(''.join(buff))
           output.close()
           x =1
           print("This is x:", x)
           print("This is data:", data)
           buff = [] #buffer reset

CodePudding user response：

The immediate flaw is that you read in all the article names the first time you see a delimiter, and then further attempts to read from the same file handle will no longer work. See also Why can't I call read() twice on an open file?

For efficiency and elegance, I would also refactor to simply read and write one line at a time.

with open("articles.txt", "r") as text, open("article-title.txt", "r") as titles:
    for line in titles:
        filename = line.rstrip('\n').replace(' ', '-')   '.txt'
        with open(filename, 'w') as article:
            for line in text:
                if line.strip() == '>>>':
                    break
                article.write(line)

This will obviously not work correctly if the number of file names is less than the number of sections in the input file. Conversely, if there are too many file names, the excess ones will not be used. Perhaps a better design would be to inline the file names into the data, or perhaps devise a mechanism for generating fallback file names if there are not enough of them in the input.

Demo: https://ideone.com/hYnOzP

CodePudding user response：

First read your titles from your article-titles.txt file into a list.

Then, as you read through your articles.txt file, you can pull (or pop if you want) your titles off the list.

The below simply yields titles as it runs through the articles file.

def get_title():
    with open('titles.txt', 'r') as f:
        yield f.readline().strip()

idx = 0
with open('articles.txt', 'r') as f:
    articles = [
        {'title': get_title(), 'lines': []},
    ]
    for line in f.readlines():
        if line.strip() == ">>>":
            articles.append(
                {'title': get_title(), 'lines': []}
            )
            idx  = 1
        else:
            articles[idx]['lines'].append(line.strip())

CodePudding user response：

You should separate your parsing logic and also your writing logic. First I'd read the article contents and parse them Secondly I'd read the article titles. Last I'd use the collected information and write out the content.

One possible way of doing it:

with open("articles.txt") as articles_fp:
    articles = [article.strip() for article in articles_fp.read().split(">>>")]

with open("article-title.txt") as article_title_fp:
    artice_filenames = [
        "-".join(title_line.strip().split())   ".txt"
        for title_line in article_title_fp
        if title_line.strip()
    ]

for article_filename, article in zip(artice_filenames, articles):
    with open(article_filename, "w") as article_fp:
        article_fp.write(article)

CodePudding user response：

How your code behaves will depend on the inputs, so it is difficult to exactly replicate your situation. However, there are a few potential issues.

You are currently searching by line both files at the same time. However, since their lines don't line up 1:1, you run into problems (f1 is still reading lines when f2 is done). You don't interact with f2 except for readlines(), which would grab all lines in the file starting at the current (moving) position in the file, which is not what you want. Add some print statements to your code, and you will see how it is behaving.

Instead, I suggest iterating through the 2 files separately, then splitting the two files in their unique ways is clearer. Here is one option. Because you know f1 is deliniated by <<<, one option is to read in the entire file first (if not too large), then use the string.split() function on this deliniation. You can then directly match each individual line of f2 to the split output of f1.

CodePudding user response：

I think you complicated it a bit, this should work:

with open("articles.txt", "r") as f1, open("article-title.txt", "r") as f2:
    files = f2.read().split("\n")
    data = f1.read().split(">>>\n\n")

for file, datum in zip(files, data):
    file_name = file.replace(" ", "-")
    with open(f"{file_name}.txt", "w") as f:
        f.write(datum)