My script reads two files as input: articles.txt and article-titles.txt
articles.txt contains articles that are delimited with "<<<" without quotes. While article-titles.txt contain a list of titles delimited "\n" without the quotes. The last one may or may not be delimited with a \n
articles.txt:
This is article.txt. This is article.txt. This is article.txt.
This is article.txt. This is article.txt. This is article.txt.
This is article.txt.This is article.txt. This is article.txt. This is article.txt.
>>>
This is article.txt.This is article.txt. This is article.txt.
This is article.txt. This is article.txt. This is article.txt.
This is article.txt. This is article.txt. This is article.txt.
This is article.txt. This is article.txt. This is article.txt.
>>>
This is article.txt. This is article.txt. This is article.txt. This is article.txt.
This is article.txt. This is article.txt. This is article.txt.This is article.txt.
article-title.txt:
This is the filename of the first article
This is the filename of the second article
This is the filename of the third article
My script should split the articles in articles.txt into separate text files. Name each file according to each line on article-title.txt. Fill each character space in the filename with a dash "-" Filenames should end with a .txt
Therefore a successful execution of the script should have three files or whatever number of files required and one file will be named: This-is-the-filename-of-the-first-article.txt
At the moment my scrip outputs a single file
with open("inputfile.txt", "r") as f1, open("inputfile-title.txt", "r") as f2:
buff = []
i = 1
for line1, line2 in zip(f1, f2):
x = 0
if line1.strip():
buff.append(line1)
if line1.strip() == ">>>":
data = f2.readlines()
output = open('%s.txt' % data[x].replace('\n', ''),'w')
output.write(''.join(buff))
output.close()
x =1
print("This is x:", x)
print("This is data:", data)
buff = [] #buffer reset
CodePudding user response:
The immediate flaw is that you read in all the article names the first time you see a delimiter, and then further attempts to read from the same file handle will no longer work. See also Why can't I call read() twice on an open file?
For efficiency and elegance, I would also refactor to simply read and write one line at a time.
with open("articles.txt", "r") as text, open("article-title.txt", "r") as titles:
for line in titles:
filename = line.rstrip('\n').replace(' ', '-') '.txt'
with open(filename, 'w') as article:
for line in text:
if line.strip() == '>>>':
break
article.write(line)
This will obviously not work correctly if the number of file names is less than the number of sections in the input file. Conversely, if there are too many file names, the excess ones will not be used. Perhaps a better design would be to inline the file names into the data, or perhaps devise a mechanism for generating fallback file names if there are not enough of them in the input.
Demo: https://ideone.com/hYnOzP
CodePudding user response:
First read your titles from your article-titles.txt file into a list.
Then, as you read through your articles.txt
file, you can pull (or pop if you want) your titles off the list.
The below simply yields titles as it runs through the articles
file.
def get_title():
with open('titles.txt', 'r') as f:
yield f.readline().strip()
idx = 0
with open('articles.txt', 'r') as f:
articles = [
{'title': get_title(), 'lines': []},
]
for line in f.readlines():
if line.strip() == ">>>":
articles.append(
{'title': get_title(), 'lines': []}
)
idx = 1
else:
articles[idx]['lines'].append(line.strip())
CodePudding user response:
You should separate your parsing logic and also your writing logic. First I'd read the article contents and parse them Secondly I'd read the article titles. Last I'd use the collected information and write out the content.
One possible way of doing it:
with open("articles.txt") as articles_fp:
articles = [article.strip() for article in articles_fp.read().split(">>>")]
with open("article-title.txt") as article_title_fp:
artice_filenames = [
"-".join(title_line.strip().split()) ".txt"
for title_line in article_title_fp
if title_line.strip()
]
for article_filename, article in zip(artice_filenames, articles):
with open(article_filename, "w") as article_fp:
article_fp.write(article)
CodePudding user response:
How your code behaves will depend on the inputs, so it is difficult to exactly replicate your situation. However, there are a few potential issues.
You are currently searching by line both files at the same time. However, since their lines don't line up 1:1, you run into problems (f1
is still reading lines when f2
is done). You don't interact with f2
except for readlines()
, which would grab all lines in the file starting at the current (moving) position in the file, which is not what you want. Add some print statements to your code, and you will see how it is behaving.
Instead, I suggest iterating through the 2 files separately, then splitting the two files in their unique ways is clearer. Here is one option. Because you know f1
is deliniated by <<<, one option is to read in the entire file first (if not too large), then use the string.split() function on this deliniation. You can then directly match each individual line of f2
to the split output of f1
.
CodePudding user response:
I think you complicated it a bit, this should work:
with open("articles.txt", "r") as f1, open("article-title.txt", "r") as f2:
files = f2.read().split("\n")
data = f1.read().split(">>>\n\n")
for file, datum in zip(files, data):
file_name = file.replace(" ", "-")
with open(f"{file_name}.txt", "w") as f:
f.write(datum)