Python Question - How to extract text between {textblock}{/textblock} of a .txt file?-CodePudding

I want to extract the text between {textblock_content} and {/textblock_content}.

With this script below, only the 1st line of the introtext.txt file is going to be extracted and written in a newly created text file. I don't know why the script does not extract also the other lines of the introtext.txt.

f = open("introtext.txt")
r = open("textcontent.txt", "w")
for l in f.readlines():
    if "{textblock_content}" in l:
        pos_text_begin = l.find("{textblock_content}")   19
        pos_text_end = l.find("{/textblock_content}")
        text = l[pos_text_begin:pos_text_end]
        r.write(text)

f.close()
r.close()

How to solve this problem?

CodePudding user response：

Your code actually working fine, assuming you have begin and end block in your line. But I think this is not what you dreamed of. You can't read multiple blocks in one line, and you can't read block which started and ended in different lines.

First of all take a look at the object which returned by open function. You can use method read in this class to access whole text. Also take a look at with statements, it can help you to make actions with file easier and safely. And to rewrite your code so it will read something between {textblockcontent} and {\textblockcontent} we should write something like this:

def get_all_tags_content(
    text: str,
    tag_begin: str = "{textblock_content}",
    tag_end: str = "{/textblock_content}"
) -> list[str]:

    useful_text = text
    ans = []

    # Heavy cicle, needs some optimizations
    # Works in O(len(text) ** 2), we can better
    while tag_begin in useful_text:
        useful_text = useful_text.split(tag_begin, 1)[1]
        if tag_end not in useful_text:
            break
        block_content, useful_text = useful_text.split(tag_end, 1)
        ans.append(block_content)
    return ans


with open("introtext.txt", "r") as f:
    with open("textcontent.txt", "w ") as r:
        r.write(str(get_all_tags_content(f.read())))

To write this function efficiently, so it can work with a realy big files on you. In this implementation I have copied our begin text every time out context block appeared, it's not necessary and it's slow down our program (Imagine the situation where you have millions of lines with content {textblock_content}"hello world"{/textblock_content}. In every line we will copy whole text to continue out program). We can use just for loop in this text to avoid copying. Try to solve it yourself

CodePudding user response：

When you call file.readlines() the file pointer will reach the end of the file. For further calls of the same, the return value will be an empty list so if you change your code to sth like one of the below code snippets it should work properly:

f = open("introtext.txt")
r = open("textcontent.txt", "w")
f_lines = f.readlines()
for l in f_lines:
    if "{textblock_content}" in l:
        pos_text_begin = l.find("{textblock_content}")   19
        pos_text_end = l.find("{/textblock_content}")
        text = l[pos_text_begin:pos_text_end]
        r.write(text)

f.close()
r.close()

Also, you can implement it through with context manager like the below code snippet:

with open("textcontent.txt", "w") as r:
    with open("introtext.txt") as f:
        for line in f: 
            if "{textblock_content}" in l:
                pos_text_begin = l.find("{textblock_content}")   19
                pos_text_end = l.find("{/textblock_content}")
                text = l[pos_text_begin:pos_text_end]
                 r.write(text)