Extracting part of string according to tags-CodePudding

I have a big string and would like to extract parts of the string. You can see an example below. I would extract the text corresponding to the text tags. So we have three text tags, namely "text": "{{{VAR_2}}}", "text": "Tell me. How old are you?", "text": "{{{VAR_0}}}".

This should result in the following list of strings: ["What is your age?", "Tell me. How old are you?", "How old are you?"]. The number of text tags varies. If no text tag is present, it should result in None.

How can this be solved in an elegant way?

<VAR_0>
How old are you?
</VAR_0>
<VAR_2>
What is your age?
</VAR_2>

(set:$ageask to 'true')
(set:$storybeat_01 to 1)
(set:$inputAge to -1)

<audio stopall>

<audio src="sound.wav" autoplay loop volume="100%">

<onSuccess($user_age==0,$inputAge:=age)>[[age selector]]
<onSuccess(5>age OR 100<age,$inputAge:=age)>[[impossible age]]
<onSuccess(90<age,$inputAge:=age)>[[very old]]

<interaction type="AgeNumberQuestionInteraction">
{
    "BaseClassConfig QuestionInteractionInterface": {
        "QuestionInteractionInterfaceConfig": {
            "maxQuestionRepetitionCount": 1
        },
        "QuestionSpeakingInteraction": {
            "SpeakingInteractionConfig": {
                "speech": [
                    {
                        "emotion": "HAPPY",
                        "text": "{{{VAR_2}}}"
                    },
                    {
                        "emotion": "HAPPY",
                        "text": "Tell me. How old are you?"
                    },
                    {
                        "emotion": "HAPPY",
                        "text": "{{{VAR_0}}}"
                    }
                ]
            }
        }
    }
}
</interaction>

CodePudding user response：

You can use a regular expressions pattern to search for the "text".

html_doc is your HTML above:

for text in re.findall('"text": ".*?"', html_doc):
    print(text)

Prints:

"text": "{{{VAR_2}}}"
"text": "Tell me. How old are you?"
"text": "{{{VAR_0}}}"

Or: using beautifulsoup:

soup = BeautifulSoup(html_doc, 'html.parser')
tag = soup.find("interaction")

for text in re.findall('"text": ".*?"', tag.text):
    print(text)

EDIT: to get the VAR text:

for text in re.findall(r"{{{(.*?)}}}", html_doc):
    print(text)

CodePudding user response：

Here is a BeautifulSoup / JSON solution, which should be more resistant to changes in the HTML layouts. Regular expressions are a terrible choice when working with HTML pages.

import bs4
import json

page = bs4.BeautifulSoup( open('x.html').read(), 'html.parser' )

data = json.loads(  page.find('interaction').text )

results = []
lst = data["BaseClassConfig QuestionInteractionInterface"]["QuestionSpeakingInteraction"]["SpeakingInteractionConfig"]["speech"]
for item in lst:
    if item['text'].startswith('{{{'):
        results.append( page.find(item['text'].strip('{}').lower()).text.strip() )
    else:
        results.append( item['text'] )
print(results)

Output:

['What is your age?', 'Tell me. How old are you?', 'How old are you?']