I have a big string and would like to extract parts of the string. You can see an example below. I would extract the text corresponding to the text
tags. So we have three text tags, namely "text": "{{{VAR_2}}}"
, "text": "Tell me. How old are you?"
, "text": "{{{VAR_0}}}"
.
This should result in the following list of strings: ["What is your age?", "Tell me. How old are you?", "How old are you?"]
. The number of text tags varies. If no text tag is present, it should result in None.
How can this be solved in an elegant way?
<VAR_0>
How old are you?
</VAR_0>
<VAR_2>
What is your age?
</VAR_2>
(set:$ageask to 'true')
(set:$storybeat_01 to 1)
(set:$inputAge to -1)
<audio stopall>
<audio src="sound.wav" autoplay loop volume="100%">
<onSuccess($user_age==0,$inputAge:=age)>[[age selector]]
<onSuccess(5>age OR 100<age,$inputAge:=age)>[[impossible age]]
<onSuccess(90<age,$inputAge:=age)>[[very old]]
<interaction type="AgeNumberQuestionInteraction">
{
"BaseClassConfig QuestionInteractionInterface": {
"QuestionInteractionInterfaceConfig": {
"maxQuestionRepetitionCount": 1
},
"QuestionSpeakingInteraction": {
"SpeakingInteractionConfig": {
"speech": [
{
"emotion": "HAPPY",
"text": "{{{VAR_2}}}"
},
{
"emotion": "HAPPY",
"text": "Tell me. How old are you?"
},
{
"emotion": "HAPPY",
"text": "{{{VAR_0}}}"
}
]
}
}
}
}
</interaction>
CodePudding user response:
You can use a regular expressions pattern to search for the "text".
html_doc
is your HTML above:
for text in re.findall('"text": ".*?"', html_doc):
print(text)
Prints:
"text": "{{{VAR_2}}}"
"text": "Tell me. How old are you?"
"text": "{{{VAR_0}}}"
Or: using beautifulsoup:
soup = BeautifulSoup(html_doc, 'html.parser')
tag = soup.find("interaction")
for text in re.findall('"text": ".*?"', tag.text):
print(text)
EDIT: to get the VAR
text:
for text in re.findall(r"{{{(.*?)}}}", html_doc):
print(text)
CodePudding user response:
Here is a BeautifulSoup / JSON solution, which should be more resistant to changes in the HTML layouts. Regular expressions are a terrible choice when working with HTML pages.
import bs4
import json
page = bs4.BeautifulSoup( open('x.html').read(), 'html.parser' )
data = json.loads( page.find('interaction').text )
results = []
lst = data["BaseClassConfig QuestionInteractionInterface"]["QuestionSpeakingInteraction"]["SpeakingInteractionConfig"]["speech"]
for item in lst:
if item['text'].startswith('{{{'):
results.append( page.find(item['text'].strip('{}').lower()).text.strip() )
else:
results.append( item['text'] )
print(results)
Output:
['What is your age?', 'Tell me. How old are you?', 'How old are you?']