How to find and replace script elements in re.compile-CodePudding

I can't find all <script>.*</script> from the html with re.compile.

Here is an example from re document:

import re
result = re.findall(r'(\w )=(\d )', 'set width=20 and height=10')
print("type: {}, len: {}, values: {}".format(type(result), len(result), result))
print("type: {}, len: {}, values: {}".format(type(result[0]), len(result[0]), result[0]))

Here is the output:

type: <class 'list'>, len: 2, values: [('width', '20'), ('height', '10')]
type: <class 'tuple'>, len: 2, values: ('width', '20')

Here is my example for testing:

import re
string = (
    'which <script> prefix foot suffix </script> '
    'or <script> prefix hand suffix </script> fell fastest'
)

result = re.findall(r'<script>', string)
print("type: {}, len: {}, values: {}\n".format(type(result), len(result), result))

result = re.findall(r'</script>', string)
print("type: {}, len: {}, values: {}\n".format(type(result), len(result), result))

result = re.findall(r'<script>.*</script>', string)
print("type: {}, len: {}, values: {}\n".format(type(result), len(result), result))

Here is the output:

type: <class 'list'>, len: 2, values: ['<script>', '<script>']

type: <class 'list'>, len: 2, values: ['</script>', '</script>']

type: <class 'list'>, len: 1, values: ['<script> prefix foot suffix </script> or <script> prefix hand suffix </script>']

Question 1:

I want to find out all text between the <script> and </script> and included the tag itself.

Here is the expected output, the result should contains 2 items in list with value as below:

<script> prefix foot suffix </script>
<script> prefix hand suffix </script>

Question 2:

And then, I want to replace the matches which contains "foot" with an empty string, and then return the final html:

<script> prefix foot suffix </script> ----> ""

I have tried some patterns but no success. How to do this?

CodePudding user response：

In your case you are matching the very first <script> tag with the last </script> tag and everything in between with the greedy .*. What you need to do is to make it lazy by adding a ? after it:

result = re.findall(r'<script>.*?</script>', string)

*? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)

The result variable will then have an array of the two strings. As for your question 2, you can simply loop through this array and use re.sub to replace the strings:

for r in result:
    if "foot" in r:
        r = re.sub(r'<script>(.*?)</script>', "<script></script>", r)
        print(r)