I'm doing web scraping with BeautifulSoup and I need to get a link which is in a script tag, so I use this
soup.find(string=re.compile("https://link9876.net/index.php"))
this returns me the next string
"var link = [];
link[0] = 'https://link1225.com/x/xxxxxx';
link[1] = 'https://link9876.net/index.php?xxxxxxxxx';
link[2] = 'https://link1356.com/index.php?xxxxxxxxx';
..."
(the position and number of the elements in the array changes every time)
But I only want to get "*https://link9876.net/index.php*", which is the best approach to resolve this?
CodePudding user response:
You could just use another regular expression to extract any necessary links, for example:
import re
script_text = """var link = [];
link[0] = 'https://link1225.com/x/xxxxxx';
link[1] = 'https://link9876.net/index.php?xxxxxxxx1';
link[2] = 'https://link9876.net/index.php?xxxxxxxx2';
link[3] = 'https://link9876.net/index.php?xxxx3xxx';
link[4] = 'https://link1356.com/index.php?xxxxx4xxx';
link[5] = 'https://link1356.com/index.php?xxxxx4xxx';
..."""
for link in re.findall(r"'(https://link9876\.net/index\.php.*?)'", script_text):
print(link)
Would give you:
https://link9876.net/index.php?xxxxxxxx1
https://link9876.net/index.php?xxxxxxxx2
https://link9876.net/index.php?xxxx3xxx