Home > Software engineering >  BeautifulSoup - Extract text from <head> and </head>
BeautifulSoup - Extract text from <head> and </head>

Time:12-20

I'm trying to extract text from source code on a page. In the tag and , I'd like to extract this :

<script src="/_next/static/d5fgdrSQl/_buildM.js" defer=""></script>

d5fgdrSQl is a dynamic data and I need to scrap precisely this key every day.

My script start like this, but I don't know how can I do that.

from bs4 import BeautifulSoup
import re
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36'}
urlweb = 'https://www.thenameofthewebiste.com'

r1 = requests.get(urlweb,headers=headers)

s1 = BeautifulSoup(r1.text, 'html.parser')
TAG = s1.find_all('_buildM')

print(TAG)

CodePudding user response:

You need to pass an user-agent header, then you can use an attribute = value css selector, with ends with $ operator to target the script src

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.paris-turf.com/', headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
print(soup.select_one('[src$="_buildManifest.js"]')['src'].split('/')[3])

CodePudding user response:

To select dynamically, you can use re

from bs4 import BeautifulSoup
import re

doc ='''
<script src="/_next/static/85565454878CDDS7785/_builddata.js" defer=""></script>
'''
      
soup = BeautifulSoup(doc,'lxml')
script= soup.find("script")
s=script['src']
d=re.search(r'(\d \w \d )',s).group(1)
print(d)

Output

85565454878CDDS7785
  • Related