I'm trying to grab meta data content from a website. Here is the code:
import requests
from bs4 import BeautifulSoup
url = "https://discord.com/invite/midjourney"
result = requests.get(url=url)
soup = BeautifulSoup(result.content, 'html5lib')
target = soup.find("meta", property="og:description")
print(target)
This returns:
<meta content="Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities." property="og:description"/>
However, looking at the page source, the content is different and it includes the number of members. The number of members is what I'm after.
<meta property="og:description" content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,472,611 members" />
Is there some type of script dynamically changing the meta content? Any ideas on how to get under the meta to the actual data?
CodePudding user response:
try:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(url, timeout=30, headers=headers) # print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')
#1. extract all meta tags from the page, return list of tags
print(soup.select('meta'))
[<meta charset="utf-8"/>,
<meta content="width=device-width, initial-scale=1.0, maximum-scale=3.0" name="viewport"/>,
<meta content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members" name="description"/>,
<meta content="summary_large_image" name="twitter:card"/>,
<meta content="@discord" name="twitter:site"/>,
<meta content="Join the Midjourney Discord Server!" name="twitter:title"/>,
<meta content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members" name="twitter:description"/>,
<meta content="Join the Midjourney Discord Server!" property="og:title"/>,
<meta content="https://discord.com/invite/midjourney" property="og:url"/>,
<meta content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members" property="og:description"/>,
<meta content="Discord" property="og:site_name"/>,
<meta content="https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512" property="og:image"/>,
<meta content="image/jpeg" property="og:image:type"/>,
<meta content="512" property="og:image:width"/>,
<meta content="512" property="og:image:height"/>,
<meta content="https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512" name="twitter:image"/>]
#2. extract all content of the meta tags, return list of text
content_only = [i.get('content') for i in soup.select('meta') if i.get('content')]
print(content_only)
['width=device-width, initial-scale=1.0, maximum-scale=3.0',
'The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members',
'summary_large_image',
'@discord',
'Join the Midjourney Discord Server!',
'The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members',
'Join the Midjourney Discord Server!',
'https://discord.com/invite/midjourney',
'The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members',
'Discord',
'https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512',
'image/jpeg',
'512',
'512',
'https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512']
#3. extract the members data that you need
members_content_only = list(set([i.get('content') for i in soup.select('meta') if i.get('content') and 'members' in i.get('content')]))
print(members_content_only)
['The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members']
CodePudding user response:
There is indeed js underneath. I found a different method to extract this using selenium and bs4.
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import requests
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium.webdriver.support.ui import WebDriverWait
options = FirefoxOptions()
options.add_argument("--headless")
driver = webdriver.Firefox(options=options)
url = "https://discord.com/invite/midjourney"
driver.get(url)
WebDriverWait(driver, 15)
page = driver.page_source
html = bs(page, 'html.parser') #print(html)
for script in html(["script", "style"]):
script.extract()
text = html.get_text()
lines = (line.strip() for line in text.splitlines())
text = '\n'.join(line for line in lines if line)
final_string = text.replace(",","")
start = final_string.find("Online") 6
end = final_string.find("Members")-1
subs = final_string[start:end]
subs_final = int(subs)
print(subs_final)
Output:
2496142
This was a roundabout way to get what I wanted. Lmk if there are more efficient ways to do this.