Trying to scrape text from a site with BeautifulSoup4, but nothing happens at all-CodePudding

I want to scrape data from this website: https://playvalorant.com/en-us/news/game-updates/

from bs4 import BeautifulSoup
import requests

site_text = requests.get('https://playvalorant.com/en-us/news/game-updates/').text
soup = BeautifulSoup(site_text, 'lxml')
posts = soup.find_all('li', class_="ContentListing-module--contentListingItem--3GAoa")
for post in posts:
    post_title = post.find(
        'h3', class_="heading-05 bold ContentListingCard-module--title--1vIFy").text
    post_title = post_title.lower()
    if "patch notes" in post_title:
        patch_ver = post_title.replace('valorant patch notes ', '')
        print(f'Patch version: {patch_ver}')
        print("")

But when I run it, nothing happens at all.

What I want to do is to see if the h3 includes the text "patch notes" and if so, check what version it is and go to https://playvalorant.com/en-us/news/game-updates/valorant-patch-notes-(patch-number)-(patch-number)/ (for example, if the text was "VALORANT Patch Notes 3213.07", then I want to go to https://playvalorant.com/en-us/news/game-updates/valorant-patch-notes-3213-07, and so on.)

I'm getting ahead of myself, but the point is, how can I get the text from this website, and then print it out?

CodePudding user response：

The data you see is loaded via Javascript, sou BeautifulSoup doesn't see it. You can use requests module to simulate it:

import json
import requests

url = (
    "https://playvalorant.com/page-data/en-us/news/game-updates/page-data.json"
)
data = requests.get(url).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for a in data["result"]["pageContext"]["data"]["articles"]:
    if "Patch Notes" in a["title"]:
        patch_notes_url = "https://playvalorant.com"   a["url"]["url"]
        print("{:<30} {}".format(a["title"], patch_notes_url))

Prints:

VALORANT Patch Notes 4.04      https://playvalorant.com/news/game-updates/valorant-patch-notes-4-04/
VALORANT Patch Notes 4.03      https://playvalorant.com/news/game-updates/valorant-patch-notes-4-03/
VALORANT Patch Notes 4.02      https://playvalorant.com/news/game-updates/valorant-patch-notes-4-02/
VALORANT Patch Notes 4.01      https://playvalorant.com/news/game-updates/valorant-patch-notes-4-01/
VALORANT Patch Notes 4.0       https://playvalorant.com/news/game-updates/valorant-patch-notes-4-0/
VALORANT Patch Notes 3.12      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-12/
VALORANT Patch Notes 3.10      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-10/
VALORANT Patch Notes 3.09      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-09/
VALORANT Patch Notes 3.08      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-08/
VALORANT Patch Notes 3.07      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-07/
VALORANT Patch Notes 3.06      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-06/
VALORANT Patch Notes 3.05      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-05/
VALORANT Patch Notes 3.04      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-04/
VALORANT Patch Notes 3.03      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-03/
VALORANT Patch Notes 3.02      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-02/
VALORANT Patch Notes 3.01      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-01/

...and so on.

CodePudding user response：

Try lxml to use xpath for accessing the required html nodes easily.

from lxml import html
import requests

url = "https://playvalorant.com/en-us/news/game-updates/"

response = requests.get(url, stream=True)
tree = html.fromstring(response.content)

posts = tree.xpath('//section[contains(@class, "section light")]/div/ul')