Home > Blockchain >  How do I scrape 2 different data under the same class name?
How do I scrape 2 different data under the same class name?

Time:08-11

I'm trying to scrape the data from steam which shows most popular games by current player numbers. https://store.steampowered.com/stats/

The table on the website looks like this


CURRENT PLAYERS PEAK TODAY      GAME
 
403,791 882,486     Counter-Strike: Global Offensive
313,691 614,086     Dota 2
248,095 511,676     Apex Legends
127,414 379,136     PUBG: BATTLEGROUNDS
94,817  174,926     Grand Theft Auto V
77,263  175,802     Lost Ark
76,397  102,653     Team Fortress 2
70,590  109,876     Rust
69,508  144,520     MONSTER HUNTER RISE
56,206  89,366      Wallpaper Engine

Apparently there are 2 numeric data, and I want to scrape both of them.

However, these 2 data has the same class name "currentServers". (CSGO as an example)

<tr  style="">
    <td align="right">
        <span >403,791</span>
    </td>
    <td align="right">
        <span >882,486</span>
    </td>
    <td width="20">&nbsp;</td>
    <td>
        <a  onm ouseover="GameHover( this, event, 'global_hover' {&quot;type&quot;:&quot;app&quot;,&quot;id&quot;:730,&quot;public&quot;:1,&quot;v6&quot;:1} );" onm ouseout="HideGameHover( this, event, 'global_hover' )" href="https://store.steampowered.com/app/730/CounterStrike_Global_Offensive/">Counter-Strike: Global Offensive</a>
    </td>
</tr>

This is my code, and the variable player only identifies the first number and skip the second number automatically.

import requests
from bs4 import BeautifulSoup

url = 'https://store.steampowered.com/stats/'

page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="detailStats")
row = results.find_all('tr', class_='player_count_row')

for data in row:
    player = data.find('span', class_='currentServers')
    name = data.find('a', class_='gameLink')
    print(player.text)
    print(name.text)
    print()

How can I identify these 2 data and scrape both of them?

CodePudding user response:

First find specific table and find_all rows from tr tag and get the column data first

import requests
from bs4 import BeautifulSoup
res=requests.get("https://store.steampowered.com/stats/")
soup=BeautifulSoup(res.text,"lxml")
table_data=soup.find_all("table")[1]
rows=table_data.find_all("tr")


cols=[i.get_text(strip=True) for i  in rows[0].find_all("td") if i.get_text(strip=True)!=""]

Now again for finding rows there are couple of tags which is empty so i am iterting on specific index and get the details as list of list values in lst1

main_rows=rows[2:]
lst1=[]
for row in main_rows:
    lst=[i.get_text(strip=True) for i in row.find_all("td")[:2]]
    lst.extend([row.find("a").get_text(strip=True)])
    lst1.append(lst)

Now use pandas to create table structure using lst1 and cols

import pandas as pd
df=pd.DataFrame(lst1,columns=cols)

Output:

Current Players Peak Today  Game
0   417,471 882,486 Counter-Strike: Global Offensive
1   320,785 614,086 Dota 2
.....

Method2:

Using pandas module you can get table data and drop NAN values to get specified table

import pandas as pd
lst=pd.read_html("https://store.steampowered.com/stats/")
df=lst[1]
df.drop(columns=[2,4],inplace=True)
df.rename(columns=df.iloc[0],inplace=True)

CodePudding user response:

Don't know what is your code's output, but as I can see your code gets the text as it is - a string of two numbers separated by a comma character. Depending on what you want to do with it and I understand that you want to separate values, than just split the player.text and you'll get a list of two values.

print(player.text   '  -->  '   str(player.text.split(',')))

and you will get something like below (first column is your code's print out, and the second is the list). Now you can do whatever you want with that list of two values. Regards...

'''
...   ...
12,418  -->  ['12', '418']
12,326  -->  ['12', '326']
12,121  -->  ['12', '121']
12,113  -->  ['12', '113']
12,017  -->  ['12', '017']
...   ...
'''
  • Related