I'm trying to scrape the data from steam which shows most popular games by current player numbers. https://store.steampowered.com/stats/
The table on the website looks like this
CURRENT PLAYERS PEAK TODAY GAME
403,791 882,486 Counter-Strike: Global Offensive
313,691 614,086 Dota 2
248,095 511,676 Apex Legends
127,414 379,136 PUBG: BATTLEGROUNDS
94,817 174,926 Grand Theft Auto V
77,263 175,802 Lost Ark
76,397 102,653 Team Fortress 2
70,590 109,876 Rust
69,508 144,520 MONSTER HUNTER RISE
56,206 89,366 Wallpaper Engine
Apparently there are 2 numeric data, and I want to scrape both of them.
However, these 2 data has the same class name "currentServers". (CSGO as an example)
<tr style="">
<td align="right">
<span >403,791</span>
</td>
<td align="right">
<span >882,486</span>
</td>
<td width="20"> </td>
<td>
<a onm ouseover="GameHover( this, event, 'global_hover' {"type":"app","id":730,"public":1,"v6":1} );" onm ouseout="HideGameHover( this, event, 'global_hover' )" href="https://store.steampowered.com/app/730/CounterStrike_Global_Offensive/">Counter-Strike: Global Offensive</a>
</td>
</tr>
This is my code, and the variable player only identifies the first number and skip the second number automatically.
import requests
from bs4 import BeautifulSoup
url = 'https://store.steampowered.com/stats/'
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="detailStats")
row = results.find_all('tr', class_='player_count_row')
for data in row:
player = data.find('span', class_='currentServers')
name = data.find('a', class_='gameLink')
print(player.text)
print(name.text)
print()
How can I identify these 2 data and scrape both of them?
CodePudding user response:
First find specific table and find_all
rows from tr
tag and get the column data first
import requests
from bs4 import BeautifulSoup
res=requests.get("https://store.steampowered.com/stats/")
soup=BeautifulSoup(res.text,"lxml")
table_data=soup.find_all("table")[1]
rows=table_data.find_all("tr")
cols=[i.get_text(strip=True) for i in rows[0].find_all("td") if i.get_text(strip=True)!=""]
Now again for finding rows
there are couple of tags which is empty so i am iterting on specific index and get the details as list of list values in lst1
main_rows=rows[2:]
lst1=[]
for row in main_rows:
lst=[i.get_text(strip=True) for i in row.find_all("td")[:2]]
lst.extend([row.find("a").get_text(strip=True)])
lst1.append(lst)
Now use pandas
to create table structure using lst1
and cols
import pandas as pd
df=pd.DataFrame(lst1,columns=cols)
Output:
Current Players Peak Today Game
0 417,471 882,486 Counter-Strike: Global Offensive
1 320,785 614,086 Dota 2
.....
Method2:
Using pandas
module you can get table
data and drop NAN
values to get specified table
import pandas as pd
lst=pd.read_html("https://store.steampowered.com/stats/")
df=lst[1]
df.drop(columns=[2,4],inplace=True)
df.rename(columns=df.iloc[0],inplace=True)
CodePudding user response:
Don't know what is your code's output, but as I can see your code gets the text as it is - a string of two numbers separated by a comma character. Depending on what you want to do with it and I understand that you want to separate values, than just split the player.text and you'll get a list of two values.
print(player.text ' --> ' str(player.text.split(',')))
and you will get something like below (first column is your code's print out, and the second is the list). Now you can do whatever you want with that list of two values. Regards...
'''
... ...
12,418 --> ['12', '418']
12,326 --> ['12', '326']
12,121 --> ['12', '121']
12,113 --> ['12', '113']
12,017 --> ['12', '017']
... ...
'''