I'm a fairly new to web scraping so apologize if the answer to my problem is obvious. I made a Web Scraper that goes through the reviews of a steam game (civilization 6) and gets information such as hours spent on the game, if they recommended it or not, products they own, and so on.
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
url = "https://steamcommunity.com/app/289070/reviews/?browsefilter=toprated&snr=1_5_100010_"
review_dict = {
"found_helpful": [],
"title": [], #recommended or not
"hours": [],
"prods_in_account": [],
"words_in_review": []
}
def data_scrapper():
"""
get's the reviews from the steam page.
"""
response = requests.get(url)
soup = bs(response.content, "html.parser")
card_div = soup.findAll("div",attrs={"class","apphub_Card modalContentLink interactable"})
for cards in card_div:
found_helpful = cards.find("div", attrs={"class": "found_helpful"})
vote_header = cards.find("div", attrs={"class": "vote_header"})
hours = cards.find("div", attrs={"class": "hours"})
products = cards.find("div", attrs={"class": "apphub_CardContentMoreLink ellipsis"})
words_in_review = cards.find("div", attrs={"class": "apphub_CardTextContent"})
review_dict["found_helpful"].append(found_helpful)
review_dict["title"].append(vote_header)
review_dict["hours"].append(hours)
review_dict["prods_in_account"].append(products)
review_dict["words_in_review"].append(len(words_in_review))
data_scrapper()
review_df = pd.DataFrame.from_dict(review_dict)
review_df.to_csv("review.csv", sep=",")
My problem is that when I run my code I am expecting an organized CSV file however I get this:
,found_helpful,title,hours,prods_in_account,words_in_review
0,"<div hljs-string">"found_helpful"">
3,398 people found this review helpful<br/>159 people found this review funny <div hljs-string">"review_award_aggregated tooltip"" data-tooltip-hljs-string">"review_reward_tooltip"" data-tooltip-html='<div hljs-string">"review_award_ctn_hover""> <div hljs-string">"review_award"" data-reaction=""6"" data-reactioncount=""5"">
<img hljs-string">"review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/6.png?v=5""/>
<span hljs-string">"review_award_count "">5</span>
</div>
<div hljs-string">"review_award"" data-reaction=""3"" data-reactioncount=""3"">
<img hljs-string">"review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/3.png?v=5""/>
<span hljs-string">"review_award_count "">3</span>
</div>
<div hljs-string">"review_award"" data-reaction=""5"" data-reactioncount=""2"">
<img hljs-string">"review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/5.png?v=5""/>
<span hljs-string">"review_award_count "">2</span>
</div>
<div hljs-string">"review_award"" data-reaction=""1"" data-reactioncount=""1"">
<img hljs-string">"review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/1.png?v=5""/>
<span hljs-string">"review_award_count hidden"">1</span>
</div>
<div hljs-string">"review_award"" data-reaction=""9"" data-reactioncount=""1"">
<img hljs-string">"review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/9.png?v=5""/>
<span hljs-string">"review_award_count hidden"">1</span>
</div>
<div hljs-string">"review_award"" data-reaction=""18"" data-reactioncount=""1"">
<img hljs-string">"review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/18.png?v=5""/>
<span hljs-string">"review_award_count hidden"">1</span>
</div>
<div hljs-string">"review_award"" data-reaction=""19"" data-reactioncount=""1"">
<img hljs-string">"review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/19.png?v=5""/>
<span hljs-string">"review_award_count hidden"">1</span>
</div>
</div>'><img hljs-string">"reward_btn_icon"" src=""https://community.akamai.steamstatic.com/public/shared/images//award_icon_blue.svg""/>14</div>
</div>","<div hljs-string">"vote_header"">
<div hljs-string">"reviewInfo"">
<div hljs-string">"thumb"">
<img height=""44"" src=""https://community.akamai.steamstatic.com/public/shared/images/userreviews/icon_thumbsDown.png?v=1"" width=""44""/>
</div>
<div hljs-string">"title"">Not Recommended</div>
<div hljs-string">"hours"">8,028.3 hrs on record</div>
</div>
<div style=""clear: left""></div>
</div>","<div hljs-string">"hours"">8,028.3 hrs on record</div>","<div hljs-string">"apphub_CardContentMoreLink ellipsis"">167 products in account</div>",38
I revised my function for extracting and appending my data but I still get this weird file, any clues as to what I'm doing wrong?
CodePudding user response:
Make these changes to your existing code:
for cards in card_div:
found_helpful = cards.find("div", attrs={"class": "found_helpful"}).get_text()
vote_header = cards.find("div", attrs={"class": "vote_header"}).get_text()
hours = cards.find("div", attrs={"class": "hours"}).get_text()
products = cards.find("div", attrs={"class": "apphub_CardContentMoreLink ellipsis"}).get_text()
words_in_review = cards.find("div", attrs={"class": "apphub_CardTextContent"}).get_text()
review_dict["found_helpful"].append(found_helpful)
review_dict["title"].append(vote_header)
review_dict["hours"].append(hours)
review_dict["prods_in_account"].append(products)
review_dict["words_in_review"].append(len(words_in_review))
review_df = pd.DataFrame.from_dict(review_dict)
cols = review_df.select_dtypes(['object']).columns
review_df[cols] = review_df[cols].apply(lambda x: x.str.strip())
OUTPUT:
found_helpful title hours prods_in_account words_in_review
0 1,266 people found this review helpful20 peopl... Recommended\n456.9 hrs on record 456.9 hrs on record 536 products in account 770
1 1,127 people found this review helpful14 peopl... Recommended\n92.1 hrs on record 92.1 hrs on record 135 products in account 574
2 853 people found this review helpful49 people ... Recommended\n1,360.8 hrs on record 1,360.8 hrs on record 18 products in account 181
3 1,832 people found this review helpful18 peopl... Recommended\n520.5 hrs on record 520.5 hrs on record 281 products in account 7114
4 3,370 people found this review helpful40 peopl... Not Recommended\n415.7 hrs on record 415.7 hrs on record 102 products in account 853
5 5,724 people found this review helpful172 peop... Not Recommended\n256.7 hrs on record 256.7 hrs on record 180 products in account 2072
6 393 people found this review helpful10 people ... Recommended\n22.8 hrs on record 22.8 hrs on record 85 products in account 278
7 3,229 people found this review helpful62 peopl... Not Recommended\n58.6 hrs on record 58.6 hrs on record 264 products in account 894
8 1,373 people found this review helpful22 peopl... Not Recommended\n195.3 hrs on record 195.3 hrs on record 75 products in account 556
9 3,398 people found this review helpful159 peop... Not Recommended\n8,028.8 hrs on record 8,028.8 hrs on record 167 products in account 8007