Home > Net >  beautiful soup not providing a proper csv file of scraped data
beautiful soup not providing a proper csv file of scraped data

Time:10-20

I'm a fairly new to web scraping so apologize if the answer to my problem is obvious. I made a Web Scraper that goes through the reviews of a steam game (civilization 6) and gets information such as hours spent on the game, if they recommended it or not, products they own, and so on.

import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

url = "https://steamcommunity.com/app/289070/reviews/?browsefilter=toprated&snr=1_5_100010_"

review_dict = {
    "found_helpful": [],
    "title": [], #recommended or not
    "hours": [],
    "prods_in_account": [],
    "words_in_review": []
}

def data_scrapper():
    """
    get's the reviews from the steam page.
    """
    response = requests.get(url)
    soup = bs(response.content, "html.parser")
    card_div = soup.findAll("div",attrs={"class","apphub_Card modalContentLink interactable"})

    for cards in card_div:
        found_helpful = cards.find("div", attrs={"class": "found_helpful"})
        vote_header = cards.find("div", attrs={"class": "vote_header"})
        hours = cards.find("div", attrs={"class": "hours"})
        products = cards.find("div", attrs={"class": "apphub_CardContentMoreLink ellipsis"})
        words_in_review = cards.find("div", attrs={"class": "apphub_CardTextContent"})

    review_dict["found_helpful"].append(found_helpful)
    review_dict["title"].append(vote_header)
    review_dict["hours"].append(hours)
    review_dict["prods_in_account"].append(products)
    review_dict["words_in_review"].append(len(words_in_review))

data_scrapper()

review_df = pd.DataFrame.from_dict(review_dict)
review_df.to_csv("review.csv", sep=",")

My problem is that when I run my code I am expecting an organized CSV file however I get this:

,found_helpful,title,hours,prods_in_account,words_in_review
0,"<div hljs-string">"found_helpful"">
                3,398 people found this review helpful<br/>159 people found this review funny               <div hljs-string">"review_award_aggregated tooltip"" data-tooltip-hljs-string">"review_reward_tooltip"" data-tooltip-html='&lt;div hljs-string">"review_award_ctn_hover""&gt;             &lt;div hljs-string">"review_award"" data-reaction=""6"" data-reactioncount=""5""&gt;
                    &lt;img hljs-string">"review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/6.png?v=5""/&gt;
                    &lt;span hljs-string">"review_award_count ""&gt;5&lt;/span&gt;
                &lt;/div&gt;
                                &lt;div hljs-string">"review_award"" data-reaction=""3"" data-reactioncount=""3""&gt;
                    &lt;img hljs-string">"review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/3.png?v=5""/&gt;
                    &lt;span hljs-string">"review_award_count ""&gt;3&lt;/span&gt;
                &lt;/div&gt;
                                &lt;div hljs-string">"review_award"" data-reaction=""5"" data-reactioncount=""2""&gt;
                    &lt;img hljs-string">"review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/5.png?v=5""/&gt;
                    &lt;span hljs-string">"review_award_count ""&gt;2&lt;/span&gt;
                &lt;/div&gt;
                                &lt;div hljs-string">"review_award"" data-reaction=""1"" data-reactioncount=""1""&gt;
                    &lt;img hljs-string">"review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/1.png?v=5""/&gt;
                    &lt;span hljs-string">"review_award_count hidden""&gt;1&lt;/span&gt;
                &lt;/div&gt;
                                &lt;div hljs-string">"review_award"" data-reaction=""9"" data-reactioncount=""1""&gt;
                    &lt;img hljs-string">"review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/9.png?v=5""/&gt;
                    &lt;span hljs-string">"review_award_count hidden""&gt;1&lt;/span&gt;
                &lt;/div&gt;
                                &lt;div hljs-string">"review_award"" data-reaction=""18"" data-reactioncount=""1""&gt;
                    &lt;img hljs-string">"review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/18.png?v=5""/&gt;
                    &lt;span hljs-string">"review_award_count hidden""&gt;1&lt;/span&gt;
                &lt;/div&gt;
                                &lt;div hljs-string">"review_award"" data-reaction=""19"" data-reactioncount=""1""&gt;
                    &lt;img hljs-string">"review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/19.png?v=5""/&gt;
                    &lt;span hljs-string">"review_award_count hidden""&gt;1&lt;/span&gt;
                &lt;/div&gt;
                &lt;/div&gt;'><img hljs-string">"reward_btn_icon"" src=""https://community.akamai.steamstatic.com/public/shared/images//award_icon_blue.svg""/>14</div>
</div>","<div hljs-string">"vote_header"">
<div hljs-string">"reviewInfo"">
<div hljs-string">"thumb"">
<img height=""44"" src=""https://community.akamai.steamstatic.com/public/shared/images/userreviews/icon_thumbsDown.png?v=1"" width=""44""/>
</div>
<div hljs-string">"title"">Not Recommended</div>
<div hljs-string">"hours"">8,028.3 hrs on record</div>
</div>
<div style=""clear: left""></div>
</div>","<div hljs-string">"hours"">8,028.3 hrs on record</div>","<div hljs-string">"apphub_CardContentMoreLink ellipsis"">167 products in account</div>",38

I revised my function for extracting and appending my data but I still get this weird file, any clues as to what I'm doing wrong?

CodePudding user response:

Make these changes to your existing code:

for cards in card_div:
    found_helpful = cards.find("div", attrs={"class": "found_helpful"}).get_text()
    vote_header = cards.find("div", attrs={"class": "vote_header"}).get_text()
    hours = cards.find("div", attrs={"class": "hours"}).get_text()
    products = cards.find("div", attrs={"class": "apphub_CardContentMoreLink ellipsis"}).get_text()
    words_in_review = cards.find("div", attrs={"class": "apphub_CardTextContent"}).get_text()

    review_dict["found_helpful"].append(found_helpful)
    review_dict["title"].append(vote_header)
    review_dict["hours"].append(hours)
    review_dict["prods_in_account"].append(products)
    review_dict["words_in_review"].append(len(words_in_review))

review_df = pd.DataFrame.from_dict(review_dict)
cols = review_df.select_dtypes(['object']).columns
review_df[cols] = review_df[cols].apply(lambda x: x.str.strip())

OUTPUT:

                                       found_helpful                                   title                  hours         prods_in_account  words_in_review
0  1,266 people found this review helpful20 peopl...        Recommended\n456.9 hrs on record    456.9 hrs on record  536 products in account              770
1  1,127 people found this review helpful14 peopl...         Recommended\n92.1 hrs on record     92.1 hrs on record  135 products in account              574
2  853 people found this review helpful49 people ...      Recommended\n1,360.8 hrs on record  1,360.8 hrs on record   18 products in account              181
3  1,832 people found this review helpful18 peopl...        Recommended\n520.5 hrs on record    520.5 hrs on record  281 products in account             7114
4  3,370 people found this review helpful40 peopl...    Not Recommended\n415.7 hrs on record    415.7 hrs on record  102 products in account              853
5  5,724 people found this review helpful172 peop...    Not Recommended\n256.7 hrs on record    256.7 hrs on record  180 products in account             2072
6  393 people found this review helpful10 people ...         Recommended\n22.8 hrs on record     22.8 hrs on record   85 products in account              278
7  3,229 people found this review helpful62 peopl...     Not Recommended\n58.6 hrs on record     58.6 hrs on record  264 products in account              894
8  1,373 people found this review helpful22 peopl...    Not Recommended\n195.3 hrs on record    195.3 hrs on record   75 products in account              556
9  3,398 people found this review helpful159 peop...  Not Recommended\n8,028.8 hrs on record  8,028.8 hrs on record  167 products in account             8007
  • Related