Home > Software design >  Web scraping website only pulling one specific data set instead of multiple
Web scraping website only pulling one specific data set instead of multiple

Time:12-16

I am creating a program which will pull the seasonal shows from MAL and display the title, rating, etc... using BeautifulSoup and Requests and writing a scv with it. All of that works fine. However only problem I have is it only pulls one show data even though they are all under a parent category.

This is the csv file


            #Beautiful soup or BS4 is  a package I will be using to allow me to parse the HTML data which I will be retrieving from a website.
            #parsing is te conversion of codes from machine language into a code which humans can understand and allow it to be structured. 
            #(Converting data from one format to another) with BS4
            from bs4 import BeautifulSoup
            
            
            #requests is an HTTP Library which allows me to send requests to websites the retrieve date using Python. This is helpful as 
            #The website is writtin in a different language so it allows me to retrieve what I want and read it as well. 
            import requests
            #import writer
            
            
            from csv import writer
            #defining the website which I will be retrieving my code
            
            url= "https://myanimelist.net/anime/season"
            
            
            #requesting to get data using 'requests' and gain acess as well. 
            #hadve to check the response before moving forward to ensure there is no problem retrieving data. 
            
            page= requests.get(url)
            #print(page)
            #<Response [200]> response was "200" meaning "Successful responses"
            
            
            
            soup = BeautifulSoup(page.content, 'html.parser')
            #here i retrieve my 
            #for this to identify the html code and determine what we will be producing(retrieveing data) for each item on the page we had to 
            #find the parent category which contains all the info we need to make our data categories. 
            lists = soup.find_all('div', class_="js-seasonal-anime-list-key-1")
            #we add _ after class  to make class_ because without the underscore the program identifies it as a python class 
            # when really it is more of a cs class
            
            #this allows us to create and close a csv file. using 'w' to allow editing
            with open('shows.csv', 'w', encoding='utf8', newline='')as f:
            
            
            #will write onto our file 'f'
                writing=writer(f)
            #organizing chart
                header=['Title', 'Show Rating', 'Members', 'Release Date']
                
            #use our writer to write a row in file
                writing.writerow(header)    
                
            #must create loop to find titles seperate as there are alot that will come up
                for list in lists:
                #identify and find class which includes the title of the shows, show ratings, members watching, and episodes
                #added .text.replace in order to get rid of the|n spacing which was in html format
                    title= list.find('a', class_="link-title").text.replace('\n', '')
                    rating= list.find('div', class_="score").text.replace('\n', '')
                    members= list.find('div', class_="scormem-item member").text.replace('\n', '')
                    release_date= list.find('span', class_="item").text.replace('\n', '')
                   
                #testing for errors and makins sure locations are correct to withdraw/request the data
                info= [title, rating, members, release_date]
                writing.writerow(info) 

I have tried to turn the list.find (line58) into list.find_all. That does display all the data sets however I cannot use .text to organize it anymore.

CodePudding user response:

To get all anime into a list, you can use next example (create a list outside the loop all_data and inside the for-loop use list.append to append the data to this list). As the last step you could save this list to CSV using .writerows() method:

import csv
import requests
from bs4 import BeautifulSoup

url = "https://myanimelist.net/anime/season"

page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
lists = soup.select("[data-genre]")

all_data = []
for lst in lists:
    title = lst.find("a", class_="link-title").text.replace("\n", "")
    rating = lst.find("div", class_="score").text.replace("\n", "")
    members = lst.find("div", class_="scormem-item member").text.replace(
        "\n", ""
    )
    release_date = lst.find("span", class_="item").text.replace("\n", "")

    all_data.append(
        [title.strip(), rating.strip(), members.strip(), release_date.strip()]
    )

with open("data.csv", "w") as f_out:
    writer = csv.writer(f_out)
    writer.writerow(["Title", "Show Rating", "Members", "Release Date"])
    writer.writerows(all_data)

print(*all_data, sep="\n")

Prints:

['Chainsaw Man', '8.83', '894K', 'Oct 12, 2022']
['Spy x Family Part 2', '8.54', '533K', 'Oct 1, 2022']
['Mob Psycho 100 III', '8.65', '406K', 'Oct 6, 2022']
['Boku no Hero Academia 6th Season', '8.29', '368K', 'Oct 1, 2022']
['Blue Lock', '8.25', '319K', 'Oct 9, 2022']
['Bleach: Sennen Kessen-hen', '9.10', '313K', 'Oct 11, 2022']
['Kage no Jitsuryokusha ni Naritakute!', '7.82', '235K', 'Oct 5, 2022']

...

and saves data.csv.

  • Related