Home > Net >  Beautiful Soup to Scrape Data from Static Webpages
Beautiful Soup to Scrape Data from Static Webpages

Time:10-20

I am trying to values from a table of multiple static webpages. It is the verb conjugation data for Korean verbs here: https://koreanverb.app/

My Python script uses Beautiful Soup. The goal is to grab all conjugations from multiple URL inputs and output the data to a CSV file.

Conjugations are stored on the page in table with class "table-responsive" and under the table rows with class "conjugation-row". There are multiple "conjugation-row" table rows on each page. My script is someone only grabbing the first table row with class "conjugation-row".

Why isn't the for loop grabbing all the td elements with class "conjugation-row"? I would appreciate a solution that grabs all tr with class "conjugation-row". I tried using job_elements = results.find("tr", class_="conjugation-row"), but I get the following error:

AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv

# create csv file
outfile = open("scrape.csv","w",newline='')
writer = csv.writer(outfile)

# define URLs
urls = ['https://koreanverb.app/?search=하다', 
        'https://koreanverb.app/?search=먹다',
        'https://koreanverb.app/?search=마시다']

# define dataframe columns
df = pd.DataFrame(columns=['conjugation_name','conjugation_korean'])

# loop to get data
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    results = soup.find("div", class_="table-responsive")
    job_elements = results.find("tr", class_="conjugation-row")
    conjugation_name = job_elements.find("td", class_="conjugation-name")
    conjugation_korean = conjugation_name.find_next_sibling("td")
    conjugation_name_text = conjugation_name.text
    conjugation_korean_text = conjugation_korean.text 
        
    # append element to data
    df2 = pd.DataFrame([[conjugation_name_text,conjugation_korean_text]],columns=['conjugation_name','conjugation_korean'])
    df = df.append(df2)

# save to csv
df.to_csv('scrape.csv')
outfile.close()
print('Export to CSV Complete')

CodePudding user response:

Get all the job_elements using find_all() since find() only returns the first occurrence and iterate over them in a for loop like below.

job_elements = results.find_all("tr", class_="conjugation-row")
for job_element in job_elements:
    conjugation_name = job_element.find("td", class_="conjugation-name")
    conjugation_korean = conjugation_name.find_next_sibling("td")
    conjugation_name_text = conjugation_name.text
    conjugation_korean_text = conjugation_korean.text

    # append element to data
    df2 = pd.DataFrame([[conjugation_name_text,conjugation_korean_text]],columns=['conjugation_name','conjugation_korean'])
    df = df.append(df2)

The error is where you are trying to use find() on a variable of type list.

CodePudding user response:

Use of find_all helps to get the correct td elements then you can use find_next to get the following unclassified td. Also, I don't think pandas is really necessary for this triviality.

import requests
from bs4 import BeautifulSoup as BS

urls = ['https://koreanverb.app/?search=하다',
        'https://koreanverb.app/?search=먹다',
        'https://koreanverb.app/?search=마시다']
CSV = 'scrape.csv'

with open(CSV, 'w') as csv:
    print('conjugation_name, conjugation_korean', file=csv)
    
with requests.Session() as session:
    for url in urls:
        r = session.get(url)
        r.raise_for_status()
        soup = BS(r.text, 'lxml')
        td = soup.find_all('td', class_='conjugation-name')
        with open(CSV, 'a') as csv:
            for _td in td:
                print(f'{_td.text}, {_td.find_next().text}', file=csv)
  • Related