I am trying to values from a table of multiple static webpages. It is the verb conjugation data for Korean verbs here: https://koreanverb.app/
My Python script uses Beautiful Soup. The goal is to grab all conjugations from multiple URL inputs and output the data to a CSV file.
Conjugations are stored on the page in table with class "table-responsive" and under the table rows with class "conjugation-row". There are multiple "conjugation-row" table rows on each page. My script is someone only grabbing the first table row with class "conjugation-row".
Why isn't the for loop grabbing all the td elements with class "conjugation-row"? I would appreciate a solution that grabs all tr with class "conjugation-row". I tried using job_elements = results.find("tr", class_="conjugation-row")
, but I get the following error:
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
# create csv file
outfile = open("scrape.csv","w",newline='')
writer = csv.writer(outfile)
# define URLs
urls = ['https://koreanverb.app/?search=하다',
'https://koreanverb.app/?search=먹다',
'https://koreanverb.app/?search=마시다']
# define dataframe columns
df = pd.DataFrame(columns=['conjugation_name','conjugation_korean'])
# loop to get data
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
results = soup.find("div", class_="table-responsive")
job_elements = results.find("tr", class_="conjugation-row")
conjugation_name = job_elements.find("td", class_="conjugation-name")
conjugation_korean = conjugation_name.find_next_sibling("td")
conjugation_name_text = conjugation_name.text
conjugation_korean_text = conjugation_korean.text
# append element to data
df2 = pd.DataFrame([[conjugation_name_text,conjugation_korean_text]],columns=['conjugation_name','conjugation_korean'])
df = df.append(df2)
# save to csv
df.to_csv('scrape.csv')
outfile.close()
print('Export to CSV Complete')
CodePudding user response:
Get all the job_elements using find_all()
since find()
only returns the first occurrence and iterate over them in a for
loop like below.
job_elements = results.find_all("tr", class_="conjugation-row")
for job_element in job_elements:
conjugation_name = job_element.find("td", class_="conjugation-name")
conjugation_korean = conjugation_name.find_next_sibling("td")
conjugation_name_text = conjugation_name.text
conjugation_korean_text = conjugation_korean.text
# append element to data
df2 = pd.DataFrame([[conjugation_name_text,conjugation_korean_text]],columns=['conjugation_name','conjugation_korean'])
df = df.append(df2)
The error is where you are trying to use find()
on a variable of type list
.
CodePudding user response:
Use of find_all helps to get the correct td elements then you can use find_next to get the following unclassified td. Also, I don't think pandas is really necessary for this triviality.
import requests
from bs4 import BeautifulSoup as BS
urls = ['https://koreanverb.app/?search=하다',
'https://koreanverb.app/?search=먹다',
'https://koreanverb.app/?search=마시다']
CSV = 'scrape.csv'
with open(CSV, 'w') as csv:
print('conjugation_name, conjugation_korean', file=csv)
with requests.Session() as session:
for url in urls:
r = session.get(url)
r.raise_for_status()
soup = BS(r.text, 'lxml')
td = soup.find_all('td', class_='conjugation-name')
with open(CSV, 'a') as csv:
for _td in td:
print(f'{_td.text}, {_td.find_next().text}', file=csv)