Home > front end >  Attempting to retrieve text from <td></td> tags using BeautifulSoup
Attempting to retrieve text from <td></td> tags using BeautifulSoup

Time:11-03

So I'm using BeautifulSoup to scrape the link in the code. The artist names and the links come out fine, but I'm not sure how to access the nationality in that second tag.

Here's the code:

import requests
import csv
from bs4 import BeautifulSoup

def findName():
  page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anB1.htm')

  soup = BeautifulSoup(page.text, 'html.parser')

  last_links = soup.find(class_='AlphaNav')
  last_links.decompose()

  f = csv.writer(open('h-artist_lastname.csv', 'w')) # Create a file to write
  f.writerow(['Last Name, First Name', 'Nationality', 'Link'])

  artist_name_list = soup.find(class_='BodyText') 
  artist_name_list_items = artist_name_list.find_all('a') 
  artist_nationality_list_items = artist_name_list.find_all('td')

  print(artist_nationality_list_items)

  for artist_name in artist_name_list_items:        
    names = artist_name.contents[0]
    #nationalities = artist_nationality_list_items.contents[0]  
    links = 'https://web.archive.org'   artist_name.get('href')

    #print(nationalities)

    f.writerow([names, links])

findName()

If I uncomment the line in the for loop, I get a runtime error which I expect. The print statement gives me this value for artist_nationality_list_items:

<td><a href="/web/20121007172915/http://www.nga.gov/cgi-bin/tsearch?artistid=32727">Babbitt, Platt D.</a></td>, <td>American, died 1879</td>, ..... <- follows this pattern for every artist

Basically, I want the part with 'American, died 1879'.

CodePudding user response:

You can use select which accepts CSS selectors with :nth-child() to select second <td> in each <tr> instead of find_all, so this:

artist_nationality_list_items = artist_name_list.find_all('td')

becomes:

artist_nationality_list_items = artist_name_list.select('td:nth-child(2)')

CodePudding user response:

You can still work with contents, but don't get bogged down with all the lists - Select your target more specific and get all information with more flow:

soup.select('div.BodyText tr')

What happens?

You're treating artist_nationality_list_items (a list) like a single element, that wont work.

How to fix?

To get the right result from your artist_nationality_list_items you have also to iterate it like that (Works, but bad idea):

for i,artist_name in enumerate(artist_name_list_items):        
    names = artist_name.contents[0]
    nationalities = artist_nationality_list_items[i 1].contents[0]  
    links = 'https://web.archive.org'   artist_name.get('href') 

Example (Alternativ and much leaner approach)

import requests
import csv
from bs4 import BeautifulSoup

def findName():
    page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anB1.htm')

    soup = BeautifulSoup(page.text, 'html.parser')

    last_links = soup.find(class_='AlphaNav')
    last_links.decompose()

    f = csv.writer(open('h-artist_lastname.csv', 'w')) # Create a file to write
    f.writerow(['Last Name, First Name', 'Nationality', 'Link'])
    
    for row in soup.select('div.BodyText tr'):

        names = row.contents[0].text
        nationalities = row.contents[1].text
        links = 'https://web.archive.org'   row.a.get('href')

        print([names,nationalities,links])

        f.writerow([names,nationalities,links])

findName()
  • Related