bs4 soup.select() vs. soup.find()-CodePudding

I am trying to scrape the text of some elements in a table using requests and BeautifulSoup, specifically the country names and the 2-letter country codes from this website.

Here is my code, which I have progressively walked back:

import requests
import bs4

res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)

for i in range(3):
    row = soup.find(f'#row{i} td')
    print(row) # printing to check progress for now

I had hoped to go row-by-row and walk the tags to get the strings like so (over range 249). However, soup.find() doesn't appear to work, just prints blank lists. soup.select() however, works fine:

for i in range(3):
    row = soup.select(f'#row{i} td')
    print(row)

Why does soup.find() not work as expected here?

CodePudding user response：

find expects the first argument to be the DOM element you're searching, it won't work with CSS selectors.

So you'll need:

row = soup.find('tr', { 'id': f"row{i}" })

To get the tr with the desired ID.

Then to get the 2-letter country code, for the first a with title: ISO 3166-1 alpha-2 code and get it's .text:

iso = row.find('a', { 'title': 'ISO 3166-1 alpha-2 code' }).text

To get the full name, there is no classname to search for, so I'd use take the second element, then we'll need to search for the span containing the country name:

name = row.findAll('td')[2].findAll('span')[2].text

Putting it all together gives:

import requests
import bs4

res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')

for i in range(3):
    row = soup.find('tr', { 'id': f"row{i}" })

    iso = row.find('a', { 'title': 'ISO 3166-1 alpha-2 code' }).text
    name = row.findAll('td')[2].findAll('span')[2].text

    print(name, iso)

Which outputs:

Afghanistan  AF
Åland Islands  AX
Albania  AL

CodePudding user response：

find_all() and select() select a list but find() and select_one() select only single element.

import requests
import bs4
import pandas as pd

res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'lxml')

data=[]
for row in soup.select('.tablesorter.mark > tbody tr'):
    name=row.find("span",class_="sortkey").text
    country_code=row.select_one('td:nth-child(4)').text.replace('\n','').strip()

    data.append({
        'name':name,
        'country_code':country_code})

df= pd.DataFrame(data)
print(df)

Output:

                 name    country_code
0          afghanistan           AF
1        aland-islands           AX
2              albania           AL
3              algeria           DZ
4       american-samoa           AS
..                 ...          ...
244  wallis-and-futuna           WF
245     western-sahara           EH
246              yemen           YE
247             zambia           ZM
248           zimbabwe           ZW

[249 rows x 2 columns]

CodePudding user response：

While .find() deals only with the first occurence of an element, .select() / .find_all() will give you a ResultSet you can iterate.

To get your goal select your rows and iterate them, in this case I used .stripped_strings to extract the text from the elements, stored it in a list and pick it by index :

for row in soup.select('#countriesTable tr[id^="row"]'):
    row = list(row.stripped_strings)
    print(row[2], row[3])

for row in soup.select('#countriesTable tbody tr'):
    row = list(row.stripped_strings)
    print(row[2], row[3])

An alternative and in my opinion best way to scrape tables is the use of pandas.read_html() that works with beautifulsoup under the hood and is doing most work for you:

import pandas as pd
pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,:]

or to get only the two specific rows:

pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,[1,2]]

	Name	ISO 2
0	Afghanistan	AF
1	Åland Islands	AX
2	Albania	AL
3	Algeria	DZ
4	American Samoa	AS
5	Andorra	AD

...