Python - Beautiful Soup - How do i extract a single piece of text out of a tag-CodePudding

at first I want to let you know, that I'm a total newbie in terms of python and web crawling. I try to implement a crawler on coinmarketcap.com with BeautifulSoup.

The dom-tree for the name of the coin looks like this:

<h2 class="sc-1q9q90x-0 jCInrl h1" color="text">Polygon<small class="nameSymbol">MATIC</small></h2>

My code to extract the name looks like this:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

    def get_name(url):
        start_url = "https://coinmarketcap.com/all/views/all/"
        url = urljoin(start_url, url)
        response =  requests.get(url)
        soup = BeautifulSoup(response.text, "html.parser")
        name = soup.find('h2', class_="sc-1q9q90x-0 jCInrl h1").text[0]
        print(name)
        
    url = "https://coinmarketcap.com/all/views/all/"
    website = requests.get(url)
    results = BeautifulSoup(website.text, "html.parser")
    counter = 0
    table = results.find('tbody')
    for row in table.find_all('tr'):
        found_coins = []
        if counter == 10:
            break
        else:
            try:
                url = row.find("a", class_="cmc-link").attrs["href"]
                name = get_name(url)
            except AttributeError:
                continue

(edited: All of the Code is shown now.)

The output or the function looks like this:

BitcoinBTC
EthereumETH
Binance CoinBNB
TetherUSDT
SolanaSOL
CardanoADA
XRPXRP
PolkadotDOT
USD CoinUSDC
DogecoinDOGE

So as you can see, the text of the h2-tag gets combined with the text of the small-tag.

How can i extract only the first piece of text out of the h2-tag?

I appreciate your help, thanks in advance!

CodePudding user response：

At the moment you are getting the entire h2 element and it's children. Once you have the h2 element, use find again to get the small element inside and output it's text

For example

h2 = soup.find('h2', class_="sc-1q9q90x-0 jCInrl h1")
name = h2.find('small').text
print(name)a

Since you want only the text of the h2 element not any children elements try the following

h2 = soup.find('h2', class_="sc-1q9q90x-0 jCInrl h1")
name = h2.contents[0]
print(name)

CodePudding user response：

You can do it this way.

Select the <h2> tag and get a list of strings inside it by using .stripped_strings
Now you have a list of two values, you can choose whatever string you need.

Here is the full code.

from bs4 import BeautifulSoup

s = """<h2  color="text">Polygon<small >MATIC</small></h2>"""
soup = BeautifulSoup(s, 'xml')
h = soup.find('h2')

print(list(h.stripped_strings))

['Polygon', 'MATIC']

CodePudding user response：

You really should be using the CoinMarketCap API, which is free. Create an account, generate a key, then:

import requests

url = "https://pro-api.coinmarketcap.com/v1/cryptocurrency/listings/latest"

headers = {
  "Accepts": "application/json",
  "X-CMC_PRO_API_KEY": "YOUR_KEY_HERE",
}

result = requests.get(url, headers=headers).json()
for coin in result["data"]:
    name = coin["name"]
    symbol = coin["symbol"]
    price = coin["quote"]["USD"]["price"]
    print(f"{name}: 1 {symbol} = {price:0.2f} USD")

The result is:

Bitcoin: 1 BTC = 60420.34486452755 USD
Ethereum: 1 ETH = 4234.891529519587 USD
Binance Coin: 1 BNB = 581.3868214529973 USD
Tether: 1 USDT = 1.0001178308074172 USD
Solana: 1 SOL = 218.568842499844 USD
Cardano: 1 ADA = 1.8793870309352723 USD
...