Home > Software engineering >  Python webscraping - Grabbing contents of <strong> and images
Python webscraping - Grabbing contents of <strong> and images

Time:04-10

I gave myself a side project to do during this semester with content we aren't covering at all. I'm trying to take content from a site (url in code) that I can put into either a text or SQL file to use (preferably text at the moment). After about an hour I figured out how to grab the <strong> element and print it to console, but trying to save it to a file leaves it empty. Trying print.(x).text and adding .text to my find_all both gave me errors.

So that leaves me with some other questions. If I'm trying to grab classes, it should be the same process I'm using now just swapping with the class name correct? There's also images on the pages, they aren't necessary, but I would like to use them if possible. Those have different class names are the site has everything as a list rather than in a table so would I be able to get say the second image without having to get the others?

import requests

url = "https://www.db.yugioh-card.com/yugiohdb/card_search.action?ope=1&sess=1&pid=11101000&rp=99999"
response = requests.get(url)
soup = BeautifulSoup(response.content,  "html.parser")

card_name = soup.find_all('strong')
print(card_name)

lob_2002 = open("Legend of Blue Eyes 2002 - Cards Names", "w")
lob_2002.write

Apologies if not everything makes sense. I tried to explain my thoughts without being too lengthy.

CodePudding user response:

soup.find_all() actually returns a list. It could be empty, with 1 element, or with 128 like in your example. You've successfully got a list of strong elements, but in order to extract values from each tag, you need to iterate the list and call .text on each. Like that

names = [name.text for name in card_name]

Then it's a list of strings and you can save it. As it's a list, you need to either convert it to string and write it, or iterate again and write each new item to newline, or use json format for example with json.dump()

# option 1
lob_2002.write(str(names))

# option 2
for name in names:
    lob_2002.write(name)

# option 3
lob_2002.write(json.dumps(names))

UPD: As for other questions.

If I'm trying to grab classes, it should be the same process I'm using now just swapping with the class name correct?

Yes, you can use find() for single (first) element, or find_all() or take a look at select() and select_one() where you can apply CSS selectors. Also XPath could be used for such purposes, but I think BeautifulSoup does not support it, while other libraries do.

... so would I be able to get say the second image without having to get the others?

You either employ find_all("img")[1] which searches for all images and then selects specific one. If it would be first image, find("img") would work. Probably soup.select("img:nth-of-type(2)") could work, but somehow instead of returning me the second image, it returns all. Either a bug, or I'm doing it wrong.

Moreover, for really specific DOM element, consider applying class and id selectors, like soup.find("span", {"class": "some_class", "id": "some_id"})

  • Related