Home > Enterprise >  BeautifulSoup deleting first half of HTML?
BeautifulSoup deleting first half of HTML?

Time:10-26

I'm practicing with BeautifulSoup and HTML requests in general for the first time. The goal of the programme is to load a webpage and it's HTML, then search through the webpage (in this case a recipe, to get a sub string of it's ingredients). I've managed to get it working with the following code:

url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"

result = requests.get(url)
myHTML = result.text
index1 = myHTML.find("recipeIngredient")
index2 = myHTML.find("recipeInstructions")
ingredients = myHTML[index1:index2]

But when I try and use BeautifulSoup here:

url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"

result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
ingredients = doc.find(text = "recipeIngredient")
print(ingredients)

I understand that the code above (even if I could get it working) would produce a different output of just ["recipeIngredient"] but that's all I'm focused on for now whilst I get to grips with BS. Instead the code above just outputs None. I printed "doc" to the terminal and it would only output what appears to be the second half of the HTML (or at least : not all of it). Whereas , the text file does contain all HTML, so I assume that's where the problem lies but i'm not sure how to fix it.

Thank you.

CodePudding user response:

You need to use:

class_="recipe__ingredients"

For example:

import requests
from bs4 import BeautifulSoup

url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"

doc = (
    BeautifulSoup(requests.get(url).text, "html.parser")
    .find(class_="recipe__ingredients")
)

ingredients = "\n".join(
    ingredient.getText() for ingredient in doc.find_all("li")
)

print(ingredients)

Output:

1 large onion , chopped
4 large garlic cloves
thumb-sized piece of ginger
2 tbsp rapeseed oil
4 small skinless chicken breasts, cut into chunks
2 tbsp tikka spice powder
1 tsp cayenne pepper
400g can chopped tomatoes
40g ground almonds
200g spinach
3 tbsp fat-free natural yogurt
½ small bunch of coriander , chopped
brown basmati rice , to serve

CodePudding user response:

It outputs None because it's looking for where the content within html tags is 'recipeIngredient', whci does not exist (there is no text in the html content. That string is an attribute of an html tag).

What you are actually trying to get with bs4 is find specific tags and/or attributes of the data/content you want. For example, @baduker points out, the ingredients in the html are within the tag with a class attribute = "recipe__ingredients".

The string 'recipeIngredient', that you pull out in that first block of code, is actually from within the <script> tag in the html, that has the ingredients in json format.

from bs4 import BeautifulSoup
import requests
import json

url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"

result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
ingredients = doc.find('script', {'type':'application/ld json'}).text
jsonData = json.loads(ingredients)

print(jsonData['recipeIngredient'])

enter image description here

  • Related