Home > other >  I am Trying to scrape overall product details like brand, ingredient and flavour
I am Trying to scrape overall product details like brand, ingredient and flavour

Time:11-18

Can anyone pls help me to scrape Flavour and brand details as key value pair using beautifulsoup. I am new in this:

Desired output would be

Flavour - Green Apple

Brand - Carabau

the html looks like this: Html Code -

<tr class="a-spacing-small">
<td class="a-span3">
    <span class="a-size-base a-text-bold">Flavour</span>
</td>

<td class="a-span9">
    <span class="a-size-base">Green Apple</span>
</td>
<tr class="a-spacing-small">
<td class="a-span3">
    <span class="a-size-base a-text-bold">Brand</span>
</td>

<td class="a-span9">
    <span class="a-size-base">Carabau</span>
</td>

CodePudding user response:

I have take data as html and you can use find method on respective tag to get exact data also you can use find_next() alternatively

html="""<tr class="a-spacing-small">
<td class="a-span3">
    <span class="a-size-base a-text-bold">Flavour</span>
</td>

<td class="a-span9">
    <span class="a-size-base">Green Apple</span>
</td>
</tr>"""

Code:

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"html.parser")
dict={}
data=soup.find("td",class_="a-span3").find_next().text

data1=soup.find("td",class_="a-span9").find("span",class_="a-size-base").text
print(data " - " data1)
dict[data]=data1

Output:

Flavour - Green Apple

CodePudding user response:

You can do like this.

Select the <tr> and use .stripped_strings to get a list of strings inside <tr>.

Note: If you have multiple <tr> then use .find_all() to select each of it and do the same.

from bs4 import BeautifulSoup

s = """
<tr >
<td >
    <span >Flavour</span>
</td>
<td >
    <span >Green Apple</span>
</td>
</tr>
"""

soup = BeautifulSoup(s, 'lxml')
tr = soup.find('tr')
print(list(tr.stripped_strings))
['Flavour', 'Green Apple']

CodePudding user response:

There's actually no need in .stripped_strings as mentioned by Ram since you can directly call a specific CSS selector which will be safer since it will grab data from specific elements, not from something else, and this doesn't create a dictionary key-value pair as you wanted.

You're looking for this:

# ...

data = []

for result in soup.select('tr'):
    # CSS selector for flavour detail
    flavor_name = result.select_one('.a-span9 .a-size-base').text
    
    # appends to list() as a dict() -> key-value pair
    data.append({
        "flavour": flavor_name
    })

print(data)

# # [{'flavour': 'Green Apple'}]

Code and example in the online IDE (will return key-value pair):

from bs4 import BeautifulSoup

html = '''
<tr >
<td >
    <span >Flavour</span>
</td>

<td >
    <span >Green Apple</span>
</td>
'''

soup = BeautifulSoup(html, 'html.parser')

# temp list()
data = []

for result in soup.select('tr'):
    # flavor = soup.select_one('.a-text-bold').text  # returns just Flavour word
    flavor_name = result.select_one('.a-span9 .a-size-base').text
    
    data.append({
        "flavour": flavor_name
    })

print(data)

# [{'flavour': 'Green Apple'}]

Access created data:

for flavour in data:
    print(flavour["flavour"])

# Green Apple

P.S. There's a dedicated web scraping blog of mine. If you need to parse search engines, have a try using SerpApi.

Disclaimer, I work for SerpApi.

  • Related