Home > Back-end >  How do I extract the first element of the bs4.element.Tag?
How do I extract the first element of the bs4.element.Tag?

Time:10-31

I wanna extract the number that is before opiniones, I can find the span that contains it but I cannot retrieve it.

Code example:

list_rest =[]
for res_name, res_stats in zip(top_rest, top_rest_info):
    dataframe ={}
    dataframe["pos"] = res_name.find('a').contents[0]
    dataframe["name"] = res_name.find('a').contents[-1]
    dataframe["number_of_reviews"] = res_stats.find("span", attrs={"class": "NoCoR"})
    list_rest.append(dataframe)

Output:

[{'pos': 'La Gourmesa',
  'name': 'La Gourmesa',
  'number_of_reviews': <span class="NoCoR">3<!-- --> opiniones</span>},
 {'pos': '1',
  'name': 'Parrilla Urbana División del Norte',
  'number_of_reviews': <span class="NoCoR">486<!-- --> opiniones</span>},
 {'pos': '2',
  'name': 'La Mansion Marriott Reforma',
  'number_of_reviews': <span class="NoCoR">730<!-- --> opiniones</span>},
 {'pos': '3',
  'name': 'Restaurante Condimento Emporio Reforma',
  'number_of_reviews': <span class="NoCoR">283<!-- --> opiniones</span>},
 {'pos': '4',
  'name': "Porfirio's Coapa",
  'number_of_reviews': <span class="NoCoR">468<!-- --> opiniones</span>}]

How do I extract the number in number of reviews?

CodePudding user response:

Here I have taken HTML as example for understanding you can use get_text() or text method to extract text from tag and split based on space and extract first field

 html="""<span class='NoCoR'>3<!-- --> opiniones</span>
 <span >486<!-- --> opiniones</span>
 <span >730<!-- --> opiniones</span>"""

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"html.parser")

main_data=soup.find_all("span",attrs={"class":"NoCoR"})
for data in main_data:
    print(data.get_text().split(" ")[0])

Output:

3
486
730

For your code it should work like this:

dataframe["number_of_reviews"] = res_stats.find("span", attrs={"class": "NoCoR"}).get_text().split(" ")[0]

CodePudding user response:

You are still working with the solution, so why do not already take this to grab the number from the tag too?

Solution

Children of a tag are available in a list called .contents so picking the first one should solve your issue - append .contents[0] to your line of code:

res_stats.find("span", attrs={"class": "NoCoR"}).contents[0]

Example for a list of options

from bs4 import BeautifulSoup
html='''<span class='NoCoR'>3<!-- --> opiniones</span><span >486<!-- --> opiniones</span><span >730<!-- --> opiniones</span><span >283<!-- --> opiniones</span><span >468<!-- --> opiniones</span>'''

soup=BeautifulSoup(html,'html.parser')

for opinion in soup.select('span.NoCoR'):
    print(opinion.contents[0])

Output

3
486
730
283
468
  • Related