Home > Enterprise >  How to get each text and href from bs4.element.Tag in Python?
How to get each text and href from bs4.element.Tag in Python?

Time:11-23

I am trying to parse some data from a URL: https://apptopia.com/store-insights/top-charts/google-play/comics/united-states

I was able to extract text and href from the bs4.element.Tag. However, the outputs are concatenated.

Here is my coding:

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("https://apptopia.com/store-insights/top-charts/google-play/comics/united-states").read()
soup = BeautifulSoup(html, 'xml')

app_info_lst = soup.find_all("div", {"class": "media-object app-link-block"})
###############################################################
# print first element in this tag:
print(app_info_lst[0])
>>><div  href="https://apptopia.com/google-play/app/com.naver.linewebtoon/intelligence"><div ><img alt=""  src="https://d1nxzqpcg2bym0.cloudfront.net/google_play/com.naver.linewebtoon/5739f736-9f84-11e9-9bdb-4f6f6db47610/64x64"/></div><div ><p ><a href="https://apptopia.com/google-play/app/com.naver.linewebtoon/intelligence" title="WEBTOON">WEBTOON</a></p><p ><a  href="/publishers/google_play/2457079" title="WEBTOON ENTERTAINMENT">WEBTOON ENTERTAINMENT</a></p></div></div>
###############################################################
# My actual output:
print(app_info_lst[0].get_text(strip=True))
>>>'WEBTOONWEBTOON ENTERTAINMENT'
     
print(app_info_lst[0].get('href'))
>>>'https://apptopia.com/google-play/app/com.naver.linewebtoon/intelligence'

However, my expected outputs is:

print(app_info_lst[0].get_text(strip=True))
>>>['WEBTOON', 'WEBTOON ENTERTAINMENT']
print(app_info_lst[0].get('href'))
>>>['https://apptopia.com/google-play/app/com.naver.linewebtoon/intelligence', '/publishers/google_play/2457079']

How can I do this? Any piece of advice/help is appreciated! Thanks!

CodePudding user response:

To generate two lists with information you can go with list comprehension.

Links:

[x['href'] for x in soup.select('table a')]

['https://apptopia.com/google-play/app/com.naver.linewebtoon/intelligence', '/publishers/google_play/2457079', 'https://apptopia.com/google-play/app/com.progdigy.cdisplay/intelligence', '/publishers/google_play/1643949',...]

Texts:

[x.text for x in soup.select('table a')]

['WEBTOON','WEBTOON ENTERTAINMENT','CDisplayEx Comic Reader','Progdigy Software',...]

Much better in my opinion is to use a list of dicts :

[{'href':x['href'],'title':x.text} for x in soup.select('table a')]

[{'href': 'https://apptopia.com/google-play/app/com.naver.linewebtoon/intelligence', 'title': 'WEBTOON'}, {'href': '/publishers/google_play/2457079', 'title': 'WEBTOON ENTERTAINMENT'}, {'href': 'https://apptopia.com/google-play/app/com.progdigy.cdisplay/intelligence',  'title': 'CDisplayEx Comic Reader'}, {'href': '/publishers/google_play/1643949', 'title': 'Progdigy Software'}, {'href': 'https://apptopia.com/google-play/app/com.naver.linewebtoon/intelligence',  'title': 'WEBTOON'},...]
  • Related