I am trying to parse some data from a URL: https://apptopia.com/store-insights/top-charts/google-play/comics/united-states
I was able to extract text and href from the bs4.element.Tag. However, the outputs are concatenated.
Here is my coding:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://apptopia.com/store-insights/top-charts/google-play/comics/united-states").read()
soup = BeautifulSoup(html, 'xml')
app_info_lst = soup.find_all("div", {"class": "media-object app-link-block"})
###############################################################
# print first element in this tag:
print(app_info_lst[0])
>>><div href="https://apptopia.com/google-play/app/com.naver.linewebtoon/intelligence"><div ><img alt="" src="https://d1nxzqpcg2bym0.cloudfront.net/google_play/com.naver.linewebtoon/5739f736-9f84-11e9-9bdb-4f6f6db47610/64x64"/></div><div ><p ><a href="https://apptopia.com/google-play/app/com.naver.linewebtoon/intelligence" title="WEBTOON">WEBTOON</a></p><p ><a href="/publishers/google_play/2457079" title="WEBTOON ENTERTAINMENT">WEBTOON ENTERTAINMENT</a></p></div></div>
###############################################################
# My actual output:
print(app_info_lst[0].get_text(strip=True))
>>>'WEBTOONWEBTOON ENTERTAINMENT'
print(app_info_lst[0].get('href'))
>>>'https://apptopia.com/google-play/app/com.naver.linewebtoon/intelligence'
However, my expected outputs is:
print(app_info_lst[0].get_text(strip=True))
>>>['WEBTOON', 'WEBTOON ENTERTAINMENT']
print(app_info_lst[0].get('href'))
>>>['https://apptopia.com/google-play/app/com.naver.linewebtoon/intelligence', '/publishers/google_play/2457079']
How can I do this? Any piece of advice/help is appreciated! Thanks!
CodePudding user response:
To generate two lists with information you can go with list comprehension
.
Links:
[x['href'] for x in soup.select('table a')]
['https://apptopia.com/google-play/app/com.naver.linewebtoon/intelligence', '/publishers/google_play/2457079', 'https://apptopia.com/google-play/app/com.progdigy.cdisplay/intelligence', '/publishers/google_play/1643949',...]
Texts:
[x.text for x in soup.select('table a')]
['WEBTOON','WEBTOON ENTERTAINMENT','CDisplayEx Comic Reader','Progdigy Software',...]
Much better in my opinion is to use a list of dicts :
[{'href':x['href'],'title':x.text} for x in soup.select('table a')]
[{'href': 'https://apptopia.com/google-play/app/com.naver.linewebtoon/intelligence', 'title': 'WEBTOON'}, {'href': '/publishers/google_play/2457079', 'title': 'WEBTOON ENTERTAINMENT'}, {'href': 'https://apptopia.com/google-play/app/com.progdigy.cdisplay/intelligence', 'title': 'CDisplayEx Comic Reader'}, {'href': '/publishers/google_play/1643949', 'title': 'Progdigy Software'}, {'href': 'https://apptopia.com/google-play/app/com.naver.linewebtoon/intelligence', 'title': 'WEBTOON'},...]