As the above title mentions, i'm trying to create a dictionary, similar to
article name:link
I use BS4 to dive into the html and obtain the stuff I need (as it's a different class every time i'm using a range to get the first 5 and looping through)
data = requests.get("https://www.marketingdive.com")
soup = BS(data.content, 'html5lib')
top_story = []
for i in range(6):
items = soup.find("a", {"class": f"analytics t-dash-top-{i}"})
#print(items.get('href'))
top_story.append(items)
print(top_story)
The end result is the following:
[None, <a href="/news/youtube-shorts-revenue-sharing-creator-economy-TikTok/632272/">
YouTube brings revenue sharing to Shorts as battle for creator talent intensifies
</a>, <a href="/news/Walmart-TikTok-Snapchat-Gen-Z-retail-commerce-ads/632191/">
Walmart weds data to popular apps like TikTok in latest ad play
</a>, <a href="/news/retail-media-global-ad-spend-groupm/632269/">
Retail media makes up 11% of global ad spend, GroupM says
</a>, <a href="/news/mike-hard-lemonade-gen-z-pto/632267/">
Mike’s Hard Lemonade pays consumers to take PTO
</a>, <a href="/news/samsung-nbcuniversal-tonight-show-metaverse-fortnite/632194/">
Samsung, NBCUniversal bring Rockefeller Center to the metaverse
</a>]
I have tried splitting the strings and trying to obtain only the href (as per the docs) from the information and using other solutions on here but am at a loss and the only thing I can think of is that I have missed a step somewhere. Any answers and comments as to where I can fix this would be appreciated.
CodePudding user response:
from bs4 import BeautifulSoup
import requests
from pprint import pp
from urllib.parse import urljoin
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
goal = {x.get_text(strip=True): urljoin(url, x['href'])
for x in soup.select('a[class^="analytics t-dash-top"]')}
pp(goal)
main('https://www.marketingdive.com/')