Home > OS >  Need to turn a list of links obtained with BS4 into a dict, but I get this information from the scra
Need to turn a list of links obtained with BS4 into a dict, but I get this information from the scra

Time:09-23

As the above title mentions, i'm trying to create a dictionary, similar to article name:link I use BS4 to dive into the html and obtain the stuff I need (as it's a different class every time i'm using a range to get the first 5 and looping through)

data = requests.get("https://www.marketingdive.com")
soup = BS(data.content, 'html5lib')
top_story = []

for i in range(6):
    items = soup.find("a", {"class": f"analytics t-dash-top-{i}"})
    #print(items.get('href'))
    top_story.append(items)

print(top_story)

The end result is the following:

[None, <a  href="/news/youtube-shorts-revenue-sharing-creator-economy-TikTok/632272/">
                                                    YouTube brings revenue sharing to Shorts as battle for creator talent intensifies
                                                </a>, <a  href="/news/Walmart-TikTok-Snapchat-Gen-Z-retail-commerce-ads/632191/">
                                                    Walmart weds data to popular apps like TikTok in latest ad play
                                                </a>, <a  href="/news/retail-media-global-ad-spend-groupm/632269/">
                                                    Retail media makes up 11% of global ad spend, GroupM says
                                                </a>, <a  href="/news/mike-hard-lemonade-gen-z-pto/632267/">
                                                    Mike’s Hard Lemonade pays consumers to take PTO
                                                </a>, <a  href="/news/samsung-nbcuniversal-tonight-show-metaverse-fortnite/632194/">
                                                    Samsung, NBCUniversal bring Rockefeller Center to the metaverse
                                                </a>]

I have tried splitting the strings and trying to obtain only the href (as per the docs) from the information and using other solutions on here but am at a loss and the only thing I can think of is that I have missed a step somewhere. Any answers and comments as to where I can fix this would be appreciated.

CodePudding user response:

from bs4 import BeautifulSoup
import requests
from pprint import pp
from urllib.parse import urljoin


def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')
    goal = {x.get_text(strip=True): urljoin(url, x['href'])
            for x in soup.select('a[class^="analytics t-dash-top"]')}
    pp(goal)


main('https://www.marketingdive.com/')
  • Related