Home > OS >  Different output from srcaping rss by using requests get and beautifulsoup
Different output from srcaping rss by using requests get and beautifulsoup

Time:10-16

I wanna scrape the data from code of this link: https://news.ycombinator.com/rss. It includes the html syntax: "link>the URL</link' (It's full of open and close <> but cannot put it in here). However, when using this code, the printed output of the link is: 'link/>the URL' and there are no content of the key 'link' in json file.

import requests
import bs4
from bs4 import BeautifulSoup
import json
import html5lib 

def rss(x):
r = requests.get(x)
s = BeautifulSoup(r.content, features='html5lib')
the_list = []
for i in s.find_all('item'):
    title = i.find('title').text
    link = i.find('link').text
    date = i.find('pubdate').text

    article = {
        'title' : title,
        'link' : link,
        'date' : date
    }

    the_list.append(article)
with open('the_list.json','w') as f:
    json.dump(the_list,f)

rss('https://news.ycombinator.com/rss')

CodePudding user response:

What happens?

As you already mentioned - Seems to bee not well formed in soup cause it is missing the <link> it has only the </link>, so you wont get text out of it with text attribute.

But good news there is a solution.

How to fix?

Just select the text with next_sibling of the <link> element:

i.find('link').next_sibling

Output

[{"title": "Gitlab from YC to IPO", "link": "https://blog.ycombinator.com/gitlab-from-yc-to-ipo/", "date": "Thu, 14 Oct 2021 13:31:43  0000"}, {"title": "Apple Joins Blender Development Fund", "link": "https://www.blender.org/press/apple-joins-blender-development-fund/", "date": "Thu, 14 Oct 2021 14:48:59  0000"}, {"title": "Sunset Geometry (2016)", "link": "https://www.shapeoperator.com/2016/12/12/sunset-geometry/", "date": "Thu, 14 Oct 2021 14:29:08  0000"}, {"title": "iPhone Macro: A Big Day for Small Things", "link": "https://lux.camera/iphone-macro-camera-a-big-day-for-small-things/", "date": "Mon, 11 Oct 2021 10:22:06  0000"}, {"title": "Michelin Airless", "link": "https://www.michelin.com/en/innovation/vision-concept/airless/", "date": "Thu, 14 Oct 2021 14:36:58  0000"}, {"title": "Release (YC W20) Is Hiring \u2013 Product Marketing Manager", "link": "https://releasehub.com/company#hire", "date": "Thu, 14 Oct 2021 17:00:15  0000"}, {"title": "Global Climate Report \u2013 September 2021", "link": "https://www.ncdc.noaa.gov/sotc/global/202109", "date": "Thu, 14 Oct 2021 14:49:59  0000"}, {"title": "Esbuild \u2013 An extremely fast JavaScript bundler", "link": "https://esbuild.github.io/", "date": "Thu, 14 Oct 2021 05:07:27  0000"}, {"title": "Small Language Models Are Also Few-Shot Learners", "link": "https://aclanthology.org/2021.naacl-main.185/", "date": "Tue, 12 Oct 2021 09:59:34  0000"}, {"title": "Who was Aleph Null? (2013)", "link": "http://bit-player.org/2013/who-was-aleph-null", "date": "Mon, 11 Oct 2021 08:35:29  0000"}, {"title": "Hands-On Rust: Effective Learning Through 2D Game Development and Play", "link": "https://pragprog.com/titles/hwrust/hands-on-rust/", "date": "Thu, 14 Oct 2021 07:59:24  0000"}, {"title": "Ask HN: What's the Point of Life?", "link": "https://news.ycombinator.com/item?id=28866558", "date": "Thu, 14 Oct 2021 16:38:15  0000"}, {"title": "What I wish I knew when learning F#", "link": "https://danielbachler.de/2020/12/23/what-i-wish-i-knew-when-learning-fsharp.html", "date": "Thu, 14 Oct 2021 12:07:40  0000"}, {"title": "Investing in Startups by Passing the Series 65", "link": "https://www.natecation.com/accredited-investor-investing-startups-series-65/", "date": "Wed, 13 Oct 2021 17:57:25  0000"}, {"title": "OpenBSD 7.0", "link": "https://www.openbsd.org/70.html", "date": "Thu, 14 Oct 2021 10:24:21  0000"}, {"title": "Countries are gathering in an effort to stop a biodiversity collapse", "link": "https://www.nytimes.com/2021/10/14/climate/un-biodiversity-conference-climate-change.html", "date": "Thu, 14 Oct 2021 13:32:00  0000"}, {"title": "Alden Global Capital, the secretive hedge fund gutting newsrooms", "link": "https://www.theatlantic.com/magazine/archive/2021/11/alden-global-capital-killing-americas-newspapers/620171/", "date": "Thu, 14 Oct 2021 15:17:06  0000"}, {"title": "Child suicides in Japan hit record high", "link": "https://www3.nhk.or.jp/nhkworld/en/news/20211013_19/", "date": "Thu, 14 Oct 2021 08:52:39  0000"}, {"title": "Every search bar looks like a URL bar to users", "link": "https://shkspr.mobi/blog/2021/10/every-search-bar-looks-like-a-url-bar-to-users/", "date": "Thu, 14 Oct 2021 13:27:58  0000"}, {"title": "Psychonetics: A nerd's toolset to work with mind and perception", "link": "http://deconcentration-of-attention.com/psychonetics.html", "date": "Tue, 12 Oct 2021 11:28:43  0000"}, {"title": "FB seals off some internal message boards to prevent leaking, immediately leaked", "link": "https://www.businessinsider.com/facebook-whistleblower-leaks-restricts-staff-access-message-boards-elections-safety-2021-10", "date": "Thu, 14 Oct 2021 11:09:08  0000"}, {"title": "Working around expired root certificates", "link": "https://scotthelme.co.uk/should-clients-care-about-the-expiration-of-a-root-certificate/", "date": "Mon, 11 Oct 2021 21:27:27  0000"}, {"title": "An unprecedented wave of online bank fraud is hitting Britain", "link": "https://www.reuters.com/world/uk/welcome-britain-bank-scam-capital-world-2021-10-14/", "date": "Thu, 14 Oct 2021 09:57:39  0000"}, {"title": "Interoperable Serendipity", "link": "https://noeldemartin.com/blog/interoperable-serendipity", "date": "Wed, 13 Oct 2021 12:02:37  0000"}, {"title": "Instagram took down post with figure from paper showing male advantage in sports", "link": "https://twitter.com/SwipeWright/status/1448064426670583814", "date": "Thu, 14 Oct 2021 16:36:30  0000"}, {"title": "IoT hacking and rickrolling my high school district", "link": "https://whitehoodhacker.net/posts/2021-10-04-the-big-rick", "date": "Tue, 12 Oct 2021 19:38:06  0000"}, {"title": "Boeing says certain 787 parts improperly manufactured", "link": "https://www.reuters.com/business/aerospace-defense/boeing-deals-with-new-defect-787-dreamliner-wsj-2021-10-14/", "date": "Thu, 14 Oct 2021 13:26:46  0000"}, {"title": "Practice Problems for Hardware Engineers", "link": "https://arxiv.org/abs/2110.06526", "date": "Thu, 14 Oct 2021 03:48:24  0000"}, {"title": "Interface ergonomics: automation isn't just about time saved", "link": "https://macoy.me/blog/programming/InterfaceFriction", "date": "Wed, 13 Oct 2021 01:05:52  0000"}, {"title": "Syncthing \u2013 a continuous file synchronization program", "link": "https://syncthing.net/", "date": "Thu, 14 Oct 2021 01:23:19  0000"}]
  • Related