I wanna scrape the data from code of this link: https://news.ycombinator.com/rss. It includes the html syntax: "link>the URL</link' (It's full of open and close <> but cannot put it in here). However, when using this code, the printed output of the link is: 'link/>the URL' and there are no content of the key 'link' in json file.
import requests
import bs4
from bs4 import BeautifulSoup
import json
import html5lib
def rss(x):
r = requests.get(x)
s = BeautifulSoup(r.content, features='html5lib')
the_list = []
for i in s.find_all('item'):
title = i.find('title').text
link = i.find('link').text
date = i.find('pubdate').text
article = {
'title' : title,
'link' : link,
'date' : date
}
the_list.append(article)
with open('the_list.json','w') as f:
json.dump(the_list,f)
rss('https://news.ycombinator.com/rss')
CodePudding user response:
What happens?
As you already mentioned - Seems to bee not well formed in soup cause it is missing the <link>
it has only the </link>
, so you wont get text out of it with text
attribute.
But good news there is a solution.
How to fix?
Just select the text with next_sibling
of the <link>
element:
i.find('link').next_sibling
Output
[{"title": "Gitlab from YC to IPO", "link": "https://blog.ycombinator.com/gitlab-from-yc-to-ipo/", "date": "Thu, 14 Oct 2021 13:31:43 0000"}, {"title": "Apple Joins Blender Development Fund", "link": "https://www.blender.org/press/apple-joins-blender-development-fund/", "date": "Thu, 14 Oct 2021 14:48:59 0000"}, {"title": "Sunset Geometry (2016)", "link": "https://www.shapeoperator.com/2016/12/12/sunset-geometry/", "date": "Thu, 14 Oct 2021 14:29:08 0000"}, {"title": "iPhone Macro: A Big Day for Small Things", "link": "https://lux.camera/iphone-macro-camera-a-big-day-for-small-things/", "date": "Mon, 11 Oct 2021 10:22:06 0000"}, {"title": "Michelin Airless", "link": "https://www.michelin.com/en/innovation/vision-concept/airless/", "date": "Thu, 14 Oct 2021 14:36:58 0000"}, {"title": "Release (YC W20) Is Hiring \u2013 Product Marketing Manager", "link": "https://releasehub.com/company#hire", "date": "Thu, 14 Oct 2021 17:00:15 0000"}, {"title": "Global Climate Report \u2013 September 2021", "link": "https://www.ncdc.noaa.gov/sotc/global/202109", "date": "Thu, 14 Oct 2021 14:49:59 0000"}, {"title": "Esbuild \u2013 An extremely fast JavaScript bundler", "link": "https://esbuild.github.io/", "date": "Thu, 14 Oct 2021 05:07:27 0000"}, {"title": "Small Language Models Are Also Few-Shot Learners", "link": "https://aclanthology.org/2021.naacl-main.185/", "date": "Tue, 12 Oct 2021 09:59:34 0000"}, {"title": "Who was Aleph Null? (2013)", "link": "http://bit-player.org/2013/who-was-aleph-null", "date": "Mon, 11 Oct 2021 08:35:29 0000"}, {"title": "Hands-On Rust: Effective Learning Through 2D Game Development and Play", "link": "https://pragprog.com/titles/hwrust/hands-on-rust/", "date": "Thu, 14 Oct 2021 07:59:24 0000"}, {"title": "Ask HN: What's the Point of Life?", "link": "https://news.ycombinator.com/item?id=28866558", "date": "Thu, 14 Oct 2021 16:38:15 0000"}, {"title": "What I wish I knew when learning F#", "link": "https://danielbachler.de/2020/12/23/what-i-wish-i-knew-when-learning-fsharp.html", "date": "Thu, 14 Oct 2021 12:07:40 0000"}, {"title": "Investing in Startups by Passing the Series 65", "link": "https://www.natecation.com/accredited-investor-investing-startups-series-65/", "date": "Wed, 13 Oct 2021 17:57:25 0000"}, {"title": "OpenBSD 7.0", "link": "https://www.openbsd.org/70.html", "date": "Thu, 14 Oct 2021 10:24:21 0000"}, {"title": "Countries are gathering in an effort to stop a biodiversity collapse", "link": "https://www.nytimes.com/2021/10/14/climate/un-biodiversity-conference-climate-change.html", "date": "Thu, 14 Oct 2021 13:32:00 0000"}, {"title": "Alden Global Capital, the secretive hedge fund gutting newsrooms", "link": "https://www.theatlantic.com/magazine/archive/2021/11/alden-global-capital-killing-americas-newspapers/620171/", "date": "Thu, 14 Oct 2021 15:17:06 0000"}, {"title": "Child suicides in Japan hit record high", "link": "https://www3.nhk.or.jp/nhkworld/en/news/20211013_19/", "date": "Thu, 14 Oct 2021 08:52:39 0000"}, {"title": "Every search bar looks like a URL bar to users", "link": "https://shkspr.mobi/blog/2021/10/every-search-bar-looks-like-a-url-bar-to-users/", "date": "Thu, 14 Oct 2021 13:27:58 0000"}, {"title": "Psychonetics: A nerd's toolset to work with mind and perception", "link": "http://deconcentration-of-attention.com/psychonetics.html", "date": "Tue, 12 Oct 2021 11:28:43 0000"}, {"title": "FB seals off some internal message boards to prevent leaking, immediately leaked", "link": "https://www.businessinsider.com/facebook-whistleblower-leaks-restricts-staff-access-message-boards-elections-safety-2021-10", "date": "Thu, 14 Oct 2021 11:09:08 0000"}, {"title": "Working around expired root certificates", "link": "https://scotthelme.co.uk/should-clients-care-about-the-expiration-of-a-root-certificate/", "date": "Mon, 11 Oct 2021 21:27:27 0000"}, {"title": "An unprecedented wave of online bank fraud is hitting Britain", "link": "https://www.reuters.com/world/uk/welcome-britain-bank-scam-capital-world-2021-10-14/", "date": "Thu, 14 Oct 2021 09:57:39 0000"}, {"title": "Interoperable Serendipity", "link": "https://noeldemartin.com/blog/interoperable-serendipity", "date": "Wed, 13 Oct 2021 12:02:37 0000"}, {"title": "Instagram took down post with figure from paper showing male advantage in sports", "link": "https://twitter.com/SwipeWright/status/1448064426670583814", "date": "Thu, 14 Oct 2021 16:36:30 0000"}, {"title": "IoT hacking and rickrolling my high school district", "link": "https://whitehoodhacker.net/posts/2021-10-04-the-big-rick", "date": "Tue, 12 Oct 2021 19:38:06 0000"}, {"title": "Boeing says certain 787 parts improperly manufactured", "link": "https://www.reuters.com/business/aerospace-defense/boeing-deals-with-new-defect-787-dreamliner-wsj-2021-10-14/", "date": "Thu, 14 Oct 2021 13:26:46 0000"}, {"title": "Practice Problems for Hardware Engineers", "link": "https://arxiv.org/abs/2110.06526", "date": "Thu, 14 Oct 2021 03:48:24 0000"}, {"title": "Interface ergonomics: automation isn't just about time saved", "link": "https://macoy.me/blog/programming/InterfaceFriction", "date": "Wed, 13 Oct 2021 01:05:52 0000"}, {"title": "Syncthing \u2013 a continuous file synchronization program", "link": "https://syncthing.net/", "date": "Thu, 14 Oct 2021 01:23:19 0000"}]