I'm building a .net5 application to scrape RSS feeds and I would like to avoid custom string parsing logic. Instead I would like to directly serialize the XML in c# objects. I've previously done this once and I used xsd.exe to generade schema file and then .cs file from that. However that's not working this time. Here's what I am trying to scrape
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
<channel>
<item>
<title>Fire kills four newborn babies at children's hospital in India</title>
<link>http://news.sky.com/story/india-fire-kills-four-newborn-babies-at-childrens-hospital-in-madhya-pradesh-12464344</link>
<description>Four newborn babies have died after a fire broke out at a children's hospital in India, officials said.</description>
<pubDate>Tue, 09 Nov 2021 07:51:00 0000</pubDate>
<guid>http://news.sky.com/story/india-fire-kills-four-newborn-babies-at-childrens-hospital-in-madhya-pradesh-12464344</guid>
<enclosure url="https://e3.365dm.com/21/11/70x70/skynews-india-fire-childrens-hospital_5577072.jpg?20211109081515" length="0" type="image/jpeg" />
<media:description type="html">A man carries a child out from the Kamla Nehru Children’s Hospital after a fire in the newborn care unit of the hospital killed four infants, in Bhopal, India, Monday, Nov. 8, 2021. There were 40 children in total in the unit, out of which 36 have been rescued, said Medical Education Minister Vishwas Sarang. (AP Photo) </media:description>
<media:thumbnail url="https://e3.365dm.com/21/11/70x70/skynews-india-fire-childrens-hospital_5577072.jpg?20211109081515" width="70" height="70" />
<media:content type="image/jpeg" url="https://e3.365dm.com/21/11/70x70/skynews-india-fire-childrens-hospital_5577072.jpg?20211109081515" />
...
</item>
</channel>
</rss>
So far I've tried using xsd.exe and this online tool: https://xmltocsharp.azurewebsites.net/. Both are having trouble with the <description>
and <media:description>
tags - it's trying to create a second "description" element inside of that item
, which fails:
- xsd.exe fails on execution and does not produce classes unless I remove one of them.
- the online tool produces classes, but those fail, when I try to instantiate
XmlSerializer
using them
I can see that there are two description tags, but one of them is defined within the media namespace. As far as xsd and .net are concerned those tags should be mapped to the same property, which is clearly an issue. Is this an invalid XML or there is some sort of limitation in those tools that prevents successful mapping. Any workaround except string parsing?
CodePudding user response:
The problem is that you have to provide "media" schema definition to xsd.exe. Media RSS Specification is the complete description of the "media" namespace. Unfortunately, I could not find any XSD file, but it is possible to generate one from the XML you have provided. I am using Visual Studio for this, there might be other tools that can do that (open file in Visual Studio, select from menu "XML" - "create schema"). Visual Studio will probably not generate the full schema, as described in the specification, but only what it can detect in the XML. Once you have the XSD file, you have to create the "media" schema file. Here is what I was able to generate from your example:
file rss.xsd
<?xml version="1.0" encoding="utf-8"?>
<xs:schema xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:import namespace="http://search.yahoo.com/mrss/" schemaLocation="media.xsd" />
<xs:element name="rss">
<xs:complexType>
<xs:sequence>
<xs:element name="channel">
<xs:complexType>
<xs:sequence>
<xs:element name="item">
<xs:complexType mixed="true">
<xs:sequence>
<xs:element name="title" type="xs:string" />
<xs:element name="link" type="xs:string" />
<xs:element name="description" type="xs:string" />
<xs:element name="pubDate" type="xs:string" />
<xs:element name="guid" type="xs:string" />
<xs:element name="enclosure">
<xs:complexType>
<xs:attribute name="url" type="xs:string" use="required" />
<xs:attribute name="length" type="xs:unsignedByte" use="required" />
<xs:attribute name="type" type="xs:string" use="required" />
</xs:complexType>
</xs:element>
<xs:element ref="media:description" />
<xs:element ref="media:thumbnail" />
<xs:element ref="media:content" />
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="version" type="xs:decimal" use="required" />
</xs:complexType>
</xs:element>
</xs:schema>
file media.xsd
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" targetNamespace="http://search.yahoo.com/mrss/">
<xs:element name="description">
<xs:complexType>
<xs:simpleContent>
<xs:extension base="xs:string">
<xs:attribute name="type" type="xs:string" use="required" />
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:element>
<xs:element name="thumbnail">
<xs:complexType>
<xs:attribute name="url" type="xs:string" use="required" />
<xs:attribute name="width" type="xs:unsignedByte" use="required" />
<xs:attribute name="height" type="xs:unsignedByte" use="required" />
</xs:complexType>
</xs:element>
<xs:element name="content">
<xs:complexType>
<xs:attribute name="type" type="xs:string" use="required" />
<xs:attribute name="url" type="xs:string" use="required" />
</xs:complexType>
</xs:element>
</xs:schema>
You can extend the XSD files if required - the full spec is in the link above. Now calling xsd.exe
c:\temp>xsd.exe media.xsd rss.xsd /c
will generate the c# class.