I am trying to parse this bit of rss
<rss xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
<channel>
<title>
Signal RSS - full
</title>
<link>
https://www.mystery.com
</link>
<description>
null
</description>
<pubDate>
Wed, 09 Mar 2022 14:07:31 GMT
</pubDate>
<lastBuildDate>
Wed, 09 Mar 2022 14:07:31 GMT
</lastBuildDate>
<item>
<guid isPermaLink="false">
someid
</guid>
<description>
-- other text
</description>
<text>
BC-AT&T-Discovery-Start-Mega-Bond-Sale-in-Test-of-Uneasy-Market
</text>
<content medium="document" expression="custom" type="text/vnd.IPTC.NewsML" lang="EN" url="https://api.com/syndication/newsml/v12/news/R8FRGG3/a715dac7-5282-4422-be8e" />
</item>
</channel>
</rss>
Fairly standard, right?
using example from https://kyleburton.github.io/clj-xpath/site/ I modified it into this:
(ns clj-xpath-examples.core
(:require
[clojure.string :as string]
[clojure.pprint :as pp])
(:use
clj-xpath.core))
(def input (slurp '.pathToXml.xml'))
(xml->doc input)
which gives me this error I cannot understand:
; IllegalAccessException class clojure.lang.Reflector cannot access class com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl (in module java.xml) because module java.xml does not export com.sun.org.apache.xerces.internal.jaxp to unnamed module @689eb690 jdk.internal.reflect.Reflection.newIllegalAccessException (Reflection.java:392)
Where am I going wrong? If I can use xpath for this it would my solution much neater.
CodePudding user response:
Here is one way to do it:
(ns tst.demo.core
(:use tupelo.core tupelo.test)
(:require
[clojure.walk :as walk]
[tupelo.forest :as forest]
[tupelo.parse.xml :as xml]
[tupelo.string :as str]
))
(def xml-str
(str/quotes->double "
<rss xmlns:media='http://search.yahoo.com/mrss/' version='2.0'>
<channel>
<title>
Signal RSS - full
</title>
<link>
https://www.mystery.com
</link>
<description>
null
</description>
<pubDate>
Wed, 09 Mar 2022 14:07:31 GMT
</pubDate>
<lastBuildDate>
Wed, 09 Mar 2022 14:07:31 GMT
</lastBuildDate>
<item>
<guid isPermaLink='false'>
someid
</guid>
<description>
-- other text
</description>
<text>
BC-AT&T-Discovery-Start-Mega-Bond-Sale-in-Test-of-Uneasy-Market
</text>
<content medium='document' expression='custom' type='text/vnd.IPTC.NewsML' lang='EN' url='https://api.com/syndication/newsml/v12/news/R8FRGG3/a715dac7-5282-4422-be8e' />
</item>
</channel>
</rss> "))
with unit test:
(dotest
(let [enlive-raw (xml/parse xml-str)
enlive-nice (walk/postwalk (fn [item]
(if (string? item)
(str/trim item)
item))
enlive-raw)]
(is= enlive-nice
{:attrs {:version "2.0" :xmlns:media "http://search.yahoo.com/mrss/"}
:content [{:attrs {}
:content [{:attrs {} :content ["Signal RSS - full"] :tag :title}
{:attrs {} :content ["https://www.mystery.com"] :tag :link}
{:attrs {} :content ["null"] :tag :description}
{:attrs {}
:content ["Wed, 09 Mar 2022 14:07:31 GMT"]
:tag :pubDate}
{:attrs {}
:content ["Wed, 09 Mar 2022 14:07:31 GMT"]
:tag :lastBuildDate}
{:attrs {}
:content [{:attrs {:isPermaLink "false"} :content ["someid"] :tag :guid}
{:attrs {} :content ["-- other text"] :tag :description}
{:attrs {}
:content ["BC-AT&T-Discovery-Start-Mega-Bond-Sale-in-Test-of-Uneasy-Market"]
:tag :text}
{:attrs {:expression "custom"
:lang "EN"
:medium "document"
:type "text/vnd.IPTC.NewsML"
:url "https://api.com/syndication/newsml/v12/news/R8FRGG3/a715dac7-5282-4422-be8e"}
:content []
:tag :content}]
:tag :item}]
:tag :channel}]
:tag :rss})))
Build using my favorite template project.
P.S. You may also be interested in the Tupelo Forest library:
(forest/enlive->hiccup enlive-nice) =>
[:rss
{:version "2.0", :xmlns:media "http://search.yahoo.com/mrss/"}
[:channel
[:title "Signal RSS - full"]
[:link "https://www.mystery.com"]
[:description "null"]
[:pubDate "Wed, 09 Mar 2022 14:07:31 GMT"]
[:lastBuildDate "Wed, 09 Mar 2022 14:07:31 GMT"]
[:item
[:guid {:isPermaLink "false"} "someid"]
[:description "-- other text"]
[:text
"BC-AT&T-Discovery-Start-Mega-Bond-Sale-in-Test-of-Uneasy-Market"]
[:content
{:expression "custom",
:lang "EN",
:medium "document",
:type "text/vnd.IPTC.NewsML",
:url
"https://api.com/syndication/newsml/v12/news/R8FRGG3/a715dac7-5282-4422-be8e"}]]]]
CodePudding user response:
First, you can use Clojure's built-in clojure.xml/parse
to get the RSS information into a data structure.
repl> (require '[clojure.xml :as xml])
nil
repl> (require '[clojure.java.io :as io])
nil
repl> (xml/parse (io/file "search.yahoo.co.xml"))
{:tag :rss, :attrs {:version "2.0", :xmlns:media "http://search....
repl> (clojure.pprint/pprint *1)
{:tag :rss,
:attrs {:version "2.0", :xmlns:media "http://search.yahoo.com/mrss/"},
:content
[{:tag :channel,
:attrs nil,
:content
[{:tag :title,
:attrs nil,
:content ["\n Signal RSS - full\n "]}
{:tag :link,
:attrs nil,
:content ["\n https://www.mystery.com\n "]}
{:tag :description,
...
What you do with the data structure will depend on what you want to extract from the RSS. The structure from clojure.xml/parse
is a tree. If you want "all the links" without regard to the tree, then there is a very interesting Clojure core function that turns the tree into a sequence of nodes, which is then amenable to processing with map
or filter
.
repl> (xml/parse (io/file "search.yahoo.co.xml"))
repl> (def s (xml-seq *1))
#'rssp.core/s
repl> (count s)
20
repl> (map :tag s)
(:rss :channel :title nil :link nil :description nil :pubDate ...
repl> (filter #(= :link (:tag %)) s)
({:tag :link, :attrs nil, :content ["\n https://www.mystery.com\n "]})
If you want to drive a turtle around the tree, looking up and down or left and right at each node, then check out the built-in function clojure.zip/xml-zip
.
Documentation for all these functions can be found at https://clojure.github.io/clojure/clojure.core-api.html