Home > Software design >  Parsing rss feed in clojure with xpath
Parsing rss feed in clojure with xpath

Time:03-12

I am trying to parse this bit of rss

<rss xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        <title>
            Signal RSS - full
        </title>
        <link>
            https://www.mystery.com
        </link>
        <description>
            null
        </description>
        <pubDate>
            Wed, 09 Mar 2022 14:07:31 GMT
        </pubDate>
        <lastBuildDate>
            Wed, 09 Mar 2022 14:07:31 GMT
        </lastBuildDate>
        <item>
            <guid isPermaLink="false">
                someid
            </guid>
            <description>
                -- other text
            </description>
            <text>
                BC-AT&amp;T-Discovery-Start-Mega-Bond-Sale-in-Test-of-Uneasy-Market
            </text>
            <content medium="document" expression="custom" type="text/vnd.IPTC.NewsML" lang="EN" url="https://api.com/syndication/newsml/v12/news/R8FRGG3/a715dac7-5282-4422-be8e" />
        </item>
    </channel>
</rss>

Fairly standard, right?

using example from https://kyleburton.github.io/clj-xpath/site/ I modified it into this:

(ns clj-xpath-examples.core
  (:require
   [clojure.string :as string]
   [clojure.pprint :as pp])
  (:use
   clj-xpath.core))

(def input (slurp '.pathToXml.xml'))

(xml->doc input)

which gives me this error I cannot understand:

; IllegalAccessException class clojure.lang.Reflector cannot access class com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl (in module java.xml) because module java.xml does not export com.sun.org.apache.xerces.internal.jaxp to unnamed module @689eb690  jdk.internal.reflect.Reflection.newIllegalAccessException (Reflection.java:392)

Where am I going wrong? If I can use xpath for this it would my solution much neater.

CodePudding user response:

Here is one way to do it:

(ns tst.demo.core
  (:use tupelo.core tupelo.test)
  (:require
    [clojure.walk :as walk]
    [tupelo.forest :as forest]
    [tupelo.parse.xml :as xml]
    [tupelo.string :as str]
    ))

(def xml-str
  (str/quotes->double "
      <rss xmlns:media='http://search.yahoo.com/mrss/' version='2.0'>
        <channel>
            <title>
                Signal RSS - full
            </title>
            <link>
                https://www.mystery.com
            </link>
            <description>
                null
            </description>
            <pubDate>
                Wed, 09 Mar 2022 14:07:31 GMT
            </pubDate>
            <lastBuildDate>
                Wed, 09 Mar 2022 14:07:31 GMT
            </lastBuildDate>
            <item>
                <guid isPermaLink='false'>
                    someid
                </guid>
                <description>
                    -- other text
                </description>
                <text>
                    BC-AT&amp;T-Discovery-Start-Mega-Bond-Sale-in-Test-of-Uneasy-Market
                </text>
                <content medium='document' expression='custom' type='text/vnd.IPTC.NewsML' lang='EN' url='https://api.com/syndication/newsml/v12/news/R8FRGG3/a715dac7-5282-4422-be8e' />
            </item>
        </channel>
    </rss> "))

with unit test:

(dotest
  (let [enlive-raw  (xml/parse xml-str)
        enlive-nice (walk/postwalk (fn [item]
                                     (if (string? item)
                                       (str/trim item)
                                       item))
                      enlive-raw)]
    (is= enlive-nice
      {:attrs   {:version "2.0" :xmlns:media "http://search.yahoo.com/mrss/"}
       :content [{:attrs   {}
                  :content [{:attrs {} :content ["Signal RSS - full"] :tag :title}
                            {:attrs {} :content ["https://www.mystery.com"] :tag :link}
                            {:attrs {} :content ["null"] :tag :description}
                            {:attrs   {}
                             :content ["Wed, 09 Mar 2022 14:07:31 GMT"]
                             :tag     :pubDate}
                            {:attrs   {}
                             :content ["Wed, 09 Mar 2022 14:07:31 GMT"]
                             :tag     :lastBuildDate}
                            {:attrs   {}
                             :content [{:attrs {:isPermaLink "false"} :content ["someid"] :tag :guid}
                                       {:attrs {} :content ["-- other text"] :tag :description}
                                       {:attrs   {}
                                        :content ["BC-AT&T-Discovery-Start-Mega-Bond-Sale-in-Test-of-Uneasy-Market"]
                                        :tag     :text}
                                       {:attrs   {:expression "custom"
                                                  :lang       "EN"
                                                  :medium     "document"
                                                  :type       "text/vnd.IPTC.NewsML"
                                                  :url        "https://api.com/syndication/newsml/v12/news/R8FRGG3/a715dac7-5282-4422-be8e"}
                                        :content []
                                        :tag     :content}]
                             :tag     :item}]
                  :tag     :channel}]
       :tag :rss})))

Build using my favorite template project.


P.S. You may also be interested in the Tupelo Forest library:

(forest/enlive->hiccup enlive-nice) => 
[:rss
 {:version "2.0", :xmlns:media "http://search.yahoo.com/mrss/"}
 [:channel
  [:title "Signal RSS - full"]
  [:link "https://www.mystery.com"]
  [:description "null"]
  [:pubDate "Wed, 09 Mar 2022 14:07:31 GMT"]
  [:lastBuildDate "Wed, 09 Mar 2022 14:07:31 GMT"]
  [:item
   [:guid {:isPermaLink "false"} "someid"]
   [:description "-- other text"]
   [:text
    "BC-AT&T-Discovery-Start-Mega-Bond-Sale-in-Test-of-Uneasy-Market"]
   [:content
    {:expression "custom",
     :lang "EN",
     :medium "document",
     :type "text/vnd.IPTC.NewsML",
     :url
     "https://api.com/syndication/newsml/v12/news/R8FRGG3/a715dac7-5282-4422-be8e"}]]]]

CodePudding user response:

First, you can use Clojure's built-in clojure.xml/parse to get the RSS information into a data structure.

repl> (require '[clojure.xml :as xml])
nil

repl> (require '[clojure.java.io :as io])
nil

repl> (xml/parse (io/file "search.yahoo.co.xml"))
{:tag :rss, :attrs {:version "2.0", :xmlns:media "http://search....

repl> (clojure.pprint/pprint *1)
{:tag :rss,
 :attrs {:version "2.0", :xmlns:media "http://search.yahoo.com/mrss/"},
 :content
 [{:tag :channel,
   :attrs nil,
   :content
   [{:tag :title,
     :attrs nil,
     :content ["\n            Signal RSS - full\n        "]}
    {:tag :link,
     :attrs nil,
     :content ["\n            https://www.mystery.com\n        "]}
    {:tag :description,
...

What you do with the data structure will depend on what you want to extract from the RSS. The structure from clojure.xml/parse is a tree. If you want "all the links" without regard to the tree, then there is a very interesting Clojure core function that turns the tree into a sequence of nodes, which is then amenable to processing with map or filter.

repl> (xml/parse (io/file "search.yahoo.co.xml"))

repl> (def s (xml-seq *1))
#'rssp.core/s

repl> (count s)
20

repl> (map :tag s)
(:rss :channel :title nil :link nil :description nil :pubDate ...

repl> (filter #(= :link (:tag %)) s)
({:tag :link, :attrs nil, :content ["\n https://www.mystery.com\n  "]})

If you want to drive a turtle around the tree, looking up and down or left and right at each node, then check out the built-in function clojure.zip/xml-zip.

Documentation for all these functions can be found at https://clojure.github.io/clojure/clojure.core-api.html

  • Related