Home > Blockchain >  Extract data from XML document using t-sql
Extract data from XML document using t-sql

Time:04-30

I have been trying to extract data from the following xml doc using t-sql on sql server 2019.

XML:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
  <url>
    <loc>https://www.URL1.com/1</loc>
    <image:image>
      <image:loc>https://www.URL1.com/11</image:loc>
    </image:image>
    <image:image>
      <image:loc>https://www.URL1.com/12</image:loc>
    </image:image>
    <image:image>
      <image:loc>https://www.URL1.com/13</image:loc>
    </image:image>
  </url>
  <url>
    <loc>https://www.URL1.com/2</loc>
    <image:image>
      <image:loc>https://www.URL1.com/21</image:loc>
    </image:image>
    <image:image>
      <image:loc>https://www.URL1.com/22</image:loc>
    </image:image>
  </url>
  <url>
    <loc>https://www.URL1.com/3</loc>
    <image:image>
      <image:loc>https://www.URL1.com/32</image:loc>
    </image:image>
  </url>
</urlset>

I would like to extract data out of the xml document into a SQL Server table. My desired output as below

Desired output:

 ------------------------ ------------------------- 
|          Loc           |        ImageLoc         |
 ------------------------ ------------------------- 
| https://www.URL1.com/1 | https://www.URL1.com/11 |
| https://www.URL1.com/1 | https://www.URL1.com/12 |
| https://www.URL1.com/1 | https://www.URL1.com/13 |
| https://www.URL1.com/2 | https://www.URL1.com/21 |
| https://www.URL1.com/2 | https://www.URL1.com/22 |
| https://www.URL1.com/3 | https://www.URL1.com/32 |
 ------------------------ ------------------------- 

My Attempts have been failed so far miserably. I have tried many thing but the only thing that allowed me to get even the Loc element was the following, I have tried using OUTER APPLY/CROSS APPLY to het the ImageLoc with no luck.

My Attempt:

DECLARE @xml XML
SELECT @xml = BulkColumn
FROM OPENROWSET(BULK 'M:\Files\MyXML.xml', SINGLE_BLOB) x

SELECT
    t.c.value('(text())[1]', 'VARCHAR(max)') URLs
,   t2.i.value('(text())[1]', 'VARCHAR(max)') URLs

FROM @xml.nodes('*:urlset/*:url/*:loc') t(c)
OUTER APPLY  @xml.nodes('*:urlset/*:url/*:loc/*:image/*:loc') t2(i)

Could you please help? Thanks in advance

CodePudding user response:

This answer was posted by lptr in the comments as just a link to a fiddle. As the OP has said that it answers their question, and lptr doesn't wish/respond to posting answers, I have migrated it to the answer section.

Here they use the * wildcard rather than defining the namespace to get the values from the XML:

dbfiddle.uk/...

SELECT
    t.c.value('(*:loc/text())[1]', 'VARCHAR(max)') URLs
,   t2.i.value('(text())[1]', 'VARCHAR(max)') URLs

FROM @xml.nodes('*:urlset/*:url') t(c)
OUTER APPLY  t.c.nodes('*:image/*:loc') t2(i);

CodePudding user response:

You need to define your namespace in your SQL as well. This can be done with putting WITH XMLNAMESPACES at the start your query and defining it there. Then you can define the image namespace and prefix it in your references and return the values from the nodes:

WITH XMLNAMESPACES ('xyz' AS image)
SELECT u.i.value('(../loc/text())[1]','varchar(500)') AS loc,
       u.i.value('(image:loc/text())[1]','varchar(500)') AS loc
FROM @xml.nodes('urlset/url/image:image') u(i);

db<>fiddle

  • Related