Home > Software engineering >  SQL Server Parse XML Column to table format - IRS 990
SQL Server Parse XML Column to table format - IRS 990

Time:02-20

I have three columns in the table with the third one being the valid XML type. A sample of the XML is listed below which includes several tags. This XML is downloaded from the IRS forms 990. I am trying to get a handful of tags to return in a tabular format. I’ve done this query below and it runs but it doesn’t return any values. I’m probably doing something wrong with the path but I can’t seem to figure it out.

Here is my table structure:

CREATE TABLE dbo.XMLFilesTable
(
Id INT IDENTITY PRIMARY KEY,
FileName VARCHAR(100),
XMLData XML,
LoadedDateTime DATETIME
)

GO

Here is the xml from the xml column:

<?xml version="1.0" encoding="utf-8"?>
<Return xmlns="http://www.irs.gov/efile" returnVersion="2019v5.1">
  <ReturnData documentCnt="7">
    <IRS990 documentId="IRS990">
      <PrincipalOfficerNm>Fr Francis Pizzarelli</PrincipalOfficerNm>
      <USAddress>
        <AddressLine1Txt>One High Street</AddressLine1Txt>
        <CityNm>Port Jefferson</CityNm>
        <StateAbbreviationCd>NY</StateAbbreviationCd>
        <ZIPCd>11777</ZIPCd>
      </USAddress>
      <GrossReceiptsAmt>6463631</GrossReceiptsAmt>
      <GroupReturnForAffiliatesInd>false</GroupReturnForAffiliatesInd>
      <Organization501c3Ind>X</Organization501c3Ind>
      <TypeOfOrganizationCorpInd>X</TypeOfOrganizationCorpInd>
      <FormationYr>1980</FormationYr>
      <LegalDomicileStateCd>NY</LegalDomicileStateCd>
      <ActivityOrMissionDesc>To provide help to individuals who can not find help in the form of housing, counseling, and other support so that they can eventually live productive independent lives. Hope House provides hope, care and compassion to nearly 2,000 individuals in need each month.</ActivityOrMissionDesc>
      <VotingMembersGoverningBodyCnt>8</VotingMembersGoverningBodyCnt>
      <VotingMembersIndependentCnt>7</VotingMembersIndependentCnt>
      <TotalEmployeeCnt>122</TotalEmployeeCnt>
      <TotalGrossUBIAmt>0</TotalGrossUBIAmt>
      <NetUnrelatedBusTxblIncmAmt>0</NetUnrelatedBusTxblIncmAmt>
      <PYContributionsGrantsAmt>3307653</PYContributionsGrantsAmt>
      <CYContributionsGrantsAmt>4963545</CYContributionsGrantsAmt>
      <CYProgramServiceRevenueAmt>0</CYProgramServiceRevenueAmt>
      <PYInvestmentIncomeAmt>25158</PYInvestmentIncomeAmt>
      <CYInvestmentIncomeAmt>122678</CYInvestmentIncomeAmt>
      <PYOtherRevenueAmt>1302778</PYOtherRevenueAmt>
      <CYOtherRevenueAmt>1016131</CYOtherRevenueAmt>
      <PYTotalRevenueAmt>4635589</PYTotalRevenueAmt>
      <CYTotalRevenueAmt>6102354</CYTotalRevenueAmt>
      <CYGrantsAndSimilarPaidAmt>0</CYGrantsAndSimilarPaidAmt>
      <CYBenefitsPaidToMembersAmt>0</CYBenefitsPaidToMembersAmt>
      <PYSalariesCompEmpBnftPaidAmt>3294184</PYSalariesCompEmpBnftPaidAmt>
      <CYSalariesCompEmpBnftPaidAmt>3352675</CYSalariesCompEmpBnftPaidAmt>
      <CYTotalProfFndrsngExpnsAmt>0</CYTotalProfFndrsngExpnsAmt>
      <CYTotalFundraisingExpenseAmt>501828</CYTotalFundraisingExpenseAmt>
      <PYOtherExpensesAmt>1793710</PYOtherExpensesAmt>

    

I tried the following code and it runs but returns nothing:

DECLARE @XML AS XML, @hDoc AS INT, @SQL NVARCHAR (MAX), @rootxmlns VARCHAR(100)

SELECT @XML = XMLData FROM XMLFilesTable
SET @rootxmlns = '<Return xmlns="http://www.irs.gov/efile"/'

EXEC sp_xml_preparedocument @hDoc OUTPUT, @XML


SELECT PYContributionsGrantsAmt, CYContributionsGrantsAmt, PYInvestmentIncomeAmt
FROM OPENXML(@hDoc, 'Return/ReturnData')
WITH
(
PYContributionsGrantsAmt [BIGINT] 'PYContributionsGrantsAmt',
CYContributionsGrantsAmt [BIGINT] 'CYContributionsGrantsAmt',
PYInvestmentIncomeAmt [BIGINT] 'PYInvestmentIncomeAmt'
)

EXEC sp_xml_removedocument @hDoc
GO

My thoughts are that the path after the open xml is not right.

Any thoughts on this?

CodePudding user response:

There are several issues in the posted code...

  1. SET @rootxmlns = '<Return xmlns="http://www.irs.gov/efile"/' is not a self-closing tag and is also attempting to define a default namespace which is problematic with OPENXML.

  2. EXEC sp_xml_preparedocument @hDoc OUTPUT, @XML doesn't include the @rootxmlns parameter, so the efile namespace isn't used for querying.

  3. OPENXML(@hDoc, 'Return/ReturnData') doesn't include the IRS990 element in the XPath.

Correcting for these issues a result can be obtained with the following:

DECLARE @XML AS XML, @hDoc AS INT, @SQL NVARCHAR (MAX), @rootxmlns VARCHAR(100)

SELECT @XML = XMLData FROM XMLFilesTable
SET @rootxmlns = '<Return xmlns:example="http://www.irs.gov/efile"/>'

EXEC sp_xml_preparedocument @hDoc OUTPUT, @XML, @rootxmlns


SELECT PYContributionsGrantsAmt, CYContributionsGrantsAmt, PYInvestmentIncomeAmt
FROM OPENXML(@hDoc, 'example:Return/example:ReturnData/example:IRS990')
WITH
(
PYContributionsGrantsAmt [BIGINT] 'example:PYContributionsGrantsAmt',
CYContributionsGrantsAmt [BIGINT] 'example:CYContributionsGrantsAmt',
PYInvestmentIncomeAmt [BIGINT] 'example:PYInvestmentIncomeAmt'
)

EXEC sp_xml_removedocument @hDoc

Which yields the result:

PYContributionsGrantsAmt CYContributionsGrantsAmt PYInvestmentIncomeAmt
3307653 4963545 25158

Compare the above SQL code with a nodes() and values() implementation which not only allows you to specify the default namespace using with xmlnamspaces but can also be used in a set-based fashion to shred the XML for multiple rows in the XMLFilesTable table...

with xmlnamespaces (default 'http://www.irs.gov/efile')
select
  IRS990.value('(PYContributionsGrantsAmt/text())[1]', 'bigint') as PYContributionsGrantsAmt,
  IRS990.value('(CYContributionsGrantsAmt/text())[1]', 'bigint') as CYContributionsGrantsAmt,
  IRS990.value('(PYInvestmentIncomeAmt/text())[1]', 'bigint') as PYInvestmentIncomeAmt
from dbo.XMLFilesTable
cross apply XMLData.nodes('/Return/ReturnData/IRS990') ReturnData(IRS990);

CodePudding user response:

Microsoft proprietary OPENXML and its companions sp_xml_preparedocument and sp_xml_removedocument are kept just for backward compatibility with the obsolete SQL Server 2000. Their use is diminished just to very few fringe cases.

Starting from SQL Server 2005 onwards, it is strongly recommended to re-write your SQL and switch it to XQuery.

SQL

-- DDL and sample data population, start
DECLARE @tbl TABLE (ID INT IDENTITY PRIMARY KEY, XMLData XML);
INSERT INTO @tbl (XMLData) VALUES
('<?xml version="1.0" encoding="utf-8"?>
<Return xmlns="http://www.irs.gov/efile" returnVersion="2019v5.1">
    <ReturnData documentCnt="7">
        <IRS990 documentId="IRS990">
            <PrincipalOfficerNm>Fr Francis Pizzarelli</PrincipalOfficerNm>
            <USAddress>
                <AddressLine1Txt>One High Street</AddressLine1Txt>
                <CityNm>Port Jefferson</CityNm>
                <StateAbbreviationCd>NY</StateAbbreviationCd>
                <ZIPCd>11777</ZIPCd>
            </USAddress>
            <GrossReceiptsAmt>6463631</GrossReceiptsAmt>
            <GroupReturnForAffiliatesInd>false</GroupReturnForAffiliatesInd>
            <Organization501c3Ind>X</Organization501c3Ind>
            <TypeOfOrganizationCorpInd>X</TypeOfOrganizationCorpInd>
            <FormationYr>1980</FormationYr>
            <LegalDomicileStateCd>NY</LegalDomicileStateCd>
            <ActivityOrMissionDesc>To provide help to individuals who can not find help in the form of housing, counseling, and other support so that they can eventually live productive independent lives. Hope House provides hope, care and compassion to nearly 2,000 individuals in need each month.</ActivityOrMissionDesc>
            <VotingMembersGoverningBodyCnt>8</VotingMembersGoverningBodyCnt>
            <VotingMembersIndependentCnt>7</VotingMembersIndependentCnt>
            <TotalEmployeeCnt>122</TotalEmployeeCnt>
            <TotalGrossUBIAmt>0</TotalGrossUBIAmt>
            <NetUnrelatedBusTxblIncmAmt>0</NetUnrelatedBusTxblIncmAmt>
            <PYContributionsGrantsAmt>3307653</PYContributionsGrantsAmt>
            <CYContributionsGrantsAmt>4963545</CYContributionsGrantsAmt>
            <CYProgramServiceRevenueAmt>0</CYProgramServiceRevenueAmt>
            <PYInvestmentIncomeAmt>25158</PYInvestmentIncomeAmt>
            <CYInvestmentIncomeAmt>122678</CYInvestmentIncomeAmt>
            <PYOtherRevenueAmt>1302778</PYOtherRevenueAmt>
            <CYOtherRevenueAmt>1016131</CYOtherRevenueAmt>
            <PYTotalRevenueAmt>4635589</PYTotalRevenueAmt>
            <CYTotalRevenueAmt>6102354</CYTotalRevenueAmt>
            <CYGrantsAndSimilarPaidAmt>0</CYGrantsAndSimilarPaidAmt>
            <CYBenefitsPaidToMembersAmt>0</CYBenefitsPaidToMembersAmt>
            <PYSalariesCompEmpBnftPaidAmt>3294184</PYSalariesCompEmpBnftPaidAmt>
            <CYSalariesCompEmpBnftPaidAmt>3352675</CYSalariesCompEmpBnftPaidAmt>
            <CYTotalProfFndrsngExpnsAmt>0</CYTotalProfFndrsngExpnsAmt>
            <CYTotalFundraisingExpenseAmt>501828</CYTotalFundraisingExpenseAmt>
            <PYOtherExpensesAmt>1793710</PYOtherExpensesAmt>
        </IRS990>
    </ReturnData>
</Return>');

WITH XMLNAMESPACES (DEFAULT 'http://www.irs.gov/efile')
SELECT ID
    , PYContributionsGrantsAmt = c.value('(PYContributionsGrantsAmt/text())[1]', 'BIGINT')
    , CYContributionsGrantsAmt  = c.value('(CYContributionsGrantsAmt/text())[1]', 'BIGINT')
    , PYInvestmentIncomeAmt  = c.value('(PYInvestmentIncomeAmt /text())[1]', 'BIGINT')
FROM @tbl
    CROSS APPLY XMLData.nodes('/Return/ReturnData/IRS990') AS t(c);
  • Related