Home > database >  XQuery: removing duplicates from returned results
XQuery: removing duplicates from returned results

Time:06-29

I have a large dataset that contains duplicates records. I've been trying to use XQuery to return results minus the duplicates. I have been reading around and thought fn:duplicate-values would allow me to get what I want...and it still might but I haven't quite achieved what I want yet.

---Sample xml data---

<data>
        <Details Level="2">
            <Record SectionNumber="0">
                <Field Name="GroupNameLINKID1" FieldName="GroupName ({VW_PEOPLE.LINK_ID})">
                    <FormattedValue>76.00</FormattedValue>
                    <Value>76.00</Value>
                </Field>
                <Field Name="BRIEFBIO1" FieldName="{VW_PEOPLE.BRIEF_BIO}">
                    <FormattedValue>1833 - 1883</FormattedValue>
                    <Value>1833 - 1883</Value>
                </Field>
                <Field Name="BIRTHDATE1" FieldName="{VW_PEOPLE.BIRTH_DATE}">
                    <FormattedValue>1833</FormattedValue>
                    <Value>1833</Value>
                </Field>
                <Field Name="DEATHDATE1" FieldName="{VW_PEOPLE.DEATH_DATE}">
                    <FormattedValue>1883</FormattedValue>
                    <Value>1883</Value>
                </Field>
            </Record>
        </Details>
        <Details Level="2">
            <Record SectionNumber="0">
                <Field Name="GroupNameLINKID1" FieldName="GroupName ({VW_PEOPLE.LINK_ID})">
                    <FormattedValue>76.00</FormattedValue>
                    <Value>76.00</Value>
                </Field>
                <Field Name="BRIEFBIO1" FieldName="{VW_PEOPLE.BRIEF_BIO}">
                    <FormattedValue>1833 - 1883</FormattedValue>
                    <Value>1833 - 1883</Value>
                </Field>
                <Field Name="BIRTHDATE1" FieldName="{VW_PEOPLE.BIRTH_DATE}">
                    <FormattedValue>1833</FormattedValue>
                    <Value>1833</Value>
                </Field>
                <Field Name="DEATHDATE1" FieldName="{VW_PEOPLE.DEATH_DATE}">
                    <FormattedValue>1883</FormattedValue>
                    <Value>1883</Value>
                </Field>
            </Record>
        </Details>
        <Details Level="2">
            <Record SectionNumber="0">
                <Field Name="GroupNameLINKID1" FieldName="GroupName ({VW_PEOPLE.LINK_ID})">
                    <FormattedValue>3,383.00</FormattedValue>
                    <Value>3,383.00</Value>
                </Field>
                <Field Name="BRIEFBIO1" FieldName="{VW_PEOPLE.BRIEF_BIO}">
                    <FormattedValue>1819 - 1894</FormattedValue>
                    <Value>1819 - 1894</Value>
                </Field>
                <Field Name="BIRTHDATE1" FieldName="{VW_PEOPLE.BIRTH_DATE}">
                    <FormattedValue>1819</FormattedValue>
                    <Value>1819</Value>
                </Field>
                <Field Name="DEATHDATE1" FieldName="{VW_PEOPLE.DEATH_DATE}">
                    <FormattedValue>1894</FormattedValue>
                    <Value>1894</Value>
                </Field>
            </Record>
        </Details>
</data>

You will see in each record a 'GroupNameLINKID1'. The first 2 records are the same. What I would like is to remove the duplicate record. Here is my XQuery:

<report>
    {

        let $item := doc("People_list_test_edited.xml")/data/Details
        let $id := $item/Record/Field[@Name='GroupNameLINKID1']/FormattedValue
         for $unique-id in distinct-values($id)
        let $briefbio := $item/Record/Field[@Name='BRIEFBIO1']/FormattedValue
        let $birth := $item/Record/Field[@Name='BIRTHDATE1']/FormattedValue
        let $death := $item/Record/Field[@Name='DEATHDATE1']/FormattedValue

 return
      <object><id>{$unique-id}</id></object>

       }
       </report>

My current output looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<report>
   <object>
      <id>76.00</id>
   </object>
   <object>
      <id>3,383.00</id>
   </object>
</report>

What I want is:

<?xml version="1.0" encoding="UTF-8"?>
<report>
   <object>
      <id>76.00</id>
      <briefbio>1833 - 1883</briefbio>
      <birth>1833</birth>
      <death>1883</death>
   </object>
   <object>
      <id>3,383.00</id>
      <briefbio>1819 - 1894</briefbio>
      <birth>1819</birth>
      <death>1894</death>
   </object>
</report>

Whenever I have tried adding additional 'where' or 'for' statements I either get returned results with all the duplicate records included or I get a huge list of repeated briefbio's, birth, and death dates under each $unique-id.

Can anyone steer me in the right direction? Many thanks in advance! All the best, Rob

CodePudding user response:

Here is how to do with with XQuery 3 and grouping:

declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization";

declare option output:method 'xml';
declare option output:indent 'yes';


<report>
{
for $detail in data/Details
group by $gn := $detail/Record/Field[@Name='GroupNameLINKID1']/FormattedValue
return
  <object>
    <id>{$gn}</id>
    <briefbio>{$detail[1]/Record/Field[@Name='BRIEFBIO1']/FormattedValue/data()}</briefbio>
  </object>
}
</report>

I have not spelled out all fields but from briefbio it should be clear how to fill in the other.

  • Related