Home > Software design >  XML to postgres using python
XML to postgres using python

Time:04-05

I have one XML file that has data from the American Registry of Internet numbers which I need to use to do a lookup on the CIDR blocks. The amount of data is around 60GB.

The structure is not so complex (could be mapped to perhaps 4 tables in a relational model)

<asn>
    <ref>https://whois.arin.net/rest/v1/asn/AS0</ref>
    <pocLinks/>
    <endAsNumber>0</endAsNumber>
    <handle>AS0</handle>
    <name>IANA-RSVD-0</name>
    <orgHandle>IANA</orgHandle>
    <comment>
        <line number="0">Reserved - May be used to identify non-routed networks</line>
    </comment>
    <registrationDate>2002-09-13T15:46:16-04:00</registrationDate>
    <startAsNumber>0</startAsNumber>
    <updateDate>2002-09-13T15:46:16-04:00</updateDate>
</asn>



<org>
    <ref>https://whois.arin.net/rest/v1/org/C07182099</ref>
    <city>BEVERLY HILLS</city>
    <iso3166-1>
        <code2>US</code2>
        <code3>USA</code3>
        <name>United States</name>
        <e164>1</e164>
    </iso3166-1>
    <handle>C07182099</handle>
    <customer>Y</customer>
    <name>NICHOLAS R. NIKOLOV</name>
    <pocLinks/>
    <parentOrgHandle>CC-3517</parentOrgHandle>
    <postalCode>90210</postalCode>
    <registrationDate>2018-10-26T03:59:46-04:00</registrationDate>
    <iso3166-2>CA</iso3166-2>
    <streetAddress>
        <line number="0">436 N BEDFORD DR</line>
    </streetAddress>
    <updateDate>2018-10-26T03:59:46.821-04:00</updateDate>
</org>



<net>
    <ref>https://whois.arin.net/rest/v1/net/NET-64-179-77-160-1</ref>
    <endAddress>64.179.77.167</endAddress>
    <handle>NET-64-179-77-160-1</handle>
    <name>CHOICE1SWIPP771609</name>
    <netBlocks>
        <netBlock>
            <cidrLenth>29</cidrLenth>
            <endAddress>064.179.077.167</endAddress>
            <type>S</type>
            <startAddress>064.179.077.160</startAddress>
        </netBlock>
    </netBlocks>
    <pocLinks/>
    <orgHandle>C00477198</orgHandle>
    <parentNetHandle>NET-64-179-0-0-1</parentNetHandle>
    <registrationDate>2003-02-24T12:20:58-05:00</registrationDate>
    <startAddress>64.179.77.160</startAddress>
    <updateDate>2003-02-24T12:20:58-05:00</updateDate>
    <version>4</version>
</net>  



<poc>
    <ref>https://whois.arin.net/rest/v1/poc/FLOYD11-ARIN</ref>
    <city>Alexandria</city>
    <companyName>Aireco8Alexandria</companyName>
    <iso3166-1>
        <code2>US</code2>
        <code3>USA</code3>
        <name>United States</name>
        <e164>1</e164>
    </iso3166-1>
    <handle>FLOYD11-ARIN</handle>
    <isRoleAccount>Y</isRoleAccount>
    <emails>
        <email>[email protected]</email>
    </emails>
    <lastName>Floyd</lastName>
    <phones>
        <phone>
            <number>
                <phoneNumber> 1-703-941-0660</phoneNumber>
                <phoneType>O</phoneType>
                <pocHandle>FLOYD11-ARIN</pocHandle>
            </number>
            <type>
                <description>Office</description>
                <code>O</code>
            </type>
        </phone>
    </phones>
    <postalCode>22312</postalCode>
    <registrationDate>2006-04-04T19:27:40-04:00</registrationDate>
    <iso3166-2>VA</iso3166-2>
    <streetAddress>
        <line number="0">5712-F Gen Washington Dr</line>
    </streetAddress>
    <updateDate>2006-04-04T19:27:40-04:00</updateDate>
</poc>

There are 4 different types of root elements, namely - POC , asn , org and net, and multiple data points for each type

The data needs to be pushed to 4 different tables in an RDS Postgres instance and here is how I intend to do it:

  1. divide XML file based on the root elements
  2. parse each type differently
  3. load to df and do df.to_sql

Is there a faster way to achieve this?

CodePudding user response:

A potential solution.

Postgres has a jsonb store that handles unstructured data. You could use the xmltodict module via xmltodict.parse() and convert the XML to a python dict.

Than you can convert this to JSON and dump into the database.
You can query levels/depth in 2 ways:

SELECT data FROM table WHERE data->'asn'->>'name' = 'IANA-RSVD-0'
SELECT data FROM table WHERE data @> '{"asn":{"name":"IANA-RSVD-0"}}'

The latter query will take advantage of the JSONB_PATH_OPS index.

https://www.postgresql.org/docs/11/functions-json.html

  • Related