I have one XML file that has data from the American Registry of Internet numbers which I need to use to do a lookup on the CIDR blocks. The amount of data is around 60GB.
The structure is not so complex (could be mapped to perhaps 4 tables in a relational model)
<asn>
<ref>https://whois.arin.net/rest/v1/asn/AS0</ref>
<pocLinks/>
<endAsNumber>0</endAsNumber>
<handle>AS0</handle>
<name>IANA-RSVD-0</name>
<orgHandle>IANA</orgHandle>
<comment>
<line number="0">Reserved - May be used to identify non-routed networks</line>
</comment>
<registrationDate>2002-09-13T15:46:16-04:00</registrationDate>
<startAsNumber>0</startAsNumber>
<updateDate>2002-09-13T15:46:16-04:00</updateDate>
</asn>
<org>
<ref>https://whois.arin.net/rest/v1/org/C07182099</ref>
<city>BEVERLY HILLS</city>
<iso3166-1>
<code2>US</code2>
<code3>USA</code3>
<name>United States</name>
<e164>1</e164>
</iso3166-1>
<handle>C07182099</handle>
<customer>Y</customer>
<name>NICHOLAS R. NIKOLOV</name>
<pocLinks/>
<parentOrgHandle>CC-3517</parentOrgHandle>
<postalCode>90210</postalCode>
<registrationDate>2018-10-26T03:59:46-04:00</registrationDate>
<iso3166-2>CA</iso3166-2>
<streetAddress>
<line number="0">436 N BEDFORD DR</line>
</streetAddress>
<updateDate>2018-10-26T03:59:46.821-04:00</updateDate>
</org>
<net>
<ref>https://whois.arin.net/rest/v1/net/NET-64-179-77-160-1</ref>
<endAddress>64.179.77.167</endAddress>
<handle>NET-64-179-77-160-1</handle>
<name>CHOICE1SWIPP771609</name>
<netBlocks>
<netBlock>
<cidrLenth>29</cidrLenth>
<endAddress>064.179.077.167</endAddress>
<type>S</type>
<startAddress>064.179.077.160</startAddress>
</netBlock>
</netBlocks>
<pocLinks/>
<orgHandle>C00477198</orgHandle>
<parentNetHandle>NET-64-179-0-0-1</parentNetHandle>
<registrationDate>2003-02-24T12:20:58-05:00</registrationDate>
<startAddress>64.179.77.160</startAddress>
<updateDate>2003-02-24T12:20:58-05:00</updateDate>
<version>4</version>
</net>
<poc>
<ref>https://whois.arin.net/rest/v1/poc/FLOYD11-ARIN</ref>
<city>Alexandria</city>
<companyName>Aireco8Alexandria</companyName>
<iso3166-1>
<code2>US</code2>
<code3>USA</code3>
<name>United States</name>
<e164>1</e164>
</iso3166-1>
<handle>FLOYD11-ARIN</handle>
<isRoleAccount>Y</isRoleAccount>
<emails>
<email>[email protected]</email>
</emails>
<lastName>Floyd</lastName>
<phones>
<phone>
<number>
<phoneNumber> 1-703-941-0660</phoneNumber>
<phoneType>O</phoneType>
<pocHandle>FLOYD11-ARIN</pocHandle>
</number>
<type>
<description>Office</description>
<code>O</code>
</type>
</phone>
</phones>
<postalCode>22312</postalCode>
<registrationDate>2006-04-04T19:27:40-04:00</registrationDate>
<iso3166-2>VA</iso3166-2>
<streetAddress>
<line number="0">5712-F Gen Washington Dr</line>
</streetAddress>
<updateDate>2006-04-04T19:27:40-04:00</updateDate>
</poc>
There are 4 different types of root elements, namely - POC , asn , org and net, and multiple data points for each type
The data needs to be pushed to 4 different tables in an RDS Postgres instance and here is how I intend to do it:
- divide XML file based on the root elements
- parse each type differently
- load to df and do
df.to_sql
Is there a faster way to achieve this?
CodePudding user response:
A potential solution.
Postgres has a jsonb store that handles unstructured data. You could use the xmltodict
module via xmltodict.parse()
and convert the XML to a python dict.
Than you can convert this to JSON and dump into the database.
You can query levels/depth in 2 ways:
SELECT data FROM table WHERE data->'asn'->>'name' = 'IANA-RSVD-0'
SELECT data FROM table WHERE data @> '{"asn":{"name":"IANA-RSVD-0"}}'
The latter query will take advantage of the JSONB_PATH_OPS index.