Flatten XML data as a pandas dataframe-CodePudding

How can I convert this XML file at this address into a pandas dataframe? I have downloaded the XML as a file and called it '058com.xml' and run the code below, though the last column of the resulting dataframe is a mess of data arranged as multiple OrderedDict. The XML structure seems complex and is beyond my knowledge.

json_normalize documentation left me confused. How can I improve the code to fully flatten the XML ?

import pandas as pd
import xmltodict

rawdata = '058com.xml'

with open(rawdata) as fd:
    doc = xmltodict.parse(fd.read(), encoding='ISO-8859-1', process_namespaces=False)

pd.json_normalize(doc['Election']['Departement']['Communes']['Commune'])

Ideally the dataframe should look like ID's, names for geographic entities and vote results and names of election candidates.

The final dataframe should contain a lot of columns when fully flatten and is expected to be very close of the CSV below. I pasted the headers and the first line in the form of a .csv (semi-colon separated) as a resentative sample of what the dataframe should look like

Code du département;Libellé du département;Code de la commune;Libellé de la commune;Etat saisie;Inscrits;Abstentions;% Abs/Ins;Votants;% Vot/Ins;Blancs;% Blancs/Ins;% Blancs/Vot;Nuls;% Nuls/Ins;% Nuls/Vot;Exprimés;% Exp/Ins;% Exp/Vot;N°Panneau;Sexe;Nom;Prénom;Voix;% Voix/Ins;% Voix/Exp
01;Ain;001;L'Abergement-Clémenciat;Complet;645;108;16,74;537;83,26;16;2,48;2,98;1;0,16;0,19;520;80,62;96,83;1;F;ARTHAUD;Nathalie;3;0,47;0,58;2;M;ROUSSEL;Fabien;6;0,93;1,15;3;M;MACRON;Emmanuel;150;23,26;28,85;4;M;LASSALLE;Jean;18;2,79;3,46;5;F;LE PEN;Marine;149;23,10;28,65;6;M;ZEMMOUR;Éric;43;6,67;8,27;7;M;MÉLENCHON;Jean-Luc;66;10,23;12,69;8;F;HIDALGO;Anne;5;0,78;0,96;9;M;JADOT;Yannick;30;4,65;5,77;10;F;PÉCRESSE;Valérie;26;4,03;5,00;11;M;POUTOU;Philippe;3;0,47;0,58;12;M;DUPONT-AIGNAN;Nicolas;21;3,26;4,04

CodePudding user response：

I tried this:

import pandas as pd
import xmltodict

rawdata = '058com.xml'

with open(rawdata) as fd:
    doc = xmltodict.parse(fd.read(), encoding='ISO-8859-1', process_namespaces=False)

df = pd.json_normalize(doc['Election']['Departement']['Communes']['Commune'])


col_length_df = len(df.columns)
all_columns = list(df.columns[:-1])   list(df.iloc[0, len(df.columns)-1][0].keys())

new_df = df.reindex(columns = all_columns)

new_df.astype({"RapportExprime": str, "RapportInscrit": str}).dtypes

for index, rows in new_df.iterrows():
    new_df.iloc[index, col_length_df-1:] = list(df.iloc[index, len(df.columns)-1][0].values())

Since the last row of df is an ordered dictionary, the code uses its keys to add empty columns, along with original columns of df, to new_df. Finally, it loops over rows of df and new_df to fill the empty columns of new_df.

The above code gives us:

    CodSubCom           LibSubCom Tours.Tour.NumTour Tours.Tour.Mentions.Inscrits.Nombre Tours.Tour.Mentions.Abstentions.Nombre  ... PrenomPsn CivilitePsn NbVoix RapportExprime RapportInscrit
0         001               Achun                  1                                 105                                     24  ...  Nathalie         Mme      0           0,00           0,00
1         002       Alligny-Cosne                  1                                 696                                    133  ...  Nathalie         Mme      3           0,54           0,43
2         003   Alligny-en-Morvan                  1                                 533                                    123  ...  Nathalie         Mme      5           1,25           0,94
3         004               Alluy                  1                                 263                                     48  ...  Nathalie         Mme      1           0,48           0,38
4         005               Amazy                  1                                 188                                     51  ...  Nathalie         Mme      2           1,53           1,06
..        ...                 ...                ...                                 ...                                    ...  ...       ...         ...    ...            ...            ...
304       309        Villapourçon                  1                                 327                                     70  ...  Nathalie         Mme      1           0,40           0,31
305       310     Villiers-le-Sec                  1                                  34                                      4  ...  Nathalie         Mme      0           0,00           0,00
306       311         Ville-Langy                  1                                 203                                     46  ...  Nathalie         Mme      1           0,64           0,49
307       312  Villiers-sur-Yonne                  1                                 263                                     60  ...  Nathalie         Mme      0           0,00           0,00
308       313         Vitry-Laché                  1                                  87                                     13  ...  Nathalie         Mme      1           1,37           1,15

Finally, new_df.columns is:

Index(['CodSubCom', 'LibSubCom', 'Tours.Tour.NumTour',
       'Tours.Tour.Mentions.Inscrits.Nombre',
       'Tours.Tour.Mentions.Abstentions.Nombre',
       'Tours.Tour.Mentions.Abstentions.RapportInscrit',
       'Tours.Tour.Mentions.Votants.Nombre',
       'Tours.Tour.Mentions.Votants.RapportInscrit',
       'Tours.Tour.Mentions.Blancs.Nombre',
       'Tours.Tour.Mentions.Blancs.RapportInscrit',
       'Tours.Tour.Mentions.Blancs.RapportVotant',
       'Tours.Tour.Mentions.Nuls.Nombre',
       'Tours.Tour.Mentions.Nuls.RapportInscrit',
       'Tours.Tour.Mentions.Nuls.RapportVotant',
       'Tours.Tour.Mentions.Exprimes.Nombre',
       'Tours.Tour.Mentions.Exprimes.RapportInscrit',
       'Tours.Tour.Mentions.Exprimes.RapportVotant', 'NumPanneauCand',
       'NomPsn', 'PrenomPsn', 'CivilitePsn', 'NbVoix', 'RapportExprime',
       'RapportInscrit'],
      dtype='object')

Total number of columns in new_df: 24

CodePudding user response：

Since the URL really contains two data sections under each <Tour>, specifically <Mentions> (which appear to be aggregate vote data) and <Candidats> (which are granular person-level data) (pardon my French), consider building two separate data frames using the new IO method, pandas.read_xml, which supports XSLT 1.0 (via the third-party lxml package). No migration to dictionaries for JSON handling.

As a special purpose language written in XML, XSLT can transform your nested structure to flatter format for migration to data frame. Specifically, each stylesheet drills down to the most granular node and then by the ancestor axis pulls higher level information as sibling columns.

Mentions (save as .xsl, a special .xml file or embed as string in Python)

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>
  
  <xsl:template match="/">
    <Tours>
      <xsl:apply-templates select="descendant::Tour/Mentions"/>
    </Tours>
  </xsl:template>
  
  <xsl:template match="Mentions/*">
    <Mention>
      <xsl:copy-of select="ancestor::Election/Scrutin/*"/>
      <xsl:copy-of select="ancestor::Departement/*[name()!='Communes']"/>
      <xsl:copy-of select="ancestor::Commune/*[name()!='Tours']"/>
      <xsl:copy-of select="ancestor::Tour/NumTour"/>
      <Mention><xsl:value-of select="name()"/></Mention>
      <xsl:copy-of select="*"/>
    </Mention>
  </xsl:template>
  
</xsl:stylesheet>

Python (read directly from URL)

url = (
    "https://www.resultats-elections.interieur.gouv.fr/telechargements/" 
    "PR2022/resultatsT1/027/058/058com.xml"
)

mentions_df = pd.read_xml(url, stylesheet=mentions_xsl)

Output

                Type  Annee  CodReg  CodReg3Car                   LibReg  CodDpt  CodMinDpt  CodDpt3Car  LibDpt  CodSubCom    LibSubCom  NumTour      Mention  Nombre RapportInscrit RapportVotant
0     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1     Inscrits     105           None          None
1     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1  Abstentions      24          22,86          None
2     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1      Votants      81          77,14          None
3     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1       Blancs       2           1,90          2,47
4     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1         Nuls       0           0,00          0,00
             ...    ...     ...         ...                      ...     ...        ...         ...     ...        ...          ...      ...          ...     ...            ...           ...
1849  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1  Abstentions      13          14,94          None
1850  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1      Votants      74          85,06          None
1851  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1       Blancs       1           1,15          1,35
1852  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1         Nuls       0           0,00          0,00
1853  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1     Exprimes      73          83,91         98,65

[1854 rows x 16 columns]

Candidats (save as .xsl, a special .xml file or embed as string in Python)

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>
  
  <xsl:template match="/">
    <Candidats>
      <xsl:apply-templates select="descendant::Tour/Resultats/Candidats"/>
    </Candidats>
  </xsl:template>
  
  <xsl:template match="Candidat">
    <xsl:copy>
      <xsl:copy-of select="ancestor::Election/Scrutin/*"/>
      <xsl:copy-of select="ancestor::Departement/*[name()!='Communes']"/>
      <xsl:copy-of select="ancestor::Commune/*[name()!='Tours']"/>
      <xsl:copy-of select="ancestor::Tour/NumTour"/>
      <xsl:copy-of select="*"/>
    </xsl:copy>
  </xsl:template>
  
</xsl:stylesheet>

Python (read directly from URL)

url = (
    "https://www.resultats-elections.interieur.gouv.fr/telechargements/" 
    "PR2022/resultatsT1/027/058/058com.xml"
)

candidats_df = pd.read_xml(url, stylesheet=candidats_xsl)

Output

                Type  Annee  CodReg  CodReg3Car                   LibReg  CodDpt  CodMinDpt  CodDpt3Car  LibDpt  CodSubCom    LibSubCom  NumTour  NumPanneauCand         NomPsn PrenomPsn CivilitePsn  NbVoix RapportExprime RapportInscrit
0     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               1        ARTHAUD  Nathalie         Mme       0           0,00           0,00
1     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               2        ROUSSEL    Fabien          M.       3           3,80           2,86
2     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               3         MACRON  Emmanuel          M.      14          17,72          13,33
3     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               4       LASSALLE      Jean          M.       2           2,53           1,90
4     Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre          1        Achun        1               5         LE PEN    Marine         Mme      28          35,44          26,67
             ...    ...     ...         ...                      ...     ...        ...         ...     ...        ...          ...      ...             ...            ...       ...         ...     ...            ...            ...
3703  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1               8        HIDALGO      Anne         Mme       0           0,00           0,00
3704  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1               9          JADOT   Yannick          M.       4           5,48           4,60
3705  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1              10       PÉCRESSE   Valérie         Mme       6           8,22           6,90
3706  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1              11         POUTOU  Philippe          M.       1           1,37           1,15
3707  Présidentielle   2022      27          27  Bourgogne-Franche-Comté      58         58          58  Nièvre        313  Vitry-Laché        1              12  DUPONT-AIGNAN   Nicolas          M.       4           5,48           4,60

[3708 rows x 19 columns]

You can join resulting data frames using their shared Communes nodes: <CodSubCom> and <LibSubCom> but may have to pivot_table on the aggregate data for a one-to-many merge. Below demonstrates with Nombre aggregate:

mentions_candidats_df = (
    candidats_df.merge(
        mentions_df.pivot_table(
            index=["CodSubCom", "LibSubCom"],
            columns="Mention",
            values="Nombre",
            aggfunc="max"
        ).reset_index(),
        on=["CodSubCom", "LibSubCom"]
    )
)

mentions_candidats_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3708 entries, 0 to 3707
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Type            3708 non-null   object
 1   Annee           3708 non-null   int64 
 2   CodReg          3708 non-null   int64 
 3   CodReg3Car      3708 non-null   int64 
 4   LibReg          3708 non-null   object
 5   CodDpt          3708 non-null   int64 
 6   CodMinDpt       3708 non-null   int64 
 7   CodDpt3Car      3708 non-null   int64 
 8   LibDpt          3708 non-null   object
 9   CodSubCom       3708 non-null   int64 
 10  LibSubCom       3708 non-null   object
 11  NumTour         3708 non-null   int64 
 12  NumPanneauCand  3708 non-null   int64 
 13  NomPsn          3708 non-null   object
 14  PrenomPsn       3708 non-null   object
 15  CivilitePsn     3708 non-null   object
 16  NbVoix          3708 non-null   int64 
 17  RapportExprime  3708 non-null   object
 18  RapportInscrit  3708 non-null   object
 19  Abstentions     3708 non-null   int64 
 20  Blancs          3708 non-null   int64 
 21  Exprimes        3708 non-null   int64 
 22  Inscrits        3708 non-null   int64 
 23  Nuls            3708 non-null   int64 
 24  Votants         3708 non-null   int64 
dtypes: int64(16), object(9)
memory usage: 753.2  KB

In forthcoming pandas 1.5, read_xml will support dtypes to allow conversion after XSLT transformation in this case.