Home > database >  Get data from XML using python
Get data from XML using python

Time:12-10

i have this xml file :

<cv xmlns="http://hr.joinvision.com/xml/3_0">
  <personalInformation>
    <firstname>Bernard</firstname>
    <lastname>Henry</lastname>
    <gender>
      <code>m</code>
      <name>Monsieur</name>
    </gender>
    <title>Exec. MBA</title>
    <isced>
      <code>5A</code>
      <name>Ma�trise universitaire ou �quivalent</name>
    </isced>
    <address/>
    <email>[email protected]</email>
    <phoneNumber>0617135919</phoneNumber>
  </personalInformation>
  <work>
    <phase>
      <dateFrom>1981-01-01 01:00</dateFrom>
      <dateTo>1984-01-01 01:00</dateTo>
      <dateFromFuzzy>1981</dateFromFuzzy>
      <dateToFuzzy>1984</dateToFuzzy>
      <duration>36</duration>
      <current>false</current>
      <operationArea>
        <code>ass</code>
        <name>Assistance</name>
        <weight>1.0</weight>
      </operationArea>
      <operationArea>
        <code>pm</code>
        <name>Direction de Projet</name>
        <weight>1.0</weight>
      </operationArea>
      <operationArea>
        <code>secr</code>
        <name>Bureau du Secr�taire</name>
        <weight>1.0</weight>
      </operationArea>
      <operationArea>
        <code>offi</code>
        <name>Gestion du Bureau</name>
        <weight>1.0</weight>
      </operationArea>
      <operationArea>
        <code>sale</code>
        <name>Ventes en G�n�ral</name>
        <weight>1.0</weight>
      </operationArea>
      <operationArea>
        <code>retai</code>
        <name>Vente au D�tail</name>
        <weight>0.7</weight>
      </operationArea>
      <topic>"TV-Vid�o"</topic>
      <comments>Adjoint chef r�gional des ventes "TV-Vid�o"(CA de 60 M�)
1980-81
Attach� commercial (CA de 6 M�)</comments>
      <plainText>* Soci�t� PHILIPS
1981-84
Adjoint chef r�gional des ventes "TV-Vid�o"(CA de 60 M�)
1980-81
Attach� commercial (CA de 6 M�)</plainText>
      <company>PHILIPS</company>
      <function>Adjoint chef</function>
      <position>
        <code>mgmtl</code>
        <name>Chef d'�quipe</name>
      </position>
      <project>false</project>
    </phase>
    <phase>
      <dateFrom>1978-01-01 01:00</dateFrom>
      <dateTo>2021-12-01 01:00</dateTo>
      <dateFromFuzzy>1978</dateFromFuzzy>
      <dateToFuzzy>2021-12</dateToFuzzy>
      <duration>516</duration>
      <current>true</current>
      <location>
        <postcode>40210</postcode>
        <city>D�sseldorf</city>
        <country>
          <code>DE</code>
          <name>Allemagne</name>
        </country>
        <state>Nordrhein-Westfalen</state>
      </location>
      <operationArea>
        <code>mark</code>
        <name>Marketing</name>
        <weight>1.0</weight>
      </operationArea>
      <industry>
        <code>72</code>
        <name>Recherche d�veloppement scientifique</name>
        <weight>1.0</weight>
      </industry>
      <comments>79
Charg� de d�velopper, en Allemagne, des activit�s publi-promotionnelles en faveur de vins et alcools fran�ais</comments>
      <plainText>* SOPEXA Deutschland - D�sseldorf
1978-79
Charg� de d�velopper, en Allemagne, des activit�s publi-promotionnelles en faveur de vins et alcools fran�ais</plainText>
      <company>SOPEXA Deutschland</company>
      <position>
        <code>ma</code>
        <name>Employ�</name>
      </position>
      <project>false</project>
    </phase>
    <phase>
      <dateFrom>1976-01-01 01:00</dateFrom>
      <dateTo>1978-01-01 01:00</dateTo>
      <dateFromFuzzy>1976</dateFromFuzzy>
      <dateToFuzzy>1978</dateToFuzzy>
      <duration>24</duration>
      <current>false</current>
      <location>
        <country>
          <code>FR</code>
          <name>France</name>
        </country>
      </location>
      <location>
        <city>New York</city>
        <country>
          <code>US</code>
          <name>�tats-Unis</name>
        </country>
      </location>
      <operationArea>
        <code>retai</code>
        <name>Vente au D�tail</name>
        <weight>1.0</weight>
      </operationArea>
      <operationArea>
        <code>adve</code>
        <name>Publicit�</name>
        <weight>0.7</weight>
      </operationArea>
      <operationArea>
        <code>mark</code>
        <name>Marketing</name>
        <weight>0.7</weight>
      </operationArea>
      <operationArea>
        <code>ware</code>
        <name>Entreposage / Logistique des Marchandises</name>
        <weight>0.7</weight>
      </operationArea>
      <comments>* FOOD AND WINES FROM FRANCE Inc. (SOPEXA USA) - New York
Responsable local de la mise en place de campagnes publi-promotionnelles (pour vins et fromages fran�ais) aupr�s de la distribution sur plusieurs r�gions des USA</comments>
      <plainText>* FOOD AND WINES FROM FRANCE Inc. (SOPEXA USA) - New York
1976-78
Responsable local de la mise en place de campagnes publi-promotionnelles (pour vins et fromages fran�ais) aupr�s de la distribution sur plusieurs r�gions des USA</plainText>
      <company>FRANCE Inc.</company>
      <function>Responsable</function>
      <position>
        <code>ma</code>
        <name>Employ�</name>
      </position>
      <project>false</project>
    </phase>
    <phase>
      <dateFrom>2020-01-01 01:00</dateFrom>
      <dateFromFuzzy>2020-01</dateFromFuzzy>
      <subphase>false</subphase>
      <current>false</current>
      <comments>negociation-bhfc.f r - [email protected] - 0617135919</comments>
      <plainText>Janvier 2020 - negociation-bhfc.f r - [email protected] - 0617135919</plainText>
      <company>negociation-bhfc.f r</company>
      <position>
        <code>ma</code>
        <name>Employ�</name>
      </position>
      <project>false</project>
    </phase>
  </work>
  <education>
    <phase>
      <dateFrom>1984-01-01 01:00</dateFrom>
      <dateFromFuzzy>1984</dateFromFuzzy>
      <duration>12</duration>
      <current>false</current>
      <comments>Sciences de l'Education</comments>
      <plainText>1984 Sciences de l'Education</plainText>
      <completed>true</completed>
    </phase>
    <phase>
      <dateFrom>1976-01-01 01:00</dateFrom>
      <dateFromFuzzy>1976</dateFromFuzzy>
      <duration>12</duration>
      <current>false</current>
      <operationArea>
        <code>elec</code>
        <name>Ing�nierie �lectrique/�lectronique</name>
        <weight>1.0</weight>
      </operationArea>
      <comments>Dipl�me H.E.C.
Langues: Anglais courant - Allemand</comments>
      <plainText>1976 Dipl�me H.E.C. Langues: Anglais courant - Allemand</plainText>
      <isced>
        <code>2</code>
        <name>Enseignement secondaire (premier cycle)</name>
      </isced>
      <educationType>
        <code>ed68</code>
        <name>Dipl�me</name>
      </educationType>
      <completed>true</completed>
    </phase>
    <phase>
      <dateFrom>1984-01-01 01:00</dateFrom>
      <dateTo>2020-01-01 01:00</dateTo>
      <dateFromFuzzy>1984</dateFromFuzzy>
      <dateToFuzzy>2020</dateToFuzzy>
      <duration>432</duration>
      <current>false</current>
      <location>
        <postcode>75001</postcode>
        <city>PARIS, HEC, PONTS Paris</city>
        <country>
          <code>FR</code>
          <name>France</name>
        </country>
        <state>�le-de-France</state>
      </location>
      <skill>
        <code>masch</code>
        <name>Construction de Machines</name>
        <weight>1.0</weight>
      </skill>
      <skill>
        <code>o</code>
        <name>Oracle</name>
        <weight>0.9</weight>
      </skill>
      <skill>
        <code>lawye</code>
        <name>Avocat</name>
        <weight>0.8</weight>
      </skill>
      <operationArea>
        <code>sale</code>
        <name>Ventes en G�n�ral</name>
        <weight>1.0</weight>
      </operationArea>
      <operationArea>
        <code>cc</code>
        <name>Helpdesk/Support Technique</name>
        <weight>0.7</weight>
      </operationArea>
      <operationArea>
        <code>retai</code>
        <name>Vente au D�tail</name>
        <weight>0.5</weight>
      </operationArea>
      <operationArea>
        <code>advoc</code>
        <name>Carri�re Judiciaire</name>
        <weight>0.4</weight>
      </operationArea>
      <operationArea>
        <code>pm</code>
        <name>Direction de Projet</name>
        <weight>1.0</weight>
      </operationArea>
      <operationArea>
        <code>cust</code>
        <name>Service � la Client�le</name>
        <weight>1.0</weight>
      </operationArea>
      <topic>" Responsable des achats "</topic>
      <comments>Interventions en:
N�gociation raisonn�e/r�solution de conflit . N�gociation commerciale
Formation de formateurs/ing�nierie p�dagogique . Management
* Exemples de mission:
Formation � la n�gociation et la r�solution de conflits (en fran�ais ou en anglais) :
* de managers, ing�nieurs, chefs de projet (CNES, ONERA, INGEROP, RAZEL, DGAC)
* de vendeurs (ORACLE, VEOLIA, CASINO, BNPP)
* d'acheteurs (SG, SANOFI, LAFARGE Ciment, CASINO, C-DISCOUNT)
* d'avocats (FIDAL 2017-2018 et ETELIO 2019)
* d'�tudiants (CENTRALE PARIS, HEC, PONTS Paris Tech, MINES Paris Tech, �cole)
POLYTECHNIQUE, Corps des MINES, Corps des PONTS
Interventions
dans des projets de formation de dimension internationale et interculturelle (en fran�ais ou anglais) pour acheteurs (LAFARGE, TOTAL) ou managers (SAINT GOBAIN)
Executive MBA HEC : workshop sur la n�gociation raisonn�e (2018)
MBA sp�cialis� " Responsable des achats " de l'Institut LEONARD DE VINCI (class� 7�me)
MBA-Achats de France en 2016): formation de ces futurs acheteurs � la n�gociation raisonn�e
Interventions � l'�tranger : au Canada pour LAFARGE (acheteurs - 2001), au Gabon pour
VEOLIA (acheteurs - 2010) et au Maroc pour JESA/OCP (acheteurs - 2017-20) et SAHAM (2018)</comments>
      <plainText>1984-2020
 * Interventions en:
. N�gociation raisonn�e/r�solution de conflit . N�gociation commerciale
. Formation de formateurs/ing�nierie p�dagogique . Management
 * Exemples de mission:
Formation � la n�gociation et la r�solution de conflits (en fran�ais ou en anglais) :
* de managers, ing�nieurs, chefs de projet (CNES, ONERA, INGEROP, RAZEL, DGAC)
* de vendeurs (ORACLE, VEOLIA, CASINO, BNPP)
* d'acheteurs (SG, SANOFI, LAFARGE Ciment, CASINO, C-DISCOUNT)
* d'avocats (FIDAL 2017-2018 et ETELIO 2019)
* d'�tudiants (CENTRALE PARIS, HEC, PONTS Paris Tech, MINES Paris Tech, �cole
POLYTECHNIQUE, Corps des MINES, Corps des PONTS)
Interventions
dans des projets de formation de dimension internationale et interculturelle (en fran�ais ou anglais) pour acheteurs (LAFARGE, TOTAL) ou managers (SAINT GOBAIN)
Executive MBA HEC : workshop sur la n�gociation raisonn�e (2018)
MBA sp�cialis� " Responsable des achats " de l'Institut LEONARD DE VINCI (class� 7�me
MBA-Achats de France en 2016): formation de ces futurs acheteurs � la n�gociation raisonn�e
Interventions � l'�tranger : au Canada pour LAFARGE (acheteurs - 2001), au Gabon pour
VEOLIA (acheteurs - 2010) et au Maroc pour JESA/OCP (acheteurs - 2017-20) et SAHAM (2018)
Bernard HENRY Formation Conseil</plainText>
      <isced>
        <code>3</code>
        <name>Enseignement secondaire (deuxi�me cycle)</name>
      </isced>
      <educationType>
        <code>ed55</code>
        <name>�cole secondaire compr�hensive</name>
      </educationType>
      <schoolname>�cole POLYTECHNIQUE</schoolname>
      <graduation>Executive MBA</graduation>
      <graduation>Executive MBA HEC</graduation>
      <graduation>MBA</graduation>
      <completed>true</completed>
    </phase>
  </education>
  <publications/>
  <additionalInformation>
    <language>
      <code>DE</code>
      <name>Allemand</name>
      <level>
        <code>C1</code>
        <name>Courant</name>
      </level>
    </language>
    <language>
      <code>FR</code>
      <name>Fran�ais</name>
    </language>
    <language>
      <code>EN</code>
      <name>Anglais</name>
    </language>
    <interests>DE CONSULTANT-FORMATEUR - 1984-2020
 * Interventions en:
. N�gociation raisonn�e/r�solution de conflit . N�gociation commerciale
. Formation de formateurs/ing�nierie p�dagogique . Management
 * Exemples de mission:
Formation � la n�gociation et la r�solution de conflits (en fran�ais ou en anglais) :
* de managers, ing�nieurs, chefs de projet (CNES, ONERA, INGEROP, RAZEL, DGAC)
* de vendeurs (ORACLE, VEOLIA, CASINO, BNPP)
* d'acheteurs (SG, SANOFI, LAFARGE Ciment, CASINO, C-DISCOUNT)
* d'avocats (FIDAL 2017-2018 et ETELIO 2019)
* d'�tudiants (CENTRALE PARIS, HEC, PONTS Paris Tech, MINES Paris Tech, �cole
POLYTECHNIQUE, Corps des MINES, Corps des PONTS)
Interventions
dans des projets de formation de dimension internationale et interculturelle (en fran�ais ou anglais) pour acheteurs (LAFARGE, TOTAL) ou managers (SAINT GOBAIN)
Executive MBA HEC : workshop sur la n�gociation raisonn�e (2018)
MBA sp�cialis� " Responsable des achats " de l'Institut LEONARD DE VINCI (class� 7�me
MBA-Achats de France en 2016): formation de ces futurs acheteurs � la n�gociation raisonn�e
Interventions � l'�tranger : au Canada pour LAFARGE (acheteurs - 2001), au Gabon pour
VEOLIA (acheteurs - 2010) et au Maroc pour JESA/OCP (acheteurs - 2017-20) et SAHAM (2018)
Bernard HENRY Formation Conseil
Janvier 2020 - negociation-bhfc.f r - [email protected] - 0617135919</interests>
  </additionalInformation>
  <objectives/>
  <statistics>
    <codeSummary>
      <code>72</code>
      <name>Recherche d�veloppement scientifique</name>
      <weight>100.0</weight>
      <duration>528</duration>
      <domain>NACE</domain>
    </codeSummary>
    <codeSummary>
      <code>o</code>
      <name>Oracle</name>
      <weight>67.0</weight>
      <duration>433</duration>
      <domain>Skill</domain>
    </codeSummary>
    <codeSummary>
      <code>lawye</code>
      <name>Avocat</name>
      <weight>59.0</weight>
      <duration>433</duration>
      <domain>Skill</domain>
    </codeSummary>
    <codeSummary>
      <code>masch</code>
      <name>Construction de Machines</name>
      <weight>74.0</weight>
      <duration>433</duration>
      <domain>Skill</domain>
    </codeSummary>
    <codeSummary>
      <code>ma</code>
      <name>Employ�</name>
      <weight>100.0</weight>
      <duration>552</duration>
      <domain>Position</domain>
    </codeSummary>
    <codeSummary>
      <code>cc</code>
      <name>Helpdesk/Support Technique</name>
      <weight>26.0</weight>
      <duration>433</duration>
      <domain>OperationArea</domain>
    </codeSummary>
    <codeSummary>
      <code>mark</code>
      <name>Marketing</name>
      <weight>100.0</weight>
      <duration>552</duration>
      <domain>OperationArea</domain>
    </codeSummary>
    <codeSummary>
      <code>cust</code>
      <name>Service � la Client�le</name>
      <weight>37.0</weight>
      <duration>433</duration>
      <domain>OperationArea</domain>
    </codeSummary>
    <codeSummary>
      <code>advoc</code>
      <name>Carri�re Judiciaire</name>
      <weight>15.0</weight>
      <duration>433</duration>
      <domain>OperationArea</domain>
    </codeSummary>
    <codeSummary>
      <code>pm</code>
      <name>Direction de Projet</name>
      <weight>37.0</weight>
      <duration>469</duration>
      <domain>OperationArea</domain>
    </codeSummary>
    <codeSummary>
      <code>retai</code>
      <name>Vente au D�tail</name>
      <weight>19.0</weight>
      <duration>494</duration>
      <domain>OperationArea</domain>
    </codeSummary>
    <codeSummary>
      <code>sale</code>
      <name>Ventes en G�n�ral</name>
      <weight>37.0</weight>
      <duration>469</duration>
      <domain>OperationArea</domain>
    </codeSummary>
    <codeSummary>
      <code>DE</code>
      <name>Allemagne</name>
      <weight>100.0</weight>
      <duration>528</duration>
      <domain>ISOCountry</domain>
    </codeSummary>
    <codeSummary>
      <code>FR</code>
      <name>France</name>
      <weight>74.0</weight>
      <duration>458</duration>
      <domain>ISOCountry</domain>
    </codeSummary>
  </statistics>
</cv> 

i want to get the data from it so im using :

def personal_infos(xml):
    tree = et.fromstring(xml)
    namespace = tree.xpath('namespace-uri(.)')
    firstname = tree.find(".//{%s}firstname" % namespace).text
    lastname = tree.find(".//{%s}lastname"% namespace).text
    email = tree.find(".//{%s}email"% namespace).text
    gender = tree.find(".//{%s}code"% namespace).text
    phone = tree.find(".//{%s}phoneNumber"% namespace).text
    personal_infos = ({"first_name":firstname,"last_name":lastname,"gender":gender,"email":email,"phone_number":phone})
    print(personal_infos)

i'm wondering how to create a loop to get all because in <work> i have so many <phases> i can't use the same code

example of what i need :

i need to extract all the infos (datefrom , dateto ,current, operation area... ) from work

I tried this :

def work_infos(xml):
    f = io.StringIO(xml)
    tree = ET.parse(f)
    root = tree.getroot()
    for elem in root.iter():
        for p in tree.findall('.//work//phase'):
            df = tree.find('.//dateFrom')
            dateTo = tree.find('.//dateTo')

it returns nothing

i'm new in python any kind of guide will be helpful ,thanks

CodePudding user response:

Actually, ETree can also do this task but I am more familiar with BeautifulSoup. Anyways, both of them have similar approaches to handle the XML data.

In case using BeautifulSoup, first, use find_all('phase') to get the list of all phases inside work. Then, iterate through the list and retrieve the value one by one. Use .text.strip() to get text node and make sure that there is no space at first and last position. Create them as a dict and append to the list one by one. Last, convert the list of dict as dataframe using pd.DataFrame.

from bs4 import BeautifulSoup

def find_tag(tag_name):
    try:
        return phase.find(tag_name).text.strip()
    except:
        return None

xml_doc = BeautifulSoup('<your_xml>','lxml')
phases = xml_doc.find_all('phase')
phase_list = list()
for phase in phases:
    phase_dict = dict()
    phase_dict['datefrom'] = find_tag('datefrom')
    phase_dict['dateto'] = find_tag('dateto')
    phase_dict['datefromfuzzy'] = find_tag('datefromfuzzy')
    phase_dict['datetofuzzy'] = find_tag('datetofuzzy')
    phase_dict['duration'] = find_tag('duration')
    
    # continue to extract next value node here
    # ...

    phase_list.append(phase_dict)

df = pd.DataFrame(phase_list)
print(df)

Output:

           datefrom            dateto datefromfuzzy datetofuzzy duration
0  1981-01-01 01:00  1984-01-01 01:00          1981        1984       36
1  1978-01-01 01:00  2021-12-01 01:00          1978     2021-12      516
2  1976-01-01 01:00  1978-01-01 01:00          1976        1978       24
3  2020-01-01 01:00              None       2020-01        None     None
4  1984-01-01 01:00              None          1984        None       12
5  1976-01-01 01:00              None          1976        None       12
6  1984-01-01 01:00  2020-01-01 01:00          1984        2020      432
  • Related