Home > other >  How to create a data frame from multiple xml files containing same structure?
How to create a data frame from multiple xml files containing same structure?

Time:12-31

I have more than 1000 XML files that probably have the same structure. I want to create a database using data in all the files. I have never known how an XML file looked before yesterday. With the help of Google, I tried using the r-packages to load a single XML file in RStudio. But when I'm trying to convert that into a data frame, an error is occurring.

This is how file looks like: File A

<?xml-stylesheet type='text/xsl' href='anzctrTransform.xsl'?>
<ANZCTR_Trial requestNumber="42">
  <stage>Registered</stage>
  <submitdate>19/07/2005</submitdate>
  <approvaldate>19/07/2005</approvaldate>
  <dateLastUpdated>14/12/2010</dateLastUpdated>
  <actrnumber>ACTRN12605000026628</actrnumber>
  <trial_identification>
    <studytitle>Phase II study of fixed dose rate Gemcitabine-Oxaliplatin Integrated with concomitant 5FU and 3-D Conformal Radiotherapy for the treatment of localised pancreatic cancer: GOFURTGO</studytitle>
    <scientifictitle>Phase II study of fixed dose rate Gemcitabine-Oxaliplatin Integrated with concomitant 5FU and 3-D Conformal Radiotherapy for the treatment of localised pancreatic cancer: GOFURTGO</scientifictitle>
    <utrn />
    <trialacronym>GOFURTGO</trialacronym>
    <secondaryid>GOFURTGO</secondaryid>
  </trial_identification>
  <conditions>
    <healthcondition>Locally advanced or locally recurrent inoperable pancreatic cancer not previously treated with chemotherapy or radiotherapy.</healthcondition>
    <conditioncode>
      <conditioncode1>Cancer</conditioncode1>
      <conditioncode2>Pancreatic</conditioncode2>
    </conditioncode>
  </conditions>
  <interventions>
    <interventions>All patients enrolled in the study will receive the same treatment consisting of all of the following:
a) 1 cycle of chemotherapy: the cycle is 28 days (gemcitabine on days 1 and 15 and oxaliplatin on days 2 and 16, followed by:
b)radiotherpay plus continuous 5FU infusion: 5FU is given continuously (7 days a week for 6 weeks), radiotherpay is given 5 days a week (Mon-Fri) for 6 weeks followed by:
c) 3 cycles of chemotherapy: each cycle is 28 days (gemcitabine on days 1 and 15 and oxaliplatin on days 2 and 16</interventions>
    <comparator>This is a single group trial</comparator>
    <control>Uncontrolled</control>
    <interventioncode>Treatment: Other</interventioncode>
  </interventions>
  <outcomes>
    <primaryOutcome>
      <outcome>The primary objective is to determine the proportions of patients starting and finishing greater than or equal to 80% of the planned dose on time for each component of the treatment.</outcome>
      <timepoint>The outcome will be measured once all patients have enrolled and have completeed the study treatment.</timepoint>
    </primaryOutcome>
    <secondaryOutcome>
      <outcome>Adverse events</outcome>
      <timepoint>Assessed at the end of ecah treatment cycle, and at end of treatment.</timepoint>
    </secondaryOutcome>
    <secondaryOutcome>
      <outcome>Objective tumour response rates</outcome>
      <timepoint>Before and after radiotherapy, at the end of treatment, and then as clinically indicated.</timepoint>
    </secondaryOutcome>
    <secondaryOutcome>
      <outcome>Time to progression</outcome>
      <timepoint>Before and after radiotherapy, at the end of treatment, and then as clinically indicated.</timepoint>
    </secondaryOutcome>
    <secondaryOutcome>
      <outcome>CA 19-9 response rates</outcome>
      <timepoint>Before and after radiotherapy, at the end of treatment, and then 2 monthly during follow up.</timepoint>
    </secondaryOutcome>
    <secondaryOutcome>
      <outcome>Health-related quality of life.</outcome>
      <timepoint>Before and after radiotherapy, at the end of treatment, and then 2 monthly until progression/disease recurrence.</timepoint>
    </secondaryOutcome>
  </outcomes>
  <eligibility>
    <inclusivecriteria>Patient must have histologically/cytologically proven adenocarcinoma of the pancreas located in the head or the body of the pancreas (primary) or in the pancreatic bed (locally recurrent).Locoregional disease must be confirmed by dual phase CT (arterial and portal phases) without distant metastases (confirmed by CT of the chest, abdomen and pelvis).Patients must be assessed by a surgeon and considered inoperable.Performance status must be ECOG grade 0, 1 or 2.</inclusivecriteria>
    <inclusiveminage>0</inclusiveminage>
    <inclusiveminagetype>Not stated</inclusiveminagetype>
    <inclusivemaxage>0</inclusivemaxage>
    <inclusivemaxagetype>Not stated</inclusivemaxagetype>
    <inclusivegender>Both males and females</inclusivegender>
    <healthyvolunteer>No</healthyvolunteer>
    <exclusivecriteria>1.Histological types other than pancreatic ductal adenocarcinoma
2. Metastatic disease.
3. Tumours of the tail of pancreas
4. Major co-morbid illnesses that, in the opinion of the investigator, would jeopardise the likely completion of the treatment program
5. Patients with peripheral sensory neuropathy with functional impairment.
6. Derangement of LFTs consistent with hepatic cellular dysfunction (ALT and/or AST &gt;3 times upper limit of normal), or a bilirubin &gt;3 times upper limit of normal. Patients with LFTs consistent with hepatic obstruction that is relieved (eg. by stenting, bypass) are eligible, provided the bilirubin has fallen to &lt;3 times upper limit of normal.
7. Patients with significant loss of bodyweight, who, at the investigator’s discretion, is deemed   not suitable for this study (eg.&gt;15% weight loss since surgery or diagnosis)
8. Treatment with a drug within the last 30 days that has not received regulatory approval at the time of study entry.
9. Treatment with any previous cytotoxic chemotherapy for this malignancy. Previous hormonal manipulation (including HRT) is allowed.
10. Previous abdominal radiotherapy
11. A previous history of malignancy other than non-melanomatous skin cancers, in –situ carcinoma, or patients who are disease–free from non-pancreatic tumours treated definitively more than 5 years ago.
12. Pregnant or lactating women, or women of childbearing potential not using adequate contraception.</exclusivecriteria>
  </eligibility>
  <trial_design>
    <studytype>Interventional</studytype>
    <purpose>Treatment</purpose>
    <allocation>Non-randomised trial</allocation>
    <concealment>Paper enrolment through the AGITG Coordinating Centre, NHMRC Clinical Trials Centre</concealment>
    <sequence>n/a</sequence>
    <masking>Open (masking not used)</masking>
    <assignment>Single group</assignment>
    <designfeatures />
    <endpoint>Safety</endpoint>
    <statisticalmethods />
    <masking1 />
    <masking2 />
    <masking3 />
    <masking4 />
    <patientregistry />
    <followup />
    <followuptype />
    <purposeobs />
    <duration />
    <selection />
    <timing />
  </trial_design>
  <recruitment>
    <phase>Phase 2</phase>
    <anticipatedstartdate>13/04/2005</anticipatedstartdate>
    <actualstartdate />
    <anticipatedenddate />
    <actualenddate />
    <samplesize>45</samplesize>
    <actualsamplesize />
    <currentsamplesize />
    <recruitmentstatus>Completed</recruitmentstatus>
    <anticipatedlastvisitdate />
    <actuallastvisitdate />
    <dataanalysis />
    <withdrawnreason />
    <withdrawnreasonother />
    <recruitmentcountry>Australia</recruitmentcountry>
    <recruitmentstate />
  </recruitment>
  <sponsorship>
    <primarysponsortype>Other Collaborative groups</primarysponsortype>
    <primarysponsorname>AGITG</primarysponsorname>
    <primarysponsoraddress>92-94 Parramatta Rd, Camperdown NSW 2050</primarysponsoraddress>
    <primarysponsorcountry>Australia</primarysponsorcountry>
    <fundingsource>
      <fundingtype>Commercial sector/Industry</fundingtype>
      <fundingname>Sanofi-Aventis</fundingname>
      <fundingaddress>Sanofi-Aventis Group 
Talavera Corporate Centre 
Building D 
12-24 Talavera Road 
Macquarie Park NSW 2113</fundingaddress>
      <fundingcountry>Australia</fundingcountry>
    </fundingsource>
    <fundingsource>
      <fundingtype>Other Collaborative groups</fundingtype>
      <fundingname>AGITG</fundingname>
      <fundingaddress>NHMRC Clinical Trials Centre
University of Sydney
Locked Bag 77
CAMPERDOWN NSW 1450</fundingaddress>
      <fundingcountry>Australia</fundingcountry>
    </fundingsource>
    <fundingsource>
      <fundingtype>University</fundingtype>
      <fundingname>CTC</fundingname>
      <fundingaddress>NHMRC Clinical Trials Centre
University of Sydney
Locked Bag 77
CAMPERDOWN NSW 1450</fundingaddress>
      <fundingcountry>Australia</fundingcountry>
    </fundingsource>
    <secondarysponsor>
      <sponsortype>Other Collaborative groups</sponsortype>
      <sponsorname>AGITG</sponsorname>
      <sponsoraddress>NHMRC Clinical Trials Centre
University of Sydney
Locked Bag 77
CAMPERDOWN NSW 1450</sponsoraddress>
      <sponsorcountry>Australia</sponsorcountry>
    </secondarysponsor>
  </sponsorship>
  <ethicsAndSummary>
    <summary />
    <trialwebsite />
    <publication />
    <ethicsreview>Approved</ethicsreview>
    <publicnotes />
    <ethicscommitee>
      <ethicname>University of Sydney</ethicname>
      <ethicaddress>Human Research Ethics Committee
Main Quad
University of Sydney NSW 2006</ethicaddress>
      <ethicapprovaldate />
      <hrec>11-2004/5/7779</hrec>
      <ethicsubmitdate />
      <ethiccountry>Australia</ethiccountry>
    </ethicscommitee>
    <ethicscommitee>
      <ethicname>Prince of Wales Hospital</ethicname>
      <ethicaddress />
      <ethicapprovaldate />
      <hrec />
      <ethicsubmitdate />
      <ethiccountry>Australia</ethiccountry>
    </ethicscommitee>
    <ethicscommitee>
      <ethicname>Border Medical Oncology</ethicname>
      <ethicaddress />
      <ethicapprovaldate />
      <hrec />
      <ethicsubmitdate />
      <ethiccountry>Australia</ethiccountry>
    </ethicscommitee>
    <ethicscommitee>
      <ethicname>St. George Hospital</ethicname>
      <ethicaddress />
      <ethicapprovaldate />
      <hrec />
      <ethicsubmitdate />
      <ethiccountry>Australia</ethiccountry>
    </ethicscommitee>
    <ethicscommitee>
      <ethicname>Newcastle Mater</ethicname>
      <ethicaddress />
      <ethicapprovaldate />
      <hrec />
      <ethicsubmitdate />
      <ethiccountry>Australia</ethiccountry>
    </ethicscommitee>
    <ethicscommitee>
      <ethicname>Alfred Hospital</ethicname>
      <ethicaddress />
      <ethicapprovaldate />
      <hrec />
      <ethicsubmitdate />
      <ethiccountry>Australia</ethiccountry>
    </ethicscommitee>
    <ethicscommitee>
      <ethicname>Nepean Hospital</ethicname>
      <ethicaddress />
      <ethicapprovaldate />
      <hrec />
      <ethicsubmitdate />
      <ethiccountry>Australia</ethiccountry>
    </ethicscommitee>
    <ethicscommitee>
      <ethicname>Royal Adelaide Hospital</ethicname>
      <ethicaddress />
      <ethicapprovaldate />
      <hrec />
      <ethicsubmitdate />
      <ethiccountry>Australia</ethiccountry>
    </ethicscommitee>
  </ethicsAndSummary>
  <attachment />
  <contacts>
    <contact>
      <title />
      <name>Dr David Goldstein</name>
      <address>Department of Medical Oncology
Prince of Wales Hospital
High Street
Randwick NSW 2031</address>
      <phone> 61 2 93822577</phone>
      <fax> 61 2 93822578</fax>
      <email>[email protected]</email>
      <country>Australia</country>
      <type>Scientific Queries</type>
    </contact>
    <contact>
      <title />
      <name>Dr David Goldstein</name>
      <address>Department of Medical Oncology
Prince of Wales Hospital
High Street
Randwick NSW 2031</address>
      <phone> 61 2 93822577</phone>
      <fax> 61 2 93822578</fax>
      <email>[email protected]</email>
      <country>Australia</country>
      <type>Public Queries</type>
    </contact>
    <contact>
      <title />
      <name />
      <address />
      <phone />
      <fax />
      <email />
      <country />
      <type>Principal Investigator</type>
    </contact>
  </contacts>
</ANZCTR_Trial>

File B.

<?xml-stylesheet type='text/xsl' href='anzctrTransform.xsl'?>
<ANZCTR_Trial requestNumber="6">
  <stage>Registered</stage>
  <submitdate>08/07/2005</submitdate>
  <approvaldate>08/07/2005</approvaldate>
  <dateLastUpdated>24/06/2010</dateLastUpdated>
  <actrnumber>ACTRN12605000003673</actrnumber>
  <trial_identification>
    <studytitle>Bisphosphonate and Anastrozole trial - Bone Maintenance Algorithm Assessment</studytitle>
    <scientifictitle>Maintaining skeletal health in postmenopausal women with surgically resected Stage I-IIIa hormone-receptor positive breast cancer who are receiving anastrozole, through the use of alendronate as determined by the Osteoporosis Australia Bone Maintenance Algorithm</scientifictitle>
    <utrn />
    <trialacronym>BATMAN</trialacronym>
    <secondaryid>Andrew Love Cancer Centre: ALCC 04.02</secondaryid>
  </trial_identification>
  <conditions>
    <healthcondition>Breast Cancer</healthcondition>
    <conditioncode>
      <conditioncode1>Cancer</conditioncode1>
      <conditioncode2>Breast</conditioncode2>
    </conditioncode>
  </conditions>
  <interventions>
    <interventions>This trial aims to assess the utility, through DEXA scans and biochemical markers of bone turnover, of a strategy of monitoring and intervention with oral alendronate in postmenopausal women with hormone-receptor positive breast cancer receiving five years of adjuvant anastrozole. It specifically addressed the issues of osteopaenic and osteoporotic women in this setting and will test three years versus five years of alendronate use.</interventions>
    <comparator>Five years of treatment with 70mg oral alendronate once weekly</comparator>
    <control>Active</control>
    <interventioncode>Treatment: Drugs</interventioncode>
  </interventions>
  <outcomes>
    <primaryOutcome>
      <outcome>Changes in lumbar vertebra and femoral neck BMD T-score after 5 years of anastrozole treatment</outcome>
      <timepoint>After 5 years of anastrozole treatment</timepoint>
    </primaryOutcome>
    <secondaryOutcome>
      <outcome>Percent change in the lumbar vertebrae</outcome>
      <timepoint>Annually for 5 years</timepoint>
    </secondaryOutcome>
    <secondaryOutcome>
      <outcome>Biochemical markers</outcome>
      <timepoint>6 months after commencing alendronate</timepoint>
    </secondaryOutcome>
    <secondaryOutcome>
      <outcome>Evaluate the Osteoporosis Australia strategy for bone protection for this patient group.</outcome>
      <timepoint>At 5 years</timepoint>
    </secondaryOutcome>
  </outcomes>
  <eligibility>
    <inclusivecriteria>Postmenopausal women- Adequately diagnosed and treated Stage I-IIIa early breast cancer- Oestrogen receptor and/or progesterone receptor positive breast cancer- Anastrozole is clinically indicated to be the best adjuvant strategy</inclusivecriteria>
    <inclusiveminage>18</inclusiveminage>
    <inclusiveminagetype>Years</inclusiveminagetype>
    <inclusivemaxage>0</inclusivemaxage>
    <inclusivemaxagetype>Not stated</inclusivemaxagetype>
    <inclusivegender>Females</inclusivegender>
    <healthyvolunteer>No</healthyvolunteer>
    <exclusivecriteria>Clinical or radiological evidence of distant spread- prior treatment with bisphosphonates within the past 12 months</exclusivecriteria>
  </eligibility>
  <trial_design>
    <studytype>Interventional</studytype>
    <purpose>Prevention</purpose>
    <allocation>Randomised controlled trial</allocation>
    <concealment>central randomisation via fax and phone</concealment>
    <sequence>Computer generated stratified blocks</sequence>
    <masking>Open (masking not used)</masking>
    <assignment>Parallel</assignment>
    <designfeatures />
    <endpoint>Efficacy</endpoint>
    <statisticalmethods />
    <masking1 />
    <masking2 />
    <masking3 />
    <masking4 />
    <patientregistry />
    <followup />
    <followuptype />
    <purposeobs />
    <duration />
    <selection />
    <timing />
  </trial_design>
  <recruitment>
    <phase>Phase 3</phase>
    <anticipatedstartdate>05/07/2005</anticipatedstartdate>
    <actualstartdate />
    <anticipatedenddate />
    <actualenddate />
    <samplesize>300</samplesize>
    <actualsamplesize />
    <currentsamplesize />
    <recruitmentstatus>Active, not recruiting</recruitmentstatus>
    <anticipatedlastvisitdate />
    <actuallastvisitdate />
    <dataanalysis />
    <withdrawnreason />
    <withdrawnreasonother />
    <recruitmentcountry>Australia</recruitmentcountry>
    <recruitmentstate />
  </recruitment>
  <sponsorship>
    <primarysponsortype>Hospital</primarysponsortype>
    <primarysponsorname>Barwon Health</primarysponsorname>
    <primarysponsoraddress>272-322 Ryrie Street, Geelong, Victoria 3220</primarysponsoraddress>
    <primarysponsorcountry>Australia</primarysponsorcountry>
    <fundingsource>
      <fundingtype>Commercial sector/Industry</fundingtype>
      <fundingname>Astra Zeneca</fundingname>
      <fundingaddress>P.O Box 131, North Ryde PBC NSW 1670</fundingaddress>
      <fundingcountry>Australia</fundingcountry>
    </fundingsource>
    <secondarysponsor>
      <sponsortype>None</sponsortype>
      <sponsorname>Nil</sponsorname>
      <sponsoraddress>Nil</sponsoraddress>
      <sponsorcountry />
    </secondarysponsor>
  </sponsorship>
  <ethicsAndSummary>
    <summary />
    <trialwebsite />
    <publication />
    <ethicsreview>Approved</ethicsreview>
    <publicnotes />
    <ethicscommitee>
      <ethicname>Barwon Health</ethicname>
      <ethicaddress />
      <ethicapprovaldate />
      <hrec />
      <ethicsubmitdate />
      <ethiccountry>Australia</ethiccountry>
    </ethicscommitee>
  </ethicsAndSummary>
  <attachment />
  <contacts>
    <contact>
      <title />
      <name>Associate Professor Richard Bell</name>
      <address>Andrew Love Cancer Centre
The Geelong Hospital
70 Swanston Street
Geelong VIC 3220</address>
      <phone> 61 3 52267855</phone>
      <fax> 61 3 52465168</fax>
      <email>[email protected]</email>
      <country>Australia</country>
      <type>Scientific Queries</type>
    </contact>
    <contact>
      <title />
      <name>Ms Elaine Yeow</name>
      <address>Andrew Love Cancer Centre
The Geelong Hospital
70 Swanston Street
Geelong VIC 3220</address>
      <phone> 61 3 52267858</phone>
      <fax> 61 3 52465168</fax>
      <email>[email protected]</email>
      <country>Australia</country>
      <type>Public Queries</type>
    </contact>
    <contact>
      <title />
      <name />
      <address />
      <phone />
      <fax />
      <email />
      <country />
      <type>Principal Investigator</type>
    </contact>
  </contacts>
</ANZCTR_Trial>

Following is my code.

library(XML)
library(xml2)
x =  read_xml("ACTRN12605000026628.xml")
print(x)

Trial 1.

x_df = as.data.frame(x)
Error in as.data.frame.default(x) : 
  cannot coerce class ‘c("xml_document", "xml_node")’ to a data.frame

Trial 2.

 xmlToList(x)
Error in UseMethod("xmlSApply") : 
  no applicable method for 'xmlSApply' applied to an object of class "c('xml_document', 'xml_node')"

Trial 3.

xmlToDataFrame(x)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘xmlToDataFrame’ for signature ‘"xml_document", "missing", "missing", "missing", "missing"’

I need help regarding why is that error occurring and how can multiple files' data be converted to a data frame or table in r.

CodePudding user response:

You cannot directly convert XML file to a dataframe. You'll need to fetch the tags and data inside those tags and then create the dataframe.

Here's the code that will do the trick:

library(XML)
library(xml2)
df <- read_xml("1.xml")

records <- xml_find_all(df, "//ANZCTR_Trial")
records

nodenames <- xml_name(xml_children(records))
nodevalues <- trimws(xml_text(xml_children(records)))

df <- as.data.frame(t(nodevalues))
colnames(df) <- nodenames

write.csv(x = df, file = 'trialData.csv')

records contains all the tags and data inside the parent ta. In your case, it is ANZCTR_Trial in both of the files that you shared in the question.

nodenames is names of tags i.e. parent tags. Whereas, nodevalues contain data.

To fetch data from grandchildren tags that are tags inside tags (For e.g phone, fax inside contacts) you'll need to further update the code as follows:

records <- xml_find_all(df, "//contacts")  ### You just keep changing it according to your need
records

Everything remains same.

  • Related