Parse xml file in pandas-CodePudding

I have this xml file (it's called "LogReg.xml" and it contains some information about a logistic regression (I am interested in the name of the features and their coefficient - I'll explain in more detail below):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_4" xmlns:data="http://jpmml.org/jpmml-model/InlineTable" version="4.4">
    <Header>
        <Application name="JPMML-SkLearn" version="1.6.35"/>
        <Timestamp>2022-02-15T09:44:54Z</Timestamp>
    </Header>
    <MiningBuildTask>
        <Extension name="repr">PMMLPipeline(steps=[('classifier', LogisticRegression())])</Extension>
    </MiningBuildTask>
    <DataDictionary>
        <DataField name="Target" optype="categorical" dataType="integer">
            <Value value="0"/>
            <Value value="1"/>
        </DataField>
        <DataField name="const" optype="continuous" dataType="double"/>
        <DataField name="grade" optype="continuous" dataType="double"/>
        <DataField name="emp_length" optype="continuous" dataType="double"/>
        <DataField name="dti" optype="continuous" dataType="double"/>
        <DataField name="Orig_FicoScore" optype="continuous" dataType="double"/>
        <DataField name="inq_last_6mths" optype="continuous" dataType="double"/>
        <DataField name="acc_open_past_24mths" optype="continuous" dataType="double"/>
        <DataField name="mort_acc" optype="continuous" dataType="double"/>
        <DataField name="mths_since_recent_bc" optype="continuous" dataType="double"/>
        <DataField name="num_rev_tl_bal_gt_0" optype="continuous" dataType="double"/>
        <DataField name="percent_bc_gt_75" optype="continuous" dataType="double"/>
    </DataDictionary>
    <RegressionModel functionName="classification" algorithmName="sklearn.linear_model._logistic.LogisticRegression" normalizationMethod="logit">
        <MiningSchema>
            <MiningField name="Target" usageType="target"/>
            <MiningField name="const"/>
            <MiningField name="grade"/>
            <MiningField name="emp_length"/>
            <MiningField name="dti"/>
            <MiningField name="Orig_FicoScore"/>
            <MiningField name="inq_last_6mths"/>
            <MiningField name="acc_open_past_24mths"/>
            <MiningField name="mort_acc"/>
            <MiningField name="mths_since_recent_bc"/>
            <MiningField name="num_rev_tl_bal_gt_0"/>
            <MiningField name="percent_bc_gt_75"/>
        </MiningSchema>
        <Output>
            <OutputField name="probability(0)" optype="continuous" dataType="double" feature="probability" value="0"/>
            <OutputField name="probability(1)" optype="continuous" dataType="double" feature="probability" value="1"/>
        </Output>
        <RegressionTable intercept="0.8064694059338298" targetCategory="1">
            <NumericPredictor name="const" coefficient="0.8013433785974717"/>
            <NumericPredictor name="grade" coefficient="0.9010481046582982"/>
            <NumericPredictor name="emp_length" coefficient="0.9460686056314133"/>
            <NumericPredictor name="dti" coefficient="0.5117062988491518"/>
            <NumericPredictor name="Orig_FicoScore" coefficient="0.07944303372859234"/>
            <NumericPredictor name="inq_last_6mths" coefficient="0.20516234445402765"/>
            <NumericPredictor name="acc_open_past_24mths" coefficient="0.4852503249658917"/>
            <NumericPredictor name="mort_acc" coefficient="0.6673203078463711"/>
            <NumericPredictor name="mths_since_recent_bc" coefficient="0.1962158305958366"/>
            <NumericPredictor name="num_rev_tl_bal_gt_0" coefficient="0.12964661294856686"/>
            <NumericPredictor name="percent_bc_gt_75" coefficient="0.04534570018290847"/>
        </RegressionTable>
        <RegressionTable intercept="0.0" targetCategory="0"/>
    </RegressionModel>
</PMML>

I have parsed it using this code:

from lxml import objectify

path = 'LogReg.xml'

parsed = objectify.parse(open(path))
root = parsed.getroot()

data = []

if True:
    for elt in root.RegressionModel.RegressionTable:
        el_data = {}
        for child in elt.getchildren():
            el_data[child.tag] = child.text
        data.append(el_data)

perf = pd.DataFrame(data)

I am interested in parsing this bit:

    <RegressionTable intercept="0.8064694059338298" targetCategory="1">
        <NumericPredictor name="const" coefficient="0.8013433785974717"/>
        <NumericPredictor name="grade" coefficient="0.9010481046582982"/>
        <NumericPredictor name="emp_length" coefficient="0.9460686056314133"/>
        <NumericPredictor name="dti" coefficient="0.5117062988491518"/>
        <NumericPredictor name="Orig_FicoScore" coefficient="0.07944303372859234"/>
        <NumericPredictor name="inq_last_6mths" coefficient="0.20516234445402765"/>
        <NumericPredictor name="acc_open_past_24mths" coefficient="0.4852503249658917"/>
        <NumericPredictor name="mort_acc" coefficient="0.6673203078463711"/>
        <NumericPredictor name="mths_since_recent_bc" coefficient="0.1962158305958366"/>
        <NumericPredictor name="num_rev_tl_bal_gt_0" coefficient="0.12964661294856686"/>
        <NumericPredictor name="percent_bc_gt_75" coefficient="0.04534570018290847"/>
    </RegressionTable>

so that I can build the following dictionary:

myDict = {
"const : 0.8013433785974717,
"grade" : 0.9010481046582982,
"emp_length" : 0.9460686056314133,
"dti" : 0.5117062988491518,
"Orig_FicoScore" : 0.07944303372859234,
"inq_last_6mths" : 0.20516234445402765,
"acc_open_past_24mths" : 0.4852503249658917,
"mort_acc" : 0.6673203078463711,
"mths_since_recent_bc" : 0.1962158305958366,
"num_rev_tl_bal_gt_0" : 0.12964661294856686,
"percent_bc_gt_75" : 0.04534570018290847
}

Basically, in the dictionary the Key is the name of the feature and the value is the coefficient of the logistic regression.

Please can anyone help me with the code?

CodePudding user response：

I'm not sure you need pandas for this, but you do need to handle the namespaces in your xml.

Try something along these lines:

myDict = {}
#register the namespace
ns = {'xx': 'http://www.dmg.org/PMML-4_4'}

#you could collapse the next two into one line, but I believe it's clearer this way
rt = root.xpath('//xx:RegressionTable[.//xx:NumericPredictor]',namespaces=ns)[0]
nps = rt.xpath('./xx:NumericPredictor',namespaces=ns)

for np in nps:
    myDict[np.attrib['name']]=np.attrib['coefficient']
myDict

The output should be your expected output.