Home > Software design >  Checking for well formed xml in a dataframe column
Checking for well formed xml in a dataframe column

Time:11-27

I have a dataframe with xml in a string column, before I can handle this further, I need the xml to check for being well formed. The strategy I'm following at the moment is using an udf, but I get an error as a result.

Code:

from lxml import etree

def wellformedness (xml):
    wellformed = True
    try:
        doc = etree.fromstring(xml)
    except:
        wellformed = False
    return wellformed

udf_wellformedness = F.udf(wellformedness, BooleanType())
df.withColumn('Wellformed', udf_wellformedness('MyColumn'))
df2 = df.filter(~df["Wellformed"])

Error: AnalysisException: Cannot resolve column name "Wellformed" among (MyColumn, MyColumn2, MyColumn3);

What am I doing wrong? And can this be done more efficiently?

CodePudding user response:

You are adding the column to a dataframe that you don't save. The line where you call withColumn should read as follows:

df = df.withColumn('Wellformed', udf_wellformedness('MyColumn'))

If you must validate the xml with a custom python function, using an udf is the way to go.

  • Related