I have a dataframe with xml in a string column, before I can handle this further, I need the xml to check for being well formed. The strategy I'm following at the moment is using an udf, but I get an error as a result.
Code:
from lxml import etree
def wellformedness (xml):
wellformed = True
try:
doc = etree.fromstring(xml)
except:
wellformed = False
return wellformed
udf_wellformedness = F.udf(wellformedness, BooleanType())
df.withColumn('Wellformed', udf_wellformedness('MyColumn'))
df2 = df.filter(~df["Wellformed"])
Error: AnalysisException: Cannot resolve column name "Wellformed" among (MyColumn, MyColumn2, MyColumn3);
What am I doing wrong? And can this be done more efficiently?
CodePudding user response:
You are adding the column to a dataframe that you don't save. The line where you call withColumn
should read as follows:
df = df.withColumn('Wellformed', udf_wellformedness('MyColumn'))
If you must validate the xml with a custom python function, using an udf is the way to go.