Home > Enterprise >  Efficiently clean HTML entities from a complex JSON file in PySpark
Efficiently clean HTML entities from a complex JSON file in PySpark

Time:10-26

I am analysing some JSON data in Palantir Foundry using PySpark. The source is a 30MB uploaded JSON file containing four elements, one of which holds a table of some 60 columns and 20,000 rows. Some of the columns in this table are strings that contain HTML entities representing UTF characters (other columns are numeric or boolean). I want to clean these strings to replace the HTML entities with the corresponding characters.

I realise that I can apply html.unescape(my_str) in a UDF to the string columns once all the JSON data has been converted into dataframes. However, this sounds inefficient. enter image description here

  • Related