How to replace characters which are not alphanumeric in pyspark sql?-CodePudding

this is my code.

%spark.pyspark

jdbc_write(spark, spark.sql("""
    SELECT 
        Global_Order_Number__c 
        , Infozeile__c
    FROM STAG.SF_CASE_TRANS
"""), JDBC_URLS['xyz_tera_utf8'], "DEV_STAG.SF_CASE", "abc", "1234")

I want to exclude every character in the Infozeile__c field which are not a-z, A-Z, 0-9.

Is their any function which is able to do this?

CodePudding user response：

Apply regexp_replace() to the column in your query:

regexp_replace(Infozeile__c, '[^a-zA-Z0-9]', '') as Infozeile__c

The regex [^a-zA-Z0-9] is a negated character class, meaning any character not in the ranges given. The replacement is a blank, effectively deleting the matched character.

If you're expecting lots of characters to be replaced like this, it would be a bit more efficient to add a , which means "one or more", so whole blocks of undesirable characters are removed at a time.

regexp_replace(Infozeile__c, '[^a-zA-Z0-9] ', '')