I want to anonymize or replace almost all columns in a pyspark dataframe except a few ones.
I know its possible to do something like:
anonymized_df = employee_df.withColumn("EMPLOYEENUMBER", col("EMPLOYEENUMBER"))\
.withColumn("NAME1", lit(""))\
.withColumn("TELEPHONE", lit(""))\
.withColumn("ELECTRONICMAILADDRESS", lit(""))
However, doing this for all columns is a tedious process. I would rather want to do something along the lines of this:
anonymized_df = employee_df.withColumn("EMPLOYEENUMBER", col("EMPLOYEENUMBER"))\
.withcolumn("*", lit("")) # all other columns replace
This does however not seem to work. Is there other work arounds that achieve this?
I guess one solution would be to could create a list of column names and do something along the lines of:
col_list = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
for col in col_list:
employee_df= employee_df.withColumn(col, lit("")))
Other suggestions would be of much help.
CodePudding user response:
You can use select
. syntax-wise it won't be much different but it will only create 1 snapshot.
keep_cols = ['a', 'b', 'c']
empty_cols = ['d', 'e', 'f'] # or list(set(df.columns) - set(keep_cols))
df = df.select(*keep_cols, *[lit('').alias(x) for x in empty_cols])