I have a dataframe like this:
item_A item_B item_C
x z y
z x y
y x z
z y x
where all values are a string and I only know the value of x
and y
but i need to get the value of z
. The problem is that z
is not always in the same column. I want to add a column only with the value of z
. I tried concatenating the columns and extract the others strings that I know but I don't know how to keep z
(with a main_pattern = r'x|y'
?)
Here is what I tried but isn't working
pattern_full = r'((' main_pattern '),)'
df = df.withColumn("vale_z", regexp_extract("columns_concatenated", pattern_full, 1)
CodePudding user response:
We can create an array with all 3 columns and use array_except()
to remove the known values which leaves us with the "Z"
.
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['col1', 'col2', 'col3'])
# ---- ---- ----
# |col1|col2|col3|
# ---- ---- ----
# | x| z| y|
# | z| y| x|
# | x| y| y|
# | x| z| z|
# ---- ---- ----
data_sdf. \
withColumn('col_array', func.array(func.col('col1'), func.col('col2'), func.col('col3'))). \
withColumn('z_val_arr', func.array_except('col_array', func.array(func.lit('x'), func.lit('y')))). \
withColumn('z_val', func.col('z_val_arr')[0]). \
show(truncate=False)
# ---- ---- ---- --------- --------- -----
# |col1|col2|col3|col_array|z_val_arr|z_val|
# ---- ---- ---- --------- --------- -----
# |x |z |y |[x, z, y]|[z] |z |
# |z |y |x |[z, y, x]|[z] |z |
# |x |y |y |[x, y, y]|[] |null |
# |x |z |z |[x, z, z]|[z] |z |
# ---- ---- ---- --------- --------- -----
CodePudding user response:
d1 = [['x', 'z', 'y'], ['z', 'x', 'y'], ['y', 'x', 'z'], ['z', 'y', 'x'], ['u', 'v', 'w']]
df1 = spark.createDataFrame(d1, ['item_A', 'item_B', 'item_C'])
df1.withColumn('columns_concatenated', concat(*df1.columns))\
.withColumn('find_z', regexp_extract(col('columns_concatenated'), '(z)', 1))\
.show(10, False)
------ ------ ------ -------------------- ------
|item_A|item_B|item_C|columns_concatenated|find_z|
------ ------ ------ -------------------- ------
|x |z |y |xzy |z |
|z |x |y |zxy |z |
|y |x |z |yxz |z |
|z |y |x |zyx |z |
|u |v |w |uvw | |
------ ------ ------ -------------------- ------