Home > Software design >  How to extract a value that i don´t know from every row of a Pyspark dataframe
How to extract a value that i don´t know from every row of a Pyspark dataframe

Time:07-22

I have a dataframe like this:

item_A  item_B item_C
  x       z      y
  z       x      y
  y       x      z
  z       y      x 

where all values are a string and I only know the value of x and y but i need to get the value of z. The problem is that z is not always in the same column. I want to add a column only with the value of z. I tried concatenating the columns and extract the others strings that I know but I don't know how to keep z (with a main_pattern = r'x|y' ?)

Here is what I tried but isn't working

pattern_full = r'(('  main_pattern '),)'
df = df.withColumn("vale_z", regexp_extract("columns_concatenated", pattern_full, 1)

CodePudding user response:

We can create an array with all 3 columns and use array_except() to remove the known values which leaves us with the "Z".

data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['col1', 'col2', 'col3'])

#  ---- ---- ---- 
# |col1|col2|col3|
#  ---- ---- ---- 
# |   x|   z|   y|
# |   z|   y|   x|
# |   x|   y|   y|
# |   x|   z|   z|
#  ---- ---- ---- 

data_sdf. \
    withColumn('col_array', func.array(func.col('col1'), func.col('col2'), func.col('col3'))). \
    withColumn('z_val_arr', func.array_except('col_array', func.array(func.lit('x'), func.lit('y')))). \
    withColumn('z_val', func.col('z_val_arr')[0]). \
    show(truncate=False)

#  ---- ---- ---- --------- --------- ----- 
# |col1|col2|col3|col_array|z_val_arr|z_val|
#  ---- ---- ---- --------- --------- ----- 
# |x   |z   |y   |[x, z, y]|[z]      |z    |
# |z   |y   |x   |[z, y, x]|[z]      |z    |
# |x   |y   |y   |[x, y, y]|[]       |null |
# |x   |z   |z   |[x, z, z]|[z]      |z    |
#  ---- ---- ---- --------- --------- ----- 

CodePudding user response:

d1 = [['x', 'z', 'y'], ['z', 'x', 'y'], ['y', 'x', 'z'], ['z', 'y', 'x'], ['u', 'v', 'w']]
df1 = spark.createDataFrame(d1, ['item_A', 'item_B', 'item_C'])

df1.withColumn('columns_concatenated', concat(*df1.columns))\
    .withColumn('find_z', regexp_extract(col('columns_concatenated'), '(z)', 1))\
    .show(10, False)
 ------ ------ ------ -------------------- ------ 
|item_A|item_B|item_C|columns_concatenated|find_z|
 ------ ------ ------ -------------------- ------ 
|x     |z     |y     |xzy                 |z     |
|z     |x     |y     |zxy                 |z     |
|y     |x     |z     |yxz                 |z     |
|z     |y     |x     |zyx                 |z     |
|u     |v     |w     |uvw                 |      |
 ------ ------ ------ -------------------- ------ 
  • Related