Home > Back-end >  Randomly select x(x is fixed) values in a column and replace it with 0 in pyspark dataframe
Randomly select x(x is fixed) values in a column and replace it with 0 in pyspark dataframe

Time:05-26

I want to randomly replace 2 of the customer values in the table to 0. The table can have as many rows as possible but the primary key of the table is cust.

How do i do this in pyspark?

SAMPLE:

Input

| cust   | value|
| c1  | 2  |
| c2  | 4  |
| c4  | 6  |
| c3  | 8  |
| c5  | 10 |
| c6  | 12 |

Output

| cust   | value|
| c1  | 2  |
| c2  | 0  |
| c4  | 6  |
| c3  | 8  |
| c5  | 0  |
| c6  | 12 |

CodePudding user response:

You can use a trick: define an index column with row_number, then filter N rows based on this new column.

import random
import pyspark.sql.functions as F
from pyspark.sql.window import Window


def replace_n_values_with_zeroes(df, n=2, col_name='value', seed=42):
  
  # retrieve number of rows
  n_rows = df.count()
  
  # take n random values from idx
  random.seed(seed)
  idx_rows = random.sample(list(range(1, n_rows 1)), n)

  # define index column
  w = Window.orderBy(F.lit(1))
  df = df.withColumn('idx', F.row_number().over(w))
  
  # replace some values with zeroes
  df = df.withColumn(col_name, F.when(F.col('idx').isin(idx_rows), 0)
                                .otherwise(F.col(col_name)))
  df = df.drop('idx')
  
  return df

Example

df = spark.createDataFrame([
  ["c1", 2],
  ["c2", 4],
  ["c4", 6],
  ["c3", 8],
  ["c5", 10],
  ["c6", 12]
], ["cust", "value"])

replace_n_values_with_zeroes(df).show()

 ---- ----- 
|cust|value|
 ---- ----- 
|  c1|    0|
|  c2|    4|
|  c4|    6|
|  c3|    8|
|  c5|   10|
|  c6|    0|
 ---- ----- 
  • Related