I want to randomly replace 2 of the customer values in the table to 0. The table can have as many rows as possible but the primary key of the table is cust.
How do i do this in pyspark?
SAMPLE:
Input
| cust | value|
| c1 | 2 |
| c2 | 4 |
| c4 | 6 |
| c3 | 8 |
| c5 | 10 |
| c6 | 12 |
Output
| cust | value|
| c1 | 2 |
| c2 | 0 |
| c4 | 6 |
| c3 | 8 |
| c5 | 0 |
| c6 | 12 |
CodePudding user response:
You can use a trick: define an index column with row_number
, then filter N rows based on this new column.
import random
import pyspark.sql.functions as F
from pyspark.sql.window import Window
def replace_n_values_with_zeroes(df, n=2, col_name='value', seed=42):
# retrieve number of rows
n_rows = df.count()
# take n random values from idx
random.seed(seed)
idx_rows = random.sample(list(range(1, n_rows 1)), n)
# define index column
w = Window.orderBy(F.lit(1))
df = df.withColumn('idx', F.row_number().over(w))
# replace some values with zeroes
df = df.withColumn(col_name, F.when(F.col('idx').isin(idx_rows), 0)
.otherwise(F.col(col_name)))
df = df.drop('idx')
return df
Example
df = spark.createDataFrame([
["c1", 2],
["c2", 4],
["c4", 6],
["c3", 8],
["c5", 10],
["c6", 12]
], ["cust", "value"])
replace_n_values_with_zeroes(df).show()
---- -----
|cust|value|
---- -----
| c1| 0|
| c2| 4|
| c4| 6|
| c3| 8|
| c5| 10|
| c6| 0|
---- -----