Home > Software engineering >  Pyspark: Create a set from a random item function
Pyspark: Create a set from a random item function

Time:11-15

New to pyspark and would like any pointers on generating a set of items based on a random selection from a given list. These random choices need to append to a list but must be unique so in the python implementation I used a set to initiate, in the context of a while statement

import string
import random
def id_generator(size=6, chars=string.ascii_uppercase   string.digits):
   return ''.join(random.choice(chars) for _ in range(size))
my_set=set()
while len(my_set)<n 1:  #n being the number of items desired
  my_set.add(id_generator())

(Credit to https://stackoverflow.com/a/2257449/8840174 for the id_generator syntax)

What I'd like to do is take advantage of spark's distributed compute and complete the above much quicker.

Process-wise I'm thinking something like this needs to happen: hold the set on the driver node, and distribute the function out to the workers available to perform id_generator() until there are n unique items in my set. It doesn't seem like there is an equivalent function in pyspark for random.choices, so maybe I need to use the UDF decorator to register the function in pyspark?

This is for a distribution between 0,1 not a random choice selected from some item list. https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.rand.html

@udf
def id_generator():
  import string
  import random
  def id_generator(size=6, chars=string.ascii_uppercase   string.digits):
    return ''.join(random.choice(chars) for _ in range(size))
  return id_generator()

Something like the above? Although I still am not clear how/if sets work on spark or not.

https://stackoverflow.com/a/61777594/8840174

The above is sorta the right idea, though I don't know that collecting the value from a single item spark dataframe is a good idea for millions of iterations.

The code works fine for straight python, but I'd like to speed it up from several hours if possible. (I need to generate several randomly generated columns based on various rules/list of values to create a dataset from scratch).

*I know that id_generator() has a size of 6, with some 2,176,782,336 combinations http://mathcentral.uregina.ca/QQ/database/QQ.09.00/churilla1.html so the chance for duplicates is not huge, but even without the set() requirement, I'm still struggling with the best implementation of appending random choices from a list to another list in pyspark.

Thanks for any input!

Edit This looks promising Random numbers generation in PySpark

CodePudding user response:

It really depends on your usecase if Spark is the best way to go, however you could do so using a udf of your function on a generated dataframe and dropping duplicates. The drawback of this approach is that due to dropping duplicates it is harder to reach an exact number of datapoints you might desire.

Note 1: I've slightly adjusted your function to use random.choices.

Note 2: If running on multiple nodes, you might need to make sure each node uses a different seed for random.

import string
import random
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

SIZE = 10 ** 6

spark = SparkSession.builder.getOrCreate()

@udf(StringType())
def id_generator(size=6, chars=string.ascii_uppercase   string.digits):
    return ''.join(random.choices(chars, k=size))

df = spark.range(SIZE)

df = df.withColumn('sample', id_generator()).drop('id')

print(f'Count: {df.count()}')
print(f'Unique count: {df.dropDuplicates().count()}')

df.show(5)

Which gives:

Count: 1000000                                                                  
Unique count: 999783                                                            
 ------ 
|sample|
 ------ 
|QTOVIM|
|NEH0SY|
|DJW5Q3|
|WMEKRF|
|OQ09N9|
 ------ 
only showing top 5 rows
  • Related