I'm trying to use the faker package to generate fake dates of birth in pyspark.
My code is as below:
from faker import *
from pyspark.sql.types import *
from pyspark.sql import Row
from datetime import *
fake = Faker("en_GB")
fake.seed_locale("en_GB", 0)
df = spark.createDataFrame([
Row(BIRTH_DT = datetime(2000, 1, 1, 12, 0)),
Row(BIRTH_DT = datetime(2000, 2, 1, 12, 0)),
Row(BIRTH_DT = datetime(2000, 3, 1, 12, 0))
])
class anonymise:
def BIRTH_DT():
def BirthDt_values():
return fake.date_of_birth(datetime.tzinfo == None)
BirthDt_udf = udf(BirthDt_values, TimestampType())
return BirthDt_udf()
df = df \
.withColumn("BIRTH_DT", anonymise.BIRTH_DT())
df.display()
However I'm getting this error:
PythonException: 'TypeError: tzinfo argument must be None or of a tzinfo subclass, not type 'bool''
I don't understand how it thinks that my parameter value is a boolean? I must be formatting this incorrectly but I can't figure out what should be done. Any help would be appreciated!
Thanks,
Carolina
CodePudding user response:
Solved! The datatype should be DateType()
not TimestampType()
because it's a date of birth and not a timestamp