Adding a column of fake data to a dataframe in pyspark: Unsupported literal type class-CodePudding

I'm trying to add an extra new column of fake data to my dataset. Say this one as an example (it doesn't make a difference what the dataframe is- I need a new extra column with unique, fake names; this is just a dummy to play with):

from faker import Faker

faker = Faker("en_GB")

profiles = [faker.profile() for i in range(0, 100)]
profiles = spark.createDataFrame(profiles)

And I'm trying to add a new column of first names with one name per row. At the moment, I'm doing this (I know this won't do what I want it to but I can't figure out what else to do):

profiles = profiles.withColumn('first_name', lit([faker.first_name()] for _ in 'name'))

However, I keep getting this error:

java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [[Robin], [Margaret], [Robin], [Victor]] I'd like to keep it to one line as that's what I need for further analyses.

I think I understand why I'm getting the error but I'm not sure what to do about it... Any ideas appreciated!

CodePudding user response：

Try something like this:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from faker import Faker

faker = Faker("en_GB")

spark = SparkSession.builder.getOrCreate()
profiles = [faker.profile() for i in range(0, 100)]
profiles = spark.createDataFrame(profiles)
fake_names = [faker.first_name() for _ in range(profiles.count())]
profiles = profiles.withColumn(
    "first_name", F.udf(lambda x: fake_names[x])(F.monotonically_increasing_id())
)

Fake names need to be generated outside the dataframe. If you use:

profiles.withColumn("first_name", F.lit(faker.first_name()))

You'll get the same fake name for all rows.

Edit:

row_number example:

fake_names = [faker.first_name() for _ in range(profiles.count())]
window = Window.orderBy("name") # Or any other unique column, but I guess name is unique here
profiles = profiles.withColumn(
    "first_name", F.udf(lambda x: fake_names[x - 1])(F.row_number().over(window))
)

CodePudding user response：

is this what you want ?

from faker import Faker

faker = Faker("en_GB")

profiles = [[faker.profile(), faker.first_name()] for i in range(0, 100)]
profiles = spark.createDataFrame(profiles, ["profile", "first_name"])

profiles.show()