I'm trying to convert a list into a dataframe in pyspark so that I can then join it onto a larger dataframe as a column. The data in the list are randomly generated names as so:
from faker import Faker
from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql.types import *
faker = Faker("en_GB")
list1 = [faker.first_name() for _ in range(0, 100)]
firstname = sc.parallelize([list1])
schema = StructType([
StructField('FirstName', StringType(), True)
])
df = spark.createDataFrame(firstname, schema)
display(df)
But I'm getting this error:
PythonException: 'ValueError: Length of object (100) does not match with length of fields (1)'.
Any ideas on what's causing this and how to fix it appreciated!
Many thanks,
Carolina
CodePudding user response:
You're getting a ValueError
because you're passing a list with one element that is a list of 100 names to parallelize
instead of passing a list of 100 elements, each element contains a list of one name.
If for instance Faker.first_name()
returns 'John'
, then 'Henry'
, then 'Jade'
, etc..., your [list1]
argument contains [['John', 'Henry', 'Jade', ...]]
.
When you pass such list to createDataFrame
method, it tries to create a dataframe with one row having 100 columns. As your schema defines only one column, it fails.
Solution here is to either to create directly dataframe from list1
as in PApostol's answer, or to change how you build list1
so you have a list of 100 lists containing one name each instead of a list of one list of 100 names:
from faker import Faker
from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql.types import *
faker = Faker("en_GB")
list1 = [[faker.first_name()] for _ in range(0, 100)]
firstname = sc.parallelize(list1)
schema = StructType([
StructField('FirstName', StringType(), True)
])
df = spark.createDataFrame(firstname, schema)
display(df)
CodePudding user response:
This is probably because pyspark
tries to create a dataframe with 100 columns (the length of firstname
) but you're only providing one column in your schema
. Try without parallelize:
list1 = [faker.first_name() for _ in range(0, 100)]
df = spark.createDataFrame(list1, schema)
or if you do want to parallelize, try:
from pyspark.sql import Row
list1 = [faker.first_name() for _ in range(0, 100)]
firstname = sc.parallelize([list1])
firstname_row = firstname.map(lambda x: Row(x))
df = spark.createDataFrame(firstname_row, schema)