Home > Software engineering >  How to convert a column from string to array in PySpark
How to convert a column from string to array in PySpark

Time:02-10

I have a dataframe converted from an inherited dataset which looks like the following:

data = [("[]","2000","M",False),
            ("[{'username':'aabb','points':'200','active':'true'}, {'username':'bbaa22','points':'0','active':'false'}]","1999","F",True),
            ("[{'username':'topuser','points':'855','active':'false'}]","1974","M",True),
            ("[]","2005","F",False),
            ("[{'username':'myprimary','points':'1050','active':'true'}, {'username':'mylurk','points':'100','active':'true'}, {'username':'closedlurk','points':'50','active':'false'}]","1992","M",True)
            ]

columns=["user","dob_year","gender","member"]
df=spark.createDataFrame(data,columns)
 -------------------- -------- ------ ------ 
|                user|dob_year|gender|member|
 -------------------- -------- ------ ------ 
|                  []|    2000|     M| false|
|[{'username':'aab...|    1999|     F|  true|
|[{'username':'top...|    1974|     M|  true|
|                  []|    2005|     F| false|
|[{'username':'myp...|    1992|     M|  true|
 -------------------- -------- ------ ------ 

I need to extract some of the elements from the user column and I attempt to use the pyspark explode function.

from pyspark.sql.functions import explode

df2 = df.select(explode(df.user), df.dob_year)

When I attempt this, I'm met with the following error:

AnalysisException: cannot resolve 'explode(user)' due to data type mismatch: input to function explode should be array or map type, not string;

When I run df.printSchema(), I realize that the user column is string, rather than list as desired.

I also attempted to cast the strings in the column to arrays by creating a UDF

import pyspark.sql.functions as f

df2 = df.withColumn("user", str2list_udf(f.col("user"))).withColumn("user",df.user.cast(ArrayType(StringType())))

When I do that, I'm met with the following error:

AnalysisException: cannot resolve 'user' due to data type mismatch: cannot cast string to array;

How can the data in this column be cast or converted into an array so that the explode function can be leveraged and individual keys parsed out into their own columns (example: having individual columns for username, points and active)? If explode isn't the best method, is there another route I should follow?

If it's helpful, here is the UDF that I created:

def str2list(x):
  if x == '[]':
    return list()
  else:
    return list(x)

str2list_udf = udf(lambda x: str2list(x))

Thank you

CodePudding user response:

Convert the stringified arrays into array of structs using from_json the explode the resulting array:

from pyspark.sql import functions as F
from pyspark.sql.types import StructField, ArrayType, StringType, StructType

user_schema = ArrayType(
    StructType([
        StructField("username", StringType(), True),
        StructField("points", StringType(), True),
        StructField("active", StringType(), True)
    ])
)

df1 = (df.withColumn("user", F.from_json("user", user_schema))
       .selectExpr("inline(user)", "dob_year", "gender", "member")
       )

df1.show()
# ---------- ------ ------ -------- ------ ------ 
#|  username|points|active|dob_year|gender|member|
# ---------- ------ ------ -------- ------ ------ 
#|      aabb|   200|  true|    1999|     F|  true|
#|    bbaa22|     0| false|    1999|     F|  true|
#|   topuser|   855| false|    1974|     M|  true|
#| myprimary|  1050|  true|    1992|     M|  true|
#|    mylurk|   100|  true|    1992|     M|  true|
#|closedlurk|    50| false|    1992|     M|  true|
# ---------- ------ ------ -------- ------ ------ 
  • Related