Home > Software design >  Get the first two words from a struct in PySpark data frame
Get the first two words from a struct in PySpark data frame

Time:09-05

data=[(("James","Bond"),["Java","C#"],{'hair':'black','eye':'brown'}),
      (("Ann","Varsa"),[".NET","Python"],{'hair':'brown','eye':'black'}),
      (("Tom Cruise",""),["Python","Scala"],{'hair':'red','eye':'grey'}),
      (("Tom Brand",None),["Perl","Ruby"],{'hair':'black','eye':'blue'})]
schema = ['n','ln','p']
df = spark.createDataFrame(data,schema=schema)

 ----------------- --------------- -------------------- 
|                n|             ln|                   p|
 ----------------- --------------- -------------------- 
|    {James, Bond}|     [Java, C#]|{eye -> brown, ha...|
|     {Ann, Varsa}| [.NET, Python]|{eye -> black, ha...|
|   {Tom Cruise, }|[Python, Scala]|{eye -> grey, hai...|
|{Tom Brand, null}|   [Perl, Ruby]|{eye -> blue, hai...|
 ----------------- --------------- -------------------- 

name = df.select('n')

I tried the filter method to get the first and second names in a separate column, but it didn't work.

The desired output:

first |last
-----------
James |Bond
Tom   |Cruise
Tom   |Brand

CodePudding user response:

The column n is of struct data type. You can convert struct column to array, then join all the array's elements using space as delimiter " ". Then you can take the first and the second word after you split this column using the same delimiter.

from pyspark.sql import functions as F

col_joined = F.array_join(F.array("n.*"), " ")
df = df.select(
    F.split(col_joined," ")[0].alias("first"),
    F.split(col_joined," ")[1].alias("last"),
)
df.show()
#  ----- ------ 
# |first|  last|
#  ----- ------ 
# |James|  Bond|
# |  Ann| Varsa|
# |  Tom|Cruise|
# |  Tom| Brand|
#  ----- ------ 
  • Related