data=[(("James","Bond"),["Java","C#"],{'hair':'black','eye':'brown'}),
(("Ann","Varsa"),[".NET","Python"],{'hair':'brown','eye':'black'}),
(("Tom Cruise",""),["Python","Scala"],{'hair':'red','eye':'grey'}),
(("Tom Brand",None),["Perl","Ruby"],{'hair':'black','eye':'blue'})]
schema = ['n','ln','p']
df = spark.createDataFrame(data,schema=schema)
----------------- --------------- --------------------
| n| ln| p|
----------------- --------------- --------------------
| {James, Bond}| [Java, C#]|{eye -> brown, ha...|
| {Ann, Varsa}| [.NET, Python]|{eye -> black, ha...|
| {Tom Cruise, }|[Python, Scala]|{eye -> grey, hai...|
|{Tom Brand, null}| [Perl, Ruby]|{eye -> blue, hai...|
----------------- --------------- --------------------
name = df.select('n')
I tried the filter method to get the first and second names in a separate column, but it didn't work.
The desired output:
first |last
-----------
James |Bond
Tom |Cruise
Tom |Brand
CodePudding user response:
The column n
is of struct
data type. You can convert struct column to array, then join all the array's elements using space as delimiter " "
. Then you can take the first and the second word after you split
this column using the same delimiter.
from pyspark.sql import functions as F
col_joined = F.array_join(F.array("n.*"), " ")
df = df.select(
F.split(col_joined," ")[0].alias("first"),
F.split(col_joined," ")[1].alias("last"),
)
df.show()
# ----- ------
# |first| last|
# ----- ------
# |James| Bond|
# | Ann| Varsa|
# | Tom|Cruise|
# | Tom| Brand|
# ----- ------