Home > Enterprise >  Create a new column in Spark dataframe that is a list of other column values
Create a new column in Spark dataframe that is a list of other column values

Time:07-05

I have a dataframe called 'df' structured as follows

ID name lv1 lv2
abb name1 40.34 21.56
bab name2 21.30 67.45
bba name3 32.45 45.44

In Pandas, I can use the following code to create a new column that contains a list of the lv1 and lv2 values

cols = ['lv1', 'lv2']
df['new_col'] = df[cols].values.tolist()

Due to memory issues because of the size of the data, I am now using Databricks instead (which I have never used before) and need to replicate the above. I've created a Spark dataframe successfully by mounting the location of my data and then loading

file_location = 'dbfs:/mnt/<mountname>/filename.csv'
file_type = "csv"
   
infer_schema = "false"
first_row_is_header = "true"
delimiter = ","

df = spark.read.format(file_type)
  .option("inferSchema", infer_schema)
  .option("header", first_row_is_header)
  .option("sep", delimiter)
  .load(file_location)

display(df)

This loads the data, however, I'm stuck on how to complete the necessary next step. I've found a function called struct in the Spark, however, I can't seem to find the corresponding function in PySpark. Any suggestions?

CodePudding user response:

It's probably array function that you're looking for.

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('abb', 'name1', 40.34, 21.56),
     ('bab', 'name2', 21.30, 67.45),
     ('bba', 'name3', 32.45, 45.44)],
    ['ID', 'name', 'lv1', 'lv2'])

df = df.withColumn('new_col', F.array('lv1', 'lv2'))

df.show()
#  --- ----- ----- ----- -------------- 
# | ID| name|  lv1|  lv2|       new_col|
#  --- ----- ----- ----- -------------- 
# |abb|name1|40.34|21.56|[40.34, 21.56]|
# |bab|name2| 21.3|67.45| [21.3, 67.45]|
# |bba|name3|32.45|45.44|[32.45, 45.44]|
#  --- ----- ----- ----- -------------- 
  • Related