I have a dataframe called 'df' structured as follows
ID | name | lv1 | lv2 |
---|---|---|---|
abb | name1 | 40.34 | 21.56 |
bab | name2 | 21.30 | 67.45 |
bba | name3 | 32.45 | 45.44 |
In Pandas, I can use the following code to create a new column that contains a list of the lv1 and lv2 values
cols = ['lv1', 'lv2']
df['new_col'] = df[cols].values.tolist()
Due to memory issues because of the size of the data, I am now using Databricks instead (which I have never used before) and need to replicate the above. I've created a Spark dataframe successfully by mounting the location of my data and then loading
file_location = 'dbfs:/mnt/<mountname>/filename.csv'
file_type = "csv"
infer_schema = "false"
first_row_is_header = "true"
delimiter = ","
df = spark.read.format(file_type)
.option("inferSchema", infer_schema)
.option("header", first_row_is_header)
.option("sep", delimiter)
.load(file_location)
display(df)
This loads the data, however, I'm stuck on how to complete the necessary next step. I've found a function called struct
in the Spark, however, I can't seem to find the corresponding function in PySpark. Any suggestions?
CodePudding user response:
It's probably array
function that you're looking for.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('abb', 'name1', 40.34, 21.56),
('bab', 'name2', 21.30, 67.45),
('bba', 'name3', 32.45, 45.44)],
['ID', 'name', 'lv1', 'lv2'])
df = df.withColumn('new_col', F.array('lv1', 'lv2'))
df.show()
# --- ----- ----- ----- --------------
# | ID| name| lv1| lv2| new_col|
# --- ----- ----- ----- --------------
# |abb|name1|40.34|21.56|[40.34, 21.56]|
# |bab|name2| 21.3|67.45| [21.3, 67.45]|
# |bba|name3|32.45|45.44|[32.45, 45.44]|
# --- ----- ----- ----- --------------