I am very new to pays-ark. I have a Dataframe including two columns and each column has strings in an array format: How can I connected the element of array from first column to the value in the same position in an array of other column.
if I convert the Dataframe to Pandas Dataframe in data brick the below code works but it will not keep the arrays in a correct format.
for item in list_x:
df_head[item] = "x"
value = df_head['materialText'].values
headName = df_head['materialTextPart'].values
value_list = []
for k in range(len(df_head)):
# print(k)
if type(value[k]) == np.float:
continue;
else:
value_array =value[k][0:].split(',')
# print(value_array)
headName_array = headName[k][1:-2].split(',')
for m in range(len(headName_array)):
if (headName_array[m] == item) or (headName_array[m] ==' ' item) or (headName_array[m] ==' ' item.replace('s','')):
columnName = item
columnValue = df_head.loc[k,columnName]
if columnValue == 'x':
df_head.loc[k,columnName] = value_array[m]
else:
df_head.loc[k,columnName]= df_head.loc[k,columnName] ',' value_array[m]
df_head[item] = df_head[item].replace('x', np.nan)
Example of columns: ["Fabric:", "Wall bracket:", "Top rail/ Bottom rail:"] ["100 % polyester (100% recycled), PET plastic", "Steel, Polycarbonate/ABS plastic, Powder coating", "Aluminium, Powder coating"]
materialTextPart | materialText |
---|---|
["Fabric:", "Wall bracket:", "Top rail/ Bottom rail:"] | |
["100 % polyester (100% recycled), PET plastic", "Steel, Polycarbonate/ABS plastic, Powder coating", "Aluminium, Powder coating"] | |
["Ticking:", "Filling:", "Ticking, underside:", "Comfort filling:", "Ticking:"] | ["100 % polyester (100% recycled)", "100 % polyester", "100% polypropylene", "Polyurethane foam 28 kg/cu.m.", "100% polyester"] |
CodePudding user response:
As I mentioned in my comment -
from pyspark.sql.functions import *
from pyspark.sql.types import *
df = spark.createDataFrame( data = [
(["Fabric:", "Wall bracket:", "Top rail/ Bottom rail:"],
["100 % polyester (100% recycled), PET plastic", "Steel, Polycarbonate/ABS plastic, Powder coating", "Aluminium, Powder coating"]
)
],
schema = StructType([StructField("xs", ArrayType(StringType())), StructField("ys", ArrayType(StringType()))])
)
df.select(zip_with("xs", "ys", lambda x, y: concat(x,y)).alias("Array_Elements_Concat")).show(truncate=False)
Output
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
|Array_Elements_Concat |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
|[Fabric:100 % polyester (100% recycled), PET plastic, Wall bracket:Steel, Polycarbonate/ABS plastic, Powder coating, Top rail/ Bottom rail:Aluminium, Powder coating]|
---------------------------------------------------------------------------------------------------------------------------------------------------------------------