Home > Blockchain >  pyspark- connect two columns array elements
pyspark- connect two columns array elements

Time:03-30

I am very new to pays-ark. I have a Dataframe including two columns and each column has strings in an array format: How can I connected the element of array from first column to the value in the same position in an array of other column.

if I convert the Dataframe to Pandas Dataframe in data brick the below code works but it will not keep the arrays in a correct format.


for item in list_x:
    df_head[item] = "x"
    value = df_head['materialText'].values
    headName = df_head['materialTextPart'].values
    value_list = []
   
    for k in range(len(df_head)):
#         print(k)
    
        if type(value[k]) == np.float:
            continue;
        else:
            value_array =value[k][0:].split(',')
#             print(value_array)
      
            headName_array = headName[k][1:-2].split(',')
        
        for m in range(len(headName_array)):
           
            if (headName_array[m] == item) or (headName_array[m] ==' ' item) or (headName_array[m] ==' ' item.replace('s','')):
                columnName = item             
                columnValue = df_head.loc[k,columnName]
               

                if columnValue == 'x':
                    df_head.loc[k,columnName] = value_array[m]
                else:
                    df_head.loc[k,columnName]= df_head.loc[k,columnName]  ','   value_array[m]
    df_head[item] = df_head[item].replace('x', np.nan)

Example of columns: ["Fabric:", "Wall bracket:", "Top rail/ Bottom rail:"] ["100 % polyester (100% recycled), PET plastic", "Steel, Polycarbonate/ABS plastic, Powder coating", "Aluminium, Powder coating"]

materialTextPart materialText
["Fabric:", "Wall bracket:", "Top rail/ Bottom rail:"]
["100 % polyester (100% recycled), PET plastic", "Steel, Polycarbonate/ABS plastic, Powder coating", "Aluminium, Powder coating"]
["Ticking:", "Filling:", "Ticking, underside:", "Comfort filling:", "Ticking:"] ["100 % polyester (100% recycled)", "100 % polyester", "100% polypropylene", "Polyurethane foam 28 kg/cu.m.", "100% polyester"]

CodePudding user response:

As I mentioned in my comment -

from pyspark.sql.functions import *
from pyspark.sql.types import *

df = spark.createDataFrame( data = [
                                      (["Fabric:", "Wall bracket:", "Top rail/ Bottom rail:"],
                                       ["100 % polyester (100% recycled), PET plastic", "Steel, Polycarbonate/ABS plastic, Powder coating", "Aluminium, Powder coating"]
                                      )
                                    ],
                            schema = StructType([StructField("xs", ArrayType(StringType())), StructField("ys", ArrayType(StringType()))])
                          )

df.select(zip_with("xs", "ys", lambda x, y: concat(x,y)).alias("Array_Elements_Concat")).show(truncate=False)

Output

 --------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
|Array_Elements_Concat                                                                                                                                                |
 --------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
|[Fabric:100 % polyester (100% recycled), PET plastic, Wall bracket:Steel, Polycarbonate/ABS plastic, Powder coating, Top rail/ Bottom rail:Aluminium, Powder coating]|
 --------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
  • Related