Home > Software engineering >  Datframe Struct fieldType to Array of field except last field on Pyspark
Datframe Struct fieldType to Array of field except last field on Pyspark

Time:11-03

I have a spark dataframe with the following schema:

 stat_chiamate
  |
 chiamate_ricevute: struct (nullable = true)
  |    |    |-- h_0: string (nullable = true)
  |    |    |-- h_1: string (nullable = true)
  |    |    |-- h_10: string (nullable = true)
  |    |    |-- h_11: string (nullable = true)
  |    |    |-- h_12: string (nullable = true)
  |    |    |-- h_13: string (nullable = true)
  |    |    |-- h_14: string (nullable = true)
  |    |    |-- h_15: string (nullable = true)
  |    |    |-- h_16: string (nullable = true)
  |    |    |-- h_17: string (nullable = true)
  |    |    |-- h_18: string (nullable = true)
  |    |    |-- h_19: string (nullable = true)
  |    |    |-- h_2: string (nullable = true)
  |    |    |-- h_20: string (nullable = true)
  |    |    |-- h_21: string (nullable = true)
  |    |    |-- h_22: string (nullable = true)
  |    |    |-- h_23: string (nullable = true)
  |    |    |-- h_3: string (nullable = true)
  |    |    |-- h_4: string (nullable = true)
  |    |    |-- h_5: string (nullable = true)
  |    |    |-- h_6: string (nullable = true)
  |    |    |-- h_7: string (nullable = true)
  |    |    |-- h_8: string (nullable = true)
  |    |    |-- h_9: string (nullable = true)
  |    |    |-- n_totale: string (nullable = true)

I want a dataframe like:

   stat_chiamate: struct (nullable = true)
     |
    chiamate_ricevute: Array
         |-- element(String)

where chiamate_ricevute is a list of value of field for example:

h_0= 0
h_1= 1
h_2= 2
.
.
.
h_23=23
n_totale=412

I want:

[0,1,2....,23]  <-- I don't want n_totale values

In my code i use df.select("stat_chiamate.chiamate_ricevute.*").schema.fieldNames()[:-1] but i have only single fieldsName but how i can use them?

df=df.select(F.array(*[field for field in 

df.select("stat_chiamate.chiamate_ricevute.*").schema.fieldNames() if field.startswith("h_")]).alias("CIRCO"))

CodePudding user response:

You could use the schema of the dataframe, and in particular the schema of your struct to extract all the field names but n_totale and then wrap them into an array.

from pyspark.sql import functions as f

fields = ['chiamate_ricevute.'   field.name for field in df.schema[0].dataType
                if field.name != 'n_totale']
result = df.select(f.array(fields).alias("chiamate_ricevute"))
  • Related