Home > Enterprise >  Access names within pyspark columns
Access names within pyspark columns

Time:09-17

I need some help to access names within columns. I have for example the following Schema:

root
 |-- id_1: string (nullable = true)
 |-- array_1: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id_2: string (nullable = true)
 |    |    |-- post: struct (nullable = true)
 |    |    |    |-- value: double (nullable = true)

By using

cols  = df.columns

I will get a list of all names at root level,

cols = [id_1, array_1,...]

However, I would like to access the names within e.g. 'array_1'. Using

df.id_1.columns

simply returns

Column<b'array_1[columns]'>

and no names. Any way to access names within arrays? Same issue arise with structs. This would help me loop/make functions easier. If it is possible to avoid various modules it would be beneficial.

Thanks

CodePudding user response:

You can use schema of dataframe to look column names. Use StructType and StructField apis. In example scala-spark code(optimize this code for your needs):

import org.apache.spark.sql.types._

case class A(a: Int, b: String)
val df = Seq(("a", Array(A(1, "asd"))), ("b", Array(A(2, "dsa")))).toDF("str_col", "arr_col")

println(df.schema)
> res19: org.apache.spark.sql.types.StructType = StructType(StructField(str_col,StringType,true), StructField(arr_col,ArrayType(StructType(StructField(a,IntegerType,false), StructField(b,StringType,true)),true),true))

val fields = df.schema.fields

println(fields(0).name)
> res22: String = str_col

println(fields(1).dataType.asInstanceOf[ArrayType].elementType)
> res23: org.apache.spark.sql.types.DataType = StructType(StructField(a,IntegerType,false), StructField(b,StringType,true))
.....
  • Related