I need some help to access names within columns. I have for example the following Schema:
root
|-- id_1: string (nullable = true)
|-- array_1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id_2: string (nullable = true)
| | |-- post: struct (nullable = true)
| | | |-- value: double (nullable = true)
By using
cols = df.columns
I will get a list of all names at root level,
cols = [id_1, array_1,...]
However, I would like to access the names within e.g. 'array_1'. Using
df.id_1.columns
simply returns
Column<b'array_1[columns]'>
and no names. Any way to access names within arrays? Same issue arise with structs. This would help me loop/make functions easier. If it is possible to avoid various modules it would be beneficial.
Thanks
CodePudding user response:
You can use schema of dataframe to look column names. Use StructType and StructField apis. In example scala-spark code(optimize this code for your needs):
import org.apache.spark.sql.types._
case class A(a: Int, b: String)
val df = Seq(("a", Array(A(1, "asd"))), ("b", Array(A(2, "dsa")))).toDF("str_col", "arr_col")
println(df.schema)
> res19: org.apache.spark.sql.types.StructType = StructType(StructField(str_col,StringType,true), StructField(arr_col,ArrayType(StructType(StructField(a,IntegerType,false), StructField(b,StringType,true)),true),true))
val fields = df.schema.fields
println(fields(0).name)
> res22: String = str_col
println(fields(1).dataType.asInstanceOf[ArrayType].elementType)
> res23: org.apache.spark.sql.types.DataType = StructType(StructField(a,IntegerType,false), StructField(b,StringType,true))
.....