Home > Software engineering >  How to order nested list with pyspark
How to order nested list with pyspark

Time:03-11

How to order the books list in this dataframe using pyspark

root
 |-- AUTHORID: integer
 |-- NAME: string 
 |-- BOOK_LIST: array 
 |    |-- BOOK_ID: integer 
 |    |-- BOOK_NAME: string 

Update

In my case I have a dataframe that have nested in multiple levels

root
  |-- AUTHOR_ID: integer (nullable = false)
  |-- NAME: string (nullable = true)
  |-- Books: array (nullable = false)
  |    |-- element: struct (containsNull = false)
  |    |    |-- BOOK_ID: integer (nullable = false)
  |    |    |-- Chapters: array (nullable = true) 
  |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |-- NAME: string (nullable = true)
  |    |    |    |    |-- NUMBER_PAGES: integer (nullable = true)

How to be able to sort chapters by name ?

CodePudding user response:

Create dataframe and use sort

     from pyspark.sql.functions import *
      df.sort(asc("BOOK_ID")).collect()

Please add sample data json if looking code example

CodePudding user response:

If you want to order by BOOK_ID and if BOOK_ID is a unique field, you can use array_sort.

df = df.withColumn('BOOK_LIST', F.array_sort('BOOK_LIST')) 

Note that array_sort will only sort by 1 column within the array, in this example BOOK_ID only.

  • Related