How to order the books list in this dataframe using pyspark
root
|-- AUTHORID: integer
|-- NAME: string
|-- BOOK_LIST: array
| |-- BOOK_ID: integer
| |-- BOOK_NAME: string
Update
In my case I have a dataframe that have nested in multiple levels
root
|-- AUTHOR_ID: integer (nullable = false)
|-- NAME: string (nullable = true)
|-- Books: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- BOOK_ID: integer (nullable = false)
| | |-- Chapters: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- NAME: string (nullable = true)
| | | | |-- NUMBER_PAGES: integer (nullable = true)
How to be able to sort chapters by name ?
CodePudding user response:
Create dataframe and use sort
from pyspark.sql.functions import *
df.sort(asc("BOOK_ID")).collect()
Please add sample data json if looking code example
CodePudding user response:
If you want to order by BOOK_ID
and if BOOK_ID
is a unique field, you can use array_sort
.
df = df.withColumn('BOOK_LIST', F.array_sort('BOOK_LIST'))
Note that array_sort
will only sort by 1 column within the array, in this example BOOK_ID only.