Home > Blockchain >  AWS Glue does not give coherent result for pyspark - orderBy
AWS Glue does not give coherent result for pyspark - orderBy

Time:03-08

when running pyspark locally I get correct results with list ordered by BOOK_ID, But when deploying the AWS Glue job, the books seem not to be ordered

root
 |-- AUTHORID: integer
 |-- NAME: string 
 |-- BOOK_LIST: array 
 |    |-- BOOK_ID: integer 
 |    |-- BOOK_NAME: string 
    from pyspark.sql import functions as F
    
    result = (df_authors.join(df_books, on=["AUTHOR_ID"], how="left")
              .orderBy(F.col("BOOK_ID").desc())
              .groupBy("AUTHOR_ID", "NAME")
              .agg(F.collect_list(F.struct("BOOK_ID", "BOOK_NAME")))
              )

Note: I'm using pyspark 3.2.1 and Glue 2.0

Any suggestion please

CodePudding user response:

Supposition

Although I managed to run the job on Glue 3.0 that supports spark 3.1, the orderBy still giving wrong result

enter image description here

CodePudding user response:

Im trying to simplify the issue, work with me:

Lets create a dataframe sample:

>>> df = spark.createDataFrame([
    {"book_id": 1, "author_id": 1, "name": "David", "book_name": "Kill Bill"},
    {"book_id": 2, "author_id": 2, "name": "Roman", "book_name": "Dying is Hard"},
    {"book_id": 3, "author_id": 3, "name": "Moshe", "book_name": "Apache Kafka The Easy Way"},
    {"book_id": 4, "author_id": 1, "name": "David", "book_name": "Pyspark Is Awesome"},
    {"book_id": 5, "author_id": 2, "name": "Roman", "book_name": "Playing a Piano"},
    {"book_id": 6, "author_id": 3, "name": "Moshe", "book_name": "Awesome Scala"}
 ])

Now, Doing this:

(
df
.groupBy("author_id", "name")
.agg(F.collect_list(F.struct("book_id", "book_name")).alias("data"), F.sum("book_id").alias("sorted_key"))
.orderBy(F.col("sorted_key").desc()).drop("sorted_key")
.show(10, False)
)

Im getting exactly what you are allegedly asking for:

 --------- ----- ---------------------------------------------------- 
|author_id|name |collect_list(struct(book_id, book_name))            |
 --------- ----- ---------------------------------------------------- 
|3        |Moshe|[{3, Apache Kafka The Easy Way}, {6, Awesome Scala}]|
|2        |Roman|[{2, Dying is Hard}, {5, Playing a Piano}]          |
|1        |David|[{1, Kill Bill}, {4, Pyspark Is Awesome}]           |
 --------- ----- ---------------------------------------------------- 

  • Related