when running pyspark locally I get correct results with list ordered by BOOK_ID, But when deploying the AWS Glue job, the books seem not to be ordered
root
|-- AUTHORID: integer
|-- NAME: string
|-- BOOK_LIST: array
| |-- BOOK_ID: integer
| |-- BOOK_NAME: string
from pyspark.sql import functions as F
result = (df_authors.join(df_books, on=["AUTHOR_ID"], how="left")
.orderBy(F.col("BOOK_ID").desc())
.groupBy("AUTHOR_ID", "NAME")
.agg(F.collect_list(F.struct("BOOK_ID", "BOOK_NAME")))
)
Note: I'm using pyspark 3.2.1
and Glue 2.0
Any suggestion please
CodePudding user response:
Supposition
Although I managed to run the job on Glue 3.0 that supports spark 3.1
, the orderBy still giving wrong result
CodePudding user response:
Im trying to simplify the issue, work with me:
Lets create a dataframe sample:
>>> df = spark.createDataFrame([
{"book_id": 1, "author_id": 1, "name": "David", "book_name": "Kill Bill"},
{"book_id": 2, "author_id": 2, "name": "Roman", "book_name": "Dying is Hard"},
{"book_id": 3, "author_id": 3, "name": "Moshe", "book_name": "Apache Kafka The Easy Way"},
{"book_id": 4, "author_id": 1, "name": "David", "book_name": "Pyspark Is Awesome"},
{"book_id": 5, "author_id": 2, "name": "Roman", "book_name": "Playing a Piano"},
{"book_id": 6, "author_id": 3, "name": "Moshe", "book_name": "Awesome Scala"}
])
Now, Doing this:
(
df
.groupBy("author_id", "name")
.agg(F.collect_list(F.struct("book_id", "book_name")).alias("data"), F.sum("book_id").alias("sorted_key"))
.orderBy(F.col("sorted_key").desc()).drop("sorted_key")
.show(10, False)
)
Im getting exactly what you are allegedly asking for:
--------- ----- ----------------------------------------------------
|author_id|name |collect_list(struct(book_id, book_name)) |
--------- ----- ----------------------------------------------------
|3 |Moshe|[{3, Apache Kafka The Easy Way}, {6, Awesome Scala}]|
|2 |Roman|[{2, Dying is Hard}, {5, Playing a Piano}] |
|1 |David|[{1, Kill Bill}, {4, Pyspark Is Awesome}] |
--------- ----- ----------------------------------------------------