How to order dynamic frame data?-CodePudding

this code used in aws glue job:

def get_latest_records(data_frame, record_keys, key):
    columns = data_frame.columns

    window_spec = w.partitionBy(*record_keys).orderBy(f.desc(key))

    output_data_frame = data_frame.withColumn("row_num", f.row_number().over(window_spec)). \
        filter(f.col("row_num") == 1). \
        drop(f.col("row_num")). \
        select(columns)

    return data_frame

I want to order the dynamic frame data according a column called "name" then if two names are equal, order by the "key" column. How to do this? Also, can you explain what (drop) does in the output_data_frame?

CodePudding user response：

Assuming the value for record_keys is an iterable with the single value "name" this should already do it.

You should call the function like this:

output_df = get_latest_records(
    data_frame=input_df,  # The name of the dataframe you want to process
    record_keys=["name"],
    key="key",
)

What's going on?

The code partitions the data by the "name" column that way and within each partition sorts it in descending order by "key". Within each partition it also adds a new column row_num using the row_number() function that sequentially adds the row number to each row in the partition.

By filtering on row_num with the value 1 you get the first row from the partition which is the datapoint you're looking for. Then the artificial column row_num is dropped again because you don't need it on the outside and you return all the columns from the original dataframe.