this code used in aws glue job:
def get_latest_records(data_frame, record_keys, key):
columns = data_frame.columns
window_spec = w.partitionBy(*record_keys).orderBy(f.desc(key))
output_data_frame = data_frame.withColumn("row_num", f.row_number().over(window_spec)). \
filter(f.col("row_num") == 1). \
drop(f.col("row_num")). \
select(columns)
return data_frame
I want to order the dynamic frame data according a column called "name" then if two names are equal, order by the "key" column. How to do this? Also, can you explain what (drop) does in the output_data_frame?
CodePudding user response:
Assuming the value for record_keys
is an iterable with the single value "name"
this should already do it.
You should call the function like this:
output_df = get_latest_records(
data_frame=input_df, # The name of the dataframe you want to process
record_keys=["name"],
key="key",
)
What's going on?
The code partitions the data by the "name" column that way and within each partition sorts it in descending order by "key". Within each partition it also adds a new column row_num
using the row_number()
function that sequentially adds the row number to each row in the partition.
By filtering on row_num
with the value 1 you get the first row from the partition which is the datapoint you're looking for. Then the artificial column row_num
is dropped again because you don't need it on the outside and you return all the columns from the original dataframe.