I am currently running an AWS Glue job the converts csvs to parquet files. The source & target of the data is an S3 bucket and this all works fine. However I would like to include information from the s3 path within the parquet file.
I have looked at the transforms in the AWS Glue Studio's visual interface but can't find anything. I've search through some of the awsglue & pyspark python library but can't find anything related to collecting path/dir structure with glob or regex.
Any help appreciated.
CodePudding user response:
So it turns out AWSGlue/pyspark does have this feature but it requires a little data wrangling and use of the scripting feature in aws glue jobs.
You can use the input_file_name function to get the full file path. This can be mapped to a column like so:
ApplyMapping_node2 = ApplyMapping_node1.toDF().withColumn("path", input_file_name())
ApplyMapping_node3 = ApplyMapping_node2.fromDF(ApplyMapping_node2, glueContext, "ApplyMapping_node2")
However, you if you need to split the path to get a specific file name you can do something like this:
ApplyMapping_node2 = ApplyMapping_node1.toDF().withColumn("path", input_file_name())
ApplyMapping_node3 = ApplyMapping_node2.withColumn("split_path", split_path_UDF(ApplyMapping_node3['path']))
ApplyMapping_node4 = ApplyMapping_node1.fromDF(ApplyMapping_node3, glueContext, "ApplyMapping_node4")
Where the split_path function is setup as a udf. Like so:
from pyspark.sql.functions import input_file_name, udf
def split_path(path):
return path.split('/')[-1]
split_path_UDF = udf(split_path, StringType())