Using S3 folder structure as meta data in AWS Glue-CodePudding

I am currently running an AWS Glue job the converts csvs to parquet files. The source & target of the data is an S3 bucket and this all works fine. However I would like to include information from the s3 path within the parquet file.

I have looked at the transforms in the AWS Glue Studio's visual interface but can't find anything. I've search through some of the awsglue & pyspark python library but can't find anything related to collecting path/dir structure with glob or regex.

Any help appreciated.

CodePudding user response：

So it turns out AWSGlue/pyspark does have this feature but it requires a little data wrangling and use of the scripting feature in aws glue jobs.

You can use the input_file_name function to get the full file path. This can be mapped to a column like so:

ApplyMapping_node2 = ApplyMapping_node1.toDF().withColumn("path", input_file_name())
ApplyMapping_node3 = ApplyMapping_node2.fromDF(ApplyMapping_node2, glueContext, "ApplyMapping_node2")

However, you if you need to split the path to get a specific file name you can do something like this:

ApplyMapping_node2 = ApplyMapping_node1.toDF().withColumn("path", input_file_name())
ApplyMapping_node3 = ApplyMapping_node2.withColumn("split_path", split_path_UDF(ApplyMapping_node3['path']))
ApplyMapping_node4 = ApplyMapping_node1.fromDF(ApplyMapping_node3, glueContext, "ApplyMapping_node4")

Where the split_path function is setup as a udf. Like so:

from pyspark.sql.functions import input_file_name, udf

def split_path(path):
    return path.split('/')[-1]

split_path_UDF = udf(split_path, StringType())