How to pick the substring from a filename and load that substring into a table column in PySpark?-CodePudding

We are reading files from ADLS through this kind of command:

relative_path = "ContainerName/"
input_file_name = "ss_old_data_Amazon_Blast_202211093837474.csv"

How can we fetch a substring from a file name that we are getting for processing.

For example:
1st filename: vm_Path_Accenture_Complex_Union_202211027373.csv
2nd filename: vm_path_Google_is_a_good_company_20221109473.csv
3rd filename: ss_old_data_Amazon_Blast_202211093837474.csv

We need to pick specific substring from the filename and remove _ from substring and make it upper case. I don't need to pick the date format (numbers) or .csv - just the company name. vm_path will be there at the beginning for most of the files, and sometimes ss_old_data. We need to remove this part too.

Expected output:
1st filename should become: ACCENTURECOMPLEXUNION
2nd filename should become: GOOGLEISAGOODCOMPANY
3rd filename should become: AMAZONBLAST

How can we achieve it in PySpark?

CodePudding user response：

One approach uses two chained calls to str.replace, followed by a call to str.upper:

df["output"] = df["filename"].str.replace(r'^(?:vm_path|ss_old_data)_|_\d \.csv$', '', regex=True, flags=re.I)
                             .str.replace('_', '', regex=False)
                             .str.upper()

Here is a regex demo showing that the initial replacement logic is working.

CodePudding user response：

Since you store your file name in a variable, you can reuse it and save it as a literal (lit). But before saving you will have to do some modifications to the literal.

ss_old_data_Amazon_Blast_202211093837474.csv <- input
_Amazon_Blast <- after applying regexp_extract with pattern '(?i)(vm_path|ss_old_data)(. )_'
AmazonBlast <- after applying translate(..., '_', '')
AMAZONBLAST <- after applying upper

from pyspark.sql import functions as F

input_file_name ="ss_old_data_Amazon_Blast_202211093837474.csv"

df = spark.read.csv(input_file_name)
c = F.regexp_extract(F.lit(input_file_name), '(?i)(vm_path|ss_old_data)(. )_', 2)
df = df.withColumn('file_name', F.upper(F.translate(c, '_', '')))
                   
df.show()
#  ---------- ----------- 
# |       _c0|  file_name|
#  ---------- ----------- 
# |csv_line_1|AMAZONBLAST|
# |csv_line_2|AMAZONBLAST|
#  ---------- -----------