We are reading files from ADLS through this kind of command:
relative_path = "ContainerName/"
input_file_name = "ss_old_data_Amazon_Blast_202211093837474.csv"
How can we fetch a substring from a file name that we are getting for processing.
For example:
1st filename: vm_Path_Accenture_Complex_Union_202211027373.csv
2nd filename: vm_path_Google_is_a_good_company_20221109473.csv
3rd filename: ss_old_data_Amazon_Blast_202211093837474.csv
We need to pick specific substring from the filename and remove _
from substring and make it upper case. I don't need to pick the date format (numbers) or .csv
- just the company name.
vm_path
will be there at the beginning for most of the files, and sometimes ss_old_data
. We need to remove this part too.
Expected output:
1st filename should become: ACCENTURECOMPLEXUNION
2nd filename should become: GOOGLEISAGOODCOMPANY
3rd filename should become: AMAZONBLAST
How can we achieve it in PySpark?
CodePudding user response:
One approach uses two chained calls to str.replace
, followed by a call to str.upper
:
df["output"] = df["filename"].str.replace(r'^(?:vm_path|ss_old_data)_|_\d \.csv$', '', regex=True, flags=re.I)
.str.replace('_', '', regex=False)
.str.upper()
Here is a regex demo showing that the initial replacement logic is working.
CodePudding user response:
Since you store your file name in a variable, you can reuse it and save it as a literal (lit
). But before saving you will have to do some modifications to the literal.
ss_old_data_Amazon_Blast_202211093837474.csv
<- input
_Amazon_Blast
<- after applying regexp_extract
with pattern '(?i)(vm_path|ss_old_data)(. )_'
AmazonBlast
<- after applying translate(..., '_', '')
AMAZONBLAST
<- after applying upper
from pyspark.sql import functions as F
input_file_name ="ss_old_data_Amazon_Blast_202211093837474.csv"
df = spark.read.csv(input_file_name)
c = F.regexp_extract(F.lit(input_file_name), '(?i)(vm_path|ss_old_data)(. )_', 2)
df = df.withColumn('file_name', F.upper(F.translate(c, '_', '')))
df.show()
# ---------- -----------
# | _c0| file_name|
# ---------- -----------
# |csv_line_1|AMAZONBLAST|
# |csv_line_2|AMAZONBLAST|
# ---------- -----------