Home > Software design >  How to solve this Pyspark Code Block using Regexp
How to solve this Pyspark Code Block using Regexp

Time:01-09

I have this CSV file

enter image description here

but when I am running my notebook regex shows some error

from pyspark.sql.functions import regexp_replace

path="dbfs:/FileStore/df/test.csv"
dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "‡‡,‡‡").csv(path)

dff.show(truncate=False)
#dffs_headers = dff.dtypes

for i in dffs_headers:
  columnLabel = i[0]
  print(columnLabel)
  newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
  
  dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$','')).drop(newColumnLabel)
  
  if columnLabel != newColumnLabel:
    dff = dff.drop(columnLabel)
    dff.show(truncate=False)
    

As and a result I am getting this

enter image description here

Can anyone improvise this code, it will be a great help.

Expected output is

|��123456��,��Version2��,��All questions have been answered accurately and the guidance in the questionnaire was understood and followed��,��2010-12-16 00:01:48.020000000��|

But I am getting

��Id��,��Version��,��Questionnaire��,��Date��

Second column is showing Truncated value

CodePudding user response:

You will need to import the libraries you want to use first, to use them. The below code in a cell before the regexp_replace call should fix this issue

from pyspark.sql.functions import regexp_replace

CodePudding user response:

This is working asnwer

from pyspark.sql.functions import regexp_replace

path="dbfs:/FileStore/df/test.csv"
dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "‡‡,‡‡").csv(path)

#dffs_headers = dff.dtypes

for i in dffs_headers:
  columnLabel = i[0]
  newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
  
  dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$',''))
  
  if columnLabel != newColumnLabel:
    dff = dff.drop(columnLabel)
  dff.show(truncate=False)
    
  • Related