I am reading JSON file that has some date columns. The issue is some of the date columns contain dates in Arabic/urdu text :
٠٤-٢٥-٢٠٢١
I want to convert it to English date in yyyy-mm-dd
format.
How to achieve this in Pyspark?
CodePudding user response:
You can convert arabic number to english by casting type to decimal.
df = spark.createDataFrame([('٠٤-٢٥-٢٠٢١',)],['arabic'])
df.withColumn('split', split('arabic', '-')) \
.withColumn('date', concat_ws('-', col('split')[2].cast('decimal'), col('split')[0].cast('decimal'), col('split')[1].cast('decimal'))) \
.drop('split').show()
---------- ---------
| arabic| date|
---------- ---------
|٠٤-٢٥-٢٠٢١ |2021-4-25|
---------- ---------
CodePudding user response:
Finally, I decided to use pandas_udf and python's unidecode library
from pyspark.sql.types import StringType
from pyspark.sql.functions import pandas_udf
from unidecode import unidecode
import pandas as pd
def unidecode_(val):
if val:
return unidecode(val)
@pandas_udf(StringType())
def a_to_n(col):
return pd.Series(col.apply(unidecode_))
df = df_json.withColumn('checkin_date', a_to_n(F.col("checkin_date")))
It is giving me the desired answer.