Home > Software design >  How to parse datetime that is coming in Arabic text (٠٤-٢٥-٢٠٢١) to English dates in Pyspark
How to parse datetime that is coming in Arabic text (٠٤-٢٥-٢٠٢١) to English dates in Pyspark

Time:09-17

I am reading JSON file that has some date columns. The issue is some of the date columns contain dates in Arabic/urdu text :

٠٤-٢٥-٢٠٢١

I want to convert it to English date in yyyy-mm-dd format. How to achieve this in Pyspark?

CodePudding user response:

You can convert arabic number to english by casting type to decimal.

df = spark.createDataFrame([('٠٤-٢٥-٢٠٢١',)],['arabic'])

df.withColumn('split', split('arabic', '-')) \
.withColumn('date', concat_ws('-', col('split')[2].cast('decimal'), col('split')[0].cast('decimal'), col('split')[1].cast('decimal'))) \
.drop('split').show()

 ---------- --------- 
|    arabic|     date|
 ---------- --------- 
|٠٤-٢٥-٢٠٢١ |2021-4-25|
 ---------- --------- 

CodePudding user response:

Finally, I decided to use pandas_udf and python's unidecode library

from pyspark.sql.types import StringType
from pyspark.sql.functions import pandas_udf
from unidecode import unidecode
import pandas as pd

def unidecode_(val):
    if val:
        return unidecode(val)


@pandas_udf(StringType())
def a_to_n(col):
    return pd.Series(col.apply(unidecode_))

df = df_json.withColumn('checkin_date', a_to_n(F.col("checkin_date")))

It is giving me the desired answer.

  • Related