Home > Enterprise >  spark : how to read csv with ";" as delimiter and "," as decimal separation?
spark : how to read csv with ";" as delimiter and "," as decimal separation?

Time:04-07

I have a problem with a csv file witch contains decimal value like that "7,27431439586819e-05"

spark.read.option("header", "true")\
    .option("delimiter", ";")\
    .option("locale", "fr-FR")\ *doesnt work...*
    .option("inferSchema", "true")\
   .csv("file.csv").toPandas()

The comma doesn't seem to be a standard comma and i dont find an option to specify the symbol (.option('decimal',',') doesnt exist, .option('locale','fr-FR') doesnt work)

Do you have any idea ? I also tried re.sub("[^0-9]", ".") and then realize i had scientific value ('e-') so it doenst work neither.

CodePudding user response:

Try using the regexp_replace() function from pyspark to replace "," with "." then cast into DoubleType().

import pandas as pd
import pyspark.sql.functions as F
from pyspark.sql.types import DoubleType

df = pd.DataFrame({"Name": ['a', 'b', 'c'], "Measures":["7,27431439586819e-05",  "15,4689439586819e-01",  "-2,97431439586819e02"]})
dfs = spark.createDataFrame(df)
dfs_transformed = dfs.withColumn('Measures', F.regexp_replace('Measures', ',', '.').cast(DoubleType()))

dfs_transformed.show()

And you should get correctly typed values:

  ---- ------------------- 
 |Name|           Measures|
  ---- ------------------- 
 |   a|7.27431439586819E-5|
 |   b|   1.54689439586819|
 |   c|  -297.431439586819|
  ---- ------------------- 
  • Related