I have a problem with a csv file witch contains decimal value like that "7,27431439586819e-05"
spark.read.option("header", "true")\
.option("delimiter", ";")\
.option("locale", "fr-FR")\ *doesnt work...*
.option("inferSchema", "true")\
.csv("file.csv").toPandas()
The comma doesn't seem to be a standard comma and i dont find an option to specify the symbol (.option('decimal',',') doesnt exist, .option('locale','fr-FR') doesnt work)
Do you have any idea ? I also tried re.sub("[^0-9]", ".") and then realize i had scientific value ('e-') so it doenst work neither.
CodePudding user response:
Try using the regexp_replace() function from pyspark to replace "," with "." then cast into DoubleType().
import pandas as pd
import pyspark.sql.functions as F
from pyspark.sql.types import DoubleType
df = pd.DataFrame({"Name": ['a', 'b', 'c'], "Measures":["7,27431439586819e-05", "15,4689439586819e-01", "-2,97431439586819e02"]})
dfs = spark.createDataFrame(df)
dfs_transformed = dfs.withColumn('Measures', F.regexp_replace('Measures', ',', '.').cast(DoubleType()))
dfs_transformed.show()
And you should get correctly typed values:
---- -------------------
|Name| Measures|
---- -------------------
| a|7.27431439586819E-5|
| b| 1.54689439586819|
| c| -297.431439586819|
---- -------------------