Home > Enterprise >  Change DataType of multiple columns with pyspark
Change DataType of multiple columns with pyspark

Time:07-20

i'am trying to change the datatype of multiple columns (100 column) with pyspark,i'am trying to make a loop or something else that can helps to change th 100 column Any help will be appreciated. this was the syntax that helped me change 3 columns:

from pyspark.sql.types import (
    IntegerType
)
dfcontract2 = dfcontract \
  .withColumn("Offre durable" ,
              dfcontract["Offre durable"]
              .cast(IntegerType()))   \
  .withColumn("Offre non durable",
              dfcontract["Offre non durable"]
              .cast(IntegerType()))    \
  .withColumn("Total"  ,
              dfcontract["Total"]
              .cast(IntegerType())) \
  
dfcontract2.printSchema()

CodePudding user response:

you can use list comprehension.

list_of_cols_to_update = ['col2', 'col3', 'col4']  # specify the columns that need casting

data_sdf. \
    select(*[func.col(k).cast('int').alias(k) if k in list_of_cols_to_update else k for k in data_sdf.columns])

lets print the list from the comprehension to see how it looks

print([func.col(k).cast('int').alias(k) if k in list_of_cols_to_update else k for k in data_sdf.columns])
# ['col1', Column<'CAST(col2 AS INT) AS `col2`'>, Column<'CAST(col3 AS INT) AS `col3`'>, Column<'CAST(col4 AS INT) AS `col4`'>]

The list_of_cols_to_update list can be generated using a list comprehension as well

list_of_cols_to_update = ['col' str(i) for i in range(2, 5)]

print(list_of_cols_to_update)
# ['col2', 'col3', 'col4']

CodePudding user response:

You can employ reduce in conjuction to withColumn to do this -

DataFrame - Reduce

from functools import reduce

schema = {col: col_type for col, col_type in sparkDF.dtypes}

cast_cols = [ col for col, col_type in schema.items() if col_type in ["bigint"] ]

sparkDF = reduce(
    lambda df, x: df.withColumn(x, F.col(x).cast(IntegerType())),
    cast_cols,
    sparkDF,
)
  • Related