Home > Blockchain >  Apache Spark Dataframe - Get length of each column
Apache Spark Dataframe - Get length of each column

Time:05-08

Question: In Apache Spark Dataframe, using Python, how can we get the data type and length of each column? I'm using latest version of python.

Using pandas dataframe, I do it as follows:

df = pd.read_csv(r'C:\TestFolder\myFile1.csv', low_memory=False)

for col in df:
        print(col, '->', df[col].str.len().max())

CodePudding user response:

Pyspark also has a describe similar to Pandas , which you can use in this case

sparkDF.describe()

CodePudding user response:

In Spark, you would need to aggregate. This returns a similar result to your Pandas version:

df.groupBy().agg(*[F.max(F.length(c)).alias(c) for c in df.columns]).show(vertical=True)

Full test:

import pandas as pd
from pyspark.sql import functions as F

df = pd.DataFrame({'col1': ['a','b'], 'col2': ['x','ab']})

for col in df:
    print(col, '->', df[col].str.len().max())
# col1 -> 1
# col2 -> 2

df = spark.createDataFrame(df)

df.groupBy().agg(*[F.max(F.length(c)).alias(c) for c in df.columns]).show(vertical=True)
# -RECORD 0---
#  col1 | 1   
#  col2 | 2 
  • Related