Question: In Apache Spark Dataframe, using Python
, how can we get the data type and length of each column? I'm using latest version of python.
Using pandas
dataframe, I do it as follows:
df = pd.read_csv(r'C:\TestFolder\myFile1.csv', low_memory=False)
for col in df:
print(col, '->', df[col].str.len().max())
CodePudding user response:
Pyspark also has a describe similar to Pandas
, which you can use in this case
sparkDF.describe()
CodePudding user response:
In Spark, you would need to aggregate. This returns a similar result to your Pandas version:
df.groupBy().agg(*[F.max(F.length(c)).alias(c) for c in df.columns]).show(vertical=True)
Full test:
import pandas as pd
from pyspark.sql import functions as F
df = pd.DataFrame({'col1': ['a','b'], 'col2': ['x','ab']})
for col in df:
print(col, '->', df[col].str.len().max())
# col1 -> 1
# col2 -> 2
df = spark.createDataFrame(df)
df.groupBy().agg(*[F.max(F.length(c)).alias(c) for c in df.columns]).show(vertical=True)
# -RECORD 0---
# col1 | 1
# col2 | 2