How can I verify the schema (number and name of the columns) of a Dataframe in pyspark?-CodePudding

I have to read a csv file and I have to verify the name and the number of columns of the dataframe. The minimum number of columns is 3 and they have to be: 'id', 'name' and 'phone'. There is no problem of having more columns than that. But it always needs to have at least those 3 columns with the exact name. Otherwise, program should fail.

For example: Correct:

 ----- ----- -----     ----- ----- ----- ----- 
|   id| name|phone|   |   id| name|phone|unit |
 ----- ----- -----     ----- ----- ----- ----- 
|3940A|jhon |1345 |   |3940A|jhon |1345 | 222 |
|2BB56|mike | 492 |   |2BB56|mike | 492 | 333 |
|3(401|jose |2938 |   |3(401|jose |2938 | 444 |
 ----- ----- -----     ----- ----- ----- -----

Incorrect:

 ----- ----- -----     ----- ----- 
|  sku| nomb|phone|   |  sku| name|
 ----- ----- -----     ----- ----- 
|3940A|jhon |1345 |   |3940A|jhon |
|2BB56|mike | 492 |   |2BB56|mike |
|3(401|jose |2938 |   |3(401|jose |
 ----- ----- -----     ----- -----

CodePudding user response：

Using simple python if-else statement should do the job:

mandatory_cols = ["id", "name", "phone"]

if all(c in df.columns for c in mandatory_cols):
    # your logic
else:
    raise ValueError("missing columns!")

CodePudding user response：

Here's an example on how to check if columns exist in your dataframe:

from pyspark.sql import Row


def check_columns_exits(cols):
    if 'id' in cols and 'name' in cols and 'phone' in cols:
        print("All required columns are present")
    else:
        print("Does not have all the required columns")


data = [Row(id="3940A", name="john", phone="1345", unit=222),
        Row(id="2BB56", name="mike", phone="492", unit=333)]
df = spark.createDataFrame(data)
check_columns_exits(df.columns)

data1 = [Row(id="3940A", name="john", unit=222),
         Row(id="2BB56", name="mike", unit=333)]
df1 = spark.createDataFrame(data1)
check_columns_exits(df1.columns)

RESULT:

All required columns are present
Does not have all the required columns