I have to read a csv file and I have to verify the name and the number of columns of the dataframe. The minimum number of columns is 3 and they have to be: 'id', 'name' and 'phone'. There is no problem of having more columns than that. But it always needs to have at least those 3 columns with the exact name. Otherwise, program should fail.
For example: Correct:
----- ----- ----- ----- ----- ----- -----
| id| name|phone| | id| name|phone|unit |
----- ----- ----- ----- ----- ----- -----
|3940A|jhon |1345 | |3940A|jhon |1345 | 222 |
|2BB56|mike | 492 | |2BB56|mike | 492 | 333 |
|3(401|jose |2938 | |3(401|jose |2938 | 444 |
----- ----- ----- ----- ----- ----- -----
Incorrect:
----- ----- ----- ----- -----
| sku| nomb|phone| | sku| name|
----- ----- ----- ----- -----
|3940A|jhon |1345 | |3940A|jhon |
|2BB56|mike | 492 | |2BB56|mike |
|3(401|jose |2938 | |3(401|jose |
----- ----- ----- ----- -----
CodePudding user response:
Using simple python if-else statement should do the job:
mandatory_cols = ["id", "name", "phone"]
if all(c in df.columns for c in mandatory_cols):
# your logic
else:
raise ValueError("missing columns!")
CodePudding user response:
Here's an example on how to check if columns exist in your dataframe:
from pyspark.sql import Row
def check_columns_exits(cols):
if 'id' in cols and 'name' in cols and 'phone' in cols:
print("All required columns are present")
else:
print("Does not have all the required columns")
data = [Row(id="3940A", name="john", phone="1345", unit=222),
Row(id="2BB56", name="mike", phone="492", unit=333)]
df = spark.createDataFrame(data)
check_columns_exits(df.columns)
data1 = [Row(id="3940A", name="john", unit=222),
Row(id="2BB56", name="mike", unit=333)]
df1 = spark.createDataFrame(data1)
check_columns_exits(df1.columns)
RESULT:
All required columns are present
Does not have all the required columns