Home > Enterprise >  How to make dictionary of column names in PySpark?
How to make dictionary of column names in PySpark?

Time:11-03

I am receiving files and for some files columns are named differently. For example:

  1. In file 1, column names are: "studentID" , "ADDRESS", "Phone_number".
  2. In file 2, column names are: "Common_ID", "Common_Address", "Mobile_number".
  3. In file 3, column names are: "S_StudentID", "S_ADDRESS", "HOME_MOBILE".

I want to pass a dictionary after loading the file data into dataframes and in that dictionary I want to pass values like:

StudentId -> STUDENT_ID
Common_ID -> STUDENT_ID
S_StudentID -> STUDENT_ID

ADDRESS -> S_ADDRESS
Common_Address -> S_ADDRESS
S_ADDRESS  -> S_ADDRESS

The reason for doing this because in my next dataframe I am reading column names like "STUDENT_ID", "S_ADDRESS" and if it will not find "S_ADDRESS", "STUDENT_ID" names in the dataframe, it will throw error for files whose names are not standardized. I want to run my dataframe and get values from those files after renaming in the above DF and one question when in run the new df will it pick the column name form dictionary having data in it.

CodePudding user response:

You can have the dictionary as you want and use toDF with a list comprehension in order to rename the columns.

Input dataframe and column names:

from pyspark.sql import functions as F

df = spark.createDataFrame([], 'Common_ID string, ADDRESS string, COL3 string')
print(df.columns)
# ['Common_ID', 'ADDRESS', 'COL3']

Dictionary and toDF:

dict_cols = {
    'StudentId':    'STUDENT_ID',
    'Common_ID':    'STUDENT_ID',
    'S_StudentID':  'STUDENT_ID',
    'ADDRESS':        'S_ADDRESS',
    'Common_Address': 'S_ADDRESS',
    'S_ADDRESS':      'S_ADDRESS'
}
df = df.toDF(*[dict_cols.get(c, c) for c in df.columns])

Resultant column names:

print(df.columns)
# ['STUDENT_ID', 'S_ADDRESS', 'COL3']

CodePudding user response:

Use dict and list comprehension. An easier way and which would work even if some of the columns are not in the list is

 df.toDF(*[dict_cols[x]  if x in dict_cols else x for x in  df.columns ]).show()

 ---------- --------- ---- 
|STUDENT_ID|S_ADDRESS|COL3|
 ---------- --------- ---- 
 ---------- --------- ---- 
  • Related