Home > Software design >  Renaming the duplicate column name or performing select operation on it in PySpark
Renaming the duplicate column name or performing select operation on it in PySpark

Time:09-17

Jupyter Notebook Screenshot

Code:

pdf=[(1,'a',4,'a',4.1,'d'),(2,'b',3,'b',3.2,'c'),(3,'c',2,'c',2.3,'b'),(1,'d',1,'d',1.4,'a')]
df15 = spark.createDataFrame(pdf, ('x','y','z','a','b','a') )
df15.show(2)

try: df15.select(df15.a).show(2)
except: print("failed")
    
df15.columns

try: df15.select(df15.columns[3]).show(2)
except: print("failed")
    
df15.withColumnRenamed('a', 'b_id').show(2)
df15.drop('a').show(2)

Output:

 --- --- --- --- --- --- 
|  x|  y|  z|  a|  b|  a|
 --- --- --- --- --- --- 
|  1|  a|  4|  a|4.1|  d|
|  2|  b|  3|  b|3.2|  c|
 --- --- --- --- --- --- 
only showing top 2 rows

failed
failed
 --- --- --- ---- --- ---- 
|  x|  y|  z|b_id|  b|b_id|
 --- --- --- ---- --- ---- 
|  1|  a|  4|   a|4.1|   d|
|  2|  b|  3|   b|3.2|   c|
 --- --- --- ---- --- ---- 
only showing top 2 rows

 --- --- --- --- 
|  x|  y|  z|  b|
 --- --- --- --- 
|  1|  a|  4|4.1|
|  2|  b|  3|3.2|
 --- --- --- --- 
only showing top 2 rows

How to rename a duplicate column or perform select operations on it?

  • select operation doesn't work on duplicate col names
  • rename and drop operation applies changes to both duplicate col names

CodePudding user response:

you could define a list of new column names and rename all columns for the dataframe at once, then drop whatever column you want to drop

new_cols = ['x','y','z','b_id','b','b_id_to_drop']
df = df.toDF(*new_cols)
df = df.drop('b_id_to_drop')
  • Related