Renaming the duplicate column name or performing select operation on it in PySpark-CodePudding

Code:

pdf=[(1,'a',4,'a',4.1,'d'),(2,'b',3,'b',3.2,'c'),(3,'c',2,'c',2.3,'b'),(1,'d',1,'d',1.4,'a')]
df15 = spark.createDataFrame(pdf, ('x','y','z','a','b','a') )
df15.show(2)

try: df15.select(df15.a).show(2)
except: print("failed")
    
df15.columns

try: df15.select(df15.columns[3]).show(2)
except: print("failed")
    
df15.withColumnRenamed('a', 'b_id').show(2)
df15.drop('a').show(2)

Output:

 --- --- --- --- --- --- 
|  x|  y|  z|  a|  b|  a|
 --- --- --- --- --- --- 
|  1|  a|  4|  a|4.1|  d|
|  2|  b|  3|  b|3.2|  c|
 --- --- --- --- --- --- 
only showing top 2 rows

failed
failed
 --- --- --- ---- --- ---- 
|  x|  y|  z|b_id|  b|b_id|
 --- --- --- ---- --- ---- 
|  1|  a|  4|   a|4.1|   d|
|  2|  b|  3|   b|3.2|   c|
 --- --- --- ---- --- ---- 
only showing top 2 rows

 --- --- --- --- 
|  x|  y|  z|  b|
 --- --- --- --- 
|  1|  a|  4|4.1|
|  2|  b|  3|3.2|
 --- --- --- --- 
only showing top 2 rows

How to rename a duplicate column or perform select operations on it?

select operation doesn't work on duplicate col names
rename and drop operation applies changes to both duplicate col names

CodePudding user response：

you could define a list of new column names and rename all columns for the dataframe at once, then drop whatever column you want to drop

new_cols = ['x','y','z','b_id','b','b_id_to_drop']
df = df.toDF(*new_cols)
df = df.drop('b_id_to_drop')