Here's what I did in pandas
df = df.loc[:,~df.columns.duplicated()]
How to do this in PySpark?
I found this, but the amount of code is too different
CodePudding user response:
You need to make the column names unique and then create a new dataframe with unique columns:
df=spark.createDataFrame([[1,2,3,4,5,6]], schema=["A","B","B","C","C","C"])
# --- --- --- --- --- ---
#| A| B| B| C| C| C|
# --- --- --- --- --- ---
#| 1| 2| 3| 4| 5| 6|
# --- --- --- --- --- ---
result=list()
for c in df.columns:
while c in result:
c = c "_"
result.append(c)
#result = ['A', 'B', 'B_', 'C', 'C_', 'C__']
df_unique=spark.createDataFrame(df.rdd, result) \
.select(*set(df.columns))
# --- --- ---
#| A| C| B|
# --- --- ---
#| 1| 4| 2|
# --- --- ---
CodePudding user response:
Having a similar problem, I have been using the below function to drop duplicated columns in a dataframe and return a new one.
Note that this function keeps the first occurrence of the duplicated column:
def drop_dup_cols(df: DataFrame) -> DataFrame:
"""
Inputs: DF
Output: New DF with only unique columns
"""
newcols = []
dupcols = []
for i in range(len(df.columns)):
if df.columns[i] not in newcols:
newcols.append(df.columns[i])
else:
dupcols.append(i)
df = df.toDF(*[str(i) for i in range(len(df.columns))])
for dupcol in dupcols:
df = df.drop(str(dupcol))
return df.toDF(*newcols)
It takes a DF
and returns the same without the duplicated columns.
- To demonstrate (using @werner sample DF):
df=spark.createDataFrame([[1,2,3,4,5,6]], schema=["A","B","B","C","C","C"])
>>> df.show()
--- --- --- --- --- ---
| A| B| B| C| C| C|
--- --- --- --- --- ---
| 1| 2| 3| 4| 5| 6|
--- --- --- --- --- ---
>>> drop_dup_cols(df).show()
--- --- ---
| A| B| C|
--- --- ---
| 1| 2| 4|
--- --- ---
CodePudding user response:
You may have to select required columns manually or you have to go with new column. This is beacause spark dataframes are immutable.