How to remove column duplication in PySpark DataFrame without declare column name-CodePudding

Here's what I did in pandas

df = df.loc[:,~df.columns.duplicated()]

How to do this in PySpark?

I found this, but the amount of code is too different

CodePudding user response：

You need to make the column names unique and then create a new dataframe with unique columns:

df=spark.createDataFrame([[1,2,3,4,5,6]], schema=["A","B","B","C","C","C"])
# --- --- --- --- --- --- 
#|  A|  B|  B|  C|  C|  C|
# --- --- --- --- --- --- 
#|  1|  2|  3|  4|  5|  6|
# --- --- --- --- --- --- 

result=list()
for c in df.columns:
    while c in result:
        c = c   "_" 
    result.append(c)
#result = ['A', 'B', 'B_', 'C', 'C_', 'C__']

df_unique=spark.createDataFrame(df.rdd, result) \
    .select(*set(df.columns))
# --- --- --- 
#|  A|  C|  B|
# --- --- --- 
#|  1|  4|  2|
# --- --- ---

CodePudding user response：

Having a similar problem, I have been using the below function to drop duplicated columns in a dataframe and return a new one.

Note that this function keeps the first occurrence of the duplicated column:

def drop_dup_cols(df: DataFrame) -> DataFrame:
    """
    Inputs: DF
    Output: New DF with only unique columns
    """
    newcols = []
    dupcols = []

    for i in range(len(df.columns)):
        if df.columns[i] not in newcols:
            newcols.append(df.columns[i])
        else:
            dupcols.append(i)

    df = df.toDF(*[str(i) for i in range(len(df.columns))])
    for dupcol in dupcols:
        df = df.drop(str(dupcol))

    return df.toDF(*newcols)

It takes a DF and returns the same without the duplicated columns.

To demonstrate (using @werner sample DF):

df=spark.createDataFrame([[1,2,3,4,5,6]], schema=["A","B","B","C","C","C"])

>>> df.show()
 --- --- --- --- --- --- 
|  A|  B|  B|  C|  C|  C|
 --- --- --- --- --- --- 
|  1|  2|  3|  4|  5|  6|
 --- --- --- --- --- --- 

>>> drop_dup_cols(df).show()
 --- --- --- 
|  A|  B|  C|
 --- --- --- 
|  1|  2|  4|
 --- --- ---

CodePudding user response：

You may have to select required columns manually or you have to go with new column. This is beacause spark dataframes are immutable.