remove rows when values from two columns are the same combination in pyspark-CodePudding

I have a spark dataframe like below. (just an example. My real data has millions of rows):

df = pd.DataFrame({'ZIP1': ['50069', '50069', '50704', '50704', '52403', '52403'],
              'ZIP2': ['50704', '52403', '50069', '52403', '50069', '50704'],
              'STATE': ['IA', 'IA', 'IA', 'IA', 'IA', 'IA'],
              'REGION': ['MIDWEST', 'MIDWEST', 'MIDWEST', 'MIDWEST', 'MIDWEST', 'MIDWEST'] } )
sdf = spark.createDataFrame(df)

    ZIP1    ZIP2    STATE   REGION
0   50069   50704   IA  MIDWEST
1   50069   52403   IA  MIDWEST
2   50704   50069   IA  MIDWEST
3   50704   52403   IA  MIDWEST
4   52403   50069   IA  MIDWEST
5   52403   50704   IA  MIDWEST

If two zipcodes from ZIP1 and ZIP2 columns are the same combination, I need to remove one row. For example, row 0 and row 2, the zipcodes are simply the same combination, but in reversed order. I need to remove either row 0 or row 2. Likewise, remove either row 1 or row 4....

Does anyone know how to achieve this in pyspark? Pyspark solution is needed. If someone can provide solutions in both pyspark and python, that is a plus. Thanks!

CodePudding user response：

Pandas solution

cols = ['ZIP1', 'ZIP2']
df[cols] = np.sort(df[cols], axis=1)
df.drop_duplicates(subset=cols)

Result

    ZIP1   ZIP2 STATE   REGION
0  50069  50704    IA  MIDWEST
1  50069  52403    IA  MIDWEST
3  50704  52403    IA  MIDWEST

Pyspark solution

cols = ['ZIP1', 'ZIP2']
sdf.withColumn('arr', F.array_sort(F.array(*cols))).drop_duplicates(['arr']).drop('arr')

Result

| ZIP1| ZIP2|STATE| REGION|
 ----- ----- ----- ------- 
|50069|50704|   IA|MIDWEST|
|50069|52403|   IA|MIDWEST|
|50704|52403|   IA|MIDWEST|
 ----- ----- ----- -------

In both solutions the main idea is to sort the values in columns ZIP1 and ZIP2, then drop the duplicates based on sorted values

CodePudding user response：

new = (sdf.withColumn('grouper',array_sort(array('ZIP1','ZIP2')))#Create grouper column
        .drop_duplicates(['grouper'])#Use grouper to drop duplicates
        .drop('grouper')#drop grouper column
       ).show()

CodePudding user response：

This is one way it could be done in pandas:

df = df.iloc[df[['ZIP1', 'ZIP2']].apply(set, axis=1).drop_duplicates().index]
# df.iloc[df.apply(set, axis=1).drop_duplicates().index]
# Also works if you want to include all columns~

Output:

    ZIP1   ZIP2 STATE   REGION
0  50069  50704    IA  MIDWEST
1  50069  52403    IA  MIDWEST
3  50704  52403    IA  MIDWEST

CodePudding user response：

Create new DF1 with zip1 and zip2 in asc order, i.e. zip1 always smaller or equal to zip2 per row.
Create new DF2 based on .distinct() from DF1.