Home > Software design >  remove rows when values from two columns are the same combination in pyspark
remove rows when values from two columns are the same combination in pyspark

Time:07-10

I have a spark dataframe like below. (just an example. My real data has millions of rows):

df = pd.DataFrame({'ZIP1': ['50069', '50069', '50704', '50704', '52403', '52403'],
              'ZIP2': ['50704', '52403', '50069', '52403', '50069', '50704'],
              'STATE': ['IA', 'IA', 'IA', 'IA', 'IA', 'IA'],
              'REGION': ['MIDWEST', 'MIDWEST', 'MIDWEST', 'MIDWEST', 'MIDWEST', 'MIDWEST'] } )
sdf = spark.createDataFrame(df)

    ZIP1    ZIP2    STATE   REGION
0   50069   50704   IA  MIDWEST
1   50069   52403   IA  MIDWEST
2   50704   50069   IA  MIDWEST
3   50704   52403   IA  MIDWEST
4   52403   50069   IA  MIDWEST
5   52403   50704   IA  MIDWEST

If two zipcodes from ZIP1 and ZIP2 columns are the same combination, I need to remove one row. For example, row 0 and row 2, the zipcodes are simply the same combination, but in reversed order. I need to remove either row 0 or row 2. Likewise, remove either row 1 or row 4....

Does anyone know how to achieve this in pyspark? Pyspark solution is needed. If someone can provide solutions in both pyspark and python, that is a plus. Thanks!

CodePudding user response:

Pandas solution

cols = ['ZIP1', 'ZIP2']
df[cols] = np.sort(df[cols], axis=1)
df.drop_duplicates(subset=cols)
Result
    ZIP1   ZIP2 STATE   REGION
0  50069  50704    IA  MIDWEST
1  50069  52403    IA  MIDWEST
3  50704  52403    IA  MIDWEST

Pyspark solution

cols = ['ZIP1', 'ZIP2']
sdf.withColumn('arr', F.array_sort(F.array(*cols))).drop_duplicates(['arr']).drop('arr')
Result
| ZIP1| ZIP2|STATE| REGION|
 ----- ----- ----- ------- 
|50069|50704|   IA|MIDWEST|
|50069|52403|   IA|MIDWEST|
|50704|52403|   IA|MIDWEST|
 ----- ----- ----- ------- 

In both solutions the main idea is to sort the values in columns ZIP1 and ZIP2, then drop the duplicates based on sorted values

CodePudding user response:

new = (sdf.withColumn('grouper',array_sort(array('ZIP1','ZIP2')))#Create grouper column
        .drop_duplicates(['grouper'])#Use grouper to drop duplicates
        .drop('grouper')#drop grouper column
       ).show()

CodePudding user response:

This is one way it could be done in pandas:

df = df.iloc[df[['ZIP1', 'ZIP2']].apply(set, axis=1).drop_duplicates().index]
# df.iloc[df.apply(set, axis=1).drop_duplicates().index]
# Also works if you want to include all columns~

Output:

    ZIP1   ZIP2 STATE   REGION
0  50069  50704    IA  MIDWEST
1  50069  52403    IA  MIDWEST
3  50704  52403    IA  MIDWEST

CodePudding user response:

  1. Create new DF1 with zip1 and zip2 in asc order, i.e. zip1 always smaller or equal to zip2 per row.
  2. Create new DF2 based on .distinct() from DF1.
  • Related