I have a spark dataframe like below. (just an example. My real data has millions of rows):
df = pd.DataFrame({'ZIP1': ['50069', '50069', '50704', '50704', '52403', '52403'],
'ZIP2': ['50704', '52403', '50069', '52403', '50069', '50704'],
'STATE': ['IA', 'IA', 'IA', 'IA', 'IA', 'IA'],
'REGION': ['MIDWEST', 'MIDWEST', 'MIDWEST', 'MIDWEST', 'MIDWEST', 'MIDWEST'] } )
sdf = spark.createDataFrame(df)
ZIP1 ZIP2 STATE REGION
0 50069 50704 IA MIDWEST
1 50069 52403 IA MIDWEST
2 50704 50069 IA MIDWEST
3 50704 52403 IA MIDWEST
4 52403 50069 IA MIDWEST
5 52403 50704 IA MIDWEST
If two zipcodes from ZIP1
and ZIP2
columns are the same combination, I need to remove one row. For example, row 0
and row 2
, the zipcodes are simply the same combination, but in reversed order. I need to remove either row 0
or row 2
. Likewise, remove either row 1
or row 4
....
Does anyone know how to achieve this in pyspark? Pyspark solution is needed. If someone can provide solutions in both pyspark and python, that is a plus. Thanks!
CodePudding user response:
Pandas solution
cols = ['ZIP1', 'ZIP2']
df[cols] = np.sort(df[cols], axis=1)
df.drop_duplicates(subset=cols)
Result
ZIP1 ZIP2 STATE REGION
0 50069 50704 IA MIDWEST
1 50069 52403 IA MIDWEST
3 50704 52403 IA MIDWEST
Pyspark solution
cols = ['ZIP1', 'ZIP2']
sdf.withColumn('arr', F.array_sort(F.array(*cols))).drop_duplicates(['arr']).drop('arr')
Result
| ZIP1| ZIP2|STATE| REGION|
----- ----- ----- -------
|50069|50704| IA|MIDWEST|
|50069|52403| IA|MIDWEST|
|50704|52403| IA|MIDWEST|
----- ----- ----- -------
In both solutions the main idea is to sort the values in columns ZIP1
and ZIP2
, then drop the duplicates based on sorted values
CodePudding user response:
new = (sdf.withColumn('grouper',array_sort(array('ZIP1','ZIP2')))#Create grouper column
.drop_duplicates(['grouper'])#Use grouper to drop duplicates
.drop('grouper')#drop grouper column
).show()
CodePudding user response:
This is one way it could be done in pandas
:
df = df.iloc[df[['ZIP1', 'ZIP2']].apply(set, axis=1).drop_duplicates().index]
# df.iloc[df.apply(set, axis=1).drop_duplicates().index]
# Also works if you want to include all columns~
Output:
ZIP1 ZIP2 STATE REGION
0 50069 50704 IA MIDWEST
1 50069 52403 IA MIDWEST
3 50704 52403 IA MIDWEST
CodePudding user response:
- Create new DF1 with zip1 and zip2 in asc order, i.e. zip1 always smaller or equal to zip2 per row.
- Create new DF2 based on
.distinct()
from DF1.