I'm working with a huge pyspark dataset and I must remove a couple of rows based on some conditions. For example, my source df looks like this:
>>> BUSINESS_INFO
business_model| classification| region | max_position
______________________________________________________________
0 food delivery | popular | South | 2
1 sports | success | West | 3
2 fitness | popular | Southeast | 1
3 videos | famous | East | 4
Now I have to use the data above to define which rows I'll have to remove from this data frame:
>>> APP_RANK_AT_THE_STORE
business_model| classification| region | date | rank
_____________________________________________________________________
0 food delivery | popular | South | 2021-05-01 | 1
1 food delivery | popular | South | 2021-05-01 | 2
2 food delivery | popular | South | 2021-05-01 | 3
3 sports | success | West | 2021-05-02 | 1
4 sports | success | West | 2021-05-02 | 2
5 sports | success | West | 2021-05-02 | 3
6 videos | famous | East | 2021-05-03 | 1
7 videos | famous | East | 2021-05-03 | 2
8 videos | famous | East | 2021-05-03 | 3
9 videos | famous | East | 2021-05-03 | 4
10 videos | famous | East | 2021-05-03 | 5
11 fitness | popular | Southeast | 2021-05-04 | 1
12 fitness | popular | Southeast | 2021-05-04 | 2
13 fitness | popular | Southeast | 2021-05-04 | 3
Basically, I'll have to remove all the data from the second data frame that doesn't fit the criteria set on the BUSINESS_INFO data frame. So, for example: I can only have on my data frame the first 2 records for when the business model is food delivery, the classification is popular and the region is South. All the rest would be dropped and the result would be something like this:
>>> FINAL_DF
business_model| classification| region | date | rank
_____________________________________________________________________
0 food delivery | popular | South | 2021-05-01 | 1
1 food delivery | popular | South | 2021-05-01 | 2
2 sports | success | West | 2021-05-02 | 1
3 sports | success | West | 2021-05-02 | 2
4 sports | success | West | 2021-05-02 | 3
5 videos | famous | East | 2021-05-03 | 1
6 videos | famous | East | 2021-05-03 | 2
7 videos | famous | East | 2021-05-03 | 3
8 videos | famous | East | 2021-05-03 | 4
9 fitness | popular | Southeast | 2021-05-04 | 1
Is there any way to do that? Thank you!
CodePudding user response:
You can do a normal join (default is inner join in Pyspark) and using >=
or <=
conditions like so
(df1
.join(df2, on=[
df1['business_model'] == df2['business_model'],
df1['classification'] == df2['classification'],
df1['max_position'] >= df2['rank'],
])
.drop(df2['business_model'])
.drop(df2['classification'])
.drop(df1['max_position'])
.show()
)
# Output
# -------------- -------------- --------- --------- ---------- ----
# |business_model|classification| region| region| date|rank|
# -------------- -------------- --------- --------- ---------- ----
# | sports| success| West| West|2021-05-02| 1|
# | sports| success| West| West|2021-05-02| 2|
# | sports| success| West| West|2021-05-02| 3|
# | videos| famous| East| East|2021-05-03| 1|
# | videos| famous| East| East|2021-05-03| 2|
# | videos| famous| East| East|2021-05-03| 3|
# | videos| famous| East| East|2021-05-03| 4|
# | fooddelivery| popular| South| South|2021-05-01| 1|
# | fooddelivery| popular| South| South|2021-05-01| 2|
# | fitness| popular|Southeast|Southeast|2021-05-04| 1|
# -------------- -------------- --------- --------- ---------- ----