So I have this data frame hotel_weather_top_ten
:
---------- ----- -------- ----------
| id|month|abs_diff|row_number|
------------- ----- -------- ----------
| 996432412678| 8| 9.0| 1|
| 970662608897| 8| 9.0| 2|
| 730144440321| 8| 9.0| 3|
| 824633720837| 8| 9.0| 4|
| 953482739712| 8| 8.0| 5|
|1013612281861| 8| 8.0| 6|
|1288490188802| 8| 8.0| 7|
| 429496729601| 8| 8.0| 8|
| 429496729604| 8| 8.0| 9|
|1262720385026| 8| 8.0| 10|
| 146028888072| 9| 19.0| 1|
|1571958030336| 9| 19.0| 2|
| 146028888068| 9| 16.0| 3|
| 25769803779| 9| 15.0| 4|
|1099511627777| 9| 13.0| 5|
| 206158430212| 9| 13.0| 6|
|1262720385024| 9| 13.0| 7|
... and so on
Basically it just shows top 10 abs_diff
value for every month in every hotel.
And now I wanna join it with hotel_weather_names
so I have hotels names:
------------- --------------------
| id| name|
------------- --------------------
| 858993459206| 2608 E Malone Ave|
|1400159338497| 345 Chestnut St|
| 463856467968| 45400 Park Ave|
| 481036337152| 1118 Government St|
|1709396983810|165 Saint Emanuel St|
| 790273982465|1237 Us Highway 4...|
| 231928233986| Maslinica Bay|
| 592705486849| 1223 Radford Blvd|
| 515396075520| 3750 Meridian St|
| 352187318275| 1407 E Rusk St|
|1202590842884| 2030 Formosa Rd|
| 747324309509| 602 E 4th St|
| 695784701955|10821 Caribbean Blvd|
| 644245094400|1725 Long Beach Blvd|
|1176821039105| 15902 S Western Ave|
|1365799600130|3201 E Pacific Co...|
| 309237645313| 8541 S Hampton Rd|
| 678604832769| 221 Shultz Rd|
| 755914244097|1320 Harrisburg Pike|
| 523986010113| 615 Clark Rd|
------------- --------------------
only showing top 20 rows
But somehow when I do the join the result is not what I am expecting. e.g.:
| id|month|abs_diff|row_number| name|
------------ ----- -------- ---------- -----------------
|996432412678| 8| 9.0| 1| 736 Idaho St|
|996432412678| 8| 9.0| 1| 736 Idaho St|
|996432412678| 8| 9.0| 1| 736 Idaho St|
|996432412678| 8| 9.0| 1| 736 Idaho St|
|996432412678| 8| 9.0| 1| 736 Idaho St|
|996432412678| 8| 9.0| 1| 736 Idaho St|
|996432412678| 8| 9.0| 1| 736 Idaho St|
|970662608897| 8| 9.0| 2| 1755 Idaho St|
|970662608897| 8| 9.0| 2| 1755 Idaho St|
|970662608897| 8| 9.0| 2| 1755 Idaho St|
|970662608897| 8| 9.0| 2| 1755 Idaho St|
|970662608897| 8| 9.0| 2| 1755 Idaho St|
|970662608897| 8| 9.0| 2| 1755 Idaho St|
|970662608897| 8| 9.0| 2| 1755 Idaho St|
|730144440321| 8| 9.0| 3|2620 Jennings Way|
|730144440321| 8| 9.0| 3|2620 Jennings Way|
|730144440321| 8| 9.0| 3|2620 Jennings Way|
|730144440321| 8| 9.0| 3|2620 Jennings Way|
|730144440321| 8| 9.0| 3|2620 Jennings Way|
|730144440321| 8| 9.0| 3|2620 Jennings Way|
------------ ----- -------- ---------- -----------------
only showing top 20 rows
So there is no top 10 anymore. How do I keep the order and just add names accordingly?
This is how I am joining them: result = hotel_weather_top_ten.join(hotel_weather_names, on='id', how='inner')
Thank you beforehand!
CodePudding user response:
You can do that with the following.
from pyspark.sql import functions
hotel_weather_top_ten = (hotel_weather_top_ten
.withColumnRenamed("id", "id1")
.withColumn("order", functions.monotonically_increasing_id()))
hotel_weather_top_ten = (hotel_weather_top_ten
.join(hotel_weather_names,
on=hotel_weather_top_ten["id1"] == hotel_weather_names["id"],
how="inner")
.orderBy("order")
.select("id","month","abs_diff","row_number","name")
CodePudding user response:
The output you get, could be due to presence of duplicate id
and name
mapping in hotel_weather_names
. Applying a hotel_weather_names.distinct()
could be one potential solution.