I have to provide the names of the top 10 users who provided the most tips from the Yelp dataset using the tip
and user
tables. When I run tip.printSchema()
I get:
root
|-- business_id: string (nullable = true)
|-- compliment_count: long (nullable = true)
|-- date: string (nullable = true)
|-- text: string (nullable = true)
|-- user_id: string (nullable = true)
And the user.printSchema()
returns:
root
|-- average_stars: double (nullable = true)
|-- compliment_cool: long (nullable = true)
|-- compliment_cute: long (nullable = true)
|-- compliment_funny: long (nullable = true)
|-- compliment_hot: long (nullable = true)
|-- compliment_list: long (nullable = true)
|-- compliment_more: long (nullable = true)
|-- compliment_note: long (nullable = true)
|-- compliment_photos: long (nullable = true)
|-- compliment_plain: long (nullable = true)
|-- compliment_profile: long (nullable = true)
|-- compliment_writer: long (nullable = true)
|-- cool: long (nullable = true)
|-- elite: string (nullable = true)
|-- fans: long (nullable = true)
|-- friends: string (nullable = true)
|-- funny: long (nullable = true)
|-- name: string (nullable = true)
|-- review_count: long (nullable = true)
|-- useful: long (nullable = true)
|-- user_id: string (nullable = true)
|-- yelping_since: string (nullable = true)
So I need to count which user_id
's had the most tips from the tip
table and join that with the name
's in the user
table. I have tried this query to get it started:
query = """
SELECT tip.user_id, count(*), user.name
FROM tip
GROUP BY user_id
LIMIT 10
LEFT JOIN user
ON tip.user_id = user.name
"""
spark.sql(query).show()
because I want to get just the top ten user_id's from tip and match them with their names from user. This returns an error however:
Py4JJavaError: An error occurred while calling o24.sql.
: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'LEFT' expecting <EOF>(line 6, pos 4)
== SQL ==
SELECT user_id, count(*)
FROM tip
GROUP BY user_id
LIMIT 10
LEFT JOIN user
----^^^
ON tip.user_id = user.name
Here are some sample rows from the tip
table:
[Row(business_id='VaKXUpmWTTWDKbpJ3aQdMw', compliment_count=0, date='2014-03-27 03:51:24', text='Great for watching games, ufc, and whatever else tickles yer fancy', user_id='UPw5DWs_b-e2JRBS-t37Ag'),
Row(business_id='OPiPeoJiv92rENwbq76orA', compliment_count=0, date='2013-05-25 06:00:56', text='Happy Hour 2-4 daily with 1/2 price drinks and slushes AND after 8 half price shakes. They actually have a peanut butter and bacon shake.', user_id='Ocha4kZBHb4JK0lOWvE0sg'),
Row(business_id='5KheTjYPu1HcQzQFtm4_vw', compliment_count=0, date='2011-12-26 01:46:17', text='Good chips and salsa. Loud at times. Good service. Bathrooms AWFUL. So that tanks my view on this place.', user_id='jRyO2V1pA4CdVVqCIOPc1Q')]
And some samples from the user
table:
[Row(average_stars=4.03, compliment_cool=1, compliment_cute=0, compliment_funny=1, compliment_hot=2, compliment_list=0, compliment_more=0, compliment_note=1, compliment_photos=0, compliment_plain=1, compliment_profile=0, compliment_writer=2, cool=25, elite='2015,2016,2017', fans=5, friends='c78V-rj8NQcQjOI8KP3UEA, alRMgPcngYSCJ5naFRBz5g, ajcnq75Z5xxkvUSmmJ1bCg, BSMAmp2-wMzCkhTfq9ToNg, jka10dk9ygX76hJG0gfPZQ, dut0e4xvme7QSlesOycHQA, l4l5lBnK356zBua7B-UJ6Q, 0HicMOOs-M_gl2eO-zES4Q, _uI57wL2fLyftrcSFpfSGQ, T4_Qd0YWbC3co6WSMw4vxg, iBRoLWPtWmsI1kdbE9ORSA, xjrUcid6Ymq0DoTJELkYyw, GqadWVzJ6At-vgLzK_SKgA, DvB13VJBmSnbFXBVBsKmDA, vRP9nQkYTeNioDjtxZlVhg, gT0A1iN3eeQ8EMAjJhwQtw, 6yCWjFPtp_AD4x93WAwmnw, 1dKzpNnib-JlViKv8_Gt5g, 3Bv4_JxHXq-gVLOxYMQX0Q, ikQyfu1iViYh8T0us7wiFQ, f1GGltNaB7K5DR1jf3dOmg, tgeFUChlh7v8bZFVl2-hjQ, -9-9oyXlqsMG2he5xIWdLQ, Adj9fBPVJad8vSs-mIP7gw, Ce49RY8CKXVsTifxRYFTsw, M1_7TLi8CbdA89nFLlH4iw, wFsNv-hqbW_F5-IRqfBN6g, 0Q1L7zXHocaUZ2gsG2XJeg, cBFgmOCBdhYa0xoFEAzp_g, VrD_AgiFvzqtlR15vir3SQ, cpE-7HK514Sr5vpSen9CEQ, F1UYelhPFB-zIKlt0ygIZg, CQAL1hvsLMCzuJf9AglsXw, 1KnY1wr15WfEWIRLB9IS6g, QWFQ-kXBiLbid-lm5Jr3dQ, nymT8liFugCrM16lTy0ZfQ, qj69bdd885heDvUPCyHd2Q, DySCZZcgbdrlHgEovk5y9w, lZMJIDuvhT9Dy4KyquLXyA, b_9Gn7wS93AoPZPR0dIJqQ, N07g1IaLh0_6sUjtiSRe4w, YdfPX_7DxSnKvvdCJ57iOw, 8GYryZPD22W7WgQ8kvMkEQ, cpQmAgOWatghp14h1pn1dQ, EnchhymLYMqftCRjqvVWmw, -JdfKhFktE7Zs9BMDFcPeQ, uWhC9eof98zPkvsalgaqJw, eyTlNDDaiPatfe6mheIZ0g, VfHq0o73aKsODvfAhwAQtg, kvD5tICngLAaQDujSFWupA, dXacwEhqi9-3_XT6JeH0Og, NfU0zDaTMEQ4-X9dbQWd9A, cTHWBdjSKbctSUIvWsgFxw, 3IEtCbSDF5t7RkZ20T6s9A, HJJXTrp6UybYyPdQ9DA0JA, JaXogQFVjzGRAeBvzamBHg, NUonfKkjS1iVqnNITtgXZg, D5vaJAYp0sOrGfsj9qvsMA, H27Ecbwwu4FGAlLgICourw, S8HrLmMiE4u8FWYWkNEoTw, Io36Y3xWQcIX9rYvPcYfXQ, J5mcqh8KxYpqjaLBNlwcig, -nTB3_08g06fD0GT8AtDBQ, wMpFA46lihK8oFns_5p65A, RZGFJHeomGJCWp3xcL3ejA, ZoQSzzXoSP1RxOD4Amv9Bg, qzM0EB0SkuuGIFv0adjQAQ, HuM6vvuveken-fPZ7d4olA, H3oukHpGpn9n_mJwSDSQyQ, PkmsJsQ8FIZe8eh8c_u96g, wSByVbwME4MzgkJaFyfvNg, YEVqknoDmrHAoUbHX0nPnA, li3vsK1XAPmeJYAUTYflHQ, MKc8yXi0glbPYt0Qb4PECw, fQPH6W9fksi27gkuUPnFaA, amrCMrDsoRetYFg2kwwdFA, 84dVQ6n6r2ezNaTuc7RkKA, yW9QjWY0olv5-uRKv3t_Kw, 5XJDj7c3eoidfQ3jW18Zgw, txSc6a6pIDctvwyBeu7Aqg, HFbbDCyyqP9xPkUlcxeIdg, hTUv5oh2do6Z3OppPuuiJA, gSqonG9J4fNM-fl_fE71AA, pd9mgTFpBTg5F9x-MsczNg, j3VE22V2GcHiH8UZxfFLfw, NYXlMW-T-3V4Jqr4r-i0Wg, btxgAZedxX8IWhMifA7Xkg, -Hp5mPLiRJNFnyeX5Ygzag, P6-DwVg6-t2JuQwIUEk0iQ, OI2TvxYvZrAodBG_RF53Xw, bHxf_VPKmZur1Bier-6A2A, Et_Sb39cVm81_Xe9HDM8ZQ, 5HwGl2UyYbaRq8aD6YC-fA, ZK228WMcCKLo5thcjD7rdw, iTf8wojwfm0NOi7dOiz3Nw, btYRxQYNJjpecflNHtFH0A, Kgo42FzpW_dXFgDKoewbtg, MNk_1Q_dqOY3xxHZKeO8VQ, AlwD504T9k0m5lkg3k5g6Q', funny=17, name='Rashmi', review_count=95, useful=84, user_id='l6BmjZMeQD3rDxWUbiAiow', yelping_since='2013-10-08 23:11:33'),
Row(average_stars=3.71, compliment_cool=0, compliment_cute=0, compliment_funny=0, compliment_hot=0, compliment_list=0, compliment_more=0, compliment_note=1, compliment_photos=0, compliment_plain=0, compliment_profile=0, compliment_writer=0, cool=10, elite='', fans=0, friends='4N-HU_T32hLENLntsNKNBg, pSY2vwWLgWfGVAAiKQzMng, lwhksSpgIyeYZor_HlN93w, v281gE-nk3jozr_5hWFAug, V7XFwm0baX37HRIduHmrXw, haSh72Q0MsQZUpWPeVgp0Q, bLbSNkLggFnqwNNzzq-Ijw, EKnCY7Tfxts4dexPrz63OQ, u_wqt9RshdZsoj8ikLqoEQ, J8FKQM1yvbwoKcuWRNh1yw, S9tC7Bp2sOLF_nwH-ksiIg, vNGs6_DP7ZbtPwX7finVIQ, 1IQ_d1RuMj8iIpcF2CDohA, WJqVJqhh7vNX51xLHEMEoQ, 5OllWsrKJsYo3XQK6siRKA', funny=8, name='David', review_count=16, useful=28, user_id='bc8C_eETBWL0olvFSJJd0w', yelping_since='2013-10-04 00:16:10')]
CodePudding user response:
The SQL logic first finds the top 10 users with the most tips from the tip
table and joins it with the user
table. Left or right side of a SQL join can be another SQL expression.
Example
The examples creates dataframes with minimal information needed to illustrate query working.
tip = spark.createDataFrame([("UPw5DWs_b-e2JRBS-t37Ag",), ("UPw5DWs_b-e2JRBS-t37Ag",), ("another_user_id",)], ("user_id",))
user = spark.createDataFrame([("UPw5DWs_b-e2JRBS-t37Ag", "u1", ), ("another_user_id", "u2",),], ("user_id", "name"))
tip.registerTempTable("tip")
user.registerTempTable("user")
sql = """
SELECT u.*
FROM user u
JOIN (SELECT user_id,
Count(user_id) AS tip_count
FROM tip
GROUP BY user_id
ORDER BY tip_count DESC
LIMIT 10) t
ON t.user_id = u.user_id
"""
spark.sql(sql).show(200, False)
Output
---------------------- ----
|user_id |name|
---------------------- ----
|UPw5DWs_b-e2JRBS-t37Ag|u1 |
|another_user_id |u2 |
---------------------- ----