Home > front end >  PySpark SQL Joining Tables
PySpark SQL Joining Tables

Time:11-18

I have to provide the names of the top 10 users who provided the most tips from the Yelp dataset using the tip and user tables. When I run tip.printSchema() I get:

root
 |-- business_id: string (nullable = true)
 |-- compliment_count: long (nullable = true)
 |-- date: string (nullable = true)
 |-- text: string (nullable = true)
 |-- user_id: string (nullable = true)

And the user.printSchema() returns:

root
 |-- average_stars: double (nullable = true)
 |-- compliment_cool: long (nullable = true)
 |-- compliment_cute: long (nullable = true)
 |-- compliment_funny: long (nullable = true)
 |-- compliment_hot: long (nullable = true)
 |-- compliment_list: long (nullable = true)
 |-- compliment_more: long (nullable = true)
 |-- compliment_note: long (nullable = true)
 |-- compliment_photos: long (nullable = true)
 |-- compliment_plain: long (nullable = true)
 |-- compliment_profile: long (nullable = true)
 |-- compliment_writer: long (nullable = true)
 |-- cool: long (nullable = true)
 |-- elite: string (nullable = true)
 |-- fans: long (nullable = true)
 |-- friends: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- name: string (nullable = true)
 |-- review_count: long (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-- yelping_since: string (nullable = true)

So I need to count which user_id's had the most tips from the tip table and join that with the name's in the user table. I have tried this query to get it started:

query = """
SELECT tip.user_id, count(*), user.name
    FROM tip 
    GROUP BY user_id 
    LIMIT 10
    LEFT JOIN user
        ON tip.user_id = user.name
"""
spark.sql(query).show()

because I want to get just the top ten user_id's from tip and match them with their names from user. This returns an error however:

Py4JJavaError: An error occurred while calling o24.sql.
: org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input 'LEFT' expecting <EOF>(line 6, pos 4)

== SQL ==

SELECT user_id, count(*) 
    FROM tip 
    GROUP BY user_id 
    LIMIT 10
    LEFT JOIN user
----^^^
        ON tip.user_id = user.name

Here are some sample rows from the tip table:

[Row(business_id='VaKXUpmWTTWDKbpJ3aQdMw', compliment_count=0, date='2014-03-27 03:51:24', text='Great for watching games, ufc, and whatever else tickles yer fancy', user_id='UPw5DWs_b-e2JRBS-t37Ag'),
 Row(business_id='OPiPeoJiv92rENwbq76orA', compliment_count=0, date='2013-05-25 06:00:56', text='Happy Hour 2-4 daily with 1/2 price drinks and slushes AND after 8 half price shakes.  They actually have a peanut butter and bacon shake.', user_id='Ocha4kZBHb4JK0lOWvE0sg'),
 Row(business_id='5KheTjYPu1HcQzQFtm4_vw', compliment_count=0, date='2011-12-26 01:46:17', text='Good chips and salsa. Loud at times. Good service. Bathrooms AWFUL. So that tanks my view on this place.', user_id='jRyO2V1pA4CdVVqCIOPc1Q')]

And some samples from the user table:

[Row(average_stars=4.03, compliment_cool=1, compliment_cute=0, compliment_funny=1, compliment_hot=2, compliment_list=0, compliment_more=0, compliment_note=1, compliment_photos=0, compliment_plain=1, compliment_profile=0, compliment_writer=2, cool=25, elite='2015,2016,2017', fans=5, friends='c78V-rj8NQcQjOI8KP3UEA, alRMgPcngYSCJ5naFRBz5g, ajcnq75Z5xxkvUSmmJ1bCg, BSMAmp2-wMzCkhTfq9ToNg, jka10dk9ygX76hJG0gfPZQ, dut0e4xvme7QSlesOycHQA, l4l5lBnK356zBua7B-UJ6Q, 0HicMOOs-M_gl2eO-zES4Q, _uI57wL2fLyftrcSFpfSGQ, T4_Qd0YWbC3co6WSMw4vxg, iBRoLWPtWmsI1kdbE9ORSA, xjrUcid6Ymq0DoTJELkYyw, GqadWVzJ6At-vgLzK_SKgA, DvB13VJBmSnbFXBVBsKmDA, vRP9nQkYTeNioDjtxZlVhg, gT0A1iN3eeQ8EMAjJhwQtw, 6yCWjFPtp_AD4x93WAwmnw, 1dKzpNnib-JlViKv8_Gt5g, 3Bv4_JxHXq-gVLOxYMQX0Q, ikQyfu1iViYh8T0us7wiFQ, f1GGltNaB7K5DR1jf3dOmg, tgeFUChlh7v8bZFVl2-hjQ, -9-9oyXlqsMG2he5xIWdLQ, Adj9fBPVJad8vSs-mIP7gw, Ce49RY8CKXVsTifxRYFTsw, M1_7TLi8CbdA89nFLlH4iw, wFsNv-hqbW_F5-IRqfBN6g, 0Q1L7zXHocaUZ2gsG2XJeg, cBFgmOCBdhYa0xoFEAzp_g, VrD_AgiFvzqtlR15vir3SQ, cpE-7HK514Sr5vpSen9CEQ, F1UYelhPFB-zIKlt0ygIZg, CQAL1hvsLMCzuJf9AglsXw, 1KnY1wr15WfEWIRLB9IS6g, QWFQ-kXBiLbid-lm5Jr3dQ, nymT8liFugCrM16lTy0ZfQ, qj69bdd885heDvUPCyHd2Q, DySCZZcgbdrlHgEovk5y9w, lZMJIDuvhT9Dy4KyquLXyA, b_9Gn7wS93AoPZPR0dIJqQ, N07g1IaLh0_6sUjtiSRe4w, YdfPX_7DxSnKvvdCJ57iOw, 8GYryZPD22W7WgQ8kvMkEQ, cpQmAgOWatghp14h1pn1dQ, EnchhymLYMqftCRjqvVWmw, -JdfKhFktE7Zs9BMDFcPeQ, uWhC9eof98zPkvsalgaqJw, eyTlNDDaiPatfe6mheIZ0g, VfHq0o73aKsODvfAhwAQtg, kvD5tICngLAaQDujSFWupA, dXacwEhqi9-3_XT6JeH0Og, NfU0zDaTMEQ4-X9dbQWd9A, cTHWBdjSKbctSUIvWsgFxw, 3IEtCbSDF5t7RkZ20T6s9A, HJJXTrp6UybYyPdQ9DA0JA, JaXogQFVjzGRAeBvzamBHg, NUonfKkjS1iVqnNITtgXZg, D5vaJAYp0sOrGfsj9qvsMA, H27Ecbwwu4FGAlLgICourw, S8HrLmMiE4u8FWYWkNEoTw, Io36Y3xWQcIX9rYvPcYfXQ, J5mcqh8KxYpqjaLBNlwcig, -nTB3_08g06fD0GT8AtDBQ, wMpFA46lihK8oFns_5p65A, RZGFJHeomGJCWp3xcL3ejA, ZoQSzzXoSP1RxOD4Amv9Bg, qzM0EB0SkuuGIFv0adjQAQ, HuM6vvuveken-fPZ7d4olA, H3oukHpGpn9n_mJwSDSQyQ, PkmsJsQ8FIZe8eh8c_u96g, wSByVbwME4MzgkJaFyfvNg, YEVqknoDmrHAoUbHX0nPnA, li3vsK1XAPmeJYAUTYflHQ, MKc8yXi0glbPYt0Qb4PECw, fQPH6W9fksi27gkuUPnFaA, amrCMrDsoRetYFg2kwwdFA, 84dVQ6n6r2ezNaTuc7RkKA, yW9QjWY0olv5-uRKv3t_Kw, 5XJDj7c3eoidfQ3jW18Zgw, txSc6a6pIDctvwyBeu7Aqg, HFbbDCyyqP9xPkUlcxeIdg, hTUv5oh2do6Z3OppPuuiJA, gSqonG9J4fNM-fl_fE71AA, pd9mgTFpBTg5F9x-MsczNg, j3VE22V2GcHiH8UZxfFLfw, NYXlMW-T-3V4Jqr4r-i0Wg, btxgAZedxX8IWhMifA7Xkg, -Hp5mPLiRJNFnyeX5Ygzag, P6-DwVg6-t2JuQwIUEk0iQ, OI2TvxYvZrAodBG_RF53Xw, bHxf_VPKmZur1Bier-6A2A, Et_Sb39cVm81_Xe9HDM8ZQ, 5HwGl2UyYbaRq8aD6YC-fA, ZK228WMcCKLo5thcjD7rdw, iTf8wojwfm0NOi7dOiz3Nw, btYRxQYNJjpecflNHtFH0A, Kgo42FzpW_dXFgDKoewbtg, MNk_1Q_dqOY3xxHZKeO8VQ, AlwD504T9k0m5lkg3k5g6Q', funny=17, name='Rashmi', review_count=95, useful=84, user_id='l6BmjZMeQD3rDxWUbiAiow', yelping_since='2013-10-08 23:11:33'),
 Row(average_stars=3.71, compliment_cool=0, compliment_cute=0, compliment_funny=0, compliment_hot=0, compliment_list=0, compliment_more=0, compliment_note=1, compliment_photos=0, compliment_plain=0, compliment_profile=0, compliment_writer=0, cool=10, elite='', fans=0, friends='4N-HU_T32hLENLntsNKNBg, pSY2vwWLgWfGVAAiKQzMng, lwhksSpgIyeYZor_HlN93w, v281gE-nk3jozr_5hWFAug, V7XFwm0baX37HRIduHmrXw, haSh72Q0MsQZUpWPeVgp0Q, bLbSNkLggFnqwNNzzq-Ijw, EKnCY7Tfxts4dexPrz63OQ, u_wqt9RshdZsoj8ikLqoEQ, J8FKQM1yvbwoKcuWRNh1yw, S9tC7Bp2sOLF_nwH-ksiIg, vNGs6_DP7ZbtPwX7finVIQ, 1IQ_d1RuMj8iIpcF2CDohA, WJqVJqhh7vNX51xLHEMEoQ, 5OllWsrKJsYo3XQK6siRKA', funny=8, name='David', review_count=16, useful=28, user_id='bc8C_eETBWL0olvFSJJd0w', yelping_since='2013-10-04 00:16:10')]

CodePudding user response:

The SQL logic first finds the top 10 users with the most tips from the tip table and joins it with the user table. Left or right side of a SQL join can be another SQL expression.

Example

The examples creates dataframes with minimal information needed to illustrate query working.


tip = spark.createDataFrame([("UPw5DWs_b-e2JRBS-t37Ag",), ("UPw5DWs_b-e2JRBS-t37Ag",), ("another_user_id",)], ("user_id",))

user = spark.createDataFrame([("UPw5DWs_b-e2JRBS-t37Ag", "u1", ), ("another_user_id", "u2",),], ("user_id", "name"))

tip.registerTempTable("tip")
user.registerTempTable("user")

sql = """
 SELECT u.*
FROM   user u
       JOIN (SELECT user_id,
                    Count(user_id) AS tip_count
             FROM   tip
             GROUP  BY user_id
             ORDER  BY tip_count DESC
             LIMIT  10) t
         ON t.user_id = u.user_id  
"""

spark.sql(sql).show(200, False)

Output

 ---------------------- ---- 
|user_id               |name|
 ---------------------- ---- 
|UPw5DWs_b-e2JRBS-t37Ag|u1  |
|another_user_id       |u2  |
 ---------------------- ---- 
  • Related