Home > front end >  Join two tables and concat columns using Pyspark (databricks)
Join two tables and concat columns using Pyspark (databricks)

Time:10-19

I have two tables in my database. I need to perform left outer join on these two tables with condition table1.id = table2.id also, source should match

Below are my two source tables.

Table 1 :

`source    id        type `

  eu2   10000162      N4    
  sus   10000162      M1    
  pda   10000162      XM 

Table 2 :

`source   id         code1        code2`

  eu2   10000162        CDNG_GRP     PROB_CD
  sus   10000162        AANV          NW
  pda   10000162        PM2          VLPD

Expected output :

source id type concat

  eu2   10000162      N4    CDNG_GRP-PROB_CD
  sus   10000162      M1    AANV-NW
  pda   10000162      XM    PM2-VLPD

I want this result in Dataframe.

Thanks in advance !

CodePudding user response:

Spark always returns a dataframe (until specified not to do so)

Try this: Considering your tables are already spark dataframe

left_join = table1.join(table2, table1.id==table1.id, "leftouter")
left_join.show()

CodePudding user response:

To get the desired result, you need to perform join on the source and id columns.

import pyspark.sql.functions as F
...
df = df1.join(df2, on=['id', 'source'], how='left') \
    .withColumn('concat', F.concat('code1', F.lit('-'), 'code2')) \
    .select(['source', 'id', 'type', 'concat'])
df.show(truncate=False)
  • Related