Home > Software design >  Union in loop Pyspark
Union in loop Pyspark

Time:01-28

I have two dataframes

data1 = [{'text': 'We traveled a long way to several beautiful houses to see the cats.', 'lang': 'eng'},
{'text': 'قطعنا شوطا طويلا إلى عدة منازل جميلة لرؤية القطط.', 'lang': 'arb'},
{'text': 'Wir reisten einen langen Weg zu mehreren schönen Häusern, um die Katzen zu sehen.', 'lang': 'deu'},
{'text': 'Nous avons parcouru un long chemin vers plusieurs belles maisons pour voir les chats.', 'lang': 'fra'}]
sdf1 = spark.createDataFrame(data1)

data2 = [{'text': 'Przebyliśmy długą drogę do kilku pięknych domów, aby zobaczyć koty.', 'lang': 'pol'},
{'text': 'Mēs ceļojām garu ceļu uz vairākām skaistām mājām, lai redzētu kaķus.', 'lang': 'lav'},
{'text': 'Kedileri görmek için birkaç güzel eve uzun bir yol kat ettik.', 'lang': 'tur'}]
sdf2 = spark.createDataFrame(data2)

I want to add only specific language rows from sdf2 to the first dataframe. I do it with a loop:

langs = ['pol', 'tur']
for lang in langs:
    sdf_l = sdf2.where(F.col('lang') == lang)
    sdf_final = sdf1.union(sdf_l)

But it only appends rows from the last language in langs

CodePudding user response:

There is no need to use loop here. Filter sdf2 first, and then unoin with sdf1.

import pyspark.sql.functions as F

...
langs = ['pol', 'tur']
sdf_final = sdf1.union(sdf2.filter(F.col('lang').isin(langs)))

If you expect to use loop, you can define a temporary variable and perform union with sdf1.

for lang in langs:
    sdf_1 = sdf2.where(F.col('lang') == lang)
    sdf1 = sdf1.union(sdf_1)
sdf1.show(truncate=False)
  • Related