Union in loop Pyspark-CodePudding

I have two dataframes

data1 = [{'text': 'We traveled a long way to several beautiful houses to see the cats.', 'lang': 'eng'},
{'text': 'قطعنا شوطا طويلا إلى عدة منازل جميلة لرؤية القطط.', 'lang': 'arb'},
{'text': 'Wir reisten einen langen Weg zu mehreren schönen Häusern, um die Katzen zu sehen.', 'lang': 'deu'},
{'text': 'Nous avons parcouru un long chemin vers plusieurs belles maisons pour voir les chats.', 'lang': 'fra'}]
sdf1 = spark.createDataFrame(data1)

data2 = [{'text': 'Przebyliśmy długą drogę do kilku pięknych domów, aby zobaczyć koty.', 'lang': 'pol'},
{'text': 'Mēs ceļojām garu ceļu uz vairākām skaistām mājām, lai redzētu kaķus.', 'lang': 'lav'},
{'text': 'Kedileri görmek için birkaç güzel eve uzun bir yol kat ettik.', 'lang': 'tur'}]
sdf2 = spark.createDataFrame(data2)

I want to add only specific language rows from sdf2 to the first dataframe. I do it with a loop:

langs = ['pol', 'tur']
for lang in langs:
    sdf_l = sdf2.where(F.col('lang') == lang)
    sdf_final = sdf1.union(sdf_l)

But it only appends rows from the last language in langs

CodePudding user response：

There is no need to use loop here. Filter sdf2 first, and then unoin with sdf1.

import pyspark.sql.functions as F

...
langs = ['pol', 'tur']
sdf_final = sdf1.union(sdf2.filter(F.col('lang').isin(langs)))

If you expect to use loop, you can define a temporary variable and perform union with sdf1.

for lang in langs:
    sdf_1 = sdf2.where(F.col('lang') == lang)
    sdf1 = sdf1.union(sdf_1)
sdf1.show(truncate=False)