I have two dataframes
data1 = [{'text': 'We traveled a long way to several beautiful houses to see the cats.', 'lang': 'eng'},
{'text': 'قطعنا شوطا طويلا إلى عدة منازل جميلة لرؤية القطط.', 'lang': 'arb'},
{'text': 'Wir reisten einen langen Weg zu mehreren schönen Häusern, um die Katzen zu sehen.', 'lang': 'deu'},
{'text': 'Nous avons parcouru un long chemin vers plusieurs belles maisons pour voir les chats.', 'lang': 'fra'}]
sdf1 = spark.createDataFrame(data1)
data2 = [{'text': 'Przebyliśmy długą drogę do kilku pięknych domów, aby zobaczyć koty.', 'lang': 'pol'},
{'text': 'Mēs ceļojām garu ceļu uz vairākām skaistām mājām, lai redzētu kaķus.', 'lang': 'lav'},
{'text': 'Kedileri görmek için birkaç güzel eve uzun bir yol kat ettik.', 'lang': 'tur'}]
sdf2 = spark.createDataFrame(data2)
I want to add only specific language rows from sdf2 to the first dataframe. I do it with a loop:
langs = ['pol', 'tur']
for lang in langs:
sdf_l = sdf2.where(F.col('lang') == lang)
sdf_final = sdf1.union(sdf_l)
But it only appends rows from the last language in langs
CodePudding user response:
There is no need to use loop here. Filter sdf2 first, and then unoin with sdf1.
import pyspark.sql.functions as F
...
langs = ['pol', 'tur']
sdf_final = sdf1.union(sdf2.filter(F.col('lang').isin(langs)))
If you expect to use loop, you can define a temporary variable and perform union with sdf1.
for lang in langs:
sdf_1 = sdf2.where(F.col('lang') == lang)
sdf1 = sdf1.union(sdf_1)
sdf1.show(truncate=False)