I am teaching myself scala (so as to use it with Apache Spark) and wanted to know if there would be some way to concatenate a series of transformations on a Spark DataFrame. E.g. let's assume we have a list of transformations
l: List[(String, String)] = List(("field1", "nonEmpty"), ("field2", "notNull"))
and a Spark DataFrame
df
, such that the desired result would be
df.filter(df("field1") =!= "").filter(df("field2").isNotNull)
.
I was thinking perhaps this could be done using function composition or list folding or something, but I really don't know how. Any help would be greatly appreciated.
Thanks!
CodePudding user response:
Yes, it is perfectly possible. But it depends of you really want, I mean, Spark provides Pipelines, that allows to compose your transformations and create a pipeline that can be serialized. You can create your custom transformers, here an example. You can include your "filter" stages in custom transformations, you will be able to use later, for example, in a Spark structured streaming.
Other option is to use Spark datasets and use the transform api. That seems more functional and elegant.
Scala has a lot of possibilities to create your own api, but take a look first to these approaches.
CodePudding user response:
Yes you can fold over an existing Dataframe
. You could keep all columns in a list and don't bother with other intermediary types:
val df =
???
val columns =
List(
col("1") =!= "",
col("2").isNotNull,
col("3") > 10
)
val filtered =
columns.foldLeft(df)((df, col) => df.filter(col))