Home > Blockchain >  Avoid loading into table dataframe that is empty
Avoid loading into table dataframe that is empty

Time:06-22

I am creating a process in spark scala within an ETL that checks for some events occurred during the ETL process. I start with an empty dataframe and if events occur this dataframe is filled with information ( a dataframe can't be filled it can only be joined with other dataframes with the same structure ). The thing is that at the end of the process, the dataframe that has been generated is loaded into a table but it can happen that the dataframe ends up being empty because no event has occured and I don't want to load a dataframe that is empty because it has no sense. So, I'm wondering if there is an elegant way to load the dataframe into the table only if it is not empty without using the if condition. Thanks!!

CodePudding user response:

I recommend to create the dataframe anyway; If you don't create a table with the same schema, even if it's empty, your operations/transformations on DF could fail as it could refer to columns that may not be present.

To handle this, you should always create a DataFrame with the same schema, which means the same column names and datatypes regardless if the data exists or not. You might want to populate it with data later.

If you still want to do it your way, I can point a few ideas for Spark 2.1.0 and above:

df.head(1).isEmpty
df.take(1).isEmpty
df.limit(1).collect().isEmpty

These are equivalent. I don't recommend using df.count > 0 because it is linear in time complexity and you would still have to do a check like df != null before.

A much better solution would be:

df.rdd.isEmpty

Or since Spark 2.4.0 there is also Dataset.isEmpty.

As you can see, whatever you decide to do, there is a check somewhere that you need to do, so you can't really get rid of the if condition - as the sentence implies: if you want to avoid creating an empty dataframe.

  • Related