Home > OS >  Performance of Pyspark pipeline in python classes
Performance of Pyspark pipeline in python classes

Time:12-31

I have a pyspark pipeline running on Databricks. A pipeline is basically a number of functions executed in a sequence which are reading/creating tables, joining, transforming, etc.(i.e. common spark stuff). So for example it could be something like below:

def read_table():

def perform_tansforms():

def perform_further_transforms():

def run_pipeline():
    read_table()
    perform_tansforms()
    perform_further_transforms()

Now to structure the code better I encapsulated the constants and functions of the pipeline into a class with static methods and a run method like below:

class CustomPipeline():
    
    class_variable_1 = "some_variable"
    class_variable_2 = "another_variable"

    @staticmethod
    def read_table():

    @staticmethod
    def perform_tansforms():

    @staticmethod
    def perform_further_transforms():

    @staticmethod
    def run():
        CustomPipeline.read_table()
        CustomPipeline.perform_tansforms()
        CustomPipeline.perform_further_transforms()

Now, this may be a stupid question but conceptually, can this affect the performance of the pipeline in any way? For example, encapsulating the parts of the pipeline into class may result in some extra overhead in communication from the python interpreter to the JVM running spark.

Any help is appreciated, thanks. Also, comment if any other detail is needed.

CodePudding user response:

Not directly, no, doesn't matter.

I suppose it could matter if, for example, your class had a bunch of initialization that executed all initialization for every step, no matter which step was executed. But I don't see that here.

This isn't different on Spark or Databricks.

  • Related