Home > database >  use singleton logic within a classmethod
use singleton logic within a classmethod

Time:11-20

I am currently using this piece of code :

class FileSystem(metaclass=Singleton):
    """File System manager based on Spark"""

    def __init__(self, spark):
        self._path = spark._jvm.org.apache.hadoop.fs.Path
        self._fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(
            spark._jsc.hadoopConfiguration()
        )

    @classmethod
    def without_spark(cls):
        with Spark() as spark:
            return cls(spark)

My object depends obviously on the Spark object (another object that I created - If you need to see its code, I can add it but I do not think it is required for my current issue).

It can be used in 2 differents ways resulting the same behavior :

fs = FileSystem.without_spark()

# OR

with Spark() as spark:
    fs = FileSystem(spark)

My problem is that, even if FileSystem is a singleton, using the class method without_spark makes me enter (__enter__) the context manager of spark, which lead to a connection to spark cluster, which takes a lot of time. How can I make that the first execution of without_spark do the connection, but the next one only returns the already created instance?

The expected behavior would be something like this :

    @classmethod
    def without_spark(cls):
        if not cls.exists:  # I do not know how to persist this information in the class
            with Spark() as spark:
                return cls(spark)
        else:
            return cls()

CodePudding user response:

I think you are looking for something like

import contextlib

class FileSystem(metaclass=Singleton):
    """File System manager based on Spark"""

    spark = None


    def __init__(self, spark):
        self._path = spark._jvm.org.apache.hadoop.fs.Path
        self._fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(
            spark._jsc.hadoopConfiguration()
        )

    @classmethod
    def without_spark(cls):
        if cls.spark is None:
            cm = cls.spark = Spark()
        else:
            cm = contextlib.nullcontext(cls.spark)
            
        with cm as s:
            return cls(s)

The first time without_spark is called, a new instance of Spark is created and used as a context manager. Subsequent calls reuse the same Spark instance and use a null context manager.


I believe your approach will work as well; you just need to initialize exists to be False, then set it to True the first (and every, really) time you call the class method.

class FileSystem(metaclass=Singleton):
    """File System manager based on Spark"""

    exists = False

    def __init__(self, spark):
        self._path = spark._jvm.org.apache.hadoop.fs.Path
        self._fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(
            spark._jsc.hadoopConfiguration()
        )

    @classmethod
    def without_spark(cls):
        if not cls.exists:
            cls.exists = True
            with Spark() as spark:
                return cls(spark)
        else:
            return cls()

CodePudding user response:

Can't you make the constructor argument optional, and initiate the Spark lazily, e.g. in a property (or functools.cached_property):

from functools import cached_property

class FileSystem(metaclass=Singleton):
    def __init__(self, spark=None):
        self._spark = spark

    @cached_property
    def spark(self):
        if self._spark:
            return self._spark
        return self._spark := Spark()

    @cached_property
    def path(self):
        return self.spark._jvm.org.apache.hadoop.fs.Path

    @cached_property
    def fs(self):
        with self.spark:
            return self.spark._jvm.org.apache.hadoop.fs.FileSystem.get(
                self.spark._jsc.hadoopConfiguration()
            )
  • Related