I am currently using this piece of code :
class FileSystem(metaclass=Singleton):
"""File System manager based on Spark"""
def __init__(self, spark):
self._path = spark._jvm.org.apache.hadoop.fs.Path
self._fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(
spark._jsc.hadoopConfiguration()
)
@classmethod
def without_spark(cls):
with Spark() as spark:
return cls(spark)
My object depends obviously on the Spark object (another object that I created - If you need to see its code, I can add it but I do not think it is required for my current issue).
It can be used in 2 differents ways resulting the same behavior :
fs = FileSystem.without_spark()
# OR
with Spark() as spark:
fs = FileSystem(spark)
My problem is that, even if FileSystem
is a singleton, using the class method without_spark
makes me enter (__enter__
) the context manager of spark, which lead to a connection to spark cluster, which takes a lot of time. How can I make that the first execution of without_spark
do the connection, but the next one only returns the already created instance?
The expected behavior would be something like this :
@classmethod
def without_spark(cls):
if not cls.exists: # I do not know how to persist this information in the class
with Spark() as spark:
return cls(spark)
else:
return cls()
CodePudding user response:
I think you are looking for something like
import contextlib
class FileSystem(metaclass=Singleton):
"""File System manager based on Spark"""
spark = None
def __init__(self, spark):
self._path = spark._jvm.org.apache.hadoop.fs.Path
self._fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(
spark._jsc.hadoopConfiguration()
)
@classmethod
def without_spark(cls):
if cls.spark is None:
cm = cls.spark = Spark()
else:
cm = contextlib.nullcontext(cls.spark)
with cm as s:
return cls(s)
The first time without_spark
is called, a new instance of Spark
is created and used as a context manager. Subsequent calls reuse the same Spark
instance and use a null context manager.
I believe your approach will work as well; you just need to initialize exists
to be False
, then set it to True
the first (and every, really) time you call the class method.
class FileSystem(metaclass=Singleton):
"""File System manager based on Spark"""
exists = False
def __init__(self, spark):
self._path = spark._jvm.org.apache.hadoop.fs.Path
self._fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(
spark._jsc.hadoopConfiguration()
)
@classmethod
def without_spark(cls):
if not cls.exists:
cls.exists = True
with Spark() as spark:
return cls(spark)
else:
return cls()
CodePudding user response:
Can't you make the constructor argument optional, and initiate the Spark
lazily, e.g. in a property
(or functools.cached_property
):
from functools import cached_property
class FileSystem(metaclass=Singleton):
def __init__(self, spark=None):
self._spark = spark
@cached_property
def spark(self):
if self._spark:
return self._spark
return self._spark := Spark()
@cached_property
def path(self):
return self.spark._jvm.org.apache.hadoop.fs.Path
@cached_property
def fs(self):
with self.spark:
return self.spark._jvm.org.apache.hadoop.fs.FileSystem.get(
self.spark._jsc.hadoopConfiguration()
)