I'm doing a ML project and decided to use classes to organize my code. Although, I'm not sure if my approach is optimal. I'll appreciate if you can share best practices, how you would approach similar challenge:
Lets concentrate on preprocessing module, where I created Preprocessor class.
This class has 3 methods for data manipulation, each taking a dataframe as input and adding a feature. Output of each method can be an input of another.
I also have 4th, wrapper method, that takes these 3 methods, chains them and creates final output:
def wrapper(self):
output = self.method_1(self.df)
output = self.method_2(output)
output = self.method_3(output)
return output
When I want to use the class, I'm creating instance with df and just call wrapper function from it. Which feels unnatural and makes me think there is a better way of doing it.
import A_class
instance = A_class(df)
output = instance.wrapper()
CodePudding user response:
Classes are great if you need to keep track of/modify internal state of an object. But they're not magical things that keep your code organized just by existing. If all you have is a preprocessing pipeline that takes some data and runs it through methods in a straight line, regular functions will often be less cumbersome.
With the context you've given I'd probably do something like this:
pipelines.py
def preprocess_data_xyz(data):
"""
Takes a dataframe of nature XYZ and returns it after
running it through the necessary preprocessing steps.
"""
step_1 = func_1(data)
step_2 = func_2(step_1)
step_3 = func_3(step_2)
return step_3
def func_1(data):
"""Does X to data."""
pass
# etc ...
analysis.py
import pandas as pd
from pipelines import preprocess_data_xyz
data_xyz = pd.DataFrame( ... )
preprocessed_data_xyz = preprocess_data_xyz(data=data_xyz)
Choosing better variable and functions is also a major component of organizing your code - you should replace func_1
, with a name that describes what it does to the data (something like add_numerical_column
, parse_datetime_column
, etc). Likewise for the data_xyz
variable.