I'm testing some code that manipulates data using pandas, and I want to avoid writing out the data during tests.
Let's say my code in a file name module.py is this:
import pandas as pd
import dask.dataframe as dd
def do_stuff() -> None:
df = pd.DataFrame({'a': [1, 2, 3], 'b': [1, 2, 3]})
another_df = df.pivot_table(values='a', index='b')
yet_another_df = another_df.groupby('b').sum()
another_df.to_csv('data.csv')
yet_another_df.to_csv('more_data.csv')
I want to intercept all these individual "to_csv" methods, so that tests ran don't write out data.
My first thought was to try something like this with pytest:
import module
class NonWritingDataFrame(pd.DataFrame):
def to_csv(self, *args, **kwargs):
pass
def test_do_stuff_returns_nothing(monkeypatch):
monkeypatch.settattr(module, 'pd.DataFrame', NonWritingDataFrame)
actual = module.do_stuff()
assert actual is None
But sadly this doesn't work - it might for the first "df" variable (I'm not actually sure if it does) but the another_df and yet_another_df are returned by other pandas methods, and not from the "module" module, so are normal pandas DataFrames and not my special NonWritingDataFrame object.
My question is, is there a neat way to replace all pandas DataFrame "to_csv" calls, regardless of the method used to define the DataFrame?
CodePudding user response:
If you want to replace all references of to_csv()
, one option is to take advantage of python's import mechanism.
A package will only get imported once, and subsequent calls will utilize the existing reference. So modifying to_csv()
before you import module
will give you the desired result --
import pandas
pandas.DataFrame.to_csv = lambda x: print("monkeypatched")
import module