Home > Net >  Select different dataset when testing | Separate test from production
Select different dataset when testing | Separate test from production

Time:09-17

This question is partly about how to test external dependencies (a.k.a. integration tests) and partly how to implement it with Python for SQL with BigQuery in specific. So answers only about 'This is how you should do integration tests' are very welcome.

In my project I have two different datasets

'project_1.production.table_1'
'project_1.development.table_1'

When running my tests I would like to call the development environment. But how to separate it properly from my production code as I don't want to clutter my production code with test(set-up) code.

Production code looks like:

def find_data(variable_x: string) -> DataFrame:
    query = '''
    SELECT *
    FROM `project_1.production.table_1` 
    WHERE foo = @variable_x
    '''

    job_config = bigquery.QueryJobConfig(
        query_parameters=[
            bigquery.ScalarQueryParameter(
                name='foo', type_="STRING", value=variable_x
            )
        ]
    )

    df = self.client.query(
        query=query, job_config=job_config).to_dataframe()
    return df

Solution 1 : Environment variables for the dataset

The python-dotenv module can be used to differentiate production from development, as I do for some parts of my code. The problem is that bigQuery does not allow to parameterize the dataset. (To prevent SQL-injection I think) See running parameterized queries docs

From the docs

Parameters cannot be used as substitutes for identifiers, column names, table names, or other parts of the query.

So having the environment variable as dataset name is not possible.

Solution 2 : Environment variable for flow control

I could add a if production == True evaluation and select the dataset. However this results in test/debug code in my production code. I would like to avoid it as much as possible.

from os import getenv

def find_data(variable_x : string) -> Dataframe:
   load_dotenv()
   PRODUCTION = getenv("PRODUCTION")
   if PRODUCTION == TRUE:
       *Execute query on project_1.production.table_1*
   else:
       *Execute query on project_1.development.table_1*
   job_config = (*snip*)
   df = (*snip*)
   return df

Solution 3 : Mimic function in testcode

Make a copy of the production code and set up the test code so that the development dataset is called.

This leads to duplication of code (one in production code and one in test code). A result of this duplication will lead to a mismatch of the code may the implementation of the function change over time. So I think this solution is not 'Embracing Change'

Solution 4 : Skip testing this function

Perhaps this function does not need to be called at all in my test code. Just take a snippet of the result of this query and use the result as a 'data injection' into the tests that depend on this result. However then I need to adjust my architecture a bit.

The above solutions don't satisfy me completely. I wonder if there is another way to solve this issue or if one of the above solutions is acceptable?

CodePudding user response:

It looks like string formatting (sometimes referred to as string interpolation) might be enough to get you where you want. You could replace the first part of your function by the following code:

query = '''
    SELECT *
    FROM `{table}` 
    WHERE foo = @variable_x
    '''.format(table = getenv("DATA_TABLE"))

This works because the query is just a string and you can do whatever you want with it before you pass it on the the BigQuery library. The String.format allows us to replace values inside a string, which is exactly what we need (see this article for a more in depth explanation about String.format)

Important security note: it is in general a bad security practice to manipulate SQL queries as plain strings (as we are doing here), but since you control the environment variables of the application it should be safe in this particular case.

  • Related