Home > OS >  Is there an efficient way to add a csv file with many columns to a Django model?
Is there an efficient way to add a csv file with many columns to a Django model?

Time:04-08

I have a data frame with over 150 columns that I want to add to my Django database. Everything I have seen online gives an example only using a couple columns and requires that you list each field that you want to use. Is there a way to create a model that inherits from the data frame columns and does this process more efficiently?

CodePudding user response:

Use a Json field to save data as json (https://docs.djangoproject.com/en/4.0/ref/models/fields/#django.db.models.JSONField) or save at non-relational database like Mongo

These are the best ways to work with dynamic data

CodePudding user response:

JSONField is great, but it might be better to have a stricter DB schema with fixed columns, especially if you intend to use the Django queryset features (filtering, etc.) on column data. In that case, you have to create, one way or another, a model with 150 fields. Given a dataframe, you could generate once and for all the field lines of your model:

for name in df.columns:
    print("    %s = models.FloatField('%s')" % (name, name))

Then, just copy-paste the result to your Django model definition, for example:

class YourModel(models.Model):
    a = models.FloatField('a')
    b = models.FloatField('b')
    c = models.FloatField('c')
    ...

Now, let's look at the data import:

If the dataframe is trusted, and the columns are named with the same syntax as the field names of your model (OK, for example, when just importing the dataframe used for model creation), you could do:

YourModel.objects.bulk_create([
    YourModel(**row) for _, row in df.iterrows()
])

row is essentially a dict-like object so that, if the columns are "a", "b" and "c", row could be similar to something like {'a': 10, 'b': 11, 'c': 12} which makes the YourModel call look like YourModel(a=10, b=11, c=12).

If the dataframe cannot be trusted, for example if users can somehow "upload dataframes" to make the DB grow, each new input dataframe should be validated:

  1. (Important; to prevent named parameter injection in YourModel.) Validate that the column names are what you expect. If you do not want to hard code the expected column names, check that each name exists in the Django model, for example by adding this before the code above:

    valid_names = [f.name for f in YourModel._meta.get_fields()]
    if not all(name in valid_names for name in df.columns):
        raise ...
    
  2. Validate the row data.

In a real world, I recommend doing 1. even if you trust the data. This can save time later by catching unintentional mistakes if the code used to generate the dataframes changes.

  • Related