Home > Back-end >  Python, Using dvc, Hw does it work? Does it keep all the data files versions? Can it lead to extra c
Python, Using dvc, Hw does it work? Does it keep all the data files versions? Can it lead to extra c

Time:10-13

I consider to learn about using dvc (https://dvc.org/), but before that I have some questions regarding dvc with cloud:

  1. Does dvc saves all the different versions of the dataset?
  2. Does dvc support all data files format (csv, feather)?
  3. Can the usage of dvc with the could, lead to extra costs, since it increase the frequency of the communication with the cloud?
  4. Can the usage of dvc with the could, lead to extra costs, since it saves many versions of the data files?
  5. Is there a limitation or disadvantages of the tool when working with large data files(100GB )?

CodePudding user response:

Does DVC saves all the different versions of the dataset?

Yes, it works on a file level. Please find more details here By how much can i approx. reduce disk volume by using dvc?

You can control though which version to keep / save.

Does DVC support all data files format (csv, feather)?

Yes, it's format-agnostic. It doesn't matter which format to use. It also means it doesn't do anything specific to CSV. It won't be trying to compress it, or calculate some diff in a smart way.

Can the usage of DVC with the could, lead to extra costs, since it increase the frequency of the communication with the cloud?

I would not worry about communication costs (unless you move millions or billions of files). But saving multiple versions of a file leads to paying for both of those versions.

Is there a limitation or disadvantages of the tool when working with large data files(100GB )?

It has additional cost of calculating the file hash (md5) to use as a key in its storage. If file is large that takes some extra time to do. Still, saving those files to the cloud and back should be more expensive.

I didn't run benchmarks, but I also can imagine there are some tools like s5cmd, etc that specialize in optimizing data transfer speeds in such cases. DVC doesn't do any tricks for this at the moment.

  • Related