Kedro - Getting path to item in the datacatalog-CodePudding

I'm training an nlp model using spacy. I have the preprocessing steps all written as a pipeline, and now I need to do the training. According to spacy's documentation I need to run the following command:

python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy

The files config.cfg, train.spacy and dev.spacy are all registered in my data catalog. I want to run this command with something similar to the following code:

import subprocess


def train_spacy_nlp_model(
    config_filepath: str, 
    train_filepath: str, 
    dev_filepath: str, 
    output_dir: str
    ):
    cmd = [
        "python -m", "spacy",
        "train", config_filepath,
        "--output", output_dir,
        "--paths.train", train_filepath,
        "--paths.dev", dev_filepath
    ]

    result = subprocess.run(" ".join(cmd), shell=True)
    if result.returncode != 0:
        raise RuntimeError("Spacy training failed")

But I have no idea how to retrieve the file path information from the items in my data catalog, is there a way of passing this information to my nodes when creating the pipeline?

CodePudding user response：

The variables you are using as input are strings. While data catalog is different. The data catalog variables are Kedro Dataset.

Both are different. Store the path as part of config and you shall get your project started.

CodePudding user response：

This is probably not the most elegant solution to this, but it works for me so I'll use it until I get a better solution. The solution was to return the path with the object on my DataSet implementation, I doubt that this would generalize for other datasets like SQL queries for example, but since I know that I have to be dealing with a file here, works fine. Here is my implementation:

from kedro.io import AbstractDataSet
from spacy.tokens import DocBin
from dataclasses import dataclass
from typing import Union
from pathlib import Path


@dataclass
class DocBinModel:
    filepath: Path
    docbin: DocBin


class SpacyDocBinDataSet(AbstractDataSet):
    def __init__(self, filepath, save_args=None, load_args=None):
        self._filepath = filepath
        self._save_args = save_args or {}
        self._load_args = load_args or {}

    def _describe(self):
        return dict(
            filepath=self._filepath,
            save_args=self._save_args,
            load_args=self._load_args,
        )

    def _load(self):
        with open(self._filepath, "rb") as f:
            docbin = DocBin().from_bytes(f.read())
        
        return DocBinModel(self._filepath, docbin)

    def _save(self, data: Union[DocBin, DocBinModel]):
        if isinstance(data, DocBinModel):
            data = data.docbin
        data.to_disk(self._filepath)

    def _exists(self):
        return Path(self._filepath).exists()