Home > Back-end >  How Can I Use Generics to Write a Reusuable TSV Parser in TypeScript?
How Can I Use Generics to Write a Reusuable TSV Parser in TypeScript?

Time:01-08

I am trying to implement a function parseTSV() that parses a TSV file and returns an array of objects but I am having a hard time figuring out how to type the function.

This is what I have so far:

export const parseTsv = <T>(tsvFilepath: string) => {
    const tsv = fs.readFileSync(tsvFilepath, 'utf8');
    const lines = tsv.split('\n');
    const result: T[] = [];
    const columns = lines[0].split('\t'); // How can I constrain columns to be a key of T?

    for (let i = 2; i < lines.length; i  ) {
        const obj = {} as T;
        const currentline = lines[i].split("\t");

        for (let j = 0; j < columns.length; j  ) {
            obj[columns[j]] = currentline[j]; /* ERROR: Element implicitly has an 'any' type because expression of type 'string' can't be used to index type 'unknown'.
  No index signature with a parameter of type 'string' was found on type 'unknown'. */
        }

        result.push(obj);
    }

    return result;
}

I wrote the function with intending for the caller to define <T> which is the interface of the object parsed from the TSV file (i.e., the columns in the TSV file).

CodePudding user response:

The call signature

declare const parseTsv: <T>(tsvFilepath: string) => T[]

cannot be safely implemented. It claims to take a tsvFilepath string and produce an array of values of type T supplied by the caller at design time. But the static type system is erased when TS is compiled to JS, and T is just a design-time type, not a runtime value. No information about T is available to the function that runs, and so if it does work it would be a coincidence; you'd be relying on the developer to verify that the type being returned will be correct.

Consider the following TypeScript code:

interface Foo {
    bar: string;
    baz: string;
}
const foos: Foo[] = parseTsv<Foo>("somePath");
foos.map(x => x.bar.toUpperCase());

interface EvilFoo {
    bar: number;
    baz: string;
}
const evils: EvilFoo[] = parseTsv<EvilFoo>("somePath");
evils.map(x => x.bar.toFixed());

At runtime that will almost certainly be

const foos = parseTsv("somePath");
foos.map(x => x.bar.toUpperCase());
const evils = parseTsv("somePath");
evils.map(x => x.bar.toFixed());

where parseTsv("somePath") is run twice, with the same input, and I think it's safe to assume that they will return the same output. Nothing about Foo or EvilFoo is involved. The chance that the file at "somePath" represents both an array of Foo objects and an array of EvilFoo objects is quite low (an object with both a string and a number at the bar property is not possible), so it is very likely that at least one of those map() methods will try to dereference undefined and you'll get a runtime error.

Indeed your implementation of parseTsv() can't even produce a type with non-string properties, while T is completely unconstrained. But even if we constrained T to only have string properties, the function doesn't know what keys it can expect. The only safe way to give your parseTsv() a call signature would be something like the non-generic:

declare const parseTsv: (tsvFilepath: string) => {[k: string]: string | undefined}[]

which says that the output will be an array of some objects whose properties are either string or undefined.


If you want to keep your implementation the same at runtime and your typing the same, then I wouldn't worry much about type safety in the implementation; just use type assertions or the any type to avoid warnings:

export const parseTsv = <T extends Record<keyof T, string>>(
    tsvFilepath: string) => {
    const tsv = fs.readFileSync(tsvFilepath, 'utf8');
    const lines = tsv.split('\n');
    const result: T[] = [];
    const columns = lines[0].split('\t');
    for (let i = 1; i < lines.length; i  ) {
        const obj = {} as any; // just use any
        const currentline = lines[i].split("\t");
        for (let j = 0; j < columns.length; j  ) {
            obj[columns[j]] = currentline[j];
        }
        result.push(obj);
    }
    return result;
}

If you want something more type safe, you'll need to change your call signature and the implementation. One possible approach could be:

const parseTsv = <K extends string>(tsvFilepath: string, ...keys: K[]) => {
    const tsv = fs.readFileSync(tsvFilepath, 'utf8');
    const lines = tsv.split('\n');
    const result: { [P in K]: string }[] = [];
    if (!lines.length) return result;
    const columns = lines[0].split('\t') as K[];
    for (const key of keys) {
        if (!columns.includes(key)) throw new Error(
            "TSV file is missing expected header named \""  
            key   "\" in line 1"
        );
    }
    for (let i = 1; i < lines.length; i  ) {
        const obj = {} as { [P in K]: string };
        const currentline = lines[i].split("\t");
        for (let j = 0; j < columns.length; j  ) {
            const key = columns[j];
            const val = currentline[j];
            if (typeof val !== "string") throw new Error(
                "TSV file is missing expected value for \""  
                key   "\" in line "   (i   1)
            );
            obj[key] = currentline[j];
        }
        result.push(obj);
    }
    return result;
}

Here you are passing in the list of keys you expect to exist on your objects. The function is generic only in the type K of these keys. The output type is { [P in K]: string }[], meaning an array of elements whose keys are K and whose values are strings.

I used some type assertions in the implementation to prevent errors (such as as K[] and as {[P in K]: string}) which means I've taken it upon myself to guarantee type safety. I've attempted to do that by having the implementation throw errors if the keys don't match or if a line is missing an entry. The only way this check is possible is because, at runtime, the keys array of type K[] exists (while K itself only exists at design time).

Anyway, let's test it out with a dummy file system:

const fs = {
    readFileSync(path: string, encoding: string) {
        switch (path) {
            case "badFile":
                return "bar\tbaz\nabc\nghi\tjkl"
            default:
                return "bar\tbaz\nabc\tdef\nghi\tjkl";
        }
    }
}

First for the happy cases:

interface Foo {
    bar: string;
    baz: string;
}
const objects: Foo[] = parseTsv("somePath", "bar", "baz");
console.log(objects);
//  [{ "bar": "abc", "baz": "def" }, { "bar": "ghi", "baz": "jkl" }] 
objects.map(x => x.bar.toUpperCase())

That works because there are "bar" and "baz" keys in the TSV. You can be somewhat confident that any operation you perform assuming objects is of type Foo[] will work. Now for the sad cases:

interface Other {
    other: string;
    key: string;
}
try {
    const others = parseTsv("somePath", "other", "key"); // RUNTIME ERROR!
    // const others: {other: string; key: string; }[]
    others.map(x => x.other.toUpperCase())
} catch (e) {
    console.log(e); // TSV file is missing expected header named "other" in line 1
}

try {
    const oops: Foo[] = parseTsv("badFile", "bar", "baz");
    oops.map(x => x.bar.toUpperCase())
} catch (e) {
    console.log(e); // TSV file is missing expected value for "baz" in line 2
}

Those fail because the file at "somePath" does not have "other" and "key" headers, and because the file at "badFile" is bad, both of which are caught at runtime. The others.map() and oops.map() lines are never reached. There's no problem assuming that others is of type Other[] and that oops is of type Foo[], since the only way for that code to run would be is if those constraints were met. So it's reasonably type safe.

Playground link to code

  • Related