Home > OS >  How to implement ctype.h functionality in JavaScript?
How to implement ctype.h functionality in JavaScript?

Time:04-16

What is the general approach for translating glibc ctype.h into JavaScript? I think I could do it, but I am not finding the tables and bitshifting that implements these in the C source code. What is the most optimal techniques here?

isalnum(c)
isalpha(c)
iscntrl(c)
isdigit(c)
islower(c)
isgraph(c)
isprint(c)
ispunct(c)
isspace(c)
isupper(c)
isxdigit(c)
isblank(c)

It appears they are using all sorts of techniques to generate these functions depending on architecture perhaps. But what is the gist of what needs to be done to translate this manually to JavaScript? It appears they are using tables of some sort, but I can't seem to also find those in the source either.

Looking at the openbsd ctype.h source gets me a little closer, but still missing the tables, and not sure if that's the most optimized approach to use for JavaScript. For example:

__only_inline int isdigit(int _c)
{
  return (_c == -1 ? 0 : ((_ctype_   1)[(unsigned char)_c] & _N));
}

I don't see where they get the & _N from, and what the (_ctype_ 1)[index] means or comes from.

They define _ctype_ as:

extern const char   *_ctype_;

But not being that expert at C, I am not sure how to interpret the extern part, or where I can find this table implementation. Maybe it's ctype_.c, but not sure what to make of it yet.

CodePudding user response:

Speaking on the OpenBSD implementation:

What you have found is a lookup table full of bitmasks, with a size that matches the range of an unsigned char (0-255), which encompasses the ASCII table (0-127).

The upper range (128-255) is padded with zeroes.

Note there is a preceding byte, which gives everything an offset of 1.

The individual masks are defined at the top of ctype.h:

#define _U  0x01
#define _L  0x02
#define _N  0x04
#define _S  0x08
#define _P  0x10
#define _C  0x20
#define _X  0x40
#define _B  0x80

Note their binary representations are each a single, distinct bit.

isXXX functions receive their argument as int, which, after an initial test for EOF (-1), is clamped to an unsigned char. This decimal value is used to index the lookup table, retrieving a value that represents the qualities of each ASCII character.

For example, the first 32 characters of the ASCII table are control characters, denoted here with the mask _C.

LF, or line-feed, is both a control and a space character. Its decimal value is 10 (0xA).

Finding this index in the lookup table, we see the mask is formed by bitwise ORing _C with _S, as in _C|_S.

isXXX functions test for qualities with bitewise AND (&), and the appropriate bitmask.

For example, isalnum tests for Uppercase, Lowercase, and Numeric with the mask (_U|_L|_N).


The same approach can be taken in JavaScript. Roughly:

const _U = 0x01;
const _L = 0x02;
const _N = 0x04;
/* ... */
const _X = 0x40;

const lookup = {
    'A': _U | _X,
    /* ... */
    '0': _N,
    /* ... */
    'e': _L | _X,
    'p': _L,
};

const get = c => lookup[c] ?? 0;
const isupper = c => get(c) & _U;
const isxdigit = c => get(c) & (_N | _X);

for (let letter of "Apple0")
    console.log(isupper(letter), isxdigit(letter));

Note that everything here is more or less in regard to the "C" locale. Switching locales changes the behaviour of these functions (man 7 locale).

  • Related