The idea is to use a regex pattern for tokenization and deterministic tagging. Then, a classifier (LSTM etc) can fill in the tags on ambiguous tokens
We are trying to define some classes, that should work across most languages
kwfl: flow keyword. if, for, return, try, exceptkwop: operator keyword. Used like operator. in, is, select, new, echokwmo: modifier keyword. pub, private, static, final, volatilekwde: declare variable, class, functionkwim: import keyword. import, from, #include (?), use
id: indentation. space/tab at beginning of linews: whitespace. space, tabnl: new-line.brop: opening bracketsbrcl: closing bracketssy: syntax features. :, ::, ->, =>, >>>, also <> in typespu: punctuation.co: comments (inline/multiline/single line)
nu: number. dec, int, scientific, hex, bin, percent.st: string.bo: boolean literals.li: other literal. null, None, undefined, built in constant values
opbi: binary operator. Other binary operatorsopun: unary operator. &ref, !not, X', x++, --xopas: assignment operators. =, <-, +=,opmo: modifier operators. references, pointers etc
pa: parameter. a variable defined together with a function.ty: type keyword. int, f64, voidtyco: type keyword constructor.cl: class. Non-primitve defined, also traits.clco: class constructor. class name used as a functionmo: module/namespace.fnme: method. A function on an object instancefnas: associated/static method/function. On module or classfnfr: standalone function.fnto: function tear-off.an: annotation. @Override, #[ allow() ], @property, rust lifetimesva: variable or similar user defined identifier.at: attribute. a variable/constant on some object or module.
uk: unknown.
- ✅ LSTM Tagger 24-12-07
- ✅ Render HTML preview 25-01-19
- ✅ NDJSON dataset 25-08-30
- ✅ Cleanup labels, linting 25-09-03
- ✅ Optuna, settle for a good LSTM model 25-09-20
- ❓ Balance dataset split criterion?
- ❓ Lightweight inference program.
- ❓ Reset indentation: avoid unnecessary indentation of all lines
- ❓ RNN variant comparison
- ❓ Feature based classifier
- ❓ data augmentation
- ❓ token LM
- ❓ character level LM -> "end to end" model
- ❓ try to catch code fragments in text?
- ❓ language classifier?
- ❓ highlighting inside strings?