Skip to content
View Incheonkirin's full-sized avatar
:octocat:
:octocat:

Block or report Incheonkirin

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Incheonkirin/README.md

Mingi Jeong

ML/LLM Engineer — retrieval, LLM serving, and open-source library internals

Previously 5.5 years on the search team at 42Maru — Korean hybrid retrieval (BM25 + dense, learning-to-rank, hard negatives), MRC (machine reading comprehension), RAG, and open-source LLM fine-tuning.

LinkedIn Email


I work on retrieval and LLM systems where real data, much of it Korean, exposes quiet failures deep in the stack: embedding losses, RoPE caches, continuous batching. Tracing those to their source is where my open-source work comes from.

🔧 Upstream contributions

The main testbed is search_system — a Korean insurance-clause retrieval lab with nori BM25 + BGE-M3 hybrid retrieval, real-query failure cases, analyzer probes, and production-style traces. The pattern is usually small, but it matters in production:

Data that is valid on one side of a representation boundary silently breaks the other — NFD Hangul vs. the analyzer, stop strings vs. byte-fragment tokens, a literal </tool_call> vs. the tool-call parser, bf16 logits vs. a float32 loss. Korean hits these boundaries constantly; English-only test suites never do.

Recent fixes have landed upstream in sentence-transformers, transformers, Elasticsearch, MLflow, and LlamaIndex: embedding-loss correctness, dynamic RoPE cache resets, continuous-batching output snapshots, nori analyzer docs, MLflow logging, and CJK text-splitter recursion.

Retrieval training and embedding losses

  • sentence-transformers #3800 — bf16/fp16 training crash across six learning-to-rank losses. (merged)
  • sentence-transformers #3817 — on multi-GPU gather_across_devices, gathered positives in GISTEmbedLoss/CachedGISTEmbedLoss were masked as false negatives, so the cross-entropy target collapsed to -inf and the training signal silently vanished on rank > 0. Surfaced with a Korean polarity probe; it also covered a regression the earlier in-batch-negative fix (#3453) had left in the GIST losses. (merged)
  • sentence-transformers #3816 — avoid materializing the full non-FAISS hard-negative mining similarity matrix. (merged)
  • sentence-transformers #3812 — MPS support for cached-loss RandContext. (merged)
  • sentence-transformers #3821 — hard-negative mining's relative-margin threshold was sign-dependent and inverted on negative positive-scores; made it sign-independent (#3819). (merged)

LLM serving and model internals

  • huggingface/transformers #46530StopStringCriteria misses CJK stop strings on byte-level tokenizers (#46519). (merged)
  • huggingface/transformers #46624 — dynamic RoPE never reset inv_freq on the layer_type=None path (it wrote max_seq_len_cached to a stray None_… attribute), so a long sequence followed by a short one kept the scaled frequencies. (merged)
  • huggingface/transformers #46670 — continuous batching's output conversion mutated the active request state and returned live aliases of the growing token/logprob buffers; made it a snapshot. (merged)
  • run-llama/llama_index #21900RecursionError in text splitters when a single CJK/emoji token exceeds chunk_size. (merged)
  • huggingface/transformers #46643TopHLogitsWarper was built without min_tokens_to_keep, so with peaked logits and beam sampling top-h could keep a single token while the other warpers kept the beam-safe minimum. (open)
  • vllm-project/vllm #45168 — Hermes tool parser drops tool calls when a literal </tool_call> appears inside a JSON string argument (#45167). (open)
  • NAVER hcx-vllm-plugin #5 — reported the same parser-boundary bug class for literal <|im_end|> inside JSON string arguments. (open issue)
  • vllm-project/vllm #45162collect_env.py aborted with an AssertionError on non-Linux platforms. (open)

Search analyzers and query normalization

  • elastic/elasticsearch #151157 — documented that nori's default XPN stop tag silently deletes meaning-bearing Korean prefixes, so 비급여 (non-covered) analyzes to 급여 (covered), from issue #151094. (merged)
  • apache/lucene #16242 — new HangulCompositionCharFilter for analysis-nori: NFD-form Hangul was silently unanalyzable as Korean (#16241). (open)
  • elastic/elasticsearch #151008 — wildcard queries: re-escape operator characters produced by the normalizer. (open)
  • explosion/spaCy #13974 — Korean tokenizer collapsed whitespace runs, breaking doc.text round-trips and offsets. (open)

Production tooling, tracing, and vector search

  • facebookresearch/faiss #5272 — diagnosed that musllinux wheels were dropped during the move to official PyPI wheels (*-musllinux_* remained in the cibuildwheel skip list) and outlined the restore path; upstream shipped the fix in faiss-cpu 1.14.3 via #5299. (resolved upstream)
  • mlflow #23957 — restored dataset expectation/tag logging in genai.evaluate(scorers=[]). (merged)
  • mlflow #23818 — OpenTelemetry retriever-span reassembly on ingest. (open)
  • ragas #2759 — make VertexAI imports optional so import ragas does not fail without Vertex dependencies. (open)
  • BentoML #5632 / #5633 — proxy-client configurability and monitoring-log span metadata. (open)

🏢 Enterprise NLP/QA at 42Maru (press)

Closed-source enterprise systems I worked on at 42Maru, with the research and engineering teams: Korean search quality, semantic QA, retrieval behavior, and OCR/NLP pipelines for real customer workflows.

  • AI ship-sales design-support system — Daewoo Shipbuilding (DSME): semantic QA over ~100K historical records for shipowners' pre-contract technical inquiries. press
  • AML / trade-based transaction detection — Hana Bank: OCR-NLP over cross-border remittance invoices. press

📊 Public artifacts from 42Maru — NIA AI Hub

Government-published Korean NLP artifacts from 42Maru projects I worked on: five AI Hub releases across news MRC, national-archives LLM instruction data, finance/legal MRC, numeric reasoning MRC, and table QA. ~2.3M labeled QA pairs plus a ~300M-token corpus.

news MRC · national-archives LLM corpus · finance/legal MRC · numeric-reasoning MRC · table QA


🧭 Repo map


🧰 Stack

Python PyTorch Transformers sentence-transformers vLLM MLflow Elasticsearch / Lucene Hybrid Retrieval / RAG

Pinned Loading

  1. Incheonkirin.github.io Incheonkirin.github.io Public

    Personal site — portfolio and notes.

    TypeScript