Skip to content

Incheonkirin/Incheonkirin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 

Repository files navigation

Mingi Jeong

ML/LLM Engineer — retrieval, LLM serving, and open-source library internals

Previously 5.5 years on the search team at 42Maru — Korean hybrid retrieval (BM25 + dense, learning-to-rank, hard negatives), MRC (machine reading comprehension), RAG, and open-source LLM fine-tuning.

LinkedIn Email


I work on retrieval and LLM systems where real data, much of it Korean, exposes quiet failures deep in the stack: embedding losses, RoPE caches, continuous batching. Tracing those to their source is where my open-source work comes from.

🔧 Upstream contributions

The main testbed is search_system — a Korean insurance-clause retrieval lab with nori BM25 + BGE-M3 hybrid retrieval, real-query failure cases, analyzer probes, and production-style traces. The pattern is usually small, but it matters in production:

Data that is valid on one side of a representation boundary silently breaks the other — NFD Hangul vs. the analyzer, stop strings vs. byte-fragment tokens, a literal </tool_call> vs. the tool-call parser, bf16 logits vs. a float32 loss. Korean hits these boundaries constantly; English-only test suites never do.

Recent fixes have landed upstream in sentence-transformers, transformers, Elasticsearch, MLflow, and LlamaIndex: embedding-loss correctness, dynamic RoPE cache resets, continuous-batching output snapshots, nori analyzer docs, MLflow logging, and CJK text-splitter recursion.

Retrieval training and embedding losses

  • sentence-transformers #3800 — bf16/fp16 training crash across six learning-to-rank losses. (merged)
  • sentence-transformers #3817 — on multi-GPU gather_across_devices, gathered positives in GISTEmbedLoss/CachedGISTEmbedLoss were masked as false negatives, so the cross-entropy target collapsed to -inf and the training signal silently vanished on rank > 0. Surfaced with a Korean polarity probe; it also covered a regression the earlier in-batch-negative fix (#3453) had left in the GIST losses. (merged)
  • sentence-transformers #3816 — avoid materializing the full non-FAISS hard-negative mining similarity matrix. (merged)
  • sentence-transformers #3812 — MPS support for cached-loss RandContext. (merged)
  • sentence-transformers #3821 — hard-negative mining's relative-margin threshold was sign-dependent and inverted on negative positive-scores; made it sign-independent (#3819). (merged)

LLM serving and model internals

  • huggingface/transformers #46530StopStringCriteria misses CJK stop strings on byte-level tokenizers (#46519). (merged)
  • huggingface/transformers #46624 — dynamic RoPE never reset inv_freq on the layer_type=None path (it wrote max_seq_len_cached to a stray None_… attribute), so a long sequence followed by a short one kept the scaled frequencies. (merged)
  • huggingface/transformers #46670 — continuous batching's output conversion mutated the active request state and returned live aliases of the growing token/logprob buffers; made it a snapshot. (merged)
  • run-llama/llama_index #21900RecursionError in text splitters when a single CJK/emoji token exceeds chunk_size. (merged)
  • huggingface/transformers #46643TopHLogitsWarper was built without min_tokens_to_keep, so with peaked logits and beam sampling top-h could keep a single token while the other warpers kept the beam-safe minimum. (open)
  • vllm-project/vllm #45168 — Hermes tool parser drops tool calls when a literal </tool_call> appears inside a JSON string argument (#45167). (open)
  • NAVER hcx-vllm-plugin #5 — reported the same parser-boundary bug class for literal <|im_end|> inside JSON string arguments. (open issue)
  • vllm-project/vllm #45162collect_env.py aborted with an AssertionError on non-Linux platforms. (open)

Search analyzers and query normalization

  • elastic/elasticsearch #151157 — documented that nori's default XPN stop tag silently deletes meaning-bearing Korean prefixes, so 비급여 (non-covered) analyzes to 급여 (covered), from issue #151094. (merged)
  • apache/lucene #16242 — new HangulCompositionCharFilter for analysis-nori: NFD-form Hangul was silently unanalyzable as Korean (#16241). (open)
  • elastic/elasticsearch #151008 — wildcard queries: re-escape operator characters produced by the normalizer. (open)
  • explosion/spaCy #13974 — Korean tokenizer collapsed whitespace runs, breaking doc.text round-trips and offsets. (open)

Production tooling, tracing, and vector search

  • facebookresearch/faiss #5272 — diagnosed that musllinux wheels were dropped during the move to official PyPI wheels (*-musllinux_* remained in the cibuildwheel skip list) and outlined the restore path; upstream shipped the fix in faiss-cpu 1.14.3 via #5299. (resolved upstream)
  • mlflow #23957 — restored dataset expectation/tag logging in genai.evaluate(scorers=[]). (merged)
  • mlflow #23818 — OpenTelemetry retriever-span reassembly on ingest. (open)
  • ragas #2759 — make VertexAI imports optional so import ragas does not fail without Vertex dependencies. (open)
  • BentoML #5632 / #5633 — proxy-client configurability and monitoring-log span metadata. (open)

🏢 Enterprise NLP/QA at 42Maru (press)

Closed-source enterprise systems I worked on at 42Maru, with the research and engineering teams: Korean search quality, semantic QA, retrieval behavior, and OCR/NLP pipelines for real customer workflows.

  • AI ship-sales design-support system — Daewoo Shipbuilding (DSME): semantic QA over ~100K historical records for shipowners' pre-contract technical inquiries. press
  • AML / trade-based transaction detection — Hana Bank: OCR-NLP over cross-border remittance invoices. press

📊 Public artifacts from 42Maru — NIA AI Hub

Government-published Korean NLP artifacts from 42Maru projects I worked on: five AI Hub releases across news MRC, national-archives LLM instruction data, finance/legal MRC, numeric reasoning MRC, and table QA. ~2.3M labeled QA pairs plus a ~300M-token corpus.

news MRC · national-archives LLM corpus · finance/legal MRC · numeric-reasoning MRC · table QA


🧭 Repo map


🧰 Stack

Python PyTorch Transformers sentence-transformers vLLM MLflow Elasticsearch / Lucene Hybrid Retrieval / RAG

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors