Add multimodal embedding & rerank support by roj234 · Pull Request #66 · JamePeng/llama-cpp-python

roj234 · 2026-02-21T14:28:38Z

It works, but duplicate, as llama_chat_format implemented multimodal --- but that does not support embedding models like Qwen-VL-Embedding.
These code heavily refers to llama-server's C++ code (ServerTokens)

JamePeng · 2026-02-21T16:21:36Z

It's best to create a multimodal Embedding class in llama_embedding.py or enhance the existing Embedding class to manage mctx. There's no need to add unnecessary memory usage to llama. Remember to release memory after using new mctx.
If possible, please provide necessary example and test code to illustrate its usage.

roj234 · 2026-02-21T19:48:23Z

Actually I am enhance the existing Embedding class, however I can move mctx management to llama_embedding.py
About memory, I have referred your context_stack and __del to free memory.
I also found llama_chat_format contains the logic for multimodal processing, but it is tightly coupled with the inference execution. It doesn't expose a way to get the processed tokens.

btw, Here is my usage


                    doc = [{"type": "text", "text": f"Name: {filepath.name}"}, 
                           {"type": "image", "image": image_data}]

class RAGModel:
    def __init__(self):
        self._model = LlamaEmbedding(
            # ...
            mmproj_path=...,
            image_min_tokens=...,
            image_max_tokens=...,
        )

    def _tmpl(self, contents: List[Dict[str, any]], instruct: str):
        files = []

        image_id = 0
        # Should not manually concat chat template here...
        tmpl = f"<|im_start|>system\n{instruct}<|im_end|>\n<|im_start|>user\n"
        for item in contents:
            type = item['type']
            if type == 'text':
                tmpl += item['text']
            elif type == 'image':
                image_id += 1
                files.append(item['image'])
                tmpl += f"Picture {image_id}: <__media__>" # __media__ is placeholder in mtmd

        return tmpl +  "<|im_end|>\n<|im_start|>assistant\n", files

    def embed_document(self, contents: List[Dict[str, any]], instruction: str = "Represent the user's input.", return_count: bool = False) -> List[float]:
        text, files = self._tmpl(contents, instruction)
        return self._model.embed_multimodal(text, files, return_count=return_count)

JamePeng · 2026-02-21T20:49:34Z

Currently, there is indeed a lack of a multimodal class similar to llama or sampler to abstract the mtmd_cpp API. The heavyweight and complex implementations of llama_chat_format based on llama 1.5 are indeed difficult to manage.

roj234 · 2026-02-24T10:40:37Z

by from llama_cpp.mtmd import Jinja2MultimodalChatFormatter RAGModel can be

def __init__():

        eos_token_id = self._model.token_eos()
        bos_token_id = self._model.token_bos()

        eos_token = (
            self._model._model.token_get_text(eos_token_id) if eos_token_id != -1 else ""
        )
        bos_token = (
            self._model._model.token_get_text(bos_token_id) if bos_token_id != -1 else ""
        )

        self._formatter = Jinja2MultimodalChatFormatter(
            template=self._model.metadata['tokenizer.chat_template'],
            eos_token=eos_token,
            bos_token=bos_token,
            stop_token_ids=[eos_token_id]
        )

    def _tmpl(self, contents: List[Dict[str, any]], instruct: str):
        result = self._formatter([{
            "role": "system",
            "content": instruct
        }, {
            "role": "user",
            "content": contents
        }])

        return result.prompt, result.medias

Contents can be image or audio, support local disk, network url, or bytes/bytearray instance, no video support yet. I thought create_completion is too complex, too, I will create alternate function instead (avoid breaking change)

JamePeng · 2026-02-24T11:56:05Z

你好 @roj234 ，这个PR可以保持继续适配优化，我要先对batch decode和eval的部分进行一些重构，原来的老的执行逻辑会有不对齐的情况，导致新模型运行第一轮后kv cache对不上，这次叠加ggml-org/llama.cpp@2b6dfe8 的变更，我就干脆按照llama.cpp目前比较新的方式进行重构，这会对Embedding的部分有一些干扰，但应该是值得的。

roj234 · 2026-02-24T13:37:09Z

好的，我计划的修改是，除了添加的LlamaEmbedding.embed_multimodal函数之外，再创建一个类似Llama.create_multimodal_chat_completion的函数，它能直接处理请求中的image\audio或者未来的video对象（我看了Qwen VL的代码，它的video实现是用ffmpeg把视频切成nFPS的图片序列，不过不排除未来有新方式的可能，那时候要看mtmd库怎么实现了）
当然，我认为最好（指删掉过时代码，但不再向前兼容）的方式是重构create_chat_completion并删掉llama_chat_format中那几千行的模板和历史包袱，比如把他们做成template中按名称命名的Jinja模板，顺便解决Llama->chat_format->Llama奇怪的调用链

JamePeng · 2026-02-28T04:41:06Z

你好， batch、generate、eval的部分我已经重构完，你可以跟进新的用法，logits不再方法里面判断，batch功能只管组装就行，释放cpu部分的logits，再外部循环根据_logits_all对score进行动态更新即可，默认目前只算最后一个logits，释放原来每次计算的压力。

(cherry picked from commit 4ba212f)

JamePeng · 2026-03-05T22:38:22Z

你好，目前我觉得MTMDChatHandler已经完全重构剥离了出来，前天已经加了audio提升为media处理工作流（为onmi或image/audio），或者您可以在继承一个MTMD_Embed_ChatHandler, 加入embedded的方法接口，在llama_embedding.py中实现算法即可，不用太过复杂了。

JamePeng force-pushed the main branch from 2861c22 to 978ddf7 Compare February 26, 2026 22:20

roj234 added 2 commits February 28, 2026 22:52

Multimodal embedding (tested on Qwen-VL-Embedding)

37fbbdf

add Jinja2MultimodalChatFormatter

5b79f47

(cherry picked from commit 4ba212f)

roj234 force-pushed the vl-embedding branch from c014761 to 5b79f47 Compare February 28, 2026 14:53

JamePeng force-pushed the main branch from 167106f to 22df747 Compare March 1, 2026 09:19

JamePeng force-pushed the main branch from 846b37d to d422e82 Compare March 12, 2026 16:13

JamePeng force-pushed the main branch from 388a3f0 to eaec2c1 Compare March 27, 2026 21:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multimodal embedding & rerank support#66

Add multimodal embedding & rerank support#66
roj234 wants to merge 2 commits intoJamePeng:mainfrom
roj234:vl-embedding

roj234 commented Feb 21, 2026

Uh oh!

JamePeng commented Feb 21, 2026 •

edited

Loading

Uh oh!

roj234 commented Feb 21, 2026

Uh oh!

JamePeng commented Feb 21, 2026

Uh oh!

roj234 commented Feb 24, 2026

Uh oh!

JamePeng commented Feb 24, 2026

Uh oh!

roj234 commented Feb 24, 2026

Uh oh!

JamePeng commented Feb 28, 2026 •

edited

Loading

Uh oh!

JamePeng commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

roj234 commented Feb 21, 2026

Uh oh!

JamePeng commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roj234 commented Feb 21, 2026

Uh oh!

JamePeng commented Feb 21, 2026

Uh oh!

roj234 commented Feb 24, 2026

Uh oh!

JamePeng commented Feb 24, 2026

Uh oh!

roj234 commented Feb 24, 2026

Uh oh!

JamePeng commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JamePeng commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JamePeng commented Feb 21, 2026 •

edited

Loading

JamePeng commented Feb 28, 2026 •

edited

Loading