Add multimodal embedding & rerank support#66
Conversation
|
It's best to create a multimodal Embedding class in llama_embedding.py or enhance the existing Embedding class to manage mctx. There's no need to add unnecessary memory usage to llama. Remember to release memory after using new mctx. |
|
Actually I am btw, Here is my usage |
|
Currently, there is indeed a lack of a multimodal class similar to llama or sampler to abstract the mtmd_cpp API. The heavyweight and complex implementations of llama_chat_format based on llama 1.5 are indeed difficult to manage. |
|
by Contents can be image or audio, support local disk, network url, or bytes/bytearray instance, no video support yet. I thought create_completion is too complex, too, I will create alternate function instead (avoid breaking change) |
|
你好 @roj234 ,这个PR可以保持继续适配优化,我要先对batch decode和eval的部分进行一些重构,原来的老的执行逻辑会有不对齐的情况,导致新模型运行第一轮后kv cache对不上,这次叠加ggml-org/llama.cpp@2b6dfe8 的变更,我就干脆按照llama.cpp目前比较新的方式进行重构,这会对Embedding的部分有一些干扰,但应该是值得的。 |
|
好的,我计划的修改是,除了添加的LlamaEmbedding.embed_multimodal函数之外,再创建一个类似Llama.create_multimodal_chat_completion的函数,它能直接处理请求中的image\audio或者未来的video对象(我看了Qwen VL的代码,它的video实现是用ffmpeg把视频切成nFPS的图片序列,不过不排除未来有新方式的可能,那时候要看mtmd库怎么实现了) |
|
你好, batch、generate、eval的部分我已经重构完,你可以跟进新的用法,logits不再方法里面判断,batch功能只管组装就行,释放cpu部分的logits,再外部循环根据_logits_all对score进行动态更新即可,默认目前只算最后一个logits,释放原来每次计算的压力。 |
(cherry picked from commit 4ba212f)
|
你好,目前我觉得MTMDChatHandler已经完全重构剥离了出来,前天已经加了audio提升为media处理工作流(为onmi或image/audio),或者您可以在继承一个MTMD_Embed_ChatHandler, 加入embedded的方法接口,在llama_embedding.py中实现算法即可,不用太过复杂了。 |
It works, but duplicate, as llama_chat_format implemented multimodal --- but that does not support embedding models like Qwen-VL-Embedding.
These code heavily refers to llama-server's C++ code (ServerTokens)