Something wrong with the tokenize function.

The ggml model converted from "YeungNLP/bloomz-396m-zh" or "WangZeJun/bloom-396m-chat" lacks some tokens, such as the string "焙" or "擀", without corresponding tokens, the generated result cannot be displayed. However, in the official python way of the model, there is no such problem.

Sample, Notice the "�" section：
```
main: prompt: '面包的烘焙制作流程'
main: number of tokens in prompt = 3
 24765 -> '面包'
   373 -> '的'
 28967 -> '烘'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


面包(24765)的(373)烘(28967)�(1165)�(237)技巧(16012)：(1038)
(189)1(20).(17) (210)面(1157)条(1996)要(853)煮熟(43916)，(355)否则(14458)容易(7305)粘(14494)。(420) 
(2813)2(21).(17) 应(23830)使用(2527)烤(15337)箱(8226)而不是(12285)微波(30656)炉(16613)加热(25228)面团(44449)。
(672)3(22).(17) 用(16647)冷水(33637)淋(15735)湿(10556)面团(44449)以防止(31473)黏(19639)在一起(10919)。
(672)4(23).(17) 在(3612)预(3119)热(4291)至(1546)摄氏(39868)175(13634)度(1423)时(1018)开始(3590)烘(28967)�(1165)�(237)，(355)直到(8326)底部(26609)变得
(13044)金(1539)黄色(21313)并(1437)散(4711)发出(13801)香味(32740)即可(10134)享用(42892)</s>(2) [end of text]


main: mem per token =  4944640 bytes
main:     load time =   558.57 ms
main:   sample time =   516.50 ms
main:  predict time =  3674.82 ms / 52.50 ms per token
main:    total time =  4945.50 ms

```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Something wrong with the tokenize function. #30

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Something wrong with the tokenize function. #30

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions