Skip to content

Something wrong with the tokenize function. #30

Description

@samsha1971

The ggml model converted from "YeungNLP/bloomz-396m-zh" or "WangZeJun/bloom-396m-chat" lacks some tokens, such as the string "焙" or "擀", without corresponding tokens, the generated result cannot be displayed. However, in the official python way of the model, there is no such problem.

Sample, Notice the "�" section:

main: prompt: '面包的烘焙制作流程'
main: number of tokens in prompt = 3
 24765 -> '面包'
   373 -> '的'
 28967 -> '烘'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


面包(24765)的(373)烘(28967)�(1165)�(237)技巧(16012):(1038)
(189)1(20).(17) (210)面(1157)条(1996)要(853)煮熟(43916),(355)否则(14458)容易(7305)粘(14494)。(420) 
(2813)2(21).(17) 应(23830)使用(2527)烤(15337)箱(8226)而不是(12285)微波(30656)炉(16613)加热(25228)面团(44449)。
(672)3(22).(17) 用(16647)冷水(33637)淋(15735)湿(10556)面团(44449)以防止(31473)黏(19639)在一起(10919)。
(672)4(23).(17) 在(3612)预(3119)热(4291)至(1546)摄氏(39868)175(13634)度(1423)时(1018)开始(3590)烘(28967)�(1165)�(237),(355)直到(8326)底部(26609)变得
(13044)金(1539)黄色(21313)并(1437)散(4711)发出(13801)香味(32740)即可(10134)享用(42892)</s>(2) [end of text]


main: mem per token =  4944640 bytes
main:     load time =   558.57 ms
main:   sample time =   516.50 ms
main:  predict time =  3674.82 ms / 52.50 ms per token
main:    total time =  4945.50 ms

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions