Skip to content

Add CIFF2Jsonl binary#41

Open
JMMackenzie wants to merge 3 commits intomasterfrom
ciff2jsonl
Open

Add CIFF2Jsonl binary#41
JMMackenzie wants to merge 3 commits intomasterfrom
ciff2jsonl

Conversation

@JMMackenzie
Copy link
Member

This adds a program that can convert a CIFF blob into a jsonl file that follows the standard "learned sparse" format.

We probably should write some tests before merging. Also, not too sure how canonical my rust is these days, so it might be good to review carefully.

@JMMackenzie JMMackenzie requested review from amallia and elshize August 6, 2025 00:40
@elshize
Copy link
Member

elshize commented Aug 7, 2025

Sorry for the delay, I'll do my best to find some time this weekend 🤞

Copy link
Member

@elshize elshize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just some minor comments.

// Map from an internal integer docid to a string "collection" docid.
let mut docid_map: HashMap<u32, String> = HashMap::new();

// 1. Map the postings back into a forward index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have each of 1-3 be a separate function? I think 1 could return forward index, 2 returns docid_map and 3 takes both to write, something like that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, agreed. I'll fix this up.

let progress = ProgressBar::new(u64::from(header.num_postings_lists));
progress.set_style(pb_style());
progress.set_draw_delta(10);
for _ in 0..header.num_postings_lists {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you have an iterator and by-one increment in the loop, you can import indicatif::ProgressIterator and do this:

for _ in (0..header.num_postings_lists).progress(progress);

This means you skip progress.inc(1) and progress.finish() -- it's done for you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note you can do this for all three loops.

let mut docid_map: HashMap<u32, String> = HashMap::new();

// 1. Map the postings back into a forward index
let progress = ProgressBar::new(u64::from(header.num_postings_lists));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could write a module-private function fn progress_bar(count: u64) -> ProgressBar because we use the same kind of progress 3 times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments