Skip to content

Conversation

@jannalulu
Copy link
Contributor

@jannalulu jannalulu commented Oct 22, 2025

Current longbench dataset (THUDM/LongBench) needs explicit trust_remote_code=True to run, which is no longer supported in datasets>=4.0. Changed dataset to Xnhyacinth/LongBench, which has the dataset saved in *.parquet. Part of fixing issue #3171

Maybe should also increment version numbers? Keeping this separate from PR #3359 because this is a different feature.

@jannalulu jannalulu requested a review from baberabb as a code owner October 22, 2025 19:18
@baberabb
Copy link
Contributor

Hi! Thanks for the PR! Just to confirm, the two datasets are equivalent?

@jannalulu
Copy link
Contributor Author

jannalulu commented Oct 27, 2025

They should be equivalent; the number of samples in the Xnhyacinth/LongBench is the same as THUDM/LongBench; the splits and subsets are set-up the same. Let me run single-doc

@jannalulu
Copy link
Contributor Author

jannalulu commented Oct 27, 2025

Results from meta-llama/Llama-3.1-8B-Instruct, batch_size=2, --apply_chat_template. seems similar to PR #3273

Tasks Version Filter n-shot Metric Value
longbench_2wikimqa 4 none 0 qa_f1_score 0.5079 ± 0.0327
longbench_dureader 4 none 0 rouge_zh_score 0.3279 0.0130
longbench_hotpotqa 4 none 0 qa_f1_score 0.5821 ± 0.0308
longbench_musique 4 none 0 qa_f1_score 0.3166 ± 0.0300

@baberabb
Copy link
Contributor

Great! which one should i merge before, do you want me to merge #3359 or this one first?

@jannalulu
Copy link
Contributor Author

jannalulu commented Oct 27, 2025

merge this first I think, because this changes the dataset and formatting etc. and #3359 creates groups and increments the version number? Could also merge this one and then I'll pull it into the other PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants