fix trust_remote_code=True for longbench #3361

jannalulu · 2025-10-22T19:18:17Z

Current longbench dataset (THUDM/LongBench) needs explicit trust_remote_code=True to run, which is no longer supported in datasets>=4.0. Changed dataset to Xnhyacinth/LongBench, which has the dataset saved in *.parquet. Part of fixing issue #3171

Maybe should also increment version numbers? Keeping this separate from PR #3359 because this is a different feature.

baberabb · 2025-10-27T18:14:33Z

Hi! Thanks for the PR! Just to confirm, the two datasets are equivalent?

jannalulu · 2025-10-27T19:31:27Z

They should be equivalent; the number of samples in the Xnhyacinth/LongBench is the same as THUDM/LongBench; the splits and subsets are set-up the same. Let me run single-doc

jannalulu · 2025-10-27T20:35:11Z

Results from meta-llama/Llama-3.1-8B-Instruct, batch_size=2, --apply_chat_template. seems similar to PR #3273

Tasks	Version	Filter	Metric	Value
longbench_2wikimqa	4	none	qa_f1_score	0.5079 ± 0.0327
longbench_dureader	4	none	rouge_zh_score	0.3279 0.0130
longbench_hotpotqa	4	none	qa_f1_score	0.5821 ± 0.0308
longbench_musique	4	none	qa_f1_score	0.3166 ± 0.0300

baberabb · 2025-10-27T20:53:33Z

Great! which one should i merge before, do you want me to merge #3359 or this one first?

jannalulu · 2025-10-27T20:55:19Z

merge this first I think, because this changes the dataset and formatting etc. and #3359 creates groups and increments the version number? Could also merge this one and then I'll pull it into the other PR

jannalulu added 2 commits October 22, 2025 18:37

update dataset

16712bd

edit fields

060f351

jannalulu requested a review from baberabb as a code owner October 22, 2025 19:18

jannalulu mentioned this pull request Oct 22, 2025

Longbench group fix #3359

Open

jannalulu added 2 commits October 22, 2025 23:32

fix line-endings

2d1ccea

standardize spaces

24b19a1

jannalulu force-pushed the longbench-dataset branch from 99e2780 to 24b19a1 Compare October 23, 2025 23:12

pacify tests

7615673

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix trust_remote_code=True for longbench #3361

fix trust_remote_code=True for longbench #3361

Uh oh!

jannalulu commented Oct 22, 2025 •

edited

Loading

Uh oh!

baberabb commented Oct 27, 2025

Uh oh!

jannalulu commented Oct 27, 2025 •

edited

Loading

Uh oh!

jannalulu commented Oct 27, 2025 •

edited

Loading

Uh oh!

baberabb commented Oct 27, 2025

Uh oh!

jannalulu commented Oct 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix trust_remote_code=True for longbench #3361

Are you sure you want to change the base?

fix trust_remote_code=True for longbench #3361

Uh oh!

Conversation

jannalulu commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baberabb commented Oct 27, 2025

Uh oh!

jannalulu commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jannalulu commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baberabb commented Oct 27, 2025

Uh oh!

jannalulu commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jannalulu commented Oct 22, 2025 •

edited

Loading

jannalulu commented Oct 27, 2025 •

edited

Loading

jannalulu commented Oct 27, 2025 •

edited

Loading

jannalulu commented Oct 27, 2025 •

edited

Loading