Abstract: Recently, researchers in the field of math word problem (MWP) solving have reported performance metrics for various large language models (LLMs) on benchmark datasets, with some models ...
GSM8K-V is a purely visual multi-image mathematical reasoning benchmark that systematically maps each GSM8K math word problem into its visual counterpart to enable a clean, within-item comparison ...
The first episode delivers one of the series’ strongest openers. Low narrative stakes due to the refusal to kill the main characters. Essential move toward answering definitive narrative questions.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results