You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the Minerva and LLeMMa papers, sympy is used to ensure equivalence of predicted / gold answers, e.g. ensuring $1/ \sqrt{3}$ and $\sqrt{3}/3$ are treated the same. From the Minerva paper:
After applying this normalization function, we checked whether the formatted target and prediction strings are SymPy-equivalent. SymPy equivalence is determined by parsing the answers via sympy.parsing.latex.parse_latex and then checking whether substracting the two resulting SymPy objects and applying sympy.simplify gives zero. We set a timeout of 5s when calling sympy.simplify, and labeled strings as nonequivalent if this timeout was exceeded.
For MATH problems, SymPy equivalence improved overall accuracy by around 1%. See Table 6 for the accuracies in MATH with only exact string match vs. SymPy equivalence.
Although the difference between Minerva & OpenAI models was only 1%, would it make sense to add sympy to the MATH metric for both correctness and potentially uncovering larger variation among open models?
The text was updated successfully, but these errors were encountered:
In the Minerva and LLeMMa papers,$1/ \sqrt{3}$ and $\sqrt{3}/3$ are treated the same. From the Minerva paper:
sympy
is used to ensure equivalence of predicted / gold answers, e.g. ensuringAlthough the difference between Minerva & OpenAI models was only 1%, would it make sense to add
sympy
to the MATH metric for both correctness and potentially uncovering larger variation among open models?The text was updated successfully, but these errors were encountered: