We're in a netherworld right now of intermingled unit tests and accuracy assessment tests. Some tests measure whether the program is functioning at all, while others are more like measures of real-world performance. It would be really useful to separate these out: UP should have a test suite that new changes require 100% success on before they can be merged, and then a separate, GSR-based set that measures changes in expected performance.
We're in a netherworld right now of intermingled unit tests and accuracy assessment tests. Some tests measure whether the program is functioning at all, while others are more like measures of real-world performance. It would be really useful to separate these out: UP should have a test suite that new changes require 100% success on before they can be merged, and then a separate, GSR-based set that measures changes in expected performance.