Model evaluation harness for standardized benchmarking with semantic similarity, exact match, and custom metrics.
Show All Activities