This year, Arthur, a machine learning monitoring startup, has benefited from the increased interest in generative AI and has been developing tools to help businesses work more effectively with LLMs. Today, it is releasing Arthur Bench, an open-source application that assists users in identifying the optimal LLM for a given set of data.
Adam Wenchel, CEO and co-founder of Arthur, states that the company has seen a great deal of interest in generative AI and LLMs, and as a result, they have been devoting a great deal of time and effort to developing products.
Despite the fact that it has been less than a year since the release of ChatGPT, he claims that businesses lack an organized method to compare the effectiveness of one tool to another, which is why they created Arthur Bench.
Wenchel told TechCrunch that Arthur Bench solves one of the most common issues he hears from customers, which is determining which model is ideal for a particular application given the available options.
It comes with a suite of tools that can be used to test performance in a systematic manner, but its true value lies in its ability to test and measure the performance of various LLMs against the types of prompts your users would use for your particular application.”You could conceivably test 100 distinct prompts and then compare the results to two distinct LLMs, such as the Anthropic
Wenchel compared OpenAI to the types of prompts that your users are likely to use. In addition, he claims that you can do this at scale and make a more informed decision regarding which model is optimal for your specific use case.
Today, Arthur Bench is released as an open source utility. There will also be a SaaS version for customers who don’t want to deal with the complexity of administering the open source version or who have larger testing needs and are willing to pay for it. However, Wenchel stated that they are currently focusing on the open source project.
The new tool follows the May release of Arthur Shield, a type of LLM firewall designed to detect hallucinations in models while protecting against toxic information and breaches of private data.