Gen AI Testing: A Case for QA as a Service

There are new challenges with building and maintaining a successful Generative AI (Gen AI) application. After exciting Proof Of Concept (POC) demonstrations and once in production, teams are too often caught off guard with the new types of demands for quality assurance and monitoring. While traditional applications tend to achieve a steady state of accuracy once deployed, Gen AI applications do not have a steady state of accuracy once deployed and changes in accuracy can occur quickly and dramatically.

The IT Operations for Gen AI is known as GenOps, which is distinct from traditional operations due to the quality assurance testing needed after the Gen AI application is deployed. It brings with it a new set of testing metrics that are essential to success. These metrics have been developed to prevent production failures that can be very costly. (We’ll dig into these more in a bit.) For example, Google’s first roll out of the Gen AI application Gemini, formerly known as Bard, created a response that led to a $100 billion loss in market value.

Quality assurance for a Gen AI application is not performed at a single point in the application’s life cycle, but rather must be an ongoing process with metrics that impact the release cycles.

To achieve a good QA operation with Gen AI, there are three steps needed:

1) Gen AI Models need to be granted the same status as the application to have distinct life cycles.
2) Metrics must be established for the model. These metrics should be used to define SLAs and KPIs that can be tracked.
3) Establishing QA as a Service allows for the tracking of metrics that are critical to managing the models within a Gen AI application.

Let’s dig into these each a bit more to see how we can avoid costly and brand-damaging mistakes with Gen AI applications.

1. Gen AI Models should be monitored distinctly

Gen AI applications chain together knowledge models as seen in the diagram below. These models experience drift—also known as degradation—over time. Drift causes errors, and those errors are cumulative through the chain. This drift can result in significant loss of accuracy in a short period.

It is important that the model be managed with GenOps to support its life cycle. The model requires testing during the development process and monitoring once it is deployed to quickly detect the need for retraining or retiring of the model.

The goals of the monitoring are to detect degradation and to determine if that is occurring in the model (general knowledge) or embeddings (private knowledge) or in the prompts (the caveats around the request).

2. Establish Gen AI Metrics to be Monitored by GenOps

New metrics have been established to guide the monitoring and maintenance of Gen AI applications.

Examples of well-defined metrics can fall into categories like model quality, system quality, and business impact.

Along with these metrics, thresholds need to be established and adapted over time to determine when a model needs to be retrained or retired.

While accuracy and safety are the scores that take the headlines, once in production, many Gen AI applications are challenged with latency and cost. Model drift not only impacts accuracy but also impacts the performance of the models.

These same metrics are proving to be extremely useful in providing a more comprehensive assessment of newer models in comparison to models in production to indicate at which point the new model should be deployed.

3. Establish GenOps QA as a Service (QAaaS)

A GenOps QAaaS solution includes a mix of accelerators and staffing to test and monitor Gen AI models. Implementing quality assurance “As A Service” allows for the operation based upon a service-level agreement or KPI.

Metrics are critical in managing the models within a Gen AI application. Depending upon the selected metrics being applied to an application, accelerators can be implemented to help respond to this process. These accelerators include large test data sets, such as OpenOrca, that have been created in cooperation with leaders in the Gen AI field and other tools to assist with testing the model of a Gen AI application.

Due to the non-deterministic responses of LLM models, a fully reliable automated test is challenged to respond to some requests such as “Generate an image that makes me laugh.” By having a well-defined QAaaS process that facilitates a mix of accelerators and staffing creating and tracking the right metrics through the entire life cycle, Gen AI applications can be created with more predictable impacts.

Transform Your Gen AI Practices with GenOps

The successful deployment and maintenance of Gen AI applications require a paradigm shift in quality assurance and monitoring practices. Unlike traditional applications, Gen AI models experience rapid and unpredictable changes in accuracy, necessitating continuous oversight through GenOps.

By treating Gen AI models as distinct entities with dedicated life cycles, establishing and tracking specific metrics, and implementing QA as a Service (QAaaS), organizations can mitigate the risks of model degradation and ensure consistent performance. These steps are crucial to avoid costly errors and maintain the reliability and effectiveness of Gen AI applications in production.

About the Author

Robert Bell is a Principal Solutions Architect for Evergreen, a division of Insight Global. He lives in Atlanta with his wife and enjoys building Cloud and Gen AI practices within large enterprise organizations. Connect with him on LinkedIn.