The rapid development of generative AI has brought features and products to market in relatively short timeframes. Teams launching products with generative AI capabilities should aim to ensure high quality, safe, fair, and equitable user experiences in accordance with the AI Principles.
A responsible approach to generative applications should provide plans for accomplishing the following:
Content policies are just one step in preventing harm to users. It is also important to consider goals and guiding principles for quality, safety, fairness, and inclusion.
Teams should devise strategies for responding to queries in sensitive verticals such as medical information to help provide high quality user experiences. Responsible strategies include providing multiple points of view, deferring topics without scientific evidence, or only providing factual information with attribution.
The goal of AI safety measures is to prevent or contain actions that can lead to harm, intentionally or unintentionally. Without appropriate mitigations, generative models may output unsafe content that may violate content policies or cause discomfort to users. Consider providing explanations to users if an output was blocked or the model was unable to generate an acceptable output.
Ensure diversity within a response and across multiple responses for the same question. For example, a response to a question about famous musicians should not only include names or images of people of the same gender identity or skin tone. Teams should strive to provide content for different communities when requested. Examine training data for diversity and representation across multiple identities, cultures, and demographics. Consider how outputs over multiple queries are representative of diversity in groups, without perpetuating common stereotypes (e.g., responses to “best jobs for women” compared to “best jobs for men” should not contain traditionally stereotyped content, such as “nurse” appearing under “best jobs for women,” but “doctor” appearing under “best jobs for men”).
The following steps are recommended when building applications with LLMs (via PaLM API Safety guidance):
In one example of safety features, the PaLM API includes adjustable safety settings that block content with adjustable probabilities of being unsafe across six categories: derogatory, toxic, sexual, violent, dangerous, and medical. These settings allow developers to determine what is appropriate for their use cases, but also has built-in protections against core harms, such as content that endangers child safety, which are always blocked and cannot be adjusted.
Fine-tuning a model can teach it how to answer based on an application’s requirements. Example prompts and answers are used to teach a model how to better support new use cases, address types of harms, or utilize different strategies desired by the product in the reply.
For instance, consider:
Additional methods of harms prevention may include using trained classifiers to label each prompt with potential harms or adversarial signals. Moreover, you may implement safeguards against deliberate misuse by limiting the volume of user queries submitted by a single user in a given time period, or try to protect against possible prompt injection.
Similar to input safeguards, guardrails can be placed on outputs. Content moderation guardrails, such as classifiers, can be used to detect policy violative content. If signals determine the output to be harmful, the application can provide an error or empty response, provide a pre-scripted output, or rank multiple outputs from the same prompt for safety.
Generative AI products should be rigorously evaluated to ensure they align with safety policies and guiding principles prior to launch. To create a baseline for evaluation and measure improvement over time, metrics should be defined for each salient content quality dimension. After metrics are defined, a separate risks analysis can determine the performance targets for launch, taking into account loss patterns, how likely they are to be encountered, and the impact of harms.
Examples of metrics to consider:
Safety benchmarks: design safety metrics that reflect the ways your application could be unsafe in the context of how it is likely to get used, then test how well your application performs on the metrics using evaluation datasets.
Violation rate: Given a balanced adversarial dataset (across applicable harms and use cases), the number of violative outputs, usually measured by interrater reliability.
Blank response rate: Given a balanced set of prompts that a product intends to provide a response for, number of blank responses (i.e., when the product is unable to provide a safe output regardless of the input or output being blocked).
Diversity: Given a set of prompts, the diversity along dimensions of identity attributes represented in outputs.
Fairness (for quality of service): Given a set of prompts that contain counterfactuals of a sensitive attribute, ability to provide the same quality of service.
Adversarial testing involves proactively trying to “break” your application. The goal is to identify points of weakness so that you can take steps to remedy them.
Adversarial testing is a method for systematically evaluating an ML model with the intent of learning how it behaves when provided with malicious or inadvertently harmful input:
An input is malicious when the input is clearly designed to produce an unsafe or harmful output – for example, asking a text generation model to generate a hateful rant about a particular religion. An input is inadvertently harmful when the input itself may be innocuous, but produces harmful output – for example, asking a text generation model to describe a person of a particular ethnicity and receiving a racist output. Adversarial testing has two primary objectives: to help teams systematically improve models and products by exposing current failure patterns, and guide mitigation pathways, and to inform product decisions by assessing alignment to safety product policies and by measuring risks that may not be fully mitigated.
Adversarial testing follows a workflow that is similar to standard model evaluation:
What distinguishes an adversarial test from a standard evaluation is the composition of the data used for testing. For adversarial tests, select test data that is most likely to elicit problematic output from the model. This means probing the model’s behavior for all the types of harms that are possible, including rare or unusual examples and edge cases that are relevant to safety policies. It should also include diversity in the different dimensions of a sentence such as structure, meaning and length.
The quiz is available on Wooclap. However, you can already view the questions.