Claude 2 Evaluations, Longer LLM Context Windows and the ConceptARC Benchmark
Research Paper Reviews
Model Card and Evaluations for Claude Models
This report includes the model card for Claude 2, along with the results of a range of safety, alignment, and capabilities evaluations. It discusses evaluations run on Claude 1.3, Claude 2, and Claude Instant 1.1. This set is referred to as “Claude models”. Also comparing the non-deployed Helpful Only 1.3, in order to show how the honesty and harmlessness interventions affect behavior and evaluations.
With human feedback being one of the most important and meaningful evaluation metrics for language models. They used human preference data to calculate per-task Elo scores across different versions of Claude. For this report, they collected data on some common tasks: detailed instruction-following (helpfulness); providing accurate, and factual information (honesty). Also including a red-teaming task (harmlessness), which asked crowd workers to roleplay adversarial scenarios and trick AI systems into generating harmful content. Results showed Claude 2 improving on both helpfulness and honesty over Claude 1.3, while scoring similarly on harmlessness. (Figure 1)
They tested the effectiveness of the Claude models’ Constitution, by evaluating it on the Bias Benchmark for QA (BBQ) and TruthfulQA evaluation. BBQ measures the propensity for models to exhibit stereotype biases across 9 social dimensions. They do this by inputting ambiguous and disambiguated context for each social dimension, then leverage BBQ to measure the accuracy, 0 indicating no bias, and 1 indicates all answers conflict with a negative stereotype. While the TruthfulQA evaluation determines whether models output accurate and truthful responses in an adversarial setting. To do this they feed back open-ended responses from Claude models as questions, to evaluate self-consistency and truthfulness. Both helpfulness and honesty interventions improved performance. (BBQ - Figure 2-3, TruthfulQA - Figure 4)
For internal evaluation of the Claude models, harmfulness was gauged using a set of 328 prompts that include examples from the team’s red-teaming work and various AI model “jailbreaks” that have been discussed online. Among the prompts evaluated, Claude 2 gave a response judged more harmful than “I can’t help you with that" in four cases. On manual inspection, in three of the cases its response did not seem harmful. However, in the other case, the model was disrupted by the jailbreak attempts in about half of its sampled responses. (Figure 5)
Earlier this year, Claude’s context window was expanded from 9K to 100K tokens. Claude 2 has been trained to have a further expanded context window of 200K tokens, corresponding to roughly 150,000 words. To demonstrate that Claude is using the full context, they measured the loss for each token position, averaged over 1000 long documents. (Figure 8)
Finally the Claude models were tested on several standard benchmark tests, including Codex HumanEval for python function synthesis, GSM8k for grade school math problem solving, MMLU for multidisciplinary Q&A, QuALITY for Q&A on very long stories (up to ∼10k tokens), ARC-Challenge, TriviaQA, and RACE-H for high-school level reading comprehension and reasoning. Claude 2 outperformed the other models in 5 of the 7 evaluations, with 71.2% in Codex P@1 being its lowest score, and 91.0% its highest in ARC-Challenge.
Claude 2 was also put through three standardized tests; Graduate Record Exam (GRE), Multistate Bar Exam (MBE), and United States Medical Licensing Examination (USMLE).
On the GRE, they evaluated the Verbal Reasoning and Quantitative Reasoning sections 5-shot at temperature T = 1 with chain-of-thought, and evaluated the Analytical Writing section 2-shot at temperature T = 1. With a 95th-, 42nd-, and 91st- percentile respectively. For the MBE, they evaluated it 5-shot without chain of thought on these multiple choice questions, getting 76.5% (153/200). Finally on USMLE, which contains three Steps that are separate exams. They evaluated each Step 5-shot without chain of thought. Transcribing X-rays and removing images when necessary, as well as omitting the non-multiple-choice section. Claude 2 scored within the range of 63-69%, which is a passing grade as typically approximately 60% is required for examinees.
Claude 2 is an improved model compared to its previous versions, having made progress in harmlessness, robustness and honesty. However it is still a work in progress, as it still generates confabulations, exhibits bias, and can be jail-broken.
My Notes:
Reinforcement Learning from Human Feedback (RLHF)
HHH - helpfulness, honesty, harmlessness
Finally the Claude models were tested on several standard benchmark tests, including Codex HumanEval for python function synthesis, GSM8k for grade school math problem solving, MMLU for multidisciplinary Q&A, QuALITY for Q&A on very long stories (up to ∼10k tokens), ARC-Challenge for science questions, TriviaQA for reading comprehension, and RACE-H for high-school level reading comprehension and reasoning. Claude 2 outperformed the other models in 5 of the 7 evaluations, with 71.2% in Codex P@1 being its lowest score, and 91.0% its highest in ARC-Challenge.
The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain
In this paper we describe an in-depth evaluation benchmark for the Abstraction and Reasoning Corpus (ARC), a collection of few-shot abstraction and analogy problems developed by Chollet. In particular, we describe ConceptARC.
At the heart of human intelligence is the ability to form and abstract concepts. Enabling humans to understand and create internal models of the world, in order to make sense of new information, often via analogy, and to decide how to behave in novel situations. In AI, research on concept formation and abstraction often utilizes idealized domains that capture some of the essential aspects of abstraction in the real world. In such domains one can assume prior knowledge without requiring the open-ended knowledge involved in real-world language and imagery.
In creating ARC tasks, Chollet assumed objectness, numerosity, basic geometry and topology as priors—the only knowledge required to solve these tasks. Chollet created a 1,000-task corpus. 800 tasks were made public and used as a challenge on Kaggle. The remaining 200 tasks were kept as a “hidden” test set for evaluating AI systems; 100 of these were used to evaluate submissions to the Kaggle challenge. The first-place program in the Kaggle challenge solved 21% of the 100 hidden tasks; an ensemble of the first- and second-place programs solved about 31%. One limitation on ARC’s usefulness is that it might be too challenging. Many of the tasks in Chollet’s corpus were difficult even for humans, and the corpus as a whole might be sufficiently difficult that it does not reveal real progress on machine acquisition of core knowledge.
We propose a systematic concept-based evaluation method. If a system performs well on examples varying in complexity and degree of abstraction, that provides strong evidence that the system has understood the concept in a generalizable way.
When we created ConceptARC, we chose 16 concepts. Each concept is central to one or more tasks in Chollet’s published ARC “training” and “evaluation” sets. For each concept, we created 10 new ARC tasks that are different instantiations of the concept, termed the concept group for a given concept. Each of our tasks has 3 different test inputs. We constructed the tasks in the ConceptARC benchmark manually. As an example, this figure shows three tasks from ConceptARC that are variations on the concept Sameness. (Figure 2, page 5)
The human participants exhibited over 90% average accuracy on 11 of the 16 concepts, and over 80% accuracy on each of the remaining 5 concepts. In contrast, the first-place Kaggle program never scored above 80% accuracy on any concept, and for 11 out of 16 concepts, its accuracy was below 60%. The second-place Kaggle program’s accuracy never reached 60% and was below 50% on 15 out of 16 concepts. GPT-4, whose performance on this domain was impressive given that it was not designed or trained for such tasks, had accuracy below 30% on 15 out of 16 concepts (it scored 33% accuracy on one concept). (Page 7 table 1)
In analyzing a sample of errors made by the human participants in our study, we found that many errors included obvious careless mistakes, and ‘near misses’, in which it was obvious that the person grasped the underlying concept but made an error in applying it. The errors made by the winning ARC-Kaggle programs and by GPT-4 were typically less interpretable.
The purpose in designing a benchmark with these attributes is threefold: first, to promote the development of AI systems that grasp generalizable core concepts; second, to fairly evaluate systems that are claimed to have such abilities; and third, to provide an evaluation set that is not overly difficult, and that would thus mask real progress in developing such systems.
My Notes:
For example, 1(a) requires spatial notions of extending a line diagonally from an object to a boundary; 1(b) requires parsing connected sets of pixels into objects and recognizing shapes across different rotations and symmetries; and 1(c) requires notions of counting and comparisons among quantities. (Page 4, figure 1)
We propose a systematic concept-based evaluation method, in which test examples are designed to instantiate variations on chosen concepts. If a system performs well on examples varying in complexity and degree of abstraction, that provides strong evidence that the system has understood the concept in a generalizable way.
Lost in the Middle: How Language Models Use Long Contexts
We find that a language model performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts. Furthermore, performance substantially decreases as the input context grows longer.
Language models are generally implemented with Transformers, which scale poorly to long sequences (e.g., since self-attention complexity is quadratic with the input sequence length). As a result, models are typically trained with relatively small context windows. Recent improvements in hardware and algorithms have resulted in larger context windows, but it remains unclear how these models make use of their input contexts when performing downstream tasks.
We investigate this with a variety of state-of-the-art open models, such as, MPT-30B-Instruct, LongChat-13B (16K), and closed models, OpenAI’s GPT-3.5-Turbo and Anthropic’s Claude, in settings that require accessing and using information within an input context. Experimenting with multi-document question answering and key-value retrieval.
In multi-document experiments, the model inputs are (i) a question to answer and (ii) k documents (e.g., passages from Wikipedia), where exactly one document contains the answer to the question and k − 1 “distractor” documents do not. With input contexts containing 10, 20, and 30 documents. As the position of relevant information is changed, we see a distinctive U-shaped curve in model performance, indicating models are significantly better at identifying and using relevant information that occurs at the very beginning or very end of contexts.
For example, GPT-3.5-Turbo’s multi-document QA performance can drop by more than 20%—at its nadir, performance in 20- and 30-document settings is lower than performance without any input documents (i.e., closed-book performance; 56.1%).
In our synthetic key-value retrieval task, the inputs are (i) a string-serialized JSON object with k key-value pairs, where each of the keys and values are unique, randomly-generated UUIDs and (ii) a particular key within the aforementioned JSON object. The goal is to return the value associated with the specified key. With input contexts containing 75, 140, and 300 key-value pairs. The results of this task had largely similar trends, in particular, we see the U-shaped performance curve again; model performance is lowest when the key-value pairs are in the middle of their input context.
To better understand the potential effects of model architecture, we perform some preliminary investigations into its role (e.g., decoder-only vs. encoder-decoder), query-aware contextualization, and the effects of instruction fine-tuning.
Through observation of encoder-decoder models, Flan-UL2 and Flan-T5-XXL, compared to decoder-only models. We can speculate that encoder-decoder models make better use of their context windows because their bidirectional encoder allows processing each document in the context of future documents, potentially enhancing relative importance estimation between documents.
Decoder-only models cannot attend to query tokens, since the query only appears at the end of the prompt and they can only attend to prior tokens. On the other hand, encoder-decoder models use a bidirectional encoder to contextualize input contexts, and are more robust to changes in the position of relevant information. We find that query-aware contextualization dramatically improves performance on the key-value retrieval task. While minimally affecting multi-document task performance, only improving performance when the relevant information is located at the very beginning, and slightly decreasing performance in other settings.
All evaluated models are instruction-tuned, they undergo supervised fine-tuning on a dataset of instructions and responses. During this process, task specification/instruction is placed at the beginning of the input context, which may lead to models placing bias on the start of the input context. We conclude with a practical case study of open-domain question answering, finding that the performance of language model readers saturates far before retriever recall.
My Notes:
OpenAI’s GPT-4 model has a maximum context window of 32K tokens; Claude’s context window of 100K tokens. OpenAI’s GPT-3.5-Turbo model context window of 16K tokens. MPT-30B has a context window of 8K tokens, and LongChat-7B has a context window of 16K tokens.