(The following guest essay explains technical challenges associated with developing an AI Large Language Model suitable for use by the intelligence community. The authors are chief executive officer and chief data scientist of Kingfisher Systems.)
In an interview with Bloomberg on September 26, Chief Technology Officer Nand Mulchandani indicated that the Central Intelligence Agency may seek to build a Large Language Model (LLM) for use by the Intelligence Community (IC), because the market-leading, commercially available LLMs are unsuitable for IC needs. Doing so will require three resources: computing, data, and talent.
At LLM scale, all three resources are costly. While the US government (USG) can easily meet the financial demands of Mulchandani’s proposal, its potential inability to meet data and talent requirements is highly likely to cause such a project to fail. The right mix for success is the private sector for data and personnel and the government for money and scaling compute.
Artificial Intelligence (AI) technology is rapidly maturing. While research into adapting LLMs to military and intelligence applications is work worthy of the Department of Defense (DoD) and IC research funds, research projects alone will not maintain parity with near-peer adversaries, particularly China. In the LLM space, broad deployment of current technology leads to data advantages that can be used to train the next generation of models, providing durable advantages to entities that can most rapidly iterate. The iteration must be with production systems, not demonstration projects.
The USG should partner with firm(s) that can provide clean data at scale and top AI talent, while providing the funding and infrastructure required to deploy a solution and bring it to the maturity level necessary for intelligence applications.
Kingfisher Systems, Inc.
Kingfisher can meet both data and talent requirements now. Because LLMs are a way of expressing a probability model for text sequences, LLM training data should be comprised of diverse, high-quality writing. Acquiring and working with closed sources will significantly slow development and deployment. The more structured nature of closed source text also makes it not well-suited to be the bulk of a training corpus. Thus, it is likely that Kingfisher Systems alone, with its current data corpus extending back over 15 years, can meet Open Source data requirements to train an IC LLM.
Despite the hype around LLMs, the underlying idea is very simple. Given a sequence of tokens (words or parts of words), predict the next token. Because this prediction is stochastic, many forward passes can be conducted, generating different possible responses. It was not surprising that with a large enough model and training corpus, much of human knowledge, culture, arts, social conventions, etc., can be encoded in the probability distribution. What was surprising was that algorithmic and hardware advances allowed these models to produce long pieces of text that were useful, even in professional settings and for expert users. The main algorithmic advance was the Transformer algorithm, described by Google in 2017. The main hardware advances were all made by one company, Nvidia.
Since the next-token predictions are only as good as the underlying data, the data is the key asset. For intelligence applications, where the generated text must meet stringent quality and accuracy standards, the training corpus must consist primarily of professionally edited text. There are only a few sources of such text at scale: books, academic journals, and journalism.
The long lag between events and publication of books makes them unsuitable as a primary source of training data. Therefore, training and fine-tuning an IC model will require very large collections of news and technical documents. To get a sense of the required scale, consider that the Llama–2 models from Facebook AI Research were trained on a two-trillion token corpus. Kingfisher’s holdings suggest that about 800 million distinct open source news articles would be required to achieve a similar corpus size without augmentation with lower-quality sources. Numbers for technical documents would be similar.
Without knowing the true extent of the open source data holdings within the USG, it is likely that their holdings are not adequate as the primary source of training data for an LLM sophisticated enough to meet the Government’s needs. Training future generations of the IC LLM would require a continuous stream of data at a similar scale.
Alone, algorithms, hardware, and data are not sufficient to create a modern AI software product such as ChatGPT. The other necessary element is the prompt. Input from the user is combined with instructions provided by the system developer that increase the likelihood of desired answers. Instructions to only respond with known information are common here. Careful prompting can enable an LLM to transfer the embodied probability data from sequences that appear in the training data to those that only occur in test or production. This transfer is known as Out of Distribution Generalization (OOD Generalization).
By definition, OOD Generalization requires reasoning over inputs that are not found in the training data. For example, an LLM trained prior to October 2023 might have general knowledge of Middle East politics, but in the absence of additional information not available at training time such a model could only restate conventional wisdom when asked about specific objectives or possible next actions for each side in the current conflict. The solution to this problem is to include the most recent information in the prompt. Typical prompts contain some, or all, of the following:
- General statements about how the model should behave, known as the system prompt;
- Additional information that might be useful to the model, such as summaries of recent events that are likely to be relevant to the interests of the user;
- A question or directive to which the model should respond; and
- The start of the desired response may also be provided, which guides the model toward the desired form of the response.
Using the prompt, the model sequentially predicts the next tokens. Tight control over the model inputs is necessary to ensure that the model behavior is aligned with organizational values and objectives. In an IC context, alignment requirements are particularly stringent. The token sequence used in each prediction is known as the context window. An additional challenge that arises in prompting LLMs is that the length of the context window is limited for technical reasons. For Llama–2 it is 4096 tokens, or about 3100 words.
A popular solution to building AI-enabled personal assistants is to couple an LLM with a traditional search engine. Output from the search engine is used to populate the context window. The LLM is then relied upon to read the search results and perform the analytic task. While recent advances have enabled the length limit to be increased to roughly book-length text, the LLM can still struggle to assess the veracity of the provided information if the raw data is of uneven quality. This situation is analogous to another poorly solved problem, fake news detection.
Simply scaling Llama–2 or another open LLM will not meet the operational needs of the IC, and development should not proceed along this path.
Despite having nearly unlimited financial resources, the ability to spend the money effectively will limit the government.
Training data can only be collected at the rate it appears. For instance, Kingfisher Systems collection statistics and the above estimate of the number of articles required suggests that the timeline could be as much as several years, depending on how effectively archival news data could be obtained.
Algorithmic sophistication will also be required to build an IC model. Due to the high quality of talent required, the government would likely need to compete with the private sector when trying to fill AI scientist jobs.
Kingfisher Systems Position
Kingfisher is uniquely positioned to help. We recognized that LLMs had reached production-ready levels late last year, and we were able to internally demonstrate that commodity LLMs could reason like intelligence analysts when provided with the right data. Our long experience with Open Source exploitation provided us with two key advantages. First, we have the right data, organized and at the necessary scale. Second, we have already developed the techniques needed to select the right data to place in the context window.
Because we have built a working system, we can build a system that will generate training data for continuous fine-tuning of models. We are building this system now. Once operational, a feedback loop will rapidly increase the performance gap between our system and systems built from commodity parts.
Find Archived Articles: