Are you AI ready? Key considerations in Generative AI solution Design

Sep 19

The AI industry continues to be an exciting and rapidly evolving area of technology, especially as more companies embrace Generative AI. While the initial enthusiasm around this technology is starting to settle, it’s becoming clear that Generative AI is not a simple, one-size-fits-all solution to address data science challenges. The market is saturated with ‘expert AI solutions’ that simply do not work. Having learnt from failings first-hand over the last year as customers acquire off the shelf industry specific generative AI solutions that do not integrate with industry technology or data, and organisations rush to implement RAG solutions within every area of a business without understanding if the data is of good enough quality, it is clear careful consideration of key design elements is crucial to ensure a strong return on investment. In this post, I’ll explore essential prerequisites, generative AI solution design decisions, and post-deployment monitoring strategies that can help lead to successful AI implementation.

Pre-requisites

Looking back at a traditional Data Science project life-cycle the initial phases are still relevant and the pre-requisites come down to business understanding and data governance:

Do we have well defined use cases which have been assessed for whether traditional ML models or generative AI models would be appropriate applications?
Do we have up-to-date data to be indexed and vectorised to be able to leverage higher accuracy results from generative models? Where business units have multiple versions of the same document or data with conflicting statements, it can be very noisy and should not be expected that generative AI solutions will understand business logic without explicit instruction.
How do we secure sensitive data used within generative AI solutions (e.g. GDPR data or data subject to regulatory audit, etc)?
Are there any biases within the data that could also be reflected in RAG solution responses?

Purview and Fabric are great Azure resources, with AI features which can assist in setting up effective data governance and data platform solutions to ensure data is of good quality and safe to be used within RAG solutions.

Generative AI solution design

A Retrieval Augment Generation solution refers to a solution whereby generative AI models are augmented with additional, often proprietary, data. Based on user questions and requests appropriate information can be retrieved and inserted into the model prompt so the LLM can generate a relevant response. A typical RAG solution will follow the process of firstly indexing relevant data. During this process the text documents are chunked or split to reduce semantic confusion when used by LLMs and keep within the bounds of model input limits. After chunking a numerical representation of the chunked data can be obtained by applying an embedding model and storing in a vector store. Embedding models create a vector representation (array of numbers) of the text, so semantically similar texts will be numerically closer together than dissimilar texts. Distance metrics (e.g. cosine angle) can be used to retrieve semantically relevant text. After embedding and vectorisation the retrieval and generation processes can be utilised, whereby a user’s input triggers relevant chunks to be retrieved from storage and a LLM model produces and answer using a prompt that includes the user question and retrieved data. At each stage of this workflow, different approaches and strategies can be utilised to optimise results:

How should the data be split/chunked? There are many approaches to splitting texts (e.g. by sentence, paragraph, page, HTML sections, semantic meaning, etc). A common generic approach is to use a recursive text splitter which will recursively split the document using common separators like new lines, full stops, etc until the chunk is the appropriate size. Factors influencing the type of text splitter used include: the shape and density of documents, user queries and LLM chunk guidelines for optimal performance.
Which embedding model should be used? The Massive Text Embedding Benchmark (MTEB)leaderboard on Hugging Face is a good place to start when decided which embedding model to use and useful metrics around the models size, retrieval and summarisation performance can be obtained but is self-reported and not necessarily assessed using data relevant to your use case. Generally performance will come down to accuracy and latency and will require some iterative experimentation and active evaluation based on your use case and data.
Which LLM model should be used for our use case? LLM researchers have made use of reference datasets to assess LLMs and provide benchmarks such as GLUE and MMLU. These benchmarks provide an assessment of a model performance across a variety of topics, such as history, politics, geography, disinformation, copyright infringement, etc. Moving beyond academic evaluation informative model hubs and LLM leaderboards are a great place to start as they provide metrics on quality, context window size, pricing, throughput, latency, and more. Additionally, domain specific leaderboards, such as LegalBench are also emerging to provide insights on more practical applications, as opposed to general knowledge. One of the most important considerations will be your use case, as this will determine what type of model would be best suited. For example, encoder only models (e.g. BERT) perform well for sentiment analysis or named-entity-recognition. Encoder Decoder models (e.g. BART) are useful for translation, summarisation and question answering. Decoder only models (e.g. GPT) perform well on a variety of tasks but excel in text generation (see here for a talk explaining how transformer work and model architecture variations in more detail). Additional considerations include: need for customisations or domain specific LLMs, accuracy requirements, inference speed, scalability, cloud vs on-remise, budget and ethical assessment. As illustrated by Hoffman et al 2022, bigger is not always better and some large models have been over-parametrised and under-trained. Likewise, there are many examples emerging of achieving superior results with smaller LLMs.
How will chat history be effectively managed? Again, depending on intended use case there may be a pre-defined number of conversational exchanges , or tokens that could be stored. Alternatively the LLM itself could use to summarise the conversation so far and the summary be stored as memory.
What retrieval strategy should be employed ? LangChain provide a variety of different retrievers such as vectorstore retriever, multi vector retriever, contextual compression, multi-query retriever, long context reorder, parent document retriever, self-query, time-weighted vectorstore and ensemble retriever. Measuring retrieval performance through metrics such as query density drift, ranking metrics and user feedback can also be valuable for continual improvement.
What LLM orchestration tooling will be used? LangChain and LangGraph are probably the most comprehensive in terms of functionality to simplify interactions with LLMs and speed of new features being available given its large community backing, but there are alternatives. With speed of new functionality, deprecation is an issue that needs to be frequently considered in production running solutions. Chains are core components of applications as they facilitate weaving together numerous functionalities. For simple use cases LangChain’s default chain type uses the stuff method which simply puts all the data into the prompt to be passed to the LLM. Whilst this makes a single call to the LLM, there is an input context length to be aware of so will not be effective for larger documents. Alternative chain type’s such as map reduce which passes all the chunks to the LLM and uses another LLM call to summarise all the responses into a final response. This method does allow for parallel workflows but this independence could also be a hinderance. The refine method is similar to map reduce but works iteratively, providing longer answers but latency is likely to be longer. Map rerank passes all the chunks from different documents to the LLM and a score is returned, the highest scoring document in terms of relevance is used to generate the final response. For linear use cases, where each prompt has a single input and single output LangChain’s sequential chain’s can be used. Where use cases require an input to be routed to one of several chains depending on the request a router chain could be used (e.g. an educational tutor assistant for students that will route to different chains based on educational topic (maths, science, history, etc)). However, for complicated workflows where better controls over deterministic output are required agentic flows have become a preferred development path, for which LangGraph is extremely powerful.
How complex is the LLM chain? Does use of self-reflective RAG and/or Agents need to be considered? Self-reflection is a prompting approach used to improve the quality and success rate of agents. It involves prompting the LLM to ‘think’ and ‘reflect’ on if a user question needs re-writing for better retrieval, when to discard irrelevant documents, retry retrieval, etc. The process of ‘thinking’ before instinctively acting promotes better results. LangGraph is a library within the LangChain ecosystem designed to aid the development of complex multi-agent LLM applications. Each node on the graph represents an LLM agent usually with its own set of function tools, and the edges are the communication channels between these agents. The structure allows for controllable and manageable workflows where agents are responsible for specific tasks. They are extremely powerful and allow for flexibility, scalability and fault tolerance. During our internal Elastacloud hackathon this summer our team created an effective resourcing tool using LangGraph whereby the supervisor agent was effectively able to allocate resources to projects based in information from several other agents: customer agent, resource agent, allocation agent and project agent.
How will security of data access be managed via the user interface? AI applications are extremely powerful but also expose organisations and individuals to new threat vectors (e.g. prompt injection, data poisoning, model denial of service, model theft, excessive agency, etc), which require novel security measures. Good security hygiene (e.g. enable MFA, apply least privilege, keep software patched and up-to-date, protect data, etc) is a great start but using LLMs in red teaming practises can aid rapid understanding of how models can be misused, insights that could be fed to the model, etc from a holistic perspective.

Monitoring and Evaluation

How will RAG solution results be evaluated? The first point of call when considering how well RAG solutions are performing should be the data that is going into each step of the sequence and what is coming out. As with most data science solutions, garbage in leads to garbage out, hence the need to assess input prompts and quality of augmented data. Creating example question and answer pairs for evaluation is however time consuming and not scalable. Beyond this initial manual investigation, more holistic evaluations with multiple data points from solution interactions can be assessed by using LLM-assisted evaluation. QAGenerateChain functionality, for example can use an LLM to scale the development of QnA pairs. QAEvalChain allows us to use an LLM to grade responses as correct or incorrect. There are other tools for evaluation such as lamini or AzureML Model Evaluationwhich can provide metrics on relevance, retrieval and groundedness, as well as a range of metrics on risk and safety. Good evaluation processes are quantitative, point to what can be improved and are scalable and automated.
How will the RAG solution be monitored and maintained? In addition to verifying users receive high quality responses that are also safe, it is also important to consider the overall performance of the AI application in terms of latency and load. Through understanding of number users, peak usage times, interaction actions, etc benchmarks and simulations can be utilised to estimate performance. Client metrics may include number of virtual users, requests per second, response time, latency, number of failed requests. Broader LLM metrics to be monitored may include number of prompt tokens per minute sent, number of generated tokens per minute, time to first token, time between tokens. A range of service metrics for Azure Open AI, App service, API management are also useful to monitor.

Conclusion

There are a variety of quick-start guides enabling individuals and organisations rapidly get started with RAG solution development, but when initial experiments don’t leverage expected results careful consideration of key design elements explored here is crucial to ensure desired outcomes and good ROI.