Where is AI going in 2026?
In 2024, RAG was all the rage. Retrieval augmented generation provided a way to add context to a query based on relevant information. In practice this means similar information based on the way that data is stored in a database. Looking at a word or token (machine representation) we can identify similar words by how commonly they occur together. One by one we check this and return the most likely results. These results are then supplemented to your query, and the LLM can give you a more specific response.
Prompt engineering came out of this. It existed before but ironically being able to supplement your prompt meant it was a lot more important what your prompt was. Zero shot means you give no examples for the type of outcome you are looking for. Few shot means you give a few examples. By giving these examples we helped structure the RAG. This also meant we got even better answers.
Because LLMs responds best to well-structured, context-rich prompts, we continued to iterate on this premise. What if the LLM could go and fetch the right context at any given point? And thus tools were introduced. If the context retrieval requires taking action, we allow a tool to abstract that action.
Enter 2025. An LLM that can call a tool is called an agent. Agents help us by solving the last mile problem. The last mile problem in delivery is how do you get something from a distribution center to the front door of the purchaser. The last mile in LLMs is how do you take information returned by the LLM to solve a problem the user is facing. The last mile is typically what a person would do, eg. click through some buttons based on the information from the LLM. These agents can do that for me.
MCP came out of this. Model context protocol is a system that lets agents interact with tools in a more standard way. Traditional software is designed for human inputs. Humans input things slowly and expensively (cost of loading a design versus loading a schema) compared to agents. By letting agents have a standardized way to interact with tools and pull context, agents can do things for us faster.
Agents are able to use this standardized approach, or similar, to execute things on our behalf so we are managing outcomes, not processes. This frees us up to do things like worry about whether we are getting the right outcomes or be more strategic with our goals for these agents. Where aptitude for a human in a task can take several years, through reinforcement learning and virtualization, these agents can approach the same aptitude in days or hours.
This makes sense. LLMs are trained on the equivalent of 281,250 books for a 30B token model. On most standardized testing these LLMs are able to outperform top benchmarks. It’s no surprise that this influences the direction they will head in 2026.
So far, we have primarily focused on text. Context and tools in text. We are sort of aware that LLMs and agents can interact with other modalities, but it is under leveraged. Most production use cases are text based, both on input and on output. But the facts support additional modalities. Most premiere models support images and audio and video inputs. The reason these use cases have not been adopted is because they are more difficult to monitor and guardrail. In other words, the risks currently outweigh the benefits. Next year the scales tip towards benefits. The premiere models have been testing and updating their multimodal capabilities all year, and as we have previously pointed out, a year is a long time for an LLM. So it seems realistic that next year we will see image, audio, and video generation that is production ready where we can also guardrail and monitor effectively.