Your attention will make Agent TARS better and better ❤️
In March, we open-sourced an early technical preview of Agent TARS. Agent TARS's tagline is "an open-source multimodal AI agent offering seamless integration with various real-world tools". After its release, it received some support from the community and gained a certain influence.
* You can find all these past moments on our official Twitter.
After a period of iteration, on one hand, the Seed multimodal models were gradually enhanced with the successive releases of UI-TARS 1.5 and Doubao 1.5 VL. On the other hand, we received a lot of feedback from the open-source community. Additionally, we faced challenges with the existing architecture struggling to support the project's long-term development, such as the Agent UI not being decoupled, making it difficult to support evaluation and independent use.
After a period of architectural design and iteration, we bring you the Beta version of Agent TARS. We will first introduce Agent TARS CLI, a Multimodal AI Agent tool designed to be "available anytime, anywhere".
Before introducing the new version, we'll share some of our understanding of Agent TARS design principles, which will help you understand this release. In the goals of the Agent TARS core team, a robust Agent system needs to do three things well:
"Building agents that run for a long time" has always been one of Agent TARS's long-term goals. Take the following example:
The Agent completed the task after about twenty rounds. Especially in multimodal tasks, without careful Context Engineering, the context can easily overflow. In Agent TARS, Context Engineering is reflected in the following aspects:
In Agent TARS, the main components of the Memory for each Agent Loop are as follows:
These contents are "dynamically" constructed into a request in each Agent Loop, and are affected by the model's Context Window. Up to now, typical model context windows are as follows:
Taking a 128k context as an example, and assuming that for Research-type tasks, each tool call's Tool Result averages 5000
tokens, and if we ignore the System Prompt, we can deduce that without any processing, the Agent would overflow at round 26
, which is clearly insufficient for long-term operation.
At the same time, considering multimodal GUI Agents, and noting the differences in token calculation for images across different model services, we assume that a high-detail image can reach up to 5000 tokens[2]. To solve this problem, Agent TARS internally adopts a dynamic optimization strategy of "using different sliding windows for different modal contents" and optimizes and calculates for Context Window.
Since the early stages of Agent TARS, we have been building MCP. You can find our early practices in MCP Brings a New Paradigm to Layered AI Application Development.
In Agent TARS Beta, we still use a similar architecture internally, but with slight differences. This stems from new challenges we faced. We found that while MCP effectively solved the problem of "separating Agent developers from tool developers," it seemed to bring some chaos to Context Engineering:
From the figure above, if we adopt the standard "separation" approach, where the Agent fully trusts the Tool Definitions and Tool Results obtained through the MCP Client from MCP Servers, you may face the following problems:
400 This model's maximum context length is 128000 tokens. However, your messages resulted in 138773 tokens. Please reduce the length of the messages.
Yes, the second problem is the core issue discussed here. Especially in MCP Browser practices, we found that tools like browser_get_html
and browser_get_text
would cause the Agent to fail on many websites. To address this, we had to either discard these tools or replace them with better implementations, such as the browser_get_markdown
built on Readability with pagination capabilities.
Based on the above, we can roughly draw the following inference: The more an Agent needs fine-grained Context Engineering control, the less it may need MCP's silent Prompt injection behavior. Even with MCP integration, Agent developers still need to perform many fine-grained controls just like operating Function Tools. From this perspective, MCP's value for production-level Agents should be a standardized Tool distribution protocol, rather than freely expanding Tools through methods like mcpServers
.
We have many more practices and opinions about MCP, so please stay tuned for our future updates.
Currently, various MCP Registries exist in the community with varying quality. In the long term, the MCP ecosystem should have a standard Benchmark set that provides clear reference scores for each tool's performance (such as model compatibility, context compression rate, performance, etc.), to help Agent developers make better choices.
Is doing the above enough? Clearly not. In Agent TARS's upcoming plans, we are promoting a multi-level Memory design:
Level | Definition |
---|---|
L0(Permanet) | Permanent memory, preserved across all Session Runs, such as user initial input messages and Agent's Answers |
L1(Run) | Memory effective only in the current Session Run, such as the Plan for the current Run |
L2(Loop) | Memory effective only in the current Run Loop, such as Tool Call, Tool Result, environmental input (screenshots), etc. |
L3(Ephemeral) | Temporary memory, such as streaming message chunks and Agent's one-time status |
Next, combining the above stratification strategy, we will further adopt some strategies to compress different levels of Context:
Additionally, we are planning to extend Agent support for the Responses API based on the Chat Completion Protocol, leveraging LLM Serving's Image Cache to further improve performance in multimodal reasoning task scenarios.
"Observability and evaluability are issues that all Agent Frameworks face and need." Here, we mainly explain some differences in Agent TARS's construction and our evolutionary direction.
In Agent TARS, the main challenge in long-step task scenarios is that the internal details of the Agent become increasingly difficult to observe, posing huge challenges to the stability of continuous framework iteration. For this, we need to introduce a mechanism to observe Agent operation. In Agent TARS Kernel, there are many environmental factors during Agent operation, such as:
For this, we need to introduce a design pattern that can save the environment on which the Agent depends as a Snapshot at runtime, and then replay the Agent based on the Snapshot to ensure that the Agent's Context, Run Loop state, and final Response remain deterministic. Finally, our designed Snapshot framework example is as follows:
This framework has already powered Agent TARS's continuous integration, driving the testing of @multimodal/agent, @mcp-agent/core, and @magent-tars/core, currently helping us avoid at least 10+ issues in Beta development:
In Alpha, Agent TARS was an Electron application, so we could only evaluate it manually, which was inefficient. Starting from Beta, Agent TARS brings a new architecture with Agent and UI layering, and the Headless running mode makes automated evaluation possible. We referenced OpenAI's simple-evals, and implemented a browsecomp evaluation method for Agent TARS through cross-process calls between Python and TypeScript:
In addition, we are also building some basic evaluation sets for MCP Tools, which are currently used to evaluate the cross-model compatibility performance of MCP Agent within Agent TARS. Once these capabilities are fully improved, we will introduce a complete Benchmark solution in the official version.
"Agent applications" is a key direction that Agent TARS has been focusing on and designing. A good Agent solution should make it easy to build applications.
To help you quickly understand what Event Stream is, we'll demonstrate with a real example. When you start Agent TARS CLI locally, you'll be able to trigger a task execution using curl
:
You'll see the Agent's Response being output like a Stream:
This includes Agent status, Tool call details, final Agent replies, environment information, and more. Yes, this design makes the entire Agent running process completely visible, allowing you to easily build your own Agent UI based on this Event Stream:
The benefit is not just that the Agent and UI follow certain protocols; if you don't like the UI, you can also replace the UI implementation — yes, this is also one of Agent TARS's future visions, supporting the community to define different Agent TARS UI implementations.
"In a sense, Agent UI is just a Replay of Agent Event Stream"
This stems from Agent TARS's Kernel being built on Event Stream itself. You'll be able to experience the elegance of these data structures through Agent TARS's SDK:
You'll get the following output:
This design makes the architecture of Agent TARS Server and Agent TARS Web UI simple enough. You only need to focus on Session management and implement a Renderer specifically for AgentEventStream — what we call <EventStreamRenderer />
.
AG-UI is a cutting-edge protocol aimed at standardizing the connection between frontend applications and AI Agents through an open protocol. Does this sound similar to the Agent Event Stream mentioned above? Yes, indeed. When AG-UI was released, we took note of this protocol and conducted thorough research and study. We found many valuable aspects of the AG-UI Protocol, such as State Management Events, which, based on the JSON Patch format (RFC 6902), enables more natural START-CONTENT-END
incremental state updates.
Here, we need to explain some differences between Agent Event Stream in Agent TARS and Agent-UI Protocol:
Scenario | Agent Event Stream | Agent-UI Protocol |
---|---|---|
Build UI | YES | YES |
Build Context | YES | NO |
Yes, Agent Event Stream is also used internally in Agent TARS to build Context, referring to the Context Engineering mentioned earlier. Doesn't it all connect? Regarding the details of Agent Event Stream, we will write a separate Blog post to introduce it to everyone in the future.
Well, above we have explained some key insights in building Agent TARS Beta, hoping these will help you understand this release. Next, we will formally introduce the capability changes brought by Agent TARS Beta.
Starting from Beta, Agent TARS's application form has evolved from an Electron App to a CLI, bringing a new Web UI based on a new architecture:
This stems from CLI's advantages without losing core capabilities:
@agent-tars/cli
has iterated through 35 versions, far exceeding the early Electron's 9 versions, allowing us to fix user-encountered issues faster;For usage guidelines of @agent-tars/cli
, please visit Quick Start.
In the early preview version, Agent TARS adopted the same DOM Extraction approach as browser-use, detecting interactive element sequences through DOM analysis. The LLM would reason and output the element number to operate next, completing the operation process:
In Agent TARS Beta, we introduce a visual control solution based on UI-TARS. The logic for operating the Browser is closer to how humans understand screens. The VLM first sees the screen, then thinks and outputs specific actions to perform (such as clicking, dragging, etc.), ultimately completing browser control tasks.
Let me show you this difference with a very simple CAPTCHA task:
First with DOM-based approach, since it's based on LLM that can't see the screen, the operation path is very complex and ultimately fails:
Then with Visual Grounding, since the model can see the screen, the VLM performs visual reasoning and outputs Click and Type Actions, quickly completing the task:
From the final released API, we provide three operation methods:
Term | Introduction |
---|---|
dom | Operation capabilities based on DOM analysis Browser Use, which was the approach in early versions of Agent TARS. |
visual-grounding | Operation capabilities based on GUI Agent (UI-TARS / Doubao 1.5 VL), without DOM-related tools. |
hybrid | Operation capabilities including both visual-grounding and dom |
For how to use and more details, please visit Browser Operation.
In Agent TARS's early preview version, we only supported Claude 3.7, but we still received numerous requests from the community for compatibility with various models. We initiated a Discussion about Model Compatibility #377, but due to architecture limitations at that time, we couldn't quickly achieve good model compatibility.
Finally, starting from Beta, we completely rewrote the Model Provider layer implementation. After actual testing, we can finally tell you that we've brought better model compatibility. Now, Agent TARS can run on Model Providers such as Volcengine, Anthropic, and OpenAI.
Currently, the compatibility list for typical models is as follows:
Model Provider | Model | Text | Vision | Tool Call & MCP | Visual Grounding |
---|---|---|---|---|---|
volcengine | Seed1.5-VL | ✔️ | ✔️ | ✔️ | ✔️ |
anthropic | claude-3.7-sonnet | ✔️ | ✔️ | ✔️ | 🚧 |
openai | gpt-4o | ✔️ | ✔️ | ✔️ | 🚧 |
In the new version of Agent TARS, we built the entire architecture on Streaming, significantly improving the interactive experience for complex tasks:
Thanks to the Agent Event Stream introduced in the new version of Agent TARS, Web UI can be developed completely independently and interact with Agent TARS Server through protocols. Ultimately, we bring a clean Web UI:
This design was first introduced in the release of UI TARS 1.5. Starting from Agent TARS Beta, Agent TARS supports GUI Grounding, and the process includes "real-time mouse tracking":
Agent TARS Web UI supports saving Replays locally by default, and also supports configuring share.provider to upload to your server.
The new version of Agent TARS not only initially supports multimodal input in the Web UI, but more notably, Agent TARS Web UI has internally implemented some General multimodal content renderers that can select appropriate UI for rendering without coupling with specific Tools and MCP Servers:
Based on the Cognition above, we completely rewrote Agent TARS Beta, bringing a brand-new multi-layered architecture built on an Event Stream-driven Agent Kernel.
Overall, the main components of Agent TARS are as follows:
Architecture of Agent TARS Beta
From a technical architecture perspective, Agent TARS Beta has evolved Agent TARS from an Electron App to an Agent ecosystem.
In addition to the examples shown in the features introduction above, here are some examples demonstrated by internal developers. Although these capabilities are not officially supported by Agent TARS, thanks to continuous model capability improvement and Context Engineering evolution:
Agent TARS does not currently support Coding or Artifact Preview, but it does support multimodal input. You can write code using the File tool and implement preview with Browser:
Similarly, some developers have even achieved nearly professional UI output through Agent TARS's autonomous multi-round iterations:
With just one model, Doubao 1.5 VL, we completed writing a game, then "played" it ourselves, and finally "beat the computer" we programmed:
For Agent TARS, this is a very valuable beginning. From the earliest UI-TARS-desktop, GUI Agent was the only first-class citizen, but now, GUI Agent is no longer an isolated entity, playing its role in an integrated environment!
This is another example that Agent TARS has been focusing on: how to generate multimodal content:
The release of Agent TARS Beta is just a beginning. In fact, we have only released the first default version of Agent TARS, while a version with dynamic planning reasoning is currently in internal testing. Agent TARS is still developing at a rapid pace. Although we have brought many new features in this release, there are still many aspects that need improvement, such as developer documentation and the presentation of deliverables. We will push for continuous subsequent releases as soon as possible, so please look forward to our updates.
In the future, we expect Agent TARS to truly become an Agent development tool that everyone can use anytime, anywhere. Welcome to use Agent TARS and communicate with us. Thank you to all community members who have supported us ❤️