📝 Documentation actively under construction. Check out our announcement blog →

Browser

Browser Operation

In Agent TARS, we support three browser operation modes based on DOM, VLM, and a combination of DOM + VLM:

ModeDescription
domOperates based on DOM analysis using Browser Use, the default mode in earlier versions of Agent TARS.
visual-groundingOperates with GUI-based agents (UI-TARS / Doubao 1.5 VL), without using DOM-related tools.
hybridCombines tools from both visual-grounding and dom.

* All three modes include basic navigation tools for navigate and tab.

Next, I’ll explain the differences between the three modes, using a very simple captcha task as an example:

Open https://2captcha.com/demo/normal and pass it

DOM

Activation Method

// agent-tars.config.ts
import { defineConfig } from '@agent-tars/interface';

export default defineConfig({
  browser: {
    control: 'dom',
  },
});

Testing Result

Using the unified test prompt mentioned earlier:

Open https://2captcha.com/demo/normal and pass it

Performance is as follows. Due to the inability of LLM to view the screen, the operation path is highly complex and ultimately fails:


Working Principle

The DOM-based method works by analyzing the DOM and identifying interactive elements on the page:

Build DOM Tree And Highlight


[1]<img></img>
[2]<button>Captcha solver</button>
[3]<a>Entry job</a>
[4]<a>API</a>
[5]<a>Proxy</a>
[6]<a>Software</a>
[7]<a>Blog</a>
[8]<a>Sign up</a>
[9]<a>Log in</a>
// ... 

As a result, this method does not rely on vision. It can work even for models that do not support vision (e.g., DeepSeek).

Visual Grounding

Activation Method

// agent-tars.config.ts
import { defineConfig } from '@agent-tars/interface';

export default defineConfig({
  browser: {
    control: 'visual-grounding',
  },
});

Testing Result

Using the unified test prompt mentioned earlier:

Open https://2captcha.com/demo/normal and pass it

Agent can see and directly perform click and input actions, quickly completing the task:

Working Principle

Essentially, the model performs grounding and returns specific coordinates and content for interaction, which are then executed by the browser operator. The example task above underwent three outputs, parsed as follows:

click(point='<point>383 502</point>')  # 1. Activate input field
type(content='W9H5K')                  # 2. Enter captcha code
click(point='<point>339 607</point>')  # 3. Click Check button to complete

Details of this parsing process and cross-model compatibility are more complex than they appear and are out of scope for this document.

Comparison with UI-TARS-desktop

If you’ve used UI-TARS-Desktop, you can understand this as a combination of the UI-TARS-desktop browser operator, navigation tools, and information extraction tools.

Hybrid

Activation Method

// agent-tars.config.ts
export default defineConfig({
  browser: {
    control: 'hybrid',
  },
});

Testing Result

The performance of the hybrid mode is consistent with Visual Grounding.

Working Principle

The Hybrid mode merges the action spaces of both DOM and Visual Grounding methods and relies on Prompt Engineering to coordinate and guide the appropriate selection of tools. The final decision is made by the model, which infers and determines the best approach to use. Since Visual Grounding already includes tools for "information extraction," for most cases, Hybrid's actual performance closely aligns with that of Visual Grounding.

However, in certain scenarios, Hybrid can attempt the lighter DOM method first. If it fails, Visual Grounding can act as a fallback:

Hybrid Browser Control in Agent TARS

Thus, theoretically, Hybrid mode offers better fault tolerance and adaptability.

Comparison

Comparison DimensionDOMVisual Grounding
PrincipleParses the DOM structure via JavaScript to identify interactive elementsAnalyzes screenshots with visual models to understand the visual layout and elements
Visual Understanding AbilityLimited, unable to interpret aesthetics and visual designCan understand visual layout and user experience
Dynamic Content HandlingRestricted, limited ability to process Canvas and complex CSS-rendered contentFlexible, capable of handling various visual presentations
Cross-Framework CompatibilityDependent on DOM structureFramework-independent, can analyze any webpage as long as it can capture screenshots
Real-Time CapabilityGood, can access real-time page updatesModerate, requires screenshot capture and model processing time

Model Compatibility

The model compatibility overview is as follows:

Model ProviderModelTextVisionTool Call & MCPVisual Grounding
volcengineSeed1.5-VL✔️✔️✔️✔️
anthropicclaude-3.7-sonnet✔️✔️✔️🚧
openaigpt-4o✔️✔️✔️🚧

This table showcases the capabilities supported by prominent models for different browser operation modes.