UI-TARS is an advanced native GUI agent model engineered to interact seamlessly with graphical user interfaces (GUIs) by emulating human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates key components—perception, reasoning, grounding, and memory—into a unified vision-language model, enabling end-to-end task automation without predefined workflows or manual rules. This integration allows the model to process multimodal inputs, such as text, images, and interactions, to build a coherent understanding of interfaces and respond accurately to dynamic changes in real-time. UI-TARS supports cross-platform interaction across desktop, mobile, and web environments, utilizing a standardized action framework to execute complex, multi-step tasks through advanced reasoning and planning. By combining large-scale annotated and synthetic datasets, UI-TARS enhances its generalization and robustness, making it a great solution for automated GUI interactio​n.