Generative Visual Code Mobile World Models

*Equal contribution    †Corresponding authors

1Trillion Labs    2KAIST AI

Trillion Labs


Demo 1

Example App 1


Demo 2

Example App 2


Demo 3

Example App 3

Demo videos showing gWorld predicting next GUI states from various mobile applications.

Abstract

Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models.

We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation.

We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework that automatically synthesizes code-based training data. In extensive evaluation across 4 in-distribution and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25× larger.

Key Results

Pareto Frontier

Average Instruction Accuracy (IAcc.) across all six benchmarks. gWorld 8B and 32B achieve a new pareto frontier in terms of model size (log10 scaled). Notably, extremely large models (e.g., Llama 4 402B) do not reach this pareto frontier.

Key Findings:

  • gWorld outperforms 8 frontier open-weight models up to 50.25× larger
  • Code-based approach virtually eliminates structural errors (<1% Render Fail)
  • +45.7% and +27.1% gain in IAcc. over base models Qwen3 VL 8B, 32B
  • Scaling training data yields predictable gains following a power law
Data Scaling Law

Data scaling laws for mobile world modeling at 8B. Performance follows a power law (R² ≥ 0.94), indicating predictable and non-saturating gains with more training data.

Method

Data generation pipeline

Schematic diagram of our data generation pipeline. We construct VLM world modeling data via three steps: (1) Repurposing offline policy trajectories into transition triplets; (2) Cross-modal relabeling of the ground-truth next state from pixels to renderable web code; and (3) Synthesizing reasoning traces using look-ahead access to the target state.

Three-Step Data Generation:

  1. Repurposing Policy Trajectory: Convert offline policy trajectories {St, At} into world modeling data {St, At, St+1}
  2. Synthetic Cross-modal Re-labeling: Convert next-state supervision from pixels to renderable web code using a frontier VLM
  3. Reasoning Data with Look-ahead: Generate reasoning traces Rt with access to ground-truth next state

Qualitative Comparison

Qualitative Example 1 Qualitative Example 2 Qualitative Example 3 Qualitative Example 4

Our model generates renderable web code to ensure pixel-perfect text and structurally accurate layouts. In contrast, image-generation baselines frequently produce illegible text and distorted layouts.

MWMBENCH: Comprehensive Benchmark

We introduce Mobile World Model Bench (MWMBENCH), a comprehensive benchmark for evaluating world modeling in mobile GUI environments.

Visual World Modeling

Evaluate in native visual modality preserving rich GUI details

Real-world Action Space

Actions in coordinate space, directly compatible with mobile execution

ID + OOD Evaluation

4 in-distribution + 2 out-of-distribution benchmarks

Datasets: AitW, GUIOdyssey, AndroidControl, AMEX (ID) | AndroidWorld, KApps (OOD)

Main Results

Model Size In-Distribution OOD Avg IAcc.
AitW GUIO AC AMEX AW KApps
Qwen-Image-Edit 20B 15.4 13.0 11.7 10.9 13.8 15.7 13.4
Emu3.5 34B 23.4 25.8 27.7 21.7 29.1 26.8 25.8
Llama 4 402B-A17B 47.2 55.8 58.6 58.3 54.3 59.9 55.7
Qwen3 VL 8B 21.5 28.2 31.1 33.7 30.8 30.1 29.2
Qwen3 VL 32B 46.8 52.0 53.2 56.9 53.4 52.5 52.5
GLM-4.6V 106B 60.9 68.2 74.2 69.5 74.1 57.4 67.4
gWorld (Ours) 8B 68.8 77.2 78.4 82.6 75.0 67.4 74.9
gWorld (Ours) 32B 71.7 81.5 82.9 86.1 79.9 75.7 79.6

Main mobile world modeling results. gWorld 8B, 32B establishes a new pareto frontier, consistently outperforming significantly larger models.

BibTeX

@article{trillion2026gworld, title={Generative Visual Code Mobile World Models}, author={Koh, Woosung and Han, Sungjun and Lee, Segyu and Yun, Se-young and Shin, Jamin}, journal={arXiv preprint arXiv:XXXX.XXXXX}, year={2026} }