Trillion Labs

Generative Visual Code Mobile World Models

Woosung Koh*1,2 Sungjun Han*1 Segyu Lee1,2 Se-young Yun†2 Jamin Shin†1

*Equal contribution    Corresponding authors

1Trillion Labs    2KAIST AI

Booking.com

Coupang

Demo videos showing gWorld predicting next GUI states from mobile applications.

Abstract

Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models.

We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation.

We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework that automatically synthesizes code-based training data. In extensive evaluation across 4 in-distribution and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25× larger.

Key Results

Pareto Frontier

Average Instruction Accuracy (IAcc.) across all six benchmarks. gWorld 8B and 32B achieve a new pareto frontier in terms of model size (log10 scaled). Notably, extremely large models (e.g., Llama 4 402B) do not reach this pareto frontier.

Key Findings:

  • gWorld outperforms 8 frontier open-weight models up to 50.25× larger
  • Code-based approach virtually eliminates structural errors (<1% Render Fail)
  • +45.7% and +27.1% gain in IAcc. over base models Qwen3 VL 8B, 32B
  • Scaling training data yields predictable gains following a power law
Data Scaling Law

Data scaling laws for mobile world modeling at 8B. Performance follows a power law (R² ≥ 0.94), indicating predictable and non-saturating gains with more training data.

Method

Data generation pipeline

Schematic diagram of our data generation pipeline. We construct VLM world modeling data via three steps: (1) Repurposing offline policy trajectories into transition triplets; (2) Cross-modal relabeling of the ground-truth next state from pixels to renderable web code; and (3) Synthesizing reasoning traces using look-ahead access to the target state.

Three-Step Data Generation:

  1. Repurposing Policy Trajectory: Convert offline policy trajectories {St, At} into world modeling data {St, At, St+1}
  2. Synthetic Cross-modal Re-labeling: Convert next-state supervision from pixels to renderable web code using a frontier VLM
  3. Reasoning Data with Look-ahead: Generate reasoning traces Rt with access to ground-truth next state

Qualitative Comparison

Qualitative Example 1 Qualitative Example 2 Qualitative Example 3 Qualitative Example 4 Qualitative Example 5

Action: click at coordinates (802, 394).

Qualitative Example 6

Action: click at coordinates (913, 143).

Qualitative Example 7 (AndroidWorld)

Action: click at coordinates (756, 685). (AndroidWorld)

Qualitative Example 8 (AndroidWorld)

Action: click at coordinates (819, 549). (AndroidWorld)

Qualitative Example 9

Action: TAP at coordinates (857, 421).

MWMBench: Comprehensive Benchmark

We introduce Mobile World Model Bench (MWMBench), a comprehensive benchmark for evaluating world modeling in mobile GUI environments.

Visual World Modeling

Evaluate in native visual modality preserving rich GUI details

Real-world Action Space

Actions in coordinate space, directly compatible with mobile execution

ID + OOD Evaluation

4 in-distribution + 2 out-of-distribution benchmarks

Datasets: AitW, GUIOdyssey, AndroidControl, AMEX (ID) | AndroidWorld, KApps (OOD)

Main Results

Instruction Accuracy (IAcc., %, ↑)

Model Size In-Distribution OOD Avg
AitW GUIO AC AMEX AW KApps
Qwen-I-E 20B 15.4 13.0 11.7 10.9 13.8 15.7 13.4
Emu3.5 34B 23.4 25.8 27.7 21.7 29.1 26.8 25.8
Llama 4 109B-A17B 47.6 53.1 50.7 49.0 51.0 48.4 50.0
Llama 4 402B-A17B 47.2 55.8 58.6 58.3 54.3 59.9 55.7
Qwen3 VL 8B 21.5 28.2 31.1 33.7 30.8 30.1 29.2
Qwen3 VL 32B 46.8 52.0 53.2 56.9 53.4 52.5 52.5
Qwen3 VL 235B-A22B 36.1 54.7 51.9 51.2 51.1 64.2 51.5
GLM-4.6V 106B 60.9 68.2 74.2 69.5 74.1 57.4 67.4
gWorld (Ours) 8B 68.8 77.2 78.4 82.6 75.0 67.4 74.9
gWorld (Ours) 32B 71.7 81.5 82.9 86.1 79.9 75.7 79.6

Render Fail (%, ↓)

Model Size In-Distribution OOD Avg
AitW GUIO AC AMEX AW KApps
Qwen-I-E 20B
Emu3.5 34B
Llama 4 109B-A17B 4.4 1.2 1.0 0.6 2.9 1.8 2.0
Llama 4 402B-A17B 9.4 7.8 8.6 12.6 14.4 2.2 9.2
Qwen3 VL 8B 33.8 51.4 42.8 31.6 42.3 38.8 40.1
Qwen3 VL 32B 11.6 16.0 13.4 3.8 13.1 8.1 11.0
Qwen3 VL 235B-A22B 40.0 27.2 34.2 30.0 30.0 15.4 29.5
GLM-4.6V 106B 2.4 3.8 1.4 1.2 1.9 4.4 2.5
gWorld (Ours) 8B 0.8 1.2 2.6 0.8 2.3 0.8 1.4
gWorld (Ours) 32B 0.6 0.8 0.8 0.4 0.4 0.6 0.6

Similarity (%, ↑)

Model Size In-Distribution OOD Avg
AitW GUIO AC AMEX AW KApps
Qwen-I-E 20B 60.1 63.8 63.8 64.4 67.9 71.0 65.2
Emu3.5 34B 68.7 68.8 68.6 71.6 74.2 71.2 70.5
Llama 4 109B-A17B 57.9 62.3 61.4 66.9 61.7 57.3 61.2
Llama 4 402B-A17B 58.9 64.0 63.1 68.1 61.8 58.6 62.4
Qwen3 VL 8B 49.9 48.3 53.4 59.2 49.9 50.5 51.8
Qwen3 VL 32B 59.0 62.7 64.1 70.0 61.2 62.8 63.3
Qwen3 VL 235B-A22B 62.9 69.7 68.8 71.7 65.2 67.3 67.6
GLM-4.6V 106B 64.7 72.5 71.4 73.2 72.2 63.7 69.6
gWorld (Ours) 8B 66.3 73.3 72.8 74.3 69.2 66.1 70.3
gWorld (Ours) 32B 67.3 73.7 74.2 75.4 71.6 66.2 71.4

Main mobile world modeling results across three metrics. gWorld 8B and 32B establish a new pareto frontier in IAcc., achieving near-zero render failures while maintaining competitive visual similarity.

BibTeX

@misc{koh2026generativevisualcodemobile,       title={Generative Visual Code Mobile World Models},       author={Woosung Koh and Sungjun Han and Segyu Lee and Se-Young Yun and Jamin Shin},       year={2026},       eprint={2602.01576},       archivePrefix={arXiv},       primaryClass={cs.LG},       url={https://arxiv.org/abs/2602.01576}, }