gWorld: Generative Visual Code Mobile World Models

Demo 1

Example App 1

Demo 2

Example App 2

Demo 3

Example App 3

Demo videos showing gWorld predicting next GUI states from various mobile applications.

Abstract

Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models.

We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation.

We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework that automatically synthesizes code-based training data. In extensive evaluation across 4 in-distribution and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25× larger.

Key Results

Average Instruction Accuracy (IAcc.) across all six benchmarks. gWorld 8B and 32B achieve a new pareto frontier in terms of model size (log₁₀ scaled). Notably, extremely large models (e.g., Llama 4 402B) do not reach this pareto frontier.

Key Findings:

gWorld outperforms 8 frontier open-weight models up to 50.25× larger
Code-based approach virtually eliminates structural errors (<1% Render Fail)
+45.7% and +27.1% gain in IAcc. over base models Qwen3 VL 8B, 32B
Scaling training data yields predictable gains following a power law

Data scaling laws for mobile world modeling at 8B. Performance follows a power law (R² ≥ 0.94), indicating predictable and non-saturating gains with more training data.

Method

Schematic diagram of our data generation pipeline. We construct VLM world modeling data via three steps: (1) Repurposing offline policy trajectories into transition triplets; (2) Cross-modal relabeling of the ground-truth next state from pixels to renderable web code; and (3) Synthesizing reasoning traces using look-ahead access to the target state.

Three-Step Data Generation:

Repurposing Policy Trajectory: Convert offline policy trajectories {S_t, A_t} into world modeling data {S_t, A_t, S_t+1}
Synthetic Cross-modal Re-labeling: Convert next-state supervision from pixels to renderable web code using a frontier VLM
Reasoning Data with Look-ahead: Generate reasoning traces R_t with access to ground-truth next state

Qualitative Comparison

Our model generates renderable web code to ensure pixel-perfect text and structurally accurate layouts. In contrast, image-generation baselines frequently produce illegible text and distorted layouts.

MWMBENCH: Comprehensive Benchmark

We introduce Mobile World Model Bench (MWMBENCH), a comprehensive benchmark for evaluating world modeling in mobile GUI environments.

Visual World Modeling

Evaluate in native visual modality preserving rich GUI details

Real-world Action Space

Actions in coordinate space, directly compatible with mobile execution

ID + OOD Evaluation

4 in-distribution + 2 out-of-distribution benchmarks

Datasets: AitW, GUIOdyssey, AndroidControl, AMEX (ID) | AndroidWorld, KApps (OOD)

Main Results

Model	Size	In-Distribution				OOD		Avg IAcc.
Model	Size	AitW	GUIO	AC	AMEX	AW	KApps	Avg IAcc.
Qwen-Image-Edit	20B	15.4	13.0	11.7	10.9	13.8	15.7	13.4
Emu3.5	34B	23.4	25.8	27.7	21.7	29.1	26.8	25.8
Llama 4	402B-A17B	47.2	55.8	58.6	58.3	54.3	59.9	55.7
Qwen3 VL	8B	21.5	28.2	31.1	33.7	30.8	30.1	29.2
Qwen3 VL	32B	46.8	52.0	53.2	56.9	53.4	52.5	52.5
GLM-4.6V	106B	60.9	68.2	74.2	69.5	74.1	57.4	67.4
gWorld (Ours)	8B	68.8	77.2	78.4	82.6	75.0	67.4	74.9
gWorld (Ours)	32B	71.7	81.5	82.9	86.1	79.9	75.7	79.6

Main mobile world modeling results. gWorld 8B, 32B establishes a new pareto frontier, consistently outperforming significantly larger models.

BibTeX

@article{trillion2026gworld, title={Generative Visual Code Mobile World Models}, author={Koh, Woosung and Han, Sungjun and Lee, Segyu and Yun, Se-young and Shin, Jamin}, journal={arXiv preprint arXiv:XXXX.XXXXX}, year={2026} }