
DiT360 is a framework for high-quality panoramic image generation, leveraging both perspective and panoramic data in a hybrid training scheme. It adopts a two-level strategy—image-level cross-domain guidance and token-level hybrid supervision—to enhance perceptual realism and geometric fidelity.

30/10/2025
17/10/2025
15/10/2025
14/10/2025
11/10/2025
10/10/2025
Clone the repo first:
git clone https://github.com/Insta360-Research-Team/DiT360.git
cd DiT360
(Optional) Create a fresh conda env:
conda create -n dit360 python=3.12 conda activate dit360
Install necessary packages (torch > 2):
# pytorch (select correct CUDA version, we test our code on torch==2.6.0 and torchvision==0.21.0)
pip install torch==2.6.0 torchvision==0.21.0
# other dependencies
pip install -r requirements.txt
We have uploaded the dataset to Hugging Face. For more details, please visit Insta360-Research/Matterport3D_polished.
For a quick start, you can try:
from datasets import load_dataset
ds = load_dataset("Insta360-Research/Matterport3D_polished")
# check the data
print(ds["train"][0])
If you encounter any issues, please refer to the official Hugging Face documentation.
For a quick use, you can just try:
python inference.py
⚠️ Note: We only trained the model on datasets with a resolution of 1024 × 2048, so using other input sizes may lead to abnormal results. In addition, without any optimization, the inference process requires approximately 37 GB of GPU memory, so please be aware of the memory usage.
We provide a training pipeline based on Insta360-Research/Matterport3D_polished, along with corresponding launch scripts. You can start training with a single command:
bash train.sh
After training is completed, you will find a checkpoint file saved under the output directory, typically like:
model_saved/lightning_logs/version_x/checkpoints/vsclip_epoch=xxx.ckpt/checkpoint/mp_rank_00_model_states.pt
You can extract the LoRA weights from the full .pt checkpoint by running:
python get_lora_weights.py <path_to_your_pt_file> <output_dir>
If you don’t specify output_dir, the extracted weights will be saved by default to:
lora_output/
After that, you can directly use your trained LoRA in the inference script.
Simply replace the default model path "fenghora/DiT360-Panorama-Image-Generation" in inference.py with your output directory (e.g., "lora_output"), and then run:
python inference.py
Mix training aims to leverage both panoramic images and perspective images to improve the model’s generalization across different viewpoints.
You need to prepare two .jsonl files:
Each line in a .jsonl file should represent a single training sample with the following format:
{"image": "path/to/image.jpg", "caption": "a description of the scene", "mask": "path/to/mask.png"}
The mask is a PNG (or similar) image used to specify which regions should be supervised during training:
255, 255, 255) indicate areas that are supervised.0, 0, 0) indicate areas that are ignored.Specifically:
mask is typically an all-white image (meaning the entire image is supervised).mask corresponds to the valid projected area derived from the panoramic-to-perspective mapping.The perspective images and their corresponding masks can be generated from panoramic images using an equirectangular-to-perspective projection.
We highly recommend using the excellent open-source library below for this purpose:
This library provides high-quality conversions between panoramic and perspective views, making it easy to generate consistent training data for mixed-view learning.
To start training, please refer to the provided scripts:
train_mix_staged_lora_dynamic.sh and train_mix_staged_lora_dynamic.py.
We treat both inpainting and outpainting as image completion tasks, where the key lies in how the mask is defined. A simple example is already provided in our codebase.
For a quick start, you can simply run:
python editing.py
In our implementation, regions with a mask value of 1 correspond to the parts preserved from the source image. Therefore, in our example, you can invert the mask as follows for inpainting:
mask = 1 - mask # for inpainting
This part is built upon Personalize Anything.
We appreciate the open source of the following projects:
@misc{dit360, title={DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training}, author={Haoran Feng and Dizhe Zhang and Xiangtai Li and Bo Du and Lu Qi}, year={2025}, eprint={2510.11712}, archivePrefix={arXiv}, }
If you find our dataset useful, please include a citation for Matterport3D:
@article{Matterport3D, title={Matterport3D: Learning from RGB-D Data in Indoor Environments}, author={Chang, Angel and Dai, Angela and Funkhouser, Thomas and Halber, Maciej and Niessner, Matthias and Savva, Manolis and Song, Shuran and Zeng, Andy and Zhang, Yinda}, journal={International Conference on 3D Vision (3DV)}, year={2017} }
If you find our inpainting & outpainting useful, please include a citation for Personalize Anything:
@article{feng2025personalize, title={Personalize Anything for Free with Diffusion Transformer}, author={Feng, Haoran and Huang, Zehuan and Li, Lin and Lv, Hairong and Sheng, Lu}, journal={arXiv preprint arXiv:2503.12590}, year={2025} }