A Unified Image-Dense Annotation Generation Model for Underwater Scenes

Huazhong University of Science & Technology
Corresponding author.

Abstract

Underwater dense prediction, especially depth estimation and semantic segmentation, is crucial for comprehensively understanding underwater scenes. Nevertheless, high-quality and large-scale underwater datasets with dense annotations remain scarce because of the complex environment and the exorbitant data collection costs. This paper proposes a unified Text-to-Image and DEnse annotation generation method (TIDE) for underwater scenes. It relies solely on text as input to simultaneously generate realistic underwater images and multiple highly consistent dense annotations. Specifically, we unify the generation of text-to-image and text-to-dense annotations within a single model. The Implicit Layout Sharing mechanism (ILS) and cross-modal interaction method called Time Adaptive Normalization (TAN) are introduced to jointly optimize the consistency between image and dense annotations. We synthesize a large underwater dataset using TIDE to validate the effectiveness of our method in underwater dense prediction tasks. The results demonstrate that our method effectively improves the performance of existing underwater dense prediction models and mitigates the scarcity of underwater data with dense annotations. Our method can offer new perspectives on alleviating data scarcity issues in other fields.

Overview

MY ALT TEXT

In this work, to address the scarcity of large-scale, high-quality underwater datasets with dense annotations, we explore a novel framework that simultaneously generates images and multiple dense annotations solely based on text conditions. By leveraging the inherent implicit layout information and cross-modal features in text-to-image models, we develop an interaction mechanism to align images and dense annotations. Experimental results show that this interaction mechanism achieves comparable or even better consistency than controllable generation methods requiring strong guidance conditions (such as depth maps).

Pipeline

MY ALT TEXT

Experimental Results

Depth Estimation

Interpolate start reference image.

Semantic Segmentation

Interpolate start reference image.

BibTeX


    @inproceedings{lin2025tide,
          title={A Unified Image-Dense Annotation Generation Model for Underwater Scenes},
          author={Lin, Hongkai and Liang, Dingkang and Qi, Zhenghao and Bai, Xiang},
          booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
          year={2025},
    }