• 中文核心期刊要目总览
  • 中国科技核心期刊
  • 中国科学引文数据库(CSCD)
  • 中国科技论文与引文数据库(CSTPCD)
  • 中国学术期刊文摘数据库(CSAD)
  • 中国学术期刊(网络版)(CNKI)
  • 中文科技期刊数据库
  • 万方数据知识服务平台
  • 中国超星期刊域出版平台
  • 国家科技学术期刊开放平台
  • 荷兰文摘与引文数据库(SCOPUS)
  • 日本科学技术振兴机构数据库(JST)

SDNet: A self-supervised bird recognition method based on large language models and diffusion models for improving long-term bird monitoring

  • Abstract: The collection and annotation of large-scale bird datasets are resource-intensive and time-consuming processes that significantly limit the scalability and accuracy of biodiversity monitoring systems. While self-supervised learning (SSL) has emerged as a promising approach for leveraging unannotated data, current SSL methods face two critical challenges in bird species recognition: (1) long-tailed data distributions that result in poor performance on underrepresented species; and (2) domain shift issues caused by data augmentation strategies designed to mitigate class imbalance. Here we present SDNet, a novel SSL-based bird recognition framework that integrates diffusion models with large language models (LLMs) to overcome these limitations. SDNet employs LLMs to generate semantically rich textual descriptions for tail-class species by prompting the models with species taxonomy, morphological attributes, and habitat information, producing detailed natural language priors that capture fine-grained visual characteristics (e.g., plumage patterns, body proportions, and distinctive markings). These textual descriptions are subsequently used by a conditional diffusion model to synthesize new bird image samples through cross-attention mechanisms that fuse textual embeddings with intermediate visual feature representations during the denoising process, ensuring generated images preserve species-specific morphological details while maintaining photorealistic quality. Additionally, we incorporate a Swin Transformer as the feature extraction backbone whose hierarchical window-based attention mechanism and shifted windowing scheme enable multi-scale local feature extraction that proves particularly effective at capturing fine-grained discriminative patterns (such as beak shape and feather texture) while mitigating domain shift between synthetic and original images through consistent feature representations across both data sources. SDNet is validated on both a self-constructed dataset (Bird_BXS) and a publicly available benchmark (Birds_25), demonstrating substantial improvements over conventional SSL approaches. Our results indicate that the synergistic integration of LLMs, diffusion models, and the Swin Transformer architecture contributes significantly to recognition accuracy, particularly for rare and morphologically similar species. These findings highlight the potential of SDNet for addressing fundamental limitations of existing SSL methods in avian recognition tasks and establishing a new paradigm for efficient self-supervised learning in large-scale ornithological vision applications.

     

/

返回文章
返回