1
|
Zhou W, Lin K, Zheng Z, Chen D, Su T, Hu H. DRTN: Dual Relation Transformer Network with feature erasure and contrastive learning for multi-label image classification. Neural Netw 2025; 187:107309. [PMID: 40048756 DOI: 10.1016/j.neunet.2025.107309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2024] [Revised: 12/14/2024] [Accepted: 02/21/2025] [Indexed: 04/29/2025]
Abstract
The objective of multi-label image classification (MLIC) task is to simultaneously identify multiple objects present in an image. Several researchers directly flatten 2D feature maps into 1D grid feature sequences, and utilize Transformer encoder to capture the correlations of grid features to learn object relationships. Although obtaining promising results, these Transformer-based methods lose spatial information. In addition, current attention-based models often focus only on salient feature regions, but ignore other potential useful features that contribute to MLIC task. To tackle these problems, we present a novel Dual Relation Transformer Network (DRTN) for MLIC task, which can be trained in an end-to-end manner. Concretely, to compensate for the loss of spatial information of grid features resulting from the flattening operation, we adopt a grid aggregation scheme to generate pseudo-region features, which does not need to make additional expensive annotations to train object detector. Then, a new dual relation enhancement (DRE) module is proposed to capture correlations between objects using two different visual features, thereby complementing the advantages provided by both grid and pseudo-region features. After that, we design a new feature enhancement and erasure (FEE) module to learn discriminative features and mine additional potential valuable features. By using attention mechanism to discover the most salient feature regions and removing them with region-level erasure strategy, our FEE module is able to mine other potential useful features from the remaining parts. Further, we devise a novel contrastive learning (CL) module to encourage the foregrounds of salient and potential features to be closer, while pushing their foregrounds further away from background features. This manner compels our model to learn discriminative and valuable features more comprehensively. Extensive experiments demonstrate that DRTN method surpasses current MLIC models on three challenging benchmarks, i.e., MS-COCO 2014, PASCAL VOC 2007, and NUS-WIDE datasets.
Collapse
Affiliation(s)
- Wei Zhou
- School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, 510006, Guangdong, China
| | - Kang Lin
- School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, 510006, Guangdong, China
| | - Zhijie Zheng
- School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, 510006, Guangdong, China
| | - Dihu Chen
- School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, 510006, Guangdong, China
| | - Tao Su
- School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, 510006, Guangdong, China.
| | - Haifeng Hu
- School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, 510006, Guangdong, China.
| |
Collapse
|
2
|
Li H, Lo JTY. A review on the use of top-view surveillance videos for pedestrian detection, tracking and behavior recognition across public spaces. ACCIDENT; ANALYSIS AND PREVENTION 2025; 215:107986. [PMID: 40081266 DOI: 10.1016/j.aap.2025.107986] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Revised: 01/03/2025] [Accepted: 02/05/2025] [Indexed: 03/15/2025]
Abstract
The use of top-view surveillance cameras has been considered as the feature to maintain uncovered view and privacy protection in public buildings like stations and traffic hubs. This study aims to provide a comprehensive review on recent developments and challenges related to the use of top-view surveillance videos in public places. The techniques using top-view images in pedestrian detection, tracking and behavior recognition are reviewed, specifically focusing on their influence on crowd control and safety management. The setup of top-view cameras and the characteristics of several available datasets are introduced. The methodologies, field of view, extracted features, region of interest, color space and used datasets for key literature are consolidated. This study contributes by identifying key advantages of top-view cameras, such as their ability to reduce occlusions and preserve privacy, while also addressing limitations, including restricted field of view and the challenges of adapting algorithms to this unique perspective. We highlight knowledge gaps in leveraging top-view cameras for transport hubs, such as the need for advanced algorithms and the lack of standardized datasets for dynamic crowd scenarios. Through this review, we aim to provide actionable insights for improving crowd management and safety measures in public buildings, especially transport hubs.
Collapse
Affiliation(s)
- Hongliu Li
- Department of Civil and Environmental Engineering, The Hong Kong Polytechnic University, PR China
| | - Jacqueline Tsz Yin Lo
- Department of Civil and Environmental Engineering, The Hong Kong Polytechnic University, PR China.
| |
Collapse
|
3
|
Gao G, Lv Z, Zhang Y, Qin AK. Advertising or adversarial? AdvSign: Artistic advertising sign camouflage for target physical attacking to object detector. Neural Netw 2025; 186:107271. [PMID: 40010291 DOI: 10.1016/j.neunet.2025.107271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2024] [Revised: 01/07/2025] [Accepted: 02/11/2025] [Indexed: 02/28/2025]
Abstract
Deep learning models are often vulnerable to adversarial attacks in both digital and physical environments. Particularly challenging are physical attacks that involve subtle, unobtrusive modifications to objects, such as patch-sticking or light-shooting, designed to maliciously alter the model's output when the scene is captured and fed into the model. Developing physical adversarial attacks that are robust, flexible, inconspicuous, and difficult to trace remains a significant challenge. To address this issue, we propose an artistic-based camouflage named Adversarial Advertising Sign (AdvSign) for object detection task, especially in autonomous driving scenarios. Generally, artistic patterns, such as brand logos and advertisement signs, always have a high tolerance for visual incongruity and are widely exist with strong unobtrusiveness. We design these patterns into advertising signs that can be attached to various mobile carriers, such as carry-bags and vehicle stickers, to create adversarial camouflage with strong untraceability. This method is particularly effective at misleading self-driving cars, for instance, causing them to misidentify these signs as 'stop' signs. Our approach combines a trainable adversarial patch with various signs of artistic patterns to create advertising patches. By leveraging the diversity and flexibility of these patterns, we draw attention away from the conspicuous adversarial elements, enhancing the effectiveness and subtlety of our attacks. We then use the CARLA autonomous-driving simulator to place these synthesized patches onto 3D flat surfaces in different traffic scenes, rendering 2D composite scene images from various perspectives. These varied scene images are then input into the target detector for adversarial training, resulting in the final trained adversarial patch. In particular, we introduce a novel loss with artistic pattern constraints, designed to differentially adjust pixels within and outside the advertising sign during training. Extensive experiments in both simulated (composite scene images with AdvSign) and real-world (printed AdvSign images) environments demonstrate the effectiveness of AdvSign in executing physical attacks on state-of-the-art object detectors, such as YOLOv5. Our training strategy, leveraging diverse scene images and varied artistic transformations to adversarial patches, enables seamless integration with multiple patterns. This enhances attack effectiveness across various physical settings and allows easy adaptation to new environments and artistic patterns.
Collapse
Affiliation(s)
- Guangyu Gao
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.
| | - Zhuocheng Lv
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Yan Zhang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - A K Qin
- Department of Computing Technologies, Swinburne University of Technology, Hawthorn, VIC 3122, Australia
| |
Collapse
|
4
|
Xu X, Wang C, Yi Q, Ye J, Kong X, Ashraf SQ, Dearn KD, Hajiyavand AM. MedBin: A lightweight End-to-End model-based method for medical waste management. WASTE MANAGEMENT (NEW YORK, N.Y.) 2025; 200:114742. [PMID: 40088805 DOI: 10.1016/j.wasman.2025.114742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/05/2024] [Revised: 03/04/2025] [Accepted: 03/07/2025] [Indexed: 03/17/2025]
Abstract
The surge in medical waste has highlighted the urgent need for cost-effective and advanced management solutions. In this paper, a novel medical waste management approach, "MedBin," is proposed for automated sorting, reusing, and recycling. A comprehensive medical waste dataset, "MedBin-Dataset" is established, comprising 2,119 original images spanning 36 categories, with samples captured in various backgrounds. The lightweight "MedBin-Net" model is introduced to enable detection and instance segmentation of medical waste, enhancing waste recognition capabilities. Experimental results demonstrate the effectiveness of the proposed approach, achieving an average precision of 0.91, recall of 0.97, and F1-score of 0.94 across all categories with just 2.51 M parameters (where M stands for million, i.e., 2.51 million parameters), 5.20G FLOPs (where G stands for billion, i.e., 5.20 billion floating-point operations per second), and 0.60 ms inference time. Additionally, the proposed method includes a World Health Organization (WHO) Guideline-Based Classifier that categorizes detected waste into 5 types, each with a corresponding disposal method, following WHO medical waste classification standards. The proposed method, along with the dedicated dataset, offers a promising solution that supports sustainable medical waste management and other related applications. To access the MedBin-Dataset samples, please visit https://universe.roboflow.com/uob-ylti8/medbin_dataset. The source code for MedBin-Net can be found at https://github.com/Wayne3918/MedbinNet.
Collapse
Affiliation(s)
- Xiazhen Xu
- Department of Mechanical Engineering, School of Engineering, University of Birmingham, Birmingham B15 2TT, UK
| | - Chenyang Wang
- Department of Mechanical Engineering, School of Engineering, University of Birmingham, Birmingham B15 2TT, UK
| | - Qiufeng Yi
- Department of Mechanical Engineering, School of Engineering, University of Birmingham, Birmingham B15 2TT, UK
| | - Jiaqi Ye
- Department of Mechanical Engineering, School of Engineering, University of Birmingham, Birmingham B15 2TT, UK
| | - Xiangfei Kong
- Department of Mechanical Engineering, School of Engineering, University of Birmingham, Birmingham B15 2TT, UK
| | - Shazad Q Ashraf
- Queen Elizabeth Hospital, Mindelsohn Way, Birmingham B15 2GW, UK
| | - Karl D Dearn
- Department of Mechanical Engineering, School of Engineering, University of Birmingham, Birmingham B15 2TT, UK
| | - Amir M Hajiyavand
- Department of Mechanical Engineering, School of Engineering, University of Birmingham, Birmingham B15 2TT, UK.
| |
Collapse
|
5
|
Tamin O, Moung EG, Dargham JA, Karim SAA, Ibrahim AO, Adam N, Osman HA. RGB and RGNIR image dataset for machine learning in plastic waste detection. Data Brief 2025; 60:111524. [PMID: 40275976 PMCID: PMC12020901 DOI: 10.1016/j.dib.2025.111524] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2024] [Revised: 03/14/2025] [Accepted: 03/24/2025] [Indexed: 04/26/2025] Open
Abstract
The increasing volume of plastic waste is an environmental issue that demands effective sorting methods for different types of plastic. While spectral imaging offers a promising solution, it has several drawbacks, such as complexity, high cost, and limited spatial resolution. Machine learning has emerged as a potential solution for plastic waste due to its ability to analyse and interpret large volumes of data using algorithms. However, developing an efficient machine learning model requires a comprehensive dataset with information on the size, shape, colour, texture, and other features of plastic waste. Moreover, incorporating near-infrared (NIR) spectral data into machine learning models can reveal crucial information about plastic waste composition and structure that remains invisible in standard RGB images. Despite this potential, no publicly available dataset currently combines RGB with NIR spectral information for plastic waste detection. To address this research gap, we introduce a comprehensive dataset of plastic waste images captured onshore using both standard RGB and RGNIR (red, green, near-infrared) channels. Each of the two-colour space datasets include 405 images that were taken along riverbanks and beaches. Both datasets underwent further pre-processing to ensure proper labelling and annotations to prepare them for training machine learning models. In total, there are 1,344 plastic waste objects that have been annotated. The proposed dataset offers a unique resource for researchers to train machine learning models for plastic waste detection. While there are existing datasets on plastic waste, the proposed dataset aims to set itself apart by offering a more comprehensive dataset with unique spectral information in the near-infrared region. It is hopeful that these datasets will contribute to the advancement of the field of plastic waste detection and encourage further research in this area.
Collapse
Affiliation(s)
- Owen Tamin
- Faculty of Science and Natural Resources, Universiti Malaysia Sabah, Jalan UMS, Kota Kinabalu, 88400, Sabah, Malaysia
| | - Ervin Gubin Moung
- Data Technologies and Applications (DaTA) Research Group, Faculty of Computing and Informatics, Universiti Malaysia Sabah, Jalan UMS, Kota Kinabalu, 88400, Sabah, Malaysia
- Faculty of Computing and Informatics, Universiti Malaysia Sabah, Jalan UMS, Kota Kinabalu, 88400, Sabah, Malaysia
| | - Jamal Ahmad Dargham
- Faculty of Engineering, Universiti Malaysia Sabah, Kota Kinabalu, 88400, Sabah, Malaysia
| | - Samsul Ariffin Abdul Karim
- Institute of Strategic Industrial Decision Modelling (ISIDM), School of Quantitative Sciences, UUM College of Arts and Sciences, Universiti Utara Malaysia, 06010 Sintok, Kedah Darul Aman, Malaysia
| | - Ashraf Osman Ibrahim
- Department of Computer and Information Sciences, Universiti Teknologi PETRONAS, 32610 Seri Iskandar, Malaysia
- Positive Computing Research Center, Emerging & Digital Technologies Institute, Universiti Teknologi PETRONAS, 32610 Seri Iskandar, Malaysia
| | - Nada Adam
- Department of Computer Science, The applied College, Northern Border University, Arar 73213, Saudi Arabia
| | - Hadia Abdelgader Osman
- Department of Computer Science, The applied College, Northern Border University, Arar 73213, Saudi Arabia
| |
Collapse
|
6
|
Guru D, N S. Banana bunch image and video dataset for variety classification and grading. Data Brief 2025; 60:111478. [PMID: 40231149 PMCID: PMC11994905 DOI: 10.1016/j.dib.2025.111478] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2025] [Revised: 02/26/2025] [Accepted: 03/13/2025] [Indexed: 04/16/2025] Open
Abstract
Banana, a major commercial fruit crop, holds high nutritional value and widespread consumption [[4], [8],10]. The global banana market valued at USD 140.83 billion in 2024 is projected to reach USD 147.74 billion by 2030. Accurate variety identification and quality grading are crucial for marketing, pricing, and operational efficiency in food processing industries [9]. As wholesalers and food processing industries process bananas in bunches (not individual fruit levels) , our bunch-level dataset offers a more accurate assessment by capturing bunch-level characteristics, which are vital for grading. Existing datasets, such as [1,6], focus on individual bananas or have limited bunch-level data, highlighting the lack of large-scale bunch datasets. This dataset fills the gap by providing bunch-level images and videos of three widely consumed banana varieties-Elakki-bale, Pachbale, and Rasbale, from Mysuru, South Karnataka, India, serving as a valuable resource for food processing industries. Our dataset supports training machine learning models for bunch-level variety classification and grading of bananas and serves as a resource for research and education.
Collapse
Affiliation(s)
- D.S. Guru
- Department of Studies in Computer Science, University of Mysore, Manasagangotri, Mysuru, Karnataka, 570006, India
| | - Saritha N
- Department of Studies in Computer Science, University of Mysore, Manasagangotri, Mysuru, Karnataka, 570006, India
| |
Collapse
|
7
|
Ebert N, Stricker D, Wasenmüller O. Enhancing robustness and generalization in microbiological few-shot detection through synthetic data generation and contrastive learning. Comput Biol Med 2025; 191:110141. [PMID: 40253923 DOI: 10.1016/j.compbiomed.2025.110141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Revised: 02/25/2025] [Accepted: 04/03/2025] [Indexed: 04/22/2025]
Abstract
In many medical and pharmaceutical processes, continuous hygiene monitoring is crucial, often involving the manual detection of microorganisms in agar dishes by qualified personnel. Although deep learning methods hold promise for automating this task, they frequently encounter a shortage of sufficient training data, a prevalent challenge in colony detection. To overcome this limitation, we propose a novel pipeline that combines generative data augmentation with few-shot detection. Our approach aims to significantly enhance detection performance, even with (very) limited training data. A main component of our method is a diffusion-based generator model that inpaints synthetic bacterial colonies onto real agar plate backgrounds. This data augmentation technique enhances the diversity of training data, allowing for effective model training with only 25 real images. Our method outperforms common training-techniques, demonstrating a +0.45 mAP improvement compared to training from scratch, and a +0.15 mAP advantage over the current SOTA in synthetic data augmentation. Additionally, we integrate a decoupled feature classification strategy, where class-agnostic detection is followed by lightweight classification via a feed-forward network, making it possible to detect and classify colonies with minimal examples. This approach achieves an AP50 score of 0.7 in a few-shot scenario on the AGAR dataset. Our method also demonstrates robustness to various image corruptions, such as noise and blur, proving its applicability in real-world scenarios. By reducing the need for large labeled datasets, our pipeline offers a scalable, efficient solution for colony detection in hygiene monitoring and biomedical research, with potential for broader applications in fields where rapid detection of new colony types is required.
Collapse
Affiliation(s)
- Nikolas Ebert
- Research and Transfer Center CeMOS, Technical University of Applied Sciences Mannheim, Mannheim, 68163, Germany; Department of Computer Science, University of Kaiserslautern-Landau (RPTU), Kaiserslautern, 67663, Germany.
| | - Didier Stricker
- Department of Computer Science, University of Kaiserslautern-Landau (RPTU), Kaiserslautern, 67663, Germany.
| | - Oliver Wasenmüller
- Research and Transfer Center CeMOS, Technical University of Applied Sciences Mannheim, Mannheim, 68163, Germany.
| |
Collapse
|
8
|
Deng K, Wen Q, Yang F, Ouyang H, Shi Z, Shuai S, Wu Z. OS-DETR: End-to-end brain tumor detection framework based on orthogonal channel shuffle networks. PLoS One 2025; 20:e0320757. [PMID: 40359502 DOI: 10.1371/journal.pone.0320757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 02/21/2025] [Indexed: 05/15/2025] Open
Abstract
OrthoNets use the Gram-Schmidt process to achieve orthogonality among filters but do not impose constraints on the internal orthogonality of individual filters. To reduce the risk of overfitting, especially in scenarios with limited data such as medical image, this study explores an enhanced network that ensures the internal orthogonality within individual filters, named the Orthogonal Channel Shuffle Network ( OSNet). This network is integrated into the Detection Transformer (DETR) framework for brain tumor detection, resulting in the OS-DETR. To further optimize model performance, this study also incorporates deformable attention mechanisms and an Intersection over Union strategy that emphasizes the internal region influence of bounding boxes and the corner distance disparity. Experimental results on the Br35H brain tumor dataset demonstrate the significant advantages of OS-DETR over mainstream object detection frameworks. Specifically, OS-DETR achieves a Precision of 95.0%, Recall of 94.2%, mAP@50 of 95.7%, and mAP@50:95 of 74.2%. The code implementation and experimental results are available at https://github.com/dkx2077/OS-DETR.git.
Collapse
Affiliation(s)
- Kaixin Deng
- College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
| | - Quan Wen
- College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
| | - Fan Yang
- College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
| | - Hang Ouyang
- College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
| | - Zhuohang Shi
- College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
| | - Shiyu Shuai
- College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
| | - Zhaowang Wu
- College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
| |
Collapse
|
9
|
Zhao P, Wang X, Yu S, Dong X, Li B, Wang H, Chen G. An open paradigm dataset for intelligent monitoring of underground drilling operations in coal mines. Sci Data 2025; 12:780. [PMID: 40355463 PMCID: PMC12069595 DOI: 10.1038/s41597-025-05118-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2024] [Accepted: 05/01/2025] [Indexed: 05/14/2025] Open
Abstract
The underground drilling environment in coal mines is critical and prone to accidents, with common accident types including rib spalling, roof falling, and others. High-quality datasets are essential for developing and validating artificial intelligence (AI) algorithms in coal mine safety monitoring and automation field. Currently, there is no comprehensive benchmark dataset for coal mine industrial scenarios, limiting the research progress of AI algorithms in this industry. For the first time, this study constructed a benchmark dataset (DsDPM 66) specifically for underground coal mine drilling operations, containing 105,096 images obtained from surveillance videos of multiple drilling operation scenes. The dataset has been manually annotated to support computer vision tasks such as object detection and pose estimation. In addition, this study conducted extensive benchmarking experiments on this dataset, applying various advanced AI algorithms including but not limited to YOLOv8 and DETR. The results indicate the proposed dataset highlights areas for improvement in algorithmic models and fills the data gap in the coal mining, providing valuable resources for developing coal mine safety monitoring.
Collapse
Affiliation(s)
- Pengzhen Zhao
- School of Electrical Engineering, Shanghai DianJi University, Shanghai, 201306, China
- Intelligent Decision and Control Technology Institute, Shanghai DianJi University, Shanghai, 201306, China
| | - Xichao Wang
- School of Electrical Engineering, Shanghai DianJi University, Shanghai, 201306, China.
- Intelligent Decision and Control Technology Institute, Shanghai DianJi University, Shanghai, 201306, China.
| | - Shuainan Yu
- School of Electrical Engineering, Shanghai DianJi University, Shanghai, 201306, China
- Intelligent Decision and Control Technology Institute, Shanghai DianJi University, Shanghai, 201306, China
| | - Xiangqing Dong
- School of Electrical Engineering, Shanghai DianJi University, Shanghai, 201306, China
- Intelligent Decision and Control Technology Institute, Shanghai DianJi University, Shanghai, 201306, China
| | - Baojiang Li
- School of Electrical Engineering, Shanghai DianJi University, Shanghai, 201306, China
- Intelligent Decision and Control Technology Institute, Shanghai DianJi University, Shanghai, 201306, China
| | - Haiyan Wang
- School of Electrical Engineering, Shanghai DianJi University, Shanghai, 201306, China
- Intelligent Decision and Control Technology Institute, Shanghai DianJi University, Shanghai, 201306, China
| | - Guochu Chen
- School of Electrical Engineering, Shanghai DianJi University, Shanghai, 201306, China
- Intelligent Decision and Control Technology Institute, Shanghai DianJi University, Shanghai, 201306, China
| |
Collapse
|
10
|
Ruarte G, Bujia G, Care D, Ison MJ, Kamienkowski JE. Integrating Bayesian and neural networks models for eye movement prediction in hybrid search. Sci Rep 2025; 15:16482. [PMID: 40355508 PMCID: PMC12069626 DOI: 10.1038/s41598-025-00272-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2025] [Accepted: 04/28/2025] [Indexed: 05/14/2025] Open
Abstract
Visual search is crucial in daily human interaction with the environment. Hybrid search extends this by requiring observers to find any item from a given set. Recently, a few models were proposed to simulate human eye movements in visual search tasks within natural scenes, but none were implemented for Hybrid search under similar conditions. We present an enhanced neural network Entropy Limit Minimization (nnELM) model, grounded in a Bayesian framework and signal detection theory, and the Hybrid Search Eye Movements (HSEM) Dataset, containing thousands of human eye movements during hybrid tasks. A key Hybrid search challenge is that participants have to look for different objects at the same time. To address this, we developed several strategies involving the posterior probability distributions after each fixation. Adjusting peripheral visibility improved early-stage efficiency, aligning it with human behavior. Limiting the model's memory reduced success in longer searches, mirroring human performance. We validated these improvements by comparing our model with a held-out set within the HSEM and with other models in a separate visual search benchmark. Overall, the new nnELM model not only handles Hybrid search in natural scenes but also closely replicates human behavior, advancing our understanding of search processes while maintaining interpretability.
Collapse
Affiliation(s)
- Gonzalo Ruarte
- Laboratorio de Inteligencia Artificial Aplicada (LIAA), Instituto de Ciencias de la Computación (ICC), CONICET - Universidad de Buenos Aires, Buenos Aires, Argentina.
- Departamento de Computación, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina.
| | - Gaston Bujia
- Laboratorio de Inteligencia Artificial Aplicada (LIAA), Instituto de Ciencias de la Computación (ICC), CONICET - Universidad de Buenos Aires, Buenos Aires, Argentina
- Departamento de Computación, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Damián Care
- Laboratorio de Inteligencia Artificial Aplicada (LIAA), Instituto de Ciencias de la Computación (ICC), CONICET - Universidad de Buenos Aires, Buenos Aires, Argentina
| | | | - Juan Esteban Kamienkowski
- Laboratorio de Inteligencia Artificial Aplicada (LIAA), Instituto de Ciencias de la Computación (ICC), CONICET - Universidad de Buenos Aires, Buenos Aires, Argentina
- Departamento de Computación, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina
- Maestría en Explotación de Datos y Descubrimiento del Conocimiento, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina
| |
Collapse
|
11
|
Tu J, Liu X, Huang Z, Hao Y, Hong R, Wang M. Cross-Modal Hashing via Diverse Instances Matching. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2025; 34:2737-2749. [PMID: 40266858 DOI: 10.1109/tip.2025.3561659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/25/2025]
Abstract
Cross-modal hashing is a highly effective technique for searching relevant data across different modalities, owing to its low storage costs and fast similarity retrieval capability. While significant progress has been achieved in this area, prior investigations predominantly concentrate on a one-to-one feature alignment approach, where a singular feature is derived for similarity retrieval. However, the singular feature in these methods fails to adequately capture the varied multi-instance information inherent in the original data across disparate modalities. Consequently, the conventional one-to-one methodology is plagued by a semantic mismatch issue, as the rigid one-to-one alignment inhibits effective multi-instance matching. To address this issue, we propose a novel Diverse Instances Matching for Cross-modal Hashing (DIMCH), which explores the relevance between multiple instances in different modalities using a multi-instance learning algorithm. Specifically, we design a novel diverse instances learning module to extract a multi-feature set, which enables our model to capture detailed multi-instance semantics. To evaluate the similarity between two multi-feature sets, we adopt the smooth chamfer distance function, which enables our model to incorporate the conventional similarity retrieval structure. Moreover, to sufficiently exploit the supervised information from the semantic label, we adopt the weight cosine triplet loss as the objective function, which incorporates the multilevel similarity among the multi-labels into the training procedure and enables the model to mine the multi-label correlation effectively. Extensive experiments demonstrate that our diverse hashing embedding method achieves state-of-the-art performance in supervised cross-modal hashing retrieval tasks.
Collapse
|
12
|
Abo-Zahhad MM, Abo-Zahhad M. Real time intelligent garbage monitoring and efficient collection using Yolov8 and Yolov5 deep learning models for environmental sustainability. Sci Rep 2025; 15:16024. [PMID: 40341180 PMCID: PMC12062267 DOI: 10.1038/s41598-025-99885-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Accepted: 04/23/2025] [Indexed: 05/10/2025] Open
Abstract
Effective waste management is currently one of the most influential factors in enhancing the quality of life. Increased garbage production has been identified as a significant problem for many cities worldwide and a crucial issue for countries experiencing rapid urban population growth. According to the World Bank Organization, global waste production is projected to increase from 2.01 billion tonnes in 2018 to 3.4 billion tonnes by 2050 (Kaza et al. in What a Waste 2.0: A Global Snapshot of Solid Waste Management to 2050, The World Bank Group, Washington, DC, USA, 2018). In many cities, growing waste is the primary driver of environmental pollution. Nationally, governments have initiated several programs to improve cleanliness by developing systems that alert businesses when it's time to empty the bins. Current research proposes an enhanced, accurate, real-time object detection system to address the problem of trash accumulating around containers. This system involves numerous trash cans scattered across the city, each equipped with a low-cost device that measures the amount of trash inside. When a certain threshold is reached, the device sends a message with a unique identifier, prompting the appropriate authorities to take action. The system also triggers alerts if individuals throw trash bags outside the container or if the bin overflows, sending a message with a unique identifier to the authorities. Additionally, this paper addresses the need for efficient garbage classification while reducing computing costs to improve resource utilization. Two-stage lightweight deep learning models based on YOLOv5 and YOLOv8 are adopted to significantly decrease the number of parameters and processes, thereby reducing hardware requirements. In this study, trash is first classified into primary categories, which are further subdivided. The primary categories include full trash containers, trash bags, trash outside containers, and wet trash containers. YOLOv5 is particularly effective for classifying small objects, achieving high accuracy in identifying and categorizing different types of waste products on hardware without GPU capabilities. Each main class is further subdivided using YOLOv8 to facilitate recycling. A comparative study of YOLOv8, YOLOv5, and EfficientNet models on public and newly constructed garbage datasets shows that YOLOv8 and YOLOv5 have good accuracy for most classes, with the full-trash bin class achieving the highest accuracy and the wet trash container class the lowest compared to the EfficientNet model. The results demonstrate that the system effectively addresses the reliability issues of previously proposed systems, including detecting whether a trash bin is full, identifying trash outside the bin, and ensuring proper communication with authorities for necessary actions. Further research is recommended to enhance garbage management and collection, considering target occlusion, CPU and GPU hardware optimization, and robotic integration with the proposed system.
Collapse
Affiliation(s)
- Mohammed M Abo-Zahhad
- Department of Electrical Engineering, Faculty of Engineering, Sohag University, Sohag, New Sohag City, Egypt
| | - Mohammed Abo-Zahhad
- Department of Electronics and Communications Engineering, Egypt-Japan University of Science and Technology (E-JUST), New Borg El-Arab City, Alexandria, 21934, Egypt.
- Department of Electrical and Electronics Engineering, Assiut University, Assiut, 71515, Egypt.
| |
Collapse
|
13
|
Wu M, Sharapov J, Anderson M, Lu Y, Wu Y. Quantifying dislocation-type defects in post irradiation examination via transfer learning. Sci Rep 2025; 15:15889. [PMID: 40335501 PMCID: PMC12059087 DOI: 10.1038/s41598-025-00238-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2024] [Accepted: 04/25/2025] [Indexed: 05/09/2025] Open
Abstract
The quantitative analysis of dislocation-type defects in irradiated materials is critical to materials characterization in the nuclear energy industry. The conventional approach of an instrument scientist manually identifying any dislocation defects is both time-consuming and subjective, thereby potentially introducing inconsistencies in the quantification. This work approaches dislocation-type defect identification and segmentation using a standard open-source computer vision model, YOLO11, that leverages transfer learning to create a highly effective dislocation defect quantification tool while using only a minimal number of annotated micrographs for training. This model demonstrates the ability to segment both dislocation lines and loops concurrently in micrographs with high pixel noise levels and on two alloys not represented in the training set. Inference of dislocation defects using transmission electron microscopy on three different irradiated alloys relevant to the nuclear energy industry are examined in this work with widely varying pixel noise levels and with completely unrelated composition and dislocation formations for practical post irradiation examination analysis. Code and models are available at https://github.com/idaholab/PANDA .
Collapse
Affiliation(s)
- Michael Wu
- Idaho National Laboratory, Idaho Falls, ID, USA
| | | | | | - Yu Lu
- Boise State University, Boise, ID, USA
- Center for Advanced Energy Studies, Idaho Falls, ID, USA
| | - Yaqiao Wu
- Boise State University, Boise, ID, USA
- Center for Advanced Energy Studies, Idaho Falls, ID, USA
| |
Collapse
|
14
|
Pawar P, McManus B, Anthony T, Yang J, Kerwin T, Stavrinos D. Artificial intelligence automated solution for hazard annotation and eye tracking in a simulated environment. ACCIDENT; ANALYSIS AND PREVENTION 2025; 218:108075. [PMID: 40339543 DOI: 10.1016/j.aap.2025.108075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/13/2024] [Accepted: 04/27/2025] [Indexed: 05/10/2025]
Abstract
High-fidelity simulators and sensors are commonly used in research to create immersive environments for studying real-world problems. This setup records detailed data, generating large datasets. In driving research, a full-scale car model repurposed as a driving simulator allows human subjects to navigate realistic driving scenarios. Data from these experiments are collected in raw form, requiring extensive manual annotation of roadway elements such as hazards and distractions. This process is often time-consuming, labor-intensive, and repetitive, causing delays in research progress. This paper proposes an AI-driven solution to automate these tasks, enabling researchers to focus on analysis and advance their studies efficiently. The solution builds on previous driving behavior research using a high-fidelity full-cab simulator equipped with gaze-tracking cameras. It extends the capabilities of the earlier system described in Pawar's (2021) "Hazard Detection in Driving Simulation using Deep Learning", which performed only hazard detection. The enhanced system now integrates both hazard annotation and gaze-tracking data. By combining vehicle handling parameters with drivers' visual attention data, the proposed method provides a unified, detailed view of participants' driving behavior across various simulated scenarios. This approach streamlines data analysis, accelerates research timelines, and enhances understanding of driving behavior.
Collapse
Affiliation(s)
- Piyush Pawar
- Institute for Social Science Research, University of Alabama, 306 Paul W. Bryant Drive East, Tuscaloosa, AL 35401, USA.
| | - Benjamin McManus
- Institute for Social Science Research, University of Alabama, 306 Paul W. Bryant Drive East, Tuscaloosa, AL 35401, USA.
| | - Thomas Anthony
- Analytical AI, 1500, 1st Ave. N, Birmingham, AL 35022, USA.
| | - Jingzhen Yang
- Department of Pediatrics, College of Medicine, The Ohio State University, Center for Injury Research and Policy, Abigail Wexner Research Institute at Nationwide Children's Hospital, 700 Children's Dr. RBIII-WB5403 Columbus, OH 43205, USA.
| | - Thomas Kerwin
- Ohio State University, Driving Simulation Laboratory, The Ohio State University, 1305 Kinnear Road, Suite 194, Columbus OH 43212, USA.
| | - Despina Stavrinos
- Institute for Social Science Research, University of Alabama, 306 Paul W. Bryant Drive East, Tuscaloosa, AL 35401, USA.
| |
Collapse
|
15
|
Zheng S, Wu Z, Xu Y, He C, Wei Z. Detector With Classifier 2: An End-to-End Multi-Stream Feature Aggregation Network for Fine-Grained Object Detection in Remote Sensing Images. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2025; 34:2707-2720. [PMID: 40305241 DOI: 10.1109/tip.2025.3563708] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/02/2025]
Abstract
Fine-grained object detection (FGOD) fundamentally comprises two primary tasks: object detection and fine-grained classification. In natural scenes, most FGOD methods benefit from higher instance resolution and fewer environmental variation, attributing more commonly associated with the latter task. In this paper, we propose a unified paradigm named Detector with Classifier2 (DC2), which provides a holistic paradigm by explicitly considering the end-to-end integration of object detection and fine-grained classification tasks, rather than prioritizing one aspect. Initially, our detection sub-network is restricted to only determining whether the proposal is a coarse-category and does not delve into the specific sub-categories. Moreover, in order to reduce redundant pixel-level calculation, we propose an instance-level feature enhancement (IFE) module to model the semantic similarities among proposals, which poses great potential for locating more instances in remote sensing images (RSIs). After obtaining the coarse detection predictions, we further construct a classification sub-network, which is built on top of the former branch to determine the specific sub-categories of the aforementioned predictions. Importantly, the detection network is performed on the complete image, while the classification network conducts secondary modeling for the detected regions. These operations can be denoted as the global contextual information and local intrinsic cues extractions for each instance. Therefore, we propose a multi-stream feature aggregation (MSFA) module to integrate global-stream semantic information and local-stream discriminative cues. Our whole DC2 network follows an end-to-end learning fashion, which effectively excavates the internal correlation between detection and fine-grained classification networks. We evaluate the performance of our DC2 network on two benchmarks SAT-MTB and HRSC2016 datasets. Importantly, our method achieves the new state-of-the-art results compared with recent works (approximately 7% mAP gains on SAT-MTB) and improves baseline by a significant margin (43.2% $v.s.~36.7$ %) without any complicated post-processing strategies. Source codes of the proposed methods are available at https://github.com/zhengshangdong/DC2.
Collapse
|
16
|
Gao C, Ajith S, Peelen MV. Object representations drive emotion schemas across a large and diverse set of daily-life scenes. Commun Biol 2025; 8:697. [PMID: 40325234 PMCID: PMC12053605 DOI: 10.1038/s42003-025-08145-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2025] [Accepted: 04/29/2025] [Indexed: 05/07/2025] Open
Abstract
The rapid emotional evaluation of objects and events is essential in daily life. While visual scenes reliably evoke emotions, it remains unclear whether emotion schemas evoked by daily-life scenes depend on object processing systems or are extracted independently. To explore this, we collected emotion ratings for 4913 daily-life scenes from 300 participants, and predicted these ratings from representations in deep neural networks and functional magnetic resonance imaging (fMRI) activity patterns in visual cortex. AlexNet, an object-based model, outperformed EmoNet, an emotion-based model, in predicting emotion ratings for daily-life scenes, while EmoNet excelled for explicitly evocative scenes. Emotion information was processed hierarchically within the object recognition system, consistent with the visual cortex's organization. Activity patterns in the lateral occipital complex (LOC), an object-selective region, reliably predicted emotion ratings and outperformed other visual regions. These findings suggest that the emotional evaluation of daily-life scenes is mediated by visual object processing, with additional mechanisms engaged when object content is uninformative.
Collapse
Affiliation(s)
- Chuanji Gao
- School of Psychology, Nanjing Normal University, Nanjing, China.
| | - Susan Ajith
- Department of Medicine, Justus-Liebig-Universität Gießen, Gießen, Germany
- Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
| | - Marius V Peelen
- Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, the Netherlands.
| |
Collapse
|
17
|
Haupt M, Garrett DD, Cichy RM. Healthy aging delays and dedifferentiates high-level visual representations. Curr Biol 2025; 35:2112-2127.e6. [PMID: 40239656 DOI: 10.1016/j.cub.2025.03.062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2024] [Revised: 01/23/2025] [Accepted: 03/25/2025] [Indexed: 04/18/2025]
Abstract
Healthy aging impacts visual information processing with consequences for subsequent high-level cognition and everyday behavior, but the underlying neural changes in visual representations remain unknown. Here, we investigate the nature of representations underlying object recognition in older compared to younger adults by tracking them in time using electroencephalography (EEG), across space using functional magnetic resonance imaging (fMRI), and by probing their behavioral relevance using similarity judgments. Applying a multivariate analysis framework to combine experimental assessments, four key findings about how brain aging impacts object recognition emerge. First, aging selectively delays the formation of object representations, profoundly changing the chronometry of visual processing. Second, the delay in the formation of object representations emerges in high-level rather than low- and mid-level ventral visual cortex, supporting the theory that brain areas developing last deteriorate first. Third, aging reduces content selectivity in the high-level ventral visual cortex, indicating age-related neural dedifferentiation as the mechanism of representational change. Finally, we demonstrate that the identified representations of the aging brain are behaviorally relevant, ascertaining ecological relevance. Together, our results reveal the impact of healthy aging on the visual brain.
Collapse
Affiliation(s)
- Marleen Haupt
- Department of Education and Psychology, Freie Universität Berlin, Habelschwerdter Allee 45, Berlin 14195, Germany; Center for Lifespan Psychology, Max Planck Institute for Human Development, Lentzallee 94, Berlin 14195, Germany.
| | - Douglas D Garrett
- Max Planck UCL Centre for Computational Psychiatry and Ageing Research, 10-12 Russell Square, London WC1B 5EH, UK
| | - Radoslaw M Cichy
- Department of Education and Psychology, Freie Universität Berlin, Habelschwerdter Allee 45, Berlin 14195, Germany; Berlin School of Mind and Brain, Faculty of Philosophy, Humboldt-Universität zu Berlin, Luisenstraße 56, Berlin 10117, Germany; Bernstein Center for Computational Neuroscience Berlin, Humbold-Universität zu Berlin, Philippstraße 13, Berlin 10115, Germany.
| |
Collapse
|
18
|
Toosi A, Harsini S, Divband G, Bénard F, Uribe CF, Oviedo F, Dodhia R, Weeks WB, Lavista Ferres JM, Rahmim A. Computer-Aided Detection (CADe) of Small Metastatic Prostate Cancer Lesions on 3D PSMA PET Volumes Using Multi-Angle Maximum Intensity Projections. Cancers (Basel) 2025; 17:1563. [PMID: 40361490 DOI: 10.3390/cancers17091563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2025] [Revised: 04/28/2025] [Accepted: 04/29/2025] [Indexed: 05/15/2025] Open
Abstract
OBJECTIVES We aimed to develop and evaluate a novel computer-aided detection (CADe) approach for identifying small metastatic biochemically recurrent (BCR) prostate cancer (PCa) lesions on PSMA-PET images, utilizing multi-angle Maximum Intensity Projections (MA-MIPs) and state-of-the-art (SOTA) object detection algorithms. METHODS We fine-tuned and evaluated 16 SOTA object detection algorithms (selected across four main categories of model types) applied to MA-MIPs as extracted from rotated 3D PSMA-PET volumes. Predicted 2D bounding boxes were back-projected to the original 3D space using the Ordered Subset Expectation Maximization (OSEM) algorithm. A fine-tuned Medical Segment-Anything Model (MedSAM) was then also used to segment the identified lesions within the bounding boxes. RESULTS The proposed method achieved a high detection performance for this difficult task, with the FreeAnchor model reaching an F1-score of 0.69 and a recall of 0.74. It outperformed several 3D methods in efficiency while maintaining comparable accuracy. Strong recall rates were observed for clinically relevant areas, such as local relapses (0.82) and bone metastases (0.80). CONCLUSION Our fully automated CADe tool shows promise in assisting physicians as a "second reader" for detecting small metastatic BCR PCa lesions on PSMA-PET images. By leveraging the strength and computational efficiency of 2D models while preserving 3D spatial information of the PSMA-PET volume, the proposed approach has the potential to improve detectability and reduce workload in cancer diagnosis and management.
Collapse
Affiliation(s)
- Amirhosein Toosi
- Department of Integrative Oncology, BC Cancer Research Institute, Vancouver, BC V5Z 1L3, Canada
- Department of Radiology, University of British Columbia, Vancouver, BC V6T 1Z3, Canada
- AI for Good Lab, Microsoft Corporation, Redmond, WA 98052, USA
| | | | | | - François Bénard
- Department of Radiology, University of British Columbia, Vancouver, BC V6T 1Z3, Canada
- BC Cancer, Vancouver, BC V5Z 1L3, Canada
| | - Carlos F Uribe
- Department of Integrative Oncology, BC Cancer Research Institute, Vancouver, BC V5Z 1L3, Canada
- Department of Radiology, University of British Columbia, Vancouver, BC V6T 1Z3, Canada
- BC Cancer, Vancouver, BC V5Z 1L3, Canada
| | - Felipe Oviedo
- AI for Good Lab, Microsoft Corporation, Redmond, WA 98052, USA
| | - Rahul Dodhia
- AI for Good Lab, Microsoft Corporation, Redmond, WA 98052, USA
| | - William B Weeks
- AI for Good Lab, Microsoft Corporation, Redmond, WA 98052, USA
| | | | - Arman Rahmim
- Department of Integrative Oncology, BC Cancer Research Institute, Vancouver, BC V5Z 1L3, Canada
- Department of Radiology, University of British Columbia, Vancouver, BC V6T 1Z3, Canada
- Department of Physics and Astronomy, University of British Columbia, Vancouver, BC V6T 1Z3, Canada
| |
Collapse
|
19
|
Demircioğlu A, Bos D, Quinsten AS, Umutlu L, Bruder O, Forsting M, Nassenstein K. Detecting the left atrial appendage in CT localizers using deep learning. Sci Rep 2025; 15:15333. [PMID: 40316718 PMCID: PMC12048584 DOI: 10.1038/s41598-025-99701-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2025] [Accepted: 04/22/2025] [Indexed: 05/04/2025] Open
Abstract
Patients with cardioembolic stroke often undergo CT of the left atrial appendage (LAA), for example, to determine whether thrombi are present in the LAA. To guide the imaging process, technologists first perform a localizer scan, which is a preliminary image used to identify the region of interest. However, the lack of well-defined landmarks makes accurate delimitation of the LAA in localizers difficult and often requires whole-heart scans, increasing radiation exposure and cancer risk. This study aims to automate LAA delimitation in CT localizers using deep learning. Four commonly used deep networks (VariFocalNet, Cascade-R-CNN, Task-aligned One-stage Object Detection Network, YOLO v11) were trained to predict the LAA boundaries on a cohort of 1253 localizers, collected retrospectively from a single center. The best-performing network in terms of delimitation accuracy was then evaluated on an internal test cohort of 368 patients, and on an external test cohort of 309 patients. The VariFocalNet performed best, achieving LAA delimitations with high accuracy (97.8% and 96.8%; Dice coefficients: 90.4% and 90.0%) and near-perfect clinical utility (99.8% and 99.3%). Compared to whole-heart scanning, the network-based delimitation reduced the radiation exposure by more than 50% (5.33 ± 6.42 mSv vs. 11.35 ± 8.17 mSv in the internal cohort, 4.39 ± 4.23 mSv vs. 10.09 ± 8.0 mSv in the external cohort). This study demonstrates that a deep learning network can accurately delimit the LAA in the localizer, leading to more accurate CT scans of the LAA, thereby significantly reducing radiation exposure to the patient compared to whole-heart scanning.
Collapse
Affiliation(s)
- Aydin Demircioğlu
- Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, 45147, Essen, Germany.
| | - Denise Bos
- Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, 45147, Essen, Germany
| | - Anton S Quinsten
- Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, 45147, Essen, Germany
| | - Lale Umutlu
- Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, 45147, Essen, Germany
| | - Oliver Bruder
- Department of Cardiology and Angiology, Contilia Heart and Vascular Center, Elisabeth-Krankenhaus Essen, Klara-Kopp-Weg 1, 45138, Essen, Germany
- Faculty of Medicine, Ruhr University Bochum, 44801, Bochum, Germany
| | - Michael Forsting
- Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, 45147, Essen, Germany
| | - Kai Nassenstein
- Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, 45147, Essen, Germany
| |
Collapse
|
20
|
Yang L, He L, Hu D, Liu Y, Peng Y, Chen H, Zhou M. Variational Transformer: A Framework Beyond the Tradeoff Between Accuracy and Diversity for Image Captioning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:9500-9511. [PMID: 39374280 DOI: 10.1109/tnnls.2024.3440872] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/09/2024]
Abstract
Accuracy and diversity represent two critical quantifiable performance metrics in the generation of natural and semantically accurate captions. While efforts are made to enhance one of them, the other suffers due to the inherent conflicting and complex relationship between them. In this study, we demonstrate that the suboptimal accuracy levels derived from human annotations are unsuitable for machine-generated captions. To boost diversity while maintaining high accuracy, we propose an innovative variational transformer (VaT) framework. By integrating "invisible information prior (IIP)" and "auto-selectable Gaussian mixture model (AGMM)," we enable its encoder to learn precise linguistic information and object relationships in various scenes, thus ensuring high accuracy. By incorporating the "range-median reward (RMR)" baseline into it, we preserve a wider range of candidates with higher rewards during the reinforcement-learning-based training process, thereby guaranteeing outstanding diversity. Experimental results indicate that our method achieves simultaneous improvements in accuracy and diversity by up to 1.1% and 4.8%, respectively, over the state-of-the-art. Furthermore, our approach demonstrates its performance that is the closest to human annotations in semantic retrieval, with its score of 50.3 versus the human score of 50.6. Thus, the method can be readily put into industrial use.
Collapse
|
21
|
Tang X, Ye S, Shi Y, Hu T, Peng Q, You X. Filter Pruning Based on Information Capacity and Independence. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:8401-8413. [PMID: 39231052 DOI: 10.1109/tnnls.2024.3415068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/06/2024]
Abstract
Filter pruning has gained widespread adoption for the purpose of compressing and speeding up convolutional neural networks (CNNs). However, the existing approaches are still far from practical applications due to biased filter selection and heavy computation cost. This article introduces a new filter pruning method that selects filters in an interpretable, multiperspective, and lightweight manner. Specifically, we evaluate the contributions of filters from both individual and overall perspectives. For the amount of information contained in each filter, a new metric called information capacity is proposed. Inspired by the information theory, we utilize the interpretable entropy to measure the information capacity and develop a feature-guided approximation process. For correlations among filters, another metric called information independence is designed. Since the aforementioned metrics are evaluated in a simple but effective way, we can identify and prune the least important filters with less computation cost. We conduct comprehensive experiments on benchmark datasets employing various widely used CNN architectures to evaluate the performance of our method. For instance, on ILSVRC-2012, our method outperforms state-of-the-art methods by reducing floating-point operations (FLOPs) by 77.4% and parameters by 69.3% for ResNet-50 with only a minor decrease in an accuracy of 2.64%.
Collapse
|
22
|
Liu C, Li B, Shi M, Chen X, Ye Q, Ji X. Explicit Margin Equilibrium for Few-Shot Object Detection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:8072-8084. [PMID: 38980785 DOI: 10.1109/tnnls.2024.3422216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/11/2024]
Abstract
Under low data regimes, few-shot object detection (FSOD) transfers related knowledge from base classes with sufficient annotations to novel classes with limited samples in a two-step paradigm, including base training and balanced fine-tuning. In base training, the learned embedding space needs to be dispersed with large class margins to facilitate novel class accommodation and avoid feature aliasing while in balanced fine-tuning properly concentrating with small margins to represent novel classes precisely. Although obsession with the discrimination and representation dilemma has stimulated substantial progress, explorations for the equilibrium of class margins within the embedding space are still in full swing. In this study, we propose a class margin optimization scheme, termed explicit margin equilibrium (EME), by explicitly leveraging the quantified relationship between base and novel classes. EME first maximizes base-class margins to reserve adequate space to prepare for novel class adaptation. During fine-tuning, it quantifies the interclass semantic relationships by calculating the equilibrium coefficients based on the assumption that novel instances can be represented by linear combinations of base-class prototypes. EME finally reweights margin loss using equilibrium coefficients to adapt base knowledge for novel instance learning with the help of instance disturbance (ID) augmentation. As a plug-and-play module, EME can also be applied to few-shot classification. Consistent performance gains upon various baseline methods and benchmarks validate the generality and efficacy of EME. The code is available at github.com/Bohao-Lee/EME.
Collapse
|
23
|
Song Y, Liu Z, Li G, Xie J, Wu Q, Zeng D, Xu L, Zhang T, Wang J. EMS: A Large-Scale Eye Movement Dataset, Benchmark, and New Model for Schizophrenia Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:9451-9462. [PMID: 39178070 DOI: 10.1109/tnnls.2024.3441928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/25/2024]
Abstract
Schizophrenia (SZ) is a common and disabling mental illness, and most patients encounter cognitive deficits. The eye-tracking technology has been increasingly used to characterize cognitive deficits for its reasonable time and economic costs. However, there is no large-scale and publicly available eye movement dataset and benchmark for SZ recognition. To address these issues, we release a large-scale Eye Movement dataset for SZ recognition (EMS), which consists of eye movement data from 104 schizophrenics and 104 healthy controls (HCs) based on the free-viewing paradigm with 100 stimuli. We also conduct the first comprehensive benchmark, which has been absent for a long time in this field, to compare the related 13 psychosis recognition methods using six metrics. Besides, we propose a novel mean-shift-based network (MSNet) for eye movement-based SZ recognition, which elaborately combines the mean shift algorithm with convolution to extract the cluster center as the subject feature. In MSNet, first, a stimulus feature branch (SFB) is adopted to enhance each stimulus feature with similar information from all stimulus features, and then, the cluster center branch (CCB) is utilized to generate the cluster center as subject feature and update it by the mean shift vector. The performance of our MSNet is superior to prior contenders, thus, it can act as a powerful baseline to advance subsequent study. To pave the road in this research field, the EMS dataset, the benchmark results, and the code of MSNet are publicly available at https://github.com/YingjieSong1/EMS.
Collapse
|
24
|
Xu M, Dai N, Jiang L, Fu Y, Deng X, Li S. Recruiting Teacher IF Modality for Nephropathy Diagnosis: A Customized Distillation Method With Attention-Based Diffusion Network. IEEE TRANSACTIONS ON MEDICAL IMAGING 2025; 44:2028-2040. [PMID: 40030767 DOI: 10.1109/tmi.2024.3524544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
The joint use of multiple modalities for medical image processing has been widely studied in recent years. The fusion of information from different modalities has demonstrated the performance improvement for a lot of medical tasks. For nephropathy diagnosis, immunofluorescence (IF) is one of the most widely-used multi-modality medical images due to its ease of acquisition and the effectiveness for certain nephropathy. However, the existing methods mainly assume different modalities have the equal effect on the diagnosis task, failing to exploit multi-modality knowledge in details. To avoid this disadvantage, this paper proposes a novel customized multi-teacher knowledge distillation framework to transfer knowledge from the trained single-modality teacher networks to a multi-modality student network. Specifically, a new attention-based diffusion network is developed for IF based diagnosis, considering global, local, and modality attention. Besides, a teacher recruitment module and diffusion-aware distillation loss are developed to learn to select the effective teacher networks based on the medical priors of the input IF sequence. The experimental results in the test and external datasets show that the proposed method has a better nephropathy diagnosis performance and generalizability, in comparison with the state-of-the-art methods.
Collapse
|
25
|
Zhou H, Yang R, Zhang Y, Duan H, Huang Y, Hu R, Li X, Zheng Y. UniHead: Unifying Multi-Perception for Detection Heads. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:9565-9576. [PMID: 38905097 DOI: 10.1109/tnnls.2024.3412947] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/23/2024]
Abstract
The detection head constitutes a pivotal component within object detectors, tasked with executing both classification and localization functions. Regrettably, the commonly used parallel head often lacks omni perceptual capabilities, such as deformation perception (DP), global perception (GP), and cross-task perception (CTP). Despite numerous methods attempting to enhance these abilities from a single aspect, achieving a comprehensive and unified solution remains a significant challenge. In response to this challenge, we develop an innovative detection head, termed UniHead, to unify three perceptual abilities simultaneously. More precisely, our approach: 1) introduces DP, enabling the model to adaptively sample object features; 2) proposes a dual-axial aggregation transformer (DAT) to adeptly model long-range dependencies, thereby achieving GP; and 3) devises a cross-task interaction transformer (CIT) that facilitates interaction between the classification and localization branches, thus aligning the two tasks. As a plug-and-play method, the proposed UniHead can be conveniently integrated with existing detectors. Extensive experiments on the COCO dataset demonstrate that our UniHead can bring significant improvements to many detectors. For instance, the UniHead can obtain +2.7 AP gains in RetinaNet, +2.9 AP gains in FreeAnchor, and +2.1 AP gains in GFL. The code is available at https://github.com/zht8506/UniHead.
Collapse
|
26
|
Bao J, Zhang J, Zhang C, Bao L. DCTCNet: Sequency discrete cosine transform convolution network for visual recognition. Neural Netw 2025; 185:107143. [PMID: 39847941 DOI: 10.1016/j.neunet.2025.107143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Revised: 01/01/2025] [Accepted: 01/09/2025] [Indexed: 01/25/2025]
Abstract
The discrete cosine transform (DCT) has been widely used in computer vision tasks due to its ability of high compression ratio and high-quality visual presentation. However, conventional DCT is usually affected by the size of transform region and results in blocking effect. Therefore, eliminating the blocking effects to efficiently serve for vision tasks is significant and challenging. In this paper, we introduce All Phase Sequency DCT (APSeDCT) into convolutional networks to extract multi-frequency information of deep features. Due to the fact that APSeDCT can be equivalent to convolutional operation, we construct corresponding convolution module called APSeDCT Convolution (APSeDCTConv) that has great transferability similar to vanilla convolution. Then we propose an augmented convolutional operator called MultiConv with APSeDCTConv. By replacing the last three bottleneck blocks of ResNet with MultiConv, our approach not only reduces the computational costs and the number of parameters, but also exhibits great performance in classification, object detection and instance segmentation tasks. Extensive experiments show that APSeDCTConv augmentation leads to consistent performance improvements in image classification on ImageNet across various different models and scales, including ResNet, Res2Net and ResNext, and achieving 0.5%-1.1% and 0.4%-0.7% AP performance improvements for object detection and instance segmentation, respectively, on the COCO benchmark compared to the baseline.
Collapse
Affiliation(s)
- Jiayong Bao
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049, China
| | - Jiangshe Zhang
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049, China.
| | - Chunxia Zhang
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049, China
| | - Lili Bao
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049, China
| |
Collapse
|
27
|
Zhang K, Zhu D, Min X, Zhai G. Unified Approach to Mesh Saliency: Evaluating Textured and Non-Textured Meshes Through VR and Multifunctional Prediction. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2025; 31:3151-3160. [PMID: 40063447 DOI: 10.1109/tvcg.2025.3549550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2025]
Abstract
Mesh saliency aims to empower artificial intelligence with strong adaptability to highlight regions that naturally attract visual attention. Existing advances primarily emphasize the crucial role of geometric shapes in determining mesh saliency, but it remains challenging to flexibly sense the unique visual appeal brought by the realism of complex texture patterns. To investigate the interaction between geometric shapes and texture features in visual perception, we establish a comprehensive mesh saliency dataset, capturing saliency distributions for identical 3D models under both non-textured and textured conditions. Additionally, we propose a unified saliency prediction model applicable to various mesh types, providing valuable insights for both detailed modeling and realistic rendering applications. This model effectively analyzes the geometric structure of the mesh while seamlessly incorporating texture features into the topological framework, ensuring coherence throughout appearance-enhanced modeling. Through extensive theoretical and empirical validation, our approach not only enhances performance across different mesh types, but also demonstrates the model's scalability and generalizability, particularly through cross-validation of various visual features.
Collapse
|
28
|
Luo X, Duan Z, Qin A, Tian Z, Xie T, Zhang T, Tang YY. Layer-Wise Mutual Information Meta-Learning Network for Few-Shot Segmentation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:9684-9698. [PMID: 39255186 DOI: 10.1109/tnnls.2024.3438771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/12/2024]
Abstract
The goal of few-shot segmentation (FSS) is to segment unlabeled images belonging to previously unseen classes using only a limited number of labeled images. The main objective is to transfer label information effectively from support images to query images. In this study, we introduce a novel meta-learning framework called layer-wise mutual information (LayerMI), which enhances the propagation of label information by maximizing the mutual information (MI) between support and query features at each layer. Our approach involves the utilization of a LayerMI Block based on information-theoretic co-clustering. This block performs online co-clustering on the joint probability distribution obtained from each layer, generating a target-specific attention map. The LayerMI Block can be seamlessly integrated into the meta-learning framework and applied to all convolutional neural network (CNN) layers without altering the training objectives. Notably, the LayerMI Block not only maximizes MI between support and query features but also facilitates internal clustering within the image. Extensive experiments demonstrate that LayerMI significantly enhances the performance of baseline and achieves competitive performance compared to state-of-the-art methods on three challenging benchmarks: PASCAL- $5^{i}$ , COCO- $20^{i}$ , and FSS-1000.
Collapse
|
29
|
Tan B, Xiao Y, Li S, Tong X, Yan T, Cao Z, Tianyi Zhou J. Language-Guided 3-D Action Feature Learning Without Ground-Truth Sample Class Label. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:9356-9369. [PMID: 38865228 DOI: 10.1109/tnnls.2024.3409613] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2024]
Abstract
This work pays the first research effort to leverage point cloud sequence-based Self-supervised 3-D Action Feature Learning (S3AFL), under text's cross-modality weak supervision. We intend to fill the huge performance gap between point cloud sequence and 3-D skeleton-based manners. The key intuition derives from the observation that skeleton-based manners actually hold the human pose's high-level knowledge that leads to attention on the body's joint-aware local parts. Inspired by this, we propose to introduce the text's weak supervision of high-level semantics into a point cloud sequence-based paradigm. With RGB-point cloud pair sequence acquired via RGB-D camera, text sequence is first generated from RGB component using pretrained image captioning model, as auxiliary weak supervision. Then, S3AFL runs in a cross and intra-modality contrastive learning (CL) way. To resist text's missing and redundant semantics, feature learning is conducted in a multistage way with semantic refinement. Essentially, text is only required for training. To facilitate the feature's representation power on fine-grained actions, a multirank max-pooling (MR-MP) way is also proposed for the point set network to better maintain discriminative clues. Experiments verify that the text's weak supervision can facilitate performance by 10.8%, 10.4%, and 8.0% on NTU RGB+D 60, 120, and N-UCLA at most. The performance gap between point cloud sequence and skeleton-based manners has been remarkably narrowed down. The idea of transferring text's weak supervision to S3AFL can also be applied to a skeleton manner, with strong generality. The source code is available at https://github.com/tangent-T/W3AMT.
Collapse
|
30
|
Shen X, Chen Y, Liu W, Zheng Y, Sun QS, Pan S. Graph Convolutional Multi-Label Hashing for Cross-Modal Retrieval. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:7997-8009. [PMID: 39028597 DOI: 10.1109/tnnls.2024.3421583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/21/2024]
Abstract
Cross-modal hashing encodes different modalities of multimodal data into low-dimensional Hamming space for fast cross-modal retrieval. In multi-label cross-modal retrieval, multimodal data are often annotated with multiple labels, and some labels, e.g., "ocean" and "cloud," often co-occur. However, existing cross-modal hashing methods overlook label dependency that is crucial for improving performance. To fulfill this gap, this article proposes graph convolutional multi-label hashing (GCMLH) for effective multi-label cross-modal retrieval. Specifically, GCMLH first generates word embedding of each label and develops label encoder to learn highly correlated label embedding via graph convolutional network (GCN). In addition, GCMLH develops feature encoder for each modality, and feature fusion module to generate highly semantic feature via GCN. GCMLH uses teacher-student learning scheme to transfer knowledge from the teacher modules, i.e., label encoder and feature fusion module, to the student module, i.e., feature encoder, such that learned hash code can well exploit multi-label dependency and multimodal semantic structure. Extensive empirical results on several benchmarks demonstrate the superiority of the proposed method over existing state-of-the-arts.
Collapse
|
31
|
Shao Y, Guo D, Cui Y, Wang Z, Zhang L, Zhang J. Graph Attention Network for Context-Aware Visual Tracking. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:9474-9487. [PMID: 39321013 DOI: 10.1109/tnnls.2024.3442290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/27/2024]
Abstract
Siamese-network-based trackers convert the general object tracking as a similarity matching task between a template and a search region. Using convolutional feature cross correlation (Xcorr) for similarity matching, a large number of Siamese trackers are proposed and achieved great success. However, due to the predefined size of the target feature, these trackers suffer from either retaining much background information or losing important foreground information. Moreover, the global matching between the target and search region also largely neglects the part-level structural information and the contextual information of the target. To tackle the aforementioned obstacles, in this article, we propose a simple context-aware Siamese graph attention network, which establishes part-to-part correspondence between the Siamese branches with a complete bipartite graph. The object information from the template is propagated to the search region via a graph attention mechanism. With such a design, a target-aware template input is enabled to replace the prefixed template region, which can adaptively fit the size and aspect ratio variations in different objects. Based on it, we further construct a context-aware feature matching mechanism to embed both the target and the contextual information in the search region. Experiments on challenging benchmarks including GOT-10k, TrackingNet, LaSOT, VOT2020, and OTB-100 demonstrate that the proposed SiamGAT* outperforms many state-of-the-art trackers and achieves leading performance. Code is available at: https://git.io/SiamGAT.
Collapse
|
32
|
Chen Y, Xiao Z, Pan Y, Zhao L, Dai H, Wu Z, Li C, Zhang T, Li C, Zhu D, Liu T, Jiang X. Mask-Guided Vision Transformer for Few-Shot Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:9636-9647. [PMID: 38976473 DOI: 10.1109/tnnls.2024.3418527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Learning with little data is challenging but often inevitable in various application scenarios where the labeled data are limited and costly. Recently, few-shot learning (FSL) gained increasing attention because of its generalizability of prior knowledge to new tasks that contain only a few samples. However, for data-intensive models such as vision transformer (ViT), current fine-tuning-based FSL approaches are inefficient in knowledge generalization and, thus, degenerate the downstream task performances. In this article, we propose a novel mask-guided ViT (MG-ViT) to achieve an effective and efficient FSL on the ViT model. The key idea is to apply a mask on image patches to screen out the task-irrelevant ones and to guide the ViT focusing on task-relevant and discriminative patches during FSL. Particularly, MG-ViT only introduces an additional mask operation and a residual connection, enabling the inheritance of parameters from pretrained ViT without any other cost. To optimally select representative few-shot samples, we also include an active learning-based sample selection method to further improve the generalizability of MG-ViT-based FSL. We evaluate the proposed MG-ViT on classification, object detection, and segmentation tasks using gradient-weighted class activation mapping (Grad-CAM) to generate masks. The experimental results show that the MG-ViT model significantly improves the performance and efficiency compared with general fine-tuning-based ViT and ResNet models, providing novel insights and a concrete approach toward generalizing data-intensive and large-scale deep learning models for FSL.
Collapse
|
33
|
Yang A, Miech A, Sivic J, Laptev I, Schmid C. Learning to Answer Visual Questions From Web Videos. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:3202-3218. [PMID: 35533174 DOI: 10.1109/tpami.2022.3173208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and the VideoQA feature probe evaluation setting and show excellent results, in particular for rare answers. Furthermore, our method achieves competitive results on MSRVTT-QA, ActivityNet-QA, MSVD-QA and How2QA datasets. We also show that our VideoQA dataset generation approach generalizes to another source of web video and text data. We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i.e., videos with alt-text annotations, and show its benefits for training VideoQA models. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations. Code, datasets and trained models are available on our project webpage (https://antoyang.github.io/just-ask.html).
Collapse
|
34
|
Zu K, Zhang H, Zhang L, Lu J, Xu C, Chen H, Zheng Y. EMBANet: A flexible efficient multi-branch attention network. Neural Netw 2025; 185:107248. [PMID: 39951863 DOI: 10.1016/j.neunet.2025.107248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2024] [Revised: 01/09/2025] [Accepted: 02/02/2025] [Indexed: 02/17/2025]
Abstract
Recent advances in the design of convolutional neural networks have shown that performance can be enhanced by improving the ability to represent multi-scale features. However, most existing methods either focus on designing more sophisticated attention modules, which leads to higher computational costs, or fail to effectively establish long-range channel dependencies, or neglect the extraction and utilization of structural information. This work introduces a novel module, the Multi-Branch Concatenation (MBC), designed to process input tensors and extract multi-scale feature maps. The MBC module introduces new degrees of freedom (DoF) in the design of attention networks by allowing for flexible adjustments to the types of transformation operators and the number of branches. This study considers two key transformation operators: multiplexing and splitting, both of which facilitate a more granular representation of multi-scale features and enhance the receptive field range. By integrating the MBC with an attention module, a Multi-Branch Attention (MBA) module is developed to capture channel-wise interactions within feature maps, thereby establishing long-range channel dependencies. Replacing the 3x3 convolutions in the bottleneck blocks of ResNet with the proposed MBA yields a new block, the Efficient Multi-Branch Attention (EMBA), which can be seamlessly integrated into state-of-the-art backbone CNN models. Furthermore, a new backbone network, named EMBANet, is constructed by stacking EMBA blocks. The proposed EMBANet has been thoroughly evaluated across various computer vision tasks, including classification, detection, and segmentation, consistently demonstrating superior performance compared to popular backbones.
Collapse
Affiliation(s)
- Keke Zu
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China; Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Shenzhen University, Shenzhen, China.
| | - Hu Zhang
- Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Shenzhen University, Shenzhen, China.
| | - Lei Zhang
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China.
| | - Jian Lu
- Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Shenzhen University, Shenzhen, China.
| | - Chen Xu
- Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Shenzhen University, Shenzhen, China.
| | - Hongyang Chen
- Research Center for Graph Computing, Zhejiang Lab, Hangzhou, China.
| | - Yu Zheng
- JD Intelligent Cities Research, Beijing, China.
| |
Collapse
|
35
|
Dang Z, Luo M, Wang J, Jia C, Han H, Wan H, Dai G, Chang X, Wang J. Disentangled Noisy Correspondence Learning. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2025; 34:2602-2615. [PMID: 40257891 DOI: 10.1109/tip.2025.3559457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/23/2025]
Abstract
Cross-modal retrieval is crucial in understanding latent correspondences across modalities. However, existing methods implicitly assume well-matched training data, which is impractical as real-world data inevitably involves imperfect alignments, i.e., noisy correspondences. Although some works explore similarity-based strategies to address such noise, they suffer from sub-optimal similarity predictions influenced by modality-exclusive information (MEI), e.g., background noise in images and abstract definitions in texts. This issue arises as MEI is not shared across modalities, thus aligning it in training can markedly mislead similarity predictions. Moreover, although intuitive, directly applying previous cross-modal disentanglement methods suffers from limited noise tolerance and disentanglement efficacy. Inspired by the robustness of information bottlenecks against noise, we introduce DisNCL, a novel information-theoretic framework for feature Disentanglement in Noisy Correspondence Learning, to adaptively balance the extraction of modality-invariant information (MII) and MEI with certifiable optimal cross-modal disentanglement efficacy. DisNCL then enhances similarity predictions in modality-invariant subspace, thereby greatly boosting similarity-based alleviation strategy for noisy correspondences. Furthermore, DisNCL introduces soft matching targets to model noisy many-to-many relationships inherent in multi-modal inputs for noise-robust and accurate cross-modal alignment. Extensive experiments confirm DisNCL's efficacy by 2% average recall improvement. Mutual information estimation and visualization results show that DisNCL learns meaningful MII/MEI subspaces, validating our theoretical analyses.
Collapse
|
36
|
Devillers B, Maytie L, VanRullen R. Semi-Supervised Multimodal Representation Learning Through a Global Workspace. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:7843-7857. [PMID: 38954575 DOI: 10.1109/tnnls.2024.3416701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2024]
Abstract
Recent deep learning models can efficiently combine inputs from different modalities (e.g., images and text) and learn to align their latent representations or to translate signals from one domain to another (as in image captioning or text-to-image generation). However, current approaches mainly rely on brute-force supervised training over large multimodal datasets. In contrast, humans (and other animals) can learn useful multimodal representations from only sparse experience with matched cross-modal data. Here, we evaluate the capabilities of a neural network architecture inspired by the cognitive notion of a "global workspace" (GW): a shared representation for two (or more) input modalities. Each modality is processed by a specialized system (pretrained on unimodal data and subsequently frozen). The corresponding latent representations are then encoded to and decoded from a single shared workspace. Importantly, this architecture is amenable to self-supervised training via cycle-consistency: encoding-decoding sequences should approximate the identity function. For various pairings of vision-language modalities and across two datasets of varying complexity, we show that such an architecture can be trained to align and translate between two modalities with very little need for matched data (from four to seven times less than a fully supervised approach). The GW representation can be used advantageously for downstream classification and cross-modal retrieval tasks and for robust transfer learning. Ablation studies reveal that both the shared workspace and the self-supervised cycle-consistency training are critical to the system's performance.
Collapse
|
37
|
Boumendil A, Bechkit W, Benatchba K. On-Device Deep Learning: Survey on Techniques Improving Energy Efficiency of DNNs. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:7806-7821. [PMID: 39046860 DOI: 10.1109/tnnls.2024.3430028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/27/2024]
Abstract
Providing high-quality predictions is no longer the sole goal for neural networks. As we live in an increasingly interconnected world, these models need to match the constraints of resource-limited devices powering the Internet of Things (IoT) and embedded systems. Moreover, in the era of climate change, reducing the carbon footprint of neural networks is a critical step for green artificial intelligence, which is no longer an aspiration but a major need. Enhancing the energy efficiency of neural networks, in both training and inference phases, became a predominant research topic in the field. Training optimization has grown in interest recently but remains challenging, as it involves changes in the learning procedure that can impact the prediction quality significantly. This article presents a study on the most popular techniques aiming to reduce the energy consumption of neural networks' training. We first propose a classification of the methods before discussing and comparing the different categories. In addition, we outline some energy measurement techniques. We discuss the limitations identified during our study as well as some interesting directions, such as neuromorphic and reservoir computing (RC).
Collapse
|
38
|
Liu Y, Chen X, Zuo S. A deep learning-driven method for safe and effective ERCP cannulation. Int J Comput Assist Radiol Surg 2025; 20:913-922. [PMID: 39920403 DOI: 10.1007/s11548-025-03329-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Accepted: 01/25/2025] [Indexed: 02/09/2025]
Abstract
PURPOSE In recent years, the detection of the duodenal papilla and surgical cannula has become a critical task in computer-assisted endoscopic retrograde cholangiopancreatography (ERCP) cannulation operations. The complex surgical anatomy, coupled with the small size of the duodenal papillary orifice and its high similarity to the background, poses significant challenges to effective computer-assisted cannulation. To address these challenges, we present a deep learning-driven graphical user interface (GUI) to assist ERCP cannulation. METHODS Considering the characteristics of the ERCP scenario, we propose a deep learning method for duodenal papilla and surgical cannula detection, utilizing four swin transformer decoupled heads (4STDH). Four different prediction heads are employed to detect objects of different sizes. Subsequently, we integrate the swin transformer module to identify attention regions to explore prediction potential deeply. Moreover, we decouple the classification and regression networks, significantly improving the model's accuracy and robustness through the separation prediction. Simultaneously, we introduce a dataset on papilla and cannula (DPAC), consisting of 1840 annotated endoscopic images, which will be publicly available. We integrated 4STDH and several state-of-the-art methods into the GUI and compared them. RESULTS On the DPAC dataset, 4STDH outperforms state-of-the-art methods with an mAP of 93.2% and superior generalization performance. Additionally, the GUI provides real-time positions of the papilla and cannula, along with the planar distance and direction required for the cannula to reach the cannulation position. CONCLUSION We validate the GUI's performance in human gastrointestinal endoscopic videos, showing deep learning's potential to enhance the safety and efficiency of clinical ERCP cannulation.
Collapse
Affiliation(s)
- Yuying Liu
- Key Laboratory of Mechanism Theory and Equipment Design of Ministry of Education, Tianjin University, Tianjin, 300354, China
| | - Xin Chen
- Department of Gastroenterology, Tianjin Medical University General Hospital, Tianjin, 300052, China
| | - Siyang Zuo
- Key Laboratory of Mechanism Theory and Equipment Design of Ministry of Education, Tianjin University, Tianjin, 300354, China.
| |
Collapse
|
39
|
Liu W, Duinkharjav B, Sun Q, Zhang SQ. FovealNet: Advancing AI-Driven Gaze Tracking Solutions for Efficient Foveated Rendering in Virtual Reality. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2025; 31:3183-3193. [PMID: 40067704 DOI: 10.1109/tvcg.2025.3549577] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/14/2025]
Abstract
Leveraging real-time eye tracking, foveated rendering optimizes hardware efficiency and enhances visual quality virtual reality (VR). This approach leverages eye-tracking techniques to determine where the user is looking, allowing the system to render high-resolution graphics only in the foveal region-the small area of the retina where visual acuity is highest, while the peripheral view is rendered at lower resolution. However, modern deep learning-based gaze-tracking solutions often exhibit a long-tail distribution of tracking errors, which can degrade user experience and reduce the benefits of foveated rendering by causing misalignment and decreased visual quality. This paper introduces FovealNet, an advanced AI-driven gaze tracking framework designed to optimize system performance by strategically enhancing gaze tracking accuracy. To further reduce the implementation cost of the gaze tracking algorithm, FovealNet employs an event-based cropping method that eliminates over 64.8% of irrelevant pixels from the input image. Additionally, it incorporates a simple yet effective token-pruning strategy that dynamically removes tokens on the fly without compromising tracking accuracy. Finally, to support different runtime rendering configurations, we propose a system performance-aware multi-resolution training strategy, allowing the gaze tracking DNN to adapt and optimize overall system performance more effectively. Evaluation results demonstrate that FovealNet achieves at least 1.42× speed up compared to previous methods and 13% increase in perceptual quality for foveated output. The code is available at https://github.com/wl3181/FovealNet.
Collapse
|
40
|
Noh S, Lee MS, Lee BD. Automated radiography assessment of ankle joint instability using deep learning. Sci Rep 2025; 15:15012. [PMID: 40301608 PMCID: PMC12041484 DOI: 10.1038/s41598-025-99620-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2025] [Accepted: 04/21/2025] [Indexed: 05/01/2025] Open
Abstract
This study developed and evaluated a deep learning (DL)-based system for automatically measuring talar tilt and anterior talar translation on weight-bearing ankle radiographs, which are key parameters in diagnosing ankle joint instability. The system was trained and tested using a dataset comprising of 1,452 anteroposterior radiographs (mean age ± standard deviation [SD]: 43.70 ± 22.60 years; age range: 6-87 years; males: 733, females: 719) and 2,984 lateral radiographs (mean age ± SD: 44.37 ± 22.72 years; age range: 6-92 years; male: 1,533, female: 1,451) from a total of 4,000 patients, provided by the National Information Society Agency. Patients who underwent joint fusion, bone grafting, or joint replacement were excluded. Statistical analyses, including correlation coefficient analysis and Bland-Altman plots, were conducted to assess the agreement and consistency between the DL-calculated and clinician-assessed measurements. The system demonstrated high accuracy, with strong correlations for talar tilt (Pearson correlation coefficient [r] = 0.798 (p < .001); intraclass correlation coefficient [ICC] = 0.797 [95% CI 0.74, 0.82]; concordance correlation coefficient [CCC] = 0.796 [95% CI 0.69, 0.85]; mean absolute error [MAE] = 1.088° [95% CI 0.06°, 1.14°]; mean square error [MSE] = 1.780° [95% CI 1.69°, 2.73°]; root mean square error [RMSE] = 1.374° [95% CI 1.31°, 1.44°]; 95% limit of agreement [LoA], 2.0° to - 2.3°) and anterior talar translation (r = .862 (p < .001); ICC = 0.861 [95% CI 0.84, 0.89]; CCC = 0.861 [95% CI 0.86, 0.89]; MAE = 0.468 mm [95% CI 0.42 mm, 0.51 mm]; MSE = 0.551 mm [95% CI 0.49 mm, 0.61 mm]; RMSE = 0.742 mm [95% CI 0.69 mm, 0.79 mm]; 95% LoA, 1.5 mm to - 1.3 mm). These results demonstrate the system's capability to provide objective and reproducible measurements, supporting clinical interpretation of ankle instability in routine radiographic practice.
Collapse
Affiliation(s)
- Seungha Noh
- Department of Computer Science, Graduate School, Kyonggi University, Suwon-si, Republic of Korea
| | - Mu Sook Lee
- Department of Radiology, Keimyung University Dongsan Hospital, Daegu, Republic of Korea
| | - Byoung-Dai Lee
- Division of AI and Computer Engineering, Kyonggi University, Suwon-si, Gyeonggi-do, 16227, Republic of Korea.
| |
Collapse
|
41
|
Zheng X, Xu Q, Zheng S, Zhao L, Liu D, Zhang L. Multiscale deformed attention networks for white blood cell detection. Sci Rep 2025; 15:14591. [PMID: 40287499 PMCID: PMC12033354 DOI: 10.1038/s41598-025-99165-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2024] [Accepted: 04/17/2025] [Indexed: 04/29/2025] Open
Abstract
White blood cell (WBC) detection is pivotal in medical diagnostics, crucial for diagnosing infections, inflammations, and certain cancers. Traditional WBC detection methods are labor-intensive and time-consuming. Convolutional Neural Networks (CNNs) are widely used for cell detection due to their strong feature extraction capability. However, they struggle with global information and long-distance dependencies in WBC images. Transformers, on the other hand, excel at modeling long-range dependencies, which improves their performance in vision tasks. To tackle the large foreground-background differences in WBC images, this paper introduces a novel WBC detection method, named the Multi-Scale Cross-Deformation Attention Fusion Network (MCDAF-Net), which combines CNNs and Transformers. The Attention Multi-scale Sensing Module (AMSM) is designed to localize WBCs more accurately by fusing features at different scales and enhancing feature representation through a self-attention mechanism. The Cross-Deformation Convolution Module (CDCM) reduces feature correlation, aiding the model in capturing diverse aspects and patterns in images, thereby improving generalization. MCDAF-Net outperforms other models on public datasets (LISC, BCCD, and WBCDD), demonstrating its superiority in WBC detection. Our code and pretrained models: https://github.com/xqq777/MCDAF-Net .
Collapse
Affiliation(s)
- Xin Zheng
- School of Computer and Information, Anqing Normal University, Anqing, 246133, China.
- The University Key Laboratory of Intelligent Perception and Computing of Anhui Province, School of Computer and Information, Anqing Normal University, Anqing, 246133, China.
| | - Qiqi Xu
- School of Computer and Information, Anqing Normal University, Anqing, 246133, China
| | - Shiyi Zheng
- School of Computer and Information, Anqing Normal University, Anqing, 246133, China
| | - Luxian Zhao
- School of Computer and Information, Anqing Normal University, Anqing, 246133, China
| | - Deyang Liu
- School of Computer and Information, Anqing Normal University, Anqing, 246133, China
- The University Key Laboratory of Intelligent Perception and Computing of Anhui Province, School of Computer and Information, Anqing Normal University, Anqing, 246133, China
| | - Liangliang Zhang
- School of Computer and Information, Anqing Normal University, Anqing, 246133, China
- The University Key Laboratory of Intelligent Perception and Computing of Anhui Province, School of Computer and Information, Anqing Normal University, Anqing, 246133, China
| |
Collapse
|
42
|
Dong X, Zhang C, Wang P, Chen D, Tu GJ, Zhao S, Xiang T. A Novel Dual-Network Approach for Real-Time Liveweight Estimation in Precision Livestock Management. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2025:e2417682. [PMID: 40285549 DOI: 10.1002/advs.202417682] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/28/2024] [Revised: 04/03/2025] [Indexed: 04/29/2025]
Abstract
The increasing demand for automation in livestock farming scenarios highlights the need for effective noncontact measurement methods. The current methods typically require either fixed postures and specific positions of the target animals or high computational demands, making them difficult to implement in practical situations. In this study, a novel dual-network framework is presented that extracts accurate contour information instead of segmented images from unconstrained pigs and then directly employs this information to obtain precise liveweight estimates. The experimental results demonstrate that the developed framework achieves high accuracy, providing liveweight estimates with an R2 value of 0.993. When contour information is used directly to estimate the liveweight, the real-time performance of the framework can reach 1131.6 FPS. This achievement sets a new benchmark for accuracy and efficiency in non-contact liveweight estimation. Moreover, the framework holds significant practical value, equipping farmers with a robust and scalable tool for precision livestock management in dynamic, real-world farming environments. Additionally, the Liveweight and Instance Segmentation Annotation of Pigs dataset is introduced as a comprehensive resource designed to support further advancements and validation in this field.
Collapse
Affiliation(s)
- Ximing Dong
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education & Key Laboratory of Swine Genetics and Breeding of Ministry of Agriculture, Huazhong Agricultural University, Wuhan, 430070, China
| | - Caiming Zhang
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education & Key Laboratory of Swine Genetics and Breeding of Ministry of Agriculture, Huazhong Agricultural University, Wuhan, 430070, China
| | - Peiyuan Wang
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education & Key Laboratory of Swine Genetics and Breeding of Ministry of Agriculture, Huazhong Agricultural University, Wuhan, 430070, China
| | - Dexuan Chen
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education & Key Laboratory of Swine Genetics and Breeding of Ministry of Agriculture, Huazhong Agricultural University, Wuhan, 430070, China
| | - Gang Jun Tu
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education & Key Laboratory of Swine Genetics and Breeding of Ministry of Agriculture, Huazhong Agricultural University, Wuhan, 430070, China
| | - Shuhong Zhao
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education & Key Laboratory of Swine Genetics and Breeding of Ministry of Agriculture, Huazhong Agricultural University, Wuhan, 430070, China
| | - Tao Xiang
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education & Key Laboratory of Swine Genetics and Breeding of Ministry of Agriculture, Huazhong Agricultural University, Wuhan, 430070, China
| |
Collapse
|
43
|
Nakajima T, Maeda K, Togo R, Ogawa T, Haseyama M. Enhancing Adversarial Defense via Brain Activity Integration Without Adversarial Examples. SENSORS (BASEL, SWITZERLAND) 2025; 25:2736. [PMID: 40363174 DOI: 10.3390/s25092736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/26/2025] [Revised: 04/20/2025] [Accepted: 04/23/2025] [Indexed: 05/15/2025]
Abstract
Adversarial attacks on large-scale vision-language foundation models, such as the contrastive language-image pretraining (CLIP) model, can significantly degrade performance across various tasks by generating adversarial examples that are indistinguishable from the original images to human perception. Although adversarial training methods, which train models with adversarial examples, have been proposed to defend against such attacks, they typically require prior knowledge of the attack. These methods also lead to a trade-off between robustness to adversarial examples and accuracy for clean images. To address these challenges, we propose an adversarial defense method based on human brain activity data by hypothesizing that such adversarial examples are not misrecognized by humans. The proposed method employs an encoder that integrates the features of brain activity and augmented images from the original images. Then, by maximizing the similarity between features predicted by the encoder and the original visual features, we obtain features with the visual invariance of the human brain and the diversity of data augmentation. Consequently, we construct a model that is robust against adversarial attacks and maintains accuracy for clean images. Unlike existing methods, the proposed method is not trained on any specific adversarial attack information; thus, it is robust against unknown attacks. Extensive experiments demonstrate that the proposed method significantly enhances robustness to adversarial attacks on the CLIP model without degrading accuracy for clean images. The primary contribution of this study is that the performance trade-off can be overcome using brain activity data.
Collapse
Affiliation(s)
- Tasuku Nakajima
- Graduate School of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-ku, Sapporo 060-0814, Japan
| | - Keisuke Maeda
- Faculty of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-ku, Sapporo 060-0814, Japan
| | - Ren Togo
- Faculty of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-ku, Sapporo 060-0814, Japan
| | - Takahiro Ogawa
- Faculty of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-ku, Sapporo 060-0814, Japan
| | - Miki Haseyama
- Faculty of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-ku, Sapporo 060-0814, Japan
| |
Collapse
|
44
|
Kumar A, Mudgal A. Surrogate safety assessment in heterogeneous traffic environment prevailing in developing countries: a systematic literature review. Int J Inj Contr Saf Promot 2025:1-19. [PMID: 40279179 DOI: 10.1080/17457300.2025.2494209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2024] [Revised: 04/03/2025] [Accepted: 04/13/2025] [Indexed: 04/27/2025]
Abstract
Surrogate safety measures (SSMs) are widely used for proactive road safety assessments, reducing reliance on crash data. Despite their potential utility amid escalating road fatalities and lack of good quality crash data in developing countries, SSMs have been predominantly applied in developed countries, where traffic streams are homogeneous, and strict lane discipline is followed. In contrast, traffic in many developing countries (e.g. China and India) is characterized by vehicular heterogeneity and multi-vehicle interactions due to non-lane-based movements. This paper provides a systematic review of 102 peer-reviewed studies in developing countries focusing on vehicular conflicts in traffic streams with heterogeneous vehicle composition and disorderly movement. This review highlights the salient features and challenges associated with SSMs-based safety assessment in developing countries and outlines potential directions for future research. It examines data collection techniques, sample sizes, and the suitability of various conflict indicators for non-lane-based traffic. Additionally, the impact of vehicular heterogeneity on conflict modeling is analyzed. A detailed discussion of conflict segregation methodologies, threshold selection techniques, and modeling frameworks is provided. This review will likely assist in developing more efficient conflict-based safety assessment techniques in heterogeneous traffic, contributing to improved road safety in developing countries.
Collapse
Affiliation(s)
- Ashutosh Kumar
- Indian Institute of Technology (BHU) Varanasi, Uttar Pradesh, India
| | - Abhisek Mudgal
- Indian Institute of Technology (BHU) Varanasi, Uttar Pradesh, India
| |
Collapse
|
45
|
Gwon MG, Um GM, Cheong WS, Kim W. Eigenpose: Occlusion-Robust 3D Human Mesh Reconstruction. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2025; 34:2379-2391. [PMID: 40238616 DOI: 10.1109/tip.2025.3559788] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/18/2025]
Abstract
A new approach for occlusion-robust 3D human mesh reconstruction from a single image is introduced in this paper. Since occlusion has emerged as a major problem to be resolved in this field, there have been meaningful efforts to deal with various types of occlusions (e.g., person-to-person occlusion, person-to-object occlusion, self-occlusion, etc.). Although many recent studies have shown the remarkable progress, previous regression-based methods still have respective limitations to handle occlusion problems due to the lack of the appearance information. To address this problem, we propose a novel method for human mesh reconstruction based on the pose-relevant subspace analysis. Specifically, we first generate a set of eigenvectors, so-called eigenposes, by conducting the singular value decomposition (SVD) of the pose matrix, which contains diverse poses sampled from the training set. These eigenposes are then linearly combined to construct a target body pose according to fusing coefficients, which are learned through the proposed network. Such combination of principal body postures (i.e., eigenposes) in a global manner gives a great help to cope with partial ambiguities by occlusions. Furthermore, we also propose to exploit a joint injection module that efficiently incorporates the spatial information of visible joints into the encoded feature during the estimation process of fusing coefficients. Experimental results on benchmark datasets demonstrate the ability of the proposed method to robustly reconstruct the human mesh under various occlusions occurring in real-world scenarios. The code and model are publicly available at: https://github.com/DCVL-3D/Eigenpose_release.
Collapse
|
46
|
Xu X, Xi L, Zhu J, Feng C, Zhou P, Liu K, Shang Z, Shao Z. Intelligent Diagnosis of Cervical Lymph Node Metastasis Using a CNN Model. J Dent Res 2025:220345251322508. [PMID: 40271993 DOI: 10.1177/00220345251322508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/25/2025] Open
Abstract
Lymph node (LN) metastasis is a prevalent cause of recurrence in oral squamous cell carcinoma (OSCC). However, accurately identifying metastatic LNs (LNs+) remains challenging. This prospective clinical study aims to test the effectiveness of our convolutional neural network (CNN) model for identifying OSCC cervical LN+ in contrast-enhanced computed tomography (CECT) in clinical practice. A CNN model was developed and trained using a dataset of 8,380 CECT images from previous OSCC patients. It was then prospectively validated on 17,777 preoperative CECT images from 354 OSCC patients between October 17, 2023, and August 31, 2024. The model's predicted LN results were provided to the surgical team without influencing surgical or treatment plans. During surgery, the predicted LN+ were identified and sent for separate pathological examination. The accuracy of the model's predictions was compared with those of human experts and verified against pathology reports. The capacity of the model to assist radiologists in LN+ diagnosis was also assessed. The CNN model was trained over 40 epochs and successfully validated after each. Compared with human experts (2 radiologists, 2 surgeons, and 2 students), the CNN model achieved higher sensitivity (81.89% vs. 81.48%, 46.91%, 50.62%), specificity (99.31% vs. 99.15%, 98.36%, 96.27%), LN+ accuracy (76.19% vs. 75.43%, P = 0.854; 40.64%, P < 0.001; 37.44%, P < 0.001), and clinical accuracy (86.16% vs. 83%, 61%, 56%). With the model's assistance, the radiologists surpassed both the previous predictive results without the model's support and the model's performance alone. The CNN model demonstrated an accuracy comparable to that of radiologists in identifying, locating, and predicting cervical LN+ in OSCC patients. Furthermore, the model has the potential to assist radiologists in making more accurate diagnoses.
Collapse
Affiliation(s)
- X Xu
- The State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Wuhan University, Wuhan, China
- Day Surgery Center, School & Hospital of Stomatology, Wuhan University, Wuhan, China
| | - L Xi
- School of Computer Science, Wuhan University, Wuhan, China
| | - J Zhu
- The State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Wuhan University, Wuhan, China
- Department of Geriatric Dentistry, School & Hospital of Stomatology, Wuhan University, Wuhan, China
| | - C Feng
- The State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Wuhan University, Wuhan, China
| | - P Zhou
- Department of Radiology, School & Hospital of Stomatology, Wuhan University, Wuhan, China
| | - K Liu
- The State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Wuhan University, Wuhan, China
- Department of Oral and Maxillofacial Head Neck Surgery, School & Hospital of Stomatology, Wuhan University, Wuhan, China
| | - Z Shang
- The State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Wuhan University, Wuhan, China
| | - Z Shao
- The State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Wuhan University, Wuhan, China
- Day Surgery Center, School & Hospital of Stomatology, Wuhan University, Wuhan, China
| |
Collapse
|
47
|
Zhang S, Wang Q, Liu J, Xiong H. ALPS: An Auto-Labeling and Pre-Training Scheme for Remote Sensing Segmentation With Segment Anything Model. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2025; 34:2408-2420. [PMID: 40184288 DOI: 10.1109/tip.2025.3556344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/06/2025]
Abstract
In the fast-growing field of Remote Sensing (RS) image analysis, the gap between massive unlabeled datasets and the ability to fully utilize these datasets for advanced RS analytics presents a significant challenge. To fill the gap, our work introduces an innovative auto-labeling framework named ALPS (Automatic Labeling for Pre-training in Segmentation), which leverages the Segment Anything Model (SAM) to predict precise pseudo-labels for RS images without necessitating prior annotations or additional prompts. The proposed pipeline significantly reduces the labor and resource demands traditionally associated with annotating RS datasets. By constructing two comprehensive pseudo-labeled RS datasets via ALPS for pre-training purposes, our approach enhances the performance of downstream tasks across various benchmarks, including iSAID and ISPRS Potsdam. Experiments demonstrate the effectiveness of our framework, showing its ability to generalize well across multiple tasks even under the scarcity of extensively annotated datasets, offering a scalable solution to automatic segmentation and annotation challenges in the field. In addition, the proposed pipeline is flexible and can be applied to medical image segmentation, remarkably increasing the performance. Note that ALPS utilizes pre-trained SAM to semi-automatically annotate RS images without additional manual annotations. Although every component in the pipeline has been well explored, integrating clustering algorithms with SAM and novel pseudo-label alignment significantly enhances RS segmentation, as an off-the-shelf tool for pre-training data preparation. Our source code is available at: https://github.com/StriveZs/ALPS.
Collapse
|
48
|
Le HH, Nguyen DMH, Bhatti OS, Kopácsi L, Ngo TP, Nguyen BT, Barz M, Sonntag D. I-MPN: inductive message passing network for efficient human-in-the-loop annotation of mobile eye tracking data. Sci Rep 2025; 15:14192. [PMID: 40268979 PMCID: PMC12019404 DOI: 10.1038/s41598-025-94593-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2024] [Accepted: 03/14/2025] [Indexed: 04/25/2025] Open
Abstract
Comprehending how humans process visual information in dynamic settings is crucial for psychology and designing user-centered interactions. While mobile eye-tracking systems combining egocentric video and gaze signals can offer valuable insights, manual analysis of these recordings is time-intensive. In this work, we present a novel human-centered learning algorithm designed for automated object recognition within mobile eye-tracking settings. Our approach seamlessly integrates an object detector with a spatial relation-aware inductive message-passing network (I-MPN), harnessing node profile information and capturing object correlations. Such mechanisms enable us to learn embedding functions capable of generalizing to new object angle views, facilitating rapid adaptation and efficient reasoning in dynamic contexts as users navigate their environment. Through experiments conducted on three distinct video sequences, our interactive-based method showcases significant performance improvements over fixed training/testing algorithms, even when trained on considerably smaller annotated samples collected through user feedback. Furthermore, we demonstrate exceptional efficiency in data annotation processes and surpass prior interactive methods that use complete object detectors, combine detectors with convolutional networks, or employ interactive video segmentation.
Collapse
Affiliation(s)
- Hoang H Le
- Interactive Machine Learning Department, German Research Center for Artificial Intelligence (DFKI), 66123, Saarbrücken, Germany.
- Mathematics and Computer Science Department, University of Science, VNU-HCM, Ho Chi Minh City, Vietnam.
- Quy Nhon AI Research and Development Center, FPT Software, Quy Nhon, Vietnam.
| | - Duy M H Nguyen
- Interactive Machine Learning Department, German Research Center for Artificial Intelligence (DFKI), 66123, Saarbrücken, Germany.
- Max Planck Research School for Intelligent Systems (IMPRS-IS), 70569, Stuttgart, Germany.
- Machine Learning and Simulation Science Department, University of Stuttgart, 70569, Stuttgart, Germany.
| | - Omair Shahzad Bhatti
- Interactive Machine Learning Department, German Research Center for Artificial Intelligence (DFKI), 66123, Saarbrücken, Germany
| | - László Kopácsi
- Interactive Machine Learning Department, German Research Center for Artificial Intelligence (DFKI), 66123, Saarbrücken, Germany
| | - Thinh P Ngo
- Mathematics and Computer Science Department, University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
| | - Binh T Nguyen
- Mathematics and Computer Science Department, University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
| | - Michael Barz
- Interactive Machine Learning Department, German Research Center for Artificial Intelligence (DFKI), 66123, Saarbrücken, Germany
- Applied Artificial Intelligence Department, University of Oldenburg, 26129, Oldenburg, Germany
| | - Daniel Sonntag
- Interactive Machine Learning Department, German Research Center for Artificial Intelligence (DFKI), 66123, Saarbrücken, Germany
- Applied Artificial Intelligence Department, University of Oldenburg, 26129, Oldenburg, Germany
| |
Collapse
|
49
|
Yang Y, Li W, Liu R, Wu C, Ren J, Shi Y, Ge S. HindwingLib: A library of leaf beetle hindwings generated by Stable Diffusion and ControlNet. Sci Data 2025; 12:680. [PMID: 40268959 PMCID: PMC12019569 DOI: 10.1038/s41597-025-05010-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Accepted: 04/15/2025] [Indexed: 04/25/2025] Open
Abstract
The utilization of datasets from beetle hindwings is prevalent in research of morphology and evolution of beetles, serving as a valuable tool for comprehending the evolutionary processes and functional adaptations under specific environmental conditions. However, the collection of hindwing images of beetles poses several challenges, including limited sample availability, complex sample preparation procedures, and restricted public accessibility. Recently, a machine learning technique called Stable Diffusion has been developed to statistically generate diverse images using a pretrained model with prompts. In this study, we introduce an approach utilizing Stable diffusion and ControlNet for the generation of beetle hindwing images, along with the corresponding results obtained from its application to a diverse set of 200 leaf beetle hindwings. To demonstrate the fidelity of the synthetic hindwing images, we conducted a comprarative analysis of three key metrics: Structural Similarity Index (SSIM), Inception Score (IS), and Fréchet Inception Distance (FID), which are crucial for evaluating image fidelity. The results demonstrated a strong alignment between the actual data and the synthetic images, confirming their high fidelity. This novel library of leaf beetle hindwings not only offers morphological image for utilization in machine learning, but also showcases the extensive applicability of the proposed methodology.
Collapse
Affiliation(s)
- Yi Yang
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing, 100101, China
- Department of Scientific Research, Beijing Planetarium, Xizhimenwai Street, Beijing, 100044, China
| | - WenJie Li
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - RuiZe Liu
- Center for Materials Science and Optoelectronics Engineering, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - ChengZhe Wu
- Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, 100094, China
| | - Jing Ren
- State Key Laboratory of Plant Genomics, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, 100101, China
| | - YiShi Shi
- Center for Materials Science and Optoelectronics Engineering, University of Chinese Academy of Sciences, Beijing, 100049, China.
- Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, 100094, China.
| | - SiQin Ge
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing, 100101, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
50
|
Liu X, Liu Y, Sui H, Qin C, Che Y, Guo Z. Anomaly detection in cropland monitoring using multiple view vision transformer. Sci Rep 2025; 15:14147. [PMID: 40269174 PMCID: PMC12019310 DOI: 10.1038/s41598-025-98405-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2024] [Accepted: 04/11/2025] [Indexed: 04/25/2025] Open
Abstract
In recent times, the importance of low-altitude security, especially in agricultural surveillance, has seen a remarkable upswing. This paper puts forward a novel Internet of Drones framework tailored for low-altitude operations. Anomaly detection, which is pivotal for ensuring the integrity of the entire system, poses a substantial challenge. Such anomalies can range from unpredictable weather patterns in farmlands to unauthorized intrusions. To surmount this, a comprehensive deep learning pipeline is proposed in this study. It deploys a vision transformer model featuring a unique attention mechanism. The pipeline includes the meticulous collection of a vast array of normal and abnormal farmland images, followed by preprocessing to standardize data. Anomaly detection is then carried out, and the model's performance is evaluated using metrics like sensitivity (92.8%), specificity (93.1%), accuracy (93.5%), and F1 score (94.1%). Comparative analysis with state-of-the-art algorithms reveals the superiority of the proposed model. In the future, this study plans to explore integrating data from thermal, infrared, or LIDAR sensors, enhance the interpretability of the vision transformer model, and optimize the deep learning pipeline to reduce computational complexity.
Collapse
Affiliation(s)
- Xuesong Liu
- James Watt School of Engineering, University of Glasgow, Glasgow, G12 8QQ, UK.
| | - Yansong Liu
- School of Intelligence Engineering, Shandong Management University, Jinan, 250357, China
| | - He Sui
- College of Aeronautical Engineering, Civil Aviation University of China, Tianjin, 300300, China
| | - Chuan Qin
- Department of Infrastructure Engineering, University of Melbourne, Melbourne, VIC, 3010, Australia
| | - Yuanxi Che
- Department of Computer Science, Xidian University, Xi'an, 710126, China
| | - Zhaobo Guo
- James Watt School of Engineering, University of Glasgow, Glasgow, G12 8QQ, UK
| |
Collapse
|