Abstract:
Accurate detection of small objects in complex road environments is essential for ensuring the safety, reliability, and robustness of autonomous driving systems. Under adverse conditions such as low illumination and fog, the performance of conventional vision-based perception systems degrades significantly. Images captured by cameras in such environments often exhibit reduced contrast, blurred textures, occluded details, and indistinct object boundaries due to insufficient lighting, light scattering by fog droplets, and atmospheric attenuation. These degradations increase the likelihood of missed and false detections, posing substantial risks in urban traffic scenarios where vulnerable road users, including pedestrians and cyclists, frequently appear. To address these challenges, this study proposes a visual-feature-guided small-object detection framework with systematic enhancements in three areas: training data construction, network architecture design, and adaptive sample allocation. Firstly, to overcome the scarcity of low-light, foggy training data, a depth-aware atmospheric scattering physical model is developed based on the KITTI clear-weather dataset. The model accurately simulates light scattering and attenuation under low-light, foggy conditions by incorporating scene depth, fog density, and illumination intensity. A low-illumination rendering strategy is introduced, and the realism of the generated images is evaluated using the AGGD metric, enabling the creation of diverse and realistic nighttime foggy images. This data augmentation substantially improves the model’s generalization capability under extreme weather conditions. Secondly, in network design, a Multi-Layer Channel Fusion Module (MLCFM) is introduced within the YOLOv11 framework. By splitting, reorganizing, and adaptively weighting feature channels across different levels, MLCFM preserves low-level texture details while enhancing high-level semantic discrimination, which is essential for small-object detection. In addition, a semantics-importance-driven dynamic multi-scale fusion structure is developed to adjust fusion weights based on the semantic contribution of features at different scales. This mechanism strengthens the detection of small objects, such as pedestrians and cyclists, while maintaining global contextual information for larger objects, such as vehicles, thereby improving sensitivity to small objects without compromising overall scene understanding. Finally, to address the difficulty of distinguishing targets from complex backgrounds and the imbalance of positive and negative samples in foggy scenes, an Adaptive Training Sample Selection (ATSS) strategy is introduced. ATSS dynamically determines positive and negative sample assignments based on the spatial distribution and statistical characteristics of candidate bounding boxes, improving the model’s attention to hard samples and reducing training instability under challenging conditions. Extensive experiments—including joint testing and ablation studies on a self-constructed low-light foggy dataset and the original KITTI dataset—demonstrate the effectiveness of the proposed approach. Detection accuracies for the Car, Cyclist, and Pedestrian categories are improved by 2.2, 11.8, and 7.8 percentage points, respectively, with an overall mean average precision (mAP@0.5) improvement of 7.3 percentage points. Visualization results further show that the enhanced network produces clearer and more precise bounding boxes, substantially reducing missed and false detections. In summary, this study presents a systematic small-object detection framework that introduces innovations in training data generation, feature-aware network design, and adaptive sample allocation. The proposed method effectively improves small-object detection performance under low-light foggy conditions, providing critical support for the safety and reliability of autonomous driving perception systems in complex environments.