Research

Research Direction

Download researches detail :

Boosting Diffusion Guidance via Learning Degradation-Aware Models for Blind Super Resolution

WACV 2025
Shao Hao Lu, Ren Wang, Ching-Chun Huang, Wei-Chen Chiu
Abstract
Existing blind diffusion-based super-resolution methods often struggle to strike a balance between fidelity and realism, while mostly synthesizing high-quality details that significantly deviate from the ground truth (i.e., having good realism but unsatisfactory fidelity). To this end, we propose to leverage the strength of the recent state-of-the-art non-blind diffusion-based method, which is capable of producing high-quality images while well maintaining the fidelity, to construct our framework to address the aforementioned issue, with specific extension being designed to support the unknown and potentially nonlinear degradation kernels. In particular, we learn an encoder to estimate the degradation representation from the input low-resolution images, together with two deep models trained to approximate the degradation and restoration processes where both of them are conditioned on the estimated degradation representation. Moreover, we introduce the technique of input perturbation to improve the model robustness against the error in terms of degradation approximation. All these efforts result to our entire method of blind super-resolution which not only can produce high-resolution images of high quality and good fidelity, but also has the flexibility to be integrated with any off-the-shelf diffusion model without requiring finetuning or retraining. We conduct extensive experiments to well verify the effectiveness of our proposed method, showing superior performance in comparison to various baselines.
Method

Blind Super Resolution with Reference Images and Implicit Degradation Representation

ACCV 2024
Huu-Phu Do, Po-Chih Hu, Hao-Chien Hsueh, Che-Kai Liu, Vu-Hoang Tran, Ching-Chun Huang
Abstract
Previous studies in blind super-resolution (BSR) have primarily concentrated on estimating degradation kernels directly from low-resolution (LR) inputs to enhance super-resolution. However, these degradation kernels, which model the transition from a high-resolution (HR) image to its LR version, should account for not only the degradation process but also the downscaling factor. Applying the same degradation kernel across varying super-resolution scales may be impractical. Our research acknowledges degradation kernels and scaling factors as pivotal elements for the BSR task and introduces a novel strategy that utilizes HR images as references to establish scale-aware degradation kernels. By employing content-irrelevant HR reference images alongside the target LR image, our model adaptively discerns the degradation process. It is then applied to generate additional LR-HR pairs through down-sampling the HR reference images, which are keys to improving the SR performance. Our reference-based training procedure is applicable to proficiently trained blind SR models and zero-shot blind SR methods, consistently outperforming previous methods in both scenarios. This dual consideration of blur kernels and scaling factors, coupled with the use of a reference image, contributes to the effectiveness of our approach in blind super-resolution tasks.
Method

RC-AutoCalib: An End-to-End Radar-Camera Automatic Calibration Network

September 2024
Yong-Lin Cai, Van-Tin Luu, Yu-Heng Shih, Jun Chuan Chiew, Wei-Chen Chiu, Yi-Ting Chen, Vu-Hoang Tran, Po-Yen Chang and Ching-Chun Huang
Abstract
This paper presents a groundbreaking approach - the first online automatic geometric calibration method for radar and camera systems. Given the significant data sparsity and measurement uncertainty in radar height data, achieving automatic calibration during system operation has long been a challenge. To address the sparsity issue, we propose a Dual-Perspective representation that gathers features from both frontal and bird’s-eye views. The frontal view contains rich but sensitive height information, whereas the bird’s-eye view provides robust features against height uncertainty. We thereby propose a novel Selective Fusion Mechanism to identify and fuse reliable features from both perspectives, reducing the effect of height uncertainty. Moreover, for each view, we incorporate a Multi-Modal Cross-Attention Mechanism to explicitly find location correspondences through cross-modal matching. During the training phase, we also design a Noise-Resistant Matcher to provide better supervision and enhance the robustness of the matching mechanism against sparsity and height uncertainty. Our experimental results, tested on the nuScenes dataset, demonstrate that our method outperforms existing state-of-the-art methods designed for LiDAR-camera auto-calibration, establishing a benchmark for future research.
Method

DetailSemNet: Elevating Signature Verification through Detail-Semantic Integration

European Conference on Computer Vision (ECCV) 2024
Meng-Cheng Shih, Tsai-Ling Huang, Yu-Heng Shih, Hong-Han Shuai, Hsuan-Tung Liu, Yi-Ren Yeh, and Ching-Chun Huang
Abstract
Offline signature verification (OSV) is a frequently utilized technology in forensics. This paper proposes a new model, DetailSemNet, for OSV. Unlike previous methods that rely on holistic features for pair comparisons, our approach underscores the significance of fine-grained differences for robust OSV. We propose to match local structures between two signature images, significantly boosting verification accuracy. Furthermore, we observe that without specific architectural modifications, transformer-based backbones might naturally obscure local details, adversely impacting OSV performance. To address this, we introduce a Detail-Semantics Integrator, leveraging feature disentanglement and re-entanglement. This integrator is specifically designed to enhance intricate details while simultaneously expanding discriminative semantics, thereby augmenting the efficacy of local structural matching. We evaluate our method against leading benchmarks in offline signature verification. Our model consistently outperforms recent methods, achieving state-of-the-art results with clear margins. The emphasis on local structure matching not only improves performance but also enhances the model's interpretability, supporting our findings. Additionally, our model demonstrates remarkable generalization capabilities in cross-dataset testing scenarios. The combination of generalizability and interpretability significantly bolsters the potential of DetailSemNet for real-world applications.
Method

DA^2: Degree-Accumulated Data Augmentation on Point Cloud with Curriculum Dynamic Threshold Selection

August 2024
Ta-Chun Tai; Ching-Chun Huang
Abstract
Conventional point cloud data augmentation methods typically employ offline transformations with predefined, randomly applied transformations. This randomness may lead to suboptimal training samples that are not suitable for the current training stage. Additionally, the predefined parameter range restricts the exploration space of augmentation limiting the diversity of samples. This paper introduces Degree-Accumulated Data Augmentation (DA-2), a novel approach that accumulates augmentations to expand the exploration space beyond predefined limits. We utilize a teacher-guided auto-augmenter to prevent the generation of excessively distorted or unrecognizable samples. This method aims to generate challenging yet suitable samples, progressively increasing the difficulty to enhance the model's robustness. Additionally, according to a student model's ability, we propose Curriculum Dynamic Threshold Selection (CDTS) to filter overly challenging samples, allowing the model to start with high-quality objects and gradually handle more complex ones as model stability improves. Our experiments demonstrate that this framework significantly enhances accuracy across various 3D point cloud classifiers.
Method

MMM-TVR: Many-to-Many Matching for Text-Video Retrieval in a Novel Probabilistic Basic Concept Space

July 2024
Beatrice Adelaide Wibisono; Ching-Chun Huang; Hong-Han Shuai
Abstract
Existing text-video retrieval (TVR) methods focus on modeling cross-modal interactions between text and video pairs through one-to-one mappings, likely due to incomplete pair labels or the difficulty of labeling all possible mappings with soft weights. Consequently, these methods overlook the complex manyto-many relationships in natural text-video pairings, reducing retrieval precision. To address these limitations, we introduce MMM-TVR, a robust framework for multiple concept representation, concept set-to-set similarity measurement, and pseudo-text-video pair discovery. Specifically, MMM-TVR employs a novel semi-supervised contrastive learning framework coupled with a Probabilistic Basic Concept Space (PBCS) constructed via Gaussian Mixture Models (GMM). With PBCS, we represent semantic concepts in a structured, probabilistic manner, allowing for more flexible representation. Unlike conventional methods, MMM-TVR utilizes ground-truth text-video pairs and effectively identifies and leverages semantically relevant pseudo pairs, thereby reducing false negative associations. These designs significantly enhance our model’s robustness and generalization capability, particularly in recognizing and handling inherent uncertainties and ambiguities in semantic matching. Extensive experiments show that our method performs better than state-of-the-art methods across four benchmark datasets: MSR-VTT, MSVD, ActivityNet, and DiDeMo. Moreover, our method proves effective in handling out-of-domain retrieval and scenarios involving multiple correspondences.
Method

Result

Make an Omelette with Breaking Eggs: Zero-Shot Learning for Novel Attribute Synthesis

NeurIPS 2022
Yu-Hsuan Li, Tzu-Yin Chao, Ching-Chun Huang, Pin-Yu Chen, Wei-Chen Chiu
Abstract

Most of the existing algorithms for zero-shot classification problems typically rely on the attribute-based semantic relations among categories to realize the classification of novel categories without observing any of their instances. However, training the zero-shot classification models still requires attribute labeling for each class (or even instance) in the training dataset, which is also expensive. To this end, in this paper, we bring up a new problem scenario: ``Can we derive zero-shot learning for novel attribute detectors/classifiers and use them to automatically annotate the dataset for labeling efficiency?'' Basically, given only a small set of detectors that are learned to recognize some manually annotated attributes (i.e., the seen attributes), we aim to synthesize the detectors of novel attributes in a zero-shot learning manner. Our proposed method, Zero-Shot Learning for Attributes (ZSLA), which is the first of its kind to the best of our knowledge, tackles this new research problem by applying the set operations to first decompose the seen attributes into their basic attributes and then recombine these basic attributes into the novel ones. Extensive experiments are conducted to verify the capacity of our synthesized detectors for accurately capturing the semantics of the novel attributes and show their superior performance in terms of detection and localization compared to other baseline approaches. Moreover, we demonstrate the application of automatic annotation using our synthesized detectors on Caltech-UCSD Birds-200-2011 dataset. Various generalized zero-shot classification algorithms trained upon the dataset re-annotated by ZSLA shows comparable performance with those trained with the manual ground-truth annotations.
Method

Result

Distance-dependent Feature Alignment and Selection for Imbalance 3D Point Cloud Object Detection

Ming-Jen Chang; Chih-Jen Cheng; Ching-Chun Hsiao; I-Fan Chou; Ching-Chun Huang
Abstract

Although pillar-based 3D object detection methods can balance the performance and inference speed, the inconsistent object features caused by dramatic sparsity drops of LiDAR point clouds sabotage the detection accuracy. We present a novel and efficient plug-in method, SVDnet, to improve the state-of-the-art pillar-based models. First, a novel low-rank objective loss is introduced to extract distance-aware vehicle features and suppress the other variations. Next, we alleviated the remaining feature inconsistency caused by object positions with two strategies. One is a Distance Alignment Ratio-generation Network (DARN), which fuses multi-scale features by distance-adaptive ratios. The other is a position attention network that modulates features based on positions. Our results on the KITTI dataset show that SVDnet improves the pillar methods and outperforms the other plug-in strategies in accuracy and speed.

Zero-LEINR: Zero-Reference Low-light Image Enhancement with Intrinsic Noise Reduction

Wing Ho Tang; Hsuan Yuan; Tzu-Hao Chiang; Ching-Chun Huang
Abstract

Zero-reference deep learning-based methods for low-light image enhancement sufficiently mitigate the difficulty of paired data collection while keeping the great generalization on various lighting conditions. However, color bias and unin-tended intrinsic noise amplification are still issues that remain unsolved. This paper proposes a zero-reference end-to-end two-stage network (Zero-LEINR) for low-light image enhancement with intrinsic noise reduction. In the first stage, we introduce a Color Preservation and Light Enhancement Block (CPLEB) that consists of a dual branch structure with different constraints to correct the brightness and preserve the correct color tone. In the second stage, Enhanced-Noise Reduction Block (ENRB) is applied to remove the intrinsic noises being enhanced during the first stage. Due to the zero-reference two-stage structure, our method is generalized to enhance low-light images with correct color tone on unseen datasets and reduce the intrinsic noise simultaneously.

IZero-DNAS - Initialization independent Zero cost Differentiable Neural Architecture Search for object detection

Abstract

Neural Architecture Search (NAS), which aims to find the optimal network structure undergiven hardware constraints, is typically a complex and time-consuming challenge. Recent re-search has been focused on the Zero-Cost Proxy method, which can rank architectures without requiring training. However, most of these methods have demonstrated applicability primarily in small models and simpler tasks, such as classification. When applied to more complex tasks, they tend to produce unstable results in our experiments. Additionally, traditional differentiable-based and sampling-based NAS algorithms face difficulties when searching for large models due to GPU memory constraints and long training process. In this paper, we introduce IZero-DNAS, a novel approach. It involves pretraining the supernet for a few epochs and then combining the Zero-Cost Proxy with a differentiable-based method to search for the optimal network structure for object detection. Furthermore, we conduct an analysis of the temperature settings in the Gumbel-Softmax used during the pretraining process of the supernet. Our experiments demon-strate that a lower temperature can reduce the interdependencies between candidate blocks in the supernet, thereby resulting in higher scores for zero-cost metrics. The results indicate that our proposed IZeroDNAS is capable of searching for superior network structures compared to manually designed counterparts such as ScaledYOLOv4 and other NAS approaches for object detection.

Few-Shot Generalizable Time-Dependent Neural Radiance Fields for Synthesizing Novel Views at Different Times

Abstract

This paper introduces a generalizable neural rendering approach for rendering novel views at arbitrary viewpoints and at arbitrary times, even with few input views. For real-world applications, it is expensive to collect multiple views, and inefficient to re-optimize for unseen scenes. Additionally, to provide a completely immersive experience, creating environments that can transition seamlessly from day to night is essential. Although current NeRF-based methods demonstrate powerful results in novel view synthesis, they fall short of achieving the aforementioned goals. To this end, we leverage multi-view stereo, neural radiance field, and disentanglement techniques. This equips our model with the capability for generalizability in a few-shot setting, allows us to construct an implicit content radiance field for scene representation, and further enables the building of neural radiance fields at any arbitrary time. Finally, we synthesize novel views of that time via volume rendering. Experiments show that the proposed framework is not only capable of rendering novel views in a few-shot setting without per-scene optimization but also excels in creating realistic novel views that transition smoothly across different times of the day, capturing intricate natural changes.

Fusion UandV: Efficient MRI Sampling via the Fusion of U-Net and Conditional Variational Autoencoder

Abstract

Magnetic Resonance Imaging (MRI) is a widely used non-invasive med ical imaging technique that provides detailed anatomical information. How ever, long scan times and motion artifacts are common drawbacks. To overcome these limitations, accelerated MRI techniques have been developed to reduce scan times. By acquiring fewer measurements, accel erated MRI enables faster image acquisition. However, these approaches may introduce additional challenges such as noise amplification and image artifacts due to the incomplete sampling of k-space data. We propose a novel method that utilizes deep learning for Accelerated MRI sampling tasks. Consistent with real-world scenarios, our model is conditioned on a specific mask pattern during the sampling process. This conditioning enables precise acquisition of undersampled k-space data, fa cilitating the filling of missing information and ensuring accurate recon struction.

Adaptive Model-Based Transfer Learning for Dynamic HVAC Control

Abstract

In this paper, our objective is to automatically adjust the air handling unit (AHU) setpoint in the heating, ventilation, and air conditioning (HVAC) system to consistently maintain indoor temperatures at the desired level specified by the user. However, a significant challenge often arises in acquiring a substantial amount of high-quality sensor data from buildings. Conse quently, numerous studies concentrate on investigating transfer learning and employ simulation software to generate training data. We introduce an adaptive model-based transfer learning ap proach for dynamic HVAC control. Unlike conventional pretraining methods, our approach allows the agent to directly control the source domain based on conditions identical to the target domain. This avoids the necessity for extensive knowledge of the target domain to establish schedules for data generation and prevents the risk of generating samples that may never occur in the target domain. Furthermore, this method grants the agent considerable flexibility to learn within the source domain. At the HVAC control method level, we propose a physics rule em bedding method and a setpoint selection strategy, both aimed at addressing potential issues in existing HVAC control methods. We emb the positive and negative relationships between in puts and outputs into the model training process, guiding the model towards convergence in the correct direction, with this constraint maintaining its efficacy over time. To mitigate the effects of drastic setpoint changes, our decision strategy takes into account the long-term impact of set points on the environment, resulting in smoother control. Lastly, to expedite and enhance the stability of intelligent HVAC control system deployment in new buildings, we streamlined the process by eliminating the need for constructing a virtual source domain. Instead, we adopted a concept similar to the proposed transfer learning method to experiment with knowledge transfer between real buildings.

Two Heads Better than One: Dual Degradation Representation for Blind Super-Resolution

Abstract

Previous methods have demonstrated remarkable performance in single image super-resolution (SISR) tasks with known and fixed degradation (e.g., bicubic downsampling). However, when the actual degradation deviates from these assumptions, these methods may experience signif icant declines in performance. In this paper, we propose a Dual Branch Degradation Extractor Network to address the problem of blind SR. While some BlindSR methods assume noise-free degradation and others do not explicitly consider the presence of noise in the degradation model, our approach predicts two unsupervised degradation embeddings that represent blurry and noisy information, respectively. The SR network can then be adapted to blur embedding and noise embedding in distinct ways. Furthermore, we treat the degradation extractor as a regressor to capitalize on the differences between SR and HR images. Extensive experiments on several benchmarks demonstrate that our method achieves state-of-the-art performance in the blind SR problem.

Loss reweight to select data by data importance for 3D object detection

Abstract

3D object detection remains a major challenging task. This is due to the characteristics associated with LiDAR, including the sparsity of point clouds, significant variations in point cloud density, and the ease with which objects can be occluded, resulting in incomplete point cloud. To accurately classify objects and determine their precise positions, models need to be more robust. Previous approaches focus on additional model design, more powerful feature extractors, and the ability to obtain finer features. We introduce a new issue and proposes different solutions in the field of 3D object detection, focusing on sample processing to effectively improve performance without altering the model architecture. We found that even within the same category, point cloud variations can be extremely significant. Adding these samples to the training process does not necessarily enhance overall performance. On the contrary, it can lead to model confusion and reduced generalization. In this paper, We first observe the samples to identify additional information, including the disperse ratio and model reflection, which encompasses classification loss and regression loss related to sample variations, then use our important sample selection (ISS) module to predict the importance to the model. By adjusting the loss function, we enable the model to focus on learning from more informative samples. Finally, we evaluate our method with KITTI dataset, a well-known 3D detection dataset. We train 3D object detectors with our module. Experimental results demonstrate that by adopting our proposed learning approach, the model can improve overall detection performance without significantly increasing the number of parameters.

Towards Understanding Cross Resolution Feature Matching for Surveillance Face Recognition

ACM MM 2022
Kuo, Chiawei and Tsai, Yi-Ting and Shuai, Hong-Han and Yeh, Yi-ren and Huang, Ching-Chun
Abstract

Cross-resolution face recognition (CRFR) in an open-set setting is a practical application for surveillance scenarios where low-resolution (LR) probe faces captured via surveillance cameras require being matched to a watchlist of high-resolution (HR) galleries. Although CRFR is to be of practical use, it sees a performance drop of more than 10% compared to that of high-resolution face recognition protocols. The challenges of CRFR are multifold, including the domain gap induced by the HR and LR images, the pose/texture variations, etc. To this end, this work systematically discusses possible issues and their solutions that affect the accuracy of CRFR. First, we explore the effect of resolution changes and conclude that resolution matching is the key for CRFR. Even simply downscaling the HR faces to match the LR ones brings a performance gain. Next, to further boost the accuracy of matching cross-resolution faces, we found that a well-designed super-resolution network, which can (a) represent the images continuously, is (b) suitable for real-world degradation kernel, (c) adaptive to different input resolutions, and (d) guided by an identity-preserved loss, is necessary to upsample the LR faces with discriminative enhancement. Here, the proposed identity-preserved loss plays the role of reconciling the objective discrepancy of super-resolution between human perception and machine recognition. Finally, we emphasize that removing the pose variations is an essential step before matching faces for recognition in the super-resolved feature space. Our method is evaluated on benchmark datasets, including SCface, cross-resolution LFW, and QMUL-Tinyface. The results show that the proposed method outperforms the SOTA methods by a clear margin and narrows the performance gap compared to the high-resolution face recognition protocol.

MENTOR: Multilingual Text Detection Toward Learning by Analogy

Hsin-Ju Lin; Tsu-Chun Chung; Ching-Chun Hsiao; Pin-Yu Chen; Wei-Chen Chiu; Ching-Chun Huang
Abstract

Text detection is frequently used in vision-based mobile robots when they need to interpret texts in their surroundings to perform a given task. For instance, delivery robots in multilingual cities need to be capable of doing multilingual text detection so that the robots can read traffic signs and road markings. Moreover, the target languages change from region to region, implying the need of efficiently re-training the models to recognize the novel/new languages. However, collecting and labeling training data for novel languages are cumbersome, and the efforts to re-train an existing/trained text detector are considerable. Even worse, such a routine would repeat whenever a novel language appears. This motivates us to propose a new problem setting for tackling the aforementioned challenges in a more efficient way: “We ask for a generalizable multilingual text detection framework to detect and identify both seen and unseen language regions inside scene images without the requirement of collecting supervised training data for unseen languages as well as model re-training”. To this end, we propose “MENTOR”, the first work to realize a learning strategy between zero-shot learning and few-shot learning for multilingual scene text detection. During the training phase, we leverage the “zero-cost” synthesized printed texts and the available training/seen languages to learn the meta-mapping from printed texts to language-specific kernel weights. Meanwhile, dynamic convolution networks guided by the language-specific kernel are trained to realize a detection-by-feature-matching scheme. In the inference phase, “zero-cost” printed texts are synthesized given a new target language. By utilizing the learned meta-mapping and the matching network, our “MENTOR” can freely identify the text regions of the new language. Experiments show our model can achieve comparable results with supervised methods for seen languages and outperform other methods in detecting unseen languages.