Research Direction

Download researches detail :

Boosting Diffusion Guidance via Learning Degradation-Aware Models for Blind Super Resolution
WACV 2025
Shao Hao Lu, Ren Wang, Ching-Chun Huang, Wei-Chen Chiu
Abstract
   Existing blind diffusion-based super-resolution methods often struggle to strike a balance between fidelity and realism, while mostly synthesizing high-quality details that significantly deviate from the ground truth (i.e., having good realism but unsatisfactory fidelity). To this end, we propose to leverage the strength of the recent state-of-the-art non-blind diffusion-based method, which is capable of producing high-quality images while well maintaining the fidelity, to construct our framework to address the aforementioned issue, with specific extension being designed to support the unknown and potentially nonlinear degradation kernels. In particular, we learn an encoder to estimate the degradation representation from the input low-resolution images, together with two deep models trained to approximate the degradation and restoration processes where both of them are conditioned on the estimated degradation representation. Moreover, we introduce the technique of input perturbation to improve the model robustness against the error in terms of degradation approximation. All these efforts result to our entire method of blind super-resolution which not only can produce high-resolution images of high quality and good fidelity, but also has the flexibility to be integrated with any off-the-shelf diffusion model without requiring finetuning or retraining. We conduct extensive experiments to well verify the effectiveness of our proposed method, showing superior performance in comparison to various baselines.
Method
Blind Super Resolution with Reference Images and Implicit Degradation Representation
ACCV 2024
Huu-Phu Do, Po-Chih Hu, Hao-Chien Hsueh, Che-Kai Liu, Vu-Hoang Tran, Ching-Chun Huang
Abstract
   Previous studies in blind super-resolution (BSR) have primarily concentrated on estimating degradation kernels directly from low-resolution (LR) inputs to enhance super-resolution. However, these degradation kernels, which model the transition from a high-resolution (HR) image to its LR version, should account for not only the degradation process but also the downscaling factor. Applying the same degradation kernel across varying super-resolution scales may be impractical. Our research acknowledges degradation kernels and scaling factors as pivotal elements for the BSR task and introduces a novel strategy that utilizes HR images as references to establish scale-aware degradation kernels. By employing content-irrelevant HR reference images alongside the target LR image, our model adaptively discerns the degradation process. It is then applied to generate additional LR-HR pairs through down-sampling the HR reference images, which are keys to improving the SR performance. Our reference-based training procedure is applicable to proficiently trained blind SR models and zero-shot blind SR methods, consistently outperforming previous methods in both scenarios. This dual consideration of blur kernels and scaling factors, coupled with the use of a reference image, contributes to the effectiveness of our approach in blind super-resolution tasks.
Method
RC-AutoCalib: An End-to-End Radar-Camera Automatic Calibration Network
September 2024
Yong-Lin Cai, Van-Tin Luu, Yu-Heng Shih, Jun Chuan Chiew, Wei-Chen Chiu, Yi-Ting Chen, Vu-Hoang Tran, Po-Yen Chang and Ching-Chun Huang
Abstract
   This paper presents a groundbreaking approach - the first online automatic geometric calibration method for radar and camera systems. Given the significant data sparsity and measurement uncertainty in radar height data, achieving automatic calibration during system operation has long been a challenge. To address the sparsity issue, we propose a Dual-Perspective representation that gathers features from both frontal and bird’s-eye views. The frontal view contains rich but sensitive height information, whereas the bird’s-eye view provides robust features against height uncertainty. We thereby propose a novel Selective Fusion Mechanism to identify and fuse reliable features from both perspectives, reducing the effect of height uncertainty. Moreover, for each view, we incorporate a Multi-Modal Cross-Attention Mechanism to explicitly find location correspondences through cross-modal matching. During the training phase, we also design a Noise-Resistant Matcher to provide better supervision and enhance the robustness of the matching mechanism against sparsity and height uncertainty. Our experimental results, tested on the nuScenes dataset, demonstrate that our method outperforms existing state-of-the-art methods designed for LiDAR-camera auto-calibration, establishing a benchmark for future research.
Method
DetailSemNet: Elevating Signature Verification through Detail-Semantic Integration
European Conference on Computer Vision (ECCV) 2024
Meng-Cheng Shih, Tsai-Ling Huang, Yu-Heng Shih, Hong-Han Shuai, Hsuan-Tung Liu, Yi-Ren Yeh, and Ching-Chun Huang
Abstract
   Offline signature verification (OSV) is a frequently utilized technology in forensics. This paper proposes a new model, DetailSemNet, for OSV. Unlike previous methods that rely on holistic features for pair comparisons, our approach underscores the significance of fine-grained differences for robust OSV. We propose to match local structures between two signature images, significantly boosting verification accuracy. Furthermore, we observe that without specific architectural modifications, transformer-based backbones might naturally obscure local details, adversely impacting OSV performance. To address this, we introduce a Detail-Semantics Integrator, leveraging feature disentanglement and re-entanglement. This integrator is specifically designed to enhance intricate details while simultaneously expanding discriminative semantics, thereby augmenting the efficacy of local structural matching. We evaluate our method against leading benchmarks in offline signature verification. Our model consistently outperforms recent methods, achieving state-of-the-art results with clear margins. The emphasis on local structure matching not only improves performance but also enhances the model's interpretability, supporting our findings. Additionally, our model demonstrates remarkable generalization capabilities in cross-dataset testing scenarios. The combination of generalizability and interpretability significantly bolsters the potential of DetailSemNet for real-world applications.
Method
MMM-TVR: Many-to-Many Matching for Text-Video Retrieval in a Novel Probabilistic Basic Concept Space
July 2024
Beatrice Adelaide Wibisono; Ching-Chun Huang; Hong-Han Shuai
Abstract
   Existing text-video retrieval (TVR) methods focus on modeling cross-modal interactions between text and video pairs through one-to-one mappings, likely due to incomplete pair labels or the difficulty of labeling all possible mappings with soft weights. Consequently, these methods overlook the complex manyto-many relationships in natural text-video pairings, reducing retrieval precision. To address these limitations, we introduce MMM-TVR, a robust framework for multiple concept representation, concept set-to-set similarity measurement, and pseudo-text-video pair discovery. Specifically, MMM-TVR employs a novel semi-supervised contrastive learning framework coupled with a Probabilistic Basic Concept Space (PBCS) constructed via Gaussian Mixture Models (GMM). With PBCS, we represent semantic concepts in a structured, probabilistic manner, allowing for more flexible representation. Unlike conventional methods, MMM-TVR utilizes ground-truth text-video pairs and effectively identifies and leverages semantically relevant pseudo pairs, thereby reducing false negative associations. These designs significantly enhance our model’s robustness and generalization capability, particularly in recognizing and handling inherent uncertainties and ambiguities in semantic matching. Extensive experiments show that our method performs better than state-of-the-art methods across four benchmark datasets: MSR-VTT, MSVD, ActivityNet, and DiDeMo. Moreover, our method proves effective in handling out-of-domain retrieval and scenarios involving multiple correspondences.
Method
Result
Make an Omelette with Breaking Eggs: Zero-Shot Learning for Novel Attribute Synthesis
NeurIPS 2022
Yu-Hsuan Li*, Tzu-Yin Chao*, Ching-Chun Huang, Pin-Yu Chen, Wei-Chen Chiu
Abstract
   Most of the existing algorithms for zero-shot classification problems typically rely on the attribute-based semantic relations among categories to realize the classification of novel categories without observing any of their instances. However, training the zero-shot classification models still requires attribute labeling for each class (or even instance) in the training dataset, which is also expensive. To this end, in this paper, we bring up a new problem scenario: ``Can we derive zero-shot learning for novel attribute detectors/classifiers and use them to automatically annotate the dataset for labeling efficiency?'' Basically, given only a small set of detectors that are learned to recognize some manually annotated attributes (i.e., the seen attributes), we aim to synthesize the detectors of novel attributes in a zero-shot learning manner. Our proposed method, Zero-Shot Learning for Attributes (ZSLA), which is the first of its kind to the best of our knowledge, tackles this new research problem by applying the set operations to first decompose the seen attributes into their basic attributes and then recombine these basic attributes into the novel ones. Extensive experiments are conducted to verify the capacity of our synthesized detectors for accurately capturing the semantics of the novel attributes and show their superior performance in terms of detection and localization compared to other baseline approaches. Moreover, we demonstrate the application of automatic annotation using our synthesized detectors on Caltech-UCSD Birds-200-2011 dataset. Various generalized zero-shot classification algorithms trained upon the dataset re-annotated by ZSLA shows comparable performance with those trained with the manual ground-truth annotations.
Method
Result
IZero-DNAS - Initialization independent Zero cost Differentiable Neural Architecture Search for object detection
Abstract
   Neural Architecture Search (NAS), which aims to find the optimal network structure undergiven hardware constraints, is typically a complex and time-consuming challenge. Recent re-search has been focused on the Zero-Cost Proxy method, which can rank architectures without requiring training. However, most of these methods have demonstrated applicability primarily in small models and simpler tasks, such as classification. When applied to more complex tasks, they tend to produce unstable results in our experiments. Additionally, traditional differentiable-based and sampling-based NAS algorithms face difficulties when searching for large models due to GPU memory constraints and long training process. In this paper, we introduce IZero-DNAS, a novel approach. It involves pretraining the supernet for a few epochs and then combining the Zero-Cost Proxy with a differentiable-based method to search for the optimal network structure for object detection. Furthermore, we conduct an analysis of the temperature settings in the Gumbel-Softmax used during the pretraining process of the supernet. Our experiments demon-strate that a lower temperature can reduce the interdependencies between candidate blocks in the supernet, thereby resulting in higher scores for zero-cost metrics. The results indicate that our proposed IZeroDNAS is capable of searching for superior network structures compared to manually designed counterparts such as ScaledYOLOv4 and other NAS approaches for object detection.
Adaptive Model-Based Transfer Learning for Dynamic HVAC Control
Abstract
   In this paper, our objective is to automatically adjust the air handling unit (AHU) setpoint in the heating, ventilation, and air conditioning (HVAC) system to consistently maintain indoor temperatures at the desired level specified by the user. However, a significant challenge often arises in acquiring a substantial amount of high-quality sensor data from buildings. Conse quently, numerous studies concentrate on investigating transfer learning and employ simulation software to generate training data. We introduce an adaptive model-based transfer learning ap proach for dynamic HVAC control. Unlike conventional pretraining methods, our approach allows the agent to directly control the source domain based on conditions identical to the target domain. This avoids the necessity for extensive knowledge of the target domain to establish schedules for data generation and prevents the risk of generating samples that may never occur in the target domain. Furthermore, this method grants the agent considerable flexibility to learn within the source domain. At the HVAC control method level, we propose a physics rule em bedding method and a setpoint selection strategy, both aimed at addressing potential issues in existing HVAC control methods. We emb the positive and negative relationships between in puts and outputs into the model training process, guiding the model towards convergence in the correct direction, with this constraint maintaining its efficacy over time. To mitigate the effects of drastic setpoint changes, our decision strategy takes into account the long-term impact of set points on the environment, resulting in smoother control. Lastly, to expedite and enhance the stability of intelligent HVAC control system deployment in new buildings, we streamlined the process by eliminating the need for constructing a virtual source domain. Instead, we adopted a concept similar to the proposed transfer learning method to experiment with knowledge transfer between real buildings.
Loss reweight to select data by data importance for 3D object detection
Abstract
   3D object detection remains a major challenging task. This is due to the characteristics associated with LiDAR, including the sparsity of point clouds, significant variations in point cloud density, and the ease with which objects can be occluded, resulting in incomplete point cloud. To accurately classify objects and determine their precise positions, models need to be more robust. Previous approaches focus on additional model design, more powerful feature extractors, and the ability to obtain finer features. We introduce a new issue and proposes different solutions in the field of 3D object detection, focusing on sample processing to effectively improve performance without altering the model architecture. We found that even within the same category, point cloud variations can be extremely significant. Adding these samples to the training process does not necessarily enhance overall performance. On the contrary, it can lead to model confusion and reduced generalization. In this paper, We first observe the samples to identify additional information, including the disperse ratio and model reflection, which encompasses classification loss and regression loss related to sample variations, then use our important sample selection (ISS) module to predict the importance to the model. By adjusting the loss function, we enable the model to focus on learning from more informative samples. Finally, we evaluate our method with KITTI dataset, a well-known 3D detection dataset. We train 3D object detectors with our module. Experimental results demonstrate that by adopting our proposed learning approach, the model can improve overall detection performance without significantly increasing the number of parameters.
Towards Understanding Cross Resolution Feature Matching for Surveillance Face Recognition
ACM MM 2022
Kuo, Chiawei and Tsai, Yi-Ting and Shuai, Hong-Han and Yeh, Yi-ren and Huang, Ching-Chun
Abstract
   Cross-resolution face recognition (CRFR) in an open-set setting is a practical application for surveillance scenarios where low-resolution (LR) probe faces captured via surveillance cameras require being matched to a watchlist of high-resolution (HR) galleries. Although CRFR is to be of practical use, it sees a performance drop of more than 10% compared to that of high-resolution face recognition protocols. The challenges of CRFR are multifold, including the domain gap induced by the HR and LR images, the pose/texture variations, etc. To this end, this work systematically discusses possible issues and their solutions that affect the accuracy of CRFR. First, we explore the effect of resolution changes and conclude that resolution matching is the key for CRFR. Even simply downscaling the HR faces to match the LR ones brings a performance gain. Next, to further boost the accuracy of matching cross-resolution faces, we found that a well-designed super-resolution network, which can (a) represent the images continuously, is (b) suitable for real-world degradation kernel, (c) adaptive to different input resolutions, and (d) guided by an identity-preserved loss, is necessary to upsample the LR faces with discriminative enhancement. Here, the proposed identity-preserved loss plays the role of reconciling the objective discrepancy of super-resolution between human perception and machine recognition. Finally, we emphasize that removing the pose variations is an essential step before matching faces for recognition in the super-resolved feature space. Our method is evaluated on benchmark datasets, including SCface, cross-resolution LFW, and QMUL-Tinyface. The results show that the proposed method outperforms the SOTA methods by a clear margin and narrows the performance gap compared to the high-resolution face recognition protocol.
MENTOR: Multilingual Text Detection Toward Learning by Analogy
Hsin-Ju Lin; Tsu-Chun Chung; Ching-Chun Hsiao; Pin-Yu Chen; Wei-Chen Chiu; Ching-Chun Huang
Abstract
   Text detection is frequently used in vision-based mobile robots when they need to interpret texts in their surroundings to perform a given task. For instance, delivery robots in multilingual cities need to be capable of doing multilingual text detection so that the robots can read traffic signs and road markings. Moreover, the target languages change from region to region, implying the need of efficiently re-training the models to recognize the novel/new languages. However, collecting and labeling training data for novel languages are cumbersome, and the efforts to re-train an existing/trained text detector are considerable. Even worse, such a routine would repeat whenever a novel language appears. This motivates us to propose a new problem setting for tackling the aforementioned challenges in a more efficient way: “We ask for a generalizable multilingual text detection framework to detect and identify both seen and unseen language regions inside scene images without the requirement of collecting supervised training data for unseen languages as well as model re-training”. To this end, we propose “MENTOR”, the first work to realize a learning strategy between zero-shot learning and few-shot learning for multilingual scene text detection. During the training phase, we leverage the “zero-cost” synthesized printed texts and the available training/seen languages to learn the meta-mapping from printed texts to language-specific kernel weights. Meanwhile, dynamic convolution networks guided by the language-specific kernel are trained to realize a detection-by-feature-matching scheme. In the inference phase, “zero-cost” printed texts are synthesized given a new target language. By utilizing the learned meta-mapping and the matching network, our “MENTOR” can freely identify the text regions of the new language. Experiments show our model can achieve comparable results with supervised methods for seen languages and outperform other methods in detecting unseen languages.