Byeongho Heo
Naver Corporation
Deep learningAlgorithmMachine learningBenchmark (computing)Dimension (vector space)Artificial intelligenceCode (cryptography)GeneralizationPattern recognitionDistillationObject detectionComputer visionComputer scienceSource codeObject (computer science)Feature (computer vision)Contextual image classificationConvolutional neural networkSegmentationRobustness (computer science)
Publications 26
#1Hwanjun SongH-Index: 7
#2Deqing SunH-Index: 31
Last. Ming-Hsuan YangH-Index: 126
view all 8 authors...
Transformers are transforming the landscape of computer vision, especially for recognition tasks. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transforme...
#1Junho Cho (Systems Research Institute)H-Index: 1
#2Sangdoo YunH-Index: 21
Last. Jin Young Choi (Systems Research Institute)H-Index: 31
view all 5 authors...
Scene text removal is a challenging task that aims to erase wild text regions that include text strokes and their ambiguous boundaries, such as embossing, shade, or flare. The challenging issues raised in the wild are not completely addressed by the existing methods. To address these issues, we propose a new loss function for blending two tasks in a new network structure that depicts wild text regions in a soft mask and selectively inpaints them into a sensible background. The proposed loss func...
Jan 1, 2021 in CVPR (Computer Vision and Pattern Recognition)
Designing an efficient model within the limited computational cost is challenging. We argue the accuracy of a lightweight model has been further limited by the design convention: a stage-wise configuration of the channel dimensions, which looks like a piecewise linear function of the network stage. In this paper, we study an effective channel dimension configuration towards better performance than the convention. To this end, we empirically study how to design a single layer properly by analyzin...
Jan 1, 2021 in AAAI (National Conference on Artificial Intelligence)
#1Mingi Ji (KAIST)H-Index: 4
#2Byeongho Heo (Naver Corporation)H-Index: 11
Last. Sungrae Park (Naver Corporation)H-Index: 12
view all 3 authors...
Knowledge distillation extracts general knowledge from a pre-trained teacher network and provides guidance to a target student network. Most studies manually tie intermediate features of the teacher and student, and transfer knowledge through pre-defined links. However, manual selection often constructs ineffective links that limit the improvement from the distillation. There has been an attempt to address the problem, but it is still challenging to identify effective links under practical scena...
May 3, 2021 in ICLR (International Conference on Learning Representations)
#1Byeongho HeoH-Index: 11
#2Sanghyuk ChunH-Index: 13
Last. Jung-Woo Ha (SNU: Seoul National University)H-Index: 20
view all 8 authors...
Normalization techniques, such as batch normalization (BN), are a boon for modern deep learning. They let weights converge more quickly with often better generalization performances. It has been argued that the normalization-induced scale invariance among the weights provides an advantageous ground for gradient descent (GD) optimizers: the effective step sizes are automatically reduced over time, stabilizing the overall training procedure. It is often overlooked, however, that the additional int...
Mar 30, 2021 in ICCV (International Conference on Computer Vision)
#1Byeongho Heo (Naver Corporation)H-Index: 11
#2Sangdoo Yun (Naver Corporation)H-Index: 21
Last. Seong Joon Oh (Naver Corporation)H-Index: 22
view all 6 authors...
Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the transformer-based architecture has been innovative for computer vision modeling, the design convention towards an effective architecture has been less studied yet. From the successful design principles of CNN, we investigate the role of spatial dimension conversion and its ...
#1Kyuewang Lee (SNU: Seoul National University)H-Index: 4
#2Hyung Jin Chang (University of Birmingham)H-Index: 21
Last. Jin Young Choi (SNU: Seoul National University)H-Index: 31
view all 6 authors...
To tackle problems arising from unexpected camera motions in unmanned aerial vehicles (UAVs), we propose a three-mode ensemble tracker where each mode specializes in distinctive situations. The proposed ensemble tracker is composed of appearance-based tracking mode, homography-based tracking mode, and momentum-based tracking mode. The appearance-based tracking mode tracks a moving object well when the UAV is nearly stopped, whereas the homography-based tracking mode shows good tracking performan...
#1Youngmin RoH-Index: 2
#2Jongwon Choi (CAU: Chung-Ang University)H-Index: 15
Last. Jin Young ChoiH-Index: 31
view all 4 authors...
Image retrieval is a challenging problem that requires learning generalized features enough to identify untrained classes, even with very few classwise training samples. In this article, to obtain generalized features further in learning retrieval data sets, we propose a novel fine-tuning method of pretrained deep networks. In the retrieval task, we discovered a phenomenon in which the loss reduction in fine-tuning deep networks is stagnated, even while weights are largely updated. To escape fro...
ImageNet has been arguably the most popular image classification benchmark, but it is also the one with a significant level of label noise. Recent studies have shown that many samples contain multiple classes, despite being assumed to be a single-label benchmark. They have thus proposed to turn ImageNet evaluation into a multi-label task, with exhaustive multi-label annotations per image. However, they have not fixed the training set, presumably because of a formidable annotation cost. We argue ...
#1Sangdoo YunH-Index: 21
#2Seong Joon OhH-Index: 22
Last. Jinhyung KimH-Index: 12
view all 5 authors...
State-of-the-art video action classifiers often suffer from overfitting. They tend to be biased towards specific objects and scene cues, rather than the foreground action content, leading to sub-optimal generalization performances. Recent data augmentation strategies have been reported to address the overfitting problems in static image classifiers. Despite the effectiveness on the static image classifiers, data augmentation has rarely been studied for videos. For the first time in the field, we...
This website uses cookies.
We use cookies to improve your online experience. By continuing to use our website we assume you agree to the placement of these cookies.
To learn more, you can find in our Privacy Policy.