Burger W. Digital Image Processing. An Algorithmic Introduction 3ed 2022 [PDF]

Texts in Computer Science

Wilhelm Burger Mark J. Burge

Digital Image Processing An Algorithmic Introduction Third Edit

5 0 96 MB

File loading please wait...

Citation preview

Texts in Computer Science

Wilhelm Burger Mark J. Burge

Digital Image Processing An Algorithmic Introduction Third Edition

Texts in Computer Science Series Editors David Gries, Department of Computer Science, Cornell University, Ithaca, NY, USA Orit Hazzan , Faculty of Education in Technology and Science, Technion—Israel Institute of Technology, Haifa, Israel

Titles in this series now included in the Thomson Reuters Book Citation Index! ‘Texts in Computer Science’ (TCS) delivers high-quality instructional content for undergraduates and graduates in all areas of computing and information science, with a strong emphasis on core foundational and theoretical material but inclusive of some prominent applications-related content. TCS books should be reasonably self-contained and aim to provide students with modern and clear accounts of topics ranging across the computing curriculum. As a result, the books are ideal for semester courses or for individual self-study in cases where people need to expand their knowledge. All texts are authored by established experts in their ﬁelds, reviewed internally and by the series editors, and provide numerous examples, problems, and other pedagogical tools; many contain fully worked solutions. The TCS series is comprised of high-quality, self-contained books that have broad and comprehensive coverage and are generally in hardback format and sometimes contain color. For undergraduate textbooks that are likely to be more brief and modular in their approach, require only black and white, and are under 275 pages, Springer offers the flexibly designed Undergraduate Topics in Computer Science series, to which we refer potential authors.

Wilhelm Burger Mark J. Burge •

Digital Image Processing An Algorithmic Introduction Third Edition

123

Wilhelm Burger University of Applied Sciences Upper Austria Hagenberg, Austria

Mark J. Burge Federal Bureau of Investigation Quantico, VA, USA

ISSN 1868-0941 ISSN 1868-095X (electronic) Texts in Computer Science ISBN 978-3-031-05743-4 ISBN 978-3-031-05744-1 (eBook) https://doi.org/10.1007/978-3-031-05744-1 1st & 2nd edition: © Springer-Verlag London 2008, 2016 3rd edition: © Springer Nature Switzerland AG 2022 This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply 2022 All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

This book provides a modern, self-contained introduction to digital image processing. We designed this book to be used both by learners desiring a firm foundation on which to build as well as practitioners in search of detailed analysis and transparent implementations of the most important techniques. This is the third English edition of the original German-language book, which has been widely used by: • Scientists and engineers who use image processing as a tool and wish to develop a deeper understanding and create custom solutions to imaging problems in their field. • IT professionals in search of a self-study course featuring easily adaptable code and completely worked out examples, enabling them to be productive right away. • Faculty and students desiring an example-rich introductory textbook suitable for an advanced undergraduate or graduate level course that features exercises, projects, and examples that have been honed during quite some years of experience teaching this material. While we concentrate on practical applications and concrete implementations, we do so without glossing over the important formal details and mathematics necessary for a deeper understanding of the algorithms. In preparing this text, we started from the premise that simply creating a recipe book of imaging solutions would not provide the deeper understanding needed to apply these techniques to novel problems, so instead our solutions are developed stepwise from three different perspectives: in mathematical form, as abstract pseudocode algorithms, and as complete Java programs. We use a common notation to intertwine all three perspectives—providing multiple, but intimately linked, views of problems and their solution. Prerequisites Instead of presenting digital image processing as a mathematical discipline, or strictly as a signal processing topic, we present it from a practitioner’s and programmer’s perspective and with a view toward replacing many of the formalisms commonly used in other texts with constructs more readily understandable by our audience. To take full advantage of the programming components of this book, a knowledge of basic data structures and object-oriented programming, ideally in Java, is beneficial. We selected Java for a number of reasons: it is the first programming language learned by students in a wide variety of engineering curricula, and professionals with knowledge of a

V

Preface

related language, especially C# or C++, will find the programming examples easy to follow and extend. The software in this book is designed to work with ImageJ, a widely used, programmer-extensible, imaging system developed, maintained, and distributed by the National Institutes of Health (NIH).1 ImageJ is implemented completely in Java, and therefore runs on all major platforms. It is popular because its “plugin”-based architecture enables it to be easily extended. While all examples run in ImageJ, they have been specifically designed to be easily ported to other environments and programming languages. Use in Research and Development This book has been especially designed for use as a textbook and as such features exercises and carefully constructed examples that supplement our detailed presentation of the fundamental concepts and techniques. As both practitioners and developers, we know that the details required to successfully understand, apply, and extend classical techniques are often difficult to find, and for this reason we have been very careful to provide the missing details, many gleaned over years of practical application. While this should make the text particularly valuable to those in research and development, it is not designed as a comprehensive, fully-cited scientific research text. On the contrary, we have carefully vetted our citations so that they can be obtained from easily accessible sources. While we have only briefly discussed the fundamentals of, or entirely omitted, topics such as hierarchical methods, wavelets, or eigenimages because of space limitations, other topics have been left out deliberately, including advanced issues such as object recognition, image understanding, and three-dimensional (3D) computer vision. So, while most techniques described in this book could be called “blind and dumb”, it is our experience that straightforward, technically clean implementations of these founding methods are essential to the success of any higherlevel, perhaps really “intelligent” techniques built on top of them. If you are only in search of a programming handbook for ImageJ or Java, there are certainly better sources. While the book includes many code examples, programming in and of itself is not our main focus. Instead Java serves as just one important element for describing each technique in a precise and immediately testable way. Classroom Use Whether it is called signal processing, image processing, or media computation, the manipulation of digital images has been an integral part of most computer science and engineering curricula for many years. Today, with the omnipresence of all-digital work flows, it has become an integral part of the required skill set for professionals in many diverse disciplines. Today the topic has migrated into the early stages of many curricula, where it is often a key foundation course. This trend uncovered a 1

VI

https://rsb.info.nih.gov/ij/

problem in that many of the texts relied on as standards in the older Preface graduate-level courses were not appropriate for beginners. The texts were usually too formal for novices, and at the same time did not provide detailed coverage of many of the most popular methods used in actual practice. The result was that educators had a difficult time selecting a single textbook or even finding a compact collection of literature to recommend to their students. Faced with this dilemma ourselves, we wrote this book in the sincere hope of filling this gap. The contents of the following chapters can be presented in either a one- or two-semester sequence. Where feasible, we have added supporting material in order to make each chapter as independent as possible, providing instructors with maximum flexibility when designing the course. Chapters 18–20 offer a compact introduction to the use of spectral techniques in image processing and are essentially independent of the other material in the text. Depending on the goals of the instructor and the curriculum, they can be covered in as much detail as required or completely omitted. The following road map shows a possible partitioning of topics for a two-semester syllabus. Road Map for a 2-Semester Syllabus Sem. 1. Digital Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. Histograms and Image Statistics . . . . . . . . . . . . . . . . . . . . . . 3. Point Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. Edges and Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6. Corner Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7. Morphological Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8. Regions in Binary Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9. Automatic Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10. Fitting Straight Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11. Fitting Circles and Ellipses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12. Detecting Geometric Primitives . . . . . . . . . . . . . . . . . . . . . . . 13. Color Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14. Colorimetric Color Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15. Filters for Color Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16. Edge Detection in Color Images . . . . . . . . . . . . . . . . . . . . . . . 17. Edge-Preserving Smoothing Filters . . . . . . . . . . . . . . . . . . . . 18. Introduction to Spectral Techniques . . . . . . . . . . . . . . . . . . . 19. The Discrete Fourier Transform in 2D . . . . . . . . . . . . . . . . . 20. The Discrete Cosine Transform (DCT) . . . . . . . . . . . . . . . . 21. Geometric Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22. Pixel Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23. Image Matching and Registration . . . . . . . . . . . . . . . . . . . . . 24. Non-Rigid Image Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25. Scale-Invariant Local Features (SIFT) . . . . . . . . . . . . . . . . . 26. Maximally Stable Extremal Regions (MSER) . . . . . . . . . .

1 ■ ■ ■ ■ ■ ■ ■ ■ □ □ □ □ ■ □ □ □ □ □ □ □ ■ ■ ■ □ □ □

2 □ □ □ □ □ □ □ □ ■ ■ ■ ■ □ ■ ■ ■ ■ ■ ■ ■ □ □ □ ■ ■ ■

Addendum To The 3rd Edition This third (and presumably final) edition offers carefully revised contents and adds some new topics that round off the original material. In particular, there are new chapters on circle and ellipse fitting,

VII

Preface

geometric primitive detection and local feature extraction. The appendix has been greatly expanded and contains numerous additions that should be helpful for concrete implementations as well as for reference. This includes small numerical examples, which are often helpful for the understanding of a problem and allow a quick check of one’s own calculations. Special attention was paid to the concise description of all methods in the form of easily understandable algorithms, not least to make the descriptions as independent as possible from concrete programming languages. In fact, we think that the consistent algorithmic description, which was significantly improved again in this edition, is really a unique feature of this book. On the other hand, we have relegated the concrete program examples somewhat to the background without sacrificing the most essential elements, especially since virtually all examples are also available online. In fact, most algorithms are implemented one-to-one in the associated imagingbook software library, so readers can always check in case of any ambiguities in the mathematical notation. All newly added concepts and associated demo programs have been implemented in this library, which will be continuously maintained and extended in the future. Unfortunately, this third edition had to be prepared without the helping hand of the second author, therefore the blame for any errors that were either newly added or overlooked in the existing text lies solely with me (i.e., the first author). Readers will hopefully be merciful in view of the fact that there was no native speaker at work this time. Online Resources and Contact Visit the website for this book www.imagingbook.com to download supplementary materials, including the complete (and free)2 Java source code for all examples and the underlying software library, full-size test images, useful references, and other supplements. Comments, questions, and corrections are welcome and should be addressed to [email protected] Exercises and Solutions Almost every chapter of this book contains a set of sample exercises, mainly for supporting instructors to prepare their own assignments. Most of these tasks are easy to solve after studying the corresponding chapter, while some others may require more elaborated reasoning or experimental work. We assume that scholars know best how to select and adapt individual assignments in order to fit the level and interest of their students. This is the main reason why we have abstained from publishing explicit solutions in the past. However, we are happy to 2

VIII

The imagingbook software suite is published under the 2-Clause BSD License (see https://opensource.org/licenses/BSD-2-Clause).

answer any personal request if an exercise is unclear or seems to elude Preface a simple solution. Thank You! This book would not have been possible without the understanding and support of our families. Our thanks go to Wayne Rasband at NIH for developing ImageJ and for his truly outstanding support of the community and to all our readers of the previous editions who provided valuable input, suggestions for improvement, and encouragement. The use of open source software for such a project always carries an element of risk, since the long-term acceptance and continuity is difficult to assess. Retrospectively, choosing ImageJ as the software basis for this work was a good decision, and we would consider ourselves happy if our books have perhaps contributed to the success of the ImageJ project itself. Finally, we owe a debt of gratitude to the professionals at Springer, particularly to Wayne Wheeler and his team who again patiently accompanied the production of this edition.

W.B. and M.J.B. Spring 2022

IX

Contents

Part I Images and Pixels 1

Digital Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Programming with Images . . . . . . . . . . . . . . . . . . . . . . 1.2 Image Analysis and Computer Vision . . . . . . . . . . . . . 1.3 Types of Digital Images . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Image Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 The Pinhole Camera Model . . . . . . . . . . . . . . . 1.4.2 The “Thin” Lens . . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Going Digital . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.4 Image Size and Resolution . . . . . . . . . . . . . . . . 1.4.5 Image Coordinate System . . . . . . . . . . . . . . . . . 1.4.6 Pixel Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Image File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Raster Versus Vector Data . . . . . . . . . . . . . . . . 1.5.2 Tagged Image File Format (TIFF) . . . . . . . . . 1.5.3 Graphics Interchange Format (GIF) . . . . . . . . 1.5.4 Portable Network Graphics (PNG) . . . . . . . . . 1.5.5 JPEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.6 Legacy File Formats . . . . . . . . . . . . . . . . . . . . . 1.5.7 Bits and Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Software for Digital Imaging . . . . . . . . . . . . . . . . . . . . 1.7 ImageJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.1 Key Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.2 Interactive Tools . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.3 Working With ImageJ and Java . . . . . . . . . . . 1.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 4 4 6 6 6 8 9 10 11 11 13 14 14 15 16 16 20 21 23 23 24 25 25 27

2

Histograms and Image Statistics . . . . . . . . . . . . . . . . . . 2.1 What is a Histogram? . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Interpreting Histograms . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Image Acquisition . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Image Defects . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Calculating Histograms . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Histograms of Images With More Than 8 Bits . . . . . 2.4.1 Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Histograms of Color Images . . . . . . . . . . . . . . . . . . . . . 2.5.1 Intensity Histograms . . . . . . . . . . . . . . . . . . . . . 2.5.2 Individual Color Channel Histograms . . . . . . . 2.5.3 Combined Color Histograms . . . . . . . . . . . . . .

29 30 31 31 33 36 37 37 37 38 38 39 39 40

XI

2.6 2.7

Contents

The Cumulative Histogram . . . . . . . . . . . . . . . . . . . . . Statistical Information from the Histogram . . . . . . . . 2.7.1 Mean and Variance . . . . . . . . . . . . . . . . . . . . . . 2.7.2 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Block Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.1 Integral Images . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.2 Mean Intensity . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.4 Practical Calculation of Integral Images . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41 41 42 43 43 43 45 45 45 46

Point Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Modifying Image Intensity . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Contrast and Brightness . . . . . . . . . . . . . . . . . . 3.1.2 Limiting Values by Clamping . . . . . . . . . . . . . . 3.1.3 Inverting Images . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Thresholding Operation . . . . . . . . . . . . . . . . . . 3.2 Point Operations and Histograms . . . . . . . . . . . . . . . . 3.3 Automatic Contrast Adjustment . . . . . . . . . . . . . . . . . 3.4 Modified Auto-Contrast Operation . . . . . . . . . . . . . . . 3.5 Histogram Equalization . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Histogram Specification . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Frequencies and Probabilities . . . . . . . . . . . . . . 3.6.2 Principle of Histogram Specification . . . . . . . . 3.6.3 Adjusting to a Piecewise Linear Distribution 3.6.4 Adjusting to a Given Histogram (Histogram Matching) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Gamma Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Why “Gamma”? . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.2 Mathematical Definition . . . . . . . . . . . . . . . . . . 3.7.3 Real Gamma Values . . . . . . . . . . . . . . . . . . . . . 3.7.4 Applications of Gamma Correction . . . . . . . . . 3.7.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.6 Modified Gamma Correction . . . . . . . . . . . . . . 3.8 Point Operations in ImageJ . . . . . . . . . . . . . . . . . . . . . 3.8.1 Point Operations with Lookup Tables . . . . . . 3.8.2 Arithmetic Operations . . . . . . . . . . . . . . . . . . . 3.8.3 Point Operations Involving Multiple Images . 3.8.4 Methods for Point Operations on Two Images 3.8.5 ImageJ Plugins Involving Multiple Images . . 3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49 50 50 50 51 51 51 52 54 55 58 58 59 60

2.8

2.9 3

61 63 67 68 68 69 70 71 71 74 74 75 75 76 76 77

Part II Filters, Edges and Corners 4

XII

Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 What is a Filter? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Linear Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 The Filter Kernel . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Applying the Filter . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Implementing Filter Operations . . . . . . . . . . . 4.2.4 Filter Plugin Examples . . . . . . . . . . . . . . . . . . .

85 85 87 87 87 88 89

4.3

4.4

4.5

4.6

4.7

4.2.5 Integer Coefficients . . . . . . . . . . . . . . . . . . . . . . . 4.2.6 Filters of Arbitrary Size . . . . . . . . . . . . . . . . . . 4.2.7 Types of Linear Filters . . . . . . . . . . . . . . . . . . . Formal Properties of Linear Filters . . . . . . . . . . . . . . . 4.3.1 Linear Convolution . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Formal Properties of Linear Convolution . . . . 4.3.3 Separability of Linear Filters . . . . . . . . . . . . . . 4.3.4 Impulse Response of a Filter . . . . . . . . . . . . . . Nonlinear Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Minimum and Maximum Filters . . . . . . . . . . . 4.4.2 Median Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Weighted Median Filter . . . . . . . . . . . . . . . . . . 4.4.4 Other Nonlinear Filters . . . . . . . . . . . . . . . . . . . Implementing Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Efficiency of Filter Programs . . . . . . . . . . . . . . 4.5.2 Handling Image Borders . . . . . . . . . . . . . . . . . . 4.5.3 Debugging Filter Programs . . . . . . . . . . . . . . . Filter Operations in ImageJ . . . . . . . . . . . . . . . . . . . . . 4.6.1 Linear Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Gaussian Filters . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.3 Nonlinear Filters . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91 Contents 92 92 95 96 96 98 99 100 101 103 105 107 108 108 109 110 111 111 112 112 113

5

Edges and Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 What Makes an Edge? . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Gradient-Based Edge Detection . . . . . . . . . . . . . . . . . . 5.2.1 Partial Derivatives and the Gradient . . . . . . . 5.2.2 Derivative Filters . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Simple Edge Operators . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Prewitt and Sobel Edge Operators . . . . . . . . . 5.3.2 Roberts Edge Operator . . . . . . . . . . . . . . . . . . . 5.3.3 Compass Operators . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Edge Operators in ImageJ . . . . . . . . . . . . . . . . 5.4 Other Edge Operators . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Edge Detection Based on Second Derivatives 5.4.2 Edges at Different Scales . . . . . . . . . . . . . . . . . 5.4.3 From Edges to Contours . . . . . . . . . . . . . . . . . . 5.5 Canny Edge Operator . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Edge Localization . . . . . . . . . . . . . . . . . . . . . . . . 5.5.3 Edge Tracing and Hysteresis Thresholding . . 5.5.4 Additional Information . . . . . . . . . . . . . . . . . . . 5.5.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Edge Sharpening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Edge Sharpening with the Laplacian Filter . . 5.6.2 Unsharp Masking . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

117 117 118 119 119 120 121 124 124 126 126 126 126 127 128 130 130 131 133 134 134 135 137 142

6

Corner Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Points of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Harris Corner Detector . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 The Local Structure Matrix . . . . . . . . . . . . . . .

145 145 146 146

XIII

Contents

6.3

6.4 6.5

6.6

6.2.2 Significance of the Local Structure Matrix . . 6.2.3 Corner Response Function (CRF) . . . . . . . . . . 6.2.4 Selecting Corner Points . . . . . . . . . . . . . . . . . . . 6.2.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alternative Formulations . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Shi-Tomasi Corner Score . . . . . . . . . . . . . . . . . . 6.3.2 MOPS Corner Score . . . . . . . . . . . . . . . . . . . . . Basic Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sub-Pixel Corner Positions . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Position Interpolation by Second-Order Taylor Expansion . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Sub-Pixel Positioning Example . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

147 150 150 152 152 152 154 154 160 161 161 162 163

Part III Binary Images

XIV

7

Morphological Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Shrink and Let Grow . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Pixel Neighborhoods . . . . . . . . . . . . . . . . . . . . . 7.2 Basic Morphological Operations . . . . . . . . . . . . . . . . . 7.2.1 The Structuring Element . . . . . . . . . . . . . . . . . 7.2.2 Point Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Dilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.4 Erosion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.5 Formal Properties of Dilation and Erosion . . 7.2.6 Designing Morphological Filters . . . . . . . . . . . 7.2.7 Application Example: Outline . . . . . . . . . . . . . 7.3 Composite Morphological Operations . . . . . . . . . . . . . 7.3.1 Opening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Closing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Properties of Opening and Closing . . . . . . . . . 7.4 Thinning (Skeletonization) . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Fast Thinning Algorithm . . . . . . . . . . . . . . . . . 7.4.3 Java Implementation . . . . . . . . . . . . . . . . . . . . . 7.4.4 Built-in Morphological Operations in ImageJ 7.5 Grayscale Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Structuring Elements . . . . . . . . . . . . . . . . . . . . . 7.5.2 Dilation and Erosion . . . . . . . . . . . . . . . . . . . . . 7.5.3 Grayscale Opening and Closing . . . . . . . . . . . . 7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

167 168 169 170 170 170 171 171 172 173 176 177 177 179 179 180 180 181 185 186 187 187 188 188 189

8

Regions in Binary Images . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Finding Connected Image Regions . . . . . . . . . . . . . . . 8.1.1 Region Labeling by Flood Filling . . . . . . . . . . 8.1.2 Sequential Region Segmentation . . . . . . . . . . . 8.1.3 Region Labeling – Summary . . . . . . . . . . . . . . 8.2 Region Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Outer and Inner Contours . . . . . . . . . . . . . . . . 8.2.2 Combining Region Labeling and Contour Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

195 196 196 201 205 206 206 207

8.2.3 Java Implementation . . . . . . . . . . . . . . . . . . . . . Representing Image Regions . . . . . . . . . . . . . . . . . . . . . 8.3.1 Matrix Representation . . . . . . . . . . . . . . . . . . . . 8.3.2 Run Length Encoding . . . . . . . . . . . . . . . . . . . . 8.3.3 Chain Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Properties of Binary Regions . . . . . . . . . . . . . . . . . . . . 8.4.1 Shape Features . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Geometric Features . . . . . . . . . . . . . . . . . . . . . . 8.5 Statistical Shape Properties . . . . . . . . . . . . . . . . . . . . . 8.5.1 Centroid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.3 Central Moments . . . . . . . . . . . . . . . . . . . . . . . . 8.5.4 Normalized Central Moments . . . . . . . . . . . . . 8.5.5 Java Implementation . . . . . . . . . . . . . . . . . . . . . 8.6 Moment-Based Geometric Properties . . . . . . . . . . . . . 8.6.1 Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.2 Region Eccentricity . . . . . . . . . . . . . . . . . . . . . . 8.6.3 Equivalent Ellipse . . . . . . . . . . . . . . . . . . . . . . . . 8.6.4 Bounding Box Aligned to the Major Axis . . . 8.6.5 Invariant Region Moments . . . . . . . . . . . . . . . . 8.7 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8 Topological Region Properties . . . . . . . . . . . . . . . . . . . 8.9 Java Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

210 Contents 212 212 213 213 216 217 217 220 220 221 221 222 222 223 223 224 226 226 227 232 233 234 234

Automatic Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Global Histogram-Based Thresholding . . . . . . . . . . . . 9.1.1 Image Statistics from the Histogram . . . . . . . 9.1.2 Simple Threshold Selection . . . . . . . . . . . . . . . . 9.1.3 Iterative Threshold Selection (Isodata Algorithm) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.4 Otsu’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.5 Maximum Entropy Thresholding . . . . . . . . . . . 9.1.6 Minimum Error Thresholding . . . . . . . . . . . . . 9.2 Local Adaptive Thresholding . . . . . . . . . . . . . . . . . . . . 9.2.1 Bernsen’s Method . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Niblack’s Method . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Java Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Global Thresholding Methods . . . . . . . . . . . . . 9.3.2 Adaptive Thresholding . . . . . . . . . . . . . . . . . . . 9.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . 9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

241 241 243 244

8.3

9

246 248 251 254 261 262 264 273 274 275 275 276

Part IV Geometric Primitives 10 Fitting Straight Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Straight Line Equations . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Slope-Intercept Form . . . . . . . . . . . . . . . . . . . . . 10.1.2 Parametric (Point-Vector) Form . . . . . . . . . . . 10.1.3 Algebraic Form . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.4 Hessian Normal Form . . . . . . . . . . . . . . . . . . . . 10.2 Fitting Lines to Points Sets . . . . . . . . . . . . . . . . . . . . .

281 281 281 282 282 284 284

XV

Contents

10.2.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 10.2.2 Orthogonal Regression . . . . . . . . . . . . . . . . . . . 10.3 Example: Contour Segmentation . . . . . . . . . . . . . . . . . 10.4 Java Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

285 286 290 292 295

11 Fitting Circles and Ellipses . . . . . . . . . . . . . . . . . . . . . . . 11.1 Fitting Circles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Circle Equations . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 Algebraic Circle Fits . . . . . . . . . . . . . . . . . . . . . 11.1.3 Geometric Circle Fitting . . . . . . . . . . . . . . . . . . 11.2 Fitting Ellipses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.1 Algebraic Ellipse Fitting . . . . . . . . . . . . . . . . . . 11.2.2 Geometric Ellipse Fitting . . . . . . . . . . . . . . . . . 11.2.3 Orthogonal Distance Approximations . . . . . . . 11.3 Java Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.1 Circle Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.2 Ellipse Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . .

297 297 297 298 305 313 314 323 331 337 338 338

12 Detecting Geometric Primitives . . . . . . . . . . . . . . . . . . 12.1 Random Sample Consensus (RANSAC) . . . . . . . . . . . 12.1.1 How Many Random Draws Are Needed? . . . . 12.1.2 RANSAC Line Detection Algorithm . . . . . . . . 12.1.3 Detecting Multiple Lines . . . . . . . . . . . . . . . . . . 12.1.4 RANSAC Circle Detection . . . . . . . . . . . . . . . . 12.1.5 RANDSAC Ellipse Detection . . . . . . . . . . . . . . 12.1.6 RANSAC Extensions and Applications . . . . . 12.2 The Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.1 Parameter Space . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.2 Accumulator Map . . . . . . . . . . . . . . . . . . . . . . . . 12.2.3 A Better Line Representation . . . . . . . . . . . . . 12.2.4 Hough Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 12.2.5 Hough Transform Extensions . . . . . . . . . . . . . . 12.2.6 Hough Transform for Circles and Arcs . . . . . . 12.2.7 Hough Transform for Ellipses . . . . . . . . . . . . . . 12.3 Java Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

341 343 344 345 346 348 352 355 355 356 357 357 359 363 366 369 369 370

Part V Color

XVI

13 Color Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.1 RGB Color Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.1.1 Structure of Color Images . . . . . . . . . . . . . . . . . 13.1.2 Color Images in ImageJ . . . . . . . . . . . . . . . . . . 13.2 Color Spaces and Color Conversion . . . . . . . . . . . . . . . 13.2.1 Conversion to Grayscale . . . . . . . . . . . . . . . . . . 13.2.2 Desaturating RGB Color Images . . . . . . . . . . . 13.2.3 HSV/HSB and HLS Color Spaces . . . . . . . . . . 13.2.4 TV Component Color Spaces: YUV, YIQ, and YCbCr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.5 Color Spaces for Printing: CMY and CMYK 13.3 Statistics of Color Images . . . . . . . . . . . . . . . . . . . . . . .

375 375 376 379 387 387 389 389 400 403 406

13.3.1 How Many Different Colors Are in an Image? 13.3.2 Color Histograms . . . . . . . . . . . . . . . . . . . . . . . . 13.4 Color Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4.1 Scalar Color Quantization . . . . . . . . . . . . . . . . 13.4.2 Vector Quantization . . . . . . . . . . . . . . . . . . . . . . 13.4.3 Java Implementation . . . . . . . . . . . . . . . . . . . . . 13.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

406 Contents 407 408 410 412 419 419

14 Colorimetric Color Spaces . . . . . . . . . . . . . . . . . . . . . . . . 14.1 CIE Color Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1.1 CIE XYZ Color Space . . . . . . . . . . . . . . . . . . . . 14.1.2 CIE x, y Chromaticity . . . . . . . . . . . . . . . . . . . . 14.1.3 Standard Illuminants . . . . . . . . . . . . . . . . . . . . . 14.1.4 Gamut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1.5 Variants of the CIE Color Space . . . . . . . . . . . 14.2 CIELAB Color Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2.1 CIEXYZ → CIELAB Conversion . . . . . . . . . . 14.2.2 CIELAB → CIEXYZ Conversion . . . . . . . . . . 14.3 CIELUV Color Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.1 CIEXYZ → CIELUV Conversion . . . . . . . . . . 14.3.2 CIELUV→ CIEXYZ Conversion . . . . . . . . . . . 14.3.3 Measuring Color Differences . . . . . . . . . . . . . . . 14.4 Standard RGB (sRGB) . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.1 Linear vs. Nonlinear Color Components . . . . 14.4.2 CIEXYZ → sRGB Conversion . . . . . . . . . . . . . 14.4.3 sRGB → CIEXYZ Conversion . . . . . . . . . . . . . 14.4.4 Calculations with Nonlinear sRGB Values . . . 14.5 Adobe RGB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.6 Chromatic Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 14.6.1 XYZ Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.6.2 Bradford Color Adaptation . . . . . . . . . . . . . . . 14.7 Colorimetric Support in Java . . . . . . . . . . . . . . . . . . . . 14.7.1 Profile Connection Space (PCS) . . . . . . . . . . . 14.7.2 Color-Related Java Classes . . . . . . . . . . . . . . . . 14.7.3 Implementation of the CIELAB Color Space (Example) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.7.4 ICC Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

425 425 426 426 428 429 429 430 430 431 432 432 433 434 434 435 436 436 437 438 439 439 440 442 442 444

15 Filters for Color Images . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1 Linear Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1.1 Monochromatic Application of Linear Filters 15.1.2 Color Space Matters . . . . . . . . . . . . . . . . . . . . . 15.1.3 Linear Filtering with Circular Values . . . . . . . 15.2 Nonlinear Color Filters . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.1 Scalar Median Filter . . . . . . . . . . . . . . . . . . . . . 15.2.2 Vector Median Filter . . . . . . . . . . . . . . . . . . . . . 15.2.3 Sharpening Vector Median Filter . . . . . . . . . . 15.3 Java Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

451 451 452 454 458 462 462 462 466 469 471 472

445 446 448

XVII

Contents

16 Edge Detection in Color Images . . . . . . . . . . . . . . . . . . 16.1 Monochromatic Techniques . . . . . . . . . . . . . . . . . . . . . 16.2 Edges in Vector-Valued Images . . . . . . . . . . . . . . . . . . 16.2.1 Multi-Dimensional Gradients . . . . . . . . . . . . . . 16.2.2 The Jacobian Matrix . . . . . . . . . . . . . . . . . . . . . 16.2.3 Squared Local Contrast . . . . . . . . . . . . . . . . . . . 16.2.4 Color Edge Magnitude . . . . . . . . . . . . . . . . . . . 16.2.5 Color Edge Orientation . . . . . . . . . . . . . . . . . . . 16.2.6 Grayscale Gradients Revisited . . . . . . . . . . . . . 16.3 Canny Edge Detector for Color Images . . . . . . . . . . . 16.4 Other Color Edge Operators . . . . . . . . . . . . . . . . . . . . 16.5 Java Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

475 476 478 480 481 482 483 485 486 489 493 494 494

17 Edge-Preserving Smoothing Filters . . . . . . . . . . . . . . . 17.1 Kuwahara-Type Filters . . . . . . . . . . . . . . . . . . . . . . . . . 17.1.1 Application to Color Images . . . . . . . . . . . . . . . 17.2 Bilateral Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.1 Domain Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.2 Range Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.3 Bilateral Filter: General Idea . . . . . . . . . . . . . . 17.2.4 Bilateral Filter with Gaussian Kernels . . . . . . 17.2.5 Application to Color Images . . . . . . . . . . . . . . . 17.2.6 Efficient Implementation by x/y Separation . 17.2.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . 17.3 Anisotropic Diffusion Filters . . . . . . . . . . . . . . . . . . . . . 17.3.1 Homogeneous Diffusion and the Heat Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3.2 The Perona-Malik Filter . . . . . . . . . . . . . . . . . . 17.3.3 Perona-Malik Filter for Color Images . . . . . . . 17.3.4 Geometry Preserving Anisotropic Diffusion . . 17.3.5 Tschumperlé-Deriche Algorithm . . . . . . . . . . . 17.4 Java Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

497 498 500 504 504 505 505 507 508 512 514 517 518 520 522 526 528 532 535

Part VI Spectral Techniques

XVIII

18 Introduction to Spectral Methods . . . . . . . . . . . . . . . . 18.1 The Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . 18.1.1 Sine and Cosine Functions . . . . . . . . . . . . . . . . 18.1.2 Fourier Series Representation of Periodic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.1.3 Fourier Integral . . . . . . . . . . . . . . . . . . . . . . . . . . 18.1.4 Fourier Spectrum and Transformation . . . . . . 18.1.5 Fourier Transform Pairs . . . . . . . . . . . . . . . . . . 18.1.6 Important Properties of the Fourier Transform 18.2 Working with Discrete Signals . . . . . . . . . . . . . . . . . . . 18.2.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.2.2 Discrete and Periodic Functions . . . . . . . . . . . 18.3 The Discrete Fourier Transform (DFT) . . . . . . . . . . . 18.3.1 Definition of the DFT . . . . . . . . . . . . . . . . . . . . 18.3.2 Discrete Basis Functions . . . . . . . . . . . . . . . . . .

539 540 540 543 543 544 545 546 550 550 555 555 557 558

18.3.3 Aliasing Again! . . . . . . . . . . . . . . . . . . . . . . . . . . 18.3.4 Units in Signal and Frequency Space . . . . . . . 18.3.5 Power Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . 18.4 Implementing the DFT . . . . . . . . . . . . . . . . . . . . . . . . . 18.4.1 Direct Implementation . . . . . . . . . . . . . . . . . . . 18.4.2 Fast Fourier Transform (FFT) . . . . . . . . . . . . . 18.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 The Discrete Fourier Transform in 2D . . . . . . . . . . . . 19.1 Definition of the 2D DFT . . . . . . . . . . . . . . . . . . . . . . . 19.1.1 2D Basis Functions . . . . . . . . . . . . . . . . . . . . . . 19.1.2 Implementing the 2D DFT . . . . . . . . . . . . . . . . 19.2 Visualizing the 2D Fourier Transform . . . . . . . . . . . . . 19.2.1 Range of Spectral Values . . . . . . . . . . . . . . . . . 19.2.2 Centered Representation of the DFT Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.3 Frequencies and Orientation in 2D . . . . . . . . . . . . . . . 19.3.1 Effective Frequency . . . . . . . . . . . . . . . . . . . . . . 19.3.2 Frequency Limits and Aliasing in 2D . . . . . . . 19.3.3 Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.3.4 Normalizing the Geometry of the 2D Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.3.5 Effects of Periodicity . . . . . . . . . . . . . . . . . . . . . 19.3.6 Windowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.3.7 Common Windowing Functions . . . . . . . . . . . . 19.4 2D Fourier Transform Examples . . . . . . . . . . . . . . . . . 19.5 Applications of the DFT . . . . . . . . . . . . . . . . . . . . . . . . 19.5.1 Linear Filter Operations in Frequency Space 19.5.2 Linear Convolution and Correlation . . . . . . . . 19.5.3 Inverse Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 The Discrete Cosine Transform (DCT) . . . . . . . . . . . 20.1 One-Dimensional DCT . . . . . . . . . . . . . . . . . . . . . . . . . 20.1.1 DCT Basis Functions . . . . . . . . . . . . . . . . . . . . . 20.1.2 Implementing the 1D DCT . . . . . . . . . . . . . . . . 20.2 Two-Dimensional DCT . . . . . . . . . . . . . . . . . . . . . . . . . 20.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.2.2 Separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.3 Java Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.4 Other Spectral Transforms . . . . . . . . . . . . . . . . . . . . . . 20.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

559 Contents 562 563 563 563 565 565 567 567 567 568 570 571 571 572 572 573 574 574 575 575 577 581 581 582 585 585 586 589 590 590 591 592 593 593 595 596 597

Part VII Image Transformations 21 Geometric Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1 Coordinate Transformations in 2D . . . . . . . . . . . . . . . 21.1.1 Linear Coordinate Transformations . . . . . . . . 21.1.2 Homogeneous Coordinates . . . . . . . . . . . . . . . . 21.1.3 Affine (Three-Point) Mapping . . . . . . . . . . . . . 21.1.4 Projective (Four-Point) Mapping . . . . . . . . . . 21.1.5 Bilinear Mapping . . . . . . . . . . . . . . . . . . . . . . . .

601 602 603 603 605 608 614

XIX

Contents

21.1.6 Log-Polar Mapping . . . . . . . . . . . . . . . . . . . . . . 21.1.7 Other Nonlinear Transformations . . . . . . . . . . 21.1.8 Piecewise Image Transformations . . . . . . . . . . 21.2 Resampling the Image . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.1 Source-to-Target Mapping . . . . . . . . . . . . . . . . 21.2.2 Target-to-Source Mapping . . . . . . . . . . . . . . . . 21.3 Java Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3.1 Geometric Transformations . . . . . . . . . . . . . . . 21.3.2 Image Transformations . . . . . . . . . . . . . . . . . . . 21.3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

615 620 623 624 625 626 627 627 629 630 632

22 Pixel Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1 Interpolation in 1D: Simple Methods . . . . . . . . . . . . . 22.1.1 Nearest-Neighbor Interpolation . . . . . . . . . . . . 22.1.2 Linear Interpolation . . . . . . . . . . . . . . . . . . . . . . 22.1.3 Ideal Low-Pass Filter . . . . . . . . . . . . . . . . . . . . . 22.2 Interpolation by Convolution . . . . . . . . . . . . . . . . . . . . 22.3 Cubic Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.4 Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.4.1 Catmull-Rom Interpolation . . . . . . . . . . . . . . . 22.4.2 Cubic B-spline Approximation . . . . . . . . . . . . . 22.4.3 Mitchell-Netravali Approximation . . . . . . . . . . 22.4.4 Lanczos Interpolation . . . . . . . . . . . . . . . . . . . . 22.5 Interpolation in 2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.5.1 Nearest-Neighbor Interpolation in 2D . . . . . . 22.5.2 Bilinear Interpolation . . . . . . . . . . . . . . . . . . . . 22.5.3 Bicubic and Spline Interpolation in 2D . . . . . 22.5.4 Lanczos Interpolation in 2D . . . . . . . . . . . . . . . 22.5.5 Examples and Discussion . . . . . . . . . . . . . . . . . 22.6 Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.6.1 Sampling the Interpolated Image . . . . . . . . . . 22.6.2 Space-Variant Low-Pass Filtering . . . . . . . . . . 22.7 Java Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

639 639 639 640 640 643 644 645 646 647 647 648 649 651 651 652 654 655 657 657 659 660 661

Part VIII Image Matching 23 Image Matching and Registration . . . . . . . . . . . . . . . . . 23.1 Template Matching in Intensity Images . . . . . . . . . . . 23.1.1 Distance between Image Patterns . . . . . . . . . . 23.1.2 Matching Under Rotation and Scaling . . . . . . 23.1.3 Java Implementation . . . . . . . . . . . . . . . . . . . . . 23.2 Matching Binary Images . . . . . . . . . . . . . . . . . . . . . . . . 23.2.1 Direct Comparison of Binary Images . . . . . . . 23.2.2 The Distance Transform . . . . . . . . . . . . . . . . . . 23.2.3 Chamfer Matching . . . . . . . . . . . . . . . . . . . . . . . 23.2.4 Java Implementation . . . . . . . . . . . . . . . . . . . . . 23.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

XX

667 668 668 675 676 676 677 678 682 684 684

24 Non-Rigid Image Matching . . . . . . . . . . . . . . . . . . . . . . . 687 24.1 The Lucas-Kanade Technique . . . . . . . . . . . . . . . . . . . 687

24.2 24.3 24.4

24.5 24.6 24.7

24.1.1 Registration in 1D . . . . . . . . . . . . . . . . . . . . . . . 24.1.2 Extension to Multi-Dimensional Functions . . The Lucas-Kanade Algorithm . . . . . . . . . . . . . . . . . . . 24.2.1 Summary of the Algorithm . . . . . . . . . . . . . . . . Inverse Compositional Algorithm . . . . . . . . . . . . . . . . Linear Transformation Parameters . . . . . . . . . . . . . . . 24.4.1 Pure Translation . . . . . . . . . . . . . . . . . . . . . . . . . 24.4.2 Affine Transformation . . . . . . . . . . . . . . . . . . . . 24.4.3 Projective Transformation . . . . . . . . . . . . . . . . 24.4.4 Concatenating Linear Transformations . . . . . 24.4.5 Coordinate Frames . . . . . . . . . . . . . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Java Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

687 Contents 689 690 693 694 698 698 699 701 701 702 702 704 704

Part IX Local Features 25 Scale-Invariant Feature Transform (SIFT) . . . . . . . . 25.1 Interest Points at Multiple Scales . . . . . . . . . . . . . . . . 25.1.1 The LoG Filter . . . . . . . . . . . . . . . . . . . . . . . . . . 25.1.2 Gaussian Scale Space . . . . . . . . . . . . . . . . . . . . . 25.1.3 LoG/DoG Scale Space . . . . . . . . . . . . . . . . . . . . 25.1.4 Hierarchical Scale Space . . . . . . . . . . . . . . . . . . 25.1.5 Scale Space Structure in SIFT . . . . . . . . . . . . . 25.2 Key Point Selection and Refinement . . . . . . . . . . . . . . 25.2.1 Local Extrema Detection . . . . . . . . . . . . . . . . . 25.2.2 Position Refinement . . . . . . . . . . . . . . . . . . . . . . 25.2.3 Suppressing Responses to Edge-Like Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.3 Creating Local Descriptors . . . . . . . . . . . . . . . . . . . . . . 25.3.1 Finding Dominant Orientations . . . . . . . . . . . . 25.3.2 SIFT Descriptor Construction . . . . . . . . . . . . . 25.4 SIFT Algorithm Summary . . . . . . . . . . . . . . . . . . . . . . 25.5 Matching SIFT Features . . . . . . . . . . . . . . . . . . . . . . . . 25.5.1 Feature Distance and Match Quality . . . . . . . 25.5.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.6 Efficient Feature Matching . . . . . . . . . . . . . . . . . . . . . . 25.7 Java Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.7.1 SIFT Feature Extraction . . . . . . . . . . . . . . . . . 25.7.2 SIFT Feature Matching . . . . . . . . . . . . . . . . . . . 25.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Maximally Stable Extremal Regions (MSER) . . . . . 26.1 Threshold Sets and Extremal Regions . . . . . . . . . . . . 26.2 Building the Component Tree . . . . . . . . . . . . . . . . . . . 26.2.1 Component Tree Algorithms . . . . . . . . . . . . . . 26.2.2 Component Tree Algorithm 1: Global Immersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.2.3 Component Tree Algorithm 2: Local Flooding 26.2.4 Component Tree Examples . . . . . . . . . . . . . . . . 26.3 Extracting MSERs from the Component Tree . . . . . 26.3.1 Component Size Variation (Growth Rate) . . .

709 710 710 715 718 720 724 730 730 732 734 736 737 740 747 748 748 753 756 759 760 762 763 765 766 768 768 770 774 777 778 779

XXI

26.3.2 Maximally Stable Components . . . . . . . . . . . . 26.3.3 Constraints on Component Size and Diversity 26.3.4 MSER Feature Statistics and Equivalent Ellipse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.3.5 Additional Constraints . . . . . . . . . . . . . . . . . . . 26.3.6 Detecting Dark And Bright Blobs . . . . . . . . . . 26.3.7 MSER Examples . . . . . . . . . . . . . . . . . . . . . . . . 26.4 Matching MSERs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.5 Local Affine Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Contents

781 781 784 785 787 788 789 792 794

Appendix

XXII

A

Mathematical Symbols and Notation . . . . . . . . . . . . . A.1 Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.1 Basic Set Symbols and Operators . . . . . . . . . . A.2.2 Destructive Set Operators . . . . . . . . . . . . . . . . A.2.3 Relations, Mappings and Functions . . . . . . . . A.3 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.1 Adding and Removing Elements . . . . . . . . . . . A.3.2 “Stack”-Type Sequences . . . . . . . . . . . . . . . . . . A.3.3 “Queue”-Type Sequences . . . . . . . . . . . . . . . . . A.3.4 Sorting Sequences . . . . . . . . . . . . . . . . . . . . . . . . A.4 Tuples and Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.1 Type Definition and Instantiation . . . . . . . . . . A.4.2 Accessing Object Components . . . . . . . . . . . . . A.4.3 Duplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5 Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

799 799 803 803 804 804 805 805 805 806 806 806 806 807 807 807

B

Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1.1 Column and Row Vectors . . . . . . . . . . . . . . . . . B.1.2 Extracting Submatrices and Vectors . . . . . . . . B.1.3 Length (Norm) of a Vector . . . . . . . . . . . . . . . . B.2 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.1 Scalar Multiplication . . . . . . . . . . . . . . . . . . . . . B.2.2 Product of Two Matrices . . . . . . . . . . . . . . . . . B.2.3 Matrix-Vector Products . . . . . . . . . . . . . . . . . . B.3 Vector Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.1 Dot (Scalar) Product . . . . . . . . . . . . . . . . . . . . . B.3.2 Outer Product . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.3 Cross Product . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4 Trace and Determinant of a Square Matrix . . . . . . . . B.5 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . B.5.1 Calculating Eigenvalues . . . . . . . . . . . . . . . . . . B.5.2 Generalized Symmetric Eigenproblems . . . . . . B.6 Homogeneous Coordinates . . . . . . . . . . . . . . . . . . . . . . B.7 Basic Matrix-Vector Operations with the Apache Commons Math Library . . . . . . . . . . . . . . . . . . . . . . . . B.7.1 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . B.7.2 Matrix-Vector Multiplication . . . . . . . . . . . . . .

809 809 810 811 811 811 811 812 812 813 813 814 814 814 815 816 818 819 820 821 821

B.8

B.7.3 Vector Products . . . . . . . . . . . . . . . . . . . . . . . . . B.7.4 Inverse of a Square Matrix . . . . . . . . . . . . . . . . B.7.5 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . Solving Systems of Linear Equations . . . . . . . . . . . . . B.8.1 Exact Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . B.8.2 Over-Determined System (Least-Squares Solutions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.8.3 Solving Homogeneous Linear Systems . . . . . .

821 Contents 822 822 822 823 824 825

C

Nonlinear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . C.1 Nonlinear Least-Squares Fitting . . . . . . . . . . . . . . . . . C.2 Solution Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.2.1 Implementation With Apache Commons Math C.2.2 Example 1: One-Dimensional Curve Fitting . C.3 Multi-Dimensional NLS Problems . . . . . . . . . . . . . . . . C.3.1 Example 2: Geometric Circle Fitting . . . . . . . C.3.2 Numerical Estimation of Partial Derivatives .

827 827 828 829 829 832 832 836

D

Elements from Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . D.1 Scalar and Vector Fields . . . . . . . . . . . . . . . . . . . . . . . . D.1.1 The Jacobian Matrix . . . . . . . . . . . . . . . . . . . . . D.1.2 Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.1.3 Maximum Gradient Direction . . . . . . . . . . . . . D.1.4 Divergence of a Vector Field . . . . . . . . . . . . . . D.1.5 The Laplacian Operator . . . . . . . . . . . . . . . . . . D.1.6 The Hessian Matrix . . . . . . . . . . . . . . . . . . . . . . D.2 Taylor Series Expansion . . . . . . . . . . . . . . . . . . . . . . . . D.2.1 Single-Variable Functions . . . . . . . . . . . . . . . . . D.2.2 Multi-Variable Functions . . . . . . . . . . . . . . . . . D.2.3 Finding Function Extrema by 2nd-Order Taylor Expansion . . . . . . . . . . . . . . . . . . . . . . . . D.3 Estimating Derivatives of Discrete Functions . . . . . . D.3.1 First-order Derivatives . . . . . . . . . . . . . . . . . . . D.3.2 Second-Order Derivatives . . . . . . . . . . . . . . . . . D.3.3 Alternative Formulations . . . . . . . . . . . . . . . . .

837 837 837 838 839 839 839 840 841 841 842

E

F

844 847 847 848 848

Sub-Pixel Maximum Finding . . . . . . . . . . . . . . . . . . . . . E.1 Second-Order Interpolation in 1D . . . . . . . . . . . . . . . . E.2 Subpixel Interpolation in 2D . . . . . . . . . . . . . . . . . . . . E.2.1 Quadratic Functions in 2D . . . . . . . . . . . . . . . . E.2.2 Method A: Second-Order Taylor Interpolation E.2.3 Method B: Least-Squares Quadratic Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.2.4 Quartic Interpolation . . . . . . . . . . . . . . . . . . . . .

849 849 851 852 853

Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.1 Straight Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.1.1 Conversions Between Different Line Equations F.1.2 Intersections of Algebraic Lines . . . . . . . . . . . . F.1.3 Intersections of Lines in Hessian Normal Form F.1.4 Numeric Line Fitting Examples . . . . . . . . . . . . F.2 Circles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

861 861 861 863 863 863 865

855 857

XXIII

F.2.1 Circle Equations and Conversions . . . . . . . . . . F.2.2 Circle From 3 Points . . . . . . . . . . . . . . . . . . . . . Ellipses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.3.1 Ellipse Equations . . . . . . . . . . . . . . . . . . . . . . . . F.3.2 Converting Between Algebraic and Geometric Parameters . . . . . . . . . . . . . . . . . . . . F.3.3 Ellipse From 5 Points . . . . . . . . . . . . . . . . . . . .

865 866 866 866

Statistical Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . G.1 Mean, Variance, and Covariance . . . . . . . . . . . . . . . . . G.1.1 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.1.2 Variance and Covariance . . . . . . . . . . . . . . . . . . G.1.3 Biased vs. Unbiased Variance . . . . . . . . . . . . . . G.2 The Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . G.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.2.2 Practical Calculation . . . . . . . . . . . . . . . . . . . . . G.3 Mahalanobis Distance . . . . . . . . . . . . . . . . . . . . . . . . . . G.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.3.2 Relation to the Euclidean Distance . . . . . . . . . G.3.3 Numerical Considerations . . . . . . . . . . . . . . . . . G.3.4 Pre-Mapping Data For Efficient Mahalanobis Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.4 The Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . G.4.1 Maximum Likelihood Estimation . . . . . . . . . . G.4.2 Gaussian Mixtures . . . . . . . . . . . . . . . . . . . . . . . G.4.3 Creating Gaussian Noise . . . . . . . . . . . . . . . . . .

871 871 871 871 872 872 873 874 874 874 875 875

Gaussian Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H.1 Cascading Gaussian Filters . . . . . . . . . . . . . . . . . . . . . . H.2 Gaussian Filters and Scale Space . . . . . . . . . . . . . . . . H.3 Effects of Gaussian Filtering in the Frequency Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H.4 LoG-Approximation by Difference of Gaussians (DoG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

883 883 883

Contents

F.3

G

H

XXIV

867 869

876 878 878 880 880

884 885

I

Writing ImageJ Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . I.1 ImageJ Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I.1.1 Program Structure . . . . . . . . . . . . . . . . . . . . . . . I.1.2 A First Example: Inverting an Image . . . . . . . I.1.3 Plugin My_Inverter_A (PlugInFilter) . . . . I.1.4 Plugin My_Inverter_B (PlugIn) . . . . . . . . . . . I.1.5 When To Use PlugIn Or PlugInFilter? . . . I.1.6 Executing ImageJ “Commands” . . . . . . . . . . . I.1.7 ImageJ’s Command Recorder . . . . . . . . . . . . .

887 887 887 888 888 890 891 892 893

J

Java Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J.1 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J.1.1 Integer Division . . . . . . . . . . . . . . . . . . . . . . . . . J.1.2 Modulus Operator . . . . . . . . . . . . . . . . . . . . . . . J.1.3 Mathematical Functions in Class Math . . . . . . J.1.4 Numerical Rounding . . . . . . . . . . . . . . . . . . . . . J.1.5 Inverse Tangent Function . . . . . . . . . . . . . . . . .

895 895 895 896 897 898 899

J.2

J.1.6 Unsigned Byte Data . . . . . . . . . . . . . . . . . . . . . J.1.7 Classes Float and Double . . . . . . . . . . . . . . . . J.1.8 Testing Floating-Point Values Against Zero . Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J.2.1 Creating Arrays . . . . . . . . . . . . . . . . . . . . . . . . . J.2.2 Array Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J.2.3 Accessing Array Elements . . . . . . . . . . . . . . . . . J.2.4 2D Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J.2.5 Arrays of Objects . . . . . . . . . . . . . . . . . . . . . . . . J.2.6 Searching for Minimum and Maximum Values J.2.7 Sorting Arrays . . . . . . . . . . . . . . . . . . . . . . . . . .

899 Contents 900 901 901 901 902 902 903 905 906 906

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 909 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 925

XXV

Part I

Images and Pixels

1 Digital Images

For a long time, using a computer to manipulate a digital image (i.e., digital image processing) was something performed by only a relatively small group of specialists who had access to expensive equipment. Usually this combination of specialists and equipment was only to be found in research labs, and so the field of digital image processing has its roots in the academic realm. Now, however, the combination of a powerful computer on every desktop and the fact that nearly everyone has some type of device for digital image acquisition, be it their cell phone camera, digital camera, or scanner, has resulted in a plethora of digital images and, with that, for many digital image processing has become as common as word processing. It was not that many years ago that digitizing a photo and saving it to a file on a computer was a time-consuming task. This is perhaps difficult to imagine given today’s powerful hardware and operating system level support for all types of digital media, but it is always sobering to remember that “personal” computers in the early 1990s were not powerful enough to even load into main memory a single image from a typical digital camera of today. Now powerful hardware and software packages have made it possible for amateurs to manipulate digital images and videos just as easily as professionals. All of these developments have resulted in a large community that works productively with digital images while having only a basic understanding of the underlying mechanics. For the typical consumer merely wanting to create a digital archive of vacation photos, a deeper understanding is not required, just as a deep understanding of the combustion engine is unnecessary to successfully drive a car. Today, IT professionals must be more then simply familiar with digital image processing. They are expected to be able to knowledgeably manipulate images and related digital media, which are an increasingly important part of the workflow not only of those involved in medicine and media but all industries. In the same way, software engineers and computer scientists are increasingly confronted with developing programs, databases, and related systems that must correctly deal with digital images. The simple lack of practical ex© Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1_1

3

1 Digital Images

perience with this type of material, combined with an often unclear understanding of its basic foundations and a tendency to underestimate its difficulties, frequently leads to inefficient solutions, costly errors, and personal frustration.

1.1 Programming with Images Even though the term “image processing” is often used interchangeably with that of “image editing”, we introduce the following more precise definitions. Digital image editing, or as it is sometimes referred to, digital imaging, is the manipulation of digital images using an existing software application such as Adobe Photoshop® or Corel Paint® . Digital image processing, on the other hand, is the conception, design, development, and enhancement of digital imaging programs. Modern programming environments, with their extensive APIs (application programming interfaces), make practically every aspect of computing, be it networking, databases, graphics, sound, or imaging, easily available to nonspecialists. The possibility of developing a program that can reach into an image and manipulate the individual elements at its very core is fascinating and seductive. You will discover that with the right knowledge, an image becomes ultimately no more than a simple array of values, that with the right tools you can manipulate in any way imaginable. “Computer graphics”, in contrast to digital image processing, concentrates on the synthesis of digital images from geometrical descriptions such as three-dimensional (3D) object models [91, 106, 280]. While graphics professionals today tend to be interested in topics such as realism and, especially in terms of computer games, rendering speed, the field does draw on a number of methods that originate in image processing, such as image transformation (morphing), reconstruction of 3D models from image data, and specialized techniques such as image-based and nonphotorealistic rendering [206,281]. Similarly, image processing makes use of a number of ideas that have their origin in computational geometry and computer graphics, such as volumetric (voxel) models in medical image processing. The two fields perhaps work closest when it comes to digital postproduction of film and video and the creation of special effects [289]. This book provides a thorough grounding in the effective processing of not only images but also sequences of images; that is, videos.

1.2 Image Analysis and Computer Vision

4

Often it appears at first glance that a given image-processing task will have a simple solution, especially when it is something that is easily accomplished by our own visual system. Yet in practice it turns out that developing reliable, robust, and timely solutions is difficult or simply impossible. This is especially true when the problem involves image analysis; that is, where the ultimate goal is not to enhance or otherwise alter the appearance of an image but instead to extract

meaningful information about its contents—be it distinguishing an 1.2 Image Analysis and object from its background, following a street on a map, or finding Computer Vision the bar code on a milk carton, tasks such as these often turn out to be much more difficult to accomplish than we would expect. We expect technology to improve on what we can do by ourselves. Be it as simple as a lever to lift more weight or binoculars to see farther or as complex as an airplane to move us across continents— science has created so much that improves on, sometimes by unbelievable factors, what our biological systems are able to perform. So, it is perhaps humbling to discover that today’s technology is nowhere near as capable, when it comes to image analysis, as our own visual system. While it is possible that this will always remain true, do not let this discourage you. Instead consider it a challenge to develop creative solutions. Using the tools, techniques, and fundamental knowledge available today, it is possible not only to solve many problems but to create robust, reliable, and fast applications. While image analysis is not the main subject of this book, it often naturally intersects with image processing and we will explore this intersection in detail in these situations: finding simple curves (Ch. 12), segmenting image regions (Ch. 8), and comparing images (Ch. 23). In these cases, we present solutions that work directly on the pixel data in a bottom-up way without recourse to domain-specific knowledge (i.e., blind solutions). In this way, our solutions essentially embody the distinction between image processing, pattern recognition, and computer vision, respectively. While these two disciplines are firmly grounded in, and rely heavily on, image processing, their ultimate goals are much more lofty. Pattern recognition is primarily a mathematical discipline and has been responsible for techniques such as clustering, hidden Markov models (HMMs), decision trees, and principal component analysis (PCA), which are used to discover patterns in data and signals. Methods from pattern recognition have been applied extensively to problems arising in computer vision and image analysis. A good example of their successful application is optical character recognition (OCR), where robust, highly accurate turnkey solutions are available for recognizing scanned text. Pattern recognition methods are truly universal and have been successfully applied not only to images but also speech and audio signals, text documents, stock trades, and finding trends in large databases, where it is often called data mining. Dimensionality reduction, statistical, and syntactical methods play important roles in pattern recognition (see, e.g., [76, 192, 259]). Computer vision tackles the problem of engineering artificial visual systems capable of somehow comprehending and interpreting our real, 3D world. Popular topics in this field include scene understanding, object recognition, motion interpretation (tracking), autonomous navigation, and the robotic manipulation of objects in a scene. Since computer vision has its roots in artificial intelligence (AI), many AI methods were originally developed to either tackle or represent a problem in computer vision (see, e.g., [63, Ch. 13]). The fields still have much in common today, especially in terms of adap-

5

1 Digital Images

tive methods and machine learning. Further literature on computer vision includes [20, 96, 131, 242, 251, 264]. Ultimately you will find image processing to be both intellectually challenging and professionally rewarding, as the field is ripe with problems that were originally thought to be relatively simple to solve but have to this day refused to give up their secrets. With the background and techniques presented in this text, you will not only be able to develop complete image-processing solutions but will also have the prerequisite knowledge to tackle unsolved problems and the real possibility of expanding the horizons of science: for while image processing by itself may not change the world, it is likely to be the foundation that supports marvels of the future.

1.3 Types of Digital Images Digital images are the central theme of this book, and unlike just a few years ago, this term is now so commonly used that there is really no reason to explain it further. Yet this book is not about all types of digital images, instead it focuses on images that are made up of picture elements, more commonly known as pixels, arranged in a regular rectangular grid. Every day, people work with a large variety of digital raster images such as color photographs of people and landscapes, grayscale scans of printed documents, building plans, faxed documents, screenshots, medical images such as x-rays and ultrasounds, and a multitude of others (see Fig. 1.1 for examples). Despite all the different sources for these images, they are all, as a rule, ultimately represented as rectangular ordered arrays of image elements.

1.4 Image Acquisition The process by which a scene becomes a digital image is varied and complicated, and, in most cases, the images you work with will already be in digital form, so we only outline here the essential stages in the process. As most image acquisition methods are essentially variations on the classical optical camera, we will begin by examining it in more detail. 1.4.1 The Pinhole Camera Model

6

The pinhole camera is one of the simplest camera models and has been in use since the 13th century, when it was known as the “Camera Obscura”. While pinhole cameras have no practical use today except to hobbyists, they are a useful model for understanding the essential optical components of a simple camera. The pinhole camera consists of a closed box with a small opening on the front side through which light enters, forming an image on the opposing wall. The light forms a smaller, inverted image of the scene (Fig. 1.2).

1.4 Image Acquisition

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

Fig. 1.1 Examples of digital images. Natural landscape (a), synthetically generated scene (b), poster graphic (c), computer screenshot (d), black and white illustration (e), barcode (f), fingerprint (g), x-ray (h), microscope slide (i), satellite image (j), radar image (k), astronomical object (l).

Perspective Projection The geometric properties of the pinhole camera are very simple. The optical axis runs through the pinhole perpendicular to the image plane. We assume a visible object, in our illustration the cactus, located at a horizontal distance Z from the pinhole and vertical distance Y from the optical axis. The height of the projection y is determined by two parameters: the fixed depth of the camera box f and the distance Z to the object from the origin of the coordinate system. By comparison we see that x = −f ·

X Z

and

y = −f ·

Y Z

(1.1)

7

1 Digital Images Fig. 1.2 Geometry of the pinhole camera. The pinhole opening serves as the origin (O) of the 3D coordinate system (X, Y, Z) for the objects in the scene. The optical axis, which runs through the opening, is the Z axis of this coordinate system. A separate 2D coordinate system (x, y) describes the projection points on the image plane. The distance f (“focal length”) between the opening and the image plane determines the scale of the projection.

Y

Z

X

optical axis

y x

O f

image plane

change with the scale of the resulting image in proportion to the depth of the box (i.e., the distance f ) in a way similar to how the focal length does in an everyday camera. For a fixed image, a small f (i.e., short focal length) results in a small image and a large viewing angle, just as occurs when a wide-angle lens is used, while increasing the “focal length” f results in a larger image and a smaller viewing angle, just as occurs when a telephoto lens is used. The negative sign in Eqn. (1.1) means that the projected image is flipped in the horizontal and vertical directions and rotated by 180◦ . Equation (1.1) describes what is commonly known today as the perspective transformation.1 Important properties of this theoretical model are that straight lines in 3D space always appear straight in 2D projections and that circles appear as ellipses. 1.4.2 The “Thin” Lens While the simple geometry of the pinhole camera makes it useful for understanding its basic principles, it is never really used in practice. One of the problems with the pinhole camera is that it requires a very small opening to produce a sharp image. This in turn reduces the amount of light passed through and thus leads to extremely long exposure times. In reality, glass lenses or systems of optical lenses are used whose optical properties are greatly superior in many aspects but of course are also much more complex. Instead we can make our model more realistic, without unduly increasing its complexity, by replacing the pinhole with a “thin lens” as in Fig. 1.3. In this model, the lens is assumed to be symmetric and infinitely thin, such that all light rays passing through it cross through a virtual plane in the middle of the lens. The resulting image geometry is the same as that of the pinhole camera. This model is not sufficiently 1

8

It is hard to imagine today that the rules of perspective geometry, while known to the ancient mathematicians, were only rediscovered in 1430 by the Renaissance painter Brunelleschi.

Z

1.4 Image Acquisition

f

Fig. 1.3 Thin lens projection model.

lens Y O y optical axis image plane

complex to encompass the physical details of actual lens systems, such as geometrical distortions and the distinct refraction properties of different colors. So, while this simple model suffices for our purposes (i.e., understanding the mechanics of image acquisition), much more detailed models that incorporate these additional complexities can be found in the literature (see, e.g., [148]). 1.4.3 Going Digital What is projected on the image plane of our camera is essentially a two-dimensional (2D), time-dependent, continuous distribution of light energy. In order to convert this image into a digital image on our computer, the following three main steps are necessary: 1. The continuous light distribution must be spatially sampled. 2. This resulting function must then be sampled in time to create a single (still) image. 3. Finally, the resulting values must be quantized to a finite range of integers (or floating-point values) such that they can be represented by digital numbers. Step 1: Spatial Sampling The spatial sampling of an image (i.e., the conversion of the continuous signal to its discrete representation) depends on the geometry of the sensor elements of the acquisition device (e.g., a digital or video camera). The individual sensor elements are arranged in ordered rows, almost always at right angles to each other, along the sensor plane (Fig. 1.4). Other types of image sensors, which include hexagonal elements and circular sensor structures, can be found in specialized products. Step 2: Temporal Sampling Temporal sampling is carried out by measuring at regular intervals the amount of light incident on each individual sensor element. The CCD2 in a digital camera does this by triggering the charging process and then measuring the amount of electrical charge that has built up during the specified amount of time that the CCD was illuminated. 2

Charge-coupled device.

9

1 Digital Images Fig. 1.4 The geometry of the sensor elements is directly responsible for the spatial sampling of the continuous image. In the simplest case, a plane of sensor elements are arranged in an evenly spaced grid, and each element measures the amount of light that falls on it.

incident light

v

sensor plane u image element I(u, v)

Step 3: Quantization of Pixel Values In order to store and process the image values on the computer they are commonly converted to an integer scale (e.g., 256 = 28 or 4096 = 212 ). Occasionally floating-point values are used in professional applications, such as medical imaging. Conversion is carried out using an analog to digital converter, which is typically embedded directly in the sensor electronics so that conversion occurs at image capture or is performed by special interface hardware. Images as Discrete Functions The result of these three stages is a description of the image in the form of a 2D, ordered matrix of integers (Fig. 1.5). Stated a bit more formally, a digital image I is a 2D function that maps from the domain of integer coordinates N × N to a range of possible pixel values P such that I(u, v) ∈ P

and

u, v ∈ N.

Now we are ready to transfer the image to our computer so that we can save, compress, and otherwise manipulate it into the file format of our choice. At this point, it is no longer important to us how the image originated since it is now a simple 2D array of numerical data. Before moving on, we need a few more important definitions. 1.4.4 Image Size and Resolution

10

In the following, we assume rectangular images, and while that is a relatively safe assumption, exceptions do exist. The size of an image is determined directly from the width M (number of columns) and the height N (number of rows) of the image matrix I. The resolution of an image specifies the spatial dimensions of the image in the real world and is given as the number of image elements per measurement; for example, dots per inch (dpi) or lines per inch (lpi) for print production, or in pixels per kilometer for satellite images. In most cases, the resolution of an image is the same in the horizontal and vertical directions, which means that the

148 123 52 107 123 162 172 123 64 89 · · · 147 130 92 95 98 130 171 155 169 163 · · · 141 118 121 148 117 107 144 137 136 134 · · · 82 106 93 172 149 131 138 114 113 129 · · ·

→

57 101 72 54 109 111 104 135 106 125 · · ·

138 135 114 82 121 110 34 76 101 111 · · ·

1.4 Image Acquisition Fig. 1.5 The transformation of a continuous grayscale image F (x, y) to a discrete digital image I(u, v) (left), image detail (below).

138 102 128 159 168 147 116 129 124 117 · · · 113 89 89 109 106 126 114 150 164 145 · · · 120 121 123 87 85 70 119 64 79 127 · · · 145 141 143 134 111 124 117 113 64 112 · · · . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

F (x, y)

I(u, v)

image elements are square. Note that this is not always the case as, for example, the image sensors of most current video cameras have non-square pixels! The spatial resolution of an image may not be relevant in many basic image processing steps, such as point operations or filters. Precise resolution information is, however, important in cases where geometrical elements such as circles need to be drawn on an image or when distances within an image need to be measured. For these reasons, most image formats and software systems designed for professional applications rely on precise information about image resolution. 1.4.5 Image Coordinate System In order to know which position on the image corresponds to which image element, we need to impose a coordinate system. Contrary to normal mathematical conventions, in image processing the coordinate system is usually flipped in the vertical direction; that is, the y-coordinate runs from top to bottom and the origin lies in the upper left corner (Fig. 1.6). While this system has no practical or theoretical advantage, and in fact may be a bit confusing in the context of geometrical transformations, it is used almost without exception in imaging software systems. The system supposedly has its roots in the original design of television broadcast systems, where the picture rows are numbered along the vertical deflection of the electron beam, which moves from the top to the bottom of the screen. We start the numbering of rows and columns at zero for practical reasons, since in Java array indexing also begins at zero. 1.4.6 Pixel Values The information within an image element depends on the data type used to represent it. Pixel values are practically always binary words of length k so that a pixel can represent any of 2k different values. The value k is called the bit depth (or just “depth”) of the image. The exact bit-level layout of an individual pixel depends on the kind of

11

M columns

1 Digital Images

0

Fig. 1.6 Image coordinates. In digital image processing, it is common to use a coordinate system where the origin (u = 0, v = 0) lies in the upper left corner. The coordinates u, v represent the columns and the rows of the image, respectively. For an image with dimensions M × N , the maximum column number is umax = M −1 and the maximum row number is vmax = N −1.

u

N rows

0

M −1

I(u, v) v

N −1

Table 1.1 Bit depths of common image types and typical application domains.

Grayscale (Intensity Images): Chan. Bits/Pix. Range 1 1 [0, 1] 1 8 [0, 255] 1 12 [0, 4095] 1 14 [0, 16383] 1 16 [0, 65535]

Use Binary image: document, illustration, fax Universal: photo, scan, print High quality: photo, scan, print Professional: photo, scan, print Highest quality: medicine, astronomy

Color Images: Chan. Bits/Pix. 3 24 3 36 3 42 4 32

Use RGB, universal: photo, scan, print RGB, high quality: photo, scan, print RGB, professional: photo, scan, print CMYK, digital prepress

Range [0, 255]3 [0, 4095]3 [0, 16383]3 [0, 255]4

Special Images: Chan. Bits/Pix. Range 1 16 [−32768, 32767] 1 32 ±3.4·1038 1 64 ±1.8·10308

Use Integer values pos./neg., increased range Floating-point values: medicine, astronomy Floating-point values: internal processing

image; for example, binary, grayscale, or RGB3 color. The properties of some common image types are summarized below (also see Table 1.1). Grayscale Images (Intensity Images) The image data in a grayscale image consist of a single channel that represents the intensity, brightness, or density of the image. In most cases, only positive values make sense, as the numbers represent the intensity of light energy or density of film and thus cannot be negative, so typically whole integers in the range 0, . . . , 2k −1 are used. For example, a typical grayscale image uses k = 8 bits (1 byte) per pixel and intensity values in the range 0, . . . , 255, where the value 0 represents the minimum brightness (black) and 255 the maximum brightness (white). For many professional photography and print applications, as well as in medicine and astronomy, 8 bits per pixel is not sufficient. Image depths of 12, 14, and even 16 bits are often encountered in these 3

12

Red, green, blue.

domains. Note that bit depth usually refers to the number of bits 1.5 Image File Formats used to represent one color component, not the number of bits needed to represent an entire color pixel. For example, an RGB-encoded color image with an 8-bit depth would require 8 bits for each channel for a total of 24 bits, while the same image with a 12-bit depth would require a total of 36 bits. Binary Images Binary images are a special type of intensity image where pixels can only take on one of two values, black or white. These values are typically encoded using a single bit (0/1) per pixel. Binary images are often used for representing line graphics, archiving documents, encoding fax transmissions, and of course in electronic printing. Color Images Most color images are based on the primary colors red, green, and blue (RGB), typically making use of 8 bits for each color component. In these color images, each pixel requires 3 × 8 = 24 bits to encode all three components, and the range of each individual color component is [0, 255]. As with intensity images, color images with 30, 36, and 42 bits per pixel are commonly used in professional applications. Finally, while most color images contain three components, images with four or more color components are common in most prepress applications, typically based on the subtractive CMYK (Cyan-Magenta-YellowBlack) color model (see Ch. 13). Indexed or palette images constitute a very special class of color image. The difference between an indexed image and a true color image is the number of different colors (fewer for an indexed image) that can be used in a particular image. In an indexed image, the pixel values are only indices (with a maximum of 8 bits) onto a specific table of selected full-color values (see Sec. 13.1.1). Special Images Special images are required if none of the above standard formats is sufficient for representing the image values. Two common examples of special images are those with negative values and those with floating-point values. Images with negative values arise during imageprocessing steps, such as filtering for edge detection (see Sec. 5.2.2), and images with floating-point values are often found in medical, biological, or astronomical applications, where extended numerical range and precision are required. These special formats are mostly application specific and thus may be difficult to use with standard image-processing tools.

1.5 Image File Formats While in this book we almost always consider image data as being already in the form of a 2D array—ready to be accessed by a program—, in practice image data must first be loaded into memory from a file. Files provide the essential mechanism for storing,

13

1 Digital Images

archiving, and exchanging image data, and the choice of the correct file format is an important decision. In the early days of digital image processing (i.e., before around 1985), most software developers created a new custom file format for almost every new application they developed. Today there exist a wide range of standardized file formats, and developers can almost always find at least one existing format that is suitable for their application. Using standardized file formats vastly increases the ease with which images can be exchanged and the likelihood that the images will be readable by other software in the long term. Yet for many projects the selection of the right file format is not always simple, and compromises must be made. The following sub-sections outline a few of the typical criteria that need to be considered when selecting an appropriate file format. 1.5.1 Raster Versus Vector Data In the following, we will deal exclusively with file formats for storing raster images; that is, images that contain pixel values arranged in a regular matrix using discrete coordinates. In contrast, vector graphics represent geometric objects using continuous coordinates, which are only rasterized once they need to be displayed on a physical device such as a monitor or printer. A number of standardized file formats exist for vector images, such as the ANSI/ISO standard format CGM (Computer Graphics Metafile) and SVG (Scalable Vector Graphics),4 as well as proprietary formats such as DXF (Drawing Exchange Format from AutoDesk), AI (Adobe Illustrator), PICT (QuickDraw Graphics Metafile from Apple), and WMF/EMF (Windows Metafile and Enhanced Metafile from Microsoft). Most of these formats can contain both vector data and raster images in the same file. The PS (PostScript) and EPS (Encapsulated PostScript) formats from Adobe as well as the PDF (Portable Document Format) also offer this possibility, although they are typically used for printer output and archival purposes.5 1.5.2 Tagged Image File Format (TIFF) This is a widely used and flexible file format designed to meet the professional needs of diverse fields. It was originally developed by Aldus and later extended by Microsoft and currently Adobe. The format supports a range of grayscale, indexed, and true color images, but also special image types with large-depth integer and floating-point elements. A TIFF file can contain a number of images with different properties. The TIFF specification provides a range of different compression methods (LZW, ZIP, CCITT, and JPEG) and color spaces, so that it is possible, for example, to store a number of variations of an image in different sizes and representations together in a single 4 5

14

https://www.w3.org/TR/SVG2/ Special variations of PS, EPS, and PDF files are also used as (editable) exchange formats for raster and vector data; for example, both Adobe’s Photoshop (Photoshop-EPS) and Illustrator (AI).

1.5 Image File Formats

Byte Order Version No 1st IFD Offset

IFH Image File Headers IFD Image File Directories

IFD 0 Tag Entry Ct

IFD 1 Tag Entry Ct

IFD 2 Tag Entry Ct

Tag 0

Tag 0

Tag 0

Tag 1

Tag 1

Tag 1

...

...

...

Tag N0

Tag N1

Tag N2

Next IFD Offset

Next IFD Offset

Next IFD Offset

Image Data 0

Image Data 1

Image Data 2

Fig. 1.7 Structure of a TIFF file. A TIFF file consists of a header and a linked list of image objects, three in this example. Each image object consists of a list of “tags” with their corresponding entries followed by a pointer to the actual image data. This structure makes TIFF very flexible and easy to extend.

TIFF file. The flexibility of TIFF has made it an almost universal exchange format that is widely used in archiving documents, scientific applications, digital photography, and digital video production. The strength of this image format lies within its architecture (Fig. 1.7), which enables new image types and information blocks to be created by defining new “tags”. In this flexibility also lies the weakness of the format, namely that proprietary tags are not always supported and so the “unsupported tag” error is sometimes still encountered when loading TIFF files. ImageJ also reads only a few uncompressed variations of TIFF formats,6 and bear in mind that most popular Web browsers currently do not support TIFF either. 1.5.3 Graphics Interchange Format (GIF) The Graphics Interchange Format (GIF) was originally designed by CompuServe in 1986 to efficiently encode the rich line graphics used in their dial-up Bulletin Board System (BBS). It has since grown into one of the most widely used formats for representing images on the Web. This popularity is largely due to its early support for indexed color at multiple bit depths, LZW7 compression, interlaced image loading, and ability to encode simple animations by storing a number of images in a single file for later sequential display. GIF is essentially an indexed image file format designed for color and grayscale images with a maximum depth of 8 bits and consequently it does not support true color images. It offers efficient support for encoding palettes containing from 2 to 256 colors, one of which can be marked for transparency. GIF supports color tables in the range of 2, . . . , 256, enabling pixels to be encoded using fewer bits. As an example, the pixels of an image using 16 unique colors require only 4 bits to store the 16 possible color values 0, . . . , 15. This means that 6 7

The ImageIO plugin offers support for a wider range of TIFF formats. Lempel-Ziv-Welch

15

1 Digital Images

instead of storing each pixel using 1 byte, as done in other bitmap formats, GIF can encode two 4-bit pixels into each 8-bit byte. This results in a 50% storage reduction over the standard 8-bit indexed color bitmap format. The GIF file format is designed to efficiently encode “flat” or “iconic” images consisting of large areas of the same color. It uses lossy color quantization as well as lossless LZW compression to efficiently encode large areas of the same color. Despite the popularity of the format, when developing new software, the PNG8 format, presented in the next sub-section, should be preferred, as it outperforms GIF by almost every metric. 1.5.4 Portable Network Graphics (PNG) PNG (pronounced “ping”) was originally developed as a replacement for the GIF file format when licensing issues9 arose because of its use of LZW compression. It was designed as a universal image format especially for use on the Internet, and, as such, PNG supports three different types of images: • true color images (with up to 3 × 16 bits/pixel), • grayscale images (with up to 16 bits/pixel), • indexed color images (with up to 256 colors). Additionally, PNG includes an alpha channel for transparency with a maximum depth of 16 bits. In comparison, the transparency channel of a GIF image is only a single bit deep. While the format only supports a single image per file, it is exceptional in that it allows images of up to 230 × 230 pixels. The format supports lossless compression by means of a variation of PKZIP (Phil Katz’s ZIP). No lossy compression is available, as PNG was not designed as a replacement for JPEG. Ultimately, the PNG format meets or exceeds the capabilities of the GIF format in every way except GIF’s ability to include multiple images in a single file to create simple animations. Currently, PNG should be considered the format of choice for representing uncompressed, lossless, true color images for use on the Web. 1.5.5 JPEG The classic JPEG standard defines a compression method for continuous grayscale and color images, such as those that would arise from nature photography. The format was developed by the Joint Photographic Experts Group (JPEG) with the goal of achieving an average data reduction of a factor of 1:16 and was established in 1990 as ISO Standard IS-10918. In practice, JPEG achieves, depending on the application, compression rates in the order of 1 bit per pixel (i.e., a compression factor of around 1:25) when compressing 24-bit color images to an acceptable quality for viewing. The JPEG standard supports images with up to 256 color components, which includes CMYK images (see Sec. 13.2.5). Although the original JPEG-1 standard it is still the most widely used image file format, JPEG today is 8 9

16

Portable network graphics Unisys’s U.S. LZW Patent No. 4,558,302 expired on June 20, 2003.

actually a collection of various image compression and coding tech- 1.5 Image File Formats nologies under continuous development.10 The modular design of the original JPEG coding standard [186] allows for variations of the “baseline” algorithm; for example, there exists an uncompressed version, though it is not often used. In the case of RGB images, the core of the algorithm consists of three main steps: 1. Color conversion and down sampling: A color transformation from RGB into the Y Cb Cr space (see Sec. 13.2.4) is used to separate the actual color components from the brightness Y component. Since the human visual system is less sensitive to rapid changes in color, it is possible to compress the color components more, resulting in a significant data reduction, without a subjective loss in image quality. 2. Cosine transform and quantization in frequency space: The image is divided up into a regular grid of 8 blocks, and for each independent block, the frequency spectrum is computed using the discrete cosine transformation (see Ch. 20). Next, the 64 spectral coefficients of each block are quantized into a quantization table. The size of this table largely determines the eventual compression ratio, and therefore the visual quality, of the image. In general, the high frequency coefficients, which are essential for the “sharpness” of the image, are reduced most during this step. During decompression these high frequency values will be approximated by computed values. 3. Lossless compression: Finally, the quantized spectral components data stream is again compressed using a lossless method, such as arithmetic or Huffman encoding, in order to remove the last remaining redundancy in the data stream. The JPEG compression method combines a number of different compression methods and its should not be underestimated. Implementing even the baseline version is nontrivial, so application support for JPEG increased sharply once the Independent JPEG Group (IJG)11 made available a reference implementation of the JPEG algorithm in 1991. Drawbacks of the JPEG compression algorithm include its limitation to 8-bit images, its poor performance on non-photographic images such as line art (for which it was not designed), its handling of abrupt transitions within an image, and the striking artifacts caused by the 8 × 8 pixel blocks at high compression rates. Figure 1.9 shows the results of compressing a section of a grayscale image using different quality factors (Photoshop QJPG = 10, 5, 1). JPEG File Interchange Format (JFIF) Despite common usage, JPEG is not a file format; it is “only” a method of compressing image data. The actual JPEG standard only specifies the JPEG codec (compressor and decompressor) and by design leaves the wrapping, or file format, undefined.12 What is nor10 11 12

https://jpeg.org https://www.ijg.org To be exact, the JPEG standard only defines how to compress the individual components and the structure of the JPEG stream.

17

Y

1 Digital Images Fig. 1.8 JPEG compression of an RGB image. Using a color space transformation, the color components Cb , Cr are separated from the Y luminance component and subjected to a higher rate of compression. Each of the three components are then run independently through the JPEG compression pipeline and are merged into a single JPEG data stream. Decompression follows the same stages in reverse order.

RGB

Color Transformation

Cb Cr

Y′

JPEG Compressor

Cb′

JPEG Compressor JPEG Compressor

Cr′

JPEG Stream Y′ Cb′ Cr′ Y′ Cb′ Cr′

JPEG Decompressor JPEG Decompressor JPEG Decompressor

Y″ Cb″ Cr″

Inverse Color Transformation

RGB′

mally referred to as a JPEG file is almost always an instance of a “JPEG File Interchange Format” (JFIF) file, originally developed by Eric Hamilton and the IJG. JFIF specifies a file format based on the JPEG standard by defining the remaining necessary elements of a file format. The JPEG standard leaves some parts of the codec undefined for generality, and in these cases JFIF makes a specific choice. As an example, in step 1 of the JPEG codec, the specific color space used in the color transformation is not part of the JPEG standard, so it is specified by the JFIF standard. As such, the use of different compression ratios for color and luminance is a practical implementation decision specified by JFIF and is not a part of the actual JPEG encoder. Exchangeable Image File Format (EXIF) The Exchangeable Image File Format (EXIF) is a variant of the JPEG (JFIF) format designed for storing image data originating on digital cameras, and to that end it supports storing metadata such as the type of camera, date and time, photographic parameters such as aperture and exposure time, as well as geographical (GPS) data. EXIF was developed by the Japan Electronics and Information Technology Industries Association (JEITA) as a part of the DCF13 guidelines and is used today by practically all manufacturers as the standard format for storing digital images on memory cards. Internally, EXIF uses TIFF to store the metadata information and JPEG to encode a thumbnail preview image. The file structure is designed so that it can be processed by existing JPEG/JFIF readers without a problem. JPEG-2000 JPEG-2000, which is specified by an ISO-ITU standard (“Coding of Still Pictures”),14 was designed to overcome some of the betterknown weaknesses of the traditional JPEG codec. Among the improvements made in JPEG-2000 are the use of larger, 64 × 64 pixel blocks and replacement of the discrete cosine transform by the wavelet 13 14

18

Design Rule for Camera File System www.jpeg.org/JPEG2000.htm

1.5 Image File Formats

(a) Original (75.08 kB)

Fig. 1.9 Artifacts arising from JPEG compression. A section of the original image (a) and the results of JPEG compression at different quality factors: QJPG = 10 (b), QJPG = 5 (c), and QJPG = 1 (d). In parentheses are the resulting file sizes for the complete (dimensions 274 × 274) image.

(b) QJPG = 10 (11.40 kB)

(c) QJPG = 5 (7.24 kB)

(d) QJPG = 1 (5.52 kB)

transform. These and other improvements enable it to achieve significantly higher compression ratios than JPEG—up to 0.25 bits per pixel on RGB color images. Despite these advantages, JPEG-2000 is supported by only a few image-processing applications and Web browsers.15

15

At this time, ImageJ does not offer native JPEG-2000 support.

19

1 Digital Images

1.5.6 Legacy File Formats For most practical applications, one of the following file formats is sufficient: TIFF as a universal format supporting a wide variety of uncompressed images and JPEG/JFIF for digital color photos when storage size is a concern, and there is either PNG or GIF for when an image is destined for use on the Web. In addition, there exist countless other file formats, such as those encountered in legacy applications or in special application areas where they are traditionally used. Windows Bitmap (BMP) The Windows Bitmap (BMP) format is a simple, and under Windows widely used, file format supporting grayscale, indexed, and true color images. It also supports binary images, but not in an efficient manner, since each pixel is stored using an entire byte. Optionally, the format supports simple lossless, run-length-based compression. While BMP offers storage for a similar range of image types as TIFF, it is a much less flexible format. Portable Bitmap Format (PBM) The Portable Bitmap Format (PBM) family16 consists of a series of very simple file formats that are exceptional in that they can be optionally saved in a human-readable text format that can be easily read in a program or simply edited using a text editor. A simple PGM image is shown in Fig. 1.10. The characters P2 in the first line indicate that the image is a PGM (“plain”) file stored in humanreadable format. The next line shows how comments can be inserted directly into the file by beginning the line with the # character. Line three gives the image’s dimensions, in this case width 17 and height 7, and line four defines the maximum pixel value, in this case 255. The remaining lines give the actual pixel values. This format makes it easy to create and store image data without any explicit imaging API, since it requires only basic text I/O that is available in any programming environment. In addition, the format supports a much more machine-optimized “raw” output mode in which pixel values are stored as bytes. PBM is widely used under Unix and supports the following formats: PBM (portable bitmap) for binary bitmaps, PGM (portable graymap) for grayscale images, and PNM (portable any map) for color images. PGM images can be opened by ImageJ. Other Image File Formats • RGB, a simple format from Silicon Graphics. • RAS (Sun Raster Format), a simple format from Sun Microsystems. • TGA (Truevision Targa File Format), the first 24-bit file format for PCs. It supports numerous image types with 8- to 32-bit depths and is still used in medicine and biology. • XBM/XPM (X-Windows Bitmap/Pixmap), a group of ASCIIencoded formats used in the X-Windows system and similar to PBM/PGM. 16

20

http://netpbm.sourceforge.net

P2 # oie.pgm 17 7 255 0 13 13 13 13 13 13 13 0 13 0 0 0 0 0 13 0 13 0 7 7 7 0 13 0 13 0 7 0 7 0 13 0 13 0 7 7 7 0 13 0 13 0 0 0 0 0 13 0 13 13 13 13 13 13 13

1.5 Image File Formats

0 0 0 0 0 0 0

0 7 7 7 7 7 0

0 7 7 7 7 7 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 81 81 81 81 81 0 0 0 81 81 81 0 81 0 0 0 81 81 81 81 0 0 0 0

Fig. 1.10 Example of a PGM file in human-readable text format (top) and the corresponding grayscale image (below).

1.5.7 Bits and Bytes Today, opening, reading, and writing image files is mostly carried out by means of existing software libraries. Yet sometimes you still need to deal with the structure and contents of an image file at the byte level, for instance when you need to read an unsupported file format or when you receive a file where the format of the data is unknown. Big Endian and Little Endian In the standard model of a computer, a file consists of a simple sequence of 8-bit bytes, and a byte is the smallest entry that can be read or written to a file. In contrast, the image elements as they are stored in memory are usually larger then a byte; for example, a 32-bit int value (= 4 bytes) is used for an RGB color pixel. The problem is that storing the four individual bytes that make up the image data can be done in different ways. In order to correctly recreate the original color pixel, we must naturally know the order in which bytes in the file are arranged. Consider, for example, a 32-bit int number z with the binary and hexadecimal values17 · · · z = 00010010 | {z } 00110100 01010110 01111000 | {z } B ≡ 12345678H , (1.2) 12H (MSB)

78H (LSB)

then 00010010B ≡ 12H is the value of the most significant byte (MSB) and 01111000B ≡ 78H the least significant byte (LSB). When the individual bytes in the file are arranged in order from MSB to LSB when they are saved, we call the ordering “big endian”, and when in the opposite direction, “little endian”. Thus the 32-bit value z from Eqn. (1.2) could be stored in one of the following two modes: Ordering

Byte Sequence

1

2

3

4

big endian

MSB → LSB

12H

34H

56H

78H

little endian

LSB → MSB

78H

56H

34H

12H

Even though correctly ordering the bytes should essentially be the responsibility of the operating and file systems, in practice it actually depends on the architecture of the processor.18 Processors from the 17 18

The decimal value of z is 305419896. At least the ordering of the bits within a byte is almost universally uniform.

21

1 Digital Images Table 1.2 Signatures of various image file formats. Most image file formats can be identified by inspecting the first bytes of the file. These byte sequences, or signatures, are listed in hexadecimal (0x..) form and as ASCII text ( indicates a nonprintable character).

Format PNG JPEG/JFIF TIFFlittle TIFFbig

Signature 0x89504e47 PNG 0xffd8ffe0 0x49492a00 II* 0x4d4d002a MM *

Format BMP GIF Photoshop PS/EPS

Signature 0x424d 0x4749463839 0x38425053 0x25215053

BM GIF89 8BPS %!PS

Intel family (e.g., x86, Pentium) are traditionally little endian, and processors from other manufacturers (e.g., IBM, MIPS, Motorola, Sun) are big endian.19 Big endian is also called network byte ordering since in the IP protocol the data bytes are arranged in MSB to LSB order during transmission. To correctly interpret image data with multi-byte pixel values, it is necessary to know the byte ordering used when creating it. In most cases, this is fixed and defined by the file format, but in some file formats, for example TIFF, it is variable and depends on a parameter given in the file header (see Table 1.2). File Headers and Signatures Practically all image file formats contain a data header consisting of important information about the layout of the image data that follows. Values such as the size of the image and the encoding of the pixels are usually present in the file header to make it easier for programmers to allocate the correct amount of memory for the image. The size and structure of this header are usually fixed, but in some formats, such as TIFF, the header can contain pointers to additional subheaders. In order to interpret the information in the header, it is necessary to know the file type. In many cases, this can be determined by the file name extension (e.g., .jpg or .tif), but since these extensions are not standardized and can be changed at any time by the user, they are not a reliable way of determining the file type. Instead, many file types can be identified by their embedded “signature”, which is often the first 2 bytes of the file. Signatures from a number of popular image formats are given in Table 1.2. Most image formats can be determined by inspecting the first few bytes of the file. These bytes, or signatures, are listed in hexadecimal (0x..) form and as ASCII text. A PNG file always begins with the 4-byte sequence 0x89, 0x50, 0x4e, 0x47, which is the “magic number” 0x89 followed by the ASCII sequence “PNG”. Sometimes the signature not only identifies the type of image file but also contains information about its encoding; for instance, in TIFF the first two characters are either II for “Intel” or MM for “Motorola” and indicate the byte ordering (little endian or big endian, respectively) of the image data in the file.

19

22

In Java, this problem does not arise since internally all implementations of the Java Virtual Machine use big endian ordering.

1.6 Software for Digital Imaging

1.7 ImageJ

Traditionally, software for digital imaging has been targeted at either manipulating or processing images, either for practitioners and designers or software programmers, with quite different requirements. Software packages for manipulating images, such as Adobe Photoshop, Corel Paint, and others, usually offer a convenient user interface and a large number of readily available functions and tools for working with images interactively. Sometimes it is possible to extend the standard functionality by writing scripts or adding selfprogrammed components. For example, Adobe provides a special API for programming Photoshop20 “plugins” in C++, though this is a nontrivial task and certainly too complex for non-programmers. In contrast to the aforementioned category of tools, digital image processing software primarily aims at the requirements of algorithm and software developers, scientists, and engineers working with images, where interactivity and ease of use are not the main concerns. Instead, these environments mostly offer comprehensive and welldocumented software libraries that facilitate the implementation of new image-processing algorithms, prototypes, and working applications. Popular examples are OpenCV,21 MatLab,22 GIMP23 or ImageMagick,24 among many others. In addition to the support for conventional programming (typically with C/C++), many of these systems provide dedicated scripting languages or visual programming aides that can be used to construct even highly complex processes in a convenient and safe fashion. In practice, image manipulation and image processing are of course closely related. Although Photoshop, for example, is aimed at image manipulation by nonprogrammers, the software itself implements many traditional image-processing algorithms. The same is true for many Web applications using server-side image processing, such as those based on ImageMagick. Thus image processing is really at the base of any image manipulation software and certainly not an entirely different category.

1.7 ImageJ ImageJ, the software that is used for this book, is a combination of both worlds discussed in the previous section. It offers a set of ready-made tools for viewing and interactive manipulation of images but can also be extended easily by writing new software components in a “real” programming language. ImageJ is implemented entirely in Java and is thus largely platform-independent, running without modification under Windows, MacOS, or Linux. Java’s dynamic execution model allows new modules (“plugins”) to be written as independent pieces of Java code that can be compiled, loaded, and 20 21 22 23 24

https://www.adobe.com/products/photoshop.html https://opencv.org https://mathworks.com https://gimp.org https://imagemagick.org

23

1 Digital Images

Wayne Rasband (right), creator of ImageJ, at the 1st ImageJ Conference 2006. Picture courtesy by Marc Seil, CRP Henri Tudor, Luxembourg.

executed “on the fly” in the running system without the need to even restart ImageJ. This quick turnaround makes ImageJ an ideal platform for developing and testing new image-processing techniques and algorithms. Since Java has become extremely popular as a first programming language in many engineering curricula, it is usually quite easy for students to get started in ImageJ without having to spend much time learning another programming language. Also, ImageJ is freely available, so students, instructors, and practitioners can install and use the software legally and without license charges on any computer. ImageJ is thus an ideal platform for education and self-training in digital image processing but is also in regular use for serious research and application development at many laboratories around the world, particularly in biological and medical imaging. ImageJ was (and still is) developed by Wayne Rasband [217] at the U.S. National Institutes of Health (NIH), originally as a substitute for its predecessor, NIH-Image, which was only available for the Apple Macintosh platform. The current version of ImageJ, updates, documentation, the complete source code, test images, and a continuously growing collection of third-party plugins can be downloaded from the ImageJ website.25 Installation is simple, with detailed instructions available online, in Werner Bailer’s programming tutorial [17], and in the authors’ ImageJ Short Reference [48]. In addition to ImageJ itself there are several popular software projects that build on or extend ImageJ. This includes in particular Fiji 26 (“Fiji Is Just ImageJ”) which offers a consistent collection of numerous plugins, simple installation on various platforms and excellent documentation. All programming examples (plugins) shown in this book should also execute in Fiji without any modifications. Another important development is ImgLib2, which is a generic Java API for representing and processing n-dimensional images in a consistent fashion. ImgLib2 also provides the underlying data model for ImageJ2,27 which is a complete reimplementation of ImageJ. 1.7.1 Key Features As a pure Java application, ImageJ should run on any computer for which a current Java runtime environment (JRE) exists. ImageJ comes with its own Java runtime, so Java need not be installed separately on the computer. Under the usual restrictions, ImageJ can be run as a Java “applet” within a Web browser, though it is mostly used as a stand-alone application. It is sometimes also used on the server side in the context of Java-based Web applications (see [17] for details). In summary, the key features of ImageJ are: • A set of ready-to-use, interactive tools for creating, visualizing, editing, processing, analyzing, loading, and storing images, with support for several common file formats. ImageJ also provides 25 26 27

24

https://rsb.info.nih.gov/ij/ https://fiji.sc https://imagej.net/ImageJ2. To avoid confusion, the “classic” ImageJ platform is sometimes referred to as “ImageJ1” or simply “IJ1”.

“deep” 16-bit integer images, 32-bit floating-point images, and 1.7 ImageJ image sequences (“stacks”). • A simple plugin mechanism for extending the core functionality of ImageJ by writing (usually small) pieces of Java code. All coding examples shown in this book are based on such plugins. • A macro language and the corresponding interpreter, which make it easy to implement larger processing blocks by combining existing functions without any knowledge of Java. Macros are not discussed in this book, but details can be found in ImageJ’s online documentation.28 1.7.2 Interactive Tools When ImageJ starts up, it first opens its main window (see Fig. 1.11), which includes the following menu entries: • File: for opening, saving, and creating new images. • Edit: for editing and drawing in images. • Image: for modifying and converting images, geometric operations. • Process: for image processing, including point operations, filters, and arithmetic operations between multiple images. • Analyze: for statistical measurements on image data, histograms, and special display formats. • Plugin: for editing, compiling, executing, and managing userdefined plugins. The current version of ImageJ can open images in several common formats, including TIFF (uncompressed only), JPEG, GIF, PNG, and BMP, as well as the formats DICOM29 and FITS,30 which are popular in medical and astronomical image processing, respectively. As is common in most image-editing programs, all interactive operations are applied to the currently active image, i.e., the image most recently selected by the user. ImageJ provides a simple (single-step) “undo” mechanism for most operations, which can also revert modifications effected by user-defined plugins. 1.7.3 Working With ImageJ and Java In the following chapters, we mostly use concrete plugins and Java code to describe algorithms and data structures. This not only makes these examples immediately applicable, but they should also help in acquiring additional skills for using ImageJ in a step-by-step fashion. Instructions on how to set up an ImageJ environment and how to start writing custom plugins can be found in Appendix I. To keep the text compact, we often describe only the run() method of a particular plugin and additional class and method definitions if they are relevant in the given context. The complete source 28 29 30

http://rsb.info.nih.gov/ij/developer/macro/macros.html Digital Imaging and Communications in Medicine Flexible Image Transport System

25

1 Digital Images

Edit images, drawing operations

Fig. 1.11 Information displayed in ImageJ’s main window is extremely helpful for debugging image-processing operations. The current cursor position is displayed in pixel coordinates unless the associated image is spatially calibrated. The way pixel values are displayed depends on the image type; in the case of a color image (as shown here) integer RGB component values are shown.

Open, save and create images

Modify and convert images, geometric operations

Point and arithmetic operations, filters

Statistical measurements, histograms, plotting

Manage, execute plugins

position value

code for these examples can of course be downloaded from the book’s supporting website.31 The complete and most current API reference for ImageJ, including source code, tutorials, and many example plugins, can be found on the official ImageJ website. Another great source for any serious plugin programming is the tutorial by Werner Bailer [17]. While this book does not require extensive Java skills from its readers, some elementary knowledge is essential for understanding or extending the given examples. There is a large number of introductory textbooks on Java, such as [13, 36, 80, 86, 234] and many others. For readers with programming experience who have not worked with Java before, we particularly recommend some of the tutorials on Or31

26

https://www.imagingbook.com

acle’s Java website.32 See also Appendix J for additional hints and a 1.8 Exercises compilation of specific Java topics that cause frequent problems.

1.8 Exercises Exercise 1.1. Determine the actual physical measurement in millimeters of an image with 1400 rectangular pixels and a resolution of 72 dpi. Exercise 1.2. A camera with a focal length of f = 50 mm is used to take a photo of a vertical column that is 12 m high and is 95 m away from the camera. Determine its height in the image in mm (a) and the number of pixels (b) assuming the camera has a resolution of 4000 dpi. Exercise 1.3. The image sensor of a particular digital camera contains 2016 × 3024 pixels. The geometry of this sensor is identical to that of a traditional 35 mm camera (with an image size of 24 × 36 mm) except that it is 1.6 times smaller. Compute the resolution of this digital sensor in dpi. Exercise 1.4. Assume the camera geometry described in Exercise 1.3 combined with a lens with focal length f = 50 mm. What amount of blurring (in pixels) would be caused by a uniform, 0.1◦ horizontal turn of the camera during exposure? Recompute this for f = 300 mm. Consider if the extent of the blurring also depends on the distance of the object. Exercise 1.5. Determine the number of bytes necessary to store an uncompressed binary image of size 4000 × 3000 pixels. Exercise 1.6. Determine the number of bytes necessary to store an uncompressed RGB color image of size 640 × 480 pixels using 8, 10, 12, and 14 bits per color channel. Exercise 1.7. Given a black and white television with a resolution of 625 × 512 8-bit pixels and a frame rate of 25 images per second: (a) How may different images can this device ultimately display, and how long would you have to watch it (assuming no sleeping) in order to see every possible image at least once? (b) Perform the same calculation for a color television with 3 × 8 bits per pixel. Exercise 1.8. Show that the projection of a 3D straight line in a pinhole camera (assuming perspective projection as defined in Eqn. (1.1)) is again a straight line in the resulting 2D image. Exercise 1.9. Using Fig. 1.10 as a model, use a text editor to create a PGM file, disk.pgm, containing an image of a bright circle. Open your image with ImageJ and then try to find other programs that can open and display the image. 32

https://docs.oracle.com/javase/

27

1 Digital Images

Exercise 1.10. Install the current version of ImageJ on your computer and make yourself familiar with the built-in commands (open, convert, edit, and save images). Exercise 1.11. Write a new ImageJ plugin that reflects a grayscale image horizontally (or vertically) using My_Inverter.java (Prog. I.1) as a template. Test your new plugin with appropriate images of different sizes (odd, even, extremely small) and inspect the results carefully. Exercise 1.12. The run() method of plugin Inverter_Plugin_A (see Prog. I.1) iterates over all pixels of the given image. Find out in which order the pixels are visited: along the (horizontal) lines or along the (vertical) columns? Make a drawing to illustrate this process. Exercise 1.13. Create an ImageJ plugin for 8-bit grayscale images of arbitrary size that paints a white frame (with pixel value 255) 10 pixels wide into the image (without increasing its size). Make sure this plugin also works for very small images. Exercise 1.14. Create a plugin for 8-bit grayscale images that calculates and prints the result (with IJ.log()). Use a variable of type int or long for accumulating the pixel values. What is the maximum image size for which we can be certain that the result of summing with an int variable is correct? Exercise 1.15. Create a plugin for 8-bit grayscale images that calculates and prints the minimum and maximum pixel values in the current image (with IJ.log()). Compare your output to the results obtained with Analyze ▷ Measure. Exercise 1.16. Write a new ImageJ plugin that shifts an 8-bit grayscale image horizontally and circularly until the original state is reached again. To display the modified image after each shift, a reference to the corresponding ImagePlus object is required (ImageProcessor has no display methods). The ImagePlus object is only accessible to the plugin’s setup() method, which is automatically called before the run() method. Modify the definition in Prog. I.1 (p. 889) to keep a reference and to redraw the ImagePlus object as follows: public class XY_Plugin implements PlugInFilter { ImagePlus im; // new variable! public int setup(String args, ImagePlus im) { this.im = im; // reference to the associated ImagePlus object return DOES_8G; }

}

28

public void run(ImageProcessor ip) { // ... modify ip im.updateAndDraw(); // redraw the associated ImagePlus object // ... }

2 Histograms and Image Statistics

Histograms are used to depict image statistics in an easily interpreted visual format. With a histogram, it is easy to determine certain types of problems in an image, for example, it is simple to conclude if an image is properly exposed by visual inspection of its histogram. In fact, histograms are so useful that modern digital cameras often provide a real-time histogram overlay on the viewfinder (Fig. 2.1) to help prevent taking poorly exposed pictures. It is important to catch errors like this at the image capture stage because poor exposure results in a permanent loss of information, which it is not possible to recover later using image-processing techniques. In addition to their usefulness during image capture, histograms are also used later to improve the visual appearance of an image and as a “forensic” tool for determining what type of processing has previously been applied to an image. The final part of this chapter shows how to calculate simple image statistics from the original image, its histogram, or the so-called integral image. Fig. 2.1 Digital SLR camera back display showing the associated RGB histograms.

29 © Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1_2

2 Histograms and Image Statistics Fig. 2.2 An 8-bit grayscale image and a histogram depicting the frequency distribution of its 256 intensity values.

2.1 What is a Histogram? Histograms in general are frequency distributions, and histograms of images describe the frequency of the intensity values that occur in an image. This concept can be easily explained by considering an old-fashioned grayscale image like the one shown in Fig. 2.2. The histogram h for a grayscale image I with intensity values in the range I(u, v) ∈ [0, K −1] holds exactly K entries, where K = 28 = 256 for a typical 8-bit grayscale image. Each single histogram entry is defined as h(i) = the number of pixels in I with the intensity value i, for all 0 ≤ i < K. More formally stated,1 h(i) = card (u, v) | I(u, v) = i .

(2.1)

Therefore, h(0) is the number of pixels with the value 0, h(1) the number of pixels with the value 1, and so forth. Finally, h(255) is the number of all white pixels with the maximum intensity value 255 = K −1. The result of the histogram computation is a 1D vector h of length K. Figure 2.3 gives an example for an image with K = 16 possible intensity values. Fig. 2.3 Histogram vector for an image with K = 16 possible intensity values. The indices of the vector element i = 0 . . . 15 represent intensity values. The value of 10 at index 2 means that the image contains 10 pixels of intensity value 2.

h(i)

10 pixels with intensity value i = 2

10 5

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 h(i) i

i

0 2 10 0 0 0 5 7 3 9 1 6 3 6 3 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Since the histogram encodes no information about where each of its individual entries originated in the image, it contains no information about the spatial arrangement of pixels in the image. This 1

30

card{. . .} denotes the number of elements (“cardinality”) in a set (see also Sec. A.1 in the Appendix).

2.2 Interpreting Histograms Fig. 2.4 Three very different images with identical histograms.

is intentional, since the main function of a histogram is to provide statistical information, (e.g., the distribution of intensity values) in a compact form. Is it possible to reconstruct an image using only its histogram? That is, can a histogram be somehow “inverted”? Given the loss of spatial information, in all but the most trivial cases, the answer is no. As an example, consider the wide variety of images you could construct using the same number of pixels of a specific value. These images would appear different but have exactly the same histogram (Fig. 2.4).

2.2 Interpreting Histograms A histogram depicts problems that originate during image acquisition, such as those involving contrast and dynamic range, as well as artifacts resulting from image-processing steps that were applied to the image. Histograms are often used to determine if an image is making effective use of its intensity range (Fig. 2.5) by examining the size and uniformity of the histogram’s distribution. alow

intensity range

ahigh Fig. 2.5 Effective intensity range. The graph depicts the frequencies of pixel values linearly (black bars) and logarithmically (gray bars). The logarithmic form makes even relatively low occurrences, which can be very important in the image, readily apparent.

linear

logarithmic

2.2.1 Image Acquisition Histograms make typical exposure problems readily apparent. As an example, a histogram where a large section of the intensity range at one end is largely unused while the other end is crowded with high-value peaks (Fig. 2.6) is representative of an improperly exposed image.

31

2 Histograms and Image Statistics Fig. 2.6 Exposure errors are readily apparent in histograms. Underexposed (a), properly exposed (b), and overexposed (c) photographs.

(a)

(b)

(c)

(a)

(b)

(c)

Fig. 2.7 How changes in contrast affect the histogram: low contrast (a), normal contrast (b), high contrast (c).

Contrast Contrast is understood as the range of intensity values effectively used within a given image, that is the difference between the image’s maximum and minimum pixel values. A full-contrast image makes effective use of the entire range of available intensity values from a = amin , . . . , amax with amin = 0, amax = K −1 (black to white). Using this definition, image contrast can be easily read directly from the histogram. Figure 2.7 illustrates how varying the contrast of an image affects its histogram. Dynamic Range The dynamic range of an image is, in principle, understood as the number of distinct pixel values in an image. In the ideal case, the dynamic range encompasses all K usable pixel values, in which case the value range is completely utilized. When an image has an available range of contrast a = alow , . . . , ahigh , with amin < alow

32

and

ahigh < amax ,

then the maximum possible dynamic range is achieved when all the intensity values lying in this range are utilized, i.e., appear in the image (see Fig. 2.8).

2.2 Interpreting Histograms Fig. 2.8 How changes in dynamic range affect the histogram: high dynamic range (a), low dynamic range with 64 intensity values (b), extremely low dynamic range with only 6 intensity values (c).

3

(a)

(b)

(c)

While the contrast of an image can be increased by transforming its existing values so that they utilize more of the underlying value range available, the dynamic range of an image can only be increased by introducing artificial (that is, not originating with the image sensor) values using methods such as interpolation (see Ch. 22). An image with a high dynamic range is desirable because it will suffer less image-quality degradation during image processing and compression. Since it is not possible to increase dynamic range after image acquisition in a practical way, professional cameras and scanners work at depths of more than 8 bits, often 12–14 bits per channel, in order to provide high dynamic range at the acquisition stage. While most output devices, such as monitors and printers, are unable to actually reproduce more than 256 different shades, a high dynamic range is always beneficial for subsequent image processing or archiving. 2.2.2 Image Defects Histograms can be used to detect a wide range of image defects that originate either during image acquisition or as the result of later image processing. Since histograms always depend on the visual characteristics of the scene captured in the image, no single “ideal” histogram exists. While a given histogram may be optimal for a specific scene, it may be entirely unacceptable for another. As an example, the ideal histogram for an astronomical image would likely be very different from that of a good landscape or portrait photo. Nevertheless, there are some general rules; for example, when taking a landscape image with a digital camera, you can expect the histogram to have evenly distributed intensity values and no isolated spikes. Saturation Ideally the contrast range of a sensor, such as that used in a camera, should be greater than the range of the intensity of the light that it receives from a scene. In such a case, the resulting histogram will be smooth at both ends because the light received from the very bright and the very dark parts of the scene will be less than the

33

2 Histograms and Image Statistics Fig. 2.9 Effect of image capture errors on histograms: saturation of high intensities (a), histogram gaps caused by a slight increase in contrast (b), and histogram spikes resulting from a reduction in contrast (c).

(a)

(b)

(c)

light received from the other parts of the scene. Unfortunately, this ideal is often not the case in reality, and illumination outside of the sensor’s contrast range, arising for example from glossy highlights and especially dark parts of the scene, cannot be captured and is lost. The result is a histogram that is saturated at one or both ends of its range. The illumination values lying outside of the sensor’s range are mapped to its minimum or maximum values and appear on the histogram as significant spikes at the tail ends. This typically occurs in an under- or overexposed image and is generally not avoidable when the inherent contrast range of the scene exceeds the range of the system’s sensor (see Fig. 2.9 (a)). Spikes and Gaps As discussed already, the intensity value distribution for an unprocessed image is generally smooth; that is, it is unlikely that isolated spikes (except for possible saturation effects at the tails) or gaps will appear in its histogram. It is also unlikely that the count of any given intensity value will differ greatly from that of its neighbors (i.e., it is locally smooth). While artifacts like these are observed very rarely in original images, they will often be present after an image has been manipulated, for instance, by changing its contrast. Increasing the contrast (see Ch. 3) causes the histogram lines to separate from each other and, due to the discrete values, gaps are created in the histogram (Fig. 2.9 (b)). Decreasing the contrast leads, again because of the discrete values, to the merging of values that were previously distinct. This results in increases in the corresponding histogram entries and ultimately leads to highly visible spikes in the histogram (see Fig. 2.9 (c)).2 Effects of Image Compression Image compression also changes an image in ways that are immediately evident in its histogram. As an example, during GIF compression, an image’s dynamic range is reduced to only a few intensities or colors, resulting in an obvious line structure in the histogram that cannot be removed by subsequent processing (see Fig. 2.10). Gener2

34

Unfortunately, these types of errors are also caused by the internal contrast “optimization” routines of some image-capture devices, especially consumer-type scanners.

(a)

(b)

(c)

(a)

(b)

(c)

(d)

2.2 Interpreting Histograms Fig. 2.10 Color quantization effects resulting from GIF conversion. The original image converted to a 256 color GIF image (left). Original histogram (a) and the histogram after GIF conversion (b). When the RGB image is scaled by 50%, some of the lost colors are recreated by interpolation, but the results of the GIF conversion remain clearly visible in the histogram (c).

Fig. 2.11 Effects of JPEG compression. The original image (a) contained only two different gray values, as its histogram (b) makes readily apparent. JPEG compression, a poor choice for this type of image, results in numerous additional gray values, which are visible in both the resulting image (c) and its histogram (d). In both histograms, the linear frequency (black bars) and the logarithmic frequency (gray bars) are shown.

ally, a histogram can quickly reveal whether an image has ever been subjected to color quantization, such as occurs during conversion to a GIF image, even if the image has subsequently been converted to a full-color format such as TIFF or JPEG. Figure 2.11 illustrates what occurs when a simple line graphic with only two gray values (128, 255) is subjected to a compression method such as JPEG, that is not designed for line graphics but instead for natural photographs. The histogram of the resulting image clearly shows that it now contains a large number of gray values that were not present in the original image, resulting in a poor-quality image3 that appears dirty, fuzzy, and blurred.

3

Using JPEG compression on images like this, for which it was not designed, is one of the most egregious of imaging errors. JPEG is designed for photographs of natural scenes with smooth color transitions, and using it to compress iconic images with large areas of the same color results in strong visual artifacts (see, e.g., Fig. 1.9 on p. 19).

35

2 Histograms and Image Statistics Prog. 2.1 ImageJ plugin for computing the histogram of an 8-bit grayscale image. The setup() method returns DOES_8G + NO_CHANGES, which indicates that this plugin requires an 8-bit grayscale image and will not alter it (line 9). In Java, all elements of a newly instantiated numeric array are automatically initialized to zero (line 13).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

import import import import

ij.IJ; ij.ImagePlus; ij.plugin.filter.PlugInFilter; ij.process.ImageProcessor;

public class Compute_Histogram implements PlugInFilter { public int setup(String arg, ImagePlus img) { return DOES_8G + NO_CHANGES; } public void int[] h int w = int h =

}

}

run(ImageProcessor ip) { = new int[256]; // histogram array ip.getWidth(); ip.getHeight();

for (int v = 0; v \Nlt h; v++) { for (int u = 0; u \Nlt w; u++) { int i = ip.getPixel(u, v); h[i] = h[i] + 1; } } // ... histogram h can now be used

2.3 Calculating Histograms Computing the histogram of an 8-bit grayscale image containing intensity values between 0 and 255 is a simple task. All we need is a set of 256 counters, one for each possible intensity value. First, all counters are initialized to zero. Then we iterate through the image I, determining the pixel value p at each location (u, v), and incrementing the corresponding counter by one. At the end, each counter will contain the number of pixels in the image that have the corresponding intensity value. An image with K possible intensity values requires exactly K counter variables; for example, since an 8-bit grayscale image can contain at most 256 different intensity values, we require 256 counters. While individual counters make sense conceptually, an actual implementation would not use K individual variables to represent the counters but instead would use an array with K entries (int[256] in Java). In this example, the actual implementation as an array is straightforward. Since the intensity values begin at zero (like arrays in Java) and are all positive, they can be used directly as the indices i ∈ [0, N −1] of the histogram array. Program 2.1 contains the complete Java source code for computing a histogram within the run() method of an ImageJ plugin. At the start of Prog. 2.1, the array h of type int[] is created (line 13) and its elements are automatically initialized4 to 0. It makes no difference, at least in terms of the final result, whether the array is traversed in row or column order, as long as all pixels in the image 4

36

In Java, arrays of primitives such as int, double are initialized at creation to 0 in the case of integer types or 0.0 for floating-point types, while arrays of objects are initialized to null.

are visited exactly once. In contrast to Prog. I.1, in this example we 2.4 Histograms of traverse the array in the standard row-first order such that the outer Images With More for loop iterates over the vertical coordinates v and the inner loop Than 8 Bits over the horizontal coordinates u.5 Once the histogram has been calculated, it is available for further processing steps or for being displayed. Of course, histogram computation is already implemented in ImageJ and is available via the method getHistogram() for objects of the class ImageProcessor. If we use this built-in method, the run() method of Prog. 2.1 can be simplified to public void run(ImageProcessor ip) { int[] h = ip.getHistogram(); // built-in ImageJ method ... // histogram h can now be used }

2.4 Histograms of Images With More Than 8 Bits Normally histograms are computed in order to visualize the image’s distribution on the screen. This presents no problem when dealing with images having 28 = 256 entries, but when an image uses a larger range of values, for instance 16- and 32-bit or floating-point images (see Table 1.1), then the growing number of necessary histogram entries makes this no longer practical. 2.4.1 Binning Since it is not possible to represent each intensity value with its own entry in the histogram, we will instead let a given entry in the histogram represent a range of intensity values. This technique is often referred to as “binning” since you can visualize it as collecting a range of pixel values in a container such as a bin or bucket. In a binned histogram of size B, each bin h(j) contains the number of image elements having values within the interval [aj , aj+1 ), and therefore (analogous to Eqn. (2.1)) h(j) = card (u, v) | aj ≤ I(u, v) < aj+1 , (2.2) for 0 ≤ j < B. Typically the range of possible values in B is divided into bins of equal size kB = K/B such that the starting value of the interval j is K aj = j · = j · kB . B 2.4.2 Example In order to create a typical histogram containing B = 256 entries from a 14-bit image, one would divide the original value range 5

In this way, image elements are traversed in exactly the same way that they are laid out in computer memory, resulting in more efficient memory access and with it the possibility of increased performance, especially when dealing with larger images (see also Appendix J).

37

2 Histograms and Image Statistics

j = 0, . . . , 214 −1 into 256 equal intervals, each of length kB = 214 /256 = 64, such that a0 = 0, a1 = 64, a2 = 128, . . . , a255 = 16320 and a256 = aB = 214 = 16320 = K. This gives the following association between pixel values and histogram bins h(0), . . . , h(255): 63 → h(0), 127 → h(1), 191 → h(2), .. .. . . 16320, . . . , 16383 → h(255). 0, . . . , 64, . . . , 128, . . . , .. .

2.4.3 Implementation If, as in the previous example, the value range 0, . . . , K −1 is divided into equal length intervals kB = K/B, there is naturally no need to use a mapping table to find aj since for a given pixel value a = I(u, v) the correct histogram element j is easily computed. In this case, it is enough to simply divide the pixel value I(u, v) by the interval length kB ; that is, I(u, v) I(u, v) I(u, v) · B = = . (2.3) kB K/B K As an index to the appropriate histogram bin h(j), we require an integer value j I(u, v) · B k j= , (2.4) K where ⌊·⌋ denotes the floor operator.6 A Java method for computing histograms by “linear binning” is given in Prog. 2.2. Note that all the computations from Eqn. (2.4) are done with integer numbers without using any floating-point operations. Also there is no need to explicitly call the floor function because the expression a * B / K in line 11 uses integer division and in Java the fractional result of such an operation is truncated, which is equivalent to applying the floor function (assuming positive arguments).7 The binning method can also be applied, in a similar way, to floating-point images.

2.5 Histograms of Color Images When referring to histograms of color images, typically what is meant is a histogram of the image intensity (luminance) or of the individual color channels. Both of these variants are supported by practically every image-processing application and are used to objectively appraise the image quality, especially directly after image acquisition. 6 7

38

⌊x⌋ rounds x down to the next whole number (see Appendix A). See also Sec. J.1.1 of the Appendix.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

int[] binnedHistogram(ImageProcessor ip) { int K = 256; // number of intensity values int B = 32; // size of histogram, must be defined int[] H = new int[B]; // histogram array int w = ip.getWidth(); int h = ip.getHeight();

}

for (int v = 0; v \Nlt h; v++) { for (int u = 0; u \Nlt w; u++) { int a = ip.getPixel(u, v); int i = a * B / K; // integer operations only! H[i] = H[i] + 1; } } // return binned histogram return H;

2.5 Histograms of Color Images Prog. 2.2 Histogram computation using “binning” (Java method). Example of computing a histogram with B = 32 bins for an 8-bit grayscale image with K = 256 intensity levels. The method binnedHistogram() returns the histogram of the image object ip passed to it as an int array of size B.

2.5.1 Intensity Histograms The intensity or luminance histogram hLum of a color image is nothing more than the histogram of the corresponding grayscale image, so naturally all aspects of the preceding discussion also apply to this type of histogram. The grayscale image is obtained by computing the luminance of the individual channels of the color image. When computing the luminance, it is not sufficient to simply average the values of each color channel; instead, a weighted sum that takes into account color perception theory should be computed. This process is explained in detail in Chapter 13. 2.5.2 Individual Color Channel Histograms Even though the luminance histogram takes into account all color channels, image errors appearing in single channels can remain undiscovered. For example, the luminance histogram may appear clean even when one of the color channels is oversaturated. In RGB images, the blue channel contributes only a small amount to the total brightness and so is especially sensitive to this problem. Component histograms supply additional information about the intensity distribution within the individual color channels. When computing component histograms, each color channel is considered a separate intensity image and each histogram is computed independently of the other channels. Figure 2.12 shows the luminance histogram hLum and the three component histograms hR , hG , and hB of a typical RGB color image. Notice that saturation problems in all three channels (red in the upper intensity region, green and blue in the lower regions) are obvious in the component histograms but not in the luminance histogram. In this case it is striking, and not at all atypical, that the three component histograms appear completely different from the corresponding luminance histogram hLum (see Fig. 2.12 (b)). 39

2 Histograms and Image Statistics Fig. 2.12 Histograms of an RGB color image: original image (a), luminance histogram hLum (b), RGB color components as intensity images (c–e), and the associated component histograms hR , hG , hB (f–h). The fact that all three color channels have saturation problems is only apparent in the individual component histograms. The spike in the distribution resulting from this is found in the middle of the luminance histogram (b).

(b) hLum

(a)

(c) R

(d) G

(e) B

(f) hR

(g) hG

(h) hB

2.5.3 Combined Color Histograms Luminance histograms and component histograms both provide useful information about the lighting, contrast, dynamic range, and saturation effects relative to the individual color components. It is important to remember that they provide no information about the distribution of the actual colors in the image because they are based on the individual color channels and not the combination of the individual channels that forms the color of an individual pixel. Consider, for example, when hR , the component histogram for the red channel, contains the entry hR (200) = 24. Then it is only known that the image has 24 pixels that have a red intensity value of 200. The entry does not tell us anything about the green and blue values of those pixels, which could be any valid value (∗), that is, (r, g, b) = (200, ∗, ∗). Suppose further that the three component histograms included the following entries: hR (50) = 100,

hG (50) = 100,

hB (50) = 100.

Could we conclude from this that the image contains 100 pixels with the color combination (r, g, b) = (50, 50, 50)

40

or that this color occurs at all? In general, no, because there is no way of ascertaining from these data if there exists a pixel in the image in which all three components have the value 50. The only thing we could really say is that the color value (50, 50, 50) can occur at most 100 times in this image.

So, although conventional (intensity or component) histograms of 2.7 Statistical color images depict important properties, they do not really provide Information from the any useful information about the composition of the actual colors in Histogram an image. In fact, a collection of color images can have very similar component histograms and still contain entirely different colors. This leads to the interesting topic of the combined histogram, which uses statistical information about the combined color components in an attempt to determine if two images are roughly similar in their color composition. Features computed from this type of histogram often form the foundation of color-based image retrieval methods. We will return to this topic in Chapter 13, where we will explore color images in greater detail.

2.6 The Cumulative Histogram The cumulative histogram, which is derived from the ordinary histogram, is useful when performing certain image operations involving histograms; for instance, histogram equalization (see Sec. 3.5). The cumulative histogram H is defined as H(i) =

i X

h(j)

for 0 ≤ i < K.

(2.5)

j=0

A particular value H(i) is thus the sum of all histogram values h(j), with j ≤ i. Alternatively, we can define H recursively (as implemented in Prog. 3.2 on p. 58): ( h(0) for i = 0, H(i) = (2.6) H(i−1) + h(i) for 0 < i < K. The cumulative histogram H(i) is a monotonically increasing function with the maximum value H(K −1) =

K−1 X

h(j) = M ·N,

(2.7)

j=0

that is, the total number of pixels in an image of width M and height N . Figure 2.13 shows a concrete example of a cumulative histogram. The cumulative histogram is useful not primarily for viewing but as a simple and powerful tool for capturing statistical information from an image. In particular, we will use it in the next chapter to compute the parameters for several common point operations (see Sec. 3.4–3.6).

2.7 Statistical Information from the Histogram Some common statistical parameters of the image can be conveniently calculated directly from its histogram. For example, the minimum and maximum pixel value of an image I can be obtained by simply

41

h(i)

2 Histograms and Image Statistics Fig. 2.13 The ordinary histogram h(i) and its associated cumulative histogram H(i).

0

255

i

255

i

H(i)

0

finding the smallest and largest histogram index with nonzero value, that is, min(I) = min {i | h(i) > 0}, max(I) = max {i | h(i) > 0}.

(2.8)

If we assume that the histogram is already available, the advantage is that the calculation does not include the entire image but only the relatively small set of histogram elements (typ. 256). 2.7.1 Mean and Variance The mean intensity value (µ) of an image I (of size M × N ) can be calculated as µ=

M −1 N −1 K−1 X 1 X X 1 · I(u, v) = · h(i) · i, M N u=0 v=0 M N i=0

(2.9)

i.e., either directly from the pixel values I(u, v) or indirectly from the histogram h (of size K), where M N = Σi h(i) is the total number of pixels. Analogously we can also calculate the variance of the pixel values straight from the histogram as M −1 N −1 K−1 X 2 1 X X 1 σ = · I(u, v)−µ = · (i−µ)2 · h(i). (2.10) M N u=0 v=0 M N i=0 2

As we see in the right parts of Eqns. (2.9) and (2.10), there is no need to access the original pixel values once the histogram is available. The formulation of the variance in Eqn. (2.10) assumes that the arithmetic mean µ has already been determined. This is not necessary though, since the mean and the variance can be calculated together in a single iteration over the image pixels or the associated histogram in the form 1 ·A and MN 1 1 σ2 = · B− · A2 , MN MN µ=

42

(2.11) (2.12)

with the quantities A=

2.8 Block Statistics M −1 N −1 X X

K−1 X

I(u, v) =

u=0 v=0

B=

M −1 N −1 X X

i·h(i),

(2.13)

i2 ·h(i).

(2.14)

i=0

I 2 (u, v) =

u=0 v=0

K−1 X i=0

The above formulation has the additional numerical advantage that all summations can be performed with integer values, in contrast to Eqn. (2.10) which requires the summation of floating-point values. 2.7.2 Median The median m of an image is defined as the smallest pixel value that is greater or equal to one half of all pixel values, i.e., lies “in the middle” of the pixel values.8 The median can also be easily calculated from the image’s histogram. To determine the median of an image I from the associated histogram h it is sufficient to find the index i that separates the histogram into two halves, such that the sum of the histogram entries to the left and the right of i are approximately equal. In other words, i is the smallest index where the sum of the histogram entries below (and including) i corresponds to at least half of the image size, that is, i n X median(I) = min i | h(j) ≥ j=0

MN 2

o

.

(2.15)

Pi Since j=0 h(j) = H(i) (see Eqn. (2.5)), the median can be expressed even simpler as median(I) = min i | H(i) ≥ M2N , (2.16) given the cumulative histogram H.

2.8 Block Statistics 2.8.1 Integral Images Integral images (also known as summed area tables [68]) provide a simple way for quickly calculating elementary statistics of arbitrary rectangular sub-images. They have found use in several interesting applications, such as fast filtering, adaptive thresholding, image matching, local feature extraction, face detection, and stereo reconstruction [26, 165, 278]. Given a scalar-valued (grayscale) image I : M × N 7→ R the associated first-order integral image is defined as Σ1 (u, v) =

u X v X

I(i, j).

(2.17)

i=0 j=0

8

See Sec. 4.4.2 for an alternative definition of the median.

43

ua

2 Histograms and Image Statistics Fig. 2.14 Block-based calculations with integral images. Only four samples from the integral image Σ1 are required to calculate the sum of the pixels inside the (green) rectangle R = ⟨a, b⟩, defined by the corner coordinates a = (ua , va ) and b = (ub , vb ).

Σ1 (u, v)

ub

A

B

C

R

va

vb a

b

Thus a value in Σ1 is the sum of all pixel values in the original image I located to the left and above the given position (u, v), inclusively. The integral image can be calculated efficiently with a single pass over the image I by using the recurrence relation   for u < 0 or v < 0, 0 Σ1 (u, v) = Σ1 (u−1, v) + Σ1 (u, v−1) −   (2.18) Σ1 (u−1, v−1) + I(u, v) for u, v ≥ 0, for positions u = 0, . . . , M −1 and v = 0, . . . , N −1 (see Alg. 2.1). Suppose now that we wanted to calculate the sum of the pixel values in a given rectangular region R, defined by the corner positions a = (ua , va ), b = (ub , vb ), that is, the first-order block sum S1 (R) =

ub vb X X

I(i, j),

(2.19)

i=ua j=va

from the integral image Σ1 . As shown in Fig. 2.14, the quantity Σ1 (ua −1, va −1) corresponds to the pixel sum within rectangle A, and Σ1 (ub , vb ) is the pixel sum over all four rectangles A, B, C and R, that is, Σ1 (ua −1, va −1) Σ1 (ub , va −1) Σ1 (ua −1, vb ) Σ1 (ub , vb )

= = = =

S1 (A), S1 (A) + S1 (B), S1 (A) + S1 (C), S1 (A) + S1 (B) + S1 (C) + S1 (R).

(2.20)

Thus S1 (R) can be calculated as S1 (R) = S1 (A)+S1 (B)+S1 (C)+S1 (R) + S1 (A) | {z } | {z } Σ1 (ub ,vb )

Σ1 (ua −1,va −1)

− [S1 (A)+S1 (B)] − [S1 (A)+S1 (C)] | {z } | {z } Σ1 (ub ,va −1)

(2.21)

Σ1 (ua −1,vb )

= Σ1 (ub , vb ) + Σ1 (ua −1, va −1) − Σ1 (ub , va −1) − Σ1 (ua −1, vb ), 44

that is, by taking only four samples from the integral image Σ1 .

2.8.2 Mean Intensity

2.8 Block Statistics

Given the rectangle’s size NR and the sum of the pixel values S1 (R), the average intensity value (mean) inside the rectangle R can now easily be found as µR =

1 · S1 (R), NR

(2.22)

with S1 (R) as defined in Eqn. (2.21) and the region size NR = |R| = (ub − ua + 1) · (vb − va + 1).

(2.23)

2.8.3 Variance Calculating the variance inside a rectangular region R requires the summation of squared intensity values, that is, tabulating Σ2 (u, v) =

u X v X

I 2 (i, j),

(2.24)

i=0 j=0

which can be performed analogously to Eqn. (2.18) in the form   for u < 0 or v < 0, 0 Σ2 (u, v) = Σ2 (u−1, v) + Σ2 (u, v−1) −   (2.25) Σ2 (u−1, v−1) + I 2 (u, v) for u, v ≥ 0. As in Eqns. (2.19)–(2.21), the sum of the squared values inside a given rectangle R (i.e., the second-order block sum) can be obtained as S2 (R) =

u1 v1 X X

I 2 (i, j)

(2.26)

i=u0 j=v0

= Σ2 (ub , vb ) + Σ2 (ua −1, va −1) − Σ2 (ub , va −1) − Σ2 (ua −1, vb ). From this, the variance inside the rectangular region R is finally calculated as 1 1 2 σR = S2 (R) − · (S1 (R))2 , (2.27) NR NR with NR as defined in Eqn. (2.23). In addition, certain higher-order statistics can be efficiently calculated with summation tables in a similar fashion. 2.8.4 Practical Calculation of Integral Images Algorithm 2.1 shows how Σ1 and Σ2 can be calculated in a single iteration over the original image I. Note that the accumulated values in the integral images Σ1 , Σ2 tend to become quite large. Even with pictures of medium size and 8-bit intensity values, the range of 32-bit integers is quickly exhausted (particularly when calculating Σ2 ). The use of 64-bit integers (type long in Java) or larger is recommended to avoid arithmetic overflow. A basic implementation of integral images is available as part of the imagingbook library.9 9

Class imagingbook.lib.image.IntegralImage

45

2 Histograms and Image Statistics Alg. 2.1 Joint calculation of the integral images Σ1 and Σ2 for a scalar-valued image I.

1: IntegralImage(I) Input: I, a scalar-valued input image with I(u, v) ∈ R. Returns the first and second order integral images of I. 2: 3:

(M, N ) ← size(I) Create maps Σ1 , Σ2 : M × N 7→ R

4: 5: 6: 7: 8:

Process the first image line (v = 0): Σ1 (0, 0) ← I(0, 0) Σ2 (0, 0) ← I 2 (0, 0) for u ← 1, . . . , M −1 do Σ1 (u, 0) ← Σ1 (u−1, 0) + I(u, 0) Σ2 (u, 0) ← Σ2 (u−1, 0) + I 2 (u, 0)

9: 10: 11: 12: 13: 14: 15:

Process the remaining image lines (v > 0): for v ← 1, . . . , N −1 do Σ1 (0, v) ← Σ1 (0, v−1) + I(0, v) Σ2 (0, v) ← Σ2 (0, v−1) + I 2 (0, v) for u ← 1, . . . , M −1 do Σ1 (u, v) ← Σ1 (u−1, v) + Σ1 (u, v−1) − Σ1 (u−1, v−1) + I(u, v) Σ2 (u, v) ← Σ2 (u−1, v) + Σ2 (u, v−1) − Σ2 (u−1, v−1) + I 2 (u, v) return (Σ1 , Σ2 )

2.9 Exercises Exercise 2.1. In Prog. 2.2, B and K are constants. Consider if there would be an advantage to computing the value of B/K outside of the loop, and explain your reasoning. Exercise 2.2. Develop an ImageJ plugin that computes the cumulative histogram of an 8-bit grayscale image and displays it as a new image, similar to H(i) in Fig. 2.13. Hint: Use the ImageProcessor method int[] getHistogram() to retrieve the original image’s histogram values and then compute the cumulative histogram “in place” according to Eqn. (2.6). Create a new (blank) image of appropriate size (e.g., 256 × 150) and draw the scaled histogram data as black vertical bars such that the maximum entry spans the full height of the image. Program 2.3 shows how this plugin could be set up and how a new image is created and displayed. Exercise 2.3. Develop a technique for nonlinear binning that uses a table of interval limits aj (Eqn. (2.2)). Exercise 2.4. Develop an ImageJ plugin that uses the Java methods Math.random() or Random.nextInt(int n) to create an image with random pixel values that are uniformly distributed in the range [0, 255]. Analyze the image’s histogram to determine how equally distributed the pixel values truly are.

46

Exercise 2.5. Develop an ImageJ plugin that creates a random image with a Gaussian (normal) distribution with mean value µ = 128 and standard deviation σ = 50. Use the standard Java method

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

import import import import

ij.ImagePlus; ij.plugin.filter.PlugInFilter; ij.process.ByteProcessor; ij.process.ImageProcessor;

public class Create_New_Image implements PlugInFilter { ImagePlus im; public int setup(String arg, ImagePlus im) { this.im = im; return DOES_8G + NO_CHANGES; } public void run(ImageProcessor ip) { // obtain the histogram of ip: int[] hist = ip.getHistogram(); int K = hist.length; // create the histogram image: ImageProcessor hip = new ByteProcessor(K, 100); hip.setValue(255); // white = 255 hip.fill(); // draw the histogram values as black bars in hip here, // for example, using hip.putpixel(u, v, 0) // ...

2.9 Exercises Prog. 2.3 Creating and displaying a new image (ImageJ plugin). First, we create a ByteProcessor object (histIp, line 20) that is subsequently filled. At this point, histIp has no screen representation and is thus not visible. Then, an associated ImagePlus object is created (line 33) and displayed by applying the show() method (line 34). Notice how the title (String) is retrieved from the original image inside the setup() method (line 10) and used to compose the new image’s title (lines 30 and 33). If histIp is changed after calling show(), then the method updateAndDraw() could be used to redisplay the associated image again (line 34).

// compose a nice title: String imTitle = im.getShortTitle(); String histTitle = "Histogram of " + imTitle;

}

}

// display the histogram image: ImagePlus him = new ImagePlus(title, hip); him.show();

double Random.nextGaussian() to produce normally-distributed random numbers (with µ = 0 and σ = 1) and scale them appropriately to pixel values. Analyze the resulting image histogram to see if it shows a Gaussian distribution too. Exercise 2.6. Implement the calculation of the arithmetic mean µ and the variance σ 2 of a given grayscale image from its histogram h (see Sec. 2.7.1). Compare your results to those returned by ImageJ’s Analyze ▷ Measure tool (they should match exactly). Exercise 2.7. Implement the first-order integral image (Σ1 ) calculation described in Eqn. (2.18) and calculate the sum of pixel values S1 (R) inside a given rectangle R using Eqn. (2.21). Verify numerically that the results are the same as with the naive formulation in Eqn. (2.19). Exercise 2.8. Values of integral images tend to become quite large. Assume that 32-bit signed integers (int) are used to calculate the integral of the squared pixel values, that is, Σ2 (see Eqn. (2.24)), for an 8-bit grayscale image. What is the maximum image size that is guaranteed not to cause an arithmetic overflow? Perform the same analysis for 64-bit signed integers (long).

47

2 Histograms and Image Statistics

48

Exercise 2.9. Calculate the integral image Σ1 for a given image I, convert it to a floating-point iamge (FloatProcessor) and display the result. You will realize that integral images are without any apparent structure and they all look more or less the same. Come up with an efficient method for reconstructing the original image I from Σ1 .

3 Point Operations

Point operations perform a modification of the pixel values without changing the size, geometry, or local structure of the image. Each new pixel value b = I ′ (u, v) depends exclusively on the previous value a = I(u, v) at the same position and is thus independent from any other pixel value, in particular from any of its neighboring pixels. 1 The original pixel values a are mapped to the new values b by some given function f , i.e., b = f I(u, v) or b = f (a). (3.1) If, as in this case, the function f () is independent of the image coordinates (i.e., the same throughout the image), the operation is called “global” or “homogeneous”. Typical examples of homogeneous point operations include, among others: • • • • • • • •

modifying image brightness or contrast, applying arbitrary intensity transformations (“curves”), inverting images, quantizing (or “posterizing”) images, global thresholding, gamma correction, color transformations etc.

We will look at some of these techniques in more detail in the following. In contrast to Eqn. (3.1), the mapping g() for a nonhomogeneous point operation would also take into account the current image coordinate (u, v), that is, b = g I(u, v), u, v or b = f (a, u, v). (3.2) A typical nonhomogeneous operation is the local adjustment of contrast or brightness used, for example, to compensate for uneven lighting during image acquisition. 1

If the result depends on more than one pixel value, the operation is called a “filter”, as described in Chapter 4.

© Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1_3

49

3 Point Operations

3.1 Modifying Image Intensity 3.1.1 Contrast and Brightness Let us start with a simple example. Increasing the image’s contrast by 50% (i.e., by the factor 1.5) or raising the brightness by 10 units can be expressed by the mappings fcontr (a) = a · 1.5

or

fbright (a) = a + 10 ,

(3.3)

respectively. The first operation is implemented as an ImageJ plugin by the code shown in Prog. 3.1, which can easily be adapted to perform any other type of point operation. Rounding to the nearest integer values is accomplished by simply adding 0.5 before the truncation effected by the (int) typecast in line 8 (this only works for positive values). Also note the use of the more efficient image processor methods get() and set() (instead of getPixel() and putPixel()) in this example. Prog. 3.1 Point operation to increase the contrast by 50% (ImageJ plugin). Note that in line 8 the result of the multiplication of the integer pixel value by the constant 1.5 (implicitly of type double) is of type double. Thus an explicit type cast (int) is required to assign the value to the int variable a. 0.5 is added in line 8 to round to the nearest integer values.

1 2 3 4 5 6 7 8 9 10 11 12 13 14

public void run(ImageProcessor ip) { int w = ip.getWidth(); int h = ip.getHeight();

}

for (int v = 0; v < h; v++) { for (int u = 0; u < w; u++) { int a = ip.get(u, v); int b = (int) (a * 1.5 + 0.5); if (b > 255) b = 255; // clamp to the maximum value (amax ) ip.set(u, v, b); } }

3.1.2 Limiting Values by Clamping When implementing arithmetic operations on pixel values, we must keep in mind that the calculated results must not exceed the admissible range of pixel values for the given image type (e.g., [0, 255] in the case of 8-bit grayscale images). This is commonly called “clamping” and can be expressed in the form   amin for f (a) < amin , b = min(max(f (a), amin ), amax ) = amax for f (a) > amax , (3.4)   f (a) otherwise. For this purpose, line 10 of Prog. 3.1 contains the statement if (b > 255) b = 255; which limits the result to the maximum value 255. Similarly, one may also want to limit the results to the minimum value (0) to avoid negative pixel values (which cannot be represented by this type of 8-bit image), for example, by the statement if (b < 0) b = 0; 50

The above statement is not needed in Prog. 3.1 because the intermediate results can never be negative in this particular operation.

3.1.3 Inverting Images Inverting an intensity image is a simple point operation that reverses the ordering of pixel values (by multiplying by −1) and adds a constant value to map the result to the admissible range again. Thus for a pixel value a = I(u, v) in the range [0, amax ], the corresponding point operation is finv (a) = −a + amax = amax − a.

3.2 Point Operations and Histograms

(3.5)

The inversion of an 8-bit grayscale image with amax = 255 was the task of our first plugin example in Sec. I.1.2 (Prog. I.1). Note that in this case no clamping is required at all because the function always maps to the original range of values. In ImageJ, this operation is performed by the method invert() (for objects of type ImageProcessor) and is also available through the Edit ▷ Invert menu. Obviously, inverting an image mirrors its histogram, as shown in Fig. 3.5 (c). 3.1.4 Thresholding Operation Thresholding an image is a special type of quantization that separates the pixel values in two classes, depending upon a given threshold value q that is usually constant. The threshold operation maps all pixels to one of two fixed intensity values a0 or a1 , that is, ( a0 for a < q, fthreshold (a) = (3.6) a1 for a ≥ q, with 0 < q ≤ amax . A common application is binarizing an intensity image with the values a0 = 0 and a1 = 1. ImageJ does provide a special image type (BinaryProcessor) for binary images, but these are actually implemented as 8-bit intensity images (just like ordinary intensity images) using the values 0 and 255. ImageJ also provides the ImageProcessor method threshold(int level ), with level ≡ q, to perform this operation, which can also be invoked through the Image ▷ Adjust ▷ Threshold menu (see Fig. 3.1 for an example). Thresholding affects the histogram by separating the distribution into two entries at positions a0 and a1 , as illustrated in Fig. 3.2.

3.2 Point Operations and Histograms We have already seen that the effects of a point operation on the image’s histogram are quite easy to predict in some cases. For example, increasing the brightness of an image by a constant value shifts the entire histogram to the right, raising the contrast widens it, and inverting the image flips the histogram. Although this appears rather simple, it may be useful to look a bit more closely at the relationship between point operations and the resulting changes in the histogram.

51

3 Point Operations Fig. 3.1 Threshold operation: original image (a) and corresponding histogram (c); result after thresholding with ath = 128, a0 = 0, a1 = 255 (b) and corresponding histogram (d); ImageJ’s interactive Threshold menu (e).

Fig. 3.2 Effects of thresholding upon the histogram. The threshold value is ath . The original distribution (a) is split and merged into two isolated entries at a0 and a1 in the resulting histogram (b).

(a)

(b)

(c)

(d)

h(i)

(e)

h′ (i)

i q (a)

i a0

a1 (b)

As the illustration in Fig. 3.3 shows, every entry (bar) at some position i in the histogram maps to a set (of size h(i)) containing all image pixels whose values are exactly i.2 If a particular histogram line is shifted as a result of some point operation, then of course all pixels in the corresponding set are equally modified and vice versa. So what happens when a point operation (e.g., reducing image contrast) causes two previously separated histogram lines to fall together at the same position i ? The answer is that the corresponding pixel sets are merged and the new common histogram entry is the sum of the two (or more) contributing entries (i.e., the size of the combined set). At this point, the elements in the merged set are no longer distinguishable (or separable), so this operation may have (perhaps unintentionally) caused an irreversible reduction of dynamic range and thus a permanent loss of information in that image.

3.3 Automatic Contrast Adjustment Automatic contrast adjustment (auto-contrast) is a point operation whose task is to modify the pixels such that the available range of values is fully covered. This is done by mapping the current darkest 2

52

Of course this is only true for ordinary histograms with an entry for every single intensity value. If binning is used (see Sec. 2.4.1), each histogram entry maps to pixels within a certain range of values.

h(i)

h′ (i)

3.3 Automatic Contrast Adjustment

h′ (a2 ) ← h(a1 ) + h(a2

h(a1 ) h(a2 )

i a1

i

a2

a2

(a)

alo

Fig. 3.3 Histogram entries represent sets of pixels of the same value. If a histogram line is moved as a result of some point operation, then all pixels in the corresponding set are equally modified (a). If, due to this operation, two histogram lines h(a1 ), h(a2 ) coincide on the same index, the two corresponding pixel sets merge and the contained pixels become undiscernable (b).

(b)

ahi Fig. 3.4 Auto-contrast operation as defined in Eqn. (3.7). Original pixel values a in the range [alo , ahi ] are mapped linearly to the target range [amin , amax ].

a

b amin

amax

and brightest pixels to the minimum and maximum intensity values, respectively, and linearly distributing the intermediate values. Let us assume that alo and ahi are the lowest and highest pixel values found in the current image, whose full intensity range is [amin , amax ]. To stretch the image to the full intensity range (see Fig. 3.4), we first map the smallest pixel value alo to zero, subsequently increase the contrast by the factor (amax −amin )/(ahi −alo ), and finally shift to the target range by adding amin . The mapping for the auto-contrast operation is thus defined as a − amin fac (a) = amin + a − alo · max , ahi − alo

(3.7)

provided that ahi ̸= alo ; that is, the image contains at least two different pixel values. For an 8-bit image with amin = 0 and amax = 255, the function in Eqn. (3.7) simplifies to fac (a) = (a−alo ) ·

255 . ahi −alo

(3.8)

The target range [amin , amax ] need not be the maximum available range of values but can be any interval to which the image should be mapped. Of course the method can also be used to reduce the image contrast to a smaller range. Figure 3.5 (b) shows the effects of an auto-contrast operation on the corresponding histogram, where the linear stretching of the intensity range results in regularly spaced gaps in the new distribution.

53

3 Point Operations Fig. 3.5 Effects of auto-contrast and inversion operations on the resulting histograms. Original image (a), result of autocontrast operation (b), and inversion (c). The histogram entries are shown both linearly (black bars) and logarithmically (gray bars). (a)

(b)

(c)

3.4 Modified Auto-Contrast Operation In practice, the mapping in Eqn. (3.7) could be strongly influenced by only a few extreme (low or high) pixel values, which may not be representative of the main image content. This can be avoided to a large extent by “saturating” a fixed percentage (plo , phi ) of pixels at the upper and lower ends of the target intensity range. To accomplish this, we determine two limiting values a′lo , a′hi such that a predefined quantile qlo of all pixel values in the image I are smaller than a′lo and another quantile qhi of the values are greater than a′hi (see Fig. 3.6). Fig. 3.6 Modified auto-contrast operation (Eqn. (3.11)). Predefined quantiles (plo , phi ) of the pixel population—shown as yellow areas at the left and right ends of the histogram h(i)— are “saturated” (i.e., mapped to the extreme values of the target range). The intermediate values (a = a′lo , . . . , a′hi ) are mapped linearly to the interval amin , . . . , amax .

h(i) a′lo plo

a′hi phi i

alo

ahi a

a′ amin

amax

The values a′lo , a′hi depend on the image content and can be easily obtained from the image’s cumulative histogram3 H: a′lo = min i | H(i) ≥ M ·N ·plo , (3.9) ′ ahi = max i | H(i) ≤ M ·N ·(1−phi ) , (3.10)

where 0 ≤ plo , phi ≤ 1, plo + phi ≤ 1, and M ·N is the number of pixels in the image. All pixel values outside (and including) a′lo and a′hi are mapped to the extreme values amin and amax , respectively, and intermediate values are mapped linearly to the interval [amin , amax ]. 3

54

See Sec. 2.6.

Using this formulation, the mapping to minimum and maximum in- 3.5 Histogram tensities does not depend on singular extreme pixels only but can be Equalization based on a representative set of pixels. The mapping for the modified auto-contrast operation can thus be defined as   for a ≤ a′lo ,  amin amax −amin ′ for a′lo < a < a′hi , (3.11) fmac (a) = amin + a−alo · ′ ′ a −a  hi lo  a for a ≥ a′ . max

hi

Usually the same value is taken for both upper and lower quantiles (i.e., plo = phi = p), with p = 0.005, . . . , 0.015 (0.5, . . . , 1.5 %) being common values. For example, the auto-contrast operation in Adobe Photoshop saturates 0.5 % (p = 0.005) of all pixels at both ends of the intensity range. Auto-contrast is a frequently used point operation and thus available in practically any image-processing software. ImageJ implements the modified auto-contrast operation as part of the Brightness/Contrast and Image ▷ Adjust menus (Auto button), shown in Fig. 3.7.

Fig. 3.7 ImageJ’s Brightness/Contrast tool (left) and Window/Level tool (right) can be invoked through the Image ▷ Adjust menu. The Auto button displays the result of a modified auto-contrast operation. Apply must be hit to actually modify the image.

3.5 Histogram Equalization A frequent task is to adjust two different images in such a way that their resulting intensity distributions are similar, for example, to use them in a print publication or to make them easier to compare. The goal of histogram equalization is to find and apply a point operation such that the histogram of the modified image approximates a uniform distribution (see Fig. 3.8). Since the histogram is a discrete distribution and homogeneous point operations can only shift and merge (but never split) histogram entries, we can only obtain an approximate solution in general. In particular, there is no way to eliminate or decrease individual peaks in a histogram, and a truly uniform distribution is thus impossible to reach. Based on point operations, we can thus modify the image only to the extent that the resulting histogram is approximately uniform. The question is

55

3 Point Operations

h(i)

Fig. 3.8 Histogram equalization. The idea is to find and apply a point operation to the image (with original histogram h) such that the histogram heq of the modified image approximates a uniform distribution (top). The cumulative target histogram Heq must thus be approximately wedge-shaped (bottom).

Original

heq (i)

modified

i H(i)

i Heq (i)

i Fig. 3.9 Histogram equalization on the cumulative histogram. A suitable point operation b ← feq (a) shifts each histogram line from its original position a to b (left or right) such that the resulting cumulative histogram Heq is approximately linear.

H(i)

i

Heq (i)

i b a

how good this approximation can be and exactly which point operation (which clearly depends on the image content) we must apply to achieve this goal. We may get a first idea by observing that the cumulative histogram (Sec. 2.6) of a uniformly distributed image is a linear ramp (wedge), as shown in Fig. 3.8. So we can reformulate the goal as finding a point operation that shifts the histogram lines such that the resulting cumulative histogram is approximately linear, as illustrated in Fig. 3.9. The desired point operation feq () is simply obtained from the cumulative histogram H of the original image as4 j K − 1k feq (a) = H(a) · , (3.12) M ·N

for an image of size M × N with pixel values a in the range [0, K −1]. The resulting function feq (a) in Eqn. (3.12) is monotonically increasing, because H(a) is monotonic and K, M , N are all positive constants. In the (unusual) case where an image is already uniformly distributed, linear histogram equalization should not modify that image any further. Also, repeated applications of linear histogram equalization should not make any changes to the image after the first time. Both requirements are fulfilled by the formulation in Eqn. (3.12). Program 3.2 lists the Java code for a sample implementation of linear histogram equalization. An example demonstrating the effects on the image and the histograms is shown in Fig. 3.10. 4

56

For a derivation, see, e.g., [108, p. 173].

3.5 Histogram Equalization

(a)

Fig. 3.10 Linear histogram equalization example. Original image I (a) and modified image I ′ (b), corresponding histograms h, h′ (c, d), and cumulative histograms H, H′ (e, f). The resulting cumulative histogram H′ (f) approximates a uniformly distributed image. Notice that new peaks are created in the resulting histogram h′ (d) by merging original histogram cells, particularly in the lower and upper intensity ranges.

(b)

h′

h

(c)

(d)

H′

H

(e)

(f)

Notice that for “inactive” pixel values i (i.e., pixel values that do not appear in the image, with h(i) = 0), the corresponding entries in the cumulative histogram H(i) are either zero or identical to the neighboring entry H(i − 1). Consequently a contiguous range of zero values in the histogram h(i) corresponds to a constant (i.e., flat) range in the cumulative histogram H(i), and the function feq (a) maps all “inactive” intensity values within such a range to the next lower “active” value. This effect is not relevant, however, since the image contains no such pixels anyway. Nevertheless, a linear histogram equalization may (and typically will) cause histogram lines to merge and consequently lead to a loss of dynamic range (see also Sec. 3.2). This or a similar form of linear histogram equalization is implemented in almost any image-processing software. In ImageJ it can be invoked interactively through the Process ▷ Enhance Contrast menu (option Equalize). To avoid extreme contrast effects, the histogram equalization in ImageJ by default5 cumulates the square root of the histogram entries using a modified cumulative histogram of the form ˜ H(i) =

i p X h(j) .

(3.13)

j=0

5

The “classic” linear approach (see Eqn. (2.5)) is used when simultaneously keeping the Alt key pressed.

57

3 Point Operations Prog. 3.2 Linear histogram equalization (ImageJ plugin). First the histogram of the image ip is obtained using the standard ImageJ method ip.getHistogram() in line 7. In line 9, the cumulative histogram is computed “in place” based on the recursive definition in Eqn. (2.6). The int division in line 16 implicitly performs the required floor (⌊ ⌋) operation by truncation.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

public void run(ImageProcessor ip) { int M = ip.getWidth(); int N = ip.getHeight(); int K = 256; // number of intensity values // compute the cumulative histogram: int[] H = ip.getHistogram(); for (int j = 1; j < H.length; j++) { H[j] = H[j - 1] + H[j]; }

}

// equalize the image: for (int v = 0; v < N; v++) { for (int u = 0; u < M; u++) { int a = ip.get(u, v); int b = H[a] * (K - 1) / (M * N); ip.set(u, v, b); } }

// see Eqn. (3.12)

3.6 Histogram Specification Although widely implemented, the goal of linear histogram equalization— a uniform distribution of intensity values (as described in the previous section)—appears rather ad hoc, since good images virtually never show such a distribution. In most real images, the distribution of the pixel values is not even remotely uniform but is usually more similar, if at all, to perhaps a Gaussian distribution. The images produced by linear equalization thus usually appear quite unnatural, which renders the technique practically useless. Histogram specification is a more general technique that modifies the image to match an arbitrary intensity distribution, including the histogram of a given image. This is particularly useful, for example, for adjusting a set of images taken by different cameras or under varying exposure or lighting conditions to give a similar impression in print production or when displayed. Similar to histogram equalization, this process relies on the alignment of the cumulative histograms by applying a homogeneous point operation. To be independent of the image size (i.e., the number of pixels), we first define normalized distributions, which we use in place of the original histograms. 3.6.1 Frequencies and Probabilities The value in each histogram cell describes the observed frequency of the corresponding intensity value, i.e., the histogram is a discrete frequency distribution. For a given image I of size M × N , the sum of all histogram entries h(i) equals the number of image pixels, X h(i) = M ·N . (3.14) i

The associated normalized histogram,

58

p(i) =

h(i) , M ·N

for 0 ≤ i < K,

(3.15)

1: Cdf(h) Input: h, a histogram of size K. Returns the cumulative distribution function P for the histogram h. 2: 3: 4: 5: 6: 7: 8: 9:

3.6 Histogram Specification Alg. 3.1 Calculation of the cumulative distribution function (cdf) P(i) from a given histogram h of length K. See Prog. 3.3 (p. 64) for the corresponding Java implementation.

K ← Size(h) PK −1 n ← i = 0 h(i)

P ← new map [0, K −1] 7→ R c←0 for i ← 0, . . . , K −1 do c ← c + h(i) P(i) ← c/n

▷ cumulate histogram values ▷ P(i) ∈ [0, 1]

return P.

is usually interpreted as the probability distribution or probability density function (pdf) of a random process, where p(i) is the probability for the occurrence of the pixel value i. The cumulative probability of i being any possible value is 1, and the distribution p must thus satisfy K−1 X p(i) = 1 . (3.16) i=0

The statistical counterpart to the cumulative histogram H (Eqn. (2.5)) is the discrete distribution function P() (also called the cumulative distribution function or cdf), i

P(i) =

i

X h(j) X H(i) H(i) = = = p(j), H(K −1) M ·N M ·N j=0 j=0

(3.17)

for i = 0, . . . , K −1. The computation of the cdf from a given histogram h is outlined in Alg. 3.1. The resulting function P(i) is (as the cumulative histogram) monotonically increasing and, in particular, P(0) = p(0)

and

P(K −1) =

K−1 X

p(i) = 1 .

(3.18)

i=0

This statistical formulation implicitly treats the generation of images as a random process whose exact properties are mostly unknown.6 However, the process is usually assumed to be homogeneous (independent of the image position); that is, each pixel value is the result of a “random experiment” on a single random variable i. The observed frequency distribution given by the histogram h(i) serves as a (coarse) estimate of the probability distribution p(i) of this random variable. 3.6.2 Principle of Histogram Specification The goal of histogram specification is to modify a given image IA by some point operation such that its distribution function PA matches 6

Statistical modeling of the image generation process has a long tradition (see, e.g., [150, Ch. 2]).

59

3 Point Operations Fig. 3.11 Principle of histogram specification. Given is the reference distribution PR (left) and the distribution function for the original image PA (right). The result is the mapping fhs : a → a′ for a point operation, which replaces each pixel a in the original image IA by a modified value ′ a . The process has two main A For each pixel value steps: ⃝ a, determine b = PA (a) from the right distribution func′ B a is then found by tion. ⃝ inverting the left distribu−1 tion function as a′ = PR (b). In summary, the result is −1 ′ fhs (a) = a = PR PA (a) .

PR (i)

reference

original

PA (i)

1

1 b

A

B 0

0

i

a′

K −1

0

0

i a

K −1

a reference distribution PR as closely as possible. We thus look for a mapping a′ = fhs (a) (3.19) to convert the original image IA by a point operation to a new image IA′ with pixel values a′ , such that its distribution function PA′ matches PR , that is, PA′ (i) ≈ PR (i) , for 0 ≤ i < K. (3.20) As illustrated in Fig. 3.11, the desired mapping fhs is found by combining the two distribution functions PR and PA (see [108, p. 180] for details). For a given pixel value a in the original image, we obtain the new pixel value a′ as a′ = PR−1 PA (a) = PR−1 (b) (3.21) and thus the mapping fhs (Eqn. (3.19)) is defined as fhs (a) = PR−1 PA (a) , for 0 ≤ a < K.

(3.22)

This of course assumes that PR (i) is invertible, that is, that the function PR−1 (b) exists for b ∈ [0, 1]. 3.6.3 Adjusting to a Piecewise Linear Distribution If the reference distribution PR is given as a continuous, invertible function, then the mapping fhs can be obtained from Eqn. (3.22) without any difficulty. In practice, it is convenient to specify the (synthetic) reference distribution as a piecewise linear function PL (i); that is, as a sequence of n+1 coordinate pairs L = ⟨a0 , P0 ⟩, . . . , ⟨an , Pn ⟩ , each consisting of an intensity value ak and the corresponding cumulative probability Pk . We assert that 0 ≤ ak < K, ak < ak+1 , and 0 ≤ Pk < 1. Also, the two endpoints ⟨a0 , P0 ⟩ and ⟨an , Pn ⟩ are fixed at (0, P0 )

60

and

(K −1, 1) ,

respectively. To be invertible, the function must also be strictly monotonic, that is, Pk < Pk+1 for 0 ≤ k < n. Figure 3.12 shows an example of such a function, which is specified by n = 5 variable points (P0 , . . . , P4 ) and a fixed end point P5 and thus consists of n linear segments. The reference distribution can of course be specified at an arbitrary accuracy by inserting additional control points.

PL (a)

3.6 Histogram Specification P4

1

P5

P3 b = PL (a

P2 P1

P0 0

a

0 a1

a2

a3

a4

Fig. 3.12 Piecewise linear reference distribution. The function PL (i) is specified by n = 5 control points ⟨0, P0 ⟩, ⟨a1 , P1 ⟩, . . . , ⟨a4 , P4 ⟩, with ak < ak+1 and Pk < Pk+1 . The final point P5 is fixed at ⟨K −1, 1⟩.

K −1

a = P−1 (b) L

The intermediate values of PL (a) are obtained by linear interpolation between the control points as ( (P −P ) Ps + (a−as ) · (ar+1 −a s) for 0 ≤ a < K −1, r+1 s PL (a) = (3.23) 1 for a = K −1. where r = max{j ∈ [0, n−1] | aj ≤ a} is the index of the line segment ⟨ar , Pr ⟩ → ⟨ar+1 , Pr+1 ⟩, overlapping position a. For instance, in the example in Fig. 3.12, point a lies within the segment that starts at point ⟨a2 , P2 ⟩, i.e., r = 2. For the histogram specification according to Eqn. (3.22), we also need the inverse distribution function P−1 L (b) for b ∈ [0, 1]. As we see from the example in Fig. 3.12, the function PL (a) is in general not invertible for values b < PL (0). We can fix this problem by mapping all values b < PL (0) to zero and thus obtain a “semi-inverse” of the reference distribution in Eqn. (3.23), i.e.,  for 0 ≤ b < PL (0),  0 (as+1 −as ) −1 PL (b) = as + (b−Ps ) · (P −P ) for PL (0) ≤ b < 1, (3.24) s+1 s   K −1 for b ≥ 1. Here s = max{j ∈ {0, . . . n−1} | Pj ≤ b} is the index of the line segment ⟨as , Ps ⟩ → ⟨as+1 , Ps+1 ⟩ that overlaps position b. The required mapping fhs for adapting a given image with intensity distribution PA is finally found, analogous to Eqn. (3.22), as fhs (a) = PL−1 PA (a) , for 0 ≤ a < K. (3.25) The whole process of computing the pixel mapping for a given image (histogram) and a piecewise linear target distribution is summarized in Alg. 3.2. A real example is shown in Fig. 3.14 (Sec. 3.6.5). 3.6.4 Adjusting to a Given Histogram (Histogram Matching) If we want to adjust one image to the histogram of another image, the reference distribution function PR (i) is not continuous and thus, in general, cannot be inverted (as required by Eqn. (3.22)). For example, if the reference distribution contains zero entries (i.e., pixel

61

3 Point Operations Alg. 3.2 Histogram specification using a piecewise linear reference distribution. Given is the histogram h of the original image and a piecewise linear reference distribution function, specified as a sequence of n control points L = (⟨ai , Pi ⟩). The discrete mapping f for the corresponding point operation is returned.

1: MatchPiecewiseLinearHistogram(h, L) Input: h, original histogram; L, reference distribution function, given as a sequence of n+1 control points L = (⟨a0 , P0 ⟩, . . . , ⟨an , Pn ⟩), with 0 ≤ ak < K, 0 ≤ Pk ≤ 1, and Pk < Pk+1 . Returns a discrete mapping f : a → b to be applied to the values of the original image.

17:

n ← |L| + 1 K ← |h| P ← Cdf(h) f ← new map [0, K −1] 7→ R for a ← 0, . . . , K −1 do b ← P(a) if (b ≤ P0 ) then a′ ← 0 else if (b ≥ 1) then a′ ← K −1 else s ← n−1 while (s ≥ 0) ∧ (Ps > b) do s←s−1 (a − an ) ′ a ← as + (b−Ps )· s+1 (Ps+1 − Ps ) f (a) ← a′

18:

return f .

2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

Fig. 3.13 Discrete histogram specification. The reference distribution PR (left) is “filled” layer by layer from bottom to top and from right to left. For every possible intensity value a (starting from a = 0), the associated probability pA (a) is added as a horizontal bar to a stack accumulated ‘under” the reference distribution PR . The bar with thickness pA (a) is drawn from right to left down to the position a′ , where the reference distribution PR is reached. The function fhs () must map a to a′ .

62

PR (i)

reference

pA (i)

▷ cdf for h (see Alg. 3.1) ▷ discrete map f

▷ find line segment in L ▷ see Eqn. 3.24

original

1

pA (a) 0

0

i a′

K −1

0

i a

K −1

values k with probability p(k) = 0), the corresponding cumulative distribution function P (just like the cumulative histogram) has intervals of constant value on which no inverse function value can be determined. In the following, we describe a simple method for histogram matching that works with discrete reference distributions. The principal idea is graphically illustrated in Fig. 3.13. The mapping fhs is not obtained by inverting but by “filling in” the reference distribution function PR (i). For each possible pixel value a, starting with a = 0, the corresponding probability pA (a) is stacked layer by layer “under” the reference distribution PR . The thickness of each horizontal bar for a equals the corresponding probability pA (a). The bar for a particular intensity value a with thickness pA (a) runs from right to left, down to position a′ , where it hits the reference distribution PR . This position a′ corresponds to the new pixel value to which a should be mapped.

1: MatchHistograms(hA , hR ) Input: hA , histogram of the target image IA ; hR , reference histogram (the same size as hA ). Returns a discrete mapping f to be applied to the values of a target image. 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

K ← Size(hA ) PA ← Cdf(hA ) PR ← Cdf(hR ) f ← new map [0, K −1] 7→ R for a ← 0, . . . , K −1 do j ← K −1 do f (a) ← j j ←j−1 while (j ≥ 0) ∧ (PA (a) ≤ PR (j))

12:

return f

▷ c.d.f. for hA (Alg. 3.1) ▷ c.d.f. for hR (Alg. 3.1) ▷ map f

3.6 Histogram Specification Alg. 3.3 Histogram matching. Given are two histograms: the histogram hA of the target image IA and a reference histogram hR , both of size K. The result is a discrete mapping f which, when applied to the target image, produces a new image with a distribution function similar to the reference histogram.

Since the sum of all probabilities pAPand the maximum of the distribution function PR are both 1 (i.e., i pA (i) = maxi PR (i) = 1), all horizontal bars will exactly fit under the function PR . One may also notice in Fig. 3.13 that the distribution value resulting at a′ is identical to the cumulated probability PA (a). Given some intensity value a, it is therefore sufficient to find the minimum value a′ , where the reference distribution PR (a′ ) is greater than or equal to the cumulative probability PA (a), that is, fhs (a) = min j | (0 ≤ j < K) ∧ PA (a) ≤ PR (j) . (3.26)

This results in a very simple procedure, which is summarized in Alg. 3.3. The corresponding Java implementation in Prog. 3.3, consists of the method matchHistograms(), which accepts the original histogram (hA) and the reference histogram (hR) and returns the mapping (f) to modify the pixel values of the target image IA. Due to the use of normalized distribution functions, the size of the associated images is not relevant. The following code fragment demonstrates the use of the matchHistograms() method from Program 3.3: ImageProcessor IA = ... ImageProcessor IR = ...

// target image IA (to be modified) // reference image IR

int[] ha = IA.getHistogram(); int[] hr = IR.getHistogram();

// get histogram for IA // get histogram for IR

int[] f = matchHistograms(ha, hr); IA.applyTable(f);

// mapping f // modify the target image IA

The original image IA is modified in the last line by applying the mapping (lookup table) f with the method applyTable() (see also p. 75). 3.6.5 Examples Adjusting to a Piecewise Linear Reference Distribution The first example in Fig. 3.14 shows the results of histogram specification for a continuous, piecewise linear reference distribution, as

63

3 Point Operations Prog. 3.3 Histogram matching (Java implementation of Alg. 3.3). The method matchHistograms() computes the mapping f from the target histogram hA and the reference histogram hR (see Eqn. (3.26)). The method Cdf() computes the cumulative distribution function (cdf) for a given histogram (Eqn. (3.17)).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

int[] matchHistograms (int[] hA, int[] hR) { // hA . . . histogram hA of the target image IA (to be modified) // hR . . . reference histogram hR // returns the mapping f to be applied to image IA int K = hA.length; double[] PA = Cdf(hA); double[] PR = Cdf(hR); int[] f = new int[K];

}

// get CDF of histogram hA // get CDF of histogram hR // mapping f is a simple int-array

// compute mapping f : for (int a = 0; a < K; a++) { int j = K - 1; do { f[a] = j; j--; } while (j >= 0 && PA[a] 1. Controlled by a single continuous parameter (γ), the power function can thus “imitate” both logarithmic and exponential types of functions. Within the interval [0, 1], the function is continuous and strictly monotonic, and also very simple to invert as a = fγ−1 (b) = b1/γ ,

(3.28)

1/γ

since b1/γ = (aγ ) = a1 = a. The inverse of the exponential function −1 fγ (b) is thus again an exponential function, fγ−1 (b) = fγ¯ (b) = f1/γ (b),

(3.29)

with the parameter γ¯ = 1/γ. 3.7.3 Real Gamma Values The actual gamma values of individual devices are usually specified by the manufacturers based on real measurements. For example, common gamma values for CRT monitors are in the range 1.8 to 2.8, with 2.4 as a typical value. Most LCD monitors are internally adjusted to similar values. Digital video and still cameras also emulate the transfer characteristics of analog film and photographic cameras by making internal corrections to give the resulting images an accustomed “look”. In TV receivers, gamma values are standardized with 2.2 for analog NTSC and 2.8 for the PAL system (these values are theoretical; results of actual measurements are around 2.35). A gamma value of 1/2.2 ≈ 0.45 is the norm for cameras in NTSC as well as the EBU7 standards. The current international standard ITU-R BT.7098 calls 7 8

European Broadcast Union (EBU) International Telecommunications Union (ITU)

69

3 Point Operations

Camera

Gamma correction

Light Fig. 3.20 Principle of gamma correction. To compensate the output signal S produced by a camera with nominal gamma value γc , a gamma correction is applied with γ ¯c = 1/γc . The corrected signal S ′ is proportional to the received light intensity L.

fγ¯ (S)

L

c

γc

γ ¯c =

S = Lγc

Corrected signal S′ ≈ L

1 γc

for uniform gamma values of 2.5 in receivers and 1/1.956 ≈ 0.51 for cameras [92, 144]. The ITU 709 standard is based on a slightly modified version of the gamma correction (see Sec. 3.7.6). Computers usually allow adjustment of the gamma value applied to the video output signals to adapt to a wide range of different monitors. Note, however, that the power function fγ () is only a coarse approximation to the actual transfer characteristics of any device, which may also not be the same for different color channels. Thus significant deviations may occur in practice, despite the careful choice of gamma settings. Critical applications, such as prepress or high-end photography, usually require additional calibration efforts based on exactly measured device profiles (see Sec. 14.7.4). 3.7.4 Applications of Gamma Correction Let us first look at the simple example illustrated in Fig. 3.20. Assume that we use a digital camera with a nominal gamma value γc , meaning that its output signal s relates to the incident light intensity L as S = L γc . (3.30) To compensate the transfer characteristic of this camera (i.e., to obtain a measurement S ′ that is proportional to the original light intensity L), the camera signal S is subject to a gamma correction with the inverse of the camera’s gamma value γ¯c = 1/γc and thus S ′ = fγ¯c (S) = S 1/γc .

(3.31)

The resulting signal S ′ = S 1/γc = Lγc

1/γc

=L

(γc γ1 ) c

= L1

is obviously proportional (in theory even identical) to the original light intensity L. Although this example is quite simplistic, it still demonstrates the general rule, which holds for output devices as well: The transfer characteristic of an input or output device with specified gamma value γ is compensated for by a gamma correction with γ¯ = 1/γ.

70

In the above we have implicitly assumed that all values are strictly in the range [0, 1], which usually is not the case in practice. When

3.7 Gamma Correction Storage

γc = 1.3

γ ¯c =

1 1.3

γ ¯s =

1 1.9

γm = 2.6

γ ¯m =

Fig. 3.21 Gamma correction in the digital imaging work flow. Images are processed and stored in a “linear” intensity space, where gamma correction is used to compensate for the transfer characteristic of each input and output device. (The gamma values shown are examples only.)

1 2.6

Processing

γ ¯p =

γs = 1.9

1 3.0

γp = 3.0

working with digital images, we have to deal with discrete pixel values, for example, in the range [0, 255] for 8-bit images. In general, performing a gamma correction b ← fgc (a, γ), on a pixel value a ∈ [0, amax ] and a gamma value γ > 0 requires the following three steps: 1. Scale a linearly to a ˆ ∈ [0, 1]. 2. Apply the gamma correction function to a ˆ: ˆb ← a ˆγ . 3. Scale ˆb ∈ [0, 1] linearly back to b ∈ [0, amax ]. Formulated in a more compact way, the corrected pixel value b is obtained from the original value a as b ←

a γ · amax . amax

(3.32)

Figure 3.21 illustrates the typical role of gamma correction in the digital work flow with two input (camera, scanner) and two output devices (monitor, printer), each with its individual gamma value. The central idea is to correct all images to be processed and stored in a device-independent, standardized intensity space. 3.7.5 Implementation Program 3.4 shows the implementation of gamma correction as an ImageJ plugin for 8-bit grayscale images. The mapping fgc (a, γ) is computed as a lookup table (Fgc), which is then applied to the image using the method applyTable() to perform the actual point operation (see also Sec. 3.8.1). 3.7.6 Modified Gamma Correction A subtle problem with the simple power function fγ (a) = aγ (Eqn. (3.27)) appears if we take a closer look at the slope of this function, expressed by its first derivative,

71

3 Point Operations Prog. 3.4 Implementation of simple gamma correction (run() method of an ImageJ plugin). The corrected intensity values b are only computed once and stored in the lookup table F (line 15). The gamma value GAMMA is constant. The actual point operation is performed by calling the ImageJ method applyTable(F) on the image object ip (line 18).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

public void run(ImageProcessor ip) { // works for 8-bit images only int K = 256; int aMax = K - 1; double GAMMA = 2.8; // create and fill the lookup table: int[] F = new int[K]; for (int a = 0; a < K; a++) { double aa = (double) a / aMax; // scale to [0, 1] double bb = Math.pow(aa, GAMMA); // power function // scale back to [0, 255]: int b = (int) Math.round(bb * aMax); F[a] = b; } }

ip.applyTable(F);

// modify the image

fγ′ (a) = γ · a(γ−1) , which for a = 0 has the values   0 for γ > 1, ′ fγ (0) = 1 for γ = 1,   ∞ for γ < 1.

(3.33)

The tangent to the function at the origin is thus horizontal (γ > 1), diagonal (γ = 1), or vertical (γ < 1), with no intermediate values. For γ < 1, this causes extremely high amplification of small intensity values and thus increased noise in dark image regions. Theoretically, this also means that the power function is generally not invertible at the origin. A common solution to this problem is to replace the lower part (0 ≤ a ≤ a0 ) of the power function by a linear segment with constant slope and to continue with the ordinary power function for a > a0 . The resulting modified gamma correction function, ( s·a for 0 ≤ a ≤ a0 , ¯ fγ,a0 (a) = (3.34) γ (1 + d) · a − d for a0 < a ≤ 1, γ 1 with s = and d = γ − 1 (3.35) (1−γ) a (γ−1) +1 a0 (γ−1) + a 0 0

72

thus consists of a linear section (for 0 ≤ a ≤ a0 ) and a nonlinear section (for a0 < a ≤ 1) that connect smoothly at the transition point a = a0 . The linear slope s and the parameter d are determined by the requirement that the two function segments must have identical values as well as identical slopes (first derivatives) at a = a0 to produce a continuous function. The function in Eqn. (3.34) is thus fully specified by the two parameters a0 and γ. Figure 3.22 shows two examples of the modified gamma correction f¯γ,a0 () with values γ = 0.5 and γ = 2.0, respectively. In both cases, the transition point is at a0 = 0.2. For comparison, the figure also

f¯γ,a (a)

f¯γ,a (a)

0

0

3.7 Gamma Correction Fig. 3.22 Modified gamma correction. The mapping f¯γ,a (a) con0 sists of a linear segment with fixed slope s between a = 0 and a = a0 , followed by a power function with parameter γ (Eqn. (3.34)). The dashed lines show the ordinary power functions for the same gamma values.

a (a) γ = 0.5, a0 = 0.2

a (b) γ = 2.0, a0 = 0.2

Nominal gamma value Standard ITU-R BT.709 sRGB

γ a0 1/2.222 ≈ 0.450 0.018 1/2.400 ≈ 0.417 0.0031308

Effective gamma value s d 4.50 0.099 12.92 0.055

γeff 1/1.956 ≈ 0.511 1/2.200 ≈ 0.455

Table 3.1 Gamma correction parameters for the ITU and sRGB standards based on the modified mapping in Eqns. (3.34) and (3.35).

shows the ordinary gamma correction fγ (a) for the same gamma values (dashed lines), whose slope at the origin is ∞ (Fig. 3.22 (f)) and zero (Fig. 3.22 (b)), respectively. Gamma Correction in Common Imaging Standards The modified gamma correction is part of several modern imaging standards. In practice, however, the values of a0 are considerably smaller than the ones used for the illustrative examples in Fig. 3.22, and γ is chosen to obtain a good overall match to the desired correction function. For example, the ITU-BT.709 specification [144] mentioned in Sec. 3.7.3 specifies the parameters γ=

1 ≈ 0.45 and 2.222

a0 = 0.018 ,

(3.36)

with the corresponding slope and offset values s = 4.50681 and d = 0.0991499, respectively (Eqn. (3.35)). The resulting correction function f¯ITU (a) has a nominal gamma value of 0.45, which corresponds to the effective gamma value γeff = 1/1.956 ≈ 0.511. The gamma correction in the sRGB standard [253] is specified on the same basis (with different parameters; see Sec. 14.4). Figure 3.23 shows the actual correction functions for the ITU and sRGB standards, respectively, each in comparison with the equivalent ordinary gamma correction. The ITU function (Fig. 3.23 (a)) with γ = 0.45 and a0 = 0.018 corresponds to an ordinary gamma correction with effective gamma value γeff = 0.511 (dashed line). The curves for sRGB (Fig. 3.23 (b)) differ only by the parameters γ and a0 , as summarized in Table 3.1. Inverse Modified Gamma Correction To invert the modified gamma correction of the form b = f¯γ,a0 (a) (Eqn. (3.34)), we need the inverse of the function f¯γ,a0 (), which is again defined in two parts,

73

3 Point Operations

f¯ITU (a)

f¯sRGB (a)

Fig. 3.23 Gamma correction functions specified by the ITU-R BT.709 (a) and sRGB (b) standards. The continuous plot shows the modified gamma correction with the nominal γ values and transition points a0 .

a (a) γ ≈ 0.450, a0 = 0.018

−1 f¯γ,a (b) = 0

(

a (b) γ ≈ 0.417, a0 = 0.0031308

b/s

b+d 1/γ 1+d

for 0 ≤ b ≤ s·a0 , for s·a0 < b ≤ 1.

s and d are the quantities defined in Eqn. (3.35) and thus −1 a = f¯γ,a f¯γ,a0 (a) for a ∈ [0, 1], 0

(3.37)

(3.38)

with the same value γ being used in both functions. The inverse gamma correction function is required in particular for transforming between different color spaces if nonlinear (i.e., gamma-corrected) component values are involved (see also Sec. 14.2).

3.8 Point Operations in ImageJ Several important types of point operations are already implemented in ImageJ, so there is no need to program every operation manually (as shown in Prog. 3.4). In particular, it is possible in ImageJ to apply point operations efficiently by using tabulated functions, to use built-in standard functions for point operations on single images, and to apply arithmetic operations on pairs of images. These issues are described briefly in the remaining parts of this section. 3.8.1 Point Operations with Lookup Tables Some point operations require complex computations for each pixel, and the processing of large images may be quite time-consuming. If the point operation is homogeneous (i.e., independent of the pixel coordinates), the value of the mapping can be precomputed for every possible pixel value and stored in a lookup table, which may then be applied very efficiently to the image. A lookup table F represents a discrete mapping (function f ) from the original to the new pixel values, f F : [0, K −1] 7−→ [0, K −1] . (3.39)

74

For a point operation specified by a particular pixel mapping a′ = f (a), the table F is initialized with the values

abs()

I(u, v) ← |I(u, v)|

add(int p )

I(u, v) ← I(u, v) + p

gamma(double g )

I(u, v) ← I(u, v)/255

invert(int p )

I(u, v) ← 255 − I(u, v)

log()

I(u, v) ← log10 I(u, v)

max(double s )

I(u, v) ← max I(u, v), s

min(double s )

I(u, v) ← min I(u, v), s

multiply(double s )

I(u, v) ← round I(u, v) · s

sqr()

I(u, v) ← I(u, v)2

sqrt()

I(u, v) ←

F(a) ← f (a),

p

I(u, v)

for 0 ≤ a < K.

g

3.8 Point Operations in ImageJ · 255

Table 3.2 ImageJ methods for arithmetic point operations applicable to objects of type ImageProcessor.

(3.40)

Thus the K table elements of F need only be computed once, where typically K = 256. Performing the actual point operation only requires a simple (and quick) table lookup in F at each pixel, that is, I ′ (u, v) ← F(I(u, v)) , (3.41) which is much more efficient than any individual function call. ImageJ provides the method void applyTable(int[] F) for objects of type ImageProcessor, which requires a lookup table F as a 1D int array of size K (see Prog. 3.4 for an example). The advantage of this approach is obvious: for an 8-bit image, for example, the mapping is evaluated only 256 times (independent of the image size) and not a million times or more as in the case of a large image. The use of lookup tables for implementing point operations thus always makes sense if the number of image pixels (M × N ) is greater than the number of possible pixel values K (which is usually the case). 3.8.2 Arithmetic Operations ImageJ implements a set of common arithmetic operations as methods for the class ImageProcessor, which are summarized in Table 3.2. In the following example, the image is multiplied by a scalar constant (1.5) to increase its contrast: ImageProcessor ip = ... //some image ip.multiply(1.5);

The image ip is destructively modified by all of these methods, with the results being limited (clamped) to the minimum and maximum pixel values, respectively. 3.8.3 Point Operations Involving Multiple Images Point operations may involve more than one image at once, with arithmetic operations on the pixels of pairs of images being a special

75

3 Point Operations

but important case. For example, we can express the pointwise addition of two images I1 and I2 (of identical size) to create a new image I ′ as I ′ (u, v) ← I1 (u, v) + I2 (u, v) (3.42) for all positions (u, v). In general, any function f (a1 , a2 , . . . , an ) over n pixel values ai may be defined to perform pointwise combinations of n images, that is, I ′ (u, v) ← f I1 (u, v), I2 (u, v), . . . , In (u, v) . (3.43) Of course, most arithmetic operations on multiple images can also be implemented as successive binary operations on pairs of images. 3.8.4 Methods for Point Operations on Two Images ImageJ supplies a single method for implementing arithmetic operations on pairs of images, copyBits(ImageProcessor ip2, int u, int v, int mode), which applies the binary operation specified by the transfer mode parameter mode to all pixel pairs taken from the source image ip2 and the target image (the image on which this method is invoked) and stores the result in the target image. u, v are the coordinates where the source image is inserted into the target image (most often u = v = 0). The following code segment demonstrates the addition of two images: ImageProcessor ip1 = ... // target image (I1 ) ImageProcessor ip2 = ... // source image (I2 ) ... ip1.copyBits(ip2, 0, 0, Blitter.ADD); // I1 ← I1 + I2 // ip1 holds the result, ip2 is unchanged ...

In this operation, the target image ip1 is destructively modified, while the source image ip2 remains unchanged. The constant ADD is one of several arithmetic transfer modes defined by the Blitter interface (see Table 3.3). In addition, Blitter defines (bitwise) logical operations, such as OR and AND. For arithmetic operations, the copyBits() method limits the results to the admissible range of pixel values (of the target image). Note that (except for target images of type FloatProcessor) the results are not rounded but truncated to integer values. 3.8.5 ImageJ Plugins Involving Multiple Images ImageJ provides two types of plugin: a generic plugin (PlugIn), which can be run without any open image, and plugins of type PlugInFilter, which apply to a single image. In the latter case, the currently active image is passed as an object of type ImageProcessor (or any of its subclasses) to the plugin’s run() method.9 If two or more images I1 , I2 , . . . , Ik are to be combined by a plugin program, only a single image I1 can be passed directly to the plugin’s 9

76

See also Sec. I.1 of the Appendix.

ADD

I1 (u, v) ← I1 (u, v) + I2 (u, v)

AVERAGE

I1 (u, v) ← I1 (u, v) + I2 (u, v) / 2

COPY

I1 (u, v) ← I2 (u, v)

DIFFERENCE

I1 (u, v) ← |I1 (u, v) − I2 (u, v)|

DIVIDE

I1 (u, v) ← I1 (u, v) / I2 (u, v)

MAX

I1 (u, v) ← max I1 (u, v), I2 (u, v)

MIN

I1 (u, v) ← min I1 (u, v), I2 (u, v)

MULTIPLY

I1 (u, v) ← I1 (u, v) · I2 (u, v)

SUBTRACT

I1 (u, v) ← I1 (u, v) − I2 (u, v)

3.9 Exercises

Table 3.3 Blitter transfer mode constants for method ImageProcessor.copyBits(). For example, ip1.copyBits (ip2,0,0,Blitter.ADD) modifies ip1 by adding the pixel values from ip2.

Example: ip1.copyBits(ip2, 0, 0, Blitter.ADD)

run() method, but not the additional images I2 , . . . , Ik . The usual solution is to make the plugin open a dialog window to let the user select the remaining images interactively. This is demonstrated in the following example plugin for transparently blending two images. Example: Linear Blending Linear blending is a simple method for continuously mixing two images, IBG and IFG . The background image IBG is covered by the foreground image IFG , whose transparency is controlled by the value α in the form I ′ (u, v) = α · IBG (u, v) + (1−α) · IFG (u, v) ,

(3.44)

with 0 ≤ α ≤ 1. For α = 0, the foreground image IFG is nontransparent (opaque) and thus entirely hides the background image IBG . Conversely, the image IFG is fully transparent for α = 1 and only IBG is visible. All α values between 0 and 1 result in a weighted sum of the corresponding pixel values taken from IBG and IFG (Eqn. (3.44)). Figure 3.24 shows the results of linear blending for different α values. The Java code for the corresponding implementation (as an ImageJ plugin) is listed in Prog. 3.5. The background image (ipBG) is passed directly to the plugin’s run() method. The second (foreground) image and the α value are specified interactively by creating an instance of the ImageJ class GenericDialog, which allows the simple implementation of dialog windows with various types of input fields.

3.9 Exercises Exercise 3.1. Implement the auto-contrast operation as defined in Eqns. (3.9)–(3.11) as an ImageJ plugin for an 8-bit grayscale image. Set the quantile p of pixels to be saturated at both ends of the intensity range (0 and 255) to p = plo = phi = 1%. Exercise 3.2. Modify the histogram equalization plugin in Prog. 3.2 to use a lookup table (Sec. 3.8.1) for computing the point operation. Exercise 3.3. Implement the histogram equalization as defined in Eqn. (3.12), but use the modified cumulative histogram defined in

77

3 Point Operations Prog. 3.5 Linear blending (ImageJ plugin). One image is transparently blended with another (user-selected) image. The plugin is applied to the (currently active) background image, and the foreground image must also be open when the plugin is started. The background image (ipBG), which is passed to the plugin’s run() method, is multiplied with α (line 24). The foreground image (fgIP, selected in part 2) is first duplicated (line 22) and then multiplied with (1−α) (line 23). Thus the original foreground image is not modified. The final result is obtained by adding the two weighted images (line 25). To select the foreground image, a list of currently open images and image titles is obtained (lines 31–33). Then a dialog object (of type GenericDialog) is created and opened for specifying the foreground image (imFG) and the α value (lines 37–47). See Fig. 3.24 for an example.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51

import import import import import import

ij.ImagePlus; ij.gui.GenericDialog; ij.plugin.filter.PlugInFilter; ij.process.Blitter; ij.process.ImageProcessor; imagingbook.lib.ij.IjUtils;

public class Linear_Blending implements PlugInFilter { static double alpha = 0.5; // transparency of foreground image ImagePlus imFG; // foreground image (to be selected) public int setup(String arg, ImagePlus im) { return DOES_8G; } public void run(ImageProcessor ipBG) { // ipBG = IBG if(!runDialog()) { return; } ImageProcessor ipFG = // ipFG = IFG imFG.getProcessor().convertToByte(false); ipFG = ipFG.duplicate(); ipFG.multiply(1 - alpha); // IFG ← IFG · (1 − α) ipBG.multiply(alpha); // IBG ← IBG · α ipBG.copyBits(ipFG,0,0,Blitter.ADD); // IBG ← IBG +IFG } boolean runDialog() { // get list of open images and their titles: ImagePlus[] images = IjUtils.getOpenImages(true); String[] titles = new String[images.length]; for (int i = 0; i < images.length; i++) { titles[i] = images[i].getShortTitle(); } // create the dialog and show: GenericDialog gd = new GenericDialog("Linear Blending"); gd.addChoice("Foreground image:", titles, titles[0]); gd.addNumericField("Alpha value [0..1]:", alpha, 2); gd.showDialog();

}

}

if (gd.wasCanceled()) { return false; } else { imFG = images[gd.getNextChoiceIndex()]; alpha = gd.getNextNumber(); return true; }

Eqn. (3.13), cumulating the square root of the histogram entries. Compare the results to the standard (linear) approach by plotting the resulting histograms and cumulative histograms as shown in Fig. 3.10.

78

Exercise 3.4. Show formally that (a) a linear histogram equalization (Eqn. (3.12)) does not change an image that already has a uniform intensity distribution and (b) that any repeated application of histogram equalization to the same image causes no more changes.

3.9 Exercises Fig. 3.24 Linear blending example. Foreground image IFG (a) and background image (IBG ) (e); blended images for transparency values α = 0.25, 0.50, and 0.75 (b–d) and dialog window (f) produced by GenericDialog (see Prog. 3.5).

(a) IFG , α = 0.0

(b) α = 0.25

(c) α = 0.50

(d) α = 0.75

(f)

(e) IBG , α = 1.0

Exercise 3.5. Show that the linear histogram equalization (Sec. 3.5) is only a special case of histogram specification (Sec. 3.6). Exercise 3.6. Implement the histogram specification using a piecewise linear reference distribution function, as described in Sec. 3.6.3. Define a new object class with all necessary instance variables to represent the distribution function and implement the required functions PL (i) (Eqn. (3.23)) and P−1 L (b) (Eqn. (3.24)) as methods of this class. Exercise 3.7. To adjust multiple images by histogram specification (see Sec. 3.6.4), one could either use one typical image as the reference

79

3 Point Operations

(as in Fig. 3.17) or compute an “average” reference histogram from a set of images. Implement the second approach and discuss its possible advantages (or disadvantages). Exercise 3.8. Implement the modified gamma correction (see Eqn. (3.34)) as an ImageJ plugin with variable values for γ and a0 using a lookup table as shown in Prog. 3.4. Exercise 3.9. Show that the modified gamma correction function f¯γ,a0 (a), with the parameters defined in Eqns. (3.34)–(3.35), is C1continuous (i.e., both the function itself and its first derivative are continuous). Exercise 3.10 (Fading image borders). The task is to apply an arithmetic operation on the pixel values, such that the image is smoothly “attenuated” (i.e., faded) to black (zero) toward its borders, as shown in Fig. 3.26. The new image I ′ shall be obtained from the given image I in the form I ′ (u, v) ← I(u, v)·w(u)·w(v),

(3.45)

where w(x) is a continuous, one-dimensional function, applied in horizontal and vertical direction. As shown in Fig. 3.25, w(x) should be composed of three segments:   “cosine-shaped” for 0 ≤ x < D, w(x) = 1 (3.46) for D ≤ x < W −D,   “cosine-shaped” for W −D ≤ x ≤ W . Here, W refers to the image size (i.e., width or height, respectively). Length D should be set to a suitable percentage of W (e.g., 10 %). Fig. 3.25 Window function w(x) for fading image borders, composed of three segments. The left and right segments (of length D) are truncated cosine functions, (see Eqn. (3.47)). The center segment should be constant with value 1.

w(x)

1

0

x D

D W

• Develop the complete expression for the function w(x) in Eqn. (3.46) based on a truncated cosine function. For example, for D = 1, the left segment of the function in Fig. 3.25 can be written in the form w(x) = 0.5 · (1 − cos(π·x)).

80

(3.47)

• Write a program (plugin) that attenuates grayscale images as in Eqn. (3.45) (see Fig. 3.26 (a–c)). • Test your program on suitable images of your choice and validate the results. Check if the pixels along the boundary really contain zero values.

3.9 Exercises

(a)

(d)

(b)

(c)

Fig. 3.26 Border fade example (Exercise 3.10). Original image (a), fade to black (b), fade to white (c), with D = 10% of the image size. White test image of size 400 × 400 after fading to black with D = 25% (d) and its horizontal profile (e).

(e)

• In ImageJ, create a white test image (with File ▷ New ▷ Image...), apply your plugin and plot the horizontal, vertical and diagonal line profiles using Analyze ▷ Plot Profile (see Fig. 3.26 (d, e)). • What changes are required if we want to have white borders instead, as shown in Fig. 3.26 (c)? Implement that version too.

81

Part II

Filters, Edges and Corners

4 Filters

The essential property of point operations (discussed in the previous chapter) is that each new pixel value only depends on the original pixel at the same position. The capabilities of point operations are limited, however. For example, they cannot accomplish the task of sharpening or smoothing an image (Fig. 4.1). This is what filters can do. They are similar to point operations in the sense that they also produce a 1:1 mapping of the image coordinates, that is, the geometry of the image does not change. Fig. 4.1 No point operation can blur or sharpen an image. This is an example of what filters can do. Like point operations, filters do not modify the geometry of an image.

4.1 What is a Filter? The main difference between filters and point operations is that filters generally use more than one pixel from the source image for computing each new pixel value. Let us first take a closer look at the task of smoothing an image. Images look sharp primarily at places where the local intensity rises or drops sharply (i.e., where the difference between neighboring pixels is large). On the other hand, we perceive an image as blurred or fuzzy where the local intensity function is smooth. A first idea for smoothing an image could thus be to simply replace every pixel by the average of its neighboring pixels. To determine the new pixel value in the smoothed image I ′ (u, v), we use the © Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1_4

85

4 Filters

original pixel I(u, v) = p0 at the same position plus its eight neighboring pixels p1 , p2 , . . . , p8 to compute the arithmetic mean of these nine values, I ′ (u, v) ←

p 0 + p1 + p 2 + p 3 + p4 + p 5 + p6 + p7 + p 8 . 9

(4.1)

Expressed in relative image coordinates this is I ′ (u, v) ←

1 9

· [ I(u−1, v−1) + I(u, v−1) + I(u+1, v−1) + I(u−1, v) + I(u, v) + I(u+1, v) + I(u−1, v + 1) + I(u, v+1) + I(u+1, v+1) ] , (4.2) which we can write more compactly in the form I ′ (u, v) ←

1 1 1 X X · I(u + i, v + j) . 9 j=−1 i=−1

(4.3)

This simple local averaging already exhibits all the important elements of a typical filter. In particular, it is a so-called linear filter, which is a very important class of filters. But how are filters defined in general? First they differ from point operations mainly by using not a single source pixel but a set of them for computing each resulting pixel. The coordinates of the source pixels are fixed relative to the current image position (u, v) and usually form a contiguous region, as illustrated in Fig. 4.2. Fig. 4.2 Principal filter operation. Each ′ new pixel value I (u, v) is calculated as a function of the pixel values within a specified region of source pixels Ru,v in the original image I.

u

I′

I v

v

Ru,v

86

u

I ′ (u, v)

The size of the filter region is an important parameter of the filter because it specifies how many original pixels contribute to each resulting pixel value and thus determines the spatial extent (support) of the filter. For example, the smoothing filter in Eqn. (4.2) uses a 3 × 3 region of support that is centered at the current coordinate (u, v). Similar filters with larger support, such as 5 × 5, 7 × 7, or even 21 × 21 pixels, would obviously have stronger smoothing effects. The shape of the filter region is not necessarily quadratic or even rectangular. In fact, a circular (disk-shaped) region would be preferred to obtain an isotropic blur effect (i.e., one that is the same in all image directions). Another option is to assign different weights to the pixels in the support region, such as to give stronger emphasis to pixels that are closer to the center of the region. Furthermore, the support region of a filter does not need to be contiguous and may

not even contain the original pixel itself (imagine a ring-shaped filter 4.2 Linear Filters region, for example). Theoretically the filter region could even be of infinite size. It is probably confusing to have so many options—a more systematic method is needed for specifying and applying filters in a targeted manner. The traditional and proven classification into linear and nonlinear filters is based on the mathematical properties of the filter function; that is, whether the result is computed from the source pixels by a linear or a nonlinear expression. In the following, we discuss both classes of filters and show several practical examples.

4.2 Linear Filters Linear filters are denoted that way because they combine the pixel values in the support region in a linear fashion, that is, as a weighted summation. The local averaging process discussed in the beginning (Eqn. (4.3)) is a special example, where all nine pixels in the 3 × 3 support region are added with identical weights (1/9). With the same mechanism, a multitude of filters with different properties can be defined by simply modifying the distribution of the individual weights. 4.2.1 The Filter Kernel For any linear filter, the size and shape of the support region, as well as the individual pixel weights, are specified by the “filter kernel” or “filter matrix” H(i, j). The size of the kernel H equals the size of the filter region, and every element H(i, j) specifies the weight of the corresponding pixel in the summation. For the 3 × 3 smoothing filter in Eqn. (4.3), the filter kernel is     1/9 1/9 1/9 1 1 1 1 H =  1/9 1/9 1/9  = ·  1 1 1  , (4.4) 9 1/9 1/9 1/9 1 1 1 because each of the nine pixels contributes one-ninth of its value to the result. In principle, the filter kernel H(i, j) is, just like the image itself, a discrete, 2D, real-valued function, H : Z × Z 7→ R. The filter has its own coordinate system with the origin—often referred to as the “hot spot”— mostly (but not necessarily) located at the center. Thus, filter coordinates are generally positive and negative (Fig. 4.3). The filter function is of infinite extent and considered zero outside the region defined by the matrix H. 4.2.2 Applying the Filter For a linear filter, the result is unambiguously and completely specified by the coefficients of the filter kernel. Applying the filter to an image is a simple process that is illustrated in Fig. 4.4. The following steps are performed at each image position (u, v): 1. The filter kernel H is moved over the original image I such that its origin H(0, 0) coincides with the current image position (u, v).

87

(0, 0) = Hot Spot

4 Filters Fig. 4.3 Filter matrix and its coordinate system. i is the horizontal (column) index, j is the vertical (row) index.

H i

j

H

Fig. 4.4 Linear filter operation. The filter kernel H is placed with its origin at position (u, v) on the image I. Each filter coefficient H(i, j) is multiplied with the corresponding image pixel I(u+i, v+j), the results are added, and the final sum is inserted as the new pixel value I ′ (u, v).

v v I

u

u

I′

2. All filter coefficients H(i, j) are multiplied with the corresponding image element I(u+i, v+j), and the results are added up. 3. Finally, the resulting sum is stored at the current position in the new image I ′ (u, v). Described formally, the pixel values of the new image I ′ (u, v) are calculated as X I ′ (u, v) = I(u+i, v+j)·H(i, j), (4.5) (i,j)∈H

where H is the set of coordinates covered by the filter H. For example, for a typical 3 × 3 filter with centered origin, this is ′

I (u, v) =

j=1 i=1 X X

I(u+i, v+j)·H(i, j),

(4.6)

i=−1 j=−1

for all image coordinates (u, v). Not quite for all coordinates, to be exact. There is an obvious problem at the image borders where the filter reaches outside the image and finds no corresponding pixel values to use in computing a result. For the moment, we ignore this border problem, but we will attend to it again in Sec. 4.5.2. 4.2.3 Implementing Filter Operations

88

Now that we understand the principal operation of a filter (Fig. 4.4) and know that the borders need special attention, we go ahead and program a simple linear filter in ImageJ. But before we do this, we

Original image

Intermediate image

(a) Version A

Original image

4.2 Linear Filters Fig. 4.5 Practical implementation of in-place filter operations. Version A: The result of the filter operation is first stored in an intermediate image and subsequently copied back to the original image (a). Version B: The original image is first copied to an intermediate image that serves as the source for the filter operation. The results are placed in the original image (b).

Intermediate image

(b) Version B

may want to consider one more detail. In a point operation (e.g., in Progs. 3.1 and 3.2), each new pixel value depends only on the corresponding pixel value in the original image, and it was thus no problem simply to store the results back to the same image—the computation is done “in place” without the need for any intermediate storage. In-place computation is generally not possible for a filter since any original pixel contributes to more than one resulting pixel and thus may not be modified before all operations are complete. We therefore require additional storage space for the resulting image, which subsequently could be copied back to the source image again (if desired). Thus the complete filter operation can be implemented in two different ways (see Fig. 4.5): A. The result of the filter computation is initially stored in a new image whose content is eventually copied back to the original image. B. The original image is first copied to an intermediate image that serves as the source for the actual filter operation. The result replaces the pixels in the original image. The same amount of storage is required for both versions, and thus none of them offers a particular advantage. In the following examples, we generally use version B. 4.2.4 Filter Plugin Examples The following examples demonstrate the implementation of two very basic filters that are nevertheless often used in practice. Simple 3 × 3 Averaging Filter (“Box” Filter) Program 4.1 shows the ImageJ code for a simple 3 × 3 smoothing filter based on local averaging (Eqn. (4.4)), which is often called a “box” filter because of its box-like shape. No explicit filter kernel is required in this case, since all filter coefficients are identical ( 1/9). Also, no clamping (see Sec. 3.1.2) of the results is needed because the sum of the filter coefficients is 1 and thus no pixel values outside the admissible range can be created.

89

4 Filters Prog. 4.1 3 × 3 averaging “box” filter (Filter_Box_3x3). First (in line 10) a duplicate (copy) of the original image (orig) is created, which is used as the source image in the subsequent filter computation (line 18). In line 23, the resulting value is placed in the original image (line 23). Notice that the border pixels remain unchanged because they are not reached by the iteration over (u, v).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

import ij.ImagePlus; import ij.plugin.filter.PlugInFilter; import ij.process.ImageProcessor; public class Filter_Box_3x3 implements PlugInFilter { ... public void run(ImageProcessor ip) { int M = ip.getWidth(); int N = ip.getHeight(); ImageProcessor copy = ip.duplicate();

}

}

for (int u = 1; u = 0) && (y < height) 9 && I[x][y] == 1) { 10 I[x][y] = label; 11 S.push(PntInt.from(x+1, y)); 12 S.push(PntInt.from(x, y+1)); 13 S.push(PntInt.from(x, y-1)); 14 S.push(PntInt.from(x-1, y)); 15 if (NT == 8) { 16 S.push(PntInt.from(x+1, y+1)); 17 S.push(PntInt.from(x-1, y+1)); 18 S.push(PntInt.from(x+1, y-1)); 19 S.push(PntInt.from(x-1, y-1)); 20 } 21 } 22 } 23 }

8.1 Finding Connected Image Regions Prog. 8.1 Java implementation of iterative flood filling (depth-first and breadth-first variants). Arrays I represents the label image, initially containing only 0’s (background) and 1’s (foreground). Instances of class PntInt (defined by the imagingbook library) represent individual pixel coordinates. The depth-first version uses a stack (instance of class LinkedList), with access methods push(), pop(), and isEmpty(). The breadth-first version also uses LinkedList to represent a queue, with access methods addLast() and removeFirst() to implement functions enqueue() and dequeue(), respectively (see Alg. 8.2).

Breadth-first version (using a queue): 24 void floodFill(int[][] I, int u, int v, int label) { 25 Queue Q = new LinkedList(); // queue Q 26 Q.addLast(PntInt.from(u, v)); 27 while (!Q.isEmpty()) { 28 PntInt p = Q.removeFirst(); // get the next point to process 29 int x = p.x; 30 int y = p.y; 31 if ((x >= 0) && (x < width) && (y >= 0) && (y < height) 32 && I[x][y] == 1) { 33 I[x][y] = label; 34 Q.addLast(PntInt.from(x+1, y)); 35 Q.addLast(PntInt.from(x, y+1)); 36 Q.addLast(PntInt.from(x, y-1)); 37 Q.addLast(PntInt.from(x-1, y)); 38 if (NT == 8) { 39 Q.addLast(PntInt.from(x+1, y+1)); 40 Q.addLast(PntInt.from(x-1, y+1)); 41 Q.addLast(PntInt.from(x+1, y-1)); 42 Q.addLast(PntInt.from(x-1, y-1)); 43 } 44 } 45 } 46 }

as instances of type LinkedList.2 Both containers are specified to hold instances of type PntInt3 only. Figure 8.2 illustrates the progress of the region marking in both variants within an example region, where the start point (i.e., seed point), which would normally lie on a contour edge, has been placed arbitrarily within the region in order to better illustrate the process. 2

3

Class LinkedList is part of Java’s collections framework, implementing the Deque and Queue interfaces. Class Pnt2d.PntInt is defined by the imagingbook library (as a general substitute for java.awt.Point and similar).

199

8 Regions in Binary Images Fig. 8.2 Iterative flood filling: depthfirst vs. breadth-first approach. The starting point, marked + in the top two image (a), was arbitrarily chosen. Intermediate results of the flood fill process after 1000 (a), 5000 (b), and 10,000 (c) marked pixels are shown. The image size is 250 × 242 pixels.

(a) Original

Depth-first

Breadth-first

(a) K = 1.000

(b) K = 5.000

(c) K = 10.000

200

It is clearly visible that the depth-first method first explores one direction (in this case horizontally to the left) completely (that is, until it reaches the edge of the region) and only then examines the remaining directions. In contrast the breadth-first method markings proceed outward, layer by layer, equally in all directions. Due to the way exploration takes place, the memory requirement of the breadth-first variant of the flood-fill version is generally much lower than that of the depth-first variant. For example, when flood

filling the region in Fig. 8.2 (using the implementation given Prog. 8.1 Finding Connected 8.1), the stack in the depth-first variant grows to a maximum of Image Regions 28,822 elements, while the queue used by the breadth-first variant never exceeds a maximum of 438 nodes. 8.1.2 Sequential Region Segmentation Sequential region marking is a classical, nonrecursive technique that is known in the literature as “region labeling”. The algorithm consists of two steps: (1) preliminary labeling of the image regions and (2) resolving situations where more than one label occurs (i.e., has been assigned in the previous step) within the same connected region. Even though this algorithm is relatively complex, especially its second stage, its moderate memory requirements make it a good choice under limited memory conditions. However, this is not a major issue on modern computers and thus, in terms of overall efficiency, sequential labeling offers no clear advantage over the simpler methods described earlier. The sequential technique is nevertheless interesting (not only from a historic perspective) and inspiring. The complete process is summarized in Alg. 8.3, with the following main steps. Step 1: Initial Labeling In the first stage of region labeling, the image is traversed from top left to bottom right sequentially to assign a preliminary label to every foreground pixel. Depending on the definition of neighborhood (either 4- or 8-connected) used, the following neighbor positions of each pixel must be examined ( × marks the current pixel at the position p = (u, v)): n1 N4 = n2 × n0 n3

n3 n2 n1 or

N8 = n4 × n0 .

(8.2)

n5 n6 n7

When using the 4-connected neighborhood N4 , only the two neighbors n1 = (u−1, v) and n2 = (u, v−1) need to be considered, but when using the 8-connected neighborhood N8 , four neighbors n1 , . . . , n4 must be examined. In the following examples (Figs. 8.3–8.5), we use an 8-connected neighborhood and a very simple test image (Fig. 8.3 (a)) to demonstrate the sequential region labeling process. Propagating Region Labels Again we assume that, in the image, the value I(p) = 0 represents background pixels and the value I(p) = 1 represents foreground pixels. We will also consider neighboring pixels that lie outside of the image matrix (e.g., on the array borders) to be part of the background. The neighborhood N(p) is slid over the image horizontally and then vertically, starting from the top left corner. When the current image element I(p) is a foreground pixel, it is either assigned a new region number or, in the case where one of its previously examined neighbors in N(p) was a foreground pixel, it takes on the region number of the neighbor. In this way, existing region numbers

201

8 Regions in Binary Images Alg. 8.3 Sequential region segmentation. The binary input image I uses values I(p) = 0 for background pixels and I(p) = 1 for foreground (region) pixels. The input image is modified to hold the resulting integer labels ≥ 2.

1: SequentialLabeling(I) Input: I, an integer-valued binary image with initial values 0 = background, 1 = foreground. Global: NT ∈ {4, 8} (neighborhood type). Returns nothing but modifies the image I. Step 1 – Assign initial labels: 2: (M, N ) ← size(I) 3: label ← 2 ▷ value of the next label to be assigned 4: C ← {} ▷ empty set of label collisions for v ← 0, . . . , N −1 do ▷ order of pixels is important! 5: 6: for u ← 0, . . . , M −1 do 7: p ← (u, v) 8: if I(p) = 1 then ▷ I(p) is a foreground pixel 9: N ← GetNeighborhood(I, p) ▷ see Eqn. 8.2 10: a ← max(N) 11: if a < 2 then ▷ a is not an assigned label 12: I(p) ← label ▷ new region, new label 13: label ← label + 1 ▷ advance label counter 14: else ▷ a ≥ 2 is an assigned label 15: I(p) ← a ▷ propagate label a 16: for all b ∈ N | b ≥ 2 ∧ a ̸= b do 17: c ← ⟨a, b⟩ ▷ collision of labels a, b 18: C ← C ∪ {c} ▷ register collision c Remark: The image I now contains labels 0, 2, . . . , label − 1.

19:

Step 2 – Resolve label collisions: Create a partitioning of the label set (sequence of 1-element sets): R ← ({2}, {3}, {4}, . . . , {label −1}) ▷ R(k) ≡ {k}

20: 21: 22: 23: 24: 25:

for all collisions ⟨a, b⟩ in C do A ← the set in R that contains a B ← the set in R that contains b if A ̸= B then A←A∪B B ← {}

▷ a, b are in different sets ▷ add elements of B to A ▷ purge set B

Remark: All equivalent labels (i.e., all labels of pixels in the same connected component) are now contained in the same subset of R and all subsets of R are disjoint.

29:

Step 3: Relabel the image: for all p ∈ M × N do if I(p) > 1 then ▷ this is a labeled foreground pixel A ← the set in R that contains a = I(p) Choose a representative label in A, e.g., its minimum: I(p) ← min(A) ▷ replace the image label

30:

return

26: 27: 28:

31: function GetNeighborhood(I, p) Returns a list of relevant neighborhood values for position p = (u, v) (see Eqn. (8.2)). 32: 33: 34: 35: 36:

202

if NT = 4 then N ← (I(u, v−1), I(u−1, v)) else ▷ NT = 8 N ← (I(u+1, v−1), I(u, v−1), I(u−1, v−1), I(u−1, v)) return N

0 0 (a) 0 0 0 0 0 0

0 0 1 0 1 0 1 0

0 0 1 0 1 0 1 0

0 0 1 0 1 0 0 0

0 0 1 1 1 1 0 0

0 1 1 0 1 1 0 0

0 1 1 1 1 1 1 0

0 0 0 0 1 1 0 0

0 0 0 0 1 1 1 0

0 1 1 0 1 1 0 0

0 0 1 0 1 0 1 0

0 0 1 0 1 0 1 0

0 0 1 0 1 0 0 0

0 0 1 1 1 1 0 0

0 1 1 0 1 1 0 0

0 1 1 1 1 1 1 0

0 0 0 0 1 1 0 0

0 0 0 0 1 1 1 0

0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 0

0 0 0 0 1 1 0 0

0 1 1 1 1 1 0 0

0 0 0 0 0 0 0 0

0 0 1 0 1 0 1 0

0 0 1 0 1 0 1 0

0 0 1 0 1 0 0 0

0 0 1 1 1 1 0 0

0 2 1 0 1 1 0 0

0 1 1 1 1 1 1 0

0 0 0 0 1 1 0 0

0 0 0 0 1 1 1 0

0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 0

0 0 0 0 1 1 0 0

0 0 0 0 0 0 0 0

0 1 1 1 1 1 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 5 0 1 0 1 0

0 0 5 0 1 0 1 0

0 0 5 0 1 0 0 0

0 0 1 1 1 1 0 0

0 2 1 0 1 1 0 0

0 2 1 1 1 1 1 0

0 0 0 0 1 1 0 0

0 0 0 0 1 1 1 0

0 3 1 0 1 1 0 0

0 3 0 0 1 1 0 0

0 0 0 0 1 1 0 0

0 Background

0 0 0 0 0 0 0 0

8.1 Finding Connected Image Regions

1 Foreground

0 0 1 0 1 0 1 0

0 0 1 0 1 0 1 0

0 0 1 0 1 0 0 0

0 0 1 1 1 1 0 0

0 2 1 0 1 1 0 0

0 1 1 1 1 1 1 0

0 0 0 0 1 1 0 0

0 0 0 0 1 1 1 0

0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 0

0 0 0 0 1 1 0 0

0 1 1 1 1 1 0 0

0 0 0 0 0 0 0 0

0 0 0 0 1 1 0 0

0 1 1 1 1 1 0 0

0 0 0 0 0 0 0 0

Fig. 8.3 Sequential region labeling— label propagation. Original image (a). The first foreground pixel (marked 1) is found in (b): all neighbors are background pixels (marked 0), and the pixel is assigned the first label (2). In the next step (c), there is exactly one neighbor pixel marked with the label 2, so this value is propagated. In (d) there are two neighboring pixels, and they have differing labels (2 and 5); one of these values is propagated, and the collision (2, 5) is registered.

0 4 1 1 1 1 0 0

0 0 1 0 1 0 1 0

0 0 1 0 1 0 1 0

0 0 1 0 1 0 0 0

0 0 1 1 1 1 0 0

0 2 1 0 1 1 0 0

0 2 1 1 1 1 1 0

0 0 0 0 1 1 0 0

0 0 0 0 1 1 1 0

0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 0

One of the labels (2) is propagated

(d) Two different neighbor labels

0 0 0 0 0 0 0 0

0 1 1 1 1 1 0 0

Neighbor label is propagated

(c) Exactly one neighbor label

0 0 0 0 0 0 0 0

0 0 0 0 1 1 0 0

New label (2)

(b) Background neighbors only

0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 5 0 1 0 1 0

0 0 5 0 1 0 1 0

0 0 5 0 1 0 0 0

0 0 2 1 1 1 0 0

0 2 1 0 1 1 0 0

0 2 1 1 1 1 1 0

0 0 0 0 1 1 0 0

0 0 0 0 1 1 1 0

0 3 1 0 1 1 0 0

0 3 0 0 1 1 0 0

0 0 0 0 1 1 0 0

0 4 1 1 1 1 0 0

0 0 0 0 0 0 0 0

propagate in the image from the left to the right and from the top to the bottom (see Fig. 8.3 (b–c)). Label Collisions In the case where two or more neighbors have labels belonging to different regions, then a label collision has occurred; that is, pixels within a single connected region have different labels. For example, in a U-shaped region, the pixels in the left and right arms are at first assigned different labels since it is not immediately apparent that they are actually part of a single region. The two labels will propagate down independently from each other until they eventually collide in the lower part of the “U” (see Fig. 8.3 (d)). When two labels a, b collide, then we know that they are actually “equivalent”; that is, they are contained in the same image region. These collisions are registered but otherwise not dealt with during

203

8 Regions in Binary Images

Fig. 8.4 Sequential region labeling— intermediate result after Step 1. Detected label collisions are marked by circles in (a). The nodes of the undirected graph (b) correspond to the pixel labels L = {2, 3, 4, 5, 6, 7}, and its edges correspond to the set of collisions C = {⟨2, 4⟩, ⟨2, 5⟩, ⟨2, 6⟩}.

the first step. Once all collisions have been registered, they are then resolved in the second step of the algorithm. The number of collisions depends on the content of the image. There can be only a few or very many collisions, and the exact number is only known at the end of the first step, once the whole image has been traversed. For this reason, collision management should make use of dynamic data structures, such as lists or hash tables. Upon the completion of the first steps, all the original foreground pixels have been provisionally marked, and all the collisions between labels within the same regions have been registered for subsequent processing. The example in Fig. 8.4 illustrates the state upon completion of step 1: all foreground pixels have been assigned preliminary labels (see Fig. 8.4 (a)), and the following collisions (depicted by circles) between label pairs ⟨2, 4⟩, ⟨2, 5⟩, and ⟨2, 6⟩ have been registered. The labels L = {2, 3, 4, 5, 6, 7} and collisions C = {⟨2, 4⟩, ⟨2, 5⟩, ⟨2, 6⟩} correspond to the nodes and edges of an undirected graph (see Fig. 8.4 (b)).

0 0 0 0 0 0 0 0

0 0 5 0 6 0 7 0

0 0 5 0 6 0 7 0

0 0 5 0 2 0 0 0

0 0 2 2 2 2 0 0

0 2 2 0 2 2 0 0

0 2 2 2 2 2 2 0

0 0 0 0 2 2 0 0

0 0 0 0 2 2 2 0

0 3 3 0 2 2 0 0

0 3 0 0 2 2 0 0

0 0 0 0 2 2 0 0

0 4 4 4 2 2 0 0

0 0 0 0 0 0 0 0

2

5 6

3

4

7

(a)

(b)

Step 2: Resolving Label Collisions The task in the second step is to resolve the label collisions that arose in the first step in order to merge the corresponding “partial” regions. This process is nontrivial since it is possible for two regions with different labels to be connected transitively. For example, given a pair of equivalencies ⟨a, b⟩, ⟨b, c⟩, means that labels a, c are equivalent as well, i.e., ⟨a, b⟩, ⟨b, c⟩ ⇒ ⟨a, c⟩. Such connections may exist via a single intermediate region (b) or, more generally, through a (possibly long) sequence of regions. In fact, this problem is identical to the problem of finding the connected components of a graph [66], where the labels L determined in step 1 constitute the “nodes” of the graph and the registered collisions C make up its “edges” (Fig. 8.4 (b)). Once all the distinct labels within a single region have been collected, the labels of all the pixels in the region are updated so they carry the same label (e.g., choosing the smallest label in the multilabel region), as depicted in Fig. 8.5. Figure 8.6 shows the complete segmentation with some region statistics that can be easily calculated from the labeling data. 204

0 0 0 0 0 0 0 0

0 0 2 0 2 0 7 0

0 0 2 0 2 0 7 0

0 0 2 0 2 0 0 0

0 0 2 2 2 2 0 0

0 2 2 0 2 2 0 0

0 2 2 2 2 2 2 0

0 0 0 0 2 2 0 0

0 0 0 0 2 2 2 0

0 3 3 0 2 2 0 0

0 3 0 0 2 2 0 0

0 0 0 0 2 2 0 0

0 2 2 2 2 2 0 0

0 0 0 0 0 0 0 0

8.1 Finding Connected Image Regions Fig. 8.5 Sequential region labeling— final result after step 2. All equivalent labels have been replaced by the smallest label within that region.

Fig. 8.6 Example of a complete region labeling. The pixels within each region have been colored according to the consecutive label values 2, 3, . . . , 10 they were assigned. The corresponding region statistics are shown in the table (total image size is 1212 × 836).

Label 2 3 4 5 6 7 8 9 10

Area (pixels) 14978 36156 25904 2024 2293 4394 29777 20724 16566

Bounding Box (left, top, right, bottom) (887, 21, 1144, 399) ( 40, 37, 438, 419) (464, 126, 841, 382) (387, 281, 442, 341) (244, 367, 342, 506) (406, 400, 507, 512) (510, 416, 883, 765) (833, 497, 1168, 759) ( 82, 558, 411, 821)

Centroid (xc , yc ) (1049.7, 242.8) ( 261.9, 209.5) ( 680.6, 240.6) ( 414.2, 310.6) ( 294.4, 439.0) ( 454.1, 457.3) ( 704.9, 583.9) (1016.0, 624.1) ( 208.7, 661.6)

8.1.3 Region Labeling – Summary In this section, we have described a selection of algorithms for finding and labeling connected regions in images. We discovered that the elegant idea of labeling individual regions using a simple recursive flood-filling method (Sec. 8.1.1) was not useful because of practical limitations on the depth of recursion and the high memory costs associated with it. We also saw that classical sequential region labeling (Sec. 8.1.2) is relatively complex and offers no real advantage over iterative implementations of the depth-first and breadth-first methods. In practice, the iterative breadth-first method is generally the best choice for large and complex images. In the following section we present a modern and efficient algorithm that performs region labeling and also delineates the regions’ contours. Since contours are required in many applications, this combined approach is highly practical. 205

8 Regions in Binary Images

8.2 Region Contours Once the regions in a binary image have been found, the next step is often to find the contours (that is, the outlines) of the regions. Like so many other tasks in image processing, at first glance this appears to be an easy one: simply follow along the edge of the region. We will see that, in actuality, describing this apparently simple process algorithmically requires careful thought, which has made contour finding one of the classic problems in image analysis. 8.2.1 Outer and Inner Contours As we discussed in Sec. 7.2.7, the pixels along the edge of a binary region (i.e., its border) can be identified using simple morphological operations and difference images. It must be stressed, however, that this process only marks the pixels along the contour, which is useful, for instance, for display purposes. In this section, we will go one step further and develop an algorithm for obtaining an ordered sequence of border pixel coordinates for describing a region’s contour. Note that connected image regions contain exactly one outer contour, yet, due to holes, they can contain arbitrarily many inner contours. Within such holes, smaller regions may be found, which will again have their own outer contours, and in turn these regions may themselves contain further holes with even smaller regions, and so on in a recursive manner (see Fig. 8.7). An additional complication arises when regions are connected by parts that taper down to the width of a single pixel. In such cases, the contour can run through the same pixel more than once and from different directions (see Fig. 8.8). Therefore, when tracing a contour from a start point xs , returning to the start point is not a sufficient condition for terminating the contour-tracing process. Other factors, such as the current direction along which contour points are being traversed, must be taken into account.

Fig. 8.7 Binary image with outer and inner contours. The outer contour lies along the outside of the foreground region (dark). The inner contour surrounds the space within the region, which may contain further regions (holes), and so on.

Outer contour Inner contour

(a)

206

(b)

One apparently simple way of determining a contour is to proceed in analogy to the two-stage process presented in Sec. 8.1, that is, to first identify the connected regions in the image and second, for each region, proceed around it, starting from a pixel selected from its border. In the same way, an internal contour can be found by starting at a border pixel of a region’s hole. A wide range of algorithms based on first finding the regions and then following along their contours have been published, including [226], [206, pp. 142–148], and [242,

xs

8.2 Region Contours Fig. 8.8 The path along a contour as an ordered sequence of pixel coordinates with a given start point xs . Individual pixels may occur (be visited) more than once within the path, and a region consisting of a single isolated pixel will also have a contour (bottom right).

p. 296]. As a modern alternative, we outline the following combined algorithm which, in contrast to the aforementioned classical methods, combines contour finding and region labeling in a single process. 8.2.2 Combining Region Labeling and Contour Detection This method, based on [56], combines the concepts of sequential region labeling (Sec. 8.1) and traditional contour tracing into a single algorithm able to perform both tasks simultaneously during a single pass through the image. It identifies and labels regions and at the same time traces both their inner and outer contours. The algorithm does not require any complicated data structures and is relatively efficient when compared to other methods with similar capabilities. The key steps of this method are described here and illustrated in Fig. 8.9: 1. As in the sequential region labeling (Alg. 8.3), the binary image I is traversed from the top left to the bottom right. Such a traversal ensures that all pixels in the image are eventually examined and assigned an appropriate label. 2. At a given position in the image, the following cases may occur: Case A: The transition from a background pixel to a previously unmarked foreground pixel means that this pixel lies on the outer edge of a new region. A new label is assigned and the associated outer contour is traversed and marked by calling the method TraceContour() (see Alg. 8.4 and Fig. 8.9 (a)). Furthermore, all background pixels directly bordering the region are marked with the special label −1. Case B: The transition from a foreground pixel B to an unmarked background pixel means that this pixel lies on an inner contour (Fig. 8.9 (b)). Starting from B, the inner contour is traversed and its pixels are marked with labels from the surrounding region (Fig. 8.9 (c)). Also, all bordering background pixels are again assigned the special label value −1.

207

A

8 Regions in Binary Images Fig. 8.9 Combined region labeling and contour following (after [56]). The image in (a) is traversed from the top left to the lower right, one row at a time. In (b), the first foreground pixel A on the outer edge of the region is found. Starting from point A, the pixels on the edge along the outer contour are visited and labeled until A is reached again (c). Labels picked up at the outer contour are propagated along the image line inside the region (d). In (e), B was found as the first point on the inner contour. Now the inner contour is traversed in clock-wise direction, marking the contour pixels until point B is reached again (f). The same tracing process is used as in step (c), with the inside of the region always lying to the right of the contour path. In (g) a previously marked point C on an inner contour is detected. Its label is again propagated along the image line inside the region. The final result is shown in (h).

(a)

(b)

(c)

(d)

A

B

B

(e)

(f)

(g)

(h)

C

208

1: RegionContourLabeling(I) Input: I, a binary image with 0 = background, 1 = foreground. Returns sequences of outer and inner contours and a map of region labels. 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30:

31: 32: 33: 34:

(M, N ) ← size(I) Cout ← ( ) Cin ← ( ) L ← new map M × N 7→ Z for all p ∈ M × N do L(p) ← 0

▷ empty list of outer contours ▷ empty list of inner contours ▷ the label map L ▷ initialize L to zero

r ←2 ▷ next region number to be assigned for v ← 0, . . . , N −1 do ▷ scan the image top to bottom label ← 0 ▷ reset the current label for u ← 0, . . . , M −1 do ▷ process line v p ← (u, v) if I(p) > 0 then ▷ I(p) is a foreground pixel if (label ̸= 0) then ▷ continue existing region L(p) ← label else ▷ label = 0 label ← L(p) if (label = 0) then ▷ found a new region label ← r ▷ assign region number r ←r +1 ▷ advance region number xs ← (p) ▷ start point ds ← 0 ▷ start direction

8.2 Region Contours Alg. 8.4 Combined contour tracing and region labeling (Part 1). Given a binary image I, the application of RegionContourLabeling(I) returns a set of contours and an array containing region labels for all pixels in the image. When a new point on either an outer or inner contour is found, then an ordered list of the contour’s points is constructed by calling procedure TraceContour() (line 23 and line 31). TraceContour() itself is described in Alg. 8.5.

Trace the outer contour starting at xs : c ← TraceContour(xs , ds , label, I, L) Cout ← Cout ⊞ (c) ▷ add new contour c L(p) ← label else ▷ I(p) is a background pixel if (label ̸= 0) then if (L(p) = 0) then ▷ hit new inner contour xs ← p − ( 10 ) 2 if NT = 4 ds ← 1 otherwise Trace the inner contour starting at xs : c ← TraceContour(xs , ds , label, I, L) Cin ← Cin ⊞ (c) ▷ add new contour c label ← 0 ▷ reset current label return (Cout , Cin , L) continued in Alg. 8.5 ▷▷

Case C: When a foreground pixel does not lie on a contour, then the neighboring pixel to the left has already been labeled (Fig. 8.9 (d)) and this label is propagated to the current pixel. The complete segmentation and contour tracing process is summarized in 8.4–8.5. Procedure RegionContourLabeling() traverses the image line by line and calls procedure TraceContour() whenever a new inner or outer contour must be traced. The labels of the image elements along the contour, as well as the neighboring foreground pixels, are stored in the “label map” L (a rectangular array of the

209

8 Regions in Binary Images Alg. 8.5 Combined contour finding and region labeling (Part 2). Starting from xs , procedure TraceContour() traces along the contour in the direction ds . During this process, all contour points as well as neighboring background points are marked in the label array L. Given a contour point x, TraceContour() uses FindNextContourPoint() to determine the next point on the contour (line 11). The axiliary function Delta() calculates the shift to next position from the search direction d.

1: TraceContour(xs , ds , label, I, L) Input: xs , start position; ds , initial search direction; label, the label assigned to this contour; I, the binary input image; L, label map. Global: NT ∈ {4, 8} (neighborhood type). Returns a new outer or inner contour (sequence of points) starting at xs . 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

c ← () ▷ empty contour (x, d) ← FindNextContourPoint(xs , ds , I, L) ▷ x = (u, v) c ← c ⊞ (x) ▷ add point x to contour C xt ← x ▷ orig. successor of start point xs home ← (xs = xt ) ▷ true if single-pixel contour while (¬home) do L(x) ← label xp ← x ▷ keep previous position dn ← (d + 6) mod 8 ▷ next search direction (x, d) ← FindNextContourPoint(x, dn , I, L) home ← (xp ≡ xs ∧ x ≡ xt ) ▷ back at start position if (¬home) then c ← c ⊞ (x) ▷ add point x to contour C return c ▷ return contour

16: FindNextContourPoint(x0 , d0 , I, L) Input: x0 , initial position; d0 , initial search direction; I, binary input image; L, the label map. Returns the next point on the contour and the direction leading to it or the initial position and direction if no such point was found.

2 1

if NT = 4 otherwise

17:

step ←

18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28:

d ← d0 i←0 done ← false while i < 7 ∧ ¬done do ▷ search in up to 7 directions x ← x0 + Delta(d) ▷ see below if I(x) = 0 then ▷ I(x) is a background pixel L(x) ← −1 ▷ mark background as visited (−1) d ← (d + step) mod 8 else ▷ found a non-background pixel at x done ← true i ← i+ step (x, d) if done return (x0 , d0 ) otherwise (no successor found)

29:

30: Delta(d) :=

∆x , ∆y

with

d ∆x ∆y

0 1 0

1 1 1

2 3 4 5 6 7 0 −1 −1 −1 0 1 1 1 0 −1 −1 −1

same size as the image) by procedure FindNextContourPoint() in Algorithm 8.5. 8.2.3 Java Implementation

210

The Java implementation of the combined region labeling and contour tracing algorithm can be found online in class RegionContour-

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

import import ... import import import import import

java.util.List; ij.IJ; imagingbook.pub.geometry.basic.NeighborhoodType2D; imagingbook.pub.geometry.basic.Pnt2d; imagingbook.pub.regions.BinaryRegion; imagingbook.pub.regions.Contour; imagingbook.pub.regions.segment.RegionContourSegmentation;

public class Region_Contours_Demo implements PlugInFilter { static NeighborhoodType NT = NeighborhoodType.N8; public int setup(String arg, ImagePlus im) { return DOES_8G + NO_CHANGES; } public void run(ImageProcessor ip) { // Make sure we have a proper byte image: ByteProcessor I = ip.convertToByteProcessor();

8.2 Region Contours Prog. 8.2 Combined region segmentation and contour tracing example (ImageJ plugin). The binary input image (ip) is initially converted to an 8-bit image (I). Then a segmenter is constructed and applied to the image. All connected components (binary regions) are retrieved from the segmenter by getRegions(). Each region carries an outer contour and a (possibly empty) set of inner contours, obtained by getOuterContour() and getInnerContours(), respectively.

// Create the region segmenter / contour tracer: RegionContourSegmentation seg = new RegionContourSegmentation(I, NT); // Get a list of detected regions (sorted by size): List regions = seg.getRegions(true); // Get the largest region: BinaryRegion Rmax = regions.get(0); // Iterate over all detected regions: for (BinaryRegion R : regions) { // Get the outer contour of R: Contour oc = R.getOuterContour(); // Get all inner contours of R: List ic = R.getInnerContours(); }

}

}

...

Labeling4 (for details see Sec. 8.9). It almost exactly follows Algs. 8.4–8.5, only the image I and the associated label map L are initially padded (i.e., enlarged) by a surrounding layer of background pixels. This simplifies the process of tracing the outer region contours, since no special treatment is needed at the image borders. Program 8.2 shows a minimal example of its usage within the run() method of an ImageJ plugin. Examples This combined algorithm for region marking and contour following is particularly well suited for processing large binary images since it is efficient and has only modest memory requirements. Figure 8.10 shows a synthetic test image that illustrates a number of special situations, such as isolated pixels and thin sections, which the algorithm 4

Package imagingbook.pub.regions.

211

8 Regions in Binary Images Fig. 8.10 Combined contour and region marking. Original image, with foreground pixels marked green (a); located contours with black lines for outer and white lines for inner contours (b). Contour polygons pass through the pixel centers. Outer contours of single-pixel regions (e.g., in the upper-right of (b)) are marked by a single dot. (a)

(b)

Fig. 8.11 Example of a complex contour (see original image in Fig. 7.11). Outer contours are marked in black and inner contours in white.

must deal with correctly when following the contours. In the resulting plot, outer contours are shown as black polygon lines running trough the centers of the contour pixels, and inner contours are drawn white. Contours of single-pixel regions are marked by small circles filled with the corresponding color. Figure 8.11 shows the results for a larger section taken from a real image (see Fig. 7.11).

8.3 Representing Image Regions 8.3.1 Matrix Representation

212

A natural representation for images is a matrix (i.e., a two-dimensional array) in which elements represent the intensity or the color at a corresponding position in the image. This representation lends itself, in most programming languages, to a simple and elegant mapping onto

two-dimensional arrays, which makes possible a very natural way to 8.3 Representing Image work with raster images. One possible disadvantage with this rep- Regions resentation is that it does not depend on the content of the image. In other words, it makes no difference whether the image contains only a pair of lines or is of a complex scene because the amount of memory required is constant and depends only on the dimensions of the image. Regions in an image can be represented using a logical mask in which the area within the region is assigned the value true and the area without the value false (Fig. 8.12). Since these values can be represented by a single bit, such a matrix is often referred to as a “bitmap”.5 Fig. 8.12 Use of a binary mask to specify a region of an image: original image (a), logical (bit) mask (b), and masked image (c).

(a)

(b)

(c)

8.3.2 Run Length Encoding In run length encoding (RLE), sequences of adjacent foreground pixels can be represented compactly as “runs”. A run, or contiguous block, is a maximal length sequence of adjacent pixels of the same type within either a row or a column. Runs of arbitrary length can be encoded compactly using three integers, Run i = ⟨row i , column i , length i ⟩, as illustrated in Fig. 8.13. When representing a sequence of runs within the same row, the number of the row is redundant and can be left out. Also, in some applications, it is more useful to record the coordinate of the end column instead of the length of the run. Since the RLE representation can be easily implemented and efficiently computed, it has long been used as a simple lossless compression method. It forms the foundation for fax transmission and can be found in a number of other important codecs, including TIFF, GIF, and JPEG. In addition, RLE provides precomputed information about the image that can be used directly when computing certain properties of the image (for example, statistical moments; see Sec. 8.5.2). 8.3.3 Chain Codes Regions can be represented not only using their interiors but also by their contours. Chain codes, which are often referred to as Freeman 5

Java does not provide a genuine 1-bit data type. Even single variables of type boolean are represented internally (depending on the Java virtual machine) with 8 or even 32 bits.

213

8 Regions in Binary Images Fig. 8.13 Run length encoding in row direction. A run of pixels can be represented by its starting point (1, 2) and its length (6).

Fig. 8.14 Chain codes with 4- and 8connected neighborhoods. To compute a chain code, begin traversing the contour from a given starting point xs . Encode the relative position between adjacent contour points using the directional code for either 4-connected (a) or 8connected (b) neighborhoods. The length of the resulting path, calculated as the sum of the individual segments, can be used to approximate the true length of the contour.

Bitmap 0 0 1 2 3 4 5 • 6

1

• •

RLE

2

3

4

5

6

7

•

•

•

•

•

•

•

• • •

• • •

• • •

• •

• •

•

⟨row, column, length⟩

8

⟨1, 2, 6⟩ ⟨3, 4, 4⟩ ⟨4, 1, 3⟩ ⟨4, 5, 3⟩ ⟨5, 0, 9⟩

→ •

xs

xs

3 3 2

3 3

2 0

2

2

5

2

4

5

2

5

2

6

4

4

4 7

3 1

3 0

2

3

3

6 0

7

4 5

3

3223222322303303...111

2

1 0

6

7

54544546767...222 √ length = 16 + 6 2 ≈ 24.5

length = 28 (a) 4-Chain Code

(b) 8-Chain Code

codes [98], are a classical method of contour encoding. In this encoding, the contour beginning at a given start point xs is represented by the sequence of directional changes it describes on the discrete image grid (Fig. 8.14). Absolute Chain Code For a closed contour of a region R, described by the sequence of points cR = (x0 , x1 , . . . xM −1 ) with xi = ⟨ui , vi ⟩, we create the elements of its chain code sequence c′R = (c′0 , c′1 , . . . c′M −1 ) with c′i = Code(u′ , v ′ ),

(8.3)

where ′

′

(u , v ) =

(

(ui+1 −ui , vi+1 −vi ) (u0 −ui , v0 −vi )

for 0 ≤ i < M −1, for i = M −1,

(8.4)

and Code(u′ , v ′ ) being defined (assuming an 8-connected neighborhood) by the following table: u′ v′ Code(u′ , v ′ )

214

1 0 0

1 1 1

0 −1 −1 −1 0 1 1 1 0 −1 −1 −1 2 3 4 5 6 7

Chain codes are compact since instead of storing the absolute coordinates for every point on the contour, only that of the starting point

is recorded. The remaining points are encoded relative to the start- 8.3 Representing Image ing point by indicating in which of the eight possible directions the Regions next point lies. Since only 3 bits are required to encode these eight directions the values can be stored using a smaller numeric type. Differential Chain Code Directly comparing two regions represented using chain codes is difficult since the description depends on the starting point selected xs , and for instance simply rotating the region by 90◦ results in a completely different chain code. When using a differential chain code, the situation improves slightly. Instead of encoding the difference in the position of the next contour point, the change in the direction along the discrete contour is encoded. A given absolute chain code c′R = (c′0 , c′1 , . . . c′M −1 ) can be converted element by element to a differential chain code c′′R = (c′′0 , c′′1 , . . . c′′M −1 ), with6 ( (c′i+1 − c′i ) mod 8 for 0 ≤ i < M −1, ′′ ci = (8.5) (c′0 − c′i ) mod 8 for i = M −1, again under the assumption of an 8-connected neighborhood. The element c′′i thus describes the change in direction (curvature) of the contour between two successive segments c′i and c′i+1 of the original chain code c′R . For the contour in Fig. 8.14 (b), for example, the result is c′R = (5, 4, 5, 4, 4, 5, 4, 6, 7, 6, 7, . . . , 2, 2, 2), c′′R = (7, 1, 7, 0, 1, 7, 2, 1, 7, 1, 1, . . . , 0, 0, 3). Given the start position xs and the (absolute) initial direction c0 , the original contour can be unambiguously reconstructed from the differential chain code. Shape Numbers While the differential chain code remains the same when a region is rotated by 90◦ , the encoding is still dependent on the selected starting point. If we want to determine the similarity of two contours of the same length M using their differential chain codes c′′1 , c′′2 , we must first ensure that the same start point was used when computing the codes. A method that is often used [20, 108] is to interpret the elements c′′i in the differential chain code as the digits of a number to the base b (b = 8 for an 8-connected contour or b = 4 for a 4-connected contour) and the numeric value Val(c′′R ) = c′′0 ·b0 + c′′1 ·b1 + . . . + c′′M −1 ·bM −1 =

M −1 X

c′′i ·bi .

(8.6)

i=0

Then the sequence c′′R is shifted circularly until the numeric value of the corresponding number reaches a maximum. We use the expression c′′R ▷ k to denote the sequence c′′R being circularly shifted by k positions to the right.7 For example, for k = 2, this is 6 7

See Sec. J.1.2 of the Appendix for a description of the ‘mod’ operator. That is, (c′′R ▷ k)( i ) = c′′R ((i − k) mod M ).

215

c′′R ′′ cR ▷ 2

8 Regions in Binary Images

= (0, 1, 3, 2, . . . , 5, 3, 7, 4), = (7, 4, 0, 1, 3, 2, . . . , 5, 3),

and kmax = argmax Val(c′′R ▷ k), 0≤k Val(c′′2 ) is determined by comparing the lexicographic ordering between the sequences c′′1 and c′′2 so that the arithmetic values need not be computed at all. Unfortunately, comparisons based on chain codes are generally not very useful for determining the similarity between regions simply because rotations at arbitrary angles (̸= 90◦ ) have too great of an impact (change) on a region’s code. In addition, chain codes are not capable of handling changes in size (scaling) or other distortions. Section 8.4 presents a number of tools that are more appropriate in these types of cases. Fourier Shape Descriptors An elegant approach to describing contours are so-called Fourier shape descriptors, which interpret the two-dimensional contour C = (x0 , x1 , . . . , xM −1 ) with xk = (uk , vk ) as a sequence of values in the complex plane, where8 zk = (uk + i·vk ) ∈ C.

(8.9)

From this sequence, one obtains (using a suitable method of interpolation in case of an 8-connected contour), a discrete, one-dimensional periodic function f (s) ∈ C with a constant sampling interval over s, the path length around the contour. The coefficients of the 1D Fourier spectrum (see Sec. 18.3) of this function f (s) provide a shape description of the contour in frequency space, where the lower spectral coefficients deliver a gross description of the shape. This technique is described in considerable detail in [49, Ch. 26].

8.4 Properties of Binary Regions Imagine that you have to describe the contents of a digital image to another person over the telephone. One possibility would be to 8

216

i denotes the imaginary unit.

call out the value of each pixel in some agreed upon order. A much 8.4 Properties of simpler way of course would be to describe the image on the basis of Binary Regions its properties—for example, “a red rectangle on a blue background”, or at an even higher level such as “a sunset at the beach with two dogs playing in the sand”. While using such a description is simple and natural for us, it is not (yet) possible for a computer to generate these types of descriptions without human intervention. For computers, it is of course simpler to calculate the mathematical properties of an image or region and to use these as the basis for further classification. Using features to classify, be they images or other items, is a fundamental part of the field of pattern recognition, a research area with many applications in image processing and computer vision [76, 192, 259]. 8.4.1 Shape Features The comparison and classification of binary regions is widely used, for example, in optical character recognition (OCR) and for automating processes ranging from blood cell counting to quality control inspection of manufactured products on assembly lines. The analysis of binary regions turns out to be one of the simpler tasks for which many efficient algorithms have been developed and used to implement reliable applications that are in use every day. By a feature of a region, we mean a specific numerical or qualitative measure that is computable from the values and coordinates of the pixels that make up the region. As an example, one of the simplest features is its size or area; that is the number of pixels that make up a region. In order to describe a region in a compact form, different features are often combined into a feature vector. This vector is then used as a sort of “signature” for the region that can be used for classification or comparison with other regions. The best features are those that are simple to calculate and are not easily influenced (robust) by irrelevant changes, particularly translation, rotation, and scaling. 8.4.2 Geometric Features A region R of a binary image can be interpreted as a two-dimensional distribution of foreground points pi = (ui , vi ) on the discrete plane Z2 , that is, as a set R = {x0 , . . . , xN −1 } = {(u0 , v0 ), . . . , (uN −1 , vN −1 )}. Most geometric properties are defined in such a way that a region is considered to be a set of pixels that, in contrast to the definition in Sec. 8.1, does not necessarily have to be connected. Perimeter The perimeter (or circumference) of a region R is defined as the length of its outer contour, where R must be connected. As illustrated in Fig. 8.14, the type of neighborhood relation must be taken into account for this calculation. When using a 4-neighborhood, the

217

8 Regions in Binary Images

measured length of the contour (except when that length is 1) will be larger than its actual length. In the case of 8-neighborhoods, a good approximation is reached by weighing the√horizontal and vertical segments with 1 and diagonal segments with 2. Given an 8-connected chain code c′R = (c′0 , c′1 , . . . , c′M −1 ), the perimeter of the region is arrived at by Perimeter(R) =

M −1 X

length(c′i ),

(8.10)

i=0

with length(c) =

(

1 √

2

for c = 0, 2, 4, 6, for c = 1, 3, 5, 7.

(8.11)

However, with this conventional method of calculation, the real perimeter P (R) is systematically overestimated. As a simple remedy, an empirical correction factor of 0.95 works satisfactorily even for relatively small regions, that is, P (R) ≈ 0.95 · Perimeter(R).

(8.12)

Area The area of a binary region R can be found by simply counting the image pixels that make up the region, that is, A(R) = N = |R|.

(8.13)

The area of a connected region without holes can also be approximated from its closed contour, defined by M coordinate points (x0 , x1 , . . . xM −1 ), where xi = (ui , vi ), using the Gaussian area formula for polygons: A(R) ≈

218

M −1 1 X · ui · v(i+1) mod M − vi · u(i+1) mod M . 2 i=0

(8.14)

When the contour is already encoded as a chain code c′R = (c′0 , c′1 , . . . , c′M −1 ), then the region’s area can be computed (trivially) with Eqn. (8.14) by expanding cR into a sequence of contour points from an arbitrary starting point (e.g., (0, 0)). However, the area can also be calculated directly from the chain code representation without expanding the contour [295] (see also Exercise 8.11). While simple region properties such as area and perimeter are not influenced (except for quantization errors) by translation and rotation of the region, they are definitely affected by changes in size; for example, when the object to which the region corresponds is imaged from different distances. However, as will be described, it is possible to specify combined features that are invariant to translation, rotation, and scaling as well.

Compactness and Roundness Compactness is understood as the relation between a region’s area and its perimeter. We can use the fact that a region’s perimeter P increases linearly with the enlargement factor while the area A increases quadratically to see that, for a particular shape, the ratio A/P 2 should be the same at any scale. This ratio can thus be used as a feature that is invariant under translation, rotation, and scaling. When applied to a circular region of any diameter, this ratio has a 1 value of 4π , so by normalizing it against a filled circle, we create a feature that is sensitive to the roundness or circularity of a region, Circularity(R) = 4π ·

A(R) , P 2 (R)

8.4 Properties of Binary Regions

(8.15)

which results in a maximum value of 1 for a perfectly round region R and a value in the range [0, 1] for all other shapes (Fig. 8.15). If an absolute value for a region’s roundness is required, the corrected perimeter estimate (Eqn. (8.12)) should be employed. Figure 8.15 shows the circularity values of different regions as computed with the formulation in Eqn. (8.15). Fig. 8.15 Circularity values for different shapes. Shown are the corresponding estimates for Circularity(R) as defined in Eqn. (8.15). Corrected values calculated with Eqn. (8.12) are shown in parentheses.

(a) 0.904 (1.001)

(b) 0.607 (0.672)

(c) 0.078 (0.086)

Bounding Box The bounding box of a region R is the minimal axis-parallel rectangle that encloses all points of R, BoundingBox(R) = ⟨umin , umax , vmin , vmax ⟩,

(8.16)

where umin , umax and vmin , vmax are the minimal and maximal coordinate values of all points (ui , vi ) ∈ R in the x and y directions, respectively (Fig. 8.16 (a)). Convex Hull The convex hull is the smallest convex polygon that contains all points of the region R. A physical analogy is a board in which nails stick out in correspondence to each of the points in the region. If you were to place an elastic band around all the nails, then, when you release it, it will contract into a convex hull around the nails (see Figs. 8.16 (b) and 8.21 (c)). Given N contour points, the convex hull can be computed in time O(N log V ), where V is the number vertices in the polygon of the resulting convex hull [23].

219

8 Regions in Binary Images Fig. 8.16 Example bounding box (a) and convex hull (b) of a binary image region.

(a)

(b)

The convex hull is useful, for example, for determining the convexity or the density of a region. The convexity of a region is defined as the ratio of the length of the convex hull and the original perimeter of the region. Density is then defined as the ratio between the area of the region and the area of its convex hull. The diameter, on the other hand, is the maximal distance between any two points on the convex hull.

8.5 Statistical Shape Properties When computing statistical shape properties, we consider a region R to be a collection of coordinate points distributed within a twodimensional space. Since statistical properties can be computed for point distributions that do not form a connected region, they can be applied before segmentation. An important concept in this context are the central moments of the region’s point distribution, which measure characteristic properties with respect to its midpoint or centroid. 8.5.1 Centroid The centroid or center of gravity of a connected region can be easily visualized. Imagine drawing the region on a piece of cardboard or tin and then cutting it out and attempting to balance it on the tip of your finger. The location on the region where you must place your finger in order for the region to balance is the centroid of the region.9 The centroid xc = (xc , yc ) of a binary (not necessarily connected) region is the arithmetic mean of the contained point coordinates ui = (ui , vi ), that is, xc =

1 X · ui |R|

(8.17)

ui ∈ R

or xc =

1 X · ui |R| (ui ,vi )

9

220

and

yc =

1 X · vi . |R|

(8.18)

(ui ,vi )

Assuming you did not imagine a region where the centroid lies outside of the region or within a hole in the region, which is of course possible.

8.5.2 Moments The formulation of the region’s centroid in Eqn. (8.18) is only a special case of the more general statistical concept of a moment. Specifically, the expression X mpq (R) = I(u, v)·up ·v q (8.19)

8.5 Statistical Shape Properties

(u,v) ∈ R

describes the (ordinary) moment of order p, q for a discrete (image) function I(u, v) ∈ R, e.g., a grayscale image. All the following definitions are also generally applicable to regions in grayscale images. The moments of connected binary regions can also be calculated directly from the coordinates of the contour points [240, p. 148]. In the special case of a binary image I(u, v) ∈ {0, 1}, only the foreground pixels with I(u, v) = 1 in the region R need to be considered, and therefore Eqn. (8.19) reduces to X mpq (R) = up ·v q . (8.20) (u,v) ∈ R

In this way, the area of a binary region can be expressed as its zeroorder moment, i.e., X X A(R) = |R| = 1 = u0 ·v 0 = m00 (R), (8.21) (u,v)

(u,v)

and similarly the centroid xc Eqn. (8.18) can be written as xc =

1 X 1 0 m (R) · u ·v = 10 , |R| m00 (R) (u,v)

1 X 0 1 m (R) yc = · u ·v = 01 . |R| m00 (R)

(8.22)

(u,v)

These moments thus represent concrete physical properties of a region. Specifically, the area m00 is an important basis for characterizing regions, and the centroid (xc , yc ) permits the reliable and (within a fraction of a pixel) exact calculation of a region’s position. 8.5.3 Central Moments To compute position-independent (translation-invariant) region features, the region’s centroid, which can be determined precisely in any situation, can be used as a reference point. In other words, we can shift the origin of the coordinate system to the region’s centroid xc = (xc , yc ) to obtain the central moments of order p, q: X µpq (R) = I(u, v) · (u−xc )p · (v−yc )q . (8.23) (u,v) ∈ R

For a binary image (with I(u, v) = 1 within the region R), Eqn. (8.23) simplifies to X µpq (R) = (u−xc )p · (v−yc )q . (8.24) (u,v) ∈ R

Note that, for binary images, µ10 and µ01 are always zero.

221

8 Regions in Binary Images

8.5.4 Normalized Central Moments Central moment values of course depend on the absolute size of the region since the value depends directly on the distance of all region points to its centroid. So, if a 2D shape is uniformly scaled by some factor s ∈ R, its central moments multiply by the factor s(p+q+2) .

(8.25)

Thus size-invariant “normalized” moments are obtained by scaling with the reciprocal of the area A = µ00 = m00 raised to the required power in the form ηpq (R) = µpq (R)·

1 A(p+q+2)/2

= µpq (R)·

1 (p+q+2)/2 , (8.26) µ00 (R)

for (p+q) ≥ 2 [148, p. 529]. 8.5.5 Java Implementation Program 8.3 gives a direct (unoptimized) Java implementation for computing the ordinary, central, and normalized central moments for binary regions.10 This implementation is only meant to illustrate the calculation, i.e., much more efficient and numerically more accurate implementations are possible (see, e.g., [153]). Prog. 8.3 Direct calculation of binary region moments in Java. The methods ordinaryMoment(), centralMoment(), and normalizedCentralMoment() calculate the ordinary moments mpq , central moments µpq , and normalized central moments µ ˆ pq , respectively, for a BinaryRegion instance R.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

10

222

// Ordinary moments mp,q (Eqn. (8.20)): double ordinaryMoment(BinaryRegion R, int p, int q) { double mpq = 0.0; for (Pnt2d pnt : R) { mpq += Math.pow(pnt.getX(), p) * Math.pow(pnt.getY(), q); } return mpq; } // Central moments µp,q (Eqn. (8.24)): double centralMoment(BinaryRegion R, int p, int q) { double A = ordinaryMoment(R, 0, 0); // region area double xc = ordinaryMoment(R, 1, 0) / A; // centroid double yc = ordinaryMoment(R, 0, 1) / A; double mupq = 0.0; for (Pnt2d pnt : R) { mupq += Math.pow(pnt.getX() - xc, p) * Math.pow(pnt.getY() - yc, q); } return mupq; } // Normalized central moments µ ˆpq (Eqn. (8.26)): double normalizedCentralMoment(BinaryRegion R, int p, int q) { double A = ordinaryMoment(R, 0, 0); double scale = 1.0 / Math.pow(A, 0.5 * (p + q + 2)); return centralMoment(R, p, q) * scale; }

Class BinaryRegion implements the Iterable interface, thus a simple for-each loop can be used to visit all points contained in a region.

8.6 Moment-Based Geometric Properties While normalized moments can be directly applied for classifying regions, further interesting and geometrically relevant features can be elegantly derived from statistical region moments.

8.6 Moment-Based Geometric Properties

8.6.1 Orientation Orientation describes the direction of the major axis, that is, the axis that runs through the centroid and along the widest part of the region (Fig. 8.18 (a)). Since rotating the region around the major axis requires less effort (smaller moment of inertia) than spinning it around any other axis, it is sometimes referred to as the major axis of rotation. As an example, when you hold a pencil between your hands and twist it around its major axis (that is, around the lead), the pencil exhibits the least mass inertia (Fig. 8.17). As long as a region has aby orientation at all (i.e., µ11 (R) ̸= 0), the direction θR of the major axis can be found directly from the central moments µpq as tan(2 θR ) =

2·µ11 (R) , µ20 (R)−µ02 (R)

and thus the corresponding angle is11 2·µ11 (R) θR = 0.5 · tan−1 µ20 (R)−µ02 (R)

(8.27)

(8.28)

= 0.5 · ArcTan µ20 (R)−µ02 (R), 2·µ11 (R) .

The resulting angle θR is in the range [− π2 , π2 ]. Orientation measurements based on region moments are generally very accurate.

R

Fig. 8.17 Major axis of a region. Rotating an elongated region R, interpreted as a physical body, around its major axis requires less effort (least moment of inertia) than rotating it around any other axis.

Direct Calculation of Orientation Vectors Often we do not seek the orientation angle itself but rather the corresponding orientation vector. Of course, if the angle θ is already known (from Eqn. (8.28)), the unit orientation vector is simply x ˆθ cos(θ) ˆθ = x = . (8.29) yˆθ sin(θ) 11

See Sec. A.1 of the Appendix for the definition of the inverse tangent function ArcTan(x, y) and Sec. J.1.5 for the corresponding Java method Math.atan2(y,x).

223

8 Regions in Binary Images

ˆ θ can also be obtained directly from the reHowever, the vector x gion’s moments, without the use of trigonometric functions. Since tan(α) = sin(α)/ cos(α) we can rewrite Eqn. (8.27) as tan(2θ) =

2µ11 a sin(2θ) = = , µ20 −µ02 b cos(2θ)

(8.30)

b = µ20 −µ02 .

(8.31)

with a = 2µ11

and

Note that this does not give us the catheti a, b directly but only their ratio a/b. Because sin2 (α) + cos2 (α) = 1 the sine/cosine values of 2θ are found as the elements of the normalized vector (a, b), that is, sin(2θ) = √

a a2 +b2

and

cos(2θ) = √

b , a2 +b2

(8.32)

sin(α) respectively. The half-angle relation tan( α2 ) = 1+cos(α) yields the tangent of the actual angle θ as

tan(θ) =

sin(2θ) 1 a a =√ · =√ . 1 + cos(2θ) a 2 + b2 1 + b a 2 + b2 + b

Thus the (un-normalized) orientation vector is √ xθ a2 +b2 + b xθ = = , yθ a

(8.33)

(8.34)

with a, b as defined in Eqn. (8.31), and the unit direction vector is obtained by normalizing xθ , i.e., √ xθ 1 a2 +b2 + b q ˆθ = x = · , (8.35) √ a ∥xθ ∥ 2·(a2 + b2 + b· a2 +b2 ) with a, b as defined in Eqn. (8.31). As usual, care must be taken to avoid a zero denominator in Eqn. (8.35), which occurs when the region’s orientation is undefined. Of course, as an alternative to Eqn. (8.28), the angle θR could also be calculated from the direction vector xθ in Eqn. (8.34) as θR = tan−1 √

p a ≡ ArcTan a2 +b2 + b, a . 2 2 a +b + b

(8.36)

8.6.2 Region Eccentricity

224

Similar to the region orientation, moments can also be used to determine the “elongatedness” or eccentricity of a region. A naive approach for calculating the eccentricity could be to rotate the region until we can fit a bounding box (or enclosing ellipse) with a maximum aspect ratio. Of course this process would be computationally intensive, simply because of the many rotations required. If we know the orientation of the region (Eqn. (8.28)), then we may fit a bounding

−y

rb

−x

8.6 Moment-Based Geometric Properties

−π 2

θ xc

Fig. 8.18 Region orientation and eccentricity. The major axis of the region extends through its center of gravity xc at the orientation θ. Note that angles π are in the range [− π 2 , + 2 ] and increment in the clockwise direction because the y axis of the image coordinate system points downward (in this example, θ ≈ −0.759 ≈ −43.5◦ ). The eccentricity of the region is defined as the ratio between the lengths of the major axis (ra ) and the minor axis (rb ) of the “equivalent” ellipse.

+x

ra

+y

+π 2

box that is parallel to the region’s major axis. In general, the proportions of the region’s bounding box is not a good eccentricity measure anyway because it does not consider the distribution of pixels inside the box. Based on region moments, highly accurate and stable measures can be obtained without any iterative search or optimization. Also, moment-based methods do not require knowledge of the boundary length (as required for computing the circularity feature in Sec. 8.4.2), and they can also handle nonconnected regions or point clouds. Several different formulations of region eccentricity can be found in the literature [20, 148, 150] (see also Exercise 8.16). The most common definition is Ecc(R) =

λ0 , λ1

which is simply the ratio of the eigenvalues12 p µ20 +µ02 + (µ20 −µ02 )2 + 4·µ211 λ0 = , p 2 µ +µ − (µ20 −µ02 )2 + 4·µ211 λ1 = 20 02 2 of the symmetric 2 × 2 “scatter” matrix13 sxx sxy µ20 µ11 S= = , sxy syy µ11 µ02

(8.37)

(8.38)

(8.39)

composed of the region’s central moments µ20 , µ02 and µ11 (see Eqn. (8.23)). The values of Ecc are in the range [1, ∞), where Ecc = 1 corresponds to a circular disk and elongated regions have values > 1. 12 13

See Sec. B.5 of the Appendix. Note that λ0 ≥ λ1 . The same scatter matrix is also used for orthogonal line fitting, which is a very similar problem (see Sec. 10.2.2). The elements of the scatter matrix are proportional to the elements of the covariance matrix 1 Σ = |R| ·S (see Sec. G.2 of the Appendix).

225

8 Regions in Binary Images Fig. 8.19 Orientation and eccentricity examples. The orientation θ (Eqn. (8.28)) is displayed for each connected region as a vector with the length proportional to the region’s eccentricity value Ecc(R) (Eqn. (8.37)). Also shown are the ellipses (see Eqn. (8.40)) corresponding to the orientation and eccentricity parameters.

8.6.3 Equivalent Ellipse The value returned by Ecc(R) is invariant to the region’s orientation and size, that is, this quantity has the important property of being rotation and scale invariant. However, the values a0 , a1 contain relevant information about the spatial structure of the region. Geometrically, the eigenvalues λ0 , λ1 of the scatter matrix (see Eqn. (8.38)) directly relate to the proportions of the “equivalent” ellipse, positioned at the region’s centroid (xc , yc ) and oriented at θ = θR . In particular, the lengths of the major and minor axes are ra = 2 ·

λ 12 0

|R|

and

rb = 2 ·

λ 12 1

|R|

,

(8.40)

respectively. Given the axes’ lengths ra , rb and the region’s centroid (xc , yc ), the parametric equation of this ellipse is14 xc cos(θ) − sin(θ) r · cos(t) x(t) = + · a (8.41) yc sin(θ) cos(θ) rb · sin(t) xc + cos(θ)·ra · cos(t) − sin(θ)·rb · sin(t) = , (8.42) yc + sin(θ)·ra · cos(t) + cos(θ)·rb · sin(t) for 0 ≤ t < 2π. If entirely filled, the region described by this ellipse would have the same central moments as the original region R. Figure 8.19 shows a set of binary regions with overlaid orientation and eccentricity properties. 8.6.4 Bounding Box Aligned to the Major Axis While the ordinary, x/y axis-aligned bounding box (see Sec. 8.4.2) is of little practical use (because it is sensitive to rotation), it may be interesting to see how to find a region’s bounding box that is 14

226

See also Sec. F.3 of the Appendix.

aligned with its major axis, as defined in Sec. 8.6.1. Given a region’s 8.6 Moment-Based ˆ θ (see Eqn. (8.35)) Geometric Properties orientation angle θR or its unit direction vector x x ˆθ cos(θR ) ˆθ = ea = x = (8.43) yˆθ sin(θR ) is the unit vector parallel to the region’s major axis; thus eb = e⊥ a =

yˆθ −ˆ xθ

(8.44)

is the unit vector orthogonal to ea .15 The bounding box can now be determined as follows (see Fig. 8.20): 1. Project each region point16 xi = (xi , yi )⊺ onto the vector ea (parallel to the region’s major axis) by calculating the dot product17 ⊺

ai = xi ·ea

(8.45)

and keeping the minimum and maximum values, i.e., ⊺

⊺

amin = min xi ·ea ,

amax = max xi ·ea .

xi ∈ R

xi ∈ R

(8.46)

2. Analogously, project each region point xi onto the orthogonal axis (specified by the vector eb ) by ⊺

bi = xi ·eb

(8.47)

and keeping the minimum and maximum values, that is, ⊺

bmin = min xi ·eb , ui ∈ R

⊺

bmax = max xi ·eb . ui ∈ R

(8.48)

Note that steps 1 and 2 can be performed in a single iteration over all region points. 3. Finally, from the resulting quantities amin , amax , bmin , bmax , calculate the four corner points p0 , . . . , p3 of the bounding box as p0 = amin ·ea + bmin ·eb ,

p1 = amin ·ea + bmax ·eb ,

p2 = amax ·ea + bmax ·eb ,

p3 = amax ·ea + bmin ·eb .

(8.49)

The complete calculation is summarized in Alg. 8.20; a typical example is shown in Fig. 8.21 (d). 8.6.5 Invariant Region Moments Normalized central moments are not affected by the translation or uniform scaling of a region (i.e., the values are invariant), but in general rotating the image will change these values. 15 ⊥ 0 1 16

17

x = perp(x) = −1 0 · x. If the region’s outer contour is available, it is sufficient to consider the contour points only. See Sec. B.3.1, Eqn. (B.28) of the Appendix.

227

8 Regions in Binary Images

p2

y

Fig. 8.20 Calculation of a region’s major axis-aligned bounding box. The unit vector ea is parallel to the region’s major axis (oriented at angle θ); eb is perpendicular to ea . The projection of a region point ui onto the lines defined by ea and eb yields the lengths ai and bi , respectively (measured from the coordinate origin). The resulting quantities amin , amax , bmin , bmax define the corner points p0 , . . . , p3 of the axis-aligned bounding box. Note that the position of the region’s centroid (xc ) is not required in this calculation.

R p1

xc

θ

xi

p3

bmax

amax bi p0 ai

bmin

amin eb

θ

x

ea

Fig. 8.21 Geometric region properties. Original binary image (a), centroid and orientation vector (length determined by the region’s eccentricity) of the major axis (b), convex hull (c), and major axisaligned bounding box (d).

(a)

(b)

(c)

(d)

Hu’s Invariant Moments A classic (and therefore often quoted) solution to this problem is a clever combination of simpler features known as “Hu’s Moments” [134]. They are calculated from low-order normalized central moments ηp,q (see Eqn. (8.26)) as18 18

228

For better readability, region arguments (R) are omitted in Eqn. (8.50). With the region argument included, e.g., the first line would read ϕ1 (R) = η20 (R) + η02 (R).

1: MajorAxisAlignedBoundingBox(R) Input: R = {xi }, a binary region containing points xi = (xi , yi ). Returns the four corner points of the region’s axis-aligned bounding box or nil if the region’s orientation is undefined. 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

Σx ← 0, Σy ← 0 for all (xi , yi ) ∈ R do Σx ← Σx + xi Σ y ← Σ y + yi 1 Σx xc ← · yc |R| Σy

▷ centroid xc = (xc , yc )

µ20 ← 0, µ02 ← 0, µ11 ← 0 for all (xi , yi ) ∈ R do µ20 ← µ20 + (xi −xc )2 ▷ central moments µ20 , µ02 , µ11 µ02 ← µ02 + (yi −yc )2 µ11 ← µ11 + (xi −xc ) · (yi −yc ) A ← 2·µ11 B ← µ20 − µ02 √ xθ B+ A2 +B 2 ← yθ A p d ← x2θ + yθ2

▷ orient. vector (Eqn. (8.34))

19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29:

for all xi ∈ R do ⊺ ai ← xi ·ea amin ← min(amin , ai ) amax ← max(amax , ai ) ⊺ bi ← xi ·eb bmin ← min(bmin , bi ) bmax ← max(bmax , bi )

30: 31: 32: 33: 34:

p0 ← amin ·ea + bmin ·eb p1 ← amin ·ea + bmax ·eb p2 ← amax ·ea + bmax ·eb p3 ← amax ·ea + bmin ·eb return (p0 , p1 , p2 , p3 )

18:

Alg. 8.6 Calculation of the major axisaligned bounding box for a binary region R. Orientation vectors are obtained directly from central region moments, i.e., without the explicit calculation of eigenvalues. If the region’s contour is available, it is sufficient (in line 23) to iterate over the contour points only.

▷ see Eqn. (8.31)

if d < ϵ then ▷ denominator is near zero return nil ▷ orientation θ is undefined 1 xθ x ˆθ ˆ θ (Eqn. (8.35)) ▷ unit orient. vector x yˆθ ← d · yθ ⊺ ea ← (ˆ xθ , yˆθ ) ▷ unit vector parall. to region’s major axis ⊺ eb ← (ˆ yθ , −ˆ xθ ) ▷ unit vector perpendicular to major axis amin ← ∞, amax ← −∞ bmin ← ∞, bmax ← −∞

16: 17:

8.6 Moment-Based Geometric Properties

▷ opt. use only outer contour of R ▷ project xi onto ea (Eqn. (8.45))

▷ project xi onto eb (Eqn. (8.47))

▷ corners of the bounding box

ϕ1 = η20 + η02 ,

(8.50) 2

ϕ2 = (η20 − η02 ) +

2 4 η11 ,

ϕ3 = (η30 − 3 η12 )2 + (3 η21 − η03 )2 , ϕ4 = (η30 + η12 )2 + (η21 + η03 )2 , ϕ5 = (η30 − 3 η12 )·(η30 + η12 )·[(η30 + η12 )2 − 3(η21 + η03 )2 ] + (3 η21 − η03 )·(η21 + η03 )·[3 (η30 + η12 )2 − (η21 + η03 )2 ], ϕ6 = (η20 − η02 )·[(η30 + η12 )2 − (η21 + η03 )2 ] + 4 η11 ·(η30 + η12 )·(η21 + η03 ),

229

ϕ7 = (3 η21 − η03 )·(η30 + η12 )·[(η30 + η12 )2 − 3 (η21 + η03 )2 ] +

8 Regions in Binary Images

(3 η12 − η30 )·(η21 + η03 )·[3 (η30 + η12 )2 − (η21 + η03 )2 ]. In practice, the logarithm of these quantities (that is, log(ϕk )) is used since the raw values may have a very large range. These features are also known as moment invariants since they are invariant under translation, rotation, and scaling. While defined here for binary images, they are also applicable to parts of grayscale images; examples can be found in [108, p. 517]. Flusser’s Invariant Moments It was shown in [88, 89] that Hu’s moments, as listed in Eqn. (8.50), are at least partially redundant and incomplete. Based on so-called complex moments cpq ∈ C, Flusser devised an improved a set of 11 rotation and scale-invariant features ψ1 , . . . , ψ11 (see Eqn. (8.54)). For grayscale images (with I(u, v) ∈ R), the complex moments of order p, q are defined as X cp,q (R) = I(u, v)·[(u−xc ) + i (v−yc )]p ·[(u−xc ) − i (v−yc )]q , (u,v)∈R (8.51) with (xc , yc ) being the centroid of R (i denotes the imaginary unit). In the case of binary images (with I(u, v) ∈ [0, 1]), Eqn. (8.51) reduces to X cp,q (R) = [(u−xc ) + i (v−yc )]p ·[(u−xc ) − i (v−yc )]q . (8.52) (u,v)∈R

Analogous to Eqn. (8.26), the complex moments can be scale-normalized in the form cˆp,q (R) =

1 · c (R), A(p+q+2)/2 p,q

(8.53)

with A = |R| being the area of the region [90, p. 29]. Finally, the derived rotation and scale invariant region moments of 2nd to 4th order are19 ψ1 = Re(ˆ c1,1 ),

ψ2 = Re(ˆ c2,1 · cˆ1,2 ),

ψ3 = Re(ˆ c2,0 · cˆ21,2 ),

ψ4 = Im(ˆ c2,0 · cˆ21,2 ),

ψ5 = Re(ˆ c3,0 · cˆ31,2 ),

ψ6 = Im(ˆ c3,0 · cˆ31,2 ),

ψ7 = Re(ˆ c2,2 ),

ψ8 = Re(ˆ c3,1 · cˆ21,2 ),

ψ9 = Im(ˆ c3,1 · cˆ21,2 ),

ψ10 = Re(ˆ c4,0 · cˆ41,2 ),

ψ11 = Im(ˆ c4,0 · cˆ41,2 ).

(8.54)

Table 8.1 lists the normalized Flusser moments for five binary shapes taken from the Kimia dataset [157]. Shape Matching with Region Moments One obvious use of invariant region moments is shape matching and classification. Given two binary shapes A and B, with associated moment (“feature”) vectors 19

230

In Eqn. (8.54), the use of Re( ) for the quantities ψ1 , ψ2 , ψ7 (which are real-valued per se) is redundant.

8.6 Moment-Based Geometric Properties ψ1 ψ2 ψ3 ψ4 ψ5 ψ6 ψ7 ψ8 ψ9 ψ10 ψ11

0.3730017575 0.0012699373 0.0004041515 0.0000097827 0.0000012672 0.0000001090 0.2687922057 0.0003192443 0.0000053208 0.0000103461 0.0000000120

0.2545476083 0.0004247053 0.0000644829 -0.0000076547 0.0000002327 -0.0000000483 0.1289708408 0.0000414818 -0.0000032541 0.0000000091 -0.0000000020

0.2154034257 0.0002068089 0.0000274491 0.0000071688 0.0000000637 0.0000000041 0.0814034374 0.0000134036 0.0000030880 0.0000000019 0.0000000008

0.2124041195 0.0001089652 0.0000014248 -0.0000022103 0.0000000083 0.0000000153 0.0712567626 0.0000003020 -0.0000008365 -0.0000000003 -0.0000000000

0.000

0.183

0.245

0.255

0.037

0.183

0.000

0.062

0.071

0.149

0.245

0.062

0.000

0.011

0.210

0.255

0.071

0.011

0.000

0.220

0.037

0.149

0.210

0.220

0.000

fA = (ψ1 (A), . . . , ψ11 (A))

and

0.3600613700 0.0017187073 -0.0003853999 -0.0001944121 -0.0000078073 -0.0000061997 0.2340886626 -0.0002878997 -0.0001628669 0.0000001922 0.0000003015

Table 8.1 Binary shapes and associated normalized Flusser moments ψ1 , . . . , ψ11 . Notice that the magnitude of the moments varies by a large factor.

Table 8.2 Inter-class (Euclidean) distances dE (A, B) between normalized shape feature vectors for the five reference shapes (see Eqn. (8.55)). Off-diagonal values should be consistently large to allow good shape discrimination.

fB = (ψ1 (B), . . . , ψ11 (B)) ,

respectively, one approach could be to simply measure the difference between shapes by the Euclidean distance of these vectors in the form dE (A, B) = ∥fA −fB ∥ =

11 X

|ψi (A) − ψi (B)|

i=1

2 1/2

.

(8.55)

Concrete distances between the five sample shapes are listed in Table 8.2. Since the moment vectors are rotation and scale invariant,20 shape comparisons should remain unaffected by such transformations. Note, however, that the magnitude of the individual moments varies over a very large range. Thus, if the Euclidean distance is used as we have just suggested, the comparison (matching) of shapes is typically dominated by a few moments (or even a single moment) of relatively large magnitude, while the small-valued moments play virtually no role in the distance calculation. This is because the Euclidean distance treats the multi-dimensional feature space uniformly along all dimensions. As a consequence, moment-based shape discrimination with the ordinary Euclidean distance is typically not very selective. A simple solution is to replace Eqn. (8.55) by a weighted distance measure of the form d′E (A, B) =

11 X i=1

wi · |ψi (A) − ψi (B)|

2 1/2

,

(8.56)

with fixed weights w1 , . . . , w11 ≥ 0 assigned to each each moment feature to compensate for the differences in magnitude. 20

Although the invariance property holds perfectly for continuous shapes, rotating and scaling discrete binary images may significantly affect the associated region moments.

231

8 Regions in Binary Images

A more elegant approach is to use of the Mahalanobis distance [30, 179] for comparing the moment vectors, which accounts for the statistical distribution of each vector component and avoids largemagnitude components dominating the smaller ones. In this case, the distance calculation becomes 1/2 ⊺ dM (A, B) = (fA −fB ) ·Σ−1 ·(fA −fB ) , (8.57) where Σ is the 11 × 11 covariance matrix for the moment vectors f . Note that the expression under the root in Eqn. (8.57) is the dot product of a row vector and a column vector, that is, the result is a non-negative scalar value. The Mahalanobis distance can be viewed as a special form of the weighted Euclidean distance (Eqn. (8.56)), where the weights are determined by the variability of the individual vector components. See Sec. G.3 of the Appendix and Exercise 8.15 for additional details.

8.7 Projections Image projections are 1D representations of the image contents, usually calculated parallel to the coordinate axis. In this case, the horizontal and vertical projections of a scalar-valued image I(u, v) of size M × N are defined as Phor (v) =

M −1 X

I(u, v)

for 0 < v < N ,

(8.58)

I(u, v)

for 0 < u < M .

(8.59)

u=0

Pver (u) =

N −1 X v=0

232

The horizontal projection Phor (v0 ) (Eqn. (8.58)) is the sum of the pixel values in the image row v0 and has length N corresponding to the height of the image. On the other hand, a vertical projection Pver of length M is the sum of all the values in the image column u0 (Eqn. (8.59)). In the case of a binary image with I(u, v) ∈ 0, 1, the projection contains the count of the foreground pixels in the corresponding image row or column. Program 8.4 gives a direct implementation of the projection calculations as the run() method for an ImageJ plugin, where projections in both directions are computed during a single traversal of the image. Projections in the direction of the coordinate axis are often utilized to quickly analyze the structure of an image and isolate its component parts; for example, in document images it is used to separate graphic elements from text blocks as well as to isolate individual lines (see the example in Fig. 8.22). In practice, especially to account for document skew, projections are often computed along the major axis of an image region Eqn. (8.28). When the projection vectors of a region are computed in reference to the centroid of the region along the major axis, the result is a rotation-invariant vector description (often referred to as a “signature”) of the region.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

public void run(ImageProcessor I) { int M = I.getWidth(); int N = I.getHeight(); int[] pHor = new int[N]; int[] pVer = new int[M];

8.8 Topological Region Properties Prog. 8.4 Calculation of horizontal and vertical projections (ImageJ plugin). The run() method calculates the projections in x and y directions simultaneously in a single iteration over the image I. The projections are represented by the onedimensional int-arrays horProj and verProj.

// = Phor (v) // = Pver (u)

for (int v = 0; v < N; v++) { for (int u = 0; u < M; u++) { int p = I.getPixel(u, v); pHor[v] += p; pVer[u] += p; } } }

// use projections pHor, pVer now . . .

Phor (v)

Fig. 8.22 Horizontal and vertical projections of a binary image.

Pver (u)

8.8 Topological Region Properties Topological features do not describe the shape of a region in continuous terms; instead, they capture its structural properties. Topological properties are typically invariant even under strong image transformations. The convexity of a region, which can be calculated from the convex hull (Sec. 8.4.2), is also a topological property. A simple and robust topological feature is the number of holes NL (R) in a region. This feature is easily determined while finding the inner contours of a region, as described in Sec. 8.2.2. A useful topological feature that can be derived directly from the number of holes is the so-called Euler number NE , which is the difference between the number of connected regions NR and the number of their holes NL , that is, NE (R) = NR (R) − NL (R).

(8.60)

233

8 Regions in Binary Images Fig. 8.23 Visual identification markers composed of recursively nested regions [28].

In the case of a single connected region this is simply 1−NL . For a picture of the number “8”, for example, NE = 1 − 2 = −1 and for the letter “D” we get NE = 1−1 = 0. Topological features are often used in combination with numerical features for classification. A classic example of this combination is OCR (optical character recognition) [47]. Figure 8.23 shows an interesting use of topological structures for coding optical markers used in augmented reality applications [28].21 The recursive nesting of outer and inner regions is equivalent to a tree structure that allows fast and unique identification of a larger number of known patterns (see also Exercise 8.20).

8.9 Java Implementation Most algorithms described in this chapter are implemented as part of the imagingbook library.22 The key classes related to region and countour segmentation are BinaryRegion and Contour, the abstract class BinaryRegionSegmentation and its concrete sub-classes • • • • •

RecursiveSegmentation (Alg. 8.2), DepthFirstSegmentation (Alg. 8.2), BreadthFirstSegmentation (Alg. 8.2), SequentialSegmentation (Alg. 8.3), RegionContourSegmentation (Algs. 8.4–8.5).

Program 8.2 (p. 211) shows a typical usage example for these classes. Additional details can be found in the online code and documentation.

8.10 Exercises Exercise 8.1. Manually simulate the execution of both variations (depth-first and breadth-first) of the flood-fill algorithm using the image in Fig. 8.24 and starting at position (5, 1). Exercise 8.2. The implementation of the flood-fill algorithm in Prog. 8.1 places all the neighboring pixels of each visited pixel into either the stack or the queue without ensuring they are foreground pixels and that they lie within the image boundaries. The number of items 21 22

234

https://github.com/mkalten/reacTIVision Package imagingbook.pub.regions

0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7

0 0 0 0 0 0 0 0

0 0 1 0 1 0 1 0

0 0 1 0 1 0 1 0

0 0 1 0 1 0 0 0

0 0 1 1 1 1 0 0

0 1 1 0 1 1 0 0

0 1 1 1 1 1 1 0

0 0 0 0 1 1 0 0

0 0 0 0 1 1 1 0

0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 0

0 0 0 0 1 1 0 0

0 1 1 1 1 1 0 0

0 0 0 0 0 0 0 0

0 Background 1 Foreground

8.10 Exercises Fig. 8.24 Binary image for Exercise 8.1.

in the stack or the queue can be reduced by ignoring (not inserting) those neighboring pixels that do not meet the two conditions given. Modify the depth-first and breadth-first variants given in Prog. 8.1 accordingly and compare the new running times. Exercise 8.3. Implement an ImageJ plugin that encodes a given binary image by run length encoding (Sec. 8.3.2) and stores it in a file. Develop a second plugin that reads the file and reconstructs the image. Exercise 8.4. Estimate the amount of memory required to represent a contour with 1000 points in the following ways: (a) as a sequence of coordinate points stored as pairs of int values; (b) as an 8-chain code using Java byte elements, and (c) as an 8-chain code using only 3 bits per element. Exercise 8.5. Implement a Java class for describing a binary image region using chain codes. It is up to you, whether you want to use an absolute or differential chain code. The implementation should be able to encode closed contours as chain codes and also reconstruct the contours given a chain code. Exercise 8.6. The Graham Scan method [111] is an efficient algorithm for calculating the convex hull of a 2D point set (of size n), with time complexity O(n · log(n)).23 Implement this algorithm and show that it is sufficient to consider only the outer contour points of a region to calculate its convex hull. Exercise 8.7. While computing the convex hull of a region, the maximal diameter (maximum distance between two arbitrary points) can also be simply found. Devise an alternative method for computing this feature without using the convex hull. Determine the running time of your algorithm in terms of the number of points in the region. Exercise 8.8. Implement an algorithm for comparing contours using their shape numbers Eqn. (8.6). For this purpose, develop a metric for measuring the distance between two normalized chain codes. Describe if, and under which conditions, the results will be reliable. Exercise 8.9. Sketch the contour equivalent to the absolute chain code sequence c′R = (6, 7, 7, 1, 2, 0, 2, 3, 5, 4, 4). (a) Choose an arbitrary starting point and determine if the resulting contour is closed. (b) Find the associated differential chain code c′′R (Eqn. (8.5)). 23

See also http://en.wikipedia.org/wiki/Graham_scan.

235

8 Regions in Binary Images

Exercise 8.10. Calculate (under assumed 8-neighborhood) the shape number of base b = 8 (see Eqn. (8.6)) for the differential chain code c′′R = (1, 0, 2, 1, 6, 2, 1, 2, 7, 0, 2) and all possible circular shifts of this code. Which shift yields the maximum arithmetic value? Exercise 8.11. Using Eqn. (8.14) as the basis, develop and implement an algorithm that computes the area of a region from its 8chain-encoded contour (see also [295], [149, Sec. 19.5]). Exercise 8.12. Modify Alg. 8.4 such that the outer and inner contours are not returned as individual lists (Co , Ci ) but as a composite tree structure. An outer contour thus represents a region that may contain zero, one, or more inner contours (i.e., holes). Each inner contour may again contain other regions (i.e., outer contours), and so on. Exercise 8.13. Sketch an example binary region where the centroid does not lie inside the region itself. Exercise 8.14. Implement the binary region moment features proposed by Hu (Eqn. (8.50)) and/or Flusser (Eqn. (8.54)) and verify that they are invariant under image scaling and rotation. Use the test image in Fig. 8.2524 (or create your own), which contains rotated and mirrored instances of the reference shapes, in addition to other (unknown) shapes. Exercise 8.15. Implement the Mahalanobis distance calculation, as defined in Eqn. (8.57), for measuring the similarity between shape moment vectors. A. Compute the covariance matrix25 (Σ) for the Flusser shape features ψ1 , . . . , ψ11 (see Eqn. (8.54)) of the reference images in Table 8.1. Calculate the inter-class Mahalanobis distances for the reference shapes, analogous to the example in Table 8.2. B. Extend your analysis to a larger set of 500–1000 shapes (e.g., from the Kimia dataset [157], which contains more than 20 000 binary shape images). Calculate the normalized moment features and the covariance matrix Σ for the entire image set. Calculate the inter-class distance matrices for (a) the Euclidean and (b) the Mahalanobis distance. Display the distance matrices as grayscale images (FloatProcessor) and interpret them. Exercise 8.16. There are alternative definitions for the eccentricity of a region Eqn. (8.37); for example [150, p. 394], 2

Ecc2 (R) =

[µ20 (R) − µ02 (R)] + 4·µ211 (R) [µ20 (R) + µ02 (R)]

2

.

(8.61)

Implement this version as well as the one in Eqn. (8.37) and contrast the results using suitably designed regions. Determine the numeric range of these quantities and test if they are really rotation and scaleinvariant. 24 25

236

Images are available on the book’s website. See Sec. G.3 of the Appendix.

8.10 Exercises Fig. 8.25 Test image for moment-based shape matching. Reference shapes (top) and test image (bottom) composed of rotated and/or scaled shapes from the Kimia database and additional (unclassified) shapes.

Exercise 8.17. Write an ImageJ plugin that (a) finds (labels) all regions in a binary image, (b) computes the orientation and eccentricity for each region, and (c) shows the results as a direction vector and the equivalent ellipse on top of each region (as exemplified in Fig. 8.19). Hint: Use Eqn. (8.42) to develop a method for drawing ellipses at arbitrary orientations (not available in ImageJ). Exercise 8.18. The Java method in Prog. 8.4 computes an image’s horizontal and vertical projections. The scheme described in Sec. 8.6.4 and illustrated in Fig. 8.20 can be used to calculate projections along arbitrary directions θ. Develop and implement such a process and display the resulting projections. Exercise 8.19. Text recognition (OCR) methods are likely to fail if the document image is not perfectly axis-aligned. One method for estimating the skew angle of a text document is to perform binary segmentation and connected components analysis (see Fig. 8.26): • Smear the original binary image by applying a disk-shaped morphological dilation with a specified radius (see Sec. 7.2.3). The aim is to close the gaps between neighboring glyphs without closing the space between adjacent text lines (Fig. 8.26 (b)) • Apply region segmentation to the resulting image and calculate the orientation θ(R) and the eccentricity E(R) of each region R (see Secs. 8.6.1 and 8.6.2). Ignore all regions that are either too small or not sufficiently elongated. • Estimate the global skew angle by averaging the regions’ orientations θi . Note that since angles are circular quantities they cannot be averaged in the usual way (see Eqn. (15.15) on p. 459 for how to calculate the mean of a circular quantity). Consider using the eccentricity as a weight for the contribution of the associated region to the global average.

237

8 Regions in Binary Images Fig. 8.26 Document skew estimation example (see Exercise 8.19). Original binary image (a); result of applying a disk-shaped morphological dilation with radius 3.0 (b); region orientation vectors (c); histogram of the orientation angle θ over ±180◦ (d) and ±45◦ (e). The real skew angle in this scan is approximately 1.1◦ .

(a)

(b)

(d)

(c)

(e)

• Obviously, this scheme is sensitive to outliers, that is, against angles that deviate strongly from the average orientation. Try to improve this estimate (i.e., make it more robust and accurate) by iteratively removing angles that are “too far” from the average orientation and then recalculating the result. Exercise 8.20. Draw the tree structure, defined by the recursive nesting of outer and inner regions, for each of the markers shown in Fig. 8.23. Based on this graph structure, suggest an algorithm for matching pairs of markers or, alternatively, for retrieving the bestmatching marker from a database of markers. Exercise 8.21. A first step in “automatically” solving a jigsaw puzzle could be to locate the sharp corners of each puzzle piece, as shown in Fig. 8.27. This could be approached as follows:

238

• Apply binary region segmentation. Ignore regions that are below a certain size (small clutter).

8.10 Exercises Fig. 8.27 Clean jigsaw puzzle pieces. Corners are (manually) marked on the first puzzle piece.

• Traverse the outer contour of each region and measure the local curvature: Use a sliding straight line segment that connects two contour points that are D steps apart and calculate the maximum distance (from that line segment) of the points in between. • Corner points must be located at convex points of the contour. • Set a threshold and mark the best corner candidates. • One may also make use of the fact that each piece should have exactly four corners that are evenly spread around the contour (i.e., not too close to each other). Describe your algorithm and parameters carefully and try to come up with a working implementation.

239

9 Automatic Thresholding

Although techniques based on binary image regions have been used for a very long time, they still play a major role in many practical image processing applications today because of their simplicity and efficiency. To obtain a binary image, the first and perhaps most critical step is to convert the initial grayscale (or color) image to a binary image, in most cases by performing some form of thresholding operation, as described in Sec. 3.1.4. Anyone who has ever tried to convert a scanned document image to a readable binary image has experienced how sensitively the result depends on the proper choice of the threshold value. This chapter deals with finding the best threshold automatically only from the information contained in the image, i.e., in an “unsupervised” fashion. This may be a single, “global” threshold that is applied to the whole image or different thresholds for different parts of the image. In the latter case we talk about “adaptive” thresholding, which is particularly useful when the image exhibits a varying background due to uneven lighting, exposure, or viewing conditions. Automatic thresholding is a traditional area of research that had its peak in the 1980s and 1990s but is still very active. Numerous methods have been developed for this task, ranging from simple ad-hoc solutions to complex algorithms with firm theoretical foundations, as documented in several reviews and evaluation studies [105,204,229,241,263]. Binarization of images is also considered a “segmentation” technique and thus often categorized under this term. In the following, we describe some representative and popular techniques in greater detail, starting in Sec. 9.1 with global thresholding methods and continuing with adaptive methods in Sec. 9.2.

9.1 Global Histogram-Based Thresholding Given a grayscale image I, the task is to find a single “optimal” threshold value for binarizing this image. Applying a particular threshold q is equivalent to classifying each pixel as being either part © Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1_9

241

9 Automatic Thresholding Fig. 9.1 Test images used for subsequent thresholding experiments. Detail from a manuscript by Johannes Kepler (a), document with fingerprint (b), ARToolkit marker (c), synthetic twolevel Gaussian mixture image (d). Results of thresholding with the fixed threshold value q = 128 (e–h). Histograms of the original images (i–l) with intensity values from 0 (left) to 255 (right).

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

of the background or the foreground. Thus the set of all image pixels is partitioned into two disjoint sets C0 and C1 , where C0 contains all elements with values in [0, 1, . . . , q] and C1 collects the remaining elements with values in [q+1, . . . , K −1], that is, ( C0 if I(p) ≤ q (background), p∈ (9.1) C1 if I(p) > q (foreground).

242

Of course, the meaning of background and foreground may differ from one application to another. For example, the aforementioned scheme is quite natural for astronomical or thermal images, where the relevant “foreground” pixels are bright and the background is dark. Conversely, in document analysis, for example, the objects of interest are usually the dark letters or artwork printed on a bright background. This should not be confusing and of course one can always invert the image to adapt to this scheme, so there is no loss of generality here. Figure 9.1 shows several test images used in this chapter and the result of thresholding with a fixed threshold value. The synthetic image in Fig. 9.1 (d) is the mixture of two Gaussian random distributions N0 , N1 for the background and foreground, respectively, with µ0 = 80, µ1 = 170, σ0 = σ1 = 20. The corresponding histograms of the test images are shown in Fig. 9.1(i–l). Note that all histograms are normalized to constant area (not to maximum values, as usual), with intensity values ranging from 0 (left) to 255 (right). The key question is how to find a suitable (or even “optimal”) threshold value for binarizing the image. As the name implies, histogram-based methods calculate the threshold primarily from the information contained in the image’s histogram, without inspecting the actual image pixels. Other methods process individual pixels for finding the threshold and there are also hybrid methods that rely both on the histogram and the local image content. Histogram-based

techniques are usually simple and efficient, because they operate on 9.1 Global a small set of data (256 values in case of an 8-bit histogram); they Histogram-Based can be grouped into two main categories: shape-based and statistical Thresholding methods. Shape-based methods analyze the structure of the histogram’s distribution, for example by trying to locate peaks, valleys and other “shape” features. Usually the histogram is first smoothed to eliminate narrow peaks and gaps. While shape-based methods were quite popular early on, they are usually not as robust as their statistical counterparts or at least do not seem to offer any distinct advantages. A classic representative of this category is the “triangle” (or “chord”) algorithm described in [294]. References to numerous other shapebased methods can be found in [241]. Statistical methods, as their name suggests, rely on statistical information derived from the image’s histogram (which of course is a statistic itself), such as the mean, variance, or entropy. In the next section, we discuss a few elementary parameters that can be obtained from the histogram, followed by a description of concrete algorithms that use this information. Again there are a vast number of similar methods and we have selected four representative algorithms to be described in more detail: (a) iterative threshold selection by Ridler and Calvard [221], (b) Otsu’s clustering method [203], (c) the minimum error method by Kittler and Illingworth [138], and (d) the maximum entropy thresholding method by Kapur, Sahoo, and Wong [156]. 9.1.1 Image Statistics from the Histogram As described in Chapter 2, Sec. 2.7, several statistical quantities, such as the arithmetic mean, variance and median, can be calculated directly from the histogram, without reverting to the original image data. If we threshold the image at level q (0 ≤ q < K), the set of pixels is partitioned into the disjoint subsets C0 , C1 , corresponding to the background and the foreground. The number of pixels assigned to each subset is n0 (q) = |C0 | =

q X

h(g)

and

n1 (q) = |C1 | =

g=0

K−1 X

h(g),

(9.2)

g=q+1

respectively. Also, because all pixels are assigned to either the background set C0 or the foreground set C1 , n0 (q) + n1 (q) = |C0 | + |C1 | = |C0 ∪ C1 | = M N.

(9.3)

For any threshold q, the mean values of the associated partitions C0 , C1 can be calculated from the image histogram as q X 1 · g·h(g), µ0 (q) = n0 (q) g=0

(9.4)

µ1 (q) =

(9.5)

K−1 X 1 · g·h(g) n1 (q) g=q+1

243

9 Automatic Thresholding

and these quantities relate to the image’s overall mean µI (Eqn. (2.9)) by1 µI =

1 · n0 (q)·µ0 (q) + n1 (q)·µ1 (q) = µ0 (K −1). MN

(9.6)

Analogously, the variances of the background and foreground partitions can be extracted from the histogram as2 σ02 (q) = σ12 (q)

q X 1 · (g − µ0 (q))2 · h(g) n0 (q) g=0

K−1 X 1 = · (g − µ1 (q))2 · h(g). n1 (q) g=q+1

(9.7)

(Of course, as in Eqn. (2.12), this calculation can also be performed in a single iteration and without knowing µ0 (q), µ1 (q) in advance.) The overall variance σI2 for the whole image is identical to the variance of the background for q = K −1, σI2 =

K−1 X 1 · (g − µI )2 · h(g) = σ02 (K −1), M N g=0

(9.8)

that is, for all pixels being assigned to the background partition. Note that, unlike the simple relation of the means given in Eqn. (9.6), σI2 ̸=

1 n (q) · σ02 (q) + n1 (q) · σ12 (q) MN 0

(9.9)

in general (see also Eqn. (9.20)). We will use these basic relations in the discussion of histogrambased threshold selection algorithms in the following and add more specific ones as we go along. 9.1.2 Simple Threshold Selection Clearly, the choice of the threshold value should not be fixed but somehow based on the content of the image. In the simplest case, we could use the mean of all image pixels, q ← mean(I) = µI ,

(9.10)

as the threshold value q, or the median, (see Sec. 2.7.2), q ← median(I) = mI ,

(9.11)

or, alternatively, the average of the minimum and the maximum (mid-range value), that is, 1

2

244

Note that µ0 (q), µ1 (q) are meant to be functions over q and thus µ0 (K −1) in Eqn. (9.6) denotes the mean of the pixel values in partition C0 for the threshold q = K −1. σ02 (q) and σ12 (q) in Eqn. (9.7) are also functions over q.

1: QuantileThreshold(h, p) Input: h : [0, K −1] 7→ N, a grayscale histogram. p, the proportion of expected background pixels (0 < p < 1). Returns the optimal threshold value or −1 if no threshold is found. 2:

K ← size(h)

P

▷ number of intensity levels

K −1

3:

MN ←

h(i)

▷ number of image pixels

i=0

4: 5: 6: 7: 8: 9: 10: 11: 12:

i←0 c ← h(0) while (i < K) ∧ (c < M N ·p) do ▷ quantile calc. (Eqn. (9.13)) i←i+1 c ← c + h(j) if c < M N then ▷ foreground is non-empty q←i else ▷ foreground is empty, all pixels are background q ← −1

13:

return q

q←

max(I) + min(I) . 2

9.1 Global Histogram-Based Thresholding Alg. 9.1 Quantile thresholding. The optimal threshold value q ∈ [0, K −2] is returned, or −1 if no valid threshold was found. Note the test in line 9 to check if the foreground is empty or not (the background is always non-empty by definition).

(9.12)

Like the image mean µI (see Eqn. (2.9)), all these quantities can be obtained directly from the histogram h. Thresholding at the median segments the image into approximately equal-sized background and foreground sets, that is, |C0 | ≈ |C1 |, which assumes that the “interesting” (foreground) pixels cover about half of the image. This may be appropriate for certain images, but completely wrong for others. For example, a scanned text image will typically contain a lot more white than black pixels, so using the median threshold would probably be unsatisfactory in this case. If the approximate fraction p (0 < p < 1) of expected background pixels is known in advance, the threshold could be set to that quantile instead. In this case, q is simply chosen as

q ← min i |

i X j=0

h(i) ≥ M ·N ·p ,

(9.13)

where N is the total number of pixels. We see that the median is only a special case of a quantile measure, with p = 0.5. This simple thresholding method is summarized in Alg. 9.1. For the mid-range technique (Eqn. (9.12)), the limiting intensity values min(I) and max(I) can be found by searching for the smallest and largest non-zero entries, respectively, in the histogram h. The mid-range threshold segments the image at 50 % (or any other percentile) of the contrast range. In this case, nothing can be said in general about the relative sizes of the resulting background and foreground partitions. Because a single extreme pixel value (outlier) may change the contrast range dramatically, this approach is not very robust. Here too it is advantageous to define the contrast range by specifying pixel quantiles, analogous to the calculation of the quantities a′low and a′high in the modified auto-contrast function (see Sec. 3.4).

245

Mean

9 Automatic Thresholding Fig. 9.2 Results from various simple thresholding schemes. Mean (a–d), median (e–h), and midrange (i–l) threshold, as defined in Eqns. (9.10)–(9.12).

(a)q = 158

(b) q = 144

(c) q = 158

(d) q = 84

(g) q = 165

(h) q = 81

(k) q = 128

(l) q = 120

Median

(e) q = 179

(f) q = 161 Mid-range

(i) q = 115

(j) q = 128

In the pathological (but nevertheless possible) case that all pixels in the image have the same intensity g, all the aforementioned methods will return the threshold q = g, which assigns all pixels to the background partition and leaves the foreground empty. Algorithms should try to detect this situation, because thresholding a uniform image obviously makes no sense. Results obtained with these simple thresholding techniques are shown in Fig. 9.2. Despite the obvious limitations, even a simple automatic threshold selection (such as the quantile technique in Alg. 9.1) will typically yield more reliable results than the use of a fixed threshold. 9.1.3 Iterative Threshold Selection (Isodata Algorithm)

246

This classic iterative algorithm for finding an optimal threshold is attributed to Ridler and Calvard [221] and was related to Isodata clustering by Velasco [275]. It is thus sometimes referred to as the “isodata” or “intermeans” algorithm. Like in many other global thresholding schemes it is assumed that the image’s histogram is a mixture of two separate distributions, one for the intensities of the background pixels and the other for the foreground pixels. In this case, the two distributions are assumed to be Gaussian with approximately identical spreads (variances). The algorithm starts by making an initial guess for the threshold, for example, by taking the mean or the median of the whole image. This splits the set of pixels into a background and a foreground set, both of which should be non-empty. Next, the means of both sets are calculated and the threshold is repositioned to their average, that is,

1: IsodataThreshold(h) Input: h : [0, K −1] 7→ N, a grayscale histogram. Returns the optimal threshold value or −1 if no threshold is found. 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

K ← size(h) ▷ number of intensity levels q ← Mean(h, 0, K −1) ▷ initial threshold (overall mean) do ▷ repeat until no more change n0 ← Count(h, 0, q) ▷ background population n1 ← Count(h, q+1, K −1) ▷ foreground population if (n0 = 0) ∨ (n1 = 0) then ▷ backgrd. or foregrd. is empty return −1 µ0 ← Mean(h, 0, q) µ1 ← Mean(h, q+1, K −1) q ′ ←jq k µ0 + µ1 q← 2 while q ̸= q ′ return q

15: Count(h, a, b) :=

b P

P

Alg. 9.2 “Isodata” threshold selection.

▷ background mean ▷ foreground mean ▷ keep previous threshold ▷ calculate the new threshold

h(g)

g=a b

16: Mean(h, a, b) :=

9.1 Global Histogram-Based Thresholding

g=a

b P

g·h(g) /

g=a

h(g)

centered between the two means. The means are then re-calculated for the resulting background and foreground sets, and so on, until the threshold does not change any longer. In practice, it takes only a few iterations for the threshold to converge. This procedure is summarized in Alg. 9.2. The initial threshold is set to the overall mean (line 3). For each threshold q, separate mean values µ0 , µ1 are computed for the corresponding foreground and background partitions. The threshold is repeatedly set to the average of the two means until no more change occurs. The clause in line 7 tests if either the background or the foreground partition is empty, which will happen, for example, if the image contains only a single intensity value. In this case, no valid threshold exists and the procedure returns −1. The functions Count(h, a, b) and Mean(h, a, b) in lines 15–16 return the number of pixels and the mean, respectively, of the image pixels with intensity values in the range [a, b]. Both can be computed directly from the histogram h without inspecting the image itself. The performance of this algorithm can be easily improved by using tables µ0 (q), µ1 (q) for the background and foreground means, respectively. The modified, table-based version of the iterative threshold selection procedure is shown in Alg. 9.3. It requires two passes over the histogram to initialize the tables µ0 , µ1 and only a small, constant number of computations for each iteration in its main loop. Note that the image’s overall mean µI , used as the initial guess for the threshold q (Alg. 9.3, line 4), need not be calculated separately but can be obtained as µI = µ0 (K −1), given that threshold q = K −1 assigns all image pixels to the background. The time complexity of this algorithm is thus O(K), that is, linear w.r.t. the size of the his-

247

9 Automatic Thresholding Alg. 9.3 Fast version of “isodata” threshold selection. Precalculated tables are used for the foreground and background means µ0 and µ1 , respectively.

1: FastIsodataThreshold(h) Input: h : [0, K −1] 7→ N, a grayscale histogram. Returns the optimal threshold value or −1 if no threshold is found. 2: 3: 4:

K ← size(h) ▷ number of intensity levels (µ0 , µ1 ) ← MakeMeanTables(h) q ← ⌊µ0 (K −1)⌋ ▷ take the overall mean µI as initial threshold

5: 6: 7: 8:

do

9: 10: 11:

▷ repeat until no more change if (µ0 (q) < 0) ∨ (µ1 (q) < 0) then return −1 ▷ background or foreground is empty q′ ← q ▷ keep previous threshold j µ (q) + µ (q) k 0 1 q← ▷ calculate the new threshold 2 ′ while q ̸= q return q

12: MakeMeanTables(h) 13: K ← size(h) 14: Create maps µ0 , µ1 : [0, K −1] 7→ R 15: n0 ← 0, s0 ← 0 16: for q ← 0, . . . , K −1 do ▷ tabulate background means µ0 (q) 17: n0 ← n0 + h(q) 18: s0 ← s0 +q · h(q) s0 /n0 if n0 > 0 19: µ0 (q) ← −1 otherwise 20: 21: 22: 23: 24: 25: 26:

n1 ← 0, s1 ← 0 µ1 (K −1) ← 0 for q ← K −2, . . . , 0 do ▷ tabulate foreground means µ1 (q) n1 ← n1 + h(q+1) s1 ← s1 +(q+1) · h(q+1) s1 /n1 if n1 > 0 µ1 (q) ← −1 otherwise return (µ0 , µ1 )

Fig. 9.3 Results of thresholding with the isodata algorithm. Binarized images and the corresponding optimal threshold values.

(a) q = 128

(b) q = 125

(c) q = 94

(d) q = 90

togram. Figure 9.3 shows the results of thresholding with the isodata algorithm applied to the test images in Fig. 9.1. 9.1.4 Otsu’s Method

248

The method proposed by Otsu [170, 203] also assumes that the original image contains pixels from two classes, whose intensity distributions are unknown. The goal is to find a threshold q such that the resulting background and foreground distributions are maximally separated, which means that they are (a) each as narrow as possible (have minimal variances) and (b) their centers (means) are most distant from each other.

For a given threshold q, the variances of the corresponding back- 9.1 Global ground and foreground partitions can be calculated straight from the Histogram-Based image’s histogram (see Eqn. (9.7)). The combined width of the two Thresholding distributions is measured by the within-class variance

where

2 σw (q) = P0 (q) · σ02 (q) + P1 (q) · σ12 (q) 1 = · n0 (q) · σ02 (q) + n1 (q) · σ12 (q) , MN q X

(9.14) (9.15)

q X 1 n (q) P0 (q) = p(i) = · h(i) = 0 , M N MN i=0 i=0

(9.16)

P1 (q) =

(9.17)

K −1 X

p(i) =

i=q+1

K −1 X 1 n (q) · h(i) = 1 M N i=q+1 MN

are the class probabilities for C0 , C1 , respectively. Thus the withinclass variance in Eqn. (9.15) is simply the sum of the individual variances weighted by the corresponding class probabilities or “populations”. Analogously, the between-class variance, 2 2 σb2 (q) = P0 (q)· µ0 (q)−µI + P1 (q)· µ1 (q)−µI (9.18) 2 2 1 = · n0 (q)· µ0 (q)−µI + n1 (q)· µ1 (q)−µI (9.19) MN measures the distances between the cluster means µ0 , µ1 and the overall mean µI . The total image variance σI2 is the sum of the within-class variance and the between-class variance, that is, 2 σI2 = σw (q) + σb2 (q),

(9.20)

for q = 0, . . . , K −1. Since σI2 is constant for a given image, the thresh2 old q can be found by either minimizing the within-variance σw or 2 maximizing the between-variance σb . The natural choice is to maximize σb2 , because it only relies on first-order statistics (i.e., the withinclass means µ0 , µ1 ). Since the overall mean µI can be expressed as the weighted sum of the partition means µ0 and µ1 (Eqn. (9.6)), we can simplify Eqn. (9.19) to 2 σb2 (q) = P0 (q) · P1 (q) · µ0 (q) − µ1 (q) (9.21) 2 1 · n (q) · n1 (q) · µ0 (q) − µ1 (q) . (9.22) = (M N )2 0 The optimal threshold is finally found by maximizing the expression for the between-class variance in Eqn. (9.22) with respect to q, thereby minimizing the within-class variance in Eqn. (9.15). Noting that σb2 (q) only depends on the means (and not on the variances) of the two partitions for a given threshold q allows for a very efficient implementation, as outlined in Alg. 9.4. The algorithm assumes a grayscale image with a total of N pixels and K intensity levels. As in Alg. 9.3, precalculated tables µ0 (q), µ1 (q) are used for the background and foreground means for all possible threshold values q = 0, . . . , K −1.

249

1: OtsuThreshold(h) Input: h : [0, K −1] 7→ N, a grayscale histogram. Returns the optimal threshold value or −1 if no threshold is found.

9 Automatic Thresholding Alg. 9.4 Finding the optimal threshold using Otsu’s method [203]. Initially (outside the for-loop), threshold q is assumed to be −1, which corresponds to the background class being empty (n0 = 0) and all pixels are assigned to the foreground class (n1 = N ). The for-loop (lines 8–15) examines each possible threshold q = 0, . . . , K −2. The constant factor 1/(M N )2 has been omitted in line 12, since it is not relevant for the optimization. The optimal threshold value is returned, or −1 if no valid threshold was found. The function MakeMeanTables() is defined in Alg. 9.3.

2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

K ←P size(h) ▷ number of intensity levels N ← i h(i) (µ0 , µ1 ) ← MakeMeanTables(h) ▷ see Alg. 9.3 2 σbmax ←0 qmax ← −1 n0 ← 0 for q ← 0, . . . , K −2 do ▷ examine all possible threshold values q n0 ← n0 + h(q) n1 ← N − n0 if (n0 > 0) ∧ (n1 > 0) then σb2 ← n0 · n1 · [µ0 (q) − µ1 (q)]2 ▷ see Eqn. (9.22) 2 if σb2 > σbmax then ▷ maximize σb2 2 2 σbmax ← σb qmax ← q

16:

return qmax

Possible threshold values are q = 0, . . . , K −2 (with q = K −1, all pixels are assigned to the background). Initially (before entering the main for-loop in line 8) q = −1; at this point, the set of background pixels ( ≤ q) is empty and all pixels are classified as foreground (n0 = 0 and n1 = N ). Each possible threshold value is examined inside the body of the for-loop. As long as any of the two classes is empty (n0 (q) = 0 or n1 (q) = 0),3 the resulting between-class variance σb2 (q) is zero. The threshold that 2 yields the maximum between-class variance (σbmax ) is returned, or −1 if no valid threshold could be found. This occurs when all image pixels have the same intensity, that all pixels are in either the background or the foreground class. Note that in line 12 of Alg. 9.4, the factor N12 is constant (independent of q) and can thus be ignored in the optimization. However, care must be taken at this point because the computation of σb2 may produce intermediate values that exceed the range of typical (32-bit) integer variables, even for medium-size images. Variables of type long should be used or the computation be performed with floatingpoint values. The absolute “goodness” of the final thresholding by qmax could be measured as the ratio η=

σb2 (qmax ) ∈ [0, 1] σI2

(9.23)

(see Eqn. (9.8)), which is invariant under linear changes of contrast and brightness [203]. Greater values of η indicate better thresholding. Results of automatic threshold selection with Otsu’s method are shown in Fig. 9.4, where qmax denotes the optimal threshold and η is 3

250

This is the case if the image contains no pixels with values I(p) ≤ q or I(p) > q, that is, the histogram h is empty either below or above the index q.

the corresponding “goodness” estimate, as defined in Eqn. (9.23). 9.1 Global The graph underneath each image shows the original histogram Histogram-Based (gray) overlaid with the variance within the background σ02 (green), Thresholding the variance within the foreground σ12 (blue), and the between-class variance σb2 (red) for varying threshold values q. The dashed vertical line marks the position of the optimal threshold qmax . Due to the pre-calculation of the mean values, Otsu’s method requires only three passes over the histogram and is thus very fast (O(K)), in contrast to opposite accounts in the literature. The method is frequently quoted and performs well in comparison to other approaches [241], despite its long history and its simplicity. In general, the results are very similar to the ones produced by the iterative threshold selection (“isodata”) algorithm described in Sec. 9.1.3.

(a) qmax = 128

0

(e) η = 0.84

(b) qmax = 124

255 0 σ02

(f) η = 0.77

255 0

(c) qmax = 94

(g) η = 0.62

σ12

σb2

(d) qmax = 92

255 0

(h) η = 0.53

255

Fig. 9.4 Results of thresholding with Otsu’s method. Calculated threshold values q and resulting binary images (a–d). Graphs in (e–h) show the corresponding within-background variance σ02 (green), the within-foreground variance σ12 (blue), and the between-class variance σb2 (red), for varying threshold values q = 0, . . . , 255. The optimal threshold qmax (dashed vertical line) is positioned at the maximum of σb2 . The value η denotes the “goodness” estimate for the thresholding, as defined in Eqn. (9.23).

qmax

9.1.5 Maximum Entropy Thresholding Entropy is an important concept in information theory and particularly in data compression. It is a statistical measure that quantifies the average amount of information contained in the “messages” generated by a stochastic data source [118, 120]. For example, the M N pixels in an image I can be interpreted as a message of M N symbols, each taken independently from a finite alphabet of K (e.g., 256) different intensity values. Every pixel is assumed to be statically independent. Knowing the probability of each intensity value g to occur, entropy measures how likely it is to observe a particular image, or, in other words, how much we should be surprised to see such an image. Before going into further details, we briefly review the notion of probabilities in the context of images and histograms (see also Sec. 3.6.1). For modeling the image generation as a random process, we first need to define an “alphabet”, that is, a set of symbols Z = {0, 1, . . . , K −1} ,

(9.24)

which in this case is simply the set of possible intensity values g = 0, . . . , K −1, together with the probability p(g) that a particular intensity value g occurs. These probabilities are supposed to be

251

9 Automatic Thresholding

known in advance, which is why they are called a priori (or prior) probabilities. The vector of probabilities, ( p(0), p(1), . . . , p(K −1) ), is a probability distribution or probability density function (pdf). In practice, the a priori probabilities are usually unknown, but they can be estimated by observing how often the intensity values actually occur in one or more images, assuming that these are representative instances of the images typically produced by that source. An estimate p(g) of the image’s probability density function p(g) is obtained by normalizing its histogram h in the form h(g) , (9.25) MN PK −1 for 0 ≤ g < K, such that 0 ≤ p(g) ≤ 1 and g=0 p(g) = 1. The associated cumulative distribution function (cdf) is p(g) ≈ p(g) =

P(g) =

g g X X h(g) = p(i), MN i=0 i=0

(9.26)

where P(0) = p(0) and P(K −1) = 1. This is simply the normalized cumulative histogram.4 Entropy of Images Given an estimate of its intensity probability distribution p(g), the entropy of an image is defined as5 H(Z) =

X

p(g) · logb

g∈Z

X 1 =− p(g)· logb p(g) , p(g)

(9.27)

g∈Z

where g = I(p) and logb (x) denotes the logarithm of x to the base b. If b = 2, the entropy (or “information content”) is measured in bits, but proportional results are obtained with any other logarithm (such as ln or log10 ). Note that the value of H() is always positive, because the probabilities p() are in [0, 1] and thus the terms logb [p()] are negative or zero for any b. Some other properties of the entropy are also quite intuitive. For example, if all probabilities p(g) are zero except for one intensity g ′ , then the entropy H(I) is zero, indicating that there is no uncertainty (or “surprise”) in the messages produced by the corresponding data source. The (rather boring) images generated by this source will contain nothing but pixels of intensity g ′ , since all other intensities are impossible. Conversely, the entropy is a maximum if all K intensities have the same probability (uniform distribution), p(g) = 4 5

252

1 , K

for 0 ≤ g < K,

(9.28)

See also Sec. 2.6. Note the subtle difference in notation for the cumulative histogram H and the entropy H.

and therefore (from Eqn. (9.27)) the entropy in this case is K−1 1 1 1 X · logb = · logb (K) K K K i=0 i=0 | {z } K·logb (K) 1 = · K · logb (K) = logb (K). K

H(Z) = −

K−1 X

(9.29)

9.1 Global Histogram-Based Thresholding

(9.30)

This is the maximum possible entropy of a stochastic source with an alphabet Z of size K. Thus the entropy H(Z) is always in the range [0, log(K)]. Using Image Entropy for Threshold Selection The use of image entropy as a criterion for threshold selection has a long tradition and numerous methods have been proposed. In the following, we describe the early (but still popular) technique by Kapur et al. [119, 156] as a representative example. Given a particular threshold q (with 0 ≤ q < K −1), the estimated probability distributions for the resulting partitions C0 and C1 are C0 : ( C1 : (

p(0) p(1) P0 (q) , P0 (q) ,

0,

0,

...,

p(q) P0 (q) ,

...,

0,

0,

0,

p(q+1) p(q+2) P1 (q) , P1 (q) ,

...,

0

),

...,

p(K−1) P1 (q)

),

(9.31)

with the associated cumulated probabilities (see Eqn. (9.26)) P0 (q) =

q X

p(i) = P(q),

(9.32)

i=0

P1 (q) =

K −1 X

p(i) = 1 − P(q).

(9.33)

i=q+1

Note that P0 (q) + P1 (q) = 1, since the background and foreground partitions are disjoint. The entropies within each partition are defined as H0 (q) = −

q X p(i) p(i) · log , P (q) P 0 0 (q) i=0

K −1 X

p(i) p(i) · log , P (q) P1 (q) i=q+1 1

H1 (q) = −

(9.34)

(9.35)

and the overall entropy for the threshold q is H01 (q) = H0 (q) + H1 (q).

(9.36)

This expression is to be maximized over q, also called the “information between the classes” C0 and C1 . To allow for an efficient computation, the expression for H0 (q) in Eqn. (9.34) can be rearranged to

253

9 Automatic Thresholding

H0 (q) = − =− =−

q X p(i) · log p(i) − log P0 (q) P (q) i=0 0

(9.37)

q X 1 · p(i)· log p(i) − log P0 (q) P0 (q) i=0

(9.38)

q q X X 1 1 · p(i)· log p(i) + · p(i) · log P0 (q) P0 (q) i=0 P0 (q) i=0 | {z } | {z } | {z } S0 (q)

1 =− ·S (q) + log P0 (q) . P0 (q) 0

=P0 (q)

indep. of i

(9.39)

Similarly, H1 (q) in Eqn. (9.35) becomes K −1 X

p(i) · log p(i) − log P1 (q) P (q) i=q+1 1 1 =− · S1 (q) + log 1−P0 (q) . 1−P0 (q)

H1 (q) = −

(9.40) (9.41)

Given the estimated probability distribution p(i), the cumulative probability P0 and the sums S0 , S1 (see Eqns. (9.39)–(9.41)) can be calculated efficiently using the recurrence relations ( p(0) for q = 0, P0 (q) = P0 (q−1) + p(q) for 0 < q < K, ( p(0)· log p(0) for q = 0, S0 (q) = (9.42) S0 (q−1) + p(q)· log p(q) for 0 < q < K, ( 0 for q = K −1, S1 (q) = S1 (q+1) + p(q+1)· log p(q+1) for 0 ≤ q < K −1.

The complete procedure is summarized in Alg. 9.5, where the quantities S0 (q), S1 (q) are obtained from the precalculated tables S0 , S1 , respectively. The algorithm performs three passes over the histogram of length K (two for filling the tables S0 , S1 and one in the main loop), so its time complexity is O(K), like the algorithms described before. Results obtained with this technique are shown in Fig. 9.5. The technique described in this section is simple and efficient, because it again relies entirely on the image’s histogram. More advanced entropy-based thresholding techniques exist that, among other improvements, take into account the spatial structure of the original image. An extensive review of entropy-based methods can be found in [55]. 9.1.6 Minimum Error Thresholding

254

The goal of minimum error thresholding is to optimally fit a combination (mixture) of Gaussian distributions to the image’s histogram. Before we proceed, we briefly look at some additional concepts from statistics. Note, however, that the following material is only intended as a superficial outline to explain the elementary concepts. For a

1: MaximumEntropyThreshold(h) Input: h : [0, K −1] 7→ N, a grayscale histogram. Returns the optimal threshold value or −1 if no threshold is found.

13: 14: 15: 16:

K ← size(h) p ← Normalize(h) (S0 , S1 ) ← MakeTables(p, K) P0 ← 0 qmax ← −1 Hmax ← −∞ for q ← 0, . . . , K −2 do ▷ check P0 ← P0 + p(q) P1 ← 1− P0 − P1 ·S0 (q) + log(P0 ) 0 H0 ← 0 1 − P ·S1 (q) + log(P1 ) 1 H1 ← 0 H01 = H0 + H1 if H01 > Hmax then Hmax ← H01 qmax ← q

17:

return qmax

2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:

▷ number of intensity levels ▷ normalize histogram ▷ tables for S0 (q), S1 (q) ▷ P0 ∈ [0, 1] ▷ maximum joint entropy all possible threshold values q ▷ P1 ∈ [0, 1] if P0 > 0 ▷ BG entropy otherwise if P1 > 0 ▷ FG entropy otherwise ▷ overall entropy for q ▷ maximize H01 (q)

18: MakeTables(p, K) 19: Create maps S0 , S1 : [0, K −1] 7→ R 20: s0 ← 0 21: for i ← 0, . . . , K −1 do 22: if p(i) > 0 then 23: s0 ← s0 + p(i) · log p(i) 24: S0 (i) ← s0 25: 26: 27: 28: 29: 30:

0

(b) q = 139

255 0 (e) H0 (q)

▷ initialize table S1

(c) q = 118

255 0 (f)

H01 (q)

Fig. 9.5 Thresholding with the maximum-entropy method. Calculated threshold values q and resulting binary images (a–d). Graphs in (e–h) show the background entropy H0 (q) (green), foreground entropy H1 (q) (blue) and overall entropy H01 (q) = H0 (q) + H1 (q) (red), for varying threshold values q. The optimal threshold qmax is found at the maximum of H01 (dashed vertical line).

(d) q = 126

255 0 (g)

H1 (q)

Alg. 9.5 Maximum entropy threshold selection [156]. Initially (outside the for-loop), the threshold q is assumed to be −1, which corresponds to the background class being empty (n0 = 0) and all pixels assigned to the foreground class (n1 = M N ). The for-loop (lines 8–16) examines each possible threshold q = 0, . . . , K −2. The optimal threshold value (0, . . . , K −2) is returned, or −1 if no valid threshold was found.

▷ initialize table S0

s1 ← 0 for i ← K −1, . . . , 0 do S1 (i) ← s1 if p(i) > 0 then s1 ← s1 + p(i) · log p(i) return (S0 , S1 )

(a) q = 133

9.1 Global Histogram-Based Thresholding

255 (h) qmax

255

9 Automatic Thresholding

solid grounding of these and related topics readers are referred to the excellent texts available on statistical pattern recognition, such as [30, 76]. Bayesian Decision Making The assumption is again that the image pixels originate from one of two classes, C0 and C1 , or background and foreground, respectively. Both classes generate random intensity values following unknown statistical distributions. Typically, these are modeled as Gaussian distributions with unknown parameters µ and σ 2 , as will be described. The task is to decide for each pixel value x to which of the two classes it most likely belongs. Bayesian reasoning is a classic technique for making such decisions in a probabilistic context. The probability, that a certain intensity value x originates from a background pixel is denoted p(x | C0 ). This is called a “conditional probability”.6 It tells us how likely it is to observe the gray value x when a pixel is a member of the background class C0 . Analogously, p(x | C1 ) is the conditional probability of observing the value x when a pixel is known to be of the foreground class C1 . For the moment, let us assume that the conditional probability functions p(x | C0 ) and p(x | C1 ) are known. Our problem is reversed though, namely to decide which class a pixel most likely belongs to, given that its intensity is x. This means that we are actually interested in the conditional probabilities p(C0 | x)

and

p(C1 | x),

(9.43)

also called a posteriori (or posterior) probabilities. If we knew these, we could simply select the class with the higher probability in the form ( C0 if p(C0 | x) > p(C1 | x), C= (9.44) C1 otherwise. Bayes’ theorem provides a method for estimating these posterior probabilities, that is, p(Cj | x) =

p(x | Cj ) · p(Cj ) , p(x)

(9.45)

where p(Cj ) is the prior probability of class Cj . While, in theory, the prior probabilities are also unknown, they can be easily estimated from the image histogram (see also Sec. 9.1.5). Finally, p(x) in Eqn. (9.45) is the overall probability of observing the intensity value x, which is typically estimated from its relative frequency in one or more images.7 6

7

256

In general, p(A | B) denotes the (conditional) probability of observing the event A in a given situation B. It is usually read as “the probability of A, given B”. p(x) is also called the “evidence” for the event x.

Note that for a particular intensity x, the corresponding evidence 9.1 Global p(x) only scales the posterior probabilities and is thus not relevant for Histogram-Based the classification itself. Consequently, we can reformulate the binary Thresholding decision rule in Eqn. (9.44) to ( C0 if p(x | C0 ) · p(C0 ) > p(x | C1 ) · p(C1 ), (9.46) C= C1 otherwise. This is called Bayes’ decision rule. It minimizes the probability of making a classification error if the involved probabilities are known and is also called the “minimum error” criterion. Gaussian Probability Distributions If the probability distributions p(x | Cj ) are modeled as Gaussian 8 distributions N(x | µj , σj2 ), where µj , σj2 denote the mean and variance of class Cj , we can rewrite the scaled posterior probabilities in Eqn. (9.46) as p(x | Cj ) · p(Cj ) = p

(x−µ )2 j · exp − ·p(Cj ). 2 2 2σ 2πσj j 1

(9.47)

As long as the ordering between the resulting class scores remains unchanged, these quantities can be scaled or transformed arbitrarily. In particular, it is common to use the logarithm of the above expression to avoid repeated multiplications of small numbers. For example, applying the natural logarithm9 to both sides of Eqn. (9.47) yields ln p(x | Cj ) · p(Cj ) = ln p(x | Cj ) + ln p(Cj ) (9.48) 1 (x−µ )2 j = ln p + ln exp − + ln p(Cj ) (9.49) 2 2 2σ 2πσj j (x−µj )2 1 1 = − · ln(2π) − · ln(σj2 ) − + ln p(Cj ) 2 2 2 2σj h 2 i (x−µj ) 1 2 = − · ln(2π) + + ln(σ ) − 2· ln p(C ) . j j 2 σj2

(9.50) (9.51)

Since ln(2π) in Eqn. (9.51) is constant, it can be ignored for the classification decision, as well as the factor 12 at the front. Thus, to find the class Cj that maximizes p(x | Cj ) · p(Cj ) for a given intensity value x, it is sufficient to maximize the quantity −

h (x − µ )2 i j + 2· ln(σ ) − ln p(C ) j j σj2

(9.52)

or, alternatively, to minimize εj (x) = 8 9

(x−µj )2 + 2· ln(σj ) − ln p(Cj ) . 2 σj

(9.53)

See also Sec. G.4 of the Appendix. Any logarithm could be used but the natural logarithm (base e) complements the exponential function of the Gaussian.

257

9 Automatic Thresholding

The quantity εj (x) can be viewed as a measure of the potential error involved in classifying the observed value x as being of class Cj . To obtain the decision associated with the minimum risk, we can modify the binary decision rule in Eqn. (9.46) to ( C0 if ε0 (x) ≤ ε1 (x), C= (9.54) C1 otherwise. Remember that this rule tells us how to correctly classify the observed intensity value x as being either of the background class C0 or the foreground class C1 , assuming that the underlying distributions are really Gaussian and their parameters are well estimated. Goodness of Classification If we apply a threshold q, all pixel values g ≤ q are implicitly classified as C0 (background) and all g > q as C1 (foreground). The goodness of this classification by q over all M N image pixels I(p) can be quantified with the criterion function ( X ε0 (I(p)) for I(p) ≤ q 1 e(q) = · (9.55) M N p ε1 (I(p)) for I(p) > q = =

q K−1 X 1 X 1 · h(g)·ε0 (g) + · h(g)·ε1 (g) M N g=0 M N g=q+1

q X

p(g)·ε0 (g) +

g=0

K−1 X

p(g)·ε1 (g),

(9.56) (9.57)

g=q+1

with the normalized frequencies p(g) = M1N ·h(g) and function εj (g) as defined in Eqn. (9.53). By substituting εj (g) from Eqn. (9.53) and some mathematical gymnastics, e(q) can be written as e(q) = 1 + P0 (q)· ln σ02 (q) + P1 (q)· ln σ12 (q) − 2·P0 (q)· ln P0 (q) − 2·P1 (q)· ln P1 (q) . (9.58)

The remaining task is to find the threshold q that minimizes e(q) (where the constant 1 in Eqn. (9.58) can be omitted, of course). For each possible threshold q, we only need to estimate (from the image’s histogram, as in Eqn. (9.31)) the “prior” probabilities P0 (q), P1 (q) and the corresponding within-class variances σ0 (q), σ1 (q). The prior probabilities for the background and foreground classes are estimated as P0 (q) ≈

q X

p(g) =

g=0

P1 (q) ≈

K−1 X

p(g) =

g=q+1

258

q n (q) 1 X · h(g) = 0 , M N g=0 MN K−1 1 X n (q) · h(g) = 1 , M N g=q+1 MN

(9.59)

(9.60)

Pq PK−1 where n0 (q) = i=0 h(i), n1 (q) = i=q+1 h(i), and M N = n0 (q) + n1 (q) is the total number of image pixels. Estimates for background

and foreground variances σ02 (q) and σ12 (q), as defined in Eqn. (9.7), 9.1 Global can be calculated efficiently by expressing them in the form Histogram-Based Thresholding q q X 2 i 1 hX 1 σ02 (q) ≈ · h(g)·g 2 − · h(g)·g n0 (q) g=0 n0 (q) g=0 | {z } | {z } B0 (q)

A0 (q)

1 1 = · B0 (q) − ·A20 (q) , (9.61) n0 (q) n0 (q) K−1 K−1 X 2 i 1 h X 1 σ12 (q) ≈ · h(g)·g 2 − · h(g)·g n1 (q) g=q+1 n1 (q) g=q+1 | {z } | {z } B1 (q)

A1 (q)

1 1 = · B1 (q) − ·A2 (q) , n1 (q) n1 (q) 1

(9.62)

with the quantities A0 (q) =

q X

h(g)·g,

B0 (q) =

g=0

A1 (q) =

q X

h(g)·g 2 ,

g=0

K −1 X

h(g)·g,

g=q+1

B1 (q) =

K −1 X

(9.63) 2

h(g)·g .

g=q+1

Furthermore, quantities σ02 (q), σ12 (q) can be tabulated for every possible q in only two passes over the histogram, using the recurrence relations ( 0 for q = 0, A0 (q) = (9.64) A0 (q−1) + h(q)·q for 1 ≤ q ≤ K −1, ( 0 for q = 0, B0 (q) = (9.65) B0 (q−1) + h(q)·q 2 for 1 ≤ q ≤ K −1, ( 0 for q = K −1, A1 (q) = (9.66) A1 (q+1) + h(q+1)·(q+1) for 0 ≤ q ≤ K −2, ( 0 for q = K −1, B1 (q) = (9.67) B1 (q+1) + h(q+1)·(q+1)2 for 0 ≤ q ≤ K −2. The complete minimum-error threshold calculation is summarized in Alg. 9.6. First, the tables S0 , S1 are set up and initialized with the values of σ02 (q), σ12 (q), respectively, for 0 ≤ q < K, following the recursive scheme in Eqns. (9.64–9.67). Subsequently, the error value e(q) is calculated for every possible threshold value q to find the global minimum. Again e(q) can only be calculated for those values of q, for which both resulting partitions are non-empty (i.e., with n0 (q), n1 (q) > 0). Note that, in lines 28 and 37 of Alg. 9.6, a small 1 constant ( 12 ) is added to the variance to avoid zero values when the corresponding class population is homogeneous (i.e., only contains a single intensity value).10 This ensures that the algorithm works 10

This is explained by the fact that each histogram bin h(i) represents intensities in the continuous range [i±0.5] and the variance of uniformly 1 distributed values in the unit interval is 12 .

259

9 Automatic Thresholding Alg. 9.6 Minimum error threshold selection based on a Gaussian mixture model (after [138]). Tables S0 , S1 are intialized with values σ02 (q) and σ12 (q), respectively (see Eqns. (9.61)– (9.62)), for all possible threshold values q = 0, . . . , K −1. n is the number of image pixels. Initially (outside the for-loop), the threshold q is assumed to be −1, which corresponds to the background class being empty (n0 = 0) and all pixels assigned to the foreground class (n1 = n). The for-loop (lines 8–16) examines each possible threshold q = 0, . . . , K −2. The optimal threshold is returned, or −1 if no valid threshold was found.

1: MinimumErrorThreshold(h) Input: h : [0, K −1] 7→ N, a grayscale histogram. Returns the optimal threshold value or −1 if no threshold is found.

15: 16: 17:

K ← size(h) (S0 , SP 1 ) ← MakeSigmaTables(h, K) n ← i h(i) n0 ← 0 qmin ← −1 emin ← ∞ for q ← 0, . . . , K −2 do ▷ evaluate all possible thresholds q n0 ← n0 + h(q) ▷ background population n1 ← n − n0 ▷ foreground population if (n0 > 0) ∧ (n1 > 0) then P0 ← n0 /n ▷ prior probability of C0 P1 ← n1 /n ▷ prior probability of C1 e ← P0 · ln(S0 (q)) + P1 · ln(S1 (q)) − 2·(P0 · ln(P0 ) + P1 · ln(P1 )) ▷ Eqn. (9.58) if e < emin then ▷ minimize error for q emin ← e qmin ← q

18:

return qmin

2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

19: MakeSigmaTables(h, K) 20: Create maps S0 , S1 : [0, K −1] 7→ R 21: n0 ← 0 22: A0 ← 0 23: B0 ← 0 24: for q ← 0, . . . , K −1 do 25: n0 ← n0 + h(q) 26: A0 ← A0 + h(q)·q 27: B0 ← B0+ h(q)·q 2 1 + (B0 − A20 /n0 )/n0 28: S0 (q) ← 12 0 29: n1 ← 0 30: A1 ← 0 31: B1 ← 0 32: S1 (K −1) ← 0 for q ← K −2, . . . , 0 do 33: 34: n1 ← n1 + h(q+1) 35: A1 ← A1 + h(q+1)·(q+1) 36: B1 ← B1+ h(q+1)·(q+1)2 1 + (B1 − A21 /n1 )/n1 37: S1 (q) ← 12 0 38:

260

▷ tabulate σ02 (q) ▷ Eqn. (9.64) ▷ Eqn. (9.65) for n0 > 0 otherwise

▷ Eqn. (9.61)

▷ tabulate σ12 (q) ▷ Eqn. (9.66) ▷ Eqn. (9.67) for n1 > 0 otherwise

▷ Eqn. (9.62)

return (S0 , S1 )

properly on images with only two distinct gray values. The algorithm computes the optimal threshold by performing three passes over the histogram (two for initializing the tables and one for finding the minimum); it thus has the same time complexity of O(K) as the algorithms described before. Figure 9.6 shows the results of minimum-error thresholding on our set of test images. It also shows the fitted pair of Gaussian distributions for the background and the foreground pixels, respectively, for

the optimal threshold as well as the graphs of the error function e(q), 9.2 Local Adaptive which is minimized over all threshold values q. Obviously the error Thresholding function is quite flat in certain cases, indicating that similar scores are obtained for a wide range of threshold values and the optimal threshold is not very distinct. We can also see that the estimate is quite accurate in case of the synthetic test image in Fig. 9.6 (d), which is actually generated as a mixture of two Gaussians (with parameters µ0 = 80, µ1 = 170 and σ0 = σ1 = 20). Note that the histograms in Fig. 9.6 have been properly normalized (to constant area) to illustrate the curves of the Gaussians, that is, properly scaled by their prior probabilities (P0 , P1 ), while the original histograms are scaled with respect to their maximum values.

(a) q = 161

0

(e)

(b) q = 50

255 0

µ0 σ0 µ1 σ1

= 97.18 = 39.48 = 181.74 = 7.09

255 0

(f)

N0 = (µ0 , σ0 )

(c) q = 43

(g)

= 33.16 = 7.28 = 164.94 = 51.04

µ0 σ0 µ1 σ1

255 0

= 12.96 = 8.74 = 168.44 = 32.22

255

(h)

e(q)

N1 = (µ1 , σ1 ) µ0 σ0 µ1 σ1

(d) q = 140

Fig. 9.6 Results from minimum-error thresholding. Calculated threshold values q and resulting binary images (a–d). The green and blue graphs in (e–h) show the fitted Gaussian background and foreground distributions N0 = (µ0 , σ0 ) and N1 = (µ1 , σ1 ), respectively. The red graph corresponds to the error quantity e(q) for varying threshold values q = 0, . . . , 255 (see Eqn. (9.58)). The optimal threshold qmin is located at the minimum of e(q) (dashed vertical line). The estimated parameters of the background/foreground Gaussians are listed at the bottom.

qmin µ0 σ0 µ1 σ1

= = = =

80.12 19.98 171.93 17.80

A minor theoretical problem with the minimum error technique is that the parameters of the Gaussian distributions are always estimated from truncated samples. This means that, for any threshold q, only the intensity values smaller than q are used to estimate the parameters of the background distribution, and only the intensities greater than q contribute to the foreground parameters. In practice, this problem is of minor importance, since the distributions are typically not strictly Gaussian either.

9.2 Local Adaptive Thresholding In many situations, a fixed threshold is not appropriate to classify the pixels in the entire image, for example, when confronted with stained backgrounds or uneven lighting or exposure. Figure 9.7 shows a typical, unevenly exposed document image and the results obtained with some global thresholding methods described in the previous sections. Instead of using a single threshold value for the whole image, adaptive thresholding specifies a varying threshold value Q(p) for each image position that is used to classify the corresponding pixel I(p) in the same way as described in Eqn. (9.1) for a global threshold.

261

9 Automatic Thresholding Fig. 9.7 Global thresholding methods fail under uneven lighting or exposure. Original image (a), results from global thresholding with various methods described above (b–d).

(a) Original

(b) Otsu

(c) Max. entropy

(d) Min. error

The following approaches differ only with regard to how the threshold “surface” Q is derived from the input image. 9.2.1 Bernsen’s Method The method proposed by Bernsen [29] specifies a dynamic threshold for each image position p, based on the minimum and maximum intensity found in a local neighborhood R(p). If Imin (p) = min I(n) n∈R(p)

and

Imax (p) = max I(n) n∈R(p)

(9.68)

are the minimum and maximum intensity values within a fixed-size neighborhood R, centered at position p, the space-varying threshold is simply calculated as the mid-range value Q(p) =

Imin (p) + Imax (p) . 2

(9.69)

This is done as long as the local contrast c(p) = Imax (p) − Imin (p)

262

(9.70)

is above some predefined limit cmin . If c(p) < cmin , the pixels in the corresponding image region are assumed to belong to a single class and are (by default) assigned to the background. The whole process is summarized in Alg. 9.7. Note that the meaning of “background” in terms of intensity levels depends on the application. For example, in astronomy, the image background is usually darker than the objects of interest. In typical OCR applications, however, the background (paper) is brighter than the foreground objects (print). The main function provides a control parameter bg to select the proper default threshold q¯, which is set to K in case of a dark background (bg = dark) and to 0 for a bright background (bg = bright). The support region R may be square or circular, typically with a radius r = 15. The choice of the minimum contrast limit cmin depends on the type of imagery and the noise level (cmin = 15 is a suitable value to start with). Figure 9.8 shows the results of Bernsen’s method on the uneven test image used in Fig. 9.7 for different settings of the region’s radius r. Due to the nonlinear min- and max-operation, the resulting threshold surface is not smooth. The minimum contrast is set to cmin = 15, which is too low to avoid thresholding low-contrast noise visible along the left image margin. By increasing the minimum contrast cmin , more neighborhoods are considered “flat” and thus ignored, that is, classified as background. This is demonstrated in

1: BernsenThreshold(I, r, cmin , bg) Input: I, grayscale image of size M × N ; r, radius of support region; cmin , minimum contrast; bg ∈ {dark, bright}, background type. Returns a map (Q) with an individual threshold value for each image position. 2: 3: 4: 5: 6: 7:

(M, N ) ← size(I) Q ← new map M × N 7→ R K if bg = dark q¯ ← ▷ low contrast default threshold 0 if bg = bright for all p ∈ M × N do R ← {n ∈ Z2 | ∥n − p∥ ≤ r} ▷ circular neighborhood of p Imin ← min I(n) Imax ← max I(n)

9:

c ← Imax− Imin (Imin + Imax )/2 Q(p) ← q¯

11:

Alg. 9.7 Adaptive thresholding using local contrast (after Bernsen [29]). The argument to bg should be set to dark if the image background is darker than the structures of interest, and to bright if the background is brighter than the objects.

n∈R

8:

10:

9.2 Local Adaptive Thresholding

n∈R

▷ local contrast if c ≥ cmin otherwise

return Q

(a) I(p)

(b) Imin (p)

(c) Imax (p)

(d) r = 7

(e) r = 15

(f) r = 30

(g) r = 7 orig. intensity

(h) r = 15 local min.

Fig. 9.8 Adaptive thresholding using Bernsen’s method. Original image (a), local minimum (b), and maximum (c). The center row shows the binarized images for different settings of r (d–f). The corresponding curves in (g–i) show the original intensity (gray), local minimum (green), maximum (red), and the actual threshold (blue) along the horizontal line marked in (a–c). The region radius r is 15 pixels, the minimum contrast cmin is 15 intensity units.

(i) r = 30 local max.

threshold

Fig. 9.9. While larger values of cmin effectively eliminate low-contrast noise, relevant structures are also lost, which illustrates the difficulty of finding a suitable global value for cmin . Additional examples, using the test images previously used for global thresholding, are shown in Fig. 9.10.

263

9 Automatic Thresholding Fig. 9.9 Adaptive thresholding using Bernsen’s method with different settings of cmin (min. local contrast). Binarized images (top row) and threshold surface Q(p) (bottom row). Black areas in the threshold functions indicate that the local contrast is below cmin ; the corresponding pixels are classified as background (white in this case).

(a) cmin = 15

(b) cmin = 30

(c) cmin = 60

What Alg. 9.7 describes formally can be implemented quite efficiently, noting that the calculation of local minima and maxima over a sliding window (lines 6–8) corresponds to a simple nonlinear filter operation (see Sec. 4.4). To perform these calculations, we can use a minimum and maximum filter with radius r, as provided by virtually every image processing environment. For example, the Java implementation of the Bernsen thresholder in Prog. 9.1 uses ImageJ’s built-in RankFilters class for this purpose. The complete implementation can be found on the book’s website (see Sec. 9.3 for additional details on the corresponding API). 9.2.2 Niblack’s Method In this approach, originally presented in [196, Sec. 5.1], the threshold Q(p) is varied across the image as a function of the local intensity average µR (p) and standard deviation11 σR (p) in the form Q(p) = µR (p) + κ · σR (p).

(9.71)

Thus the local threshold Q(p) is determined by adding a constant portion (κ ≥ 0) of the local standard deviation σR to the local mean µR . µR and σR are calculated over a square support region R centered at p. The size (radius) of the averaging region R should be as large as possible, at least larger than the size of the structures to be detected, but small enough to capture the variations (unevenness) of the background. A size of 31 × 31 pixels (or radius r = 15) is suggested in [196] and κ = 0.18, though the latter does not seem to be critical. One problem is that, for small values of σR (as obtained in “flat” image regions of approximately constant intensity), the threshold will be close to the local average, which makes the segmentation quite sensitive to low-amplitude noise (“ghosting”). A simple improvement 11

264

The standard deviation σ is the square root of the variance σ 2 .

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51

package imagingbook.pub.threshold.adaptive; import ij.plugin.filter.RankFilters; import ij.process.ByteProcessor; import imagingbook.pub.threshold.BackgroundMode;

9.2 Local Adaptive Thresholding

public class BernsenThresholder extends AdaptiveThresholder { public static class Parameters { public int radius = 15; public int cmin = 15; public BackgroundMode bgMode = BackgroundMode.DARK; } private final Parameters params;

Prog. 9.1 Bernsen’s thresholder (ImageJ plugin implementation of Alg. 9.7). Note the use of ImageJ’s RankFilters class (lines 30–32) for calculating the local minimum (Imin) and maximum (Imax) maps inside the getThreshold() method. The resulting threshold surface Q(p) is returned as an 8-bit image of type ByteProcessor.

public BernsenThresholder() { this.params = new Parameters(); } public BernsenThresholder(Parameters params) { this.params = params; } public ByteProcessor getThreshold(ByteProcessor I) { int M = I.getWidth(); int N = I.getHeight(); ByteProcessor Imin = (ByteProcessor) I.duplicate(); ByteProcessor Imax = (ByteProcessor) I.duplicate(); RankFilters rf = new RankFilters(); rf.rank(Imin,params.radius,RankFilters.MIN); // Imin (p) rf.rank(Imax,params.radius,RankFilters.MAX); // Imax (p) int q = (params.bgMode == BackgroundMode.DARK) ? 256 : 0; ByteProcessor Q = new ByteProcessor(M, N); // Q(p)

}

}

for (int v = 0; v < N; v++) { for (int u = 0; u < M; u++) { int gMin = Imin.get(u, v); int gMax = Imax.get(u, v); int c = gMax - gMin; if (c >= params.cmin) Q.set(u, v, (gMin + gMax) / 2); else Q.set(u, v, q); } } return Q;

is to secure a minimum distance from the mean by adding a constant offset d, that is, replacing Eqn. (9.71) by Q(p) = µR (p) + κ · σR (p) + d,

(9.72)

with d ≥ 0, in the range 2, . . . , 20 for typical 8-bit images. The original formulation (Eqn. (9.71)) is aimed at situations where the foreground structures are brighter than the background (Fig. 9.11 (a)) but does not work if the images are set up the other way round (Fig. 9.11 (b)). In the case that the structures of interest

265

Original image

9 Automatic Thresholding Fig. 9.10 Additional examples for Bernsen’s method. Original images (a–d), local minimum Imin (e–h), maximum Imax (i–l), and threshold map Q (m–p); results after thresholding the images (q–t). Settings are r = 15, cmin = 15. A bright background is assumed for all images (bg = bright), except for image (d).

(a)

(b)

(c)

(d)

(g)

(h)

(j) (k) Local threshold Q(p)

(l)

(n)

(o)

(p)

(s)

(t)

Local minimum

(e)

(f) Local maximum

(i)

(m)

Binary image

(q)

(r)

are darker than the background (as, e.g., in typical OCR applications), one could either work with inverted images or modify the calculation of the threshold to ( µR (p) + (κ·σR (p) + d) if bg = dark, Q(p) = (9.73) µR (p) − (κ·σR (p) + d) if bg = bright.

266

The modified procedure is detailed in Alg. 9.8. The example in Fig. 9.12 shows results obtained with this method on an image with a bright background containing dark structures, for κ = 0.3 and varying settings of d. Note that setting d = 0 (see Fig. 9.12 (d, g)) corresponds

Dark background (bg = dark)

Bright background (bg = bright)

Q µ

µ Q

(a)

(b)

to Niblack’s original method. For these examples, a circular window of radius r = 15 was used to compute the local mean µR (p) and variance σR (p). Additional examples are shown in Fig. 9.13. Note that the selected radius r is obviously too small for the structures in the images in Fig. 9.13 (c, d), which are thus not segmented cleanly. Better results can be expected with a larger radius. With the intent to improve upon Niblack’s method, particularly for thresholding deteriorated text images, Sauvola and Pietikäinen [233] proposed setting the threshold to  µR (p)· 1 − κ· σR (p) −1 if bg = dark, σmax Q(p) = (9.74) µR (p)· 1 + κ· σR (p) −1 if bg = bright, σ

9.2 Local Adaptive Thresholding Fig. 9.11 Illustration of adaptive thresholding based on average local intensity. The illustration shows a line profile as typically found in document imaging. The space-variant threshold Q (dotted blue line) is chosen as the local average µR (dashed green line) offset by a multiple of the local intensity variation σR . The offset is chosen to be positive for images with a dark background and bright structures (a) and negative if the background is brighter than the contained structures of interest (b).

max

with κ = 0.5 and σmax = 128 (the “dynamic range of the standard deviation” for 8-bit images) as suggested parameter values. In this

(a) I(p)

(b) µR (p)

(c) σR (p)

(d) d = 0

(e) d = 5

(f) d = 10

(g) d = 0

(h) d = 5

orig. intensity

µR

Fig. 9.12 Adaptive thresholding using Niblack’s method (with r = 15, κ = 0.3). Original image (a), local mean µR (b), and standard deviation σR (c). The result for d = 0 in (d) corresponds to Niblack’s original formulation. Increasing the value of d reduces the amount of clutter in regions with low variance (e, f). The curves in (g–i) show the local intensity (gray), mean (green), variance (red), and the actual threshold (blue) along the horizontal line marked in (a–c).

(i) d = 10 σR

threshold

267

9 Automatic Thresholding Alg. 9.8 Adaptive thresholding using local mean and variance (modified version of Niblack’s method [196]). The argument to bg should be dark if the image background is darker than the structures of interest, bright if the background is brighter than the objects.

1: NiblackThreshold(I, r, κ, d, bg) Input: I, grayscale image of size M × N ; r, radius of support region; κ, variance control parameter; d, minimum offset; bg ∈ {dark, bright}, background type. Returns a map with an individual threshold value for each image position. 2: 3: 4: 5: 6: 7: 8:

(M, N ) ← size(I) Create map Q : M × N 7→ R for all p ∈ M × N do (µ, σ 2√ ) ← GetLocalMeanAndVariance(I, p, r) σ ← σ 2 ▷ local std. deviation σR µ + (κ·σ + d) if bg = dark Q(p) ← ▷ Eqn. (9.73) µ − (κ·σ + d) if bg = bright return Q

9: GetLocalMeanAndVariance(I, p, r) Returns the local mean and variance of the image pixels in a circular region with radius r and centered at position p. 10: 11: 12: 13: 14: 15: 16: 17:

R ← {n ∈ Z2 | ∥n − p∥ ≤ r} n←0 A←0 B←0 for all n ∈ R do n←n+1 A ← A + I(p) B ← B + (I(p))2

18:

µ ←

19: 20:

2

▷ circular neighborhood of p

1 ·A n 1 ·(B n

σ ← − n1 ·A2 ) return (µ, σ 2 )

approach, the offset between the threshold and the local average not only depends on the local variation σR (as in Eqn. (9.71)), but also on the magnitude of the local mean µR ! Thus, changes in absolute brightness lead to modified relative threshold values, even when the image contrast remains constant. Though this technique is frequently referenced in the literature, it appears questionable if this behavior is generally desirable. Calculating Local Mean and Variance Algorithm 9.8 shows the principle operation of Niblack’s method and also illustrates how to efficiently calculate the local average and variance. Given the image I and the averaging region R, we can use the shortcut suggested in Eqn. (2.12) to obtain these quantities as µR =

1 ·A n

and

2 σR =

with A=

X

I(i, j),

(i,j)∈R

268

B=

X

1 1 · B − ·A2 , n n

I 2 (i, j),

n = |R|.

(9.75)

(9.76)

(i,j)∈R

Procedure GetLocalMeanAndVariance() in Alg. 9.8 shows this calculation in full detail.

Original image

(a)

(b)

9.2 Local Adaptive Thresholding

(c)

(d)

Local average µR (p)

(e)

(f) (g) Local standard deviation σR (p)

(h)

(i)

(j) (k) Local threshold Q(p)

(l)

(n)

(o)

(p)

(s)

(t)

(m)

Fig. 9.13 Additional examples for thresholding with Niblack’s method using a disk-shaped support region of radius r = 15. Original images (a–d), local mean µR (e–h), std. deviation σR (i–l), and threshold Q (m–p); results after thresholding the images (q–t). The background is assumed to be brighter than the structures of interest, except for image (d), which has a dark background. Settings are κ = 0.3, d = 5.

Binary image

(q)

(r)

When computing the local average and variance, attention must be paid to the situation at the image borders, as illustrated in Fig. 9.14. Two approaches are frequently used. In the first approach (following the common practice for implementing filter operations), all outside pixel values are replaced by the closest inside pixel, which is always a border pixel. Thus the border pixel values are effectively replicated outside the image boundaries and thus these pixels have a strong influence on the local results. The second approach is to perform the calculation of the average and variance on only those image pixels that are actually covered by the support region. In this

269

9 Automatic Thresholding

case, the number of pixels (n) is reduced at the image borders to about 1/4 of the full region size. Although the calculation of the local mean and variance outlined in function GetLocalMeanAndVariance() (see Alg. 9.8) is definitely more efficient than a brute-force approach, additional optimizations are possible. Most image processing environments have suitable routines already built in. With ImageJ, for example, we can again use the RankFilters class (as with the min- and max-filters in the Bernsen approach, see Sec. 9.2.1). Instead of performing the computation for each pixel individually, the following ImageJ code segment uses predefined filters to compute two separate images Imean (µR ) and Ivar 2 (σR ) containing the local mean and variance values, respectively, with a disk-shaped support region of radius 15: ByteProcessor I; // original image I int radius = 15; FloatProcessor Imean = I.convertToFloatProcessor(); FloatProcessor Ivar = Imean.duplicate(); RankFilters rf = new RankFilters(); rf.rank(Imean, radius, RankFilters.MEAN); rf.rank(Ivar, radius, RankFilters.VARIANCE); ...

// µR (p) 2 (p) // σR

See Sec. 9.3 and the online code for additional implementation details. Note that the filter methods implemented in RankFilters perform replication of border pixels as the border handling strategy, as discussed earlier. Fig. 9.14 Calculating local statistics at image boundaries. The illustration shows a disk-shaped support region with radius r, placed at the image border. Pixel values outside the image can be replaced (“filled-in”) by the closest border pixel, as is common in many filter operations. Alternatively, the calculation of the local statistics can be confined to include only those pixels inside the image that are actually covered by the support region. At any border pixel, the number of covered elements (n) is still more than ≈ 1/4 of the full region size. In this particular case, the circular region covers a maximum of n = 69 pixels when fully embedded and n = 22 when positioned at an image corner.

270

0 1 2 3

u

r 0 1 2 3

v

Local Average and Variance with Gaussian Kernels The purpose of taking the local average is to smooth the image to obtain an estimate of the varying background intensity. In case of a square or circular region, this is equivalent to convolving the image with a box- or disk-shaped kernel, respectively. Kernels of this type, however, are not well suited for image smoothing, because they create strong ringing and truncating effects, as demonstrated in Fig. 9.15. Moreover, convolution with a box-shaped (rectangular) kernel is a non-isotropic operation, that is, the results are orientationdependent. From this perspective alone it seems appropriate to consider other smoothing kernels, Gaussian kernels in particular.

Box

Disk

Gaussian

9.2 Local Adaptive Thresholding Fig. 9.15 Local average (a–c) and variance (d–f) obtained with different smoothing kernels. 31 × 31 box filter (a, d), disk filter with radius r = 15 (b, e), Gaussian kernel with σ = 0.6 · 15 = 9.0 (c, f). Both the box and disk filter show strong truncation effects (ringing), the box filter is also highly non-isotropic. All images are contrast-enhanced for better visibility.

µR

(a)

(b)

(c)

(d)

(e)

(f)

σR

Using a Gaussian kernel HσG for smoothing is equivalent to calculating a weighted average of the corresponding image pixels, with the weights being the coefficients of the kernel. Thus calculating this weighted local average can be expressed by µG (p) =

1 · I ∗ HσG (p), G ΣHσ

(9.77)

where ΣHσG is the sum of the coefficients in the kernel HσG and ∗ denotes the linear convolution operator.12 Analogously, there is also 2 a weighted variance σG which can be calculated jointly with the local average µG (as in Eqn. (9.75)) in the form 1 ·A (p), ΣHσG G 1 1 2 σG (p) = · BG (p) − ·A2G (p) , G G ΣHσ ΣHσ

µG (p) =

(9.78) (9.79)

with AG = I ∗ HσG and BG = I 2 ∗ HσG . Thus all we need is two filter operations, one applied to the original image (I ∗ HσG ) and another applied to the squared image (I 2 ∗ HσG ), using the same 2D Gaussian kernel HσG (or any other suitable smoothing kernel). If the kernel HσG is normalized (i.e., ΣHσG = 1), Eqns. (9.78)–(9.79) reduce to µG (p) = AG (p), 2 σG (p)

= BG (p) −

(9.80) A2G (p),

(9.81)

with AG , BG as defined above. This suggests a very simple process for calculating the local average and variance by Gaussian filtering, as summarized in Alg. 9.9. The width (standard deviation σ) of the Gaussian kernel is set to 0.6 times the radius r of the corresponding disk filter to produce a similar effect as Alg. 9.8. The Gaussian 12

See Sec. 4.3.1.

271

1: AdaptiveThresholdGauss(I, r, κ, d, bg) Input: I, intensity image of size M × N ; r, support region radius; κ, variance control parameter; d, minimum offset; bg ∈ {dark, bright}, background type. Returns a map Q of local thresholds for the grayscale image I.

9 Automatic Thresholding Alg. 9.9 Adaptive thresholding using Gaussian averaging (extended from Alg. 9.8). Parameters are the original image I, the radius r of the Gaussian kernel, variance control k, and minimum offset d. The argument to bg should be dark if the image background is darker than the structures of interest, bright if the background is brighter than the objects. The procedure MakeGaussianKernel2D(σ) creates a discrete, normalized 2D Gaussian kernel with standard deviation σ.

2: 3: 4: 5: 6:

(M, N ) ← size(I) Create maps A, B, Q : M × N 7→ R for all p ∈ M × N do A(p) ← I(p) B(p) ← (I(p))2

7: 8: 9:

HσG ← MakeGaussianKernel2D(0.6 · r) A ← A ∗ HσG ▷ filter the original image with HσG G B ← B ∗ Hσ ▷ filter the squared image with HσG

13:

for all p ∈ M × N do µG ← p A(p) σG ← B(p) − A2 (p)

14:

return Q

10: 11: 12:

Q(p) ←

µG + κ·σG + d µG − κ·σG + d

▷ Eqn. (9.80) ▷ Eqn. (9.81) if bg = dark if bg = bright

▷ Eqn. (9.73)

15: MakeGaussianKernel2D(σ) Returns a discrete 2D Gaussian kernel H with std. deviation σ, sized sufficiently large to avoid truncation effects. 16: k ← max(1, ⌈3.5 · σ⌉) ▷ size the kernel sufficiently large 17: Create map H : [−k, k]2 7→ R 18: s←0 19: for x ← −k, . . . , k do 20: for y ← −k, . . . , k do 21: H(x, y) ← exp(−(x2 +y 2 )/(2·σ 2 )) ▷ 2D Gaussian 22: s ← s + H(x, y) 23: for x ← −k, . . . , k do 24: for y ← −k, . . . , k do 25: H(x, y) ← 1s ·H(x, y) ▷ normalize H 26: return H

approach has two advantages: First, compared to the box or disk kernels, the Gaussian kernel makes a much superior low-pass filter. Second, the 2D Gaussian is (unlike the circular disk kernel) separable in the x- and y-direction, which permits a very efficient implementation of the 2D filter using only a pair of 1D convolutions (see Sec. 4.6.1). Implementation Details For practical calculation, A, B can be represented as (floating-point) images, and most modern image-processing environments provide efficient implementations of Gaussian filters with large-size kernels. Programs 9.2–9.3 show a complete implementation of Niblack’s thresholder using Gaussian smoothing kernels. An instance of the filter class GaussianFilterSeparable13 is used for Gaussian filtering(see Prog. 9.3). 13

272

Class GaussianFilterSeparable is provided by the imagingbook library (package imagingbook.lib.filter.linear).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56

import import import import import import

ij.plugin.filter.RankFilters; ij.process.ByteProcessor; ij.process.FloatProcessor; imagingbook.lib.filter.GenericFilter; imagingbook.lib.filter.linear.GaussianFilterSeparable; imagingbook.pub.threshold.BackgroundMode;

public abstract class NiblackThresholder extends AdaptiveThresholder {

9.3 Java Implementation Prog. 9.2 Niblack’s thresholder using Gaussian smoothing kernels (implementation of Alg. 9.9, Part 1).

// parameters for this thresholder public static class Parameters { public int radius = 15; public double kappa = 0.30; public int dMin = 5; public BackgroundMode bgMode = BackgroundMode.DARK; } private final Parameters params;

// parameter object

protected FloatProcessor Imean; protected FloatProcessor Isigma;

// = µG (p) // = σG (p)

// method to be implemented by real sub-classes: protected abstract void makeMeanAndVariance(ByteProcessor I, int radius); @Override public ByteProcessor getThreshold(ByteProcessor I) { int M = I.getWidth(); int N = I.getHeight(); makeMeanAndVariance(I, params.radius); ByteProcessor Q = new ByteProcessor(M, N); final double kappa = params.kappa; final int dMin = params.dMin; final boolean darkBg = (params.bgMode == BackgroundMode.DARK);

}

for (int v = 0; v < N; v++) { for (int u = 0; u < M; u++) { double sigma = Isigma.getf(u, v); double mu = Imean.getf(u, v); double diff = kappa * sigma + dMin; int q = (int) Math.rint((darkBg) ? mu + diff : mu - diff); if (q < 0) q = 0; if (q > 255) q = 255; Q.set(u, v, q); } } return Q;

// continues in Prog. 9.3

9.3 Java Implementation All thresholding methods described in this chapter have been implemented as part of the imagingbook library that is available with full source code at the book’s website. The top class in this li-

273

9 Automatic Thresholding Prog. 9.3 Niblack’s thresholder using Gaussian smoothing kernels (implementation of Alg. 9.9, Part 2): Gaussian smoothing. Gauss is implemented as an inner sub-class of NiblackThresholder. The floating-point images A, B correspond to the maps A (filtered original image) and B (filtered squared image) in Alg. 9.9. An instance of the imagingbook class GaussianFilterSeparable is created in line 74 and subsequently used to filter both images (lines 75–76).

57 // continued from Prog. 9.2 58 59 public static class Gauss extends NiblackThresholder { 60 61 @Override // required by super-class ’NiblackThresholder’ 62 protected void makeMeanAndVariance(ByteProcessor I, int r) { 63 int M = I.getWidth(); 64 int N = I.getHeight(); 65 66 Imean = new FloatProcessor(M, N); 67 Isigma = new FloatProcessor(M, N); 68 69 FloatProcessor A = I.convertToFloatProcessor(); // = I 70 FloatProcessor B = I.convertToFloatProcessor(); 71 B.sqr(); // = I 2 72 73 double sigma = r * 0.6; 74 GenericFilter gf = new GaussianFilterSeparable(sigma); 75 gf.applyTo(A); // A ← A ∗ HσG 76 gf.applyTo(B); // B ← B ∗ HσG 77 78 for (int v = 0; v < N; v++) { 79 for (int u = 0; u < M; u++) { 80 float a = A.getf(u, v); 81 float b = B.getf(u, v); 82 float sigmaG = 83 (float) Math.sqrt(b - a * a); // Eq. 9.81 84 Imean.setf(u, v, a); // = µG (p) 85 Isigma.setf(u, v, sigmaG); // = σG (p) 86 } 87 } 88 } 89 } // end of inner class NiblackThresholder.Gauss 90 } // end of class NiblackThresholder

brary14 is Thresholder with the sub-classes GlobalThresholder and AdaptiveThresholder for the methods described in Secs. 9.1 and 9.2, respectively. Class Thresholder itself is abstract and only defines a set of (non-public) utility methods for histogram analysis. 9.3.1 Global Thresholding Methods The thresholding methods covered in Sec. 9.1 are implemented by the following classes: • • • • • •

MeanThresholder, MedianThresholder (Sec. 9.1.2), QuantileThresholder (Alg. 9.1), IsodataThresholder (Alg. 9.2–9.3), OtsuThresholder (Alg. 9.4), MaxEntropyThresholder (Alg. 9.5), and MinErrorThresholder (Alg. 9.6).

These are sub-classes of the (abstract) class GlobalThresholder. The following example shows the application of the threshold() method to a given ByteProcessor object i: GlobalThresholder thr = new IsodataThresholder(); int q = thr.getThreshold(ip); 14

274

Package imagingbook.pub.threshold

if (q > 0) ip.threshold(q); else ...

9.4 Summary and Further Reading

Here threshold() is the built-in ImageJ’s method defined by class ImageProcessor. 9.3.2 Adaptive Thresholding The algorithms described in Sec. 9.2 are implemented by the following classes: • BernsenThresholder (Alg. 9.7), • NiblackThresholder (Alg. 9.8, multiple versions), and • SauvolaThresholder (Eqn. (9.74)). These are sub-classes of the (abstract) class AdaptiveThresholder. The following example demonstrates the typical use of these methods for a given ByteProcessor object I: AdaptiveThresholder thr = new BernsenThresholder(); ByteProcessor Q = thr.getThreshold(I); thr.threshold(I, Q);

The 2D threshold surface is represented by the image Q; the method threshold(I, Q) is defined by class AdaptiveThresholder. Alternatively, the same operation can be performed without making Q explicit, as demonstrated by the following code segment: // Create and set up a parameter object: Parameters params = new BernsenThresholder.Parameters(); params.radius = 15; params.cmin = 15; params.bgMode = BackgroundMode.DARK; // Create the thresholder: AdaptiveThresholder thr = new BernsenThresholder(params); // Perform the threshold operation: thr.threshold(I);

This example also shows how to specify a parameter object (params) for the instantiation of the thresholder.

9.4 Summary and Further Reading The intention of this chapter was to give an overview of established methods for automatic image thresholding. A vast body of relevant literature exists, and thus only a fraction of the proposed techniques could be discussed here. For additional approaches and references, several excellent surveys are available, including [105, 204, 229, 263] and [241]. Given the obvious limitations of global techniques, adaptive thresholding methods have received continued interest and are still a focus of ongoing research. Another popular approach is to calculate an adaptive threshold through image decomposition. In this case, the image is partitioned into (possibly overlapping) tiles, an “optimal” threshold is calculated for each tile and the adaptive threshold is

275

9 Automatic Thresholding

obtained by interpolation between adjacent tiles. Another interesting idea, proposed in [292], is to specify a “threshold surface” by sampling the image at specific points that exhibit a high gradient, with the assumption that these points are at transitions between the background and the foreground. From these irregularly spaced point samples, a smooth surface is interpolated that passes through the sample points. Interpolation between these irregularly spaced point samples is done by solving a Laplacian difference equation to obtain a continuous “potential surface”. This is accomplished with the socalled “successive over-relaxation” method, which requires about N scans over an image of size N × N to converge, so its time complexity is an expensive O(N 3 ). A more efficient approach was proposed in [33], which uses a hierarchical, multi-scale algorithm for interpolating the threshold surface. Similarly, a quad-tree representation was used for this purpose in [58]. Another interesting concept is “kriging” [201], which was originally developed for interpolating 2D geological data [214, Sec. 3.7.4]. In the case of color images, simple thresholding is often applied individually to each color channel and the results are subsequently merged using a suitable logical operation (see Exercise 9.4). Transformation to a non-RGB color space (such as HSV or CIELAB) might be helpful for this purpose (see Sec. 13.2.3 and Ch. 14). For a binarization method aimed specifically at vector-valued images, see [181], for example. Since thresholding can be viewed as a specific form of segmentation, color segmentation methods [59, 65, 104, 245] are also relevant for binarizing color images.

9.5 Exercises Exercise 9.1. Define a procedure for estimating the minimum and maximum pixel value of an image from its histogram. Threshold the image at the resulting mid-range value (see Eqn. (9.12)). Can anything be said about the size of the resulting partitions? Exercise 9.2. Define a procedure for estimating the median of an image from its histogram. Threshold the image at the resulting median value (see Eqn. (9.11)) and verify that the foreground and background partitions are of approximately equal size. Exercise 9.3. The algorithms described in this chapter assume 8-bit grayscale input images (of type ByteProcessor in ImageJ). Adopt the current implementations to work with 16-bit integer image (of type ShortProcessor). Images of this type may contain pixel values in the range [0, 216 −1] and the getHistogram() method returns the histogram as an integer array of length 65536.

276

Exercise 9.4. Implement simple thresholding for RGB color images by thresholding each (scalar-valued) color channel individually and then merging the results by performing a pixel-wise AND operation. Compare the results to those obtained by thresholding the corresponding grayscale (luminance) images.

Exercise 9.5. Re-implement the Bernsen and/or Niblack thresholder 9.5 Exercises (classes BernsenThresholder and NiblackThresholder) using integral images (see Sec. 2.8.1) for efficiently calculating the required local mean and variance of the input image over a rectangular support region R.

277

Part IV

Geometric Primitives

10 Fitting Straight Lines

The straight line is such a simple yet important concept in the context of image geometry. Images of virtually any man-made structure will contain lines and, on the other hand, straight lines are quite rare in natural images. In this chapter we first look at how we can describe lines mathematically and which parameterizations are best suited for our purpose, which is how to fit a line to a given set of 2D data points under some optimality criterion. The material presented here is also helpful for some of the following chapters, Ch. 12 in particular. Some auxiliary materials concerning straight lines and numerical examples can be found in Ch. F of the Appendix.

10.1 Straight Line Equations 10.1.1 Slope-Intercept Form The most common (“classic”) 2D line representation, familiar to every high school student, is y = k·x + d,

(10.1)

with the two parameters k and d. Geometrically, parameter k specifies the slope of the line and d stands for the axis intercept, i.e., the y-position where the line crosses the vertical axis (see Fig. 10.1). For some fixed values k, d, the line consists of all points (x, y) that satisfy Eqn. (10.1). It should be noted that the formulation in Eqn. (10.1) treats the y-coordinate as a linear function of the (free) x-coordinate, so the x and y coordinates do not have equal roles. While this is quite appropriate in many other areas (e.g., in statistics), there is generally no such functional relation between x/y in the context of image geometry. Moreover, the slope-intercept representation fails to represent all possible lines in the 2D plane. In particular, the slope of any vertical line is infinite and, since it runs parallel to the y-axis, its intercept is undefined as well. Even the parameters of lines that are only approximately vertical may be quite harmful in numerical © Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1_10

281

10 Fitting Straight Lines Fig. 10.1 Line in slope-intercept representations y = kx + d. Parameter k specifies the slope of line L and d is the intercept on the y-axis. Note that strictly vertical lines cannot be described with finite parameters.

y

L 1

k

(x, y)

d

n

←

A=∞ C =? x

calculations and thus a great number of potential lines are ruled out by this representation. We will therefore not use it again (except for didactical reasons in Ch. 12). What we should remember though is the fact that only two parameters are sufficient to specify a straight line in 2D. 10.1.2 Parametric (Point-Vector) Form Given some point s = (xs , ys ) on the line and a direction vector v = (xv , yv ) ̸= 0, then each point x = (x, y) on the (infinite) line L = ⟨s, v⟩ is specified by the equation x = s + t·v,

(10.2)

for some t ∈ R. This line representation has four parameters (x0 , y0 , xd , yd ), which could be reduced to three by making v a unit vector and only specifiying its angle. Obviously Eqn. (10.2) can describe any line in 2D, i.e., is not subject to the inherent limitations of the slopeintercept form. Equivalently, one could specify two points x0 , x1 on the line, such that v = x1 −x0 , i.e., x = s + t·(x1 −x0 ),

(10.3)

again with four parameters. This is particularly useful for defining line segments connecting two points (with t ∈ [0, 1]), which is the standard convention for drawing lines in most 2D graphics APIs. 10.1.3 Algebraic Form The “general” or “algebraic” line equation is of the form A·x + B·y + C = 0 ,

(10.4)

with parameters A, B, C ∈ R, under the constraint that at least one of A, B is nonzero (i.e., A2 +B 2 > 0). Otherwise, i.e., if A = B = 0, the set of points (x, y) described by Eqn. (10.4) is either empty (if C ̸= 0) or covers the entire 2D plane (if C = 0). Since the expression in Eqn. (10.4) is a (first-order) polynomial in x, y, it is also called an algebraic equation. It can describe any possible line in the 2D plane with finite parameters, including vertical (B = 0) and horizontal (A = 0) lines. Parameters A, B specify the orientation of the line and C its position. In particular, the vector n = (A, B) 282

⊺

(10.5)

is perpendicular to the associated line, and thus its normal vector 10.1 Straight Line n′ = (−B, A)⊺ is parallel to the line. Equations Note that the slope-intercept form y = kx + d (Eqn. (10.1)) can be written as k·x − y + d = 0,

(10.6)

which is equivalent to an algebraic line with parameters (A, B, C) = (k, −1, d), i.e., with parameter B = −1 being fixed. Since B = 0 for strictly vertical lines, such lines cannot be represented by the slopeintercept form. Concrete formulas for converting various types of line representations are given in Sec. F.1.1 of the Appendix. Normalizing Algebraic Line Parameters The three parameters of an algebraic line are not unique: if some line is specified by parameters ⊺

q = (A, B, C) ,

(10.7)

then any scaled parameter vector q ′ = s·q = (sA, sB, sC)

⊺

(10.8)

describes exactly the same line (for s ̸= 0). The line parameters can be normalized, for example, by applying the constraint 2

∥n∥ = A2 + B 2 = 1, i.e., by scaling the parameters q to ! A ˆ A 1 1 ˆ ˆ= B = · B . q ·q = √ ∥n∥ ˆ A2 +B 2 C C

(10.9)

(10.10)

ˆ B) ˆ ⊺ of the normalized line has ˆ = (A, As a result, the normal vector n ˆ = 1). unit length (i.e., ∥n∥ Distance to an Algebraic Line Given a (unnormalized) algebraic line L with parameters A, B, C, a point x = (x, y) is exactly on this line (i.e., its distance to L is zero) if Eqn. (10.4) is satisfied. Obviously, the left-hand side of Eqn. (10.4) will evaluate to some nonzero residual, that is, Ax + By + C = R ̸= 0 ,

(10.11)

if point x is somewhere off the line. Conveniently, the residual R directly relates to the minimum (orthogonal) distance of point x to the line L, which is dist(x, L) =

1 Ax+By+C ·R = √ . ∥n∥ A2 +B 2

(10.12)

ˆ with normalized The result may be positive or negative. p For a line L, 2 2 ˆ ˆ ˆ ˆ ˆ parameters A, B, C (see Eqn. (10.10)), A +B = 1 and thus Eqn. (10.12) reduces to ˆ = Ax ˆ + By ˆ + C. ˆ dist(x, L)

(10.13)

283

10 Fitting Straight Lines

10.1.4 Hessian Normal Form Even with the constraint on A, B (see Eqn. (10.9)), a normalized algebraic line requires three parameters, although a 2D line has only two degrees of freedom. The Hessian normal form1 representation of ˆ B ˆ of the normalized a line removes this redundancy by replacing A, algebraic definition by cos(φ) and sin(φ), respectively:2 cos(φ)·x + sin(φ)·y + Cˆ = 0.

(10.14)

Angle φ is the orientation of the normal vector n = (A, B)⊺ , i.e. φ = ∠n = ArcTan(A, B).

(10.15)

By Eqn. (10.13), the distance of such a line to the coordinate origin (x = y = 0) is −Cˆ = r, which leads to the usual notation cos(φ)·x + sin(φ)·y − r = 0 .

(10.16)

This general line representation needs only two parameters (r, φ) and has a very intuitive geometric interpretation, as illustrated in Fig. 10.2. Fig. 10.2 Line in Hessian normal form. Parameter r specifies the orthogonal distance of the line from the coordinate origin; φ is the orientation of the line’s normal vector.

y (x, y)

r

L φ x

The distance of a point x = (x, y) to a Hessian line is (analogous to Eqn. (10.13)) dist(x, L) = cos(θ)·x + sin(θ)·y − r.

(10.17)

Note that the representation is still ambiguous, since the parameters (r, φ) and (φ + π, −r) specify the same line. This multiplicity can be avoided by restricting r to positive values or constraining the angle φ to ± π2 , for example.3

10.2 Fitting Lines to Points Sets The primary topic of this chapter is how to fit a straight line to a set of n observed points X = (x0 , . . . , xn−1 ), i.e., to find the parameters of a line that (somehow) minimizes the distances to the points xi = (xi , yi ). 1 2 3

284

Also called Hesse normal form in some sources. Since [cos(φ)]2 + [sin(φ)]2 = 1. See Sec. 12.2.3 and Appendix Sec. F.1 for additional details.

10.2.1 Linear Regression A common and seemingly similar problem is known in statistics as linear regression, where the task is to fit a linear model to a given set of n measurements (xi , yi ). In short, the idea is that the observed quantities xi and yi are linearly related in the form yi ≈ fk,d (xi ) = k·xi + d,

10.2 Fitting Lines to Points Sets

(10.18)

with parameters k and d (see Eqn. (10.1)). The mismatch (or residual) ri between a measurement yi and the associated value “predicted” by the linear model is expressed by their difference, i.e., ri = yi − fk,d (xi ).

(10.19)

Fitting then means to find the optimal model parameters k, d that result in the smallest overall differences, for example, by minimizing the sum of the squared residuals, i.e., F (k, d) =

n−1 X i=0

ri2 =

n−1 X

[fk,d (xi ) − yi ]2 .

(10.20)

i=0

Squaring the quantities ri is one way to make sure that positive and negative differences contribute equally to the overall error and do not mutually cancel out.4 The formulation in Eqn. (10.20) is called ordinary least-squares regression and its solution is straightforward to calculate in closed form. The optimal line parameters can be found as the solution of the system of linear equations5 2 Σxi Σxi k Σxi yi · = (10.21) Σxi n d Σyi (with n = Σ1 = |X|), which is 1 k Σxi ·Σyi − n·Σxi yi = · . d Σxi ·Σxi yi − Σx2i ·Σyi (Σxi )2 −n·Σx2i

(10.22)

Geometrically, ordinary regression minimizes the vertical distances (see Eqn. (10.19)) between the sample points and the line defined by parameters A, C, as illustrated in Fig. 10.3. Unfortunately, this model is not suitable for fitting arbitrary lines in the 2D plane since the variables x and y do not take equal roles and, as in the intimately related slope-intercept form of the line equation (see Sec. 10.1.1), proper vertical lines are not feasible. Since minimization is performed on strictly vertical distances, the optimal fit depends on the orientation of the point set w.r.t. the x/y axes (see Fig. 10.4). If the point set is shifted and rotated we expect the best-fitting line to shift and rotate by the same amount, i.e., remain in the same relative position to the data points. Thus optimal geometric fits should be at least translation and rotation-invariant, but the latter is not satisfied by the ordinary regression model. 4

5

Another option is to take the sum of the absolute values |ri |, which makes optimization a bit more difficult. Pn−1 Summation operators are abbreviated by Σ. i=0

285

10 Fitting Straight Lines

y

Fig. 10.4 Linear regression vs. orthogonal fitting under rotation. Lines are fitted to same random points being rotated in 20◦ steps. Lines found by linear regression (minimizing vertical distances) clearly depend on the relative orientation to the coordinate frame. In contrast, lines obtained by orthogonal fitting always remain in the same position w.r.t. the point set. The difference is most pronounced when point sets are vertically oriented. Note that all lines pass through the centroid of the point set.

y = f(x)

xi

yi

Fig. 10.3 Ordinary linear regression. The optimal fit is obtained by minimizing the squared differences (distances in y-direction) between the observed quantities yi and values f (xi ) predicted by the linear model at the associated positions xi .

ri f(xi)

x

xi

0◦

20◦

40◦

60◦

80◦

100◦

linear regression fit

orthogonal fit

◦ centroid

10.2.2 Orthogonal Regression To obtain a more solid geometric fit, a clearly better choice is to minimize the orthogonal distances between the data points and the line, since this will make the optimal fit orientation-independent (see Fig. 10.5). It is also a good idea to switch to the more general algebraic line representation (see Sec. 10.1.3) to avoid the singularity problem associated with vertical lines. Thus the line model is Ax+By +C = 0 with parameters q = (A, B, C)⊺ and we also assume that the parameters are normalized to A2 +B 2 = 1, such that di = Axi + Byi + C

(10.23)

is the (signed) orthogonal distance of point (xi , yi ) to the line (see Eqn. (10.13)). The objective function to be minimized is F (q) =

n−1 X i=0

286

d2i =

n−1 X i=0

[Axi + Byi + C]2 .

(10.24)

10.2 Fitting Lines to Points Sets

y xi

Fig. 10.5 Orthogonal regression minimizes the perpendicular distance of the line to the given sample data xi . The result is independent of the coordinate frame. Thus the position of the line relative to the point set remains unchanged under rigid transformations.

di

x

Solution by Eigendecomposition (Alg. 10.1) The parameters for the minimum least-squares fit are found as the solution of the linear system       Σx2i Σxi yi Σxi A A Σxi yi Σyi2 Σyi  · B  = µ·B  , (10.25) Σxi Σyi n C 0 for some µ ̸= 0. Parameter C depends linearly on A, B and can as isolated (from the last row) as C = − n1 ·(A·Σxi + B·Σyi ) = −A·¯ x − B·¯ y, which reduces Eqn. (10.25) to the 2 × 2 system ! Σx2i − n1 (Σxi )2 Σxi yi − n1 Σxi Σyi A A · = λ· . B B Σxi yi − 1 Σxi Σyi Σy 2 − 1 (Σyi )2 n

i

(10.26)

(10.27)

n

This is a classic eigenvalue problem of the form Ax = λx which can be solved with standard numerical techniques.6 The matrix in Eqn. (10.27) is symmetric and has two orthogonal eigenvectors e0 , e1 , with associated eigenvalues λ0 , λ1 . The eigenvector associated with the larger eigenvalue points in the direction of the line, the other (for the smaller eigenvalue) is perpendicular to the line. Since the vector (A, B) is also perpendicular to the line, the second eigenvector is the solution, i.e., ⊺

(A, B) = ek ,

(10.28)

with k = argmini {λi }. Once A, B are known, C can be recovered by Eqn. (10.26). Note that the resulting line always passes through ¯ = (¯ the centroid x x, y¯) of the point set. The optimal fit may not be unique, for example, if the data points are the vertices of a regular polygon. In this case, any line passing through the center exhibits the same sum of squared point distances, regardless of its orientation. The 2 × 2 matrix in Eqn. (10.27) is also called the scatter matrix s s S = xx xy (10.29) sxy syy 6

See Sec. B.5 of the Appendix.

287

1: OrthogonalLineFitEigen(X) Input: X = (x0 , . . . , xn−1 ), a collection of n ≥ 2 observed 2D sample points xi = (xi , yi ). Returns the parameters (A, B, C) for the best-fitting line. 2: n ← |X| 3: (Sx , Sy , Sxx , Syy , Sxy ) ← (0, 0, 0, 0, 0) for all (x, y) ∈ X do 4: 5: Sx ← Sx +x 6: Sy ← Sy +y 7: Sxx ← Sxx +x2 8: Syy ← Syy +y 2 9: Sxy ← Sxy +x·y

10 Fitting Straight Lines Alg. 10.1 Orthogonal least-squares line fitting based on eigen decomposition. The 2 × 2 scatter matrix S is built from the data points in lines 3–13. Parameters A, B are found by calculating the eigendecomposition of S (see Sec. B.5 of the Appendix) in lines 14–16.

10: 11: 12: 13: 14: 15: 16: 17: 18: 19:

← Sxx −Sx2 /n ← Syy −Sy2 /n ← Sxy −Sx·Sy /n sxx sxy S← sxy syy (λ, E) ← eigen(S) k ←argmini λ(i) sxx syy sxy

A B

▷ scatter matrix ▷ λ = (λ0 , λ1 ), E = (e0 |e1 ) ▷ index of smallest eigenvalue ▷ kth eigenvector

← E(∗, k)

C ← −(A·Sx +B·Sy )/n q ← (A, B, C) return q

▷ A2 +B 2 = 1

for the data X = {(xi , yi )}, with elements7 sxx = Σ(xi −¯ x) 2 syy = Σ(yi −¯ y )2

= Σx2i − n1 (Σxi )2 ,

= Σyi2 − n1 (Σyi )2 ,

sxy = Σ(xi −¯ x)·(yi −¯ y ) = Σxi yi −

(10.30)

1 n ·Σxi ·Σyi ,

where x ¯ = n1 ·Σxi and y¯ = n1 ·Σyi are the means of the x and y data components, respectively. Note that the method for calculating S suggested in Eqn. (10.27) and Eqn. (10.30) does not require precalculation of the sample means x ¯, y¯ and thus can be done in a single loop with reduced round-off errors. The complete procedure is summarized in Alg. 10.1. Alternatively, the scatter matrix can also be defined as   x0 −¯ x y0 −¯ y ⊺ .. .. . S = X ·X, with X= (10.31) . . xn−1 −¯ x yn−1 −¯ y It should be mentioned that the eigenvalue solution works even if the scatter matrix S is singular, which happens when all data points are collinear, i.e., when the line is a perfect fit. Solution by Singular-Value Decomposition (Alg. 10.2) Alternatively the least-squares problem in Eqn. (10.24) can be solved without building the 2 × 2 scatter matrix S but directly from centered 7

288

The quantities sxx , syy , sxy are identical to the second-order central moments µ20 , µ02 , µ11 (see Sec. 8.5.3) of the data set.

1: OrthogonalLineFitSvd(X) Input: X = (x0 , . . . , xn−1 ), a collection of n ≥ 2 observed 2D sample points xi = (xi , yi ). Returns the parameters (A, B, C) for the best-fitting line. 2: n ← |X| 3: (Sx , Sy ) ← (0, 0) 4: for all (x, y) ∈ X do 5: Sx ← Sx +x, Sy ← Sy +y 6: 7: 8: 9: 10: 11: 12: 13: 14:

x ¯ ← Sx /n, y¯ ← Sy /n ! x0 −¯ x y0 −¯ y .. .. X← . . xn−1 −¯ x yn−1 −¯ y (U, Σ, V) ← SVD(X) (σ0 , σ1 ) ← diag(Σ) k ←argmini σ(i) A B

▷ centroid (¯ x, y¯)

▷ singular-value decomp. of X ▷ singular values ▷ index of smallest singular value

10.2 Fitting Lines to Points Sets Alg. 10.2 Orthogonal least-squares line fitting based on singular-value decomposition. The scatter matrix is not constructed but singular-value decomposition (see Sec. B.8 of the Appendix) is applied directly to the centered data matrix X (line 8). The parameters A, B are found as the column vector k of matrix V corresponding to the smallest singular value σk (line 11).

▷ kth column vector of V

← V(∗, k)

C ← −(A·¯ x + B·y¯) q ← (A, B, C) return q

▷ A2 +B 2 = 1

data matrix X (see Eqn. (10.31)). The singular-value decomposition of the (non-square) matrix X in the form X = U·Σ·V

⊺

(10.32)

yields a unitary 2 × 2 matrix V, whose column vectors correspond to the eigenvectors of X⊺ ·X = S. The solution for (A, B)⊺ is the column vector of V associated with the smallest diagonal (singular) value in Σ. Since the vectors of V are orthonormal, no additional normalization of (A, B)⊺ is necessary. The resulting algorithm, which is summarized in Alg. 10.2, should generally provide better numerical stability than Alg. 10.1. Solution By Constrained Homogeneous System (Alg. 10.3) This (third) approach to orthogonal line fitting, described in [100, Sec. 6.7], works without data centering and other point statistics. The objective function in Eqn. (10.24) can be expressed as a system of n linear equations, 

     1 x0 y0 0 C ..     ..   ... ... A ≈ · . . B 1 xn−1 yn−1 0

or

X·a ≈ 0

(10.33)

(with A, B, C deliberately permuted). This is an overdetermined, homogeneous8 system which one may be tempted to solve directly with 2 the usual least squares methods, i.e., by minimizing ∥X · a − 0∥ = ∥X · a∥. This, however, would not enforce the necessary constraint 8

The system is called homogeneous because of the zero vector on the right-hand side, unlike ordinary linear systems of the form Ax = b (see Sec. B.8.3 of the Appendix).

289

10 Fitting Straight Lines

A2 +B 2 = 1. For this purpose, matrix X is first factored by QR decomposition as X = Q·R,

(10.34)

where Q is an orthogonal matrix (i.e., Q−1 = Q⊺ ) and R is an upper triangular matrix.9 If some vector a is a solution to X·a = 0 it is also a solution to Q·R·a = 0, and therefore Q−1 ·Q ·R·a = Q−1 ·0 | {z }

or

R·a = 0,

(10.35)

I

which reduces the problem  r00 R·a =  0 0

in Eqn. (10.33) to      r01 r02 C 0 r11 r12  · A ≈  0  . 0 r22 B 0

(10.36)

Since the constraint only applies to two of the three unknowns and the last two rows of Eqn. (10.36) being independent of C (due to the clever arrangement of X), it is sufficient to solve the homogeneous sub-system r11 r12 A 0 ′ ′ R ·a = · ≈ , (10.37) 0 r22 B 0 subject to the nonlinear constraint A2 +B 2 = 1. Equation (10.37) has no non-trivial (i.e., non-zero) solution, but the least-squares solution for ∥a′ ∥ = 1 is obtained by singular-value decomposition of R′ , ⊺

R′ = U·Σ·V ,

(10.38)

as the column vector v k of V associated with the smallest singular value σk of Σ, i.e. A = v k , with k = argmin{σi ∈ diag(Σ)}. (10.39) B i

Once A, B are found, the remaining third parameter is calculated (by inserting into the first line of Eqn. (10.36)) as10 C = − r1 (A·r01 + B·r02 ), 00

(10.40)

where r00 ̸= 0 is guaranteed by the QR decomposition. The complete procedure is summarized in Alg. 10.3. A special situation to be considered is n = 2, which makes matrix R non-square of size 2 × 3 (in Eqn. (10.36)). This can be easily fixed by substituting 0 for the missing element R2,2 (see Alg. 10.3, line 6).

10.3 Example: Contour Segmentation Figures 10.7 and 10.6 show the application of line fitting for contour segmentation. Input data (point sequences) are obtained by 9 10

290

See also Sec. B.8.2. Note the similarity to Eqn. (10.26).

1: OrthogonalLineFitHomogeneous(X) Input: X = (x0 , . . . , xn−1 ), a collection of n ≥ 2 observed 2D sample points xi = (xi , yi ). Returns the parameters (A, B, C) for the best-fitting line. 2: n ← |X|   1 x0 y0 . . . ..  3: X ←  .. .. ▷ see Eqn. (10.33) 1 xn−1 yn−1 4: (Q, R) ← QRD(X) ▷ QR decomp. of X (Eqn. (10.34)) 5: r11 ← R , r ← R 12 1,2 1,1 0 if n = 2 6: r22 ← ▷ special case n = 2 R2,2 otherwise

12:

r11 r12 0 r22 (U, Σ, V) ← SVD(R′ ) σ ← diag(Σ) k ←argmini σ(i)

13: 14:

q ← (A, B, C) return q

7: 8: 9: 10: 11:

R′ ←

A B

▷ R′ ← R(1:1, 2:2), Eqn. (10.37)

0,0

Alg. 10.3 Orthogonal line fitting by homogeneous least-squares solution. The algorithm operates directly on the (non-centered) point data and does not calculate any scatter statistics. The optimal line parameters are found by QR decomposition of the data matrix X (line 4), followed by a constrained least-squares solution of a 2 × 2 homogeneous system using singular value decomposition (line 8).

▷ singular value decomp. of R′ ▷ singular values σ = (σ0 , σ1 ) ▷ index of smallest singular value

← V(∗, k)

C ← − R1

10.3 Example: Contour Segmentation

· (A·R0,1 + B·R0,2 )

▷ kth column vector of V ▷ Eqn. (10.40) ▷ A2 +B 2 = 1

thresholding the image, applying binary region segmentation, and extracting the outer region contours (see Sec. 8.2). The goal is to split the contour into a sequence of straight segments, i.e., to find maximal groups of successive contour points that form straight lines. One way to perform contour segmentation is by the following coarse steps: 1. Move along the contour to find an initial (next) straight start segment from a small number of successive contour points. Stop if none can be found. 2. Add more contour points to the start segment as long as it remains sufficiently straight. Refit the line for every new point added. 3. Remove some points (which caused the termination of the line) from the end of the segment, refit and collect the line. 4. Continue with Step 1. A basic implementation of this process is given in Alg. 10.4. The input to function SegmentContour() is a sequence of X = (xi ) of successive contour points. The list of collected line segments (S) is returned. The main steps of the algorithm (as listed above) are carried out by function GetNextSegment(). Line fitting and error calculation (summarized in function FitLine()) are done repeatedly by invoking function OrthogonalLineFitEigen() (see Alg. 10.1). Since lines must be re-fitted whenever a contour point is added or removed, an “incremental” line fitting method could improve the overall efficiency of this process (see Exercise 10.3). Note that the implementation in Alg. 10.4 is only rudimentary and does not address various details (and possible pitfalls) that must be considered in practice. For example, contours may be open (as

291

10 Fitting Straight Lines Fig. 10.6 Contour segmentation examples (Alg. 10.4). Region segmentation was applied to the binary images in (a, b) to extract the associated (4connected) contours. The enlarged views in (c–f) show each extracted line and the contributing contour points in the same color. Note that the algorithm excludes points near corners to improve line fitting. Intersection points (circled) of successive line segments can be used to determine corner positions accurately even when intersections are quite “flat” (e, f).

(a)

(b)

(c)

(d)

(e)

(f)

assumed here) or closed, in which case all point indices must be treated modulus n. Also, the start position may lie inside a good straight line segment, which should not be broken up.

10.4 Java Implementation The imagingbook library defines several straight line representations and associated conversion methods, including the classes11 11

292

Package imagingbook.pub.geometry.line.

1: SegmentContour(X) Input: X = (x0 , . . . , xn−1 ), a collection of n contour points. Returns a list of straight line segments. 2: a ← 20 ▷ no. of points on the initial segment (global) 3: b←6 ▷ no. of points removed at end of segment (global) 4: dmax ← 2.0 ▷ max. single point-to-line distance (global) 5: εmax ← 0.3 ▷ max. average point-to-line distance (global) 6: 7: 8: 9: 10: 11: 12: 13: 14:

n ← |X| cur ← 0 S ← () do L ← GetNextSegment() if L ̸= nil then add(S, L) while L ̸= nil ∧ c < n return S

15: GetNextSegment() 16: Q ← () 17: for i ← 1, . . . , a do 18: add(Q, X(cur)) 19: cur ← cur + 1 20: (L, ε) ← FitLine(Q)

▷ no. of contour points (global) ▷ current point index (global) ▷ list of line segments

▷ point queue for current segment ▷ create start segment from a points

21: 22: 23: 24: 25:

Shift the segment until sufficiently straight: while ε > εmax ∧ cur < n do removeFirst(Q) ▷ remove first point of Q add(Q, X(cur)) ▷ add point to end of Q cur ← cur + 1 (L, ε) ← FitLine(Q)

26: 27:

if cur ≥ n then return nil

28: 29: 30: 31: 32: 33: 34:

Extend the segment as long as possible: while ε > εmax ∧ cur < n do x ← X(cur) ▷ next candidate point if |dist(L, x)| > dmax then ▷ x is too far off the line break Q ← add(Q, x) cur ← cur + 1 (L, ε) ← FitLine(Q)

35: 36: 37:

▷ no straight start segment found

Trim segment (remove last b points from Q): for i ← 1, . . . , b do removeLast(Q) (L, ε) ← FitLine(Q) return L

38: FitLine(Q) 39: q ← OrthogonalLineFitEigen(Q) 40: L ← newP Line(q) 1 41: ε ← |Q| · [dist(L, x)]2

10.4 Java Implementation Alg. 10.4 Incremental contour segmentation. The algorithm tries to locate a sequence of straight line segments from an ordered sequence of contour points (X). The contour may be open-ended or closed. The main work is done by function GetNextSegment(), which first collects a short sequence of contour points, starting at the current point index cur (line 17). This point sequence is shifted along the contour until a sufficiently straight segment is found (by line fitting, lines 21–25). Then the segment is extended by adding additional contour points for as long as it remains sufficiently straight (lines 28–34). This is stopped when either the next point is too far off the line or the overall fitting error (average line-to-point distance) becomes too large. Once the end of a segment is found, the last few points are removed again and a final line fit is performed (lines 35–37). Function FitLine() performs an orthogonal line fit on the point sequence Q and calculates the associated average error ε (lines 38–41). Any of the orthogonal line fitting algorithms (Algs. 10.1–10.3) may be used but OrthogonalLineFitEigen() can be easily adapted for incremental line fitting.

▷ refit the line

▷ Q = (xi ), sequence of points ▷ see Alg. 10.1 ▷ q = (A, B, C) ▷ see Eqn. (10.13)

x∈Q

42:

return (L, ε)

293

10 Fitting Straight Lines Fig. 10.7 Incremental contour segmentation on ARToolkit marker. Straight lines are fitted to the outer contour of the largest region in the binary image (c, d). Accurate corner positions are obtained by calculating the intersection points of successive line segments.

(a)

(b)

(c)

(d)

SlopeInterceptLine (Sec. 10.1.1), ParametricLine (Sec. 10.1.2), AlgebraicLine (Sec. 10.1.3), HessianLine (Sec. 10.1.4). Among these, class AlgebraicLine is the most universal and flexible, providing methods for calculating point-to-line distance, finding the closest point on the line, and the intersection with another algebraic line. The imagingbook library also provides sample implementations of all line fitting algorithms described in this chapter. The following classes implement the common interface LineFit:12 LinearRegressionFit (Sec. 10.2.1), OrthogonalLineFitEigen (Alg. 10.1), OrthogonalLineFitSvd (Alg. 10.2), OrthogonalLineFitHomogeneous (Alg. 10.3). A basic usage example is shown in Prog. 10.1. There is also an incremental line fitter (class LineFitIncremental), which allows adding and removing points while refitting the line dynamically.

12

294

Package imagingbook.pub.geometry.fitting.line

1 2 3 4

import imagingbook.pub.geometry.basic.Pnt2d; import imagingbook.pub.geometry.basic.PntUtils; import imagingbook.pub.geometry.fitting.line.LineFit; import imagingbook.pub.geometry.fitting.line. OrthogonalLineFitEigen; import imagingbook.pub.geometry.line.AlgebraicLine; ... double[][] X = {{1, 8}, {4, 5}, {4, 7}, {6, 4}, {9, 4}}; Pnt2d[] pts = PntUtils.fromDoubleArray(X);

5 6 7 8 9 10 LineFit fit = new OrthogonalLineFitEigen(pts); 11 AlgebraicLine line = fit.getLine(); 12 13 double meanError = fit.getOrthogonalError(pts) / X.length);

10.5 Exercises Prog. 10.1 Orthogonal line fitting example using class OrthogonalLineFitEigen. See Fig. F.1 (Appendix p. 864) for a visualization of the result.

10.5 Exercises Exercise 10.1. Graphics APIs typically provide methods for drawing straight lines between two specified points but none for drawing algebraic lines. Devise a “brute-force” method to draw an algebraic line L = ⟨A, B, C⟩ by simply scanning the image and marking every pixel that is close enough to the line (see Eqns. (10.12)–(10.13)). How can the width of the plotted line be adjusted? Exercise 10.2. Come up with an algorithm for plotting an algebraic line L with standard line-drawing commands (between two points). First, create algebraic lines for the four lines La , . . . , Ld representing the image boundary and calculate their intersections with line L (see Fig. 10.8). Then determine which (if any) of the intersetion points are inside the visible part of the image. If L passes through the visible image rectangle, draw the resulting line segment (e.g., using the drawLine() method of ImageJ’s ImageProcessor class). See Sec. F.1.2 of the Appendix for how to calculate algebraic line intersections. La

Lb L

Lc

x1

Fig. 10.8 Drawing the visible part of an algebraic line L by calculating the intersections with the image rectangle (defined by lines La , Lb , Lc , Ld ).

x2 Ld

Exercise 10.3. Line fitting is often done repeatedly on point sets that differ only by a single point. In Alg. 10.4, for example, line segments are grown by adding contour points one by one and re-fitting the complete point set after each addition. Devise an “incremental” line fitting procedure based on Alg. 10.1 that allows adding and removing individual points to a line fit. For this purpose, modify function OrthogonalLineFitEigen() such that the summation variables Sx , . . . , Sx,y are retained between individual fits and updated when a point is added or removed.

295

11 Fitting Circles and Ellipses

11.1 Fitting Circles This chapter addresses the problem of fitting circles to a given set of data points in 2D, which arises in numerous applications in engineering, medicine, physics, remote sensing, archeology, etc. Despite the classic nature of this problem, reliable and efficient algorithms have only been developed in the last few decades and are still a topic of active research. As with straight lines (see Ch. 10) the goal is to find an optimal orthogonal fit of the point set in the least-squares sense. We distinguish two main types of fits: geometric fits minimize Euclidean orthogonal distances, while algebraic fits minimize some other (typically algebraic) expression as an approximation. In a statistical sense, geometric fits are considered the “gold standard”, but they require the solution of nonlinear problems for which only iterative methods exist. On the other hand, algebraic fits are efficient to calculate in closed form (i.e., without any iterative steps) and their results often come sufficiently close to those from geometric fits even under noisy conditions, particularly if the sample points are welldistributed over the full circle. Thus in many situations an algebraic fit may be entirely sufficient or can at least provide a good starting solution for a subsequent geometric fit. Circle fits have been studied extensively and a great number of different methods exist (see [61] for an excellent overview). The aim of this section is to give a practical summary and outline concrete algorithms for selected methods that are in common use. 11.1.1 Circle Equations There are various ways to define a circle mathematically. Typically we think of a circle as being defined by its center xc = (xc , yc ) and radius r, with all circle points x = (x, y) placed at distance r from the center, i.e., ∥x − xc ∥ = r or, equivalently, (x−xc )2 + (y−yc )2 = r2

or

(x−xc )2 (y−yc )2 + = 1. (11.1) r2 r2

© Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1_11

297

11 Fitting Circles and Ellipses

xi ˘i x

Fig. 11.1 Basic circle fitting geometry. The “model” circle is represented by its center xc and radius r. For a given sample point xi , ri is the Euclidean distance to the circle center and di is the smallest distance to a point on ˘ i denotes the the circle. x circle point closest to xi .

r

ri

di = ri −r

xc

This implicit circle equation in Cartesian coordinates is typically used for geometric fitting. The parametric form yields explicit circle coordinates in dependence of a single scalar variable (angle) t, that is, x xc cos(t) = + r· , (11.2) y yc sin(t) for t = 0, . . . , 2π. This formulation makes it easy to draw circles as a sequence of straight line segments, for example. Finally there are algebraic equations for the circle which come in the form x2 + y 2 + Bx + Cy + D = 0

(11.3)

or similar. While this refers to the same (x, y) coordinates, the algebraic parameters B, C, D have no direct geometric interpretation. Note that Eqn. (11.3) is just a special case of a 2D conic section. An obvious measure for evaluating the fit of a given circle with parameters p = (xc , yc , r) to a set of data points xi is X X X F (p) = (∥xi −xc ∥ −r)2 = (ri − r)2 = d2i , (11.4) | {z } xi ∈X

ri

xi ∈X

xi ∈X

i.e., the sum of the squared differences di between the fixed circle radius r and the individual point-to-center distances ri = ∥xi −xc ∥ (see Fig. 11.1). Finding the circle parameters that minimize F (p) results in a geometric fit, as discussed in Sec. 11.1.3. 11.1.2 Algebraic Circle Fits Geometric fits are desirable for various reasons (see [1]). Unfortunately, minimizing the expression in Eqn. (11.4) turns out to be a nonlinear least-squares problem, which is hard to handle analytically and generally not possible to solve in closed form. Algebraic fits ease this problem by using a slightly modified objective function, X X 2 FA (p) = (∥xi −xc ∥ − r2 )2 = (ri2 − r2 )2 (11.5) xi ∈X

xi ∈X

which is different from Eqn. (11.4) only by a pair of additional squares. The quantity (ri2 −r2 ) is sometimes called the algebraic distance 1 between point xi and the circle with radius r. Obviously the 1

298

An interesting geometric interpretation of the quantity (ri2 − r2 ) is that it is proportional to the difference of areas of the original circle (with radius r) and another circle with radius ri .

algebraic distance is zero if xi is exactly on the circle and small only 11.1 Fitting Circles if xi is close to the circle. It is positive for points outside the circle and negative inside. Algebraic Circle Fit 1: Kåsa’s Method The technique attributed to Kåsa [167] is one of the oldest algebraic circle fitting methods and considered a classic.2 It is certainly not among the most accurate methods but nevertheless interesting for its simplicity and efficiency. To obtain an objective function whose derivatives are linear w.r.t. its parameters, function FA () in Eqn. (11.5) is reformulated to X X 2 FA (p) = (∥xi −xc ∥ − r2 )2 = (ri2 − r2 )2 (11.6) | {z } xi ∈X xi ∈X 2 r X p i 2 2 = xi −xc )2 + (yi −yc )2 − r2 (11.7) X 2 2 2 2 = (xi −xc ) + (yi −yc ) − r (11.8) X 2 2 2 2 2 2 = xi − 2xc xi + xc + yi − 2yc yi + yc − r (11.9) X 2 2 2 2 2 2 = xi + yi − 2xc xi − 2yc yi + xc + yc − r . (11.10) | {z } | {z } | {z } | {z } zi

B

C

D

Letting zi = x2i + yi2 and changing the parameters to B = −2xc ,

C = −2yc ,

D = x2c + yc2 − r2 ,

yields a new objective function X FK (q) = (zi + B·xi + C ·yi + D)2 .

(11.11)

(11.12)

i

with parameters q = (B, C, D). Note the analogy to the algebraic circle equation in Eqn. (11.3). Differentiating FK (q) with respect to the parameters B, C, D gives P ∂F (Bx2i + Cxi yi + Dxi + xi zi ) ∂B = 2· = 2·(B·Σx2i + C ·Σxi yi + D·Σxi + Σxi zi ), (11.13) P ∂F 2 (Bxi yi + Cyi + Dyi + yi zi ) ∂C = 2· = 2· B·Σxi yi + C ·Σyi2 + D·Σyi + Σyi zi , (11.14) P ∂F (Bxi + Cyi + D + zi ) ∂D = 2· = 2· B·Σxi + C ·Σyi + D·Σ1 + Σzi , (11.15) which are all linear in B, C, D (xi , yi , zi being constant). At a minimum of function FK (q) all first derivatives must be zero, i.e., rewriting the above equations in compact matrix notation,  ∂F         Σx2i Σxi yi Σxi B Σxi zi 0 ∂B  ∂F    ∂C  = Σxi yi Σyi2 Σyi  · C  +  Σyi zi  = 0 (11.16) ∂F Σxi Σyi Σ1 D Σzi 0 ∂D

2

Similar methods have been published by several other authors (see [61, Sec. 5.1]).

299

11 Fitting Circles and Ellipses Alg. 11.1 Kåsa’s algebraic circle fit algorithm. It is based on the exact solution of a linear 3 × 3 system.

1: FitCircleKasa(X) Input: X = (x0 , . . . , xn−1 ), a collection of n ≥ 3 observed 2D sample points xi = (xi , yi ). Returns the geometric parameters (xc , yc , r) for the best-fitting circle or nil if no solution can be found. 2: n ← |X| 3: (sx , sy , sz , sxx , syy , sxy , sxz , syz ) ← (0, . . . , 0) 4: for i ← 0, . . . , n−1 do 5: (x, y, z) ← (xi , yi , x2i +yi2 ) sx ← sx + x, sxy ← sxy + x·y, sxx ← sxx + x2 , sy ← sy + y, sxz ← sxz + x·z, syy ← syy + y 2 , 6: sz ← sz + z, s ← syz + y·z ! yz ! sxx sxy sx −sxz 7: M ← sxy syy sy , b ← −syz sx sy n −sz 8: 9: 10: 11: 12: 13: 14: 15:

if M is singular then ▷ var. ways to test, e.g., det(M) = 0 return nil ▷ no solution else ⊺ (B, C, D) ← solve M·q = b ▷ or q ← M−1 ·b xc ← −0.5 · B yc ← −0.5√· C r ← 0.5 · B 2 + C 2 − 4D return (xc , yc , r) ▷ = popt

or, in linear equation form,      Σx2i Σxi yi Σxi B −Σxi zi Σxi yi Σyi2 Σyi  · C  =  −Σyi zi  Σxi Σyi n D −Σzi

or

M·q = b. (11.17)

The solution to this linear 3 × 3 system ais the vector of algebraic circle parameters q = (B, C, D)⊺ . Although the 3 × 3 system in Eqn. (11.17) could be solved in closed form, it is always advisable to use stable numerical techniques (e.g., LU-decomposition). If matrix M has full rank (3), the system has an exact solution, from which the geometric circle parameters are extracted as √ B C B 2 + C 2 − 4D xc = − , yc = − , r= . (11.18) 2 2 2 The complete procedure is summarized in Alg. 11.1. Note that the same solution is obtained if the scatter sums (Σxi , . . .) in Eqn. (11.17) are replaced by the corresponding means or “moments” x=

300

1 n·

P

x2 ,

xx =

1 n·

P

x2i ,

xy =

1 n·

P

x i yi ,

i.e., with both sides of Eqn. (11.17) being scaled by the system to be solved is       xx xy x B −xz xy yy y  ·  C  = −yz  , x y 1 D −z

1 n.

..., In this case

(11.19)

while the remaining calculations are the same. This formulation is 11.1 Fitting Circles often seen in the literature but has no real computational advantages. An alternative, potentially more stable implementation of Kåsa’s method using the Moore-Penrose pseudoinverse is described in [61, Sec. 5.2]. In general, Kåsa’s method works well if points are sampled along the full circle, in which case the results are usually close to those of a geometric fit. However, as demonstrated in Fig. 11.2 (p. 306), the method tends to underestimate the circle radius when sample points cover only parts of the circle or when measurement noise is increased. Algebraic Circle Fit 2: Pratts’s Method Obviously, the parameters of the algebraic circle equation in Eqn. (11.3) can be scaled by any nonzero factor without changing the associated circle. The underlying circle equation for Pratt’s method [213] is A·(x2 + y 2 ) + B·x + C ·y + D = 0 ,

(11.20)

i.e., all parameters are scaled by some factor A. Note that Eqn. (11.3) is only a special case of the general circle equation in Eqn. (11.20), with A = 1. Moreover, Eqn. (11.20) describes a circle only if A ̸= 0, otherwise it is the equation of a straight line (see Sec. 10.1.3). The parameters A, B, C, D in Eqn. (11.20) are related to the geometric circle parameters by xc = −

B , 2A

yc = −

C , 2A

r2 =

B 2 +C 2 −4AD . 4A2

(11.21)

Conversely, given the geometric circle parameters xc , yc , r, the associated algebraic parameters are, e.g.,3 A=

1 , 2r

B = −2Axc ,

C = −2Ayc ,

D=

B 2 +C 2 −1 . (11.22) 4A

Objective Function To fit a circle with parametrization in Eqn. (11.20), the objective function to be minimized is FP (q) =

n−1 X 1 · (A·zi + B·xi + C ·yi + D)2 , 2 2 B +C −4AD i=0

(11.23)

with zi = x2i + yi2 as before. Since B 2 + C 2 − 4AD > 0 whenever q = (A, B, C, D) represents a circle, the same objective can be expressed by the simplified function FP′ (q) =

n−1 X

(A·zi + B·xi + C ·yi + D)2 → min,

(11.24)

i=0

subject to the constraint 3

As noted before, any set of scaled circle parameters (sA, sB, sC, sD), with s ̸= 0, is equivalent, i.e., describes the same circle.

301

11 Fitting Circles and Ellipses

B 2 + C 2 − 4AD = 1 .

(11.25)

Under this constraint, Eqn. (11.20) can describe any circle or line in the 2D plane [61, Sec. 3.2]. Thus the Pratt algorithm can handle a straight line (from collinear sample points) as a special case of a circle with infinite radius. The simplified function in Eqn. (11.24) can be written as 2

⊺

⊺

FP′ (q) = ∥X·q∥ = q ·(X ·X)·q ,

(11.26)

with the parameter vector q = (A, B, C, D)⊺ and the data matrix   z0 x 0 y 0 1 . .. .. ..  X =  .. (11.27) . . . . zn−1 xn−1 yn−1 1 The constraint in Eqn. (11.25) can be written  0 0  0 1 ⊺ q ·C·q = 1 , with C=  0 0 −2 0

as  0 −2 0 0 . 1 0 0 0

(11.28)

Now the task of minimizing FP′ (q) can be reduced to the eigenvalue problem ⊺

C−1 ·X ·X·q = λ·q ,

(11.29)

i.e., the solution q opt is an eigenvector of the matrix C−1 ·X⊺ ·X and λ is one of its eigenvalues (see [61, Sec. 5.5] for details). In particular, FP′ is minimized by the smallest non-negative eigenvalue λ ≥ 0.4 To avoid numerical instabilities when X⊺ ·X is poorly conditioned (i.e., close to singular), the implementation described below (proposed in [61]) is based on singular-value decomposition of the n × 4 data matrix X itself, i.e., without actually building the 4 × 4 matrix X⊺ ·X. Data Centering To improve numerical stability it is generally recommended to use centered data instead of working with the original point coordinates. For this purpose, procedure FitCirclePratt() in Alg. 11.2 accepts a reference point xr = (xr , yr ), which is typically set to the centroid (¯ x) of the point set X but could be any other point (e.g., the image center). Initially, all points in xi ∈ X are transformed to centered coordinates by x˙ i ← xi − xr .

(11.30)

The circle fit is then performed on the centered data, with the ˙ B, ˙ C, ˙ D). ˙ initial solution q˙ = (A, To obtain the circle parameters (A, B, C, D) for the original (non-centered) point data, the initial parameters q˙ must be re-adjusted to 4

302

⊺

As noted in [61], C−1 ·X ·X has only real eigenvalues (three positive and one negative).

1: FitCirclePratt(X, xr ) Input: X = (x0 , . . . , xn−1 ), a collection of n ≥ 3 observed 2D sample points xi = (xi , yi ); xr = (xr , yr ), a reference point (e.g., the centroid of X). Returns the algebraic parameters (A, B, C, D) of the best-fitting circle. 2: n ← |X| 3: 4: 5: 6: 7:

8:

9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:

22:

23:

let x˙ i = (xi −xr ), if n = 3 then  z˙0 x˙ 0 z˙1 x˙ 1 X← z˙2 x˙ 2 0 0 else  z˙0 . X ←  ..



0  0 C← 0 −2

z˙n−1 0 1 0 0



y˙ 0 y˙ 1 y˙ 2 0

1 1 1 0

x˙ 0 .. .



y˙ 0 1 .. ..  . . y˙ n−1 1

x˙ n−1  0 −2 0 0 1 0 0 0

▷ constraint matrix, Eqn. (11.28)

(U, Σ, V) ← SVD(X) σ ← diag(Σ) k ← argmini σ(i) σmin ← σ(k) σmax ← max(σ)

if σmin /σmax < ε then q˙ ← V(∗, k) else Y ← V·Σ ⊺ Z ← Y ·C−1 ·Y (λ, E) ← eigen(Z) l ← argmini {λi > 0} q˙ ← V·Σ−1 ·el



1 0 0  −2xr 1 0 q← −2yr 0 1 x2r +yr2 −xr −yr return q

z˙i = x˙ 2i + y˙ i2 :

y˙ i = (yi −yr ),

▷ compact SVD of X ▷ singular values σ = (σ0 , . . . , σ3 ) ▷ index of smallest singular value ▷ smallest singular value ▷ largest singular value

11.1 Fitting Circles Alg. 11.2 Pratt algebraic circle fit using singular-value decomposition. If exactly three sample points are supplied, this is handled as a special case by adding an additional row of zeros to X (line 5). This makes X singular but guarantees that the V matrix produced by the SVD (line 9) is 4 × 4. Data centering (i.e., using the centroid of X as the reference point xr ) is recommended to improve numerical stability. The algorithm can handle collinear data (i.e., points on a straight line), in which case parameter A = q(0) ≈ 0. Note that the inverse of the constraint matrix C (required −1 in line 18) is simply C = 0 0 0 −0.5

0 1 0 0

0 −0.5 0 0 1 0 0 0

.

⊺

▷ X X is ill conditioned ▷ kth column vector of V ⊺ ▷ X X is non-singular



▷ Z is symmetric (4 × 4) ▷ λ = (λ0 , ..., λ3 ), E = (e0 |...|e3 ) ▷ idx. of smallest pos. eigenvalue ▷ Σ−1 = diag( σ1 , ..., σ1 ) 0

0 0 · q˙ 0 1

▷

(

3

re-adjust for data centering, s. Eqn. (11.31)

▷ q = (A, B, C, D)

    A˙ A ˙ ˙ r B    B−2· A·x  = . ˙ ˙ C    C −2·A·yr 2 2 ˙ ˙ ˙ ˙ D A·(xr +yr )−B·xr −C ·yr +D

(11.31)

˙ Obviously, if no data centering is used (i.e., xr = 0), then q = q. Algorithm The complete procedure for the Pratt circle fit is listed in Alg. 11.2. Instead of actually calculating C−1 ·X⊺ ·X, a singular-value decomposition (SVD) of X is obtained in the form ⊺

X = U·Σ·V ,

(11.32) 303

11 Fitting Circles and Ellipses

where Σ is a 4 × 4 diagonal matrix composed of the non-negative, real singular values of X, matrix U is of size n × 4, and V is a 4 × 4 orthonormal matrix, i.e.,5     σ0 0 0 0 U0,0 · · · U0,3 V0,0 · · · V3,0  .   ..  .    0 σ1 0 0   . .  . X= .  · 0 0 σ 0  · .. . . .. . (11.33)  . 2 Un−1,0 · · · Un−1,3 V0,3 · · · V3,3 0 0 0 σ3 {z }| | {z }| {z } Σ(4 × 4)

U(n × 4)

V⊺ (4 × 4)

The columns of V (i.e., rows of of V⊺ ) are called the (right) singular vectors of X. If the smallest singular value σk is very small (compared to the largest singular value σmax ), X is considered ill-conditioned and thus X⊺ ·X is singular. Fortunately, a solution exists even in this case: it is obtained as the associated column vector k of matrix V (see Alg. 11.2, line 15). This requires that the data matrix X has n ≥ 4 rows, since the size of the square matrix V is min(n, 4). To make the algorithm work for n = 3 sample points, a dummy row of zeroes is added to X to bring it to size 4 × 4 (see Alg. 11.2, line 5). This makes X singular but the SVD gives the right solution, which is an exact (3-point) circle fit. In the non-singular case, the matrix Y = VΣ is formed to find the eigenvalues of the symmetric matrix6 ⊺

Z = Y ·C−1 ·Y.

(11.34)

Now the eigenpair ⟨λl , el ⟩ with the smallest eigenvalue λl is selected, ˙ B, ˙ C, ˙ D) ˙ ⊺ is obtained from which the circle’s parameter vector q˙ = (A, by solving ⊺

Σ·V ·q˙ = el . −1

(11.35)

⊺

Since V = V and the inverse of the diagonal matrix Σ being simply Σ−1 = diag( σ1 , . . . , σ1 ), q˙ can be calculated directly as 0

3

q˙ = V·Σ−1 ·el ,

(11.36)

7

as done in Alg. 11.2, line 21. Finally, the initial parameter vector q˙ is re-adjusted to q to compensate for data centering by xr , as described in Eqn. (11.31) (see Alg. 11.2, line 22). If the resulting parameter A = q(0) is greater than zero (i.e., |A| ≥ ε), the geometric circle parameters xc , yc , r are found by the relations defined in Eqn. (11.21). Otherwise division by A would produce infinite circle parameters, i.e., the sample points are collinear. In this case the result is not useless, but is a straight line of the form Bx + Cy + D = 0,

(11.37)

obtained directly from the circle parameters B, C, D. 5

6

7

304

This definition of the SVD (with Σ and V being square matrices) is sometimes called “thin SVD” (see, e.g., [107, Sec. 2.4.3] for other definitions). Without data centering it may happen (in rare cases) that matrix Z has complex eigenvalues, which renders the result useless. Note that the eigenvalue el may be arbitrarily scaled, thus the constraint in Eqn. (11.25) does not hold without normalizing A, B, C, D.

Algebraic Circle Fits Compared

11.1 Fitting Circles

The algebraic circle fit methods described in this section share a number of positive characteristics. Both the Kåsa and the Pratt fit are invariant to translation, rotation and similarity transformations in general [61]. They work non-iteratively and find circles straightaway if the sample points are actually on a circle. Both are easy to implement and allow efficient calculation. However, there are two reasons why the Pratt fit is the preferred method of choice. First, the Kåsa fit fails on collinear data, for which the Pratt fit delivers a useful result (a straight line). Second, the Kåsa fit shows a significant bias toward smaller circles if the data points are sampled within a limited section of the circle. This effect is shown in Fig. 11.2, where n = 100 sample points wer taken from a 45◦ arc and perturbed by uncorrelated additive Gaussian noise with increasing σ. While the circles estimated by the Kåsa fit become smaller and smaller with increasing noise levels, the Pratt fit produces consistent results under the same conditions and beyond. Note that the circle from the Pratt fit in Fig. 11.2 (f), although larger than the real circle (green), is very close to the result of the geometric fit (see Table 11.1), which is the best such an algorithm can achieve. While the Kåsa method has been shown here mainly for its simplicity and widespread use, the Pratt fit is clearly superior in any aspect. Under reasonable conditions it produces results that are very close to those obtained by geometric fits (as described in the following section) and thus sufficiently accurate for most applications. Note that quite a few other algebraic circle fitting methods have been proposed in the literature, e.g., the method by Taubin [258], which is very similar to the Pratt fit. Also, successful attempts have been made to reduce (or even eliminate) the typical bias exhibited by algebraic fit methods (see [61] for a comprehensive analysis and additional details). 11.1.3 Geometric Circle Fitting While algebraic circle fits are entirely sufficient in many applications, their theoretical shortcoming is that they do not minimize Euclidean distance but some other quantity—algebraic distance. Geometric fitting is understood as the task of finding the optimal circle parameters that actually minimize the (squared) Euclidean distances between the data points and the circle contour. Results are usually very similar to algebraic fits but not exactly the same. All geometric fitting algorithms are iterative, thus in general more costly to calculate than algebraic fits. They are nevertheless useful for high-precision measurements or simply to estimate the goodness of algebraic fits. The problem can be approached in two different ways, minimizing either one or two residual quantities per sample point. The “distancebased” approach measures a single scalar-valued quantity, namely the orthogonal distance of each sample point xi to the circle. In contrast, the “coordinate-based” approach treats both x/y coordinates of a sample point and the associated closest circle point as individual variables. Both versions operate on “natural” geometric circle parameters p = (xc , yc , r), as described in the following sections.

305

11 Fitting Circles and Ellipses Fig. 11.2 Algebraic circle fits at different noise levels. Results obtained with the Kåsa and Pratt method are shown in the left and right column, respectively. Points (n = 100, blue dots) were sampled on the true circle (green) and additive Gaussian noise of increasing σ applied to their x/y coordinates. The fitted circle is drawn in red. Clearly the Pratt method shows superior performance even at relatively high noise levels. Note that in (f) the result of the algebraic Pratt fit is almost identical to that of the geometric fit (i.e., with minimum RMS distance error).

Kåsa

Pratt

(a)

σ = 0.5

(b)

(c)

σ = 2.0

(d)

(e)

σ = 5.0 true circle

(f) algebraic fit

The distance-based method is usually preferred because it is simpler, requires less memory and typically converges quicker. However, in difficult situations, e.g., when starting from a far-off initial solution, the coordinate-based method may turn out as more stable. Note that none of these methods is guaranteed to always converge. 306

“Distance-Based” Geometric Circle Fitting

11.1 Fitting Circles

The task of geometric circle fitting is to find the optimal circle parameters p = (xc , yc , r) for a given set of sample points X = {(xi , yi )} by minimizing the function (see Eqn. (11.4)) X p 2 X 2 F (p) = (xi −xc )2 +(yi −yc )2 −r = ri − r . (11.38) | {z } xi ∈X

xi ∈X

ri

Thus the goal is to minimize the sum of the squared differences between the individual center-to-point distances ri and the circle radius r, while the circle’s center and radius are unknown. Unfortunately, unlike with algebraic fits, this problem is nonlinear (in its parameters) and its solution cannot be found in closed, non-iterative form. Nonlinear Least Squares (NLS) Nonlinear least squares problems like the one in Eqn. (11.38) can be solved with iterative methods, such as the Gauss-Newton or Levenberg-Marquardt method. Section C.2 of the Appendix defines the abstract method, popt ← SolveNLS(V, J, z, p0 )

(11.39)

for solving this type of problem. It accepts a value function V (p) and a Jacobian function J(p) which return a vector of predicted model values and the Jacobian matrix, respectively, for a given parameter point p. z is a vector of target values to be approximated (set to zero in this case) and p0 is the initial parameter vector (i.e., some initial “guess”). Upon successful completion, function SolveNLS() returns the parameter vector popt that minimizes 2

∥V (p)−z∥ =

n−1 X

[vi (p)−zi ]2

(11.40)

i=0

2

or simply ∥V (p)∥ =

P

[vi (p)]2 when z = 0.

Model Functions (Values and Jacobian) To solve the concrete problem in Eqn. (11.38), the n components of the value function   v0 (p)   .. V (p) =  (11.41)  . vn−1 (p)

are defined as8

vi (p) = ri − r =

p

(xi −xc )2 +(yi −yc )2 − r,

(11.42)

i.e., the (signed) point-to-circle distances di (see Fig. 11.1). Since 8

Note that the functions V (p) and J(p) not only depend on the circle parameters p but also on the “observed” points X = (xi ), so strictly speaking the notation should be V (p |X) and J(p |X), for example. However, since X is constant in this context, it is not explicitly specified.

307

F (p) =

11 Fitting Circles and Ellipses

X

[ri − r]2 =

X

[vi (p)]2

(11.43)

is to be minimized, the target value for all vi (p) is zero, an thus the required target vector (see Eqn. (11.39)) is ⊺

z = 0 = (0, . . . , 0) .

(11.44)

The Jacobian function J(p) returns a matrix of first-order partial derivatives of the model functions vi (p) w.r.t. to each of the 3 circle parameters in xc , yc , r, which are ∂vi (p) ∂xc

=

xc −xi , ri

∂vi (p) ∂yc

yc −yi , ri

=

∂vi (p) ∂r

= −1 .

(11.45)

The matrix to be returned by function J(p) is of size n × 3, i.e., 

  J(p) =  



J0 (p)

.. .

Ji (p)

.. .

Jn−1 (p)

with rows9 Ji (p) =

(



    =   

∂v0 (p) ∂xc

∂v0 (p) ∂yc

∂v0 (p) ∂r

∂vi (p) ∂xc

∂vi (p) ∂yc

∂vi (p) ∂r

∂vn−1 (p)

∂vn−1 (p)

∂vn−1 (p)

∂xc

∂yc

∂r

.. . .. .

.. .

xc −xi yc −yi ri , ri , −1

(0, 0, −1)

.. .

.. . .. .

if ri ̸= 0, otherwise.



  ,  

(11.46)

(11.47)

Functions V (p) and J(p) are evaluated for every parameter point p visited by the optimizer, i.e., at least once in every iteration. The complete “distance-based” geometric circle fitting procedure is summarized in Alg. 11.3. A concrete Java implementation of function SolveNLS(), based on the Levenberg-Marquardt solver provided by the Apache Commons Math library, is listed in Prog. C.1 (p. 829).10 “Coordinate-Based” Geometric Circle Fitting Coordinate-based geometric fitting is very similar to the distancebased approach. Both try to minimize the sum of the squared Euclidean distances between each data point xi and the associated closest point on the circle,11 denoted ˘ i (p) = xc + (xi − xc )· x

r , ri

(11.48)

with the circle center xc = (xc , yc ) and the center-to-point distance p ri = ∥xi −xc ∥ = (xi −xc )2 + (yi −yc )2 . (11.49) The objective function to be minimized is 9

10 11

308

The case ri = 0 (i.e., point xi being at the circle center) must be taken into account. See Sec. C.3.1 and Prog. C.3 (page 834) for a related usage example. See also Fig. 11.1 on page 298.

1: GeometricCircleFitDist(X, p0 ) Input: X = (x0 , . . . , xn−1 ), a collection of n ≥ 3 observed 2D sample points xi = (xi , yi ); p0 = (xc , yc , r), initial circle parameters. Returns the parameters of the circle that minimizes the sum of the squared distances to the points in X. 2: n ← |X| ▷ X, n are global 3: z ← 0n ▷ target vector z = (0, . . . , 0) 4: popt ← SolveNLS(V, J, z, p0 ) ▷ run the LS solver 5: return popt ▷ = (xc , yc , r) 6: V(p) ▷ value function, to be called by SolveNLS() Returns the n-vector v = (v0 , . . . , vn−1 ) of function values v(i) = vi (p) for the current parameter “point” p. 7: (xc , yc , r) ← p ▷ current circle parameters 8: v ← new vector ∈ Rn 9: for i ← 0, . . . , n−1 do 10: (xi , yi ) ← p X(i) 11: v(i) ← (xi −xc )2 +(yi −yc )2 − r ▷ Eqn. (11.42) 12: return v 13: J(p) ▷ Jacobian function, to be called by SolveNLS() Returns the n × 3 Jacobian matrix J for the current parameter “point” p (see Eqn. (11.46)). 14: (xc , yc , r) ← p ▷ current circle parameters 15: J ← new matrix ∈ Rn × 3 16: for i ← 0, . . . , n−1 do 17: (xi , yi ) ← X(i) 18: dx ← xc −xi , dy ← yc −yi p 19: ri ← d2x + d2y J(i, ∗) ←

20:

21:

(

d

( rx , i

dy ri

, −1)

(0, 0, −1)

if ri > 0

11.1 Fitting Circles Alg. 11.3 Geometric circle fit – Version 1 (“distance-based”). The initial circle parameters p0 are typically obtained by some algebraic (e.g., Pratt) fit. Optimization is performed on “natural” circle parameters (xc , yc , r). An implementation of the nonlinear leastsquares optimization procedure SolveNLS() (invoked in line 4) is described in Sec. C.2 of the Appendix. The value function V(p) and the Jacobian function J(p) (defined in lines 6 and 13, respectively) are called in each iteration of the least-squares optimizer on the current parameter vector p. Note that functions V(p) and J(p) calculate the same quantities (dx , dy , ri ) for each sample point. This can be avoided in real implementations by combining V, J to a single function (see also Sec. C.3.1).

▷ ith row of J

otherwise

return J

F (p) =

n−1 X

2

∥˘ xi (p) − xi ∥ ,

(11.50)

i=0

˘ i (the for observations xi (the fixed sample points) and the values x associated closest circle points) yielded by the circle model. The problem is again nonlinear since the expression in Eqn. (11.48) involves products of parameters. While this looks similar to the “distance-based” formulation in Eqn. (11.38), the difference is that ˘ i , xi are not scalar quantities but 2D vectors and thus the value x functions, x ˘i (p) vi (p) = , (11.51) y˘i (p) are vector-valued as well. By rewriting Eqn. (11.50) as F (p) =

n−1 X

2

∥˘ xi (p) − xi ∥ =

i=0

=

n−1 X i=0

n−1 X

[x ˘i (p) − xi ]2 + [ y˘i (p) − yi ]2

i=0

[x ˘i (p) − xi ]2 +

n−1 X i=0

[ y˘i (p) − yi ]2 ,

(11.52) 309

11 Fitting Circles and Ellipses

we see that this can be resolved by simply splitting vi (p) into pairs of scalar-valued functions, vi,x (p) = x ˘i (p) = xc + (xi −xc )· rr , i

(11.53)

vi,y (p) = y˘i (p) = yc + (yi −yc )· rr , i

and arranging the composite value function V (p) and the associated target vector z in the form12 

v0 v1

 ..  .  v2i V (p) =   v2i+1  ..  .

v2n−2 v2n−1





      =      

x ˘0 (p) y˘0 (p)

.. .

x ˘i (p) y˘i (p)

.. .

x ˘n−1 (p) y˘n−1 (p)



   ,   



z0 z0

 ..  .  z2i z =  z2i+1  ..  .

z2n−2 z2n−1





      =      



x0 y0

..  .  x1   y1 , ..  . 

(11.54)

xn−1 yn−1

with both vectors of length m = 2n. Note that, unlike in the distancebased formulation, the target vector z is not zero here but holds the x/y coordinates of the data points xi . The same row arrangement is used to mount the 2n × 3 Jacobian matrix,  J0,0 J1,0

 .  .  .  J2i,0 J(p) =  J2i+1,0  .  .  .

J0,1 J1,1



J0,2 J1,2

.. .

J2i,1 J2i+1,1

.. .

J2n−2,0 J2n−2,1 J2n−1,0 J2n−1,1



∂x ˘0 (p) ∂xc ∂y ˘0 (p) ∂xc

  ..   ..   .   ∂ x˘i.(p)   ∂xc J2i,2 = J2i+1,2   ∂ y˘i (p)   ∂x. c ..   .   ∂ x˘ .. (p)  n−1 J2n−2,2  ∂xc J 2n−1,2

∂x ˘0 (p) ∂yc ∂y ˘0 (p) ∂yc

.. .

∂x ˘i (p) ∂yc ∂y ˘i (p) ∂yc

.. .

∂x ˘n−1 (p)

∂y ˘n−1 (p)

∂yc ∂y ˘n−1 (p)

∂xc

∂yc



∂x ˘0 (p) ∂r ∂y ˘0 (p) ∂r

     ∂x ˘i (p)  ∂r , ∂y ˘i (p)  ∂r  ..  .  ∂x ˘n−1 (p)   ∂r .. .

∂y ˘n−1 (p) ∂r

(11.55)

with the (nontrivial)13 partial derivatives ∂x ˘i J2i,0 = ∂x = 1+ rr c

i

(xi −xc )2 ri2

−1 ,

∂x ˘i i −yc ) J2i,1 = ∂y = r·(xi −xcr)·(y , 3 c

J2i,2 =

∂x ˘i ∂r

i

=

xi −xc ri ,

∂ y˘i i −yc ) J2i+1,0 = ∂x = r·(xi −xcr)·(y , 3 c i 2 ∂ y˘i c) J2i+1,1 = ∂y = 1+ rr (yi −y −1 , r2 c

J2i+1,2 =

∂ y˘i ∂r

i

=

yi −yc ri .

i

(11.56)

where i = 0, . . . , n−1 and ri = ∥xi −xc ∥ (see Eqn. (11.49)). The complete coordinate-based fitting procedure is summarized in Alg. 11.4. A concrete Java implementation for this algorithm is listed in Prog. C.3.14 12

13 14

310

The ordering of elements in V (p) and z is irrelevant; instead of alternating x/y-values we could have also grouped all x-values first, followed by all y-values. In fact, any “flattening” arrangement will do as long as associated values are placed in the same row. With a little help from Mathematica. See Sec. C.3 of the Appendix, which also provides a small numeric example. For a concrete implementation of function SolveNLS() (using Levenberg-Marquardt optimization) see Prog. C.1 (p. 829).

1: GeometricCircleFitCoord(X, p0 ) Input: X = (x0 , . . . , xn−1 ), a collection of n ≥ 3 observed 2D sample points xi = (xi , yi ); p0 = (xc , yc , r), initial circle parameters. Returns the parameters of the circle that minimizes the sum of the squared distances to the points in X. 2: n ← |X| ▷ X, n are global ⊺ 3: z ← (x0 , y0 , . . . , xi , yi , . . . , xn−1 , yn−1 ) ▷ “target” vector (2n) 4: popt ← SolveNLS(V, J, z, p0 ) ▷ run the LS solver 5: return popt ▷ = (xc , yc , r) 6: V(p) ▷ value function, to be called by SolveNLS() Returns the 2n-vector v = (v0 , . . . , v2n−1 ) of model values for the current parameter “point” p = (xc , yc , r) (see Eqn. (11.54)). 7: v ← new vector ∈ R2n 8: for i ← 0, . . . , n−1 do 9: (xi , yi ) ← X(i) 10: dx ← xc −xi 11: dy ← yc −yi p 12: ri ← d2x + d2y 13: v(2i) ← xc + dx · rr i 14: v(2i+1) ← yc + dy · rr i 15: return v

11.1 Fitting Circles Alg. 11.4 Geometric circle fit – Version 2 (“coordinate-based”). Optimization is performed on “natural” circle parameters (xc , yc , r). Each of the n sample points in X contributes two quantities to be minimized, one for each x/y coordinate. Thus, compared to “distancebased” Version 1 (Alg. 11.3), the value vector returned by V(p) is of length 2n and the Jacobian matrix returned by J(p) is of size 2n × 3. See Prog. C.3 (p. 834) for a concrete Java implementation of this algorithm.

16: J(p) ▷ Jacobian function, to be called by SolveNLS() Builds and returns the 2n × 3 Jacobian matrix J for the current parameter “point” p = (xc , yc , r) (see Eqn. (11.56)). 17: J ← new matrix ∈ Rn × 3 18: for i ← 0, . . . , n−1 do 19: (xi , yi ) ← X(i) 20: dx ← xc −xi 21: dy ← yc −yi p 22: ri ← d2x + d2y 23:

24: 25:

J(2i, ∗)

J(2i+1, ∗) ← return J

d2

← 1+ rr ·( rx2 −1) | i

r ri

·

dx ·dy ri2

i

r ri

| 1+ rr ·( i

·

dx ·dy

d2 y ri2

|

dx ri

−1) |

dy ri

ri2

▷ =

∂x ˘ ∂p

▷ =

∂y ˘ ∂p

Geometric Circle Fitting Examples The examples in this section were calculated with the distance-based method (Sec. 11.1.3). Figure 11.3 shows the process of geometric circle fitting on two examples with a small number of sample points. Also visualized is the surface of the objective functions F (p) for varying center positions xc /yc , while the circle radius r is kept at a fixed value (the optimal radius). Only three sample points are specified in Fig. 11.3 (a) and both the algebraic and the geometric fit arrive at the same (exact) result, as one would expect. In Fig. 11.3 (b), the algebraic (Pratt) fit is somewhat off the geometric optimum, but the suggested center point is in the same low-value basin of the objective function. The objective function F is usually “well behaved” (i.e., smooth), thus minima are relatively easy to locate. Note that in both cases (a, b) other local minima (function basins) exist in the vicinity of the sample points, where a geometric fit will get trapped if the initial guess is sufficiently close.

311

11 Fitting Circles and Ellipses Fig. 11.3 Geometric circle fit and topography of the objective function F (p). The contour plots show the values of F (p) for varying xc /yc positions and a fixed circle radius r (set to the optimal radius). The dashed magenta line indicates the result of the algebraic fit, the solid blue line marks the geometric fit. In (a) only 3 sample points are given, i.e., the circle fit is exact and the results from both fits coincide, with the circle center at the function’s minimum. In (b) the algebraic fit (for 6 sample points) is marginally off but the projected center is still very close to the geometric optimum.

(a) 3 points

(b) 6 points algebraic fit

Figure 11.4 compares the performance of the distance-based and coordinate-based algorithms (Alg. 11.3 and 11.4, respectively). In Fig. 11.4 (a, b) the starting solution (red circle) is the estimate obtained by an algebraic Kåsa fit. Under such typical conditions, both algorithms perform similarly, the distance-based algorithm— if at all—converging slightly faster. In more challenging situations, as shown in Fig. 11.4 (c, d), the distance-based version has a tendency to overshoot in the parameter adaptation steps, while the coordinatebased version shows a more conservative behavior. Additional examples for geometric circle fits are shown in Fig. 11.5. Again the data points (blue dots) are manually placed. Two different methods of algebraic fitting were used to estimate the initial circle parameters (red circles). Intermediate parameter points evaluated in each step of the geometric fitter are shown as gray circles, final results of the geometric fits as blue circles. It can be seen that the results of the (algebraic) Pratt fit are very close to the final geometric fit in all situations. However, even if the initial estimate is poor (as in Fig. 11.5 (c)) the geometric fit only requires a few iterations to converge. Table 11.1 lists the results for the examples Fig. 11.2, with the outcome of the Pratt fit used as the initial guess for the geometric fit. Although the mean squared error E(p|X) =

1 X 1 X · (ri −r)2 = · (∥xi −xc ∥ − r)2 |X| |X| xi ∈X

312

geometric fit

(11.57)

xi ∈X

is always reduced by the geometric fit, the improvements over the Pratt fit are only marginal. Note that (due to sample points being randomly placed) the residual errors for the Pratt and geometric fits are actually smaller than the errors for the original (“real”) circle. Algebraic circle fits, and the Pratt algorithm in particular, are quick, robust and do not require an initial guess. Other modern algebraic methods, such as the algorithm by Taubin [258] and the “hyperaccurate” algebraic fit by Al-Sharadqah and Chernov [6, 155], yield similarly accurate results. Under non-extreme conditions it may be fully sufficient to apply an algebraic fit only. However, a geometric

Distance-based algorithm

Coordinate-based algorithm

11.2 Fitting Ellipses Fig. 11.4 Distance-based vs. coordinatebased geometric circle fit. Both algorithms find the same results and show similar convergence behavior (a, b). In extreme situations, i.e., with initial parameters being far from the optimal solution, the distance-based algorithm tends to overshoot (c) while the coordinate-based algorithm still performs unaffectedly (d).

(a)

(b)

(c)

(d)

start

intermediate steps

σ = 0.5, Fig. 11.2(a, b) xc Real 200.0 Kåsa 202.6 Pratt 200.0 Geom. 200.0

yc 190.0 191.3 190.2 190.2

r 150.0 147.2 149.9 149.9

E 0.231 0.233 0.229 0.229

geometric fit

σ = 2.0, Fig. 11.2(c, d) xc 200.0 231.6 199.0 199.0

yc 190.0 203.7 190.4 190.4

r 150.0 116.8 150.9 150.8

E 3.693 4.708 3.660 3.659

σ = 5.0, Fig. 11.2(e, f) xc 200.0 288.8 192.6 192.5

yc r E 190.0 150.0 23.01 227.7 60.5 53.91 189.3 157.4 22.77 189.3 157.2 22.72

Table 11.1 Error improvement by geometric circle fitting for the examples in Fig. 11.2. σ is the noise level, E denotes the mean squared error (Eqn. (11.57)).

fit usually takes only a few iterations to converge (the result of the first iteration already being very close to the final optimum in most cases) and may well be worth the additional effort.

11.2 Fitting Ellipses The task of fitting ellipses to given point sets is of equal importance and has similar applications as circle fitting. Since 3D circles map to ellipses in perspective images, photographs of man-made structures are more likely to show ellipses than circles. Fitting ellipses is slightly more difficult due to the additional parameters and less stringent constraints, associated with the risk of ending up with other conic shapes. Nevertheless, techniques for ellipse and circle fitting are quite similar, again with a choice of algebraic and geometric approximation methods.

313

11 Fitting Circles and Ellipses

Kåsa

Pratt

Fig. 11.5 Geometric circle fits starting from different algebraic fits. Initial start parameters (red circles) for the geometric fit were obtained by the Kåsa method (left column) and Pratt’s method (right column), respectively. Blue circles represent the results from the geometric fit, gray circles are the intermediate steps taken by the (distancebased) optimization algorithm.

0

(a)

(b)

(c)

(d)

(e) algebraic fit

(f) intermediate steps

geometric fit

11.2.1 Algebraic Ellipse Fitting An arbitrary (i.e., shifted and oriented) ellipse can be defined as the set of points (x, y) that satisfy the algebraic equation (cf. Eqns. (11.3, 11.20)) Ax2 + Bxy + Cy 2 + Dx + Ey + F = 0 , 314

(11.58)

with parameters q = (A, B, C, D, E, F ).15 Equation 11.58 actually 11.2 Fitting Ellipses describes a general 2D conic, i.e., a circle, ellipse, parabola or hyperbola, if at least one of A, B, C is nonzero. It represents an ellipse only if its discriminant is negative, that is, B 2 − 4AC < 0 (see Sec. 11.2.1). Algebraic Method by Fitzgibbon The classic, direct ellipse-fitting method described in the following was proposed by Fitzgibbon et al. in [85]. The set of ellipse points (x, y) is defined by the implicit function in Eqn. (11.58), with parameters q = (A, B, C, D, E, F ). The trivial solution q = (0, . . . , 0) needs to be avoided and not all curves described by Eqn. (11.58) are ellipses. Thus some constraint is required on q, e.g., 4AC −B 2 > 0 (among other possibilities). Since any scaled parameter vector q ′ = sq (for s ̸= 0) represents the exact same curve as q, this inequality constraint can be modified to 4AC −B 2 = 1,

(11.59)

which is much simpler to incorporate in a linear least-squares formulation. Given a 2D point (x, y), the expression on the left-hand side of Eqn. (11.58), i.e., f (x, y) = Ax2 + Bxy + Cy 2 + Dx + Ey + F,

(11.60)

measures its algebraic distance from the ellipse. If (x, y) is exactly on the ellipse, then f (x, y) = 0, otherwise f (x, y) may be positive or negative. Given a set of sample points X = (x0 , . . . , xn−1 ), the fitting problem can be expressed as minimizing the sum of the squared algebraic point distances X F (q) = [f (xi , yi )]2 → min (11.61) (xi ,yi )∈X

or, analogous to Pratt’s circle fit method (see Eqn. (11.26)) 2

F (q) = ∥X·q∥ → min,

(11.62)

where 

x20 x 0 y0 y02 x0 y0 . . .. .. .. .. X =  .. . . . 2 x2n−1 xn−1 yn−1 yn−1 xn−1 yn−1

 1 ..  .

(11.63)

1

is a n × 6 “design” matrix assembled from the point coordinates xi = (xi , yi ). The ellipse constraint (Eqn. (11.59)) expressed in matrix form is  0 0 2 0 0 0 ⊺

q ·C·q = 1,

15

with

C=

0 −1 2 0 0 0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0. 0 0 0

(11.64)

See Sec. F.3.1 of the Appendix for other ellipse representations and formulas for converting between different parametrizations.

315

11 Fitting Circles and Ellipses

As in Pratt’s circle fit method (see Eqn. (11.29)), the optimization can be reduced to the generalized eigenvalue problem16 ⊺

(X ·X)·q = λ·C·q , | {z }

(11.65)

S−1 C·q = µ · q

(11.66)

S

where S = X⊺ · X (of size 6 × 6) is called the “scatter” matrix. One may be tempted to solve this as an ordinary eigenvalue problem C−1 Sq = λq (which is possible in Pratt’s circle fit), but here C is singular, i.e., C−1 does not exist. A viable alternative is to reformulate Eqn. (11.65) as another (ordinary) eigenvalue problem,

(with µ = λ1 ), which is possible since S is symmetric and (typically) non-singular. The solution q opt would then be the eigenvector ek of S−1 C associated with its largest (positive) eigenvalue µk . Although this matrix is ill-conditioned,17 eigenvalue decomposition is typically possible. In the case of an exact fit (i.e., when all sample points are perfectly on the ellipse), however, S−1 C becomes fully singular and the method fails. Solution by Generalized Symmetric Eigendecomposition The generalized eigenvalue problem in Eqn. (11.65) can be solved with dedicated algorithms (such as the QZ-algorithm [107]), which unfortunately are not widely implemented in numerical libraries. However, the fact that both matrices S and C are symmetric (6 × 6) allows for another approach. By some rearrangement of Eqn. (11.66) the problem can be transformed to C·q = µ·S·q,

(11.67)

which is a generalized symmetric eigenvalue problem of the standard form A·x = λ·B·x, where A, B are real symmetric and B is positive definite. Systems of this type can be solved with conventional methods. In particular, function GeneralSymmetricEigen(A, B), defined in Sec. B.5.2 of the Appendix (see Alg. B.2), is based on standard Cholesky decomposition. It is used in Alg. 11.5, which is a summary of what we call Fitzgibbon’s “original” ellipse fitting algorithm. Numerically Improved Fitzgibbon Fit As noted in [116], Fitzgibbon’s original algorithm is numerically brittle, because matrix S is singular if the fit is exact and nearly singular otherwise. This section describes a numerically improved version of this algorithm which was proposed by Halíř and Flusser in [116]. It achieves numerical stability mainly by partitioning matrices into smaller blocks and thereby avoiding singularities. First, the columns 16

17

316

A generalized eigenproblem is of the form Ax = λBx, compared to an ordinary eigenproblem Ax = λx, which has no matrix on the right-hand side (see Sec. B.5 of the Appendix). Typical determinant values are det(S−1 C) < 10−80 and vanishing singular values (i.e., condition number ∞).

1: FitEllipseFitzgibbonOriginal(X) Input: X = (x0 , . . . , xn−1 ), a collection of n ≥ 5 2D sample points xi = (xi , yi ). Returns the algebraic ellipse parameters q = (A, B, C, D, E, F ) for the optimal fit. 2: n ← |X|   x2 0

3:

2 y0

x 0 y0

. X ←  ..

.. .

.. .

x0

.. .

y0

.. .

1

..  .

▷ design matrix X

2 x2 n−1 xn−1 yn−1 yn−1 xn−1 yn−1 1

4:

⊺

S ← X ·X 0 0 0 −1 2 0 0 0 0 0 0 0

2 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

!

▷ scatter matrix S

5:

C←

6: 7: 8:

Solve the generalized symm. eigenvalue problem (Eqn. (11.67)): (λ, V) ← GeneralSymmetricEigen(C, S) ▷ see Alg. B.2 (p. 819) k ← argmaxi {λi ∈ λ} ▷ idx. of largest eigenvalue q ← V(∗, k) ▷ kth eigenvector

9:

return

▷ constraint matrix C

1 ·q ∥q∥

11.2 Fitting Ellipses Alg. 11.5 Algebraic ellipse fitting algorithm by Fitzgibbon et al. [85]. The best fitting ellipse is found by solving a symmetric generalized eigenvalue problem (see Eqn. (11.67)). Note that the constraint matrix C is singular and the scatter ma⊺ trix S = X ·X is typically ill-conditioned, which makes the algorithm numerically sensitive.

▷ normalize q

of the design matrix X (Eqn. (11.63)) are split into a quadratic part X1 and a linear part X2 , i.e., X = X1 X2 (11.68) with



x20 . X1 =  ..

x0 ·y0 .. .

x2n−1 xn−1 ·yn−1

  y02 x 0 y0 ..  . .. and X2 =  .. . . 2 yn−1 xn−1 yn−1

 1 .. . . (11.69) 1

Similarly, the 6 × 6 scatter matrix S = X⊺ ·X is split into 3 × 3 submatrices  ! ⊺  S1 = X1 ·X1 , S1 S2 S = X⊺1 ·X2 , , with (11.70) S=  2 S⊺2 S3 S3 = X⊺2 ·X2 .

The constraint matrix C is partitioned into ! 0 0 C1 0 , with C1 = 0 −1 2 0 0 0

2 0 0

,

(11.71)

i.e., the constraint matrix is only 3 × 3 and non-singular and the ellipse constraint (Eqn. (11.64)) becomes ⊺

q 1 ·C1 ·q 1 = 1. Finally, the parameter vector q is split in two parts, ! q1 ⊺ ⊺ q= , with q 1 = (A, B, C) , q 2 = (D, E, F ) q2

(11.72)

(11.73)

(note that q, q 1 , q 2 are column vectors). With this setup, the optimization problem elegantly turns into the problem of finding the appropriate eigenvector of the 3 × 3 “reduced scatter matrix”

317

11 Fitting Circles and Ellipses

⊺

−1 Z = C−1 1 ·(S1 − S2 ·S3 · S2 ).

(11.74)

Following the original algorithm, one would now search for λk , the smallest non-negative eigenvalue18 of Z, whose associated eigenvector ek is the first-half parameter vector q 1 , i.e., q 1 = ek

⊺

q 2 = −S−1 3 · S2 ·q 1 ,

and

(11.75)

respectively. However, this may fail when Z is close to singular, which happens ironically when the fit is exact. In this case, all eigenvalues may become negative and even the maximum eigenvalue may not be the correct one. The clever remedy proposed in [116] is to instead check the ellipse constraint in Eqn. (11.72) for all of Z’s eigenvectors ei . Since e⊺i Cei is positive for exactly one eigenvector of Z, the solution ek is easy to find. This results in a very stable and efficient algorithm, which is summarized in Alg. 11.6. It also shows how to center the data set X by aligning its centroid with the coordinate origin, which further improves numerical stability. Note that this requires a subsequent adjustment of the resulting parameter vector q 2 , which is performed in line 15 of Alg. 11.6. This step is unnecessary if non-centered data are used for building X1 and X2 . Taubin’s Algebraic Method The algorithm by Taubin [258] is among the oldest and reportedly most accurate non-iterative ellipse fitting methods. Its structure is very similar to Fitzgibbon’s approach (see Sec. 11.2.1) but a different constraint formulation is used, with coefficients that are not constant but derived from the statistics of the point set. As in Alg. 11.5 the optimal fit is found by solving a generalized eigenvalue problem. To reduce the effects of numerical roundoff errors, point coordinates are initially centered by x˙ i = xi − xr ,

(11.76)

w.r.t. some fixed reference point xr = (xr , yr ). This could be, for example, the image center or the centroid of the sample points, i.e., xr =

xr yr

n−1

=

1 X ¯. · x =x n i=0 i

(11.77)

The resulting design matrix is ξ0

X=

.. .

ξn−1

!



x˙ 20 . =  ..

x˙ 2n−1

x˙ 0 y˙ 0 .. .

y˙ 02 .. .

2 x˙ n−1 y˙ n−1 y˙ n−1

x˙ 0 .. . x˙ n−1

 y˙ 0 1 .. ..  . . , y˙ n−1 1

(11.78)

with row vectors ξ i = (x˙ 2i , x˙ i y˙ i , y˙ i2 , x˙ i , y˙ i , 1). From this a normalized scatter matrix is constructed as 18

318

Of the three eigenvalues of matrix Z, typically two are negative and one is positive.

1: FitEllipseFitzgibbonStable(X, xr ) Input: X = (x0 , . . . , xn−1 ), a collection of n ≥ 6 2D sample points xi = (xi , yi ); xr = (xr , yr ), a reference point (e.g., the centroid of X). Returns a vector q = (A, B, C, D, E, F ) of algebraic parameters for the optimal ellipse. 2: n ← |X| 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:

let x˙ i = (xi −xr ),



x˙ 20

y˙ i = (yi −yr ) y˙ 02

x˙ 0 y˙ 0 .. .

.. .



15:

16:

▷ kth column vector of E

q 1 ← ek q 2 ← −T·q 1

 q←

1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 −2xr −yr 0 1 0 0 −xr −2yr 0 1 2 2 xr xr yr yr −xr −yr

return

1

▷ Z ∈ R3 × 3 , see Eqn. (11.74) ▷ eigenvectors E = (e0 |e1 |e2 ) ▷ check Eqn. (11.72) on each ei

S2 ·T) (λ, E) ← eigen(Z) ⊺ k ← argmax(v i ·C1 ·v i )





1 .. .

▷ reduced constr. matrix (Eqn. (11.71))

i = 0,1,2

13: 14:

▷ centered data

x˙ 0 y˙ 0 . .. X2 ←  .. . x˙ n−1 y˙ n−1

..  , .

X1 ←  2 x˙ 2n−1 x˙ n−1 y˙ n−1 y˙ n−1 ⊺ S1 ← X1 ·X1 ⊺ S2 ← X1 ·X2 ⊺ S3 ← X2 ·X2 0 0 2 C1 ← 0 −1 0 2 0 0 ⊺ −1 T ← S 3 · S2 −1 Z ← C1 ·(S1 −



0 0 0 0 0 1

1 ·q ∥q∥

   q1     ·   

▷

q2

(

11.2 Fitting Ellipses Alg. 11.6 Numerically improved algebraic ellipse fitting algorithm by Halíř and Flusser [116] (cf. Alg. 11.5). The data set X is initially centered at a reference point (xr , yr ) (line 3), which is eventually compensated for (in line 15). S1 , S2 , S3 are 3 × 3 partitions of the full scatter matrix, and C1 is the reduced constraint matrix (which is non-singular and has a simple inverse). A standard eigensolver is used to find the eigenvectors of the symmetric matrix Z, which has only one eigenvector v k for which ⊺ v i C1 v i is positive (line 12). Note that the eigenvalues of Z are not used. The complete parameter vector q is obtained by stacking q 1 and q 2 , followed by a re-adjustment for data centering (line 15). This algorithm handles exact ellipse fits to 5 points.

re-adjust for data centering

▷ normalize q

n−1

1 X ⊺ S= · ξ ·ξ n i=0 i i  x˙ 4 x˙ 3 y˙ i

=

3 n−1 x˙ i y˙ i 1 X  x˙ 2i y˙ i2 ·  3 n i=0 x2˙ i x˙ i y˙ i x˙ 2i

where σp,q =

(11.79)

˙ 2i y˙ i2 x˙ 3i x˙ 2i y˙ i i i x x˙ 2i y˙ i2 x˙ i y˙ i3 x˙ 2i y˙ i x˙ i y˙ i2 x˙ i y˙ i3 y˙ i4 x˙ i y˙ i2 y˙ i3 x˙ 2i y˙ i x˙ i y˙ i2 x˙ 2i x˙ i y˙ i x˙ i y˙ i2 y˙ i3 x˙ i y˙ i y˙ i2 x˙ i y˙ i y˙ i2 x˙ i y˙ i n−1

x˙ 2i x˙ i y˙ i y˙ i2 x˙ i y˙ i 1 n−1

 σ 4,0   σσ3,1   2,2 = σ3,0  σ 2,1

σ2,0

σ3,1 σ2,2 σ1,3 σ2,1 σ1,2 σ1,1

σ2,2 σ1,3 σ0,4 σ1,2 σ0,3 σ0,2

σ3,0 σ2,1 σ1,2 σ2,0 σ1,1 σ1,0

1 X p q 1 X x˙ i ·y˙ i = · · (x −x )p ·(yi −yr )q . n i=0 n i=0 i r

σ2,1 σ1,2 σ0,3 σ1,1 σ0,2 σ0,1

σ2,0 σ1,1 σ0,2 σ1,0 σ0,1 1



 , 

(11.80)

Matrix S is symmetric and aggregates the covariances of the point data set X. Note that, if (and in this case only) the reference point xr is the centroid of X, then σ1,0 = n1 Σx˙ = 0, σ0,1 = n1 Σy˙ = 0, and

S=

 σ4,0 σ3,1  σ2,2 σ  3,0 σ2,1 σ2,0

σ3,1 σ2,2 σ1,3 σ2,1 σ1,2 σ1,1

σ2,2 σ1,3 σ0,4 σ1,2 σ0,3 σ0,2

σ3,0 σ2,1 σ1,2 σ2,0 σ1,1 0

σ2,1 σ2,0 σ1,2 σ1,1 σ0,3 σ0,2 σ1,1 0 σ0,2 0 0 1



 . 

(11.81)

Jacobian and Constraint Matrix The interesting aspect of Taubin’s method is that it incorporates statistics of the sample data in the formulation of the optimization

319

11 Fitting Circles and Ellipses

constraint. The vectors ξ i = ξ(x˙ i , y˙ i ) in Eqn. (11.78) can be interpreted as the outcome of a vector-valued function ξ : R2 7→ R6 , with ξ(x, y) = (x2 , xy, y 2 , x, y, 1) = (f0 (x, y), . . . f5 (x, y)),

(11.82)

composed of 6 two-dimensional, scalar-valued functions fj : R2 7→ R, f0 (x, y) = x2 ,

f1 (x, y) = xy,

f2 (x, y) = y 2

f3 (x, y) = x,

f4 (x, y) = y,

f5 (x, y) = 1.

The gradient associated with each component function fj () is a vector-valued function ! ∂f (x) j

∇fj (x) =

∂x ∂fj (x) ∂y

(11.83)

of the first partial derivatives of fj , that is,19 ⊺

∇f0 (x, y) = (2x, 0) , ⊺

∇f3 (x, y) = (1, 0) ,

⊺

∇f2 (x, y) = (0, 2y)

⊺

∇f5 (x, y) = (0, 0) .

∇f1 (x, y) = (y, x) ,

⊺

⊺

∇f4 (x, y) = (0, 1) ,

By combining the gradients of all 6 component functions we obtain the 6 × 2 Jacobian matrix for the multi-valued function ξ(x, y) as    ⊺  Jξ (x) =

[∇f0 (x)] ⊺  [∇f1 (x)]⊺  [∇f2 (x)]  [∇f (x)]⊺  3 ⊺ [∇f4 (x)] ⊺ [∇f5 (x)]

2x 0

    =   

y

x

1

0

0 0

1 0

 .  

0 2y 

(11.84)

Each sample point x˙ i = (x˙ i , y˙ i ) contributes a 6 × 6 constraint matrix ci , which is obtained from the associated Jacobian matrix as  2x˙ 0  i

 ci = [Jξ (x˙ i )]·[Jξ (x˙ i )] =   ⊺

=



y˙ i x˙ i 0 2y˙ i 1 0 0 1 0 0

 2x˙ i ·  0

y˙ i

0

1 0 0

x˙ i 2y˙ i 0 1 0

4x˙ 2i 2x˙ i y˙ i 0 2x˙ i 2x˙ i y˙ i x˙ 2 +y˙ 2 2x˙ i y˙ i y˙ i  0 2x˙ y˙ 4y˙ 2 0  i i  2x˙ i y˙ 0 1 0 x˙ 2y˙ i 0 0 0 0 0

0 x˙ i 2y˙ i 0 1 0

0 0 0 0 0 0



 , 

(11.85)

(11.86)

and the complete constraint matrix C is calculated as the mean of the matrices ci , i.e.,  4σ  2σ 0 2σ 0 0 2,0

 2σ1,1 1  0 C= · c = n i=0 i  2σ1,0 n−1 X

0 0

1,1

1,0

σ2,0 +σ0,2 2σ1,1 σ0,1 σ1,0 2σ1,1 4σ0,2 0 2σ0,1 σ0,1 0 1 0 σ1,0 2σ0,1 0 1 0 0 0 0

0 0 0 0 0

  . 

(11.87)

¯ , then σ1,0 = σ0,1 = 0 Again, if the sample data are centered at xr = x and, in this case, 19

320

See Sec. D.1.2 of the Appendix.

1: FitEllipseTaubin1(X, xr ) Input: X = (x0 , . . . , xn−1 ), a set of n ≥ 6 2D sample points xi = (xi , yi ); xr = (xr , yr ), a reference point (e.g., the centroid of X). Returns a vector q = (A, B, C, D, E, F ) of algebraic parameters for the optimal ellipse. 2: n ← |X| 3:

let x˙ i = (xi −xr ), 2 x˙ 0

4: 5: 6:

7: 8: 9: 10:

11:

X← S←

.. .

x˙ 2 n−1

y˙ i = (yi −yr ) 2 y˙ 0

x˙ 0 y˙ 0

.. .

x˙ 0

.. .

2 x˙ n−1 y˙ n−1 y˙ n−1

⊺ 1 ·(X ·X) n 4a 2c 0 2d 0 0 2c a+b 2c e d 0 0 2c 4b 0 2e 0 2d e 0 1 0 0 0 d 2e 0 1 0 0 0 0 0 0 0

 

C←

y˙ 0

.. .

1

.. .

..  .

xn−1 y˙ n−1 1



 ,

with

▷ centered data



11.2 Fitting Ellipses Alg. 11.7 Taubin’s algorithm (Version 1). The solution is found by solving a generalized symmetric eigenvalue problem (see Alg. B.2), where S is positive definite but C is not.

▷ design matrix

▷ cov. matrix (Eqn. (11.79))  a = S(0,5) = σ2,0   b = S(2,5) = σ0,2 c = S(1,5) = σ1,1

  d = S(3,5) = σ1,0 e = S(4,5) = σ0,1

Solve generalized symmetric eigenvalue problem C·q = λ·S·q: (λ, V) ← GeneralSymmetricEigen(C, S) ▷ see Alg. B.2 (p. 819) k ← argmaxi (λi ) ▷ find largest eigenvalue λk ∈ λ q opt ← V(∗, k) ▷ get eigenvector v k



 q←

return

1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 −2xr −yr 0 1 0 0 −xr −2yr 0 1 2 x2 x r yr yr −xr −yr r 1 ·q ∥q∥

C=

0 0 0 0 0 1



  · q opt

 4σ

2σ1,1 0 2,0 2σ1,1 σ2,0 +σ0,2 2σ1,1  0 2σ1,1 4σ0,2   0 0 0



0 0 0 1

0 0 0 0

▷

(

re-adjust for data centering

▷ normalize q

0 0 0 0

0

0

0

0 1 0

0

0

0

0 0 0



  . 

(11.88)

The quantities σp,q are defined as in Eqn. (11.79) and can be extracted straight from matrix S (Eqn. (11.81)). The optimal ellipse parameters are then found by solving the generalized eigenvalue problem S·q = λ·C·q

(11.89)

and taking as q opt the eigenvector of the smallest non-negative eigenvalue. As in Eqn. (11.65), both matrices S and C are symmetric, but C is not positive definite, as required for a symmetric generalized eigenvalue problem. However, since S is positive definite, this can again be transformed to a symmetric problem, C·q = µ·S·q,

(11.90)

where µ = λ1 . In this case the solution is the eigenvector associated with the largest eigenvalue µ. As in Alg. 11.6, the resulting parameter vector q opt needs to be re-adjusted to compensate for the shifted sample data. The complete ellipse fitting procedure is summarized in Alg. 11.7 (Version 1).

321

11 Fitting Circles and Ellipses

Alternative Implementation (Version 2) An alternative method for solving the original problem Eqn. (11.89) is to reduce the design and constraint matrices S and C, respectively, to size 5 × 5, to make C non-singular [60]. First, S is decomposed to ! S 1 s5 , (11.91) S= s⊺5 1 where vector s5 contains the first five elements of the rightmost column of S (the last 2 elements of s5 being zero if point data are centered). The derived matrix20 ⊺

P = S1 − s5 ·s5

(11.92)

is symmetric and positive definite. Similarly, the constraint matrix C is decomposed (by stripping the zero parts) to ! Q 05 C= (11.93) 0⊺5 0 where Q is of size 5 × 5. Both P and Q are positive definite and thus the generalized (symmetric) eigenvalue problem can be solved as P·q = λ·Q·q,

(11.94)

the optimal solution q 0 being the eigenvector with the largest eigenvalue λk . The complete (6-element) parameter vector q is obtained by extending q 0 with the scalar value F , i.e., ⊺

⊺

q = (q 0 | F ) ,

with

⊺

F = −s5 ·q 0 .

(11.95)

See Alg. 11.8 for a summary of this procedure. Note that the parameters returned by the Taubin fit do not necessarily describe an ellipse, but a general conic, e.g., a parabola or hyperbola. Thus the Taubin algorithm can be used to perform general conic fitting. Parameters can be checked for being elliptic by testing 4AC −B 2 > 0 (see Eqn. (11.59)). Comparison and Other Algebraic Methods The methods by Fitzgibbon and Taubin perform similarly well when sample points are placed along a large portion of the ellipse perimeter, even under large amounts of measurement noise. Under extreme conditions, i.e., when sample points cover only a small section of the ellipse, results are critically affected by the noise level. A comparison of the Fitzgibbon and Taubin methods under such conditions is shown in Fig. 11.6. For this qualitative experiment, ellipse points were sampled within a small section of the curve under increasing Gaussian x/y positioning noise. While both methods show a similar bias, Taubin’s method is clearly more stable at higher noise levels. Note that under the given conditions the original ellipse typically does not coincide with the minimum least-squares orthogonal distance ellipse obtained from a geometric fit (as described in the following section). Thus the goodness of algebraic fits should rather be measured against a geometric fit to the same sample points rather than against the underlying (yet unknown) “real” ellipse. 20

322

⊺

s5 ·s5 is the outer vector product, i.e., evaluates to a 5 × 5 matrix.

1: FitEllipseTaubin2(X, xr ) Input: X = (x0 , . . . , xn−1 ), a collection of n ≥ 6 2D sample points xi = (xi , yi ); xr = (xr , yr ), a reference point (e.g., the centroid of X). Returns a vector q = (A, B, C, D, E, F ) of algebraic parameters for the optimal ellipse. 2: n ← |X| 3:

let x˙ i = (xi −xr ), 2 x˙ 0

.. .

5:

X←

6:

S1 ← S(0:4, 0:4)

7:

s5 ← S(0:4, 5)

8:

P ← S1 − s5 ·s5

9:

Q←

4:

S←

y˙ i = (yi −yr )

x˙ 0 y˙ 0

x˙ 2 n−1

.. .

2 y˙ 0

x˙ 0

.. .

2 x˙ n−1 y˙ n−1 y˙ n−1

y˙ 0

.. .

.. .

▷ design matrix

▷ cov. matrix (Eqn. (11.79)) ▷ 5 × 5 submatrix of S



,

with

⊺ ▷ outer product s5 ·s5 (5 × 5)  a = S0,5 = σ2,0   b = S2,5 = σ0,2 c = S1,5 = σ1,1 = σ1,0 e = S4,5 = σ0,1 )

  d = S3,5

10: 11: 12:

Solve generalized symm. eigenvalue problem P·q = λ·Q·q: (λ, V) ← GeneralSymmetricEigen(P, Q) ▷ see Alg. B.2 k ← argmini {λi ∈ λ | λi ≥ 0} ▷ idx. of min. non-neg. eigenval. q 0 ← V(∗, k) ▷ get eigenvector v k

13:

F ← −s5 ·q 0

14:

⊺ ⊺ q opt ← (q 0 | F ) 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 q ← −2xr −yr 0 1 0 0 −xr −2yr 0 1 2 2 xr xr yr yr −xr −yr

15:

16:

⊺

  

return

Alg. 11.8 Taubin’s algorithm (Version 2). Compared to Alg. 11.7, the scatter matrix S1 (and P derived from it) and the constraint matrix Q are reduced to size 5 × 5. Since both matrices are positive definite, the numerical stability of the generalized eigenvalue solution is improved. This algorithm handles exact ellipse fits to 5 points. See Eqns. (11.91)– (11.95) for details.

▷ right col. vector of S (shortened)

⊺

2c 0 2d 0 2c a+b 2c e d 0 2c 4b 0 2e 2d e 0 1 0 0 d 2e 0 1

▷ centered data



..  .

xn−1 y˙ n−1 1

⊺ 1 ·(X ·X) n

 4a

1

11.2 Fitting Ellipses

1 ·q ∥q∥

▷ dot product (F ∈ R) 0 0 0 0 0 1



  · q opt

▷

(

re-adjust for data centering

▷ normalize q

11.2.2 Geometric Ellipse Fitting The problem of geometric ellipse fitting is to find ellipse parameters that minimize the sum of the squared Euclidean distances between the ellipse and the given point set. The problem is not quite as simple as in the circle case, since the distance between a point (xi , yi ) and its closest point on the ellipse cannot be calculated in closed form. Consequently, all geometric methods for fitting ellipses are iterative. As in the circle case, an initial guess for the ellipse parameters is usually obtained from an algebraic fit. In this section, we first look at the sub-problem of finding the closest ellipse point and then use this result for geometric ellipse fitting, again based on nonlinear least-squares optimization. As for circle fitting (see Sec. 11.1.3), we distinguish between distance-based and coordinate-based ellipse fitting methods. Finding the Closest Ellipse Point (Euclidean Distance) Geometric ellipse fitting is usually performed in geometric parameter space. The algorithm described below was proposed by Eberly [79]

323

11 Fitting Circles and Ellipses

Fitzgibbon

Taubin

Fig. 11.6 Algebraic ellipse fits at different noise levels. Results obtained with the methods by Fitzgibbon (left) and Taubin (right). Points (n = 100, blue dots) were sampled on the true ellipse (green) and additive Gaussian noise with increasing σ applied to their x/y coordinates. The fitted ellipse is drawn in red.

(c)

σ = 0.1

(d)

(e)

σ = 0.5

(f)

(g)

σ = 2.0 true ellipse

(h) algebraic fit

and can be traced back to an earlier scheme by Hart [122]. A detailed account and analysis can be found in [62].

324

Given some point xi located inside or outside the ellipse, the task ˘ i on the ellipse curve that is closest to xi in is to find the point x ˘ i the projection (or contact) terms of Euclidean distance. We call x point to xi (see Fig. 11.7). Calculations are performed in Cartesian

xi

y

11.2 Fitting Ellipses

v

di ˘i x rb xc

ui

ra

rb

di

↔

θ

˘i u

x

O

O

(a) original coordinate frame

ra (b) canonical coordinate frame

u

coordinates, i.e., with “geometric” ellipse parameters21 p = (xc , yc , ra , rb , θ).

Fig. 11.7 Closest ellipse point calculation in canonical coordinate space. Points in the original coordinate frame (a) are transformed to canonical coordinates (b) by aligning the ellipse to the coordinate axis and shifting its center (xc ) to the origin O. All points are mapped to the first quadrant, where all calculations are per˘ i (of formed. The vector ui → u length di ) is perpendicular to the tangent, i.e., parallel to the ˘ i. gradient at point at u

(11.96)

For the given purpose, the equation of a general, i.e., shifted and rotated ellipse (see Eqn. (F.23)) is rather unwieldy. The problem is greatly simplified by converting (x, y) coordinates to a “canonical” coordinate frame (u, v), with the ellipse aligned to the coordinate axes and centered at the origin (see Fig. 11.7 (b)), which reduces the geometric ellipse equation to f (u, v) =

u2 v2 + − 1 = 0. ra2 rb2

(11.97)

The transformation from original x to canonical u coordinates is22 ⊺

u = (u, v) = R−1 ·(x − xc ), where xc = (xc , yc ) denotes the ellipse center and cos(θ) − sin(θ) R= sin(θ) cos(θ)

(11.98)

(11.99)

is a standard rotation matrix (with R−1 = R⊺ ). The reverse transformation, i.e., from u back to x coordinates, is x = R·u + xc .

(11.100)

Since things are symmetric, calculations can be confined to the first quadrant of the ellipse, where all points are temporarily mapped to (by switching to positive-only coordinates). ˘ i is that the The essential condition for the projection point u ˘ i is normal to the ellipse tangent at the contact point vector ui −u ˘ i (see Fig. 11.7 (b)), i.e., parallel to the gradient of f ( ) at point u ˘ i, u which is ! ∂f (˘ ui ,˘ vi ) u ˘i /ra2 ∂u ∇f (˘ ui , v˘i ) = ∂f (˘u ,˘v ) = 2· = 2·di . (11.101) i i v˘i /rb2 ∂v

21 22

See Sec. F.3.1 of the Appendix. In transformed (canonical) coordinates, u corresponds to the original ˘ corresponds to the closest ellipse point x ˘. data point x, and u

325

11 Fitting Circles and Ellipses

˘ i (connecting points ui and u ˘ i ) must be some Thus the vector ui −u multiple (t) of the gradient vector di , i.e., u ˘i /ra2 ui u ˘i ˘ i = t·di ui − u or − = t· , (11.102) vi v˘i v˘i /rb2 for some t ∈ R, and therefore23 2 ra ·ui /(t + ra2 ) s (t) ˘ i = s(t) = x u = . sy (t) rb2 ·vi /(t + rb2 )

(11.103)

Function s(t) is shown in Fig. 11.8 for a specific ellipse and target ˘ i must also be on the ellipse, we point ui . Since the contact point u substitute u, v in Eqn. (11.97) by sx (t), sy (t), respectively, to obtain the additional constraint g(t) =

[sy (t)]2 [sx (t)]2 ra2 ·u2i rb2 ·vi2 + − 1 = + − 1 = 0. 2 ra2 rb (t + ra2 )2 (t + rb2 )2 (11.104)

The scalar function g(t) is nonlinear and solving Eqn. (11.104) means to find a root of g(t), e.g., by Newton’s method. For this we also need its first derivative, which is g ′ (t) =

2·ra2 ·u2i 2·rb2 ·vi2 + . 2 3 (t + ra ) (t + rb2 )3

(11.105)

As Fig. 11.8 shows, there is more than one solution (there are two intersection points between function s(t) and the ellipse). Thus the ˘ i .24 The Newton iteration should possibly start at a point close to u final value t can then be plugged into Eqn. (11.103) to calculate the ˘ i . The corresponding point x ˘ i is then found by transcontact point u ˘ i back from the canonical to the original coordinate frame. forming u See Alg. 11.9 for a summary of the complete procedure. Some examples are shown in Fig. 11.9. “Distance-Based” Geometric Fitting Analogous to geometric circle fitting, the task is to find ellipse parameters p = (ra , rb , xc , yc , θ) that minimize the sum of the squared (orthogonal) distances di of the data points X = {xi } from the ellipse, i.e., to minimize X X 2 F (p) = d2i = ∥˘ xi − xi ∥ , (11.106) xi ∈X

xi ∈X

˘ i is the point closest to xi on the ellipse defined by the pawhere x ˘ i depends (nonlinearly) on the ellipse paramerameters p. Since x ters p, minimizing F (p) is a nonlinear least-squares problem. Again we rely on the abstract function SolveNLS(V, J, z, p0 )25 for nonlinear 23

24 25

326

t is negative for points ui inside the ellipse and positive for points outside. Function s(t) can also be written in implicit form: rb2 ·u·(vi −v) − ra2 ·v·(ui −u) = 0. t0 = max(ra ui −ra2 , rb vi −rb2 ) is recommended as a starting value in [62]. See Sec. C.2 of the Appendix.

v

11.2 Fitting Ellipses

200

s(t)

Fig. 11.8 Finding the closest ellipse point (in canonical coordinates u, v). Given some target point ˘ i on the ui , its closest point u ellipse is found at the intersection of the function s(t) (Eqn. (11.103)) and the ellipse curve f (x)=0 (Eqn. (11.97)). There are two solutions (blue dots). ˘ i is normal The vector ui −u to the ellipse tangent at point ˘ i . In fact, the vector from ui u ˙ on curve s(t) to any point u is parallel to the gradient of the algebraic distance f () at ˙ (i.e., normal to the isopoint u contour). Ellipse parameters: ra =150, rb =80.

ui

100

˘i u

u

0

˙ u f (u)=0 -100

-200 -200

0

-100

100

f (u)=0

s(t)

200

f (u)=const Fig. 11.9 Orthogonal point projection examples (closest ellipse points).

least-squares optimization. In the distance-based approach we try to minimize n quantities, i.e., the orthogonal point distances di = ∥˘ xi (p) − xi ∥ ,

(11.107)

for i = 0, ..., n − 1. Thus the value function passed to SolveNLS() is ⊺

⊺

V (p) = (v0 (p), . . . , vn−1 (p)) = (d0 , . . . , dn−1 ) ,

(11.108)

and the associated target vector is z = 0n = (0, . . . , 0). The Jacobian matrix returned by function J(p) is of size n × 5, with rows26 i (p) ∂di (p) ∂di (p) ∂di (p) Ji (p) = ∂d∂ri (p) , ∂d∂r , ∂x , ∂y , ∂θ , (11.109) a

where

26

b

u ˘2i 3 ra

c

c

v ˘2

∂di (p) ∂ra

= −gi ·

∂di (p) ∂xc

=

x ˘i −xi di ,

∂di (p) ∂θ

=

(yi −˘ yi )(xc −˘ xi )−(xi −˘ xi )(yc −˘ yi ) , di

,

∂di (p) ∂rb

= −gi · ri3 ,

∂di (p) ∂yc

=

(11.110)

b

y˘i −yi di ,

See [5, Sec. A.1] for a derivation of the Jacobian.

(11.111) (11.112) 327

11 Fitting Circles and Ellipses Alg. 11.9 Finding the closest point on an ellipse. Given is an ellipse, with radii ra , rb , center xc , yc and orientation θ, and a single 2D point xi outside or inside the ellipse. To find the ellipse point (˘ xi ) closest to xi , the ellipse is assumed to be axisaligned and centered at the origin. For this purpose, xi is transformed to this coordinate frame (line 4) and then mapped to point u = (u, v) in the first quadrant (line 5). The only special case to consider is when u is very close to the ellipse center (line 7). Otherwise t is calculated by finding the root of function g(t) using Newton’s method (lines 12–20). This typically takes only 4–5 iterations. The re˘ is finally sulting ellipse point u mapped back to the original quadrant and coordinate system (line 23). The orthogonal distance d is calculated in u/vspace (line 22) but not used. If only the minimum distance is needed, the calculation could be stopped at this point and ˘ i ). d returned (instead of x

1: ClosestEllipsePoint(p, xi ) Input: p = (xc , yc , ra , rb , θ), geometric ellipse parameters; xi , point on the 2D plane. Constants: kmax = 100 (max. iteration ˘ i , the ellipse point count), ϵ = 10−6 (min. step size). Returns x closest to xi . ⊺ 2: xc ← (xc , yc ) cos(θ) − sin(θ) 3: R← ▷ rotation matrix sin(θ) cos(θ) 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24:

′ u v′

← R ·(xi − xc )

u v

←

⊺

|u′ | |v ′ |

▷ x in canonical coordinates

▷ u′ mapped to 1st quadrant

if (u+v) ▷ (u, v) is near the ellipse’s center ≤ ϵthen u ˘ 0 ← v˘ rb else t ← max(ra ·u−ra2 , rb ·v−rb2 ) ▷ t0 , as suggested in [62] g′ ← ∞ k←0 do ▷ find root of g(t) by Newton’s method k ←k+1 (r · u)2 (r · v)2 g ← a 2 2 + b 2 2 −1 ▷ g(t), Eqn. (11.104) (t + ra ) (t + rb ) 2 2 (ra · u) (r · v) g ′ ← 2· + b 2 3 ▷ g ′ (t), Eqn. (11.105) (t + ra2 )3 (t + rb ) δt ← gg′ ▷ step size t ← t + δt ▷ Newton step δg ← g − g ′ ▷ change of g w.r.t. prev. value g′ ← g while (|δt | > ϵ) ∧ (|δg | > ϵ) ∧ (k < kmax )

u ˘ v˘

d←

p

←

ra2 ·u/(t + ra2 ) rb2 ·v/(t + rb2 )

2 (u−˘ v )2 u) + (v−˘ ′ u ˘ · sgn(u ) ˘ i ← R· x + xc v˘ · sgn(v ′ ) ˘i return x

˘ ▷ projection point u ˘∥ ▷ orthogonal distance ∥u−u ˘ back to x/y coordinates ▷ map u

and gi = sgn

(ui −˘ui )·˘ui 2 ra

vi )·˘ vi + (vi −˘ · r2 b

2 −1/2 u ˘i 2 + rv˘2i . 2 ra b

(11.113)

Algorithm 11.10 gives a summary of this optimization procedure. Again we use function SolveNLS() to represent a nonlinear leastsquares solver (typ. a Levenberg-Marquart solver), as described in Sec. C.2 of the Appendix. Function V(p) returns, for a specific (ellipse) parameter point p, a vector of distance values (one for each sample point in X) and J(p) returns the associated Jacobian matrix. Both functions are assumed to be called by SolveNLS() “Coordinate-Based” Geometric Fitting 328

The problem structure is the same as for the coordinate-based geometric circle fitting procedure in Sec. 11.1.3. The closest point coor-

1: FitEllipseGeometricDist(X, p0 ) Input: X = (xi ), a collection of n ≥ 6 2D sample points xi = (xi , yi ); p0 , initial geometric ellipse parameters (typ. obtained by algebraic fitting). Returns the geometric parameters for the optimal ellipse. 2: n ← |X| ▷ X, n are global 3: z ← 0n ▷ “target” vector (2n) 4: popt ← SolveNLS(V, J, z, p0 ) ▷ run the NLS solver 5: return popt ▷ ellipse param. popt = (xc , yc , ra , rb , θ)

11.2 Fitting Ellipses Alg. 11.10 Geometric ellipse fitting (“distance-based”).

6: V(p) ▷ value function, to be called by SolveNLS() Returns the n-vector v = (v0 , . . . , vn−1 ) of model values for the current parameter “point” p. 7: v ← new vector ∈ Rn ▷ vector of “model” values 8: for i ← 0, . . . , n−1 do 9: xi ← X(i) ˘ i ← ClosestEllipsePoint(p, xi ) 10: x ▷ see Alg. 11.9 11: di ← ∥˘ xi − x i ∥ ▷ distance of xi from ellipse 12: v(i) ← di 13: return v ▷ v = (v0 , v1 , . . . , vn−1 ) 14: J(p) ▷ Jacobian function, to be called by SolveNLS() Builds and returns the n × 5 Jacobian matrix J for the current parameter “point” p = (xc , yc , ra , rb , θ) (see Eqn. (11.114)).

cos(θ) − sin(θ) sin(θ) cos(θ)

15:

R←

16: 17: 18: 19: 20: 21:

for i ← 0, . . . , n−1 do ⊺ ⊺ xi = (x, y) ← X(i) ▷ = (xi , xi ) ⊺ ⊺ ˘ i = (˘ x x, y˘) ← ClosestEllipsePoint(p, xi ) ▷ = (˘ xi , y˘i ) d ← ∥˘ xi − xi ∥ ▷ distance of xi from ellipse ⊺ ⊺ ⊺ u = (u, v) ← R ·(xi − xc ) ▷ = (ui , vi ) ⊺ ⊺ ⊺ ˘ = (˘ u u, v˘) ← R ·(˘ xi − xc ) q ▷ = (˘ ui , v˘i )

22: 23:

g ← sgn

(u−˘u)·˘u 2 ra

b

2

∂ra ← −g· ur˘3

a 2

24:

∂rb ← −g· vr˘3

25:

∂xc ←

2 u ˘ 2 + rv˘2 2 ra

v )·˘ v + (v−˘ ] r2

b

b

x ˘−x d y ˘−y ∂ yc ← d (y−˘ y )(xc −˘ x)−(x−˘ x)(yc −˘ y) ∂θ ← d

26: 27: 28: 29:

J(i, ∗) ← ∂ra , ∂rb , ∂xc , ∂yc , ∂θ return J

▷ ith row of J

˘ i = (˘ dinates x xi , y˘i ) are the current “model” values and the coordinates of the data points xi = (xi , yi ) are the observed “target” values. Given n data points, the resulting vectors are of length m = 2n, i.e. (analogous to Eqn. (11.54)), 

v0 v1

 ..  .  v2i V (p) =   v2i+1  ..  .

v2n−2 v2n−1





      =      

x ˘0 (p) y˘0 (p)

.. .

x ˘i (p) y˘i (p)

.. .

x ˘n−1 (p) y˘n−1 (p)



   ,   



z0 z0

 ..  .  z2i z =  z2i+1  ..  .

z2n−2 z2n−1





      =      

x0 y0



..  .  x1   y1 . ..  . 

xn−1 yn−1

(11.114) 329

11 Fitting Circles and Ellipses

Accordingly, the Jacobian matrix to be returned by function J(p) is of size 2n × 5, with one pair of rows for each sample point and one column for each of the five ellipse parameters. For each sample point xi , the Jacobian matrix holds the first derivatives of the ˘ i , which depend on the x ˘/˘ y -coordinates of the closest ellipse point x ellipse parameters p. The derivatives quantify how much the closest point coordinates change when the corresponding ellipse parameter is varied, which is important information for the least-squares optimizer for adapting the parameters. Thus each of the n sample points contributes a 2 × 5 block of partial derivatives, ! ∂x ˘i (p) ∂ x ˘i (p) ∂ x ˘i (p) ∂ x ˘i (p) ∂ x ˘i (p) ∂ra ∂rb ∂xc ∂yc ∂θ ˘ i (p) ∂x Ji (p) = = ∂ y˘i (p) ∂ y˘i (p) ∂ y˘i (p) ∂ y˘i (p) ∂ y˘i (p) , (11.115) ∂p ∂ra

∂rb

∂xc

∂yc

∂θ

evaluated for the current ellipse parameters p. The complete Jacobian matrix has 2n rows (one pair of rows for each sample point xi ) and 5 columns (one for each parameter):  ∂ x˘0 (p)  ∂x ˘0 (p) ∂x ˘0 (p) ∂x ˘0 (p) ∂x ˘0 (p) ∂ra

∂rb ∂y ˘0 (p) ∂rb

∂xc ∂y ˘0 (p) ∂xc

∂yc ∂y ˘0 (p) ∂yc

∂x ˘i (p) ∂rb ∂y ˘i (p) ∂rb

∂x ˘i (p) ∂xc ∂y ˘i (p) ∂xc

∂x ˘i (p) ∂yc ∂y ˘i (p) ∂yc

∂x ˘n−1 (p)

∂x ˘n−1 (p)

∂x ˘n−1 (p)

∂y ˘n−1 (p)

∂rb ∂y ˘n−1 (p)

∂xc ∂y ˘n−1 (p)

∂yc ∂y ˘n−1 (p)

∂ra

∂rb

∂xc

∂yc

 ∂ y˘0 (p)  ∂ra  ..   ∂ x˘ .(p)  i  ∂ra J(p) =  ∂ y˘i (p)  ∂ra  ..  .   ∂ x˘n−1 (p)  ∂ra

.. .

.. .

.. .

.. .

.. .

.. .

∂θ ∂y ˘0 (p) ∂θ

     J0 (p)  .  . ∂x ˘i (p)    . ∂θ    J (p)  =  i. . ∂y ˘i (p)    ∂θ .  . ..  J (p) . n−1  ∂x ˘n−1 (p)   ∂θ .. .

∂y ˘n−1 (p)

(11.116)

∂θ

Determining the partial derivatives in Eqn. (11.115) is not trivial, since the closest ellipse point is only implicitly known. The results below are taken from [5, Sec. 3.2, A.1] which also provides a detailed ˘i derivation. Given a target point xi and its closest ellipse point x (calculated by Alg. 11.9), with canonical coordinates27 ⊺

⊺

⊺

⊺

ui = (ui , vi ) = R ·(xi −xc ),

(11.117)

˘ i = (˘ u ui , v˘i ) = R ·(˘ xi −xc ),

respectively, the associated Jacobian submatrix (Eqn. (11.115)) is Ji (p) = −R·Q−1 ·(T·U + V) + W,

(11.118)

where p = (xc , yc , ra , rb , θ) and Q=

T= U= 27

330

0 0 · vi −˘ vi u ˘i −ui ! 0 0 , − rv˘2i ru˘2i b

1 2 ra

0

0 1 rb2

!

+

u ˘i 2 ra v ˘i rb2

v ˘i rb2 − ru˘2i a

!

,

(11.119) (11.120)

a

0 0 − cos(θ) − sin(θ) vi , 0 0 sin(θ) − cos(θ) −ui

See Eqn. (11.99) for the definition of R.

(11.121)

 u ˘2 v ˘2 − r3i − ri3 0 0 0 b  2˘ua  1 0 0 V= · − i 0 0 0 0 , 0 vi −˘ vi u ˘i −ui  ra3 2˘ vi 0 − r3 0 0 0 b 0 0 1 0 yc −˘ yi W= . 0 0 0 1x ˘i −xc



11.2 Fitting Ellipses

(11.122)

(11.123)

The complete geometric ellipse fitting procedure is summarized in Alg. 11.11. It requires an “initial guess” for the ellipse parameters p0 , which are usually obtained by an algebraic ellipse (or circle) fit to the same point set X. The actual work is done by function SolveNLS() performing iterative nonlinear least-squares optimization. Geometric Ellipse Fitting Examples In many practical situations, particularly if sample points are dispersed over a large part of the ellipse and the amount of positioning noise is not excessive, solutions delivered by algebraic fits are already close to the optimal (geometric) results. In this case, geometric fitting procedures will typically only need a few iterations to converge to the refined result. A more difficult example is shown in Fig. 11.10, where the data points only cover a small portion of the ellipse and the initial estimate obtained from the algebraic fit is poor (see also Fig. 11.6). Figure 11.10 show the intermediate steps for both the distance-based and coordinate-based geometric fitting algorithms described above. In the first example (Fig. 11.10 (a, b)), both algorithms can be seen to converge at similarly slow rates, with the distance-based algorithm being somewhat faster (10 vs. 15 iterations). In the second example (Fig. 11.10 (c, d)), the initial solution was manually specified, such that all ellipse parameters are quite far off their real values and only few data points are near the ellipse contour. Again the distance-based algorithm shows significantly faster convergence but, on its path to the optimal solution, makes some “bold jumps” in parameter space, as documented by the intermediate ellipse positions, orientations and radii in Fig. 11.10 (c).28 The coordinate-based algorithm takes significantly longer but also follows a clearly more conservative path in parameter space (Fig. 11.10 (d)). 11.2.3 Orthogonal Distance Approximations Quick calculation of point-to-ellipse distances is essential in many applications, particularly in (iterative) geometric fitting, where these distances are possibly computed many times. As noted in Sec. 11.2.2, the exact (Euclidean) minimum distance of a point x = (x, y)⊺ to an ellipse E = ⟨A, B, C, D, E, F ⟩ (planar quadric) cannot be calculated in closed form but only iteratively (except for a circle). Various methods for calculating approximate orthogonal point-to-ellipse distances have been proposed (see [227] for a good overview), two of which are described in the following. 28

When implemented with a “vanilla” Levenberg-Marquardt optimizer, the distance-based algorithm may overshoot and intermediate ellipse radii become negative.

331

11 Fitting Circles and Ellipses Alg. 11.11 Geometric ellipse fit (“coordinate-based”).

1: FitEllipseGeometricCoord(X, p0 ) Input: X = (xi ), a collection of n ≥ 6 2D sample points xi = (xi , yi ); p0 , initial geometric ellipse parameters (typ. obtained by algebraic fitting). Returns the geometric parameters for the optimal ellipse. 2: n ← |X| ▷ X, n are global ⊺ 3: z ← (x0 , y0 , . . . , xi , yi , . . . , xn−1 , yn−1 ) ▷ “target” vector (2n) 4: popt ← SolveNLS(V, J, z, p0 ) ▷ run the NLS solver 5: return popt ▷ ellipse param. popt = (ra , rb , xc , yc , θ) 6: V(p) ▷ value function, to be called by SolveNLS() Returns the 2n-vector v = (v0 , . . . , v2n−1 ) of model values for the current parameter “point” p. 7: v ← new vector ∈ R2n ▷ vector of “model” values 8: for i ← 0, . . . , n−1 do 9: xi ← X(i) 10: (˘ xi , y˘i ) ← ClosestEllipsePoint(p, xi ) ▷ see Alg. 11.9 11: v(2i) ← x ˘i 12: v(2i+1) ← y˘i 13: return v ▷ v = (˘ x0 , y˘0 , . . . x ˘n−1 , y˘n−1 ) 14: J(p) ▷ Jacobian function, to be called by SolveNLS() Builds and returns the 2n × 5 Jacobian matrix J for the current parameter “point” p (see Eqn. (11.114)). 15: J ← new matrix ∈ R2n × 5 ▷ Jacobian matrix of size 2n × 5 16: for i ← 0, . . . , n−1 do 17: xi ← X(i) 18: Ji ← GetJacobian(p, xi ) ▷ = Ji (p), see Eqn. (11.115) 19: J(2i, ∗) ← Ji (0, ∗) ▷ copy row 0 of Ji 20: J(2i+1, ∗) ← Ji (1, ∗) ▷ copy row 1 of Ji 21: return J 22: GetJacobian(p, xi ) Builds and returns the 2 × 5 Jacobian Ji (p) for ellipse parameters p = (xc , yc , ra , rb , θ) and a specific sample point xi (see Eqn. (11.115)). cos(θ) − sin(θ) 23: R← sin(θ) cos(θ) ⊺ ˘ i = (˘ 24: x x, y˘) ← ClosestEllipsePoint(p, xi ) ⊺ ⊺ 25: u = (u, v) ← R ·(xi − xc ) ⊺ ⊺ ˘ = (˘ 26: u u, v˘) ← R ·(˘ x −x ) ! i 1 c ! u ˘ v ˘ 0 2 2 ra r2 0 0 ra b 27: Q← · + v˘ u ˘ 0 r12 v−˘ vu ˘ −u 2 − 2 28: 29:

30:

31: 32: 33:

332

T← U←

V←

0 0

− rv˘2

b

u ˘ 2 ra

r

b

0 0 − cos(θ) − sin(θ) v 0 0 sin(θ) − cos(θ) −u 1 0 0 0 v−˘ vu ˘ −u

 

2

2

a

b

b

ra

− ur˘3 − vr˘3 0 0 0

 

u 0 0 0 0 · − 2˘ 3 ra 2˘ 0 − rv3 0 0 0

0 0 1 0 yc −˘ y W← 0 0 0 1x ˘ −xc Ji ← −R·Q−1 ·(T·U + V) + W return Ji

b

▷ 2 × 5 matrix

Distance-based algorithm

Coordinate-based algorithm

10 9 8 7

11.2 Fitting Ellipses Fig. 11.10 Distance-based vs. coordinatebased geometric ellipse fit. Both algorithms find the same results, the distance-based algorithm converging clearly faster than the coordinatebased algorithm. Red ellipses correspond to the initial parameters, blue ellipses are the results of geometric fitting. An algebraic (Fitzgibbon) fit was used to estimate the initial ellipse parameters in (a, b). In (c,d), the starting ellipse was specified manually to simulate a poor initial fit.

15 12 14 13 11 10 8 1 9 7 64

1 64 5 3

5 3 2 0

2 0

(a)

(b)

76

14 13 12 11 5 7 10

5 1

4

9 86 4

3 2 3

2 1 0

0

(c)

(d)

initial parameters

intermediate steps

geometric fit

Sampson Distance The algebraic distance dA (x) = Ax2 + Bxy + Cy 2 + Dx + Ey + F

(11.124)

is zero if point x is exactly on the ellipse. Unfortunately, the algebraic distance, although easy to calculate, is not useful as a distance measure by itself. It grows quadratically with increasing distance from the ellipse, since dA () is a second-order polynomial in x, y. It is also strongly biased in the direction of the major axis (see Fig. 11.11 (a)). √ While taking the square root of the algebraic distance (i.e., |dA (x)|) gives a quantity that is naturally closer to being linear, it is unable to remove this bias (Fig. 11.11 (b)). The rate of change of dA (x) at a particular position x = (x, y) corresponds to the magnitude (i.e., the norm) of the gradient of dA () at that point. The gradient itself (i.e., the vector of first partial derivatives) is ! ∂da (x) 2Ax + By + D ∂x , (11.125) ∇dA (x) = ∂da (x) = 2Cy + Bx + E ∂y

and its magnitude is

333

Algebraic Distance

11 Fitting Circles and Ellipses

Square Root of Algebraic Distance

Fig. 11.11 Algebraic distance. Contour plots of the algebraic distance dA (x) (a) and the √ square root of its magnitude |dA (x)| (b), for a given ellipse (red line). Both functions are zero on the ellipse itself. To be useful as a distance function, all isolines should be equidistant.

(a) dA (x)

∥∇dA (x)∥ =

p

(b)

p

|dA (x)|

(2Ax+By+D)2 + (2Cy+Bx+E)2 .

(11.126)

The so-called “Sampson distance” (introduced by Sampson in [230])29 normalizes the algebraic distance in Eqn. (11.124) by simply scaling with the norm of its gradient: dS (xi ) =

dA (x) . ∥∇dA (x)∥

(11.127)

The resulting distance function is approximately linear w.r.t. the Euclidean point distance and can be calculated non-iteratively and very efficiently. The function evaluates to zero for locations on the curve itself, but becomes infinite at the ellipse center, where the gradient norm vanishes. The Sampson distance is widely used for conic fitting, because it is simple an efficient to calculate and provides a good approximation to the orthogonal (Euclidean) distance in the vicinity of the curve. For example, an iterative ellipse fitting algorithm based on the Sampson distance (as a substitute for the geometric distance) is described in [255]. However, despite its popularity, it has several drawbacks that should be considered in fitting applications. First, estimates become quite inaccurate at greater distances, which may not be a great problem since such points are probably outliers anyway. Also note that the distance estimate is far from being “isotropic”, with peculiar anomalies along the major axis (as can be seen in Fig. 11.12 (a)). Secondly, the Sampson distance increases rapidly inside the ellipse and is even infinite at its center, which will strongly affect any fit that includes inner points. Confocal Conic Distance Approximation An interesting approximation to the orthogonal Euclidean distance, proposed by Rosin in [227], is based on the intersection of the ellipse with the confocal hyperbola that passes through the given point xi . For simplicity, all calculations are performed in canonical coordinates 29

334

Also called gradient-weighted or approximated maximum likelihood (AML) distance.

Sampson Distance

Confocal Conic Distance

(a)

(b) ellipse

estim. distance from ellipse

(u, v), i.e., with the ellipse being axis-aligned and centered at the origin, as shown in Fig. 11.7. All points xi in the original coordinate frame are transformed to canonical coordinates ui = R−1 ·(xi −xc ),

11.2 Fitting Ellipses Fig. 11.12 Sampson distance vs. Confocal Conic distance. The Sampson distance (a) is approximately linearized w.r.t. the algebraic distance, as indicated by the (almost) equidistant contour lines, at least outside the ellipse. Inside the ellipse, the magnitude of the Sampson distance increases non-linearly and is infinite at the ellipse center. The Confocal Conic distance (b) is very close to the (true) Euclidean distance, indicated by equidistant isolines outside and inside the ellipse. Blue circles indicate the estimated distance of sample points to the ellipse. Estimates are correct if the associated circles touch the (red) ellipse.

(11.128)

with the ellipse center xc and the rotation matrix R, as defined in Eqn. (11.99). Again the discussion can be confined to the first ˘ i is always in the quadrant of the ellipse, since the contact point u same quadrant as the target point ui , i.e., all coordinates are positive. The ellipse equation in the canonical frame is (cf. Eqn. (11.97)) u2 v2 + = 1, ra2 rb2

(11.129)

and the focal points F1 , F2 of this ellipse are located on the u-axis distance ±fe from the origin (see Fig. 11.13), where fe2 = ra2 − rb2

(11.130)

is the focal radius. Analogously, the equation of a hyperbola in canonical space is u2 v2 − 2 = 1, 2 sa sb

(11.131)

with radii sa , sb and focal radius fh2 = s2a + s2b .

(11.132)

If an ellipse and hyperbola share the same focal points, i.e., fe = fh , they are said to be confocal. In this case, the two curves are orthogonal at their intersections, i.e., their tangents are normal to each other at all intersection points. A hyperbola that is confocal to the ellipse specified by ra , rb and also passes through a given point ui = (ui , vi ) must satisfy u2i vi2 u2i vi2 − = − = 1. s2a s2b s2a fe2 −s2a

(11.133) 335

v

11 Fitting Circles and Ellipses

200

Fig. 11.13 Point-to-ellipse distance estimate by confocal conic (canonical coordinates). The confocal hyperbola is constructed to share the same focal points (F1 , F 2) as the ellipse and to pass through the target point ui . The curves are orthogonal at any intersection ˘ i ), which can be calpoint (u culated in closed form. The closest point distance is estimated as the Euclidean dis˘ i . Eltance between ui and u lipse/hyperbola parameters: ra =150, rb =80, sa ≈182.7, sb ≈106.2, fe =fh ≈126.9.

100

rb

ui ˘i u

0

F2

F1

u

ra

-100

-200 -200

-100

ellipse

0

hyperbola

100

200

◦ common focal points

The single unknown s2a is found by solving the quadratic equation (s2a )2 − (u2i +vi2 +fe2 )·(s2a ) + u2i ·fe2 = 0 , with the outcome (the smaller of the two solutions) p u2i +vi2 +fe2 − (u2i +vi2 +fe2 )2 − 4·u2i ·fe2 2 sa = . 2

(11.134)

(11.135)

˘ i of the ellipse By Eqn. (11.132), s2b = fe2 −s2a and the intersection u and hyperbola is found by jointly solving Eqns. (11.129, 11.131) as30 ! p ra · s2a ·(rb2 +s2b ) 1 p ˘i = p u · . (11.136) ra2 ·s2b + rb2 ·s2a rb · s2b ·(ra2 −s2a ) Finally, the distance of ui to the ellipse is simply estimated as the ˘ i , i.e., Euclidean distance between ui and the intersection point u p ˘ i ∥ = (ui −u˘i )2 + (vi −v˘i )2 . dC (ui ) = ∥ui −u (11.137)

The complete situation is illustrated in Fig. 11.13 for a single point ui . Algorithm 11.12 gives a summary of the calculations involved. Function ClosestEllipsePointApprox(p, xi ) returns the (approximated) ˘ i closest to the given point xi . The approximate disellipse point x tance could then be calculated in the original coordinate space as ˘ i itself is not required, the disd = ∥xi −˘ xi ∥. Alternatively, if point x tance may be calculated directly in u/v-space, as suggested by Eqn. (11.137) (see Alg. 11.12, line 12). Note that the confocal conic distance works flawlessly for points inside and outside the ellipse and approaches the true (Euclidean) orthogonal point distance as points move further away from the ellipse. It can be calculated efficiently in closed form and is thus an excellent substitute for the orthogonal distance if performance is critical, e.g., in ellipse detection by RANSAC (see Sec. 12.1.5). 30

336

There are actually four solutions but only the one in the first quadrant (with positive-only coordinates) is relevant.

11.3 Java Implementation Fig. 11.14 Confocal conic distance approximation. The given ellipse (red) specifies two focal points F1 , F 2 (not shown), which are shared by the hyperbolae passing through each of the data points (blue dots). The uniformly spaced contours of the distance function confirm the high quality of this approximation, which also holds for the interior of the ellipse.

1: ClosestEllipsePointApprox(p, xi ) Input: p = (ra , rb , xc , yc , θ), geometric ellipse parameters; xi , ˘ i , the estimated ellipse point point on the 2D plane. Returns x closest to xi . ⊺ 2: xc ← (xc , yc ) cos(θ) − sin(θ) 3: R← ▷ rotation matrix sin(θ) cos(θ) 4: 5:

′ u v′

← R ·(xi − xc )

u v

←

⊺

|u′ | |v ′ |

6:

fe2 ← ra2 − rb2

7: 8:

s2a ← s2b ← fe2 − s2a

9: 10: 11: 12: 13: 14:

▷ x in canonical coordinates ▷ u′ mapped to 1st quadrant

√

u2 +v 2 +fe2 −

(u2 +v 2 +fe2 )2 −4·u2 ·fe2 2

▷ Eqn. (11.135) ▷ Eqn. (11.132)

p

c ← 1/ ra2 ·s2b + rb2 ·s2a p u ˘ ← c·ra · s2a ·(rb2 +s2b ) p v˘ ← c·rb · s2b ·(ra2 −s2a ) d←

p

2 2 (u−˘ u) + (v−˘ v ) ′ u ˘ · sgn(u ) ˘ i ← R· x + xc v˘· sgn(v ′ ) ˘i return x

▷ Eqn. (11.136)

Alg. 11.12 Confocal conic distance approximation. Given is an ellipse with radii ra , rb , center xc , yc and orientation θ, and a single 2D point xi outside or inside the ellipse. The transformation of point xi to the canonical u, v coordinate frame and back to original coordinates is the same as in Alg. 11.9. The (approximate) closest ellipse point is found in closed form (i.e., without iterations) as the intersection of the ellipse and the confocal hyperbola passing through the transformed point u. All calculations are performed in the first quadrant. The distance estimate d is calculated in u/vspace (line 12) but not used. If only the minimum distance is needed, the calculation could be stopped at this point and d ˘ i ). returned (instead of x

˘∥ ▷ distance estimate ∥u−u ˘ back to x/y coordinates ▷ map u

11.3 Java Implementation The imagingbook library provides one-to-one implementations for most of the definitions and algorithms described in this chapter. All circle representations implement the common interface Circle:31 GeometricCircle (parameters xc , yc , r, see Eqn. (11.1)), AlgebraicCircle (parameters B, C, D, see Eqn. (11.3)). 31

Package imagingbook.pub.geometry.circle

337

11 Fitting Circles and Ellipses

Similarly, all ellipse representations implement the interface Ellipse:32 GeometricEllipse (parameters xc , yc , ra , rb , θ, see Eqn. (11.96)), AlgebraicEllipse (parameters A, . . . , F , see Eqn. (11.58)). Methods for converting between different representations are defined for all classes, e.g., the static method GeometricEllipse.from(AlgebraicEllipse ae) returns a new instance of GeometricEllipse from a given object of type AlgebraicEllipse. Analogous methods are available for the converting circle representations. 11.3.1 Circle Fitting Algebraic and geometric circle fitters are defined as individual Java classes implementing the common interfaces CircleFitAlgebraic or CircleFitGeometric, respectively.33 Algebraic Circle Fits: CircleFitKasa (Alg. 11.1), CircleFitPratt (Alg. 11.2), CircleFitTaubin (see [258]), CircleFitHyper (see [6]). Geometric Circle Fits: CircleFitGeometricDist (distance-based, Alg. 11.10), CircleFitGeometricCoord (coordinate-based, Alg. 11.11). An exact circle fit from three given 2D points can be obtained with class CircleFit3Points (see Sec. F.2.2 of the Appendix). 11.3.2 Ellipse Fitting Algebraic and geometric ellipse fitters implement the common interfaces EllipseFitAlgebraic or EllipseFitGeometric, respectively.34 Algebraic Ellipse Fits: EllipseFitFitzgibbon (Alg. 11.5), EllipseFitFitzgibbonStable (Alg. 11.6), EllipseFitTaubin1 (Alg. 11.7), EllipseFitTaubin2 (Alg. 11.8). Geometric Ellipse Fits: EllipseGeometricFitDist (distance-based, Alg. 11.10), EllipseGeometricFitCoord (coordinate-based, Alg. 11.11). In addition, class EllipseFit5Points is defined to perform an exact ellipse fit from five given 2D points (see Sec. F.3.3 of the Appendix). 32 33 34

338

Package imagingbook.pub.geometry.ellipse Package imagingbook.pub.geometry.fitting.circle Package imagingbook.pub.geometry.fitting.ellipse

Example The circle fitting example in Prog. 11.1 illustrates the basic use of the classes described in this section. Given an array of 2D points (of type Pnt2d), the program first performs an algebraic circle fit using an instance of class CircleFitPratt (fitA). This yields the parameters (q) to instantiate a AlgebraicCircle object (ca1), which is immediately converted to a GeometricCircle (cg1). This circle is then used as the starting solution for a geometric circle fit of type CircleFitGeometricDist (fitG), from which the refined circle (cg2) is obtained directly. Additional examples (ImageJ plugins) can be found in the online source code. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

import imagingbook.pub.geometry.basic.Pnt2d; import imagingbook.pub.geometry.circle.AlgebraicCircle; import imagingbook.pub.geometry.circle.GeometricCircle; import imagingbook.pub.geometry.fitting...CircleFitAlgebraic; import imagingbook.pub.geometry.fitting...CircleFitGeometric; ... Pnt2d[] pts = ... ; // array of 2D points

11.3 Java Implementation

Prog. 11.1 Algebraic and geometric circle fitting example using classed CircleFitPratt and CircleFitGeometricDist.

// first perform an algebraic (Pratt) circle fit: CircleFitAlgebraic fitA = new CircleFitPratt(pts); double[] q = fitA.getParameters(); AlgebraicCircle ca1 = new AlgebraicCircle(q); GeometricCircle cg1 = GeometricCircle.from(ac1); // use gc1 as the initial circle for a geometric fit: CircleFitGeometric fitG = CircleFitGeometricDist(pts, cg1); GeometricCircle cg2 = fitG.getCircle();

339

12 Detecting Geometric Primitives

Many geometrical shapes, such as lines, circles, and ellipses, can be readily described by simple equations with only a few parameters. Since such geometric “primitives” often occur as part of man-made objects, they are especially useful features for analysis of these types of images (see Fig. 12.1). In Chapters 10 and 11 the task was to fit a geometric model (e.g., a line, circle or ellipse) to a given set of 2D data points under the important assumption that all points actually belong to that shape. What we did not account for is the situation that the data set contains points which are not part of the shape and thus should not be included in the fit. Such points are called “outliers” and, unfortunately, this is the typical situation in reality. In fact, outliers can often occur in relatively large numbers, as illustrated by the example in Fig. 12.2. Including one or more outliers in the fitting process will, depending on their distance to the actual shape, give distorted results. This is particularly pronounced if least-squares fitting is used, since the impact of any data point is proportional Fig. 12.1 Simple geometrical shapes such as sections of lines, circles, and ellipses are often found on man-made objects.

341 © Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1_12

12 Detecting Geometric Primitives Fig. 12.2 Multiple incomplete lines within a “sea of outliers”.

342

to its squared distance from the model. Thus, if a remote outlier is falsely included in the fit, results my possibly become useless despite the support from many “good” data points. Another issue to be considered is that the data may contain multiple instantiations of the sought model. In this case, points that represent outliers w.r.t. one instance are probably inliers for another instance embedded in the same data. There are basically two strategies for making the fitting process more “robust” in the presence of outliers. One approach is to replace the squared error term by a different cost function that limits the influence of distant sample data, e.g., so-called M-estimators (see e.g., [96, Sec. 10.4]). Unfortunately, this makes the fitting problem non-linear and thus requires iterative methods. The second method is to somehow identify outliers and ignore them in the fitting process. However, finding the real outliers is not trivial. One may be tempted to simply discard those data points that are furthest away from the model, i.e., yield the largest error terms. While this idea appears initially attractive, one can never be sure if such a decision is correct, since the model itself may be distorted by picking the wrong “inliers”. What makes the situation complicated is that this is really a “chicken and egg” problem: if the model were correct, outliers would be easy to identify; otherwise, if we knew the outliers, fitting the model would be straightforward. Obviously none of the two works without solving the other first. This chapter presents two classic methods that attempt to solve this dilemma in completely different ways. The first approach, called Random Sample Consensus (RANSAC), relies on randomly picking a minimal set of data points from which to instantiate the given model, and then collect all other points that are in agreement with this model. This is repeated multiple times and the model with the highest support is chosen (see Sec. 12.1). In the second approach, known as the Hough transform, every data point “votes” for all model instances that could potentially have produced that point, using a discrete parameter space and voting array (see Sec. 12.2). Solutions are chosen as the model instances that received the maximum number of votes. While the principle concepts are amazingly simple (though not necessarily trivial to understand), both methods may appear a

bit wasteful and perhaps even “brutal” at first sight. Nevertheless, if 12.1 Random Sample properly designed and tuned they are surprisingly effective in many Consensus (RANSAC) practical applications.

12.1 Random Sample Consensus (RANSAC) The method described in this section is attributed to Fischler and Bolles [83]. It was originally conceived for estimating relative camera positions from perspective images (which is still a very important application) but is indeed a general technique for estimating model parameters from observed data. The principle is simple. Given is a set of observations (or measurements) of which some are assumed to be the result of a known model, but neither the identity of the associated data points nor the parameters of the model are known. The goal is to find a maximal subset of “good” points that support the given model and to determine its parameters by (typ. least-squares) fitting. This is accomplished by the following iterative process, with X being the original data set: 1. Randomly draw k items D from X, with k just large enough to fit the given model. 2. Instantiate the model from the sample data in subset D. 3. Evaluate how many other measurements in X are “in agreement” with the instantiated model (i.e., find the number of inliers). 4. Keep the solution with the highest score and repeat from Step 1 for a predefined number of iterations (M ). 5. Calculate the final model by fitting to the associated inliers. In the concrete case of line detection, k = 2, since two points are necessary and sufficient to instantiate a straight line in 2D (see Fig. 12.3). Analogously, k = 3 points are required for detecting circles, k = 5 points for arbitrary ellipses, etc. Of course, care must be taken not to select the same data point twice.

δ

Fig. 12.3 RANSAC scheme for straight line detection. Two points (red) are randomly selected from the data set to establish an initial line. Then all other points are checked to see if they are sufficiently close (≤δ) to the line. Each try is scored by the associated number of “inliers” (green).

δ

(a) 7 inliers random draw

(b) 19 inliers initial line

inliers

343

12 Detecting Geometric Primitives

12.1.1 How Many Random Draws Are Needed? So how many random draws (M ) are needed and why not simply examine all possible subsets? The number of combinations to draw k from n = |X| items (with ordering being irrelevant and any element only picked once) is comb(n, k) = Cnk =

n! , k! · (n−k)!

(12.1)

which, for k = 2 (in the case of straight lines), reduces to comb(n, 2) =

n·(n−1)·(n−2)! n · (n−1) n2 −n = = , 2!·(n−2)! 2! 2

(12.2)

i.e., the number of unique pairs grows quadratically at rate ≈ 0.5 · n2 . While this does not sound particularly dramatic, the resulting numbers are likely too big to examine all possible point pairs exhaustively. For example, for a data set of size n = 1000 this makes exactly 499.500 point combinations, for n = 2000 we get already 2 million pairs, etc. The interesting aspect of RANSAC is that it may work successfully with a greatly reduced number of trials. The actual number depends on the statistics of the data set X, i.e., the proportion of inliers vs. outliers. Assuming that the data set contains only a single line, the likelihood of randomly picking two points on that line is quite large if X contains only a small number of outliers. The larger the proportion of outliers, the smaller is the probability of picking a pair of inliers. Let w ∈ [0, 1] be the proportion of “good” data points in X and p the desired probability of successfully finding a hit after M trials. When drawing a random pair from X, the probability of getting two good points in a row (assuming that two consecutive picks are independent events) is w · w = w2 and thus the probability of picking at least one outlier is 1 − w2 . After m independent random draws, the probability of never picking a pair of inliers (i.e., picking m bad pairs) is (1 − w2 ) · (1 − w2 ) · . . . · (1 − w2 ) = (1 − w2 )m = z. | {z }

(12.3)

m times

and thus the probability of picking at least one pair of good line points is p = 1−z. If this “confidence level” p is specified, then the required minimum number of tries can be estimated as log(1 − p) M= . (12.4) log(1 − wk )

344

Thus, for a fixed pick size k (k = 2 for lines) and confidence level p (a typical value is p = 0.99), the number of required random picks only depends on the proportion of inliers w and is independent of the total number of samples! Figure 12.4 shows the relationship between the number of trials (m) and the associated success rate p for different inlier ratios w and k = 2. For example, if 50% of the data points are inliers (i.e.,

w = 0.5

0.4

0.3

0.2

0.1

12.1 Random Sample Consensus (RANSAC)

p 1.0

0.8 0.6 0.4 0.2

20

40

60

80

100

m

Fig. 12.4 Probability of picking at least one pair (k = 2) of “good” line points after m trials for different inlier ratios w = 0.1, ..., 0.5. For example, with 50% inliers (w = 0.5) it takes only m = 20 trials to achieve nearly 100% confidence. On the other hand, with only 10% inliers (w = 0.1) no more than about 64% confidence is reached after m = 100 trials.

w = 0.5), chances for picking at least one pair of good line points are almost 100% after only 20 tries. Even with only 20% inliers (i.e., 80% outliers) the probability of success is at roughly 95% after 100 tries. This is much less than the effort needed for exhaustive search (see Eqn. (12.2)), which also depends on the absolute size of the data set (n). Figure 12.5 shows the minimum number of required tries (M ) as a function of the inlier ratio w and the pick size k, for the confidence level fixed at p = 0.99. For k = 2 (i.e., for detecting lines), note the sharp increase of the number of tries (m) for inlier ratios below 0.1. For example, for k = 2 and w = 0.1, about M = 460 random draws are required. This grows to almost 2000 draws for w = 0.05 and about 7500 for w = 0.025, which indicates that the method becomes ineffective when the proportion of inliers is too small. However, as Fig. 12.5 shows, this gets a lot worse for larger pick sizes, e.g., k = 3 for detecting circles and k = 5 for ellipses. Fig. 12.5 Number of required random tries (M ) for pick sizes k = 2, . . . , 5 and varying inlier ratio w for the confidence level fixed at p = 0.99 (see Eqn. (12.4)). For k = 2 (i.e., picking a pair of samples), there is a strong increase of required tries for w < 0.1, i.e., if there are more than 90% outliers. The tolerable proportion of outliers is reduced with increasing pick size k.

M

10 000

8000

6000

4000

2000

k=2

k=3

0.1

k=4

0.2

k=5

0.3

0.4

0.5

w

12.1.2 RANSAC Line Detection Algorithm Let us return to 2D line detection as an illustrative application of the RANSAC method. Only two sample points are required to instantiate an initial line model and (orthogonal) least-squares fitting can

345

12 Detecting Geometric Primitives

be done very effectively, as described in Sec. 10.2.2. And of course we continue to use the algebraic line model Ax + By + C = 0 (with parameters A, B, C) in everything that follows. The degree of “consensus” in the data set X = {xi } for a specific line L is determined from the orthogonal point-to-line distances di = dist(xi , L), which are easy to calculate (see Eqn. (10.12)). A point xi ∈ X is considered an inlier (i.e., “in support of” line L) if it is within some fixed distance δ from the line L, i.e.,1 |dist(xi , L)| ≤ δ.

(12.5)

The distance threshold δ should correspond to the amount of noise in the data but is difficult to determine analytically2 and usually found by experimentation. The union of the qualified points, i.e., S = {xi ∈ X | |dist(xi , L)| ≤ δ},

(12.6)

is the consensus set for line L and the associated consensus score is simply the size of this set, i.e., s = |S|. The goal is to find the line with the maximum score. It is also a good idea to specify a minimum support score smin to avoid cluttered results by rejecting lines with small support. Again the value of smin depends on the situation and is difficult to determine a priori. The basic RANSAC scheme for line detection is summarized in Alg. 12.1. Procedure RansacFindBestLine() tries to detect the strongest line in the data set X by performing M random draws, with distance threshold δ and minimum point count smin . Note that the initial line fit (Alg. 12.1, line 8) includes only two points and is trivial to calculate,3 while any of the orthogonal line fitting algorithms described in Sec. 10.2.2 can be used for the final line fit over the complete consensus set (line 16). It should also be noted that only the inlier count |S| is required for the consensus score (line 9), not the set S itself. 12.1.3 Detecting Multiple Lines Algorithm 12.1 only detects a single line but often more than one instance of the model is contained in data set. A spontaneous idea could be to use the same algorithm but to keep a sorted list of the strongest lines detected during the M random draws. Unfortunately, this does not work well at all, since any line with a significant number of points will be detected over and over again and thus many strong lines with similar consensus sets will appear in the list. One simple remedy is to remove the associated inliers Sj from the data set X after a line Lj has been detected and run the following detection on the reduced data set X ′ = X \ Sj . This is repeated until no more acceptable lines (with sufficient support) can be found, as summarized in Alg. 12.2. Its main loop invokes function RansacFindBestLine() from Alg. 12.1 on progressively smaller data sets and terminates when no more line with sufficient consensus score (smin ) is found. 1 2 3

346

The absolute value is needed in Eqn. (12.6) since dist(xi , L) is signed. See, e.g., [124, Sec. 10.4]. See Eqn. (F.9) in the Appendix (p. 863).

1: RansacFindBestLine(X, M, δ, smin ) Input: X = (x0 , . . . , xn−1 ), a collection of n 2D sample points xi = (xi , yi ); M , the number of iterations required; δ, the max. point-to-line distance for inliers (see Fig. 12.3); smin , the min. “concensus” (number of supporting inliers). Returns the line L = ⟨A, B, C⟩ for the algebraic line with the maximum support (or nil if no such line could be found) and the set of associated inliers S. 2:

(smax , Lmax ) ← (−1, nil)

3: 4: 5: 6: 7:

for i ← 1, . . . , M do do ▷ randomly pick two different points x0 ← RandomPick(X) x1 ← RandomPick(X) while (x0 = x1 )

8: 9: 10: 11:

L ← FitLine(x0 , x1 ) S ← CollectInliers(X, L, δ) if c ≥ smin ∧ |S| > smax then (smax , Lmax ) ← (|S|, L)

12: 13: 14: 15: 16: 17:

if Lmax = nil then return (nil, nil) else S ← CollectInliers(X, Lmax , δ) L′ ← FitLine(S) return (L′ , S)

▷ best score/line so far

▷ see Eqn. (F.9), p. 863 ▷ count inliers only

12.1 Random Sample Consensus (RANSAC) Alg. 12.1 RANSAC for detecting a single line. In each iteration of the main loop (lines 3–11) two non-identical sample points x0 , x1 are randomly picked from the data set X to instantiate line L (line 8). The line’s consensus set S is found by function CollectInliers(), which gathers all data points whose distance from L is less than δ (lines 18–26). All lines with scores below smin are ignored and the “strongest” line Lmax (and the associated score smax ) is stored (line 11). Finally, if any acceptable line was found, its parameters are re-estimated using all points in the associated consensus set (line 16) and returned.

▷ no acceptable line found

▷ see Algs. 10.1–10.3

18: CollectInliers(X, L, δ) ▷ collect all points close to line L 19: let L√ = ⟨A, B, C⟩: 20: a ← A2 + B 2 21: S ← {} 22: for all xi =(xi , yi ) ∈ X do 23: di ← a1 ·(A·xi + B·yi + C) ▷ see Eqn. (10.12) 24: if |di | ≤ δ then 25: S ← S ∪ {xi } 26: return S

1: RansacFindMultipleLines(X, M, δ, smin ) Input: see Alg. 12.1. Returns the set of detected lines (L). 2: L ← {} ▷ set of detected lines 3: (L, S) ← RansacFindBestLine(X, M, δ, smin ) ▷ see Alg. 12.1 4: while L ̸= nil do 5: L ← L ∪ {L} ▷ collect line L 6: X ←X \S ▷ remove all inliers from X 7: (L, S) ← RansacFindBestLine(X, M, δ, smin ) ▷S⊆X 8: return L

Alg. 12.2 RANSAC for detecting multiple lines. Procedure RansacFindBestLine() (see Alg. 12.1) is called repeatedly until no more acceptable line is found. In each iteration the points inside the line’s consensus set S are removed from the data set X (line 6).

The example in Fig. 12.6 shows the detection of the four strongest lines in a binary test image. One complete RANSAC run is performed for each line. After a line has been detected, all inlier points are deleted from the data set, such that the same line will not be picked up twice. It may not hurt to remove additional points (at distance greater than δ) if lines are relatively wide (as in this example). 347

12 Detecting Geometric Primitives Fig. 12.6 RANSAC multi-line example. Complete view (left column), enlarged details (right column). Each line is detected in a separate RANSAC run (see Alg. 12.2). After a line is detected, the associated inliers are erased from the data set. This process continues until no more acceptable lines are found (only the first three lines are shown). Settings: minimum inlier count smin = 200, max. point distance δ = 2.0, RANSAC iterations M = 1000. Image size: 360 × 240, total number of points: 10219.

(a) L0 (536 inliers)

(b) L1 (412 inliers)

(c) L2 (389 inliers)

(d) L3 (290 inliers) • random draw

initial fit (2 pts)

final fit

• inliers

12.1.4 RANSAC Circle Detection

348

RANSAC circle detection follows the same principal scheme as line detection. In this case k = 3 different points are required to instantiate the circle model (see Fig. 12.7). As shown in Fig. 12.5, this

12.1 Random Sample Consensus (RANSAC) δ Fig. 12.7 Geometry for RANSAC circle detection. Three points (red) are picked at random to instantiate the initial circle C = ⟨xc , yc , r⟩. All points xi with distance ri ∈ [r−δ, r+δ] from xc are considered inliers (green).

r

xc

random draw

ri

xi

initial circle

inliers

requires a significantly (ca. 10 times) larger number of random tries for a given inlier ratio and confidence level than with k = 2 (for lines). If the circle radius is known a priori, then only two initial points are needed (see Exercise 12.3). Algorithm 12.3 gives a summary of the RANSAC circle detection process. As in Alg. 12.2, multiple circles are detected by performing a full RANSAC run (M iterations) for each circle and subsequently removing all inliers from the data set. In each RANSAC run (procedure RansacFindBestCircle()), points x0 , x1 , x2 are picked randomly from the data set X, which is repeated as long as any two points are identical (Alg. 12.3, lines 12–16).4 Note that the ordering of the points is irrelevant. The initial circle passing through the three sample points is calculated in closed form by function FitCircle3(), defined in Alg. 12.4.5 This function returns ‘nil’ if the three points happen to be collinear, i.e., no finite circle exists. Given a circle instance C = ⟨xc , yc , r⟩, the associated consensus set S contains all points xi ∈ X for which |dist(xi , C)| = |ri − r| ≤ δ, (12.7) p with ri = ∥xi −xc ∥ = (xi −xc )2 + (yi −yc )2 (see Fig. 12.7). The circle with the highest score is then re-fit to its consensus set by function FitCircleN() in line 26. This function is not explicitly defined but could be replaced by any of the algebraic circle fits described in Sec. 11.1.2 (Algs. 10.1–10.3). Results are typically highly accurate since all involved points are inliers and thus close to the circle. Alternatively, a geometric circle fit may be used (see Sec. 11.1.3), with the initial circle as a starting solution, which is even more accurate, of course. A typical example of RANSAC-based circle detection is shown in Fig. 12.8. The synthetic test image contains two actual circles with inlier ratio w ≈ 0.1, i.e., about 90% of the data points are outliers. 4

5

The chance of randomly picking two identical points is small but finite. Alternatively, one could omit this test and rely on the initial 3-point circle fit to fail (i.e., function FitCircle3() returning ‘nil’) if any two points are the same. See also Sec. F.2.2 of the Appendix.

349

12 Detecting Geometric Primitives Alg. 12.3 RANSAC circle detection (Part 1). Procedure RansacFindBestLine() (see Alg. 12.1) is called repeatedly until no more acceptable line is found. In each iteration the points inside the line’s consensus set S are removed from the data set X (line 6).

1: RansacFindMultipleCircles(X, M, δ, smin ) Input: X = (x0 , . . . , xn−1 ), a collection of n 2D sample points xi = (xi , yi ); M , the number of iterations required; δ, the max. point-to-line distance for inliers; smin , the min. “concensus” (number of supporting inliers). Returns the set of detected circles (C). 2: C ← {} ▷ set of detected circles 3: (C, S) ← RansacFindBestCircle(X, M, δ, smin ) ▷ see Alg. 12.1 4: while C ̸= nil do 5: C ← C ∪ {C} ▷ collect this circle 6: X ←X \S ▷ remove all inliers from X 7: (C, S) ← RansacFindBestCircle(X, M, δ, smin ) ▷S⊆X 8: return L 9: RansacFindBestCircle(X, M, δ, smin ) Input: see above. Returns the circle C = ⟨xc , yc , r⟩ with the maximum support (or nil if no such circle could be found) and the set of associated inliers S. 10:

(Cmax , smax ) ← (nil, −1)

11: 12: 13: 14: 15: 16:

for i ← 1, . . . , M do do ▷ randomly pick 3 different points x0 ← RandomPick(X) x1 ← RandomPick(X) x2 ← RandomPick(X) while (x0 = x1 ∨ x0 = x2 ∨ x1 = x2 )

17: 18: 19: 20: 21:

C ← FitCircle3(x0 , x1 , x2 ) ▷ see Alg. 12.4 if C ̸= nil then ▷ x0 , x1 , x2 are not collinear S ← CollectInliers(X, C, δ) ▷ count inliers only if |S| ≥ smin ∧ |S| > smax then (Cmax , smax , ) ← (C, |S|)

22: 23: 24: 25: 26: 27:

▷ strongest circle/score so far

if Cmax = nil then return (nil, ∅) else S ← CollectInliers(X, Cmax , δ) C′ ← FitCircleN(S) return (C′ , S)

28: CollectInliers(X, C, δ) 29: let C = ⟨xc , yc , r⟩: 30: S ← {} 31: for all xi ∈ X do 32: ri ← ∥xi − xc ∥ 33: if |ri − r| ≤ δ then 34: S ← S ∪ {xi } 35: return S

▷ no acceptable circle found

▷ see Algs. 10.1–10.3

▷ collect all points close to circle C

▷ distance of xi from circle center

According to Fig. 12.5, approximately 4000 independent random tries are required for the sample size k = 3 to achieve p = 0.99 detection probability. Only M = 1000 tries were used in this experiment. The minimum number of inliers was set to smin = 100, which resulted in a third (phantom) circle to be detected (with 106 inliers).

350

1: FitCircle3(x0 , x1 , x2 ) Input: x0 , x1 , x2 , three points xi =(xi , yi ) ∈ R2 . Returns a circle that exactly fits the three input points or nil if the points are collinear. 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

A ← x0 ·(y1 −y2 ) − y0 ·(x1 −x2 ) + x1 y2 − x2 y1 if |A| < ε then ▷ zero A? C ← nil ▷ points x0 , x1 , x2 are collinear else R0 ← x20 + y02 R1 ← x21 + y12 R2 ← x22 + y22

12.1 Random Sample Consensus (RANSAC) Alg. 12.4 RANSAC circle detection (Part 2). Closed-form, exact circle fit to 3 points. ‘nil’ is returned if the specified points x0 , x1 , x2 are collinear.

B ← R0 ·(y2 −y1 ) + R1 ·(y0 −y2 ) + R2 ·(y1 −y0 ) C ← R0 ·(x1 −x2 ) + R1 ·(x2 −x0 ) + R2 ·(x0 −x1 ) D ← R0 ·(x2 y1 −x1 y2 ) + R1 ·(x0 y2 −x2 y0 ) + R2 ·(x1 y0 −x0 y1 ) xc ← −B/(2A) yc ←√−C/(2A) r ← B 2 +C 2 −4AD / |2A| C ← ⟨xc , yc , r⟩ return C Fig. 12.8 RANSAC circle detection example. Original point set with ca. 90% outliers (a) and detected circles. After a circle was detected, the associated inliers were erased from the data set. Note that the initial circle fits (red) are very close to the final fits (blue) and thus only partially visible. Settings: minimum inlier count smin = 100, max. point distance δ = 2.0, RANSAC iterations M = 1000. Image size: 200 × 200, total number of points: 1996. (a) Original

(b) Circle 1 (284 inliers)

(c) Circle 2 (181 inliers)

(d) Circle 3 (106 inliers)

• random draw

initial fit (3 pts)

final fit

• inliers

351

12 Detecting Geometric Primitives

12.1.5 RANDSAC Ellipse Detection Five sample points are required to uniquely specify an ellipse, which means that the number of random tries needed for a given confidence level is significantly higher than for finding circles. For example, according to Eqn. (12.4), about ca. 875 tries are needed to achieve a confidence of p = 0.99 when w = 35% of the points are inliers. However, this soars to 460,000 tries if the inlier ratio is reduced to 10% (see Fig. 12.5 for k = 5)! Algorithm The RANSAC algorithm for ellipses is almost identical to the one for circles (see Alg. 12.3) and needs not to be repeated in full detail. The only differences are the initial ellipse instantiation from five randomly sampled points and the final ellipse fit to the set of collected inliers. Function FitEllipse5() in Alg. 12.5 shows one approach for calculating the initial 5-point fit by solving a linear system of equations. 6 Note that (unless the five random points are collinear) there is always a solution, but the resulting parameters may not describe an ellipse but some other quadratic curve. This happens frequently and in this case the random pick must be discarded. Alternatively, one could apply an algebraic ellipse fit (see Sec. 11.2.1) to the five sample points. For example, the stable version of the Fitzgibbon fit (Alg. 11.6) and the optimized Taubin fit (Alg. 11.8) are good candidates, since both can handle 5-point ellipse fits. However, the resulting algebraic fits are only useful if they are exact, i.e., if the residual errors are zero. Given an initial ellipse, the set of inliers must be determined. Because all points in the (typically large) data set must be tested, this requires the efficient calculation of many point-to-ellipse distances, as discussed in Sections 11.2.2 and 11.2.3. Since the true geometric distance (i.e., the Euclidean distance to the closest ellipse point, see p. 323) can only be calculated iteratively, approximate distances are typically used. The Sampson distance (see p. 333) is a popular choice, since it can be calculated efficiently and directly from the algebraic ellipse parameters. It yields good distance estimates only for points close to the ellipse, but this is where inliers are typically found anyways. A superb alternative is the confocal conic distance (see p. 334), which works accurately for points at any position, including the inside of the ellipse. Its calculation is similarly efficient as the Sampson distance but requires geometric ellipse parameters (xc , yc , ra , rb , θ), i.e., a one-time conversion of the ellipse.7 Examples The point set used for the example shown in Fig. 12.9 contains two ellipses, each covering about 35% of the total points, i.e., w ≈ 0.35. 1000 iterations were used in each RANSAC loop and the associated inliers were removed from the point set for each detected ellipse. 6 7

352

See also Sec. F.3.3 of the Appendix. See Sec. F.3.2 of the Appendix for the conversion between algebraic and geometric ellipse parameters.

1: FitEllipse5(x0 , x1 , x2 , x3 , x4 ) Input: x0 , . . . x4 , five points xi =(xi , yi ) ∈ R2 . Returns an ellipse that exactly fits the three input points or nil if no ellipse fit is possible. 2: 3: 4: 5: 6: 7: 8: 9:









x20 −y02 x0 y0 x0 y0 1 −y02    .. .. .. .. , M ←  ... b ←  ...  . . . .  x24 −y42 x4 y4 x4 y4 1 −y42 E ← nil if M is not singular then ▷ e.g., if det(M) ̸= 0 ⊺ (A, B, D, E, F ) ← solve(M·q = b) ▷ Sec. B.8.1 (p. 823) C ←1−A if (4·A·C −B 2 ) > 0 then ▷ check if this is an ellipse E ← ⟨A, B, C, D, E, F ⟩ ▷ new algebraic ellipse return E

(a) original

(b) Ellipse 1 (242 inliers)

(c) Ellipse 2 (186 inliers) • random draw

initial fit (5 pts)

12.1 Random Sample Consensus (RANSAC) Alg. 12.5 Exact ellipse fit to 5 points. Function FitEllipse5() calculates and returns the parameters A, . . . , F of the algebraic ellipse passing through the five specified points x0 , . . . , x4 . ‘nil’ is returned if two or more points are duplicates or the fitted curve is not an ellipse.

Fig. 12.9 RANSAC ellipse detection (example 1). “Best” three ellipses (low-noise data). Original point set with ca. 35% inliers (a), the first 3 detected ellipses (b–d). After an ellipse was detected, the associated inliers were erased from the data set. Settings: minimum inlier count smin = 15, max. point distance δ = 2.0, RANSAC iterations M = 1000. Image size: 360 × 240, total number of points: 604.

(d) Ellipse 3 (18 inliers) final fit

• inliers

In the second example (Fig. 12.10) the inlier ratio is reduced to about 10%, which means that a significantly larger number of random tries is required in theory (460,000 for p = 0.99). Only M = 100,000 tries were actually performed in each RANSAC detection run, not least to demonstrate possible failure modes. The left column of Fig. 12.10 shows a typical sequence of successful detection results. The strongest and second-strongest ellipses are detected first, followed by a number of “ghost” ellipses with clearly smaller but still considerable inlier counts. In the other example (right column of Fig. 12.10) the second-largest ellipse remains undetected, since another strong ellipse is found first which “consumes” a large section of the actual ellipse (Fig. 12.10 (d. f)). This could be avoided (with high probability) by increasing the number of random tries to the prescribed value. Note however, that even with M = 500,000 iterations the number of tries only makes up about 1.36·10−9 the size of the complete search space

353

12 Detecting Geometric Primitives Fig. 12.10 RANSAC ellipse detection (example 2). “Best” five ellipses (high-noise data). Good result (a, c, e, g, i) and not so good result (b, d, f, h, j). For each detected ellipse, the associated inliers are removed from the point set. Results in the left column are as desired. Since only two “real” ellipses are contained in the data, the consensus sets for ellipses 3–5 (e, g, h) are small. In the right column, detection of the second ellipse fails in (d), leaving many inliers unremoved only to be picked up by the following ellipse (f). The remaining two ellipses (h, j) are again incidental with low support. Settings: minimum inlier count smin = 50, max. point distance δ = 2.0, RANSAC iterations M = 100,000. Image size: 360 × 240, total number of points: 2099.

Good

(a) 238 inliers

Ellipse 1

(b) 266 inliers

(c)160 inliers

Ellipse 2

(d) 136 inliers

(e) 106 inliers

Ellipse 3

(f)133 inliers

(g) 94 inliers

Ellipse 4

(h) 93 inliers

(i) 86 inliers

Ellipse 5

(j) 88 inliers

• random draw

354

Not so good

initial fit (5 pts)

final fit

• inliers

of ca. 0.338·1015 combinations, i.e., only about 1 in 1,000 million 12.2 The Hough possible 5-point picks is analyzed for detection. Transform 12.1.6 RANSAC Extensions and Applications RANSAC is a very general technique and therefore used in many applications where parameters of a known (or assumed) model must be determined from observed data within a large proportion of outliers. If the approximate inlier ratio is known, the probability of successful detection can be easily estimated. However, given that only a fixed number of random tries (iterations) is performed in a typical RANSAC process, there is never a 100% guarantee for success, i.e., there is always a slight chance that some existing instance remains undetected. One disadvantage of RANSAC is that the required iteration count grows quickly with the number of data samples needed to instantiate the model (see Fig. 12.5), which makes RANSAC inefficient if the inlier ratio is low. Many modifications and improvements of the basic RANSAC method have been proposed to overcome its inherent limitations. As a consequence, some modern and specialized implementations of RANSAC have become considerably more complex than the basic algorithm described here (e.g., see [22] for a good overview).

12.2 The Hough Transform The method proposed by Paul Hough—originally published as a US Patent [133] and often referred to as the “Hough transform” (HT)— is a general approach to localizing any shape that can be defined parametrically within a distribution of points [76,139]. An interesting account of the Hough transform’s genesis can be found in [123]. The Hough transform is perhaps most often used for detecting straight line segments in edge maps. A line segment in 2D can be described with two real-valued parameters using the classic slopeintercept form y = k·x + d, (12.8) where k is the slope and d the intercept—that is, the height at which the line would intercept the y axis (Fig. 12.11). A line segment that passes through two given edge points p1 = (x1 , y1 ) and p2 = (x2 , y2 ) must satisfy the conditions Fig. 12.11 Two points, p1 and p2 , lie on the same line when y1 = kx1 +d and y2 = kx2 + d for a particular pair of parameters k and d.

y p2 = (x2 , y2 )

p1 = (x1 , y1 )

y1 = k·x1 + d y2 = k·x2 + d

d x

355

12 Detecting Geometric Primitives

L1

y

p0

Fig. 12.12 A set of lines passing through an image point. For all possible lines Lj passing through the point p0 = (x0 , y0 ), the equation y0 = kj x0 + dj holds for appropriate values of the parameters kj , dj .

L2 L3

L4

y1 = k·x1 + d

and

x

y2 = k·x2 + d,

(12.9)

for k, d ∈ R. The goal is to find values of k and d such that as many edge points as possible lie on the line they describe; in other words, the line that fits the most edge points. But how can you determine the number of edge points that lie on a given line segment? One possibility is to exhaustively “draw” every possible line segment into the image while counting the number of points that lie exactly on each of these. Even though the discrete nature of pixel images (with only a finite number of different lines) makes this approach possible in theory, generating such a large number of lines is infeasible in practice. 12.2.1 Parameter Space The Hough transform approaches the problem from another direction. It examines all the possible line segments that run through a single given point in the image. Every line Lj = ⟨kj , dj ⟩ that runs through a point p0 = (x0 , y0 ) must satisfy the condition L j : y0 = k j x 0 + d j

(12.10)

for suitable values kj , dj . Equation 12.10 is underdetermined and the possible solutions for kj , dj correspond to an infinite set of lines passing through the given point p0 (Fig. 12.12). Note that for a given kj , the solution for dj in Eqn. (12.10) is dj = −x0 · kj + y0 ,

(12.11)

which is another equation for a line, where now kj , dj are the variables and x0 , y0 are the constant parameters of the equation. The solution set {(kj , dj )} of Eqn. (12.11) describes the parameters of all possible lines Lj passing through the image point p0 = (x0 , y0 ). For an arbitrary image point pi = (xi , yi ), Eqn. (12.11) describes the line Mi : d = −xi · k + yi (12.12)

356

with the parameters −xi , yi in the so-called parameter or Hough space, spanned by the coordinates k, d. The relationship between (x, y) image space and (k, d) parameter space can be summarized as follows:

L12

y

12.2 The Hough Transform

d

p2 = (x2 , y2 ) M2 : d = −x2 ·k+y2 p1 = (x1 , y1 )

q 12 = (k12 , d12 ) M1 : d = −x1 ·k+y1 k

x (a) x/y Image space

(b) k/d Parameter space

Image Space (x, y)

Parameter Space (k, d)

Point

pi = (xi , yi )

↔

Mi : d = −xi ·k+yi

Line

Line

Lj : y = kj ·x+dj

↔

q j = (kj , dj )

Point

Fig. 12.13 Relationship between image space and parameter space. The parameter values for all possible lines passing through the image point pi = (xi , yi ) in image space (a) lie on a single line Mi in parameter space (b). This means that each point q j = (kj , dj ) in parameter space corresponds to a single line Lj in image space. The intersection of the two lines M1 , M2 at the point q 12 = (k12 , d12 ) in parameter space indicates that a line L12 through the two points k12 and d12 exists in the image space.

Each image point pi and its associated line bundle correspond to exactly one line Mi in parameter space. Therefore we are interested in those places in the parameter space where lines intersect. The example in Fig. 12.13 illustrates how the lines M1 and M2 intersect at the position q 12 = (k12 , d12 ) in the parameter space, which means (k12 , d12 ) are the parameters of the line in the image space that runs through both image points p1 and p2 . The more lines Mi that intersect at a single point in the parameter space, the more image space points lie on the corresponding line in the image! In general, we can state: If N lines intersect at position (k ′ , d′ ) in parameter space, then N image points lie on the corresponding line y = k ′ x + d′ in image space. 12.2.2 Accumulator Map Finding the dominant lines in the image can now be reformulated as finding all the locations in parameter space where a significant number of lines intersect. This is basically the goal of the HT. In order to compute the HT, we must first decide on a discrete representation of the continuous parameter space by selecting an appropriate step size for the k and d axes. Once we have selected step sizes for the coordinates, we can represent the space naturally using a 2D array. Since the array will be used to keep track of the number of times parameter space lines intersect, it is called an “accumulator” array. Each parameter space line is painted into the accumulator array and the cells through which it passes are incremented, so that ultimately each cell accumulates the total number of lines that intersect at that cell (Fig. 12.14). 12.2.3 A Better Line Representation The line representation in Eqn. (12.8) is not used in practice because for vertical lines the slope is infinite, that is, k = ∞. A more practical

357

12 Detecting Geometric Primitives

y

Fig. 12.14 The accumulator map is a discrete representation of the parameter space (k, d). For each image point found (a), a discrete line in the parameter space (b) is drawn. This operation is performed additively so that the values of the array through which the line passes are incremented by 1. The value at each cell of the accumulator array is the number of parameter space lines that intersect it (in this case 2).

d

x (a) Image space

Fig. 12.15 Representation of lines in 2D. In the common k, d representation (a), vertical lines pose a problem because k = ∞. The Hessian normal form (b) avoids this problem by representing a line by its angle θ and distance r from the origin.

(b) Accumulator map

y = kx + d

y

k

y

x · cos(θ) + y · sin(θ) = r

(x, y)

(x, y) r

k=∞ d=? d

θ x

(a)

x

(b)

representation is the Hessian normal form (HNF)8 for representing lines, x · cos(θ) + y · sin(θ) = r, (12.13) which does not exhibit such singularities and also provides a natural linear quantization for its parameters, the angle θ and the radius r (Fig. 12.15). With the HNF representation, the parameter space is defined by the coordinates θ, r, and a point p = (x, y) in image space corresponds to the relation r(θ) = x · cos(θ) + y · sin(θ), (12.14) for angles in the range 0 ≤ θ < π (see Fig. 12.16). Thus, for a given image point p, the associated radius r is simply a function of the angle θ. If we use the center of the image (of size M × N ), xr M xr = = 0.5· , (12.15) yr N as the reference point for the x/y image coordinates, then it is possible to limit the range of the radius to half the diagonal of the image, that is, 8

358

The Hessian normal form is a normalized version of the general (“algebraic”) line equation Ax + By + C = 0, with A = cos(θ), B = sin(θ), and C = −r. See also Sec. 10.1.4 and Sec. F.1.1 of the Appendix.

−rmax ≤ r(θ) ≤ rmax ,

with

rmax = 0.5·

p

M 2 +N 2 .

(12.16) 12.2 The Hough

We can see that the function r(θ) in Eqn. (12.14) is the sum of a cosine and a sine function on θ, each being weighted by the x and y coordinates of the image point (assumed to be constant for the moment). The result is again a sinusoidal function whose magnitude and phase depend only on the weights (coefficients) x, y. Thus, with the Hessian parametrization θ/r, an image point (x, y) does not create a straight line in the accumulator map A(i, j) but a unique sinusoidal curve, as shown in Fig. 12.16. Again, each image point adds a curve to the accumulator and each resulting cluster point corresponds to to a dominant line in the image with a proportional number of points on it.9 Image Space (x/y) y

+r

+y

Parameter Space (θ/r ) i

rmax

b

b

rb N 2

a ra θa

θc −x

xr

π

a 0

+x

0

rc N 2

c 0

c

rmax

M 2

d d −y

M 2

(a)

12.2.4 Hough Algorithm

x −rmax

0 −r

(b)

Transform

Fig. 12.16 Image space and parameter space using the HNF reprem−1 n−1 sentation. The image (a) of size M × N contains four lines a, . . . , d. Each point on an image line creates a sinusoidal curve in the θ/r parameter space (b) and the corresponding line parameters are indij cated by the clearly visible cluster points in the accumulator map. The reference point xr for the x/y coordinates lies at the center of the image. The line angles θi are in the range [0, π) and θ the associated radii ri are in π [−rmax , rmax ] (the length rmax is half of the image diagonal). For example, the angle θa of line a is approximately π/3, with the (positive) radius ra ≈ 0.4 rmax . Note that, with this parametrization, line c is at angle θc ≈ 2π/3 and negative radius rc ≈ − 0.4 rmax .

The fundamental Hough algorithm using the HNF line representation (Eqn. (12.13)) is given in Alg. 12.6. The first stage creates a 2D accumulator array and then iterates over the specified point set X to fill it. The resulting increments are p dθ = π/m and dr = 0.5· M 2 +N 2 /n (12.17) for the angle θ and the radius r, respectively. The discrete indices of the accumulators cells are denoted i and j, with j0 = n ÷ 2 as the center index (for r = 0). For each input point xk = (xk , yk ), a sinusoidal curve is added to the accumulator map by stepping over the discrete angles θi = θ0 , . . . , θm−1 , calculating the corresponding radius10 9

10

Note that, in Fig. 12.16(a), the positive direction of the y-coordinate runs upwards (unlike our usual convention for image coordinates) to stay in line with the previous illustrations (and high school geometry). In practice, the consequences are minor: only the rotation angle runs in the opposite direction and thus the accumulator image in Fig. 12.16(b) was mirrored horizontally for proper display. The frequent (and expensive) calculation of cos(θi ) and sin(θi ) in Eqn. (12.18) and Alg. 12.6 (line 13) can be easily avoided by initially tab-

359

12 Detecting Geometric Primitives

rk,i = (xk −xr )· cos(θi ) + (yk −yr )· sin(θi ) (see Eqn. (12.14)) and its discrete index rk,i j = j0 + round , dr

(12.18)

(12.19)

and subsequently incrementing the accumulator cell A(i, j) by one (see Alg. 12.6, lines 9–15). Reversely, for a given accumulator position (i, j), the associated line parameters θi and rj can be calculated as θ = i · dθ

and

r = (j−j0 )·dr .

(12.20)

In the second stage of Alg. 12.6, the accumulator array is searched for local peaks above the specified minimum score smin . For each detected peak, a line object is created of the form L = ⟨θ, r, xr , s⟩,

(12.21)

consisting of the angle θ, the radius r, the reference point xr , and the corresponding accumulator score s. The set L of detected lines is returned. Figure 12.17 shows the result of applying the Hough transform to a noisy binary image with four embedded lines.11 These lines show themselves clearly as cluster points in the corresponding accumulator map in Fig. 12.17 (b). Figure 12.17 (c) shows the reconstruction of these lines from the extracted parameters. In this example, the resolution of the discrete parameter space is set to 256 × 256.12 Processing the Accumulator Array The reliable detection and precise localization of peaks in the accumulator map A(i, j) is not a trivial task. As can readily be seen in Fig. 12.17(b), even in the case where the lines in the image are geometrically “straight”, the parameter space curves associated with them do not intersect at exactly one point in the accumulator array but rather their intersection points are distributed within a small area. This is primarily caused by the rounding errors introduced by the discrete coordinate grid used for the accumulator array. Since the maximum points are really maximum areas in the accumulator array, simply traversing the array and returning the positions of its largest values is not sufficient. Since this is a critical step in the algorithm, we examine two different approaches below (see Fig. 12.18). Approach A: Thresholding First the accumulator is thresholded to the value of ta by setting all accumulator values A(i, j) < ta to 0. The resulting scattering of points, or point clouds, are first coalesced into regions (Fig. 12.18(b)), using a technique such as a morphological closing operation (see Sec.

11 12

360

ulating the function values for all m possible angles θi = θ0 , . . . , θm−1 , which should yield a significant performance gain. The same test image is used with the RANSAC algorithm in Sec. 12.1.2. Note that drawing a straight line given in Hessian normal form is not really a trivial task (see Excercises 12.1–12.2 for details).

1: HoughTransformLines(X, M, N, m, n, smin ) Input: X = (x0 , . . .), a collection of 2D points; M, N , width and height of the image plane; m, n, angular/radial accumulator steps; smin , minimum line score (points on the line). Returns a sequence L of detected lines. 2: 3: 4: 5:

xr = (xr , yr ) ← 0.5·(M, N ) dθ ← π/m√ dr ← 0.5· M 2 + N 2 /n j0 ← n ÷ 2

6: 7: 8:

Step 1 – set up and fill the Hough accumulator: Create map A : [0, m−1] × [0, n−1] 7→ Z ▷ accumulator for all accumulator cells (i, j) do A(i, j) ← 0 ▷ initialize accumulator

▷ reference point xr (image center) ▷ angular step size ▷ radial step size (0.5·rmax /n) ▷ map index for r = 0

9: 10: 11: 12: 13: 14: 15:

for all xk =(xk , yk ) ∈ X do (x, y) ← (xk −xr , yk −yr ) for i ← 0, . . . , m−1 do θ ← dθ · i r ← x· cos(θ) + y· sin(θ) j ← j0 + round(r/dr ) A(i, j) ← A(i, j) + 1

16: 17: 18: 19: 20: 21: 22: 23: 24:

Step 2 – extract the most dominant lines: L ← () ▷ start with empty sequence of lines for all accumulator cells (i, j) do ▷ collect local maxima if (A(i, j) ≥ smin ) ∧ IsLocalMax(A, i, j) then θ ← i · dθ ▷ angle θ r ← (j − j0 ) · dr ▷ radius r s ← A(i, j) ▷ score for this line L ← ⟨θ, r, xr , s⟩ ▷ create a new line L L ← L ∪ {L} ▷ add line L to sequence L return L

▷ shift to reference ▷ angular coordinate i ▷ angle, 0 ≤ θ < π ▷ see Eqn. 12.14 ▷ radial coordinate j ▷ increment A(i, j)

12.2 The Hough Transform Alg. 12.6 Hough algorithm for detecting straight lines. The algorithm returns a sorted list of straight lines of the form L = ⟨θ, r, xr , s⟩ for the point set X. The resolution of the discrete Hough accumulator map (and thus the step size for the angle and radius) are specified by parameters m and n, respectively. smin is the minimum line score, i.e. the min. number of points on any detected line. The function IsLocalMax() used in line 18 is the same as in Alg. 6.1 (see p. 153).

7.3.2). Next the remaining regions must be localized, for instance using the region-finding technique from Sec. 8.1, and then each region’s centroid (see Sec. 8.5) can be utilized as the (noninteger) coordinates for the potential image space line. Often the sum of the accumulator’s values within a region is used as a measure of the strength (number of image points) of the line it represents. Approach B: Nonmaximum Suppression In this method, local maxima in the accumulator array are found by suppressing nonmaximal values.13 This is carried out by determining for every accumulator cell A(i, j) whether the value is higher than the value of all of its neighboring cells. If this is the case, then the value remains the same; otherwise it is set to 0 (Fig. 12.18(c)). The (integer) coordinates of the remaining peaks are potential line parameters, and their respective heights correlate with the strength of the image space line they represent. This method can be used in conjunction with a threshold operation to reduce the number of 13

Nonmaximum suppression is also used in Sec. 6.2.4 for isolating corner points.

361

12 Detecting Geometric Primitives Fig. 12.17 Hough transform for straight lines. The dimensions of the original image (a) are 360 × 240 pixels, so the maximal radius (measured from the image center) is rmax ≈ 216. For the parameter space (b), a step size of 256 is used for both the angle θ = 0, . . . , π (horizontal axis) and the radius r = −rmax , . . . , rmax (vertical axis). The four (dark) clusters in (b) surround the maximum values in the accumulator array, and their parameters correspond to the four lines in the original image. Intensities are shown inverted in all images to improve legibility.

(a)

(b)

(c)

(d)

candidate points that must be considered. The result for Fig. 12.17(a) is shown in Fig. 12.18(d).

362

Mind Those Vertical Lines! Special consideration should be given to vertical lines (once more!) when processing the contents of the accumulator map. The parameter pairs for these lines lie near θ = 0 and θ = π at the left and right borders, respectively, of the accumulator map (see Fig. 12.16(b)). Thus, to locate peak clusters in this part of the parameter space, the horizontal coordinate along the θ axis must be treated circularly, that is, modulo m. However, as can be seen in Fig. 12.16(b), the sinusoidal traces in the parameter space do not continue smoothly at the transition θ = π → 0, but are vertically mirrored! Evaluating such neighborhoods near the borders of the parameter space thus requires special treatment of the vertical (r) accumulator coordinate. This particular situation is illustrated in Fig. 12.19. The test image in Fig. 12.19 (a) contains a perfectly horizontal line L1 (with θ = π/2) and a vertical line L2 (with θ = 0). Figure 12.19 (b) shows the associated accumulator space, this time extended over the complete period 0, . . . , 2π (i.e., twice as large in θ-direction than Fig. 12.16 (b)). We see that the content in both halves of the accumulator is exactly the same, but mirrored in vertical direction! The accumulator peak corresponding to the horizontal line L1 appears at π/2 and again at 3π/2. This peak is easy to detect in either half of the accumulator, since it is at safe distance from any boundaries. The vertical line L2 , however, shows up in three places of the extended accumulator, at θ = 0, π and again at 2π, i.e., exactly at the accumulator boundaries! This peak is difficult to detect if only half of the accumulator is analyzed, due to the mirroring effect. While

12.2 The Hough Transform

(a)

(b)

(c)

(d)

Image

Fig. 12.18 Finding local maximum values in the accumulator array. Original distribution of the values in the Hough accumulator (a). Approach A: Threshold operation using 50% of the maximum value (b). The remaining regions represent the four dominant lines in the image, and the coordinates of their centroids are a good approximation to the line parameters. Approach B: Using non-maximum suppression results in a large number of local maxima (c) that must then be reduced using a threshold operation (d).

Extended Accumulator

L2 L1

L2 L1

L2

L2

(a)

0→θ

π/2

π

L1

3π/2

2π

Fig. 12.19 Detection of vertical lines is critical. Test image (a) and extended Hough accumulator (b). The two halves of the extended accumulator hold the same content but are vertically mirrored. Accumulator peaks for the vertical line L2 show up in three different places (red circles): at θ = 0, π and 2π. If only one half of the accumulator is analyzed, these peaks are difficult to locate.

(b)

it is not necessary to actually calculate the extended accumulator for proper peak detection (both halves are redundant), great care must be taken to select the right neighborhoods near the left and right borders (i.e., for θ = 0 and θ = π). Otherwise vertical lines are likely to get lost. 12.2.5 Hough Transform Extensions So far, we have presented the Hough transform only in its most basic formulation. The following is a list of some of the more common methods of improving and refining the method. 363

12 Detecting Geometric Primitives

Modified Accumulation The purpose of the accumulator map is to locate the intersections of multiple 2D curves. Due to the discrete nature of the image and accumulator coordinates, rounding errors usually cause the parameter curves not to intersect in a single accumulator cell, even when the associated image lines are exactly straight. A common remedy is, for a given angle θ = iθ · ∆θ (Alg. 12.6), to increment not only the main accumulator cell A(i, j) but also the neighboring cells A(i, j−1) and A(i, j+1), possibly with different weights. This makes the Hough transform more tolerant against inaccurate point coordinates and rounding errors. Considering Edge Strength and Orientation Until now, the raw data for the Hough transform was typically an edge map that was interpreted as a binary image with ones at potential edge points. Yet edge maps contain additional information, such as the edge strength E(u, v) and local edge orientation Φ(u, v) (see Sec. 5.3), which can be used to improve the results of the HT. The edge strength E(u, v) is especially easy to take into consideration. Instead of incrementing visited accumulator cells by 1, add the strength of the respective edge, that is, A(i, j) ← A(i, j) + E(u, v).

(12.22)

In this way, strong edge points will contribute more to the accumulated values than weak ones (see also Exercise 12.9). The local edge orientation Φ(u, v) is also useful for limiting the range of possible orientation angles for the line at (u, v). The angle Φ(u, v) can be used to increase the efficiency of the algorithm by reducing the number of accumulator cells to be considered along the θ axis. Since this also reduces the number of irrelevant “votes” in the accumulator, it increases the overall sensitivity of the Hough transform (see, e.g., [147, p. 483]). Bias Compensation Since the value of a cell in the Hough accumulator represents the number of image points falling on a line, longer lines naturally have higher values than shorter lines. This may seem like an obvious point to make, but consider when the image only contains a small section of a “long” line. For instance, if a line only passes through the corner of an image then the cells representing it in the accumulator array will naturally have lower values than a “shorter” line that lies entirely within the image (Fig. 12.20). It follows then that if we only search the accumulator array for maximal values, it is likely that we will completely miss short line segments. One way to compensate for this inherent bias is to compute for each accumulator entry A(i, j) the maximum number of image points Amax (i, j) possible for a line with the corresponding parameters and then normalize the result, for example, in the form A(i, j) ← 364

A(i, j) . max(1, Amax (i, j))

(12.23)

y

12.2 The Hough Transform

a ra

b

Fig. 12.20 Hough transform bias problem. When an image represents only a finite section of an object, then those lines nearer the center (smaller r values) will have higher values than those farther away (larger r values). As an example, the maximum value of the accumulator for line a will be higher than that of line b.

x

rb

The normalization map Amax (i, j) can be determined analytically (by calculating the intersecting length of each line) or by simulation; for example, by computing the Hough transform of an image with the same dimensions in which all pixels are edge pixels or by using a random image in which the pixels are uniformly distributed. Line Endpoints Our simple version of the Hough transform determines the parameters of the line in the image but not their endpoints. These could be found in a subsequent step by determining which image points belong to any detected line (e.g., by applying a threshold to the perpendicular distance between the ideal line—defined by its parameters—and the actual image points). An alternative solution is to calculate the extreme point of the line during the computation of the accumulator array. For this, every cell of the accumulator array is supplemented with four addition coordinates to A(i, j) = (a, umin , vmin , umax , vmax ),

(12.24)

where component a denotes the original accumulator value and umin , vmin , umax , vmax are the coordinates of the line’s bounding box. After the additional coordinates are initialized, they are updated simultaneously with the positions along the parameter trace for every image point (u, v). After completion of the process, the accumulator cell (i, j) contains the bounding box for all image points that contributed it. When finding the maximum values in the second stage, care should be taken so that the merged cells contain the correct endpoints (see also Exercise 12.7). Hierarchical Hough Transform The accuracy of the results increases with the size of the parameter space used; for example, a step size of 256 along the θ axis is equivπ alent to searching for lines at every 256 ≈ 0.7◦ . While increasing the number of accumulator cells provides a finer result, bear in mind that it also increases the computation time and especially the amount of memory required. Instead of increasing the resolution of the entire parameter space, the idea of the hierarchical HT is to gradually “zoom” in and refine the parameter space. First, the regions containing the most important lines are found using a relatively low-resolution parameter space,

365

12 Detecting Geometric Primitives

and then the parameter spaces of those regions are recursively passed to the HT and examined at a higher resolution. In this way, a relatively exact determination of the parameters can be found using a limited (in comparison) parameter space. Line Intersections It may be useful in certain applications not to find the lines themselves but their intersections, for example, for precisely locating the corner points of a polygon-shaped object. The Hough transform delivers the parameters of the recovered lines in Hessian normal form (that is, as tuples Lk = ⟨θk , rk ⟩). To compute the point of intersection x12 = (x12 , y12 )⊺ for two lines L1 = ⟨θ1 , r1 ⟩ and L2 = ⟨θ2 , r2 ⟩ we need to solve the system of linear equations x12 · cos(θ1 ) + y12 · sin(θ1 ) = r1 , x12 · cos(θ2 ) + y12 · sin(θ2 ) = r2 ,

(12.25)

for the unknowns x12 , y12 . The solution is 1 x12 r sin(θ2 )−r2 sin(θ1 ) = · 1 y12 cos(θ1 ) sin(θ2 )− cos(θ2 ) sin(θ1 ) r2 cos(θ1 )−r1 cos(θ2 ) 1 r1 sin(θ2 )−r2 sin(θ1 ) = · , (12.26) r2 cos(θ1 )−r1 cos(θ2 ) sin(θ2 −θ1 ) for sin(θ2 − θ1 ) ̸= 0. Obviously x0 is undefined (no intersection point exists) if the lines L1 , L2 are parallel to each other (i.e., if θ1 ≡ θ2 ). Figure 12.21 shows an illustrative example using ARToolkit 14 markers. After automatic thresholding (see Ch. 9) the straight line segments along the outer boundary of the largest binary region are analyzed with the Hough transform. Subsequently, the corners of the marker are calculated precisely as the intersection points of the involved line segments. 12.2.6 Hough Transform for Circles and Arcs Since lines in 2D have two degrees of freedom, they could be completely specified using two real-valued parameters. In a similar fashion, representing a circle in 2D requires three parameters, for example C = ⟨xc , yc , r⟩, where xc , yc are the coordinates of the center and r is the radius of the circle (Fig. 12.22). A point p = (x, y) lies exactly on the circle C if the condition (x − xc )2 + (x − yc )2 = r2 (12.27) holds. Therefore the Hough transform for circles requires a 3D parameter space A(i, j, k) to find the position and radius of circles (and circular arcs) in an image. Unlike the HT for lines, there does not exist a simple functional dependency between the coordinates in parameter space; so how can we find every parameter combination (xc , yc , r) 14

366

Used for augmented reality applications (see https://en.wikipedia.org/ wiki/ARToolKit).

12.2 The Hough Transform

(a)

(b)

(c)

(d)

Fig. 12.21 Hough transform used for precise calculation of corner points. Original image showing a typical ARToolkit marker (a), result after automatic thresholding (b). The outer contour pixels of the largest binary region (c) are used as input points to the Hough transform. Hough accumulator map (d), detected lines and marked intersection points (e).

(e)

r

rb

θ

yc p = (x, y)

yc

Fig. 12.22 Representation of circles and ellipses in 2D. A circle (a) requires three parameters (e.g., xc , yc , r). An arbitrary ellipse (b) takes five parameters (e.g., xc , yc , ra , rb , θ).

p = (x, y)

ra

xc

xc

(a)

(b)

that satisfies Eqn. (12.27) for a given image point (x, y)? A “brute force” approach is to a exhaustively test all cells of the parameter space to see if the relation in Eqn. (12.27) holds, which is computationally quite expensive, of course.

367

12 Detecting Geometric Primitives

y¯j

A(i, j, k)

rk

C1

Fig. 12.23 Hough transform for circles. The illustration depicts a single slice of the 3D accumulator array A(i, j, k) at a given circle radius rk . The center points of all the circles running through a given image point p1 = (x1 , y1 ) form a circle C1 with a radius of rk centered around p1 , just as the center points of the circles that pass through p2 and p3 lie on the circles C2 , C3 . The cells along the edges of the three circles C1 , C2 , C3 of radius rk are traversed and their values in the accumulator array incremented. The cell in the accumulator array contains a value of 3 where the circles intersect at the true center of the image circle C.

C3 rk

y1

p3

p1

Potential centers for p3

C

p2

True center of C

C2 x ¯i x1

100 Fig. 12.24 3D parameter space for circles. For each image point p = (x, y), the cells lying on a cone (with its axis at (x, y) and varying radius rk ) in the 3D accumulator A(i, j, k) are traversed and incremented. The size of the discrete accumulator is set to 100 × 100 × 30. Candidate center points are found where many of the 3D surfaces intersect.

3D parameter space: xc , yc = 0, . . . , 100 r = 10, . . . , 30

80

60 yc

40 30

20 25 20 r

0 0

20

15 40

60 xc

368

Image points p: p1 = (30, 50) p2 = (50, 50) p3 = (40, 40) p4 = (80, 20)

80

10 100

If we examine Fig. 12.23, we can see that a better idea might be to make use of the fact that the coordinates of the center points also form a circle in Hough space. It is therefore not necessary to scan the entire 3D parameter space for each image point. Instead we need only increase the cell values along the edge of the appropriate circle on each r plane of the accumulator array. To do this, we may adapt any of the standard algorithms for generating circles. For example, the integer math version of the classic Bresenham algorithm [41] is a good candidate. Figure 12.24 shows the spatial structure of the 3D parameter space for circles. For a given image point p = (x, y), at each plane along the r axis (for rk = rmin , . . . , rmax ), a circle centered at (x, y) with the radius rk is “drawn”, ultimately creating a 3D cone-shaped surface in the parameter space. The coordinates of the dominant

circles can be found by searching the accumulator space for the cells 12.3 Java with the highest values; that is, the cells where the most cones inter- Implementation sect. Just as in the HT for lines, the bias problem (see Sec. 12.2.5) also occurs in the circle HT. Sections of circles (i.e., arcs) can be found in a similar way, in which case the maximum value possible for a given cell is proportional to the arc length. 12.2.7 Hough Transform for Ellipses In a perspective image, most circular objects originating in our real, 3D world will actually appear in 2D images as ellipses, except in the case where the object lies on the optical axis and is observed from the front. For this reason, perfectly circular structures seldom occur in photographs. While the Hough transform can still be used to find ellipses, the larger parameter space required makes it substantially more expensive. A general ellipse in 2D has five degrees of freedom and therefore requires five parameters to represent it, for example, E = ⟨xc , yc , ra , rb , θ⟩,

(12.28)

where xc , yc are the coordinates of the center points, ra , rb are the two radii, and θ is the orientation of the principal axis (Fig. 12.22).15 In order to find ellipses of any size, position, and orientation using the Hough transform, a 5D parameter space with a suitable resolution in each dimension is required. A simple calculation illustrates the enormous expense of representing this space: using a resolution of only 128 = 27 steps in every dimension results in 235 accumulator cells, and implementing these using 4-byte int values thus requires 237 bytes (128 gigabytes) of memory. Moreover, the amount of processing required for filling and evaluating such a huge parameter space makes this method unattractive for real applications. An interesting alternative in this case is the generalized Hough transform, which in principle can be used for detecting any arbitrary 2D shape [20,139]. Using the generalized Hough transform, the shape of the sought-after contour is first encoded point by point in a table and then the associated parameter space is related to the position xc , yc , scale s, and orientation θ of the shape. This requires a 4D space, which is smaller than that of the Hough method for ellipses described earlier.

12.3 Java Implementation The Hough transform for straight lines and RANSAC detectors for lines, circles and ellipses are implemented in the imagingbook library,16 with the main classes HoughTransformLines,17 RansacLineDetector, 15 16 17

See Sec. 11.2 and Sec. F.3.1 of the Appendix for other ellipse definitions. Packages imagingbook.pub.hough and imagingbook.pub.ransac This implementation is without bias compensation (see Sec. 12.2.5).

369

12 Detecting Geometric Primitives Prog. 12.1 Minimal usage example of class HoughTransformLines (ImageJ plugin). An 8-bit input image (ip) with zero background and non-zero foreground pixels is assumed. In line 16, a parameter object is created and configured; nAng (= m) and nRad (= n) specify the number of discrete angular and radial steps in the Hough accumulator map. In line 21 an instance of HoughTransformLines is created for the input image ip. The accumulator map is calculated in this step. In line 24, getLines() is called to retrieve an array of the 5 strongest detected lines, with at least 50 image points each. Unless empty, this sequence is subsequently listed.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

import import import import import import import

ij.IJ; ij.ImagePlus; ij.plugin.filter.PlugInFilter; ij.process.ImageProcessor; imagingbook.pub.hough.HoughTransformLines; imagingbook.pub.hough.HoughTransformLines.Parameters; imagingbook.pub.hough.lines.HoughLine;

public class Hough_Lines_Minimal implements PlugInFilter { public int setup(String arg, ImagePlus im) { return DOES_8G + NO_CHANGES; } public void run(ImageProcessor ip) { Parameters params = new Parameters(); params.nAng = 256; // = m params.nRad = 256; // = n // compute the Hough Transform: HoughTransformLines ht = new HoughTransformLines(ip, params); // retrieve the 5 strongest lines with min. 50 accumulator votes HoughLine[] lines = ht.getLines(50, 5);

}

}

if (lines.length > 0) { IJ.log("Lines found:"); for (HoughLine L : lines) { IJ.log(L.toString()); // list the resulting lines } } else IJ.log("No lines found!");

RansacCircleDetector, RansacEllipseDetector. A minimal usage example for Hough line detection is shown in Prog. 12.1. Interested readers should consult the online source code for other examples and implementation details.

12.4 Exercises Exercise 12.1. Drawing a straight line given in Hessian normal (HNF) form is not directly possible because typical graphics environments can only draw lines between two specified end points.18 An HNF line L = ⟨θ, r⟩, specified relative to a reference point xr = (xr , yr ), can be drawn into an image I in several ways (implement both versions): • Version 1: Iterate over all image points (u, v): if Eqn. (12.18), that is, r = (u−xr ) · cos(θ) + (v−yr ) · sin(θ), 18

370

For example, with drawLine(x1, y1, x2, y2) in ImageJ.

(12.29)

is satisfied for position (u, v), then mark the pixel I(u, v). Of 12.4 Exercises course, this “brute force” method will only show those (few) line pixels whose positions satisfy the line equation exactly. To obtain a more “tolerant” drawing method, we first reformulate Eqn. (12.29) to (u−xr ) · cos(θ) + (v−yr ) · sin(θ) − r = d.

(12.30)

Obviously, Eqn. (12.29) is satisfied only if exactly d = 0 in Eqn. (12.30). If, however, Eqn. (12.29) is not satisfied, then the magnitude of d ̸= 0 equals the distance of the point (u, v) from the line. Note that d itself may be positive or negative, depending on which side of the line (u, v) is located. This suggests the following version. • Version 2: Define a constant w > 0. Iterate over all image positions (u, v); whenever the inequality |(u−xr )· cos(θ) + (v−yr )· sin(θ) − r| ≤ w

(12.31)

is satisfied for position (u, v), mark the pixel I(u, v). For example, all line points should show with w = 1. What is the geometric meaning of w? Exercise 12.2. Develop a “less brutal” method (compared to Exercise 12.1) for drawing a straight line L = ⟨θ, r⟩ in Hessian normal form (HNF). First, set up the HNF equations for the four border lines of the image, A, B, C, D. Now determine the intersection points of the given line L with each border line A, . . . , D and use the built-in drawLine() method or a similar routine to draw L by connecting the intersection points. Consider which special situations may appear and how they could be handled. Exercise 12.3. Reformulate and implement RANSAC circle detection (see Sec. 12.1.4) for the case that the circle radius is known. Only two random points are required to instantiate the circle model but note that this circle is not unique! Exercise 12.4. Devise and implement a RANSAC method for simultaneously detecting a pair of concentric circles whose radii are related by a fixed (known) ratio. Exercise 12.5. Implement (or extend) the Hough transform for straight lines by including measures against the bias problem, as discussed in Sec. 12.2.5 (Eqn. (12.23)). Exercise 12.6. The bias problem (see Sec. 12.2.5) is often mentioned in connection with the Hough transform. Does the RASAC method have similar issue? If yes, consider what remedies may be available and how they could be implemented. Exercise 12.7. Implement (or extend) the Hough transform for finding lines that takes into account line endpoints, as described in Sec. 12.2.5 (Eqn. (12.24)).

371

12 Detecting Geometric Primitives

Exercise 12.8. Calculate the pairwise intersection points of all detected lines (see Eqns. (12.25)–(12.26)) and show the results graphically. Exercise 12.9. Extend the Hough transform for straight lines so that updating the accumulator map takes into account the intensity (edge magnitude) of the current pixel, as described in Eqn. (12.22). Exercise 12.10. Implement a hierarchical version of the Hough transform for straight lines (see p. 365) capable of accurately determining line parameters. Exercise 12.11. Implement the Hough transform for finding circles and circular arcs with varying radii. Make use of a fast algorithm for drawing circles in the accumulator array, such as described in Sec. 12.2.6. Exercise 12.12. Assuming that “good” sample points are more likely to be found in areas of dense point population, could this be used to accelerate RANSAC? One idea is to estimate the local point density and use it to increase the likelihood of points being selected from high-density regions. For example, a simple Gaussian filter (see Sec. 4.2.7) could be used to produce a discrete density distribution (“heat map”), as illustrated in Fig. 12.25. Consider how this could be used to control the random selection process.

Fig. 12.25 Local point density map (Exercise 12.12). The “heat map” in (b) was obtained by applying a Gaussian filter (σ = 7.5) to the inverted image (a). It is displayed using a pseudocolor lookup-table.

(a) original

372

(b) point density map

Part V

Color

13 Color Images

Color images are involved in every aspect of our lives, where they play an important role in everyday activities such as television, photography, and printing. Color perception is a fascinating and complicated phenomenon that has occupied the interests of scientists, psychologists, philosophers, and artists for hundreds of years [238, 246]. In this chapter, we focus on those technical aspects of color that are most important for working with digital color images. Our emphasis will be on understanding the various representations of color and correctly utilizing them when programming. Additional color-related issues, such as colorimetric color spaces, color quantization, and color filters, are covered in subsequent chapters.

13.1 RGB Color Images The RGB color schema encodes colors as combinations of the three primary colors: red, green, and blue (R, G, B). This scheme is widely used for transmission, representation, and storage of color images on both analog devices such as television sets and digital devices such as computers, digital cameras, and scanners. For this reason, many image-processing and graphics programs use the RGB schema as their internal representation for color images, and most language libraries, including Java’s imaging APIs, use it as their standard image representation. RGB is an additive color system, which means that all colors start with black and are created by adding the primary colors. You can think of color formation in this system as occurring in a dark room where you can overlay three beams of light—one red, one green, and one blue—on a sheet of white paper. To create different colors, you would modify the intensity of each of these beams independently. The distinct intensity of each primary color beam controls the shade and brightness of the resulting color. The colors gray and white are created by mixing the three primary color beams at the same intensity. A similar operation occurs on the screen of a color television or © Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1_13

375

13 Color Images Fig. 13.1 Representation of the RGB color space as a 3D unit cube. The primary colors red (R), green (G), and blue (B) form the coordinate system. The “pure” red color (R), green (G), blue (B), cyan (C), magenta (M), and yellow (Y) lie on the vertices of the color cube. All the shades of gray, of which K is an example, lie on the diagonal between black S and white W.

C

B

W M

B

0.5 K

P

G G

Y

0.5 S

R25

R50

R75

R

Pt. S R Y G C B M W K R75 R50 R25 P

RGB values Color R Black 0.00 Red 1.00 Yellow 1.00 Green 0.00 Cyan 0.00 Blue 0.00 Magenta 1.00 White 1.00 50% Gray 0.50 75% Red 0.75 50% Red 0.50 25% Red 0.25 Pink 1.00

G 0.00 0.00 1.00 1.00 1.00 0.00 0.00 1.00 0.50 0.00 0.00 0.00 0.50

B 0.00 0.00 0.00 0.00 1.00 1.00 1.00 1.00 0.50 0.00 0.00 0.00 0.50

R

CRT1 -based computer monitor, where tiny, close-lying dots of red, green, and blue phosphorous are simultaneously excited by a stream of electrons to distinct energy levels (intensities), creating a seemingly continuous color image. The RGB color space can be visualized as a 3D unit cube in which the three primary colors form the coordinate axis. The RGB values are positive and lie in the range [0, Cmax ]; for most digital images, Cmax = 255. Every possible color Ci corresponds to a point within the RGB color cube of the form Ci = (Ri , Gi , Bi ), where 0 ≤ Ri , Gi , Bi ≤ Cmax . RGB values are often normalized to the interval [0, 1] so that the resulting color space forms a unit cube (see Fig. 13.1). Point S = (0, 0, 0) corresponds to the color black, W = (1, 1, 1) corresponds to the color white, and all the points lying on the diagonal between S and W are shades of gray created from equal color components R = G = B. Figure 13.2 shows a color test image and its corresponding RGB color components, displayed here as intensity images. We will refer to this image in a number of examples that follow in this chapter. RGB is a very simple color system and, as demonstrated in Sec. 13.2, a basic knowledge of it is often sufficient for processing color images or transforming them into other color spaces. At this point, we will not be able to determine what color a particular RGB pixel corresponds to in the real world, or even what the primary colors red, green, and blue truly mean in a physical (i.e., colorimetric) sense. For now we rely on our intuitive understanding of color and will address colorimetry and color spaces later in the context of the CIE color system (see Ch. 14). 13.1.1 Structure of Color Images Color images are represented in the same way as grayscale images, by using an array of pixels in which different models are used to order the 1

376

Cathode ray tube

13.1 RGB Color Images Fig. 13.2 A color image and its corresponding RGB channels. The fruits depicted are mainly yellow and red and therefore have high values in the R and G channels. In these regions, the B content is correspondingly lower (represented here by darker gray values) except for the bright highlights on the apple, where the color changes gradually to white. The tabletop in the foreground is purple and therefore displays correspondingly higher values in its B channel.

RGB

R

G

B

individual color components. In the next sections we will examine the difference between true color images, which utilize colors uniformly selected from the entire color space, and so-called palleted or indexed images, in which only a select set of distinct colors are used. Deciding which type of image to use depends on the requirements of the application. True Color Images A pixel in a true color image can represent any color in its color space, as long as it falls within the (discrete) range of its individual color components. True color images are appropriate when the image contains many colors with subtle differences, as occurs in digital photography and photo-realistic computer graphics. Next we look at two methods of ordering the color components in true color images: component ordering and packed ordering. Component Ordering In component ordering (also referred to as planar ordering) the color components are laid out in separate arrays of identical dimensions. In this case, the color image Icomp = (IR , IG , IB )

(13.1)

can be thought of as a vector of related intensity images IR , IG , and IB (see Fig. 13.3), and the RGB values of the color image I at

377

13 Color Images Fig. 13.3 RGB component ordering. The three color components are laid out in separate arrays IR , IG , IB of the same size.

IB

3 component arrays

IG IR

IB (u, v) IG (u, v) IR (u, v)

v

u Fig. 13.4 RGB packed ordering. The three color components R, G, and B are packed into a single array element.

Ipack

v R

G

B

I (u, v) u

position (u, v) are obtained by accessing the three component images in the form     R IR (u, v) G = IG (u, v) . (13.2) B IB (u, v) Packed Ordering In packed ordering, the component values that represent the color of a particular pixel are packed together into a single element of the image array (see Fig. 13.4). The RGB values of a packed image Ipack at the location (u, v) are obtained by accessing the individual components of the color pixel as     R red(Ipack (u, v)) G = grn(Ipack (u, v))  . (13.3) B blu(Ipack (u, v)) The access functions red(), grn(), blu(), will depend on the specific implementation used for encoding the color pixels. Indexed Images

378

Indexed images permit only a limited number of distinct colors and therefore are used mostly for illustrations and graphics that contain large regions of the same color. Often these types of images are stored

v

Pr

Pg

Pb

0

R0

G0

B0

1

R1

G1

B1

2

R2

G2

B2

q

Rq

Gq

Bq

Index

Iidx

q

Q−1 u

RQ−1 GQ−1 BQ−1

13.1 RGB Color Images Fig. 13.5 RGB indexed image. The image array Iidx itself does not contain any color component values. Instead, each cell contains an index q ∈ [0, Q−1]. into the associated color table (“palette”) P. The actual color value is specified by the table entry Pq = (Rq , Gq , Bq ).

Color table P

in indexed GIF or PNG files for use on the Web. In these indexed images, the pixel array does not contain color or brightness data but instead consists of integer numbers q that are used to index into a color table or “palette” P : [0, Q−1] 7→ [0, K −1]3 .

(13.4)

Here Q denotes the size of the color table, equal to the maximum number of distinct image colors (typically some power of two, e.g., Q = 2, . . . , 256), and K is the number of RGB component values (e.g., K = 256). Table P contains a specific color vector P(q) = (Rq , Gq , Bq )⊺ for every color index q = 0, . . . , Q−1 (see Fig. 13.5). The RGB component values of an indexed image Iidx at position (u, v) are obtained as     R Pr (q) G = Pg (q) = P(q), (13.5) B Pb (q) with the index q = Iidx (u, v). To allow proper reconstruction, the color table P must of course be stored and/or transmitted along with the indexed image. During the transformation from a true color image to an indexed image (e.g., from a JPEG image to a GIF image), the problem of optimal color reduction, or color quantization, arises. Color quantization is the process of determining an optimal color table and then mapping it to the original colors. This process is described at greater detail in Sec. 13.4. 13.1.2 Color Images in ImageJ ImageJ provides two simple types of color images: • RGB full-color images (24-bit “RGB color”), • Indexed images (“8-bit color”). RGB True Color Images RGB color images in ImageJ use a packed order (see Sec. 13.1.1), where each color pixel is represented by a 32-bit int value. As Fig.

379

13 Color Images

13.6 illustrates, 8 bits are used to represent each of the RGB components, which limits the range of the individual components to 0–255. The remaining 8 bits are reserved for the transparency,2 or alpha (α), component. This is also the usual ordering in Java3 for RGB color images.

Fig. 13.6 Structure of a packed RGB color pixel in Java. Within a 32-bit int, 8 bits are allocated, in the following order, for each of the color components R, G, B, and the transparency value α (unused in ImageJ).

α

31

R

G

24 23

B

16 15 Bits

8 7

0

Accessing RGB Component Values RGB color images are represented by an array of pixels, the elements of which are standard Java ints. To disassemble the packed int value into the three color components, you apply the appropriate bitwise shifting and masking operations. In the following example, we assume that the image processor ip (of type ColorProcessor) contains an RGB color image: ColorProcessor ip = ... int c = ip.getPixel(u,v); int r = (c & 0xff0000) >> 16; int g = (c & 0x00ff00) >> 8; int b = (c & 0x0000ff);

// // // //

a packed RGB color pixel red component green component blue component

In this example, each of the RGB components of the packed pixel c are isolated using a bitwise AND operation (&)with an appropriate bit mask (following convention, bit masks are given in hexadecimal4 notation), and afterwards the extracted bits are shifted right by 16 (for R) or 8 (for G) bit positions (see Fig. 13.7). Fig. 13.7 Decomposition of a 32-bit RGB color pixel using bit operations. The R component (bits 16–23) of the RGB pixels C (above) is isolated using a bitwise AND operation (&) together with a bit mask M = 0xff0000. All bits except the R component are set to the value 0, while the bit pattern within the R component remains unchanged. This bit pattern is subsequently shifted 16 positions to the right (>>), so that the R component is moved into the lowest 8 bits and its value lies in the range of 0, . . . , 255. During the shift operation, zeros are filled in from the left.

α 31

&

G 16 15

C

B 8 7

0

0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 M

00

ff

0 0 0 0 0 0 0 0

R

00

00

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 C&M

>> 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

R

(C&M)>>16

The “assembly” of an RGB pixel from separate R, G, and B values works in the opposite direction, using the bitwise OR operator (|) and shifting bits to the left ( 16; int g = (c & 0x00ff00) >> 8; int b = (c & 0x0000ff); // modify colors: r = r + 10; if (r > 255) r = 255; g = g + 10; if (g > 255) g = 255; b = b + 10; if (b > 255) b = 255; // reassemble color pixel and insert into pixel array: pixels[i] = ((r & 0xff) 0) { // this is a color pixel! float r = (float)(cHi - R) / cRng; float g = (float)(cHi - G) / cRng; float b = (float)(cHi - B) / cRng; float h; if (R == cHi) // R is largest component h = b - g; else if (G == cHi) // G is largest component h = r - b + 2.0f; else // B is largest component h = g - r + 4.0f; if (h < 0) h = h + 6; H = h / 6; } return new float[] {H, L, S};

Java Implementation Currently there is no method in either the standard Java API or ImageJ for converting color values between RGB and HLS. Program 13.8 gives one possible implementation of the RGB → HLS conversion that follows the definitions in Eqns. (13.23)–(13.24). The HLS → RGB conversion is shown in Prog. 13.9. HSV and HLS Color Spaces Compared Despite the obvious similarity between the two color spaces, as Fig. 13.16 illustrates, substantial differences in the V /L and S components do exist. The essential difference between the HSV and HLS spaces is the ordering of the colors that lie between the white point W and the “pure” colors (R, G, B, Y, C, M), which consist of at most two primary colors, at least one of which is completely saturated. The difference in how colors are distributed in RGB, HSV, and HLS space is readily apparent in Fig. 13.17. The starting point was a distribution of 1331 (11 × 11 × 11) color tuples obtained by uniformly sampling the RGB space at an interval of 0.1 in each dimension. We can see clearly that in HSV space the maximally saturated colors (s = 1) form circular rings with increasing density toward the upper plane of the cylinder. In HLS space, however, the color samples are

397

13 Color Images Prog. 13.9 HLS → RGB conversion (Java implementation).

Fig. 13.16 HSV and HLS components compared. Saturation (top row) and intensity (bottom row). In the color saturation difference image SHSV −SHLS (top), light areas correspond to positive values and dark areas to negative values. Saturation in the HLS representation, especially in the brightest sections of the image, is notably higher, resulting in negative values in the difference image. For the intensity (value and luminance, respectively) in general, VHSV ≥ LHLS and therefore the difference VHSV −LHLS (bottom) is always positive. The hue component H (not shown) is identical in both representations.

398

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

float[] HLStoRGB (float[] HLS) { float H = HLS[0], L = HLS[1], S = HLS[2];

// H,L,S in [0, 1]

float r = 0, g = 0, b = 0; if (L = 1) // white r = g = b = 1; else { float hh = (6 * H) % 6; // = H ′ int c1 = (int) hh; float c2 = hh - c1; float d = (L > 16; int G = (C & 0x0000ff00) >> 8; int B = (C & 0x000000ff); // 3:3:2 uniform color quantization byte RGB = (byte) ((R & 0xE0) | (G & 0xE0) >> 3 | ((B & 0xC0) >> 6));

Prog. 13.12 Quantization of a 3 × 8-bit RGB color pixel to 8 bits by 3:3:2 packing.

not optimal because the interesting colors may not be sampled with sufficient density while at the same time colors are represented that do not appear in the image at all. 411

13 Color Images

13.4.2 Vector Quantization Vector quantization does not treat the individual color components separately as does scalar quantization, but each color vector ci = (Ri , Gi , Bi ) or pixel in the image is treated as a single entity. Starting from a set of original color vectors C = {c0 , . . . , cm−1 }, the task of vector quantization is a) to find a set of n representative color vectors C′ = {c′0 , . . . , c′n−1 } and b) to replace each original color ci by one of the new colors c′j ∈ C′ , where n is usually predetermined (n < m) and the resulting deviation from the original image shall be minimal. This is a combinatorial optimization problem in a rather large search space, which usually makes it impossible to determine a (globally) optimal solution in adequate time. Thus all of the following methods only compute a “local” optimum at best. Populosity Algorithm The populosity algorithm15 [125] selects the n most frequent colors in the image as the representative set of color vectors C′ . Being very easy to implement, this procedure is quite popular. The method described in Sec. 13.3.1, based on sorting the image pixels, can be used to determine the n most frequent image colors. Each original pixel ci is then replaced by the closest representative color vector in C′ ; that is, the quantized color vector with the smallest distance in the 3D color space. The algorithm performs sufficiently only as long as the original image colors are not widely scattered through the color space. Some improvement is possible by grouping similar colors into larger cells first (by scalar quantization). However, a less frequent (but possibly important) color may get lost whenever it is not sufficiently similar to any of the n most frequent colors. Median-Cut Algorithm The median-cut algorithm [125] is considered a classical method for color quantization that is implemented in many applications (including ImageJ). As in the populosity method, a color histogram is first calculated for the original image, traditionally with a reduced number of histogram cells (such as 32 × 32 × 32) for efficiency reasons.16 The initial histogram volume is then recursively split into smaller boxes until the desired number of representative colors is reached. In each recursive step, the color box representing the largest number of pixels is selected for splitting. A box is always split across the longest of its three axes at the median point, such that half of the contained pixels remain in each of the resulting subboxes (Fig. 13.26). 15 16

412

Sometimes also called “popularity” algorithm. This corresponds to a scalar prequantization on the color components, which leads to additional quantization errors and thus produces suboptimal results. This step seems unnecessary on modern computers and should be avoided.

13.4 Color Quantization Fig. 13.26 Median-cut algorithm. The RGB color space is recursively split into smaller cubes along one of the color axes. 1st cut

2nd cut

3rd cut

The result of this recursive splitting process is a partitioning of the color space into a set of disjoint boxes, with each box ideally containing the same number of image pixels. In the last “assignment” step, the mean (i.e., average) color is calculated for each color cube, and all image pixels it contains are replaced by that color. Alternatively, each image color is replaced by the closest (in terms of Euclidean distance) box center color. Though this is more expensive to calculate, results are usually better. The advantage of this method is that color regions of high pixel density are split into many smaller cells, thus reducing the overall quantization error. In color regions of low density, however, relatively large cubes and thus large color deviations may occur for individual pixels. The median-cut method is described in detail in Algorithms 13.1– 13.3 and a corresponding Java implementation can be found in the source code section of this book’s website (see Sec. 13.4.3). Octree Algorithm Similar to the median-cut algorithm, this method is also based on partitioning the 3D color space into cells of varying size. The octree algorithm [101] utilizes a hierarchical structure, where each cube in color space may contain eight subcubes. This partitioning is represented by a tree structure (octree) with a cube at each node that may again link to up to eight further nodes. Thus each node corresponds to a subrange of the color space that reduces to a single color point at a certain tree depth d (e.g., d = 8 for a 3 × 8-bit RGB color image). When an image is processed, the corresponding quantization tree, which is initially empty, is created dynamically by evaluating all pixels in a sequence. Each pixel’s color tuple is inserted into the quantization tree, while at the same time the number of nodes is limited to a predefined value K (typically 256). When a new color ci is inserted and the tree does not contain this color, one of the following situations can occur: 1. If the number of nodes is less than K and no suitable node for the color ci exists yet, then a new node is created for ci . 2. Otherwise, i.e., if the tree has K nodes already and color ci is not yet represented, existing nodes at the maximum tree depth (which represent similar colors) are merged into a common node. A key advantage of the iterative octree method is that the number of color nodes remains limited to K in any step and thus the amount of required storage is small. The final replacement of the image

413

13 Color Images Alg. 13.1 Median-cut color quantization (Part 1). The input image I is quantized to up to K representative colors and a new, quantized image is returned. The main work is done in procedure FindReferenceColors(), which iteratively partitions the color space into increasingly smaller boxes. It returns a set of representative colors (Cq ) that are subsequently used by procedure QuantizeImage() to quantize the original image I . Note that (unlike in most common implementations) no prequantization is applied to the original image colors. For the definitions of data structures (“object” types) ColNode and ColorBox see Alg. 13.3, lines 26–27.

1: QuantizeMedianCut(I , K) I , a color image; K, the max. number of quantized colors. Returns a new quantized image with at most K colors. 2: 3: 4: 5: 6: 7: 8:

Call ← GetAllColors(I) ▷ Call = {cn0 , cn1 , . . .} if |Call | ≤ K then return I ▷ not enough colors, nothing to quantize else Cq ← FindReferenceColors(Call , K) I ′ ← QuantizeImage(I , Cq ) ▷ see Alg. 13.3 return I ′

9: FindReferenceColors(Call , K) Returns a set of up to K representative colors from the original colors Call = {cn0 , cn1 , . . .} (a set of ColorNode instances, see Alg. 13.3, line 26).

10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:

Create a ColorBox instance for level 0 that contains all image colors Call and make it the initial element in the set of color boxes B: cb0 ← CreateColorBox(Call , 0) ▷ see Alg. 13.2 B ← {cb0 } ▷ initial set of color boxes k ← 1, done ← false while (k < K) ∧ ¬done do cb ← FindBoxToSplit(B) ▷ see Alg. 13.2 if cb ̸= nil then (cb1 , cb2 ) ← SplitBox(cb) ▷ see Alg. 13.2 B ← B \ {cb} ▷ remove cb from B B ← B ∪ {cb1 , cb2 } ▷ add cb1 , cb2 to B k ←k+1 else ▷ no more boxes to split done ← true Collect the average colors of all color boxes in B:

22: 23:

Cq ← {AverageColor(cbj ) | cbj ∈ B} return Cq

▷ see Alg. 13.3 ▷ Cq = {c0 , . . . , cK −1 }

24: GetAllColors(I ) Returns the distinct colors in image I as a set of ColorNode instances (see definition in Alg. 13.3, line 26). 25: C ← () ▷ empty sequence of colors 26: for all p ∈ dom(I) do 27: c ← I(p) ▷ c = (R, G, B) 28: C ← C ⊞ (c) ▷ collect all image colors

414

▷ sort C such that same colors are adjacent

29:

sort(C)

30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41:

P ← |C| ▷ P = number of pixels in I Call ← {} ▷ empty set of ColNode instances n←1 ▷ run-length counter for i ← 0, . . . , P −1 do if (i < P −1) ∧ (C(i) = C(i+1)) then n←n+1 else ▷ color transition, i.e., end of run (R, G, B) ← C(i) cn ← new ColNode(R, G, B, n) ▷ = ⟨red, grn, blu, cnt⟩ Call ← Call ∪ {cn} n←1 return Call

1: FindBoxToSplit(B) Searches the set B of ColorBox instances for a box to split and returns this box, or nil if no splittable box can be found.

2: 3: 4: 5: 6: 7:

Find the set of color boxes that can be split (i.e., contain at least 2 different colors): Bs ← {cb ∈ B | card(colors(cb)) ≥ 2} if Bs = {} then ▷ no splittable box was found return nil else Select a splittable box with minimum level: cb′ ← argmin(level(cb)) cb∈Bs

return cb′

8: SplitBox(cb) Splits ColorBox b at the median plane that is perpendicular to its longest dimension and returns a pair of new color boxes. 9: 10: 11:

m ← level(cb) C ← colors(cb) ▷ set of ColorNode instances in cb d ← GetMaxBoxDimension(cb) ▷ d ∈ {R, G, B}, see Alg. 13.3

13.4 Color Quantization Alg. 13.2 Median-cut color quantization (Part 2). Function FindBoxToSplit(B) searches the set of ColorBox instances B for the next color box to split. If such a box is found, it is split into two new color boxes by function SplitBox() at the median along its longest dimension (red, green, or blue). Function medianR (C) returns the median of the red component values for the colors in C, analogously medianG , medianB for the green and blue dimensions (see line 12). Function CreateColorBox() creates and initializes a new ColorBox instance. Variables Rmin , . . . , Bmax are trimmed to the box boundaries of the colors contained in the associated color set C.

From all colors in C determine the median of the color distribution along dimension d and split C into C1 , C2 :

 {cn ∈ C | red(cn) ≤ medianR (C)} if d = R

12:

C1 ←

13:

C2 ← C \ C1

14: 15:

b1 ← CreateColorBox(C1 , m+1) b2 ← CreateColorBox(C2 , m+1)

16:

return (b1 , b2 )



{cn ∈ C | grn(cn) ≤ medianG (C)} {cn ∈ C | blu(cn) ≤ medianB (C)}

if d = G if d = B ▷ C = C1 ∪ C2 ▷ see below

17: CreateColorBox(C, m) Creates and returns a new ColorBox for the ColorNode instances in C = {cn0 , cn1 , . . .} (see type definitions in Alg. 13.3, lines 26–27) at split-level m. 18: 19:

Find the RGB extrema of all colors in C: Rmin , Gmin , Bmin ← +∞ Rmax , Gmax , Bmax ← −∞

20: 21:

for all cn ∈ C do (R, G, B) ← (red(cn), grn(cn), blu(cn))

22: 23:

Rmin ← min(Rmin , R) Rmax ← max(Rmax , R)

24: 25:

Gmin ← min(Gmin , G) Gmax ← max(Gmax , G)

26: 27:

Bmin ← min(Bmin , B) Bmax ← max(Bmax , B)

28: 29:

cb ← new ColorBox(C, m, Rmin , Rmax , Gmin , Gmax , Bmin , Bmax ) return cb

415

13 Color Images Alg. 13.3 Median-cut color quantization (Part 3). Auxiliary functions and type definitions.

1: AverageColor(b) Returns the average color cavg for the pixels represented by the color box b. 2: 3: 4: 5: 6: 7: 8: 9: 10:

C ← colors(b) ▷ the set of ColorNode instances in box b n←0 ΣR ← 0, ΣG ← 0, ΣB ← 0 for all cn ∈ C do k ← cnt(c) n←n+k ΣR ← ΣR + k·red(cn) ΣG ← ΣG + k·grn(cn) ΣB ← ΣB + k·blu(cn)

11: 12:

cavg ← n1 ·(ΣR , ΣG , ΣB ) return cavg

13: GetMaxBoxDimension(b) Returns the largest color dimension d ∈ {R, G, B} of ColorBox instance b. 14: dR ← rmax(b) − rmin(b) 15: dG ← gmax(b) − gmin(b) 16: dB ← bmax(b) − bmin(b) 17: 18: 19:

dmax  ← max(dR , dG , dB ) R if dmax = dR d ← G if dmax = dG B if d max = dB return d

20: QuantizeImage(I , Cq ) Returns a new image with color pixels from I replaced by their closest representative colors in Cq (by Euclidean distance).

24:

(M, N ) ← size(I ) I ′ ← duplicate(I ) ▷ create a new image for all p ∈ M × N do Find the quantization color in Cq that is “closest” to the current pixel color (e.g., using the Euclidean distance in RGB space): I ′ (p) ← argmin ∥I (p) − c∥

25:

return I ′

21: 22: 23:

c∈Cq

Object type definitions (see also Appendix Sec. A.4): 26: ColNode := ⟨red, grn, blu, cnt⟩ Represents a single RGB color with an additional count-field. red, grn, blu: RGB component values; cnt, the number of occurrences of this color. 27: ColorBox := ⟨colors, level, rmin , rmax , gmin , gmax , bmin , bmax ⟩ Represents a cube-shaped section of the RGB color space. colors is the set of image colors represented by this color box, level denotes the split-level, and rmin , . . . , bmax are the boundaries of the color cube in RGB space.

416

W B

W B

G

Fig. 13.27 Color distribution after application of the median-cut (a) and octree (b) algorithms. In both cases, the set of 226,321 colors in the original image (see Fig. 13.25 (a)) was reduced to 256 representative colors.

G

0

13.4 Color Quantization

0

R

(a) Median-cut

R

(b) Octree

pixels by the quantized color vectors can also be performed easily and efficiently with the octree structure because only up to eight comparisons (one at each tree layer) are necessary to locate the bestmatching color for each pixel. Again a better (though more costly) alternative is to set up a map of representative colors (the average node colors), and replace each original color by the “closest” (in terms of Euclidean distance) reference color. Figure 13.27 shows the resulting color distributions in RGB space after applying the median-cut and octree algorithms. In both cases, the original image (Fig. 13.25 (a)) is quantized to 256 colors. Notice in particular the dense placement of quantized colors in certain regions of the green hues. For both algorithms and the (scalar) 3:3:2 quantization, the resulting distances between the original pixels and the quantized colors are shown in Fig. 13.28. The greatest errors naturally result from 3:3:2 quantization, because this method does not consider the contents of the image at all. Compared with the median-cut method, the overall error for the octree algorithm is smaller, although the latter creates several large deviations, particularly inside the colored foreground regions and the forest region in the background. In general, however, the octree algorithm does not offer significant advantages in terms of image quality over the simpler median-cut algorithm. Another example is shown in Fig. 13.29, contrasting the difference between Euclidean distance-based color assignment and “quick” (i.e., tree-based) color substitution. Other Methods for Vector Quantization A suitable set of representative color vectors can usually be determined without inspecting all pixels in the original image. It is often sufficient to use only 10% of randomly selected pixels to obtain a high probability that none of the important colors is lost. In addition to the color quantization methods described already, several other procedures and refined algorithms have been proposed. This includes statistical and clustering methods, such as the classical k-means algorithm, but also the use of neural networks and genetic algorithms. A good overview can be found in [248]. 417

13 Color Images Fig. 13.28 Quantization errors. Original image (a), distance between original and quantized color pixels for scalar 3:3:2 quantization (b), median-cut (c), and octree (d) algorithms.

(a) Detail

(b) 3:3:2

(c) Median-cut

(d) Octree

Fig. 13.29 Quantization example with K = 16 colors. In (b), mediancut quantization was applied with Euclidean distance assignment. Octree quantization with Euclidean distance (c) and “quick” (i.e., treebased) assignment (d). Note the additional quantization noise introduced by tree-based assignment, caused by colors that are near the outer boundaries of a color box.

418

(a) Original

(b) Median-cut

(c) Octree (Euclidean distance)

(d) Octree (quick quantization)

13.4.3 Java Implementation

13.5 Exercises

The Java implementation17 of the algorithms described in this chapter consists of a common interface ColorQuantizer and the concrete classes MedianCutQuantizer, OctreeQuantizer. Program 13.13 shows a simple ImageJ plugin that employs the class MedianCutQuantizer for quantizing an RGB full-color image to an indexed image. The choice of data structures for the representation of color sets and the implementation of the associated set operations are essential to achieve good performance. The data structures used in this implementation are illustrated in Fig. 13.30. Initially, the set of all unique colors contained in the original image (ip of type ColorProcessor) is extracted by method getAllColors(ip). The result is an array allColors of ColorNode instances (representing the set Call ). Each ColorNode holds the associated color (red, grn, blu) and its frequency (cnt) in the image. So this is essentially a color histogram. The final (K) reference colors are collected in colormap (Cq ), which is a set of ColorBox instances. Each ColorBox specifies a contiguous range of image colors in array allColors, bounded by the indexes lower and upper. Note that color boxes never overlap and each original color is contained in exactly one color box. Thus the color boxes in colormap (B in Alg. 13.1) form a partitioning of allColors. To split a specific color box along a color dimension d ∈ {R, G, B}, the colors within the corresponding segment of allColors are sorted by their red, green, or blue component, respectively. In Java, this is quite easy to implement using the standard Arrays.sort() method and a specific Comparator object for each color dimension. Finally, method quantize() creates a new image by replacing each pixel in ip by the closest color in colormap.

13.5 Exercises Exercise 13.1. Create an ImageJ plugin that rotates the individual components of an RGB color image, that is, R→G→B→R. Exercise 13.2. Pseudocolors are sometimes used for displaying grayscale images (i.e., for viewing medical images with high dynamic range). Create an ImageJ plugin for converting 8-bit grayscale images to an indexed image with 256 colors, simulating the hues of glowing iron (from dark red to yellow and white). Exercise 13.3. Create an ImageJ plugin that shows the color table of an 8-bit indexed image as a new image with 16 × 16 rectangular color fields. Mark all unused color table entries in a suitable way. Look at Prog. 13.3 as a starting point. 17

See package imagingbook.pub.color.quantize. ImageJ also provides a (less flexible) native implementation (class ij.process.MedianCut).

419

Original Image

Prog. 13.13 Median-cut color quantization (ImageJ plugin). This example uses class MedianCutQuantizer (imagingbook library) to quantize a full-color RGB image to an indexed color image (of type ByteProcessor) with 16 colors. The quantized image is finally displayed, the original image remains unchanged.

ColorProcessor

cnt red grn blu

getAllColors(ip)

allColors 0

upper lower level

ColorBox

Q−1

ColorBox

Fig. 13.30 Data structures used in the implementation of the mediancut quantization algorithm (class MedianCutQuantizer). The input image (ip) is a full-color RGB image. Initially, all unique image colors are extracted into an array (allColors) of ColorNode instances. Each such node holds the original color components and the associated frequency (cnt). The original colors are partitioned into nonoverlapping ColorBox objects, contained in set colormap. Each color box specifies a contiguous segment of colors in array allColors. To split a color box, the colors in the associated segment are first sorted (along the longest box dimension) and then separated into two new boxes. This effects very efficient set partitioning. Finally, method quantize() creates a new image by substituting every original color by the closest reference color in in colormap.

ColorNode

13 Color Images

upper lower level

Quantized Image

colormap

quantize()

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

import import import import import import import

ij.ImagePlus; ij.plugin.filter.PlugInFilter; ij.process.ByteProcessor; ij.process.ColorProcessor; ij.process.ImageProcessor; imagingbook.pub.color.quantize.ColorQuantizer; imagingbook.pub.color.quantize.MedianCutQuantizer;

public class Median_Cut_Quantization implements PlugInFilter { static int NCOLORS = 16; public int setup(String arg, ImagePlus imp) { return DOES_RGB + NO_CHANGES; } public void run(ImageProcessor ip) { ColorProcessor cp = (ColorProcessor) ip; // create a quantizer: ColorQuantizer q = new MedianCutQuantizer(cp, NCOLORS);

}

}

// quantize cp to an indexed color image: ByteProcessor qip = q.quantize(cp); (new ImagePlus("Quantized Index Color Image", qip)).show();

Exercise 13.4. Show that a “desaturated” RGB pixel produced in the form (R, G, B) → (Y, Y, Y ), where Y is the equivalent luminance value (see Eqn. (13.10)), has the luminance Y as well.

420

Exercise 13.5. Extend the ImageJ plugin for desaturating color images in Prog. 13.5 such that the image is only modified inside the user-selected region of interest (ROI).

13.5 Exercises

(a)

Fig. 13.31 Selective desaturation example. Original image with selected reference color cref = (250, 92, 150) (a), desaturated image (b). Gaussian saturation function f (H) (see Eqn. (13.57)) with reference hue Href = 0.9388 and σ = 0.1 (c).

(b)

f (H)

(c)

H

Exercise 13.6. Write an ImageJ plugin that selectively desaturates an RGB image, preserving colors with a hue close to a given reference color cref = (Rref , Gref , Bref ), with (HSV) hue Href (see the example in Fig. 13.31). Transform the image to HSV and modify the colors (cf. Eqn. (13.32)) in the form     Hdesat H  Sdesat  = f (H)·S  , (13.55) Vdesat V where f (H) is a smooth saturation function, for example, a Gaussian function of the form f (H) = e

−

(H −Href )2 2·σ 2

= gσ (H −Href ),

(13.56)

with center Href and variance σ 2 (see Fig. 13.31 (c)). Recall that the H component is circular in [0, 1). To obtain a continuous and periodic saturation function we note that H ′ = H −Href is in the range [−1, 1] and reformulate f (H) as  ′  gσ (H +1) f (H) = gσ (H ′ )   gσ (H ′ −1)

for H ′ < − 0.5, for −0.5 ≤ H ′ ≤ 0.5, for H ′ > 0.5.

(13.57)

Verify the values of the function f (H), check in particular that it is 1 for the reference color! What would be a good (synthetic) color image for validating the saturation function? Use ImageJ’s color picker (pipette) tool to specify the reference color cref interactively.18 18

The current color pick is returned by the ImageJ method Toolbar. getForegroundColor().

421

13 Color Images

Exercise 13.7. Calculate (analogous to Eqns. (13.44)–(13.45)) the complete transformation matrices for converting from (linear) RGB colors to YCbCr for the ITU-BT.709 (HDTV) standard with the coefficients wR = 0.2126, wB = 0.0722 and wG = 0.7152. Exercise 13.8. Determining the similarity between images of different sizes is a frequent problem (e.g., in the context of image data bases). Color statistics are commonly used for this purpose because they facilitate a coarse classification of images, such as landscape images, portraits, etc. However, 2D color histograms (as described in Sec. 13.3.2) are usually too large and thus cumbersome to use for this purpose. A simple idea could be to split the 2D histograms or even the full RGB histogram into K regions (bins) and to combine the corresponding entries into a K-dimensional feature vector, which could be used for a coarse comparison. Develop a concept for such a procedure, and also discuss the possible problems. Exercise 13.9. Write a program (plugin) that generates a sequence of colors with constant hue and saturation but different brightness (value) in HSV space. Transform these colors to RGB and draw them into a new image. Verify (visually) if the hue really remains constant. Exercise 13.10. When applying any type of filter in HSV or HLS color space one must keep in mind that the hue component H is circular in [0, 1) and thus shows a discontinuity at the 1 → 0 (360 → 0◦ ) transition. For example, a linear filter would not take into account that H = 0.0 and H = 1.0 refer to the same hue (red) and thus cannot be applied directly to the H component. One solution is to filter the cosine and sine values of the H component (which really is an angle) instead, and composing the filtered hue array from the filtered cos/sin values (see Sec. 15.1.3 for details). Based on this idea, implement a variable-sized linear Gaussian filter (see Sec. 4.2.7) for the HSV color space. Exercise 13.11. Simplify the 3:3:2 quantization given in Prog. 13.12 such that only a single bit mask/shift step is performed for each color component. Exercise 13.12. The median-cut algorithm for color quantization (Sec. 13.4.2) is implemented in the Independent JPEG Group’s 19 libjpeg open source software with the following modification: the choice of the cube to be split next depends alternately on (a) the number of contained image pixels and (b) the cube’s geometric volume. Consider the possible motives and discuss examples where this approach may offer an improvement over the original algorithm. Exercise 13.13. Implement a median-cut quantizer that allows specific colors (e.g., black and white) to be “forced” into the final color table. The existing implementation in the imagingbook library (class MedianCutQuantizer) may serve as a good starting point. 19

422

www.ijg.org.

Exercise 13.14. The signal-to-noise ratio (SNR) is a common mea- 13.5 Exercises sure for quantifying the loss of image quality introduced by color quantization. It is defined as the ratio between the average signal energy Psignal and the average noise energy Pnoise . For example, given a full-color RGB color image I and the associated quantized image I ′ (both of size M × N ), this ratio could be calculated as M −1 NP −1 P

Psignal SNR(I , I ) = = M −1 Nu=0 −1 P P Pnoise ′

u=0 v=0

∥I (u, v)∥

2

v=0

∥I (u, v) −

2 I ′ (u, v)∥

.

(13.58)

Thus all deviations between the original and the quantized image are considered “noise”. The signal-to-noise ratio is usually specified on a logarithmic scale with the unit decibel (dB), that is, SNRlog (I , I ′ ) = 10 · log10 (SNR(I , I ′ )) [dB].

(13.59)

Implement the calculation of the SNR, as defined in Eqns. (13.58)– (13.59), for color images and compare the results for the median-cut and the octree algorithms for the same number of target colors.

423

14 Colorimetric Color Spaces

In any application that requires precise, reproducible, and deviceindependent presentation of colors, the use of calibrated color systems is an absolute necessity. For example, color calibration is routinely used throughout the digital print work flow but also in digital film production, professional photography, image databases, etc. One may have experienced how difficult it is, for example, to render a good photograph on a color laser printer, and even the color reproduction on monitors largely depends on the particular manufacturer and computer system. All the color spaces described in the previous chapter (see Sec. 13.2) somehow relate to the physical properties of some media device, such as the specific colors of the phosphor coatings inside a CRT tube or the colors of the inks used for printing. To make colors appear similar or even identical on different media modalities, we need a representation that is independent of how a particular device reproduces these colors. Color systems that describe colors in a measurable, device-independent fashion are called colorimetric or calibrated, and the field of color science is traditionally concerned with the properties and application of these color systems (see, e.g., [290] or [243] for an overview). While several colorimetric standards exist, we focus on the most widely used CIE systems in the remaining part of this section.

14.1 CIE Color Spaces The XYZ color system, developed by the CIE (Commission Internationale d’Èclairage)1 in the 1920s and standardized in 1931, is the foundation of most colorimetric color systems that are in use today [219, p. 22].

1

International Commission on Illumination (www.cie.co.at)

© Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1_14

425

14 Colorimetric Color Spaces

Fig. 14.1 The XYZ color space is defined by the three imaginary primary colors X, Y , Z, where the Y dimension corresponds to the perceived luminance. All visible colors are contained inside an open, cone-shaped volume that originates at the black point S (a), where E denotes the axis of neutral (gray) colors. The RGB color space maps to the XYZ space as a linearly distorted cube (b). See also Fig. 14.5(a).

14.1.1 CIE XYZ Color Space The CIE XYZ color scheme was developed after extensive measurements of human visual perception under controlled conditions. It is based on three imaginary primary colors X, Y , Z, which are chosen such that all visible colors can be described as a summation of positive-only components, where the Y component corresponds to the perceived lightness or luminosity of a color. All visible colors lie inside a 3D cone-shaped region (Fig. 14.1(a)) which, interestingly enough, does not include the primary colors themselves. Y

Y

1

1

Y G

E 0

W

R C

0 S

X

1

X

M 1

1 Z

E

(a)

B Z

(b)

Some common color spaces, and the RGB color space in particular, conveniently relate to XYZ space by a linear coordinate transformation, as described in Sec. 14.4. Thus, as shown in Fig. 14.1(b), the RGB color space is embedded in the XYZ space as a distorted cube, and therefore straight lines in RGB space map to straight lines in XYZ again. The CIE XYZ scheme is (similar to the RGB color space) nonlinear with respect to human visual perception, that is, a particular fixed distance in XYZ is not perceived as a uniform color change throughout the entire color space. The XYZ coordinates of the RGB color cube (based on the primary colors defined by ITU-R BT.709) are listed in Table 14.1. 14.1.2 CIE x, y Chromaticity As mentioned, the luminance in XYZ color space increases along the Y axis, starting at the black point S located at the coordinate origin (X = Y = Z = 0). The color hue is independent of the luminance and thus independent of the Y value. To describe the corresponding “pure” color hues and saturation in a convenient manner, the CIE system also defines the three chromaticity values X Y Z , y= , z= , (14.1) X +Y +Z X +Y +Z X +Y +Z where (obviously) x + y + z = 1 and thus one of the three values (e.g., z) is redundant. Equation (14.1) describes a central projection from x=

426

Pt. S R Y G C B M W

Color Black Red Yellow Green Cyan Blue Magenta White

R 0.00 1.00 1.00 0.00 0.00 0.00 1.00 1.00

G 0.00 0.00 1.00 1.00 1.00 0.00 0.00 1.00

B 0.00 0.00 0.00 0.00 1.00 1.00 1.00 1.00

X 0.0000 0.4125 0.7700 0.3576 0.5380 0.1804 0.5929 0.9505

Y 0.0000 0.2127 0.9278 0.7152 0.7873 0.0722 0.2848 1.0000

Z 0.0000 0.0193 0.1385 0.1192 1.0694 0.9502 0.9696 1.0888

x 0.3127 0.6400 0.4193 0.3000 0.2247 0.1500 0.3209 0.3127

y 0.3290 0.3300 0.5052 0.6000 0.3288 0.0600 0.1542 0.3290

14.1 CIE Color Spaces Table 14.1 Coordinates of the RGB color cube in CIE XYZ space. The X, Y, Z values refer to standard (ITU-R BT. 709) primaries and white point D65 (see Table 14.2), x, y denote the corresponding CIE chromaticity coordinates.

X, Y, Z coordinates onto the 3D plane X + Y + Z = 1,

(14.2)

with the origin S as the projection center (Fig. 14.2). Thus, for an arbitrary XYZ color point A = (Xa , Ya , Za ), the corresponding chromaticity coordinates a = (xa , ya , za ) are found by intersecting the line SA with the X + Y + Z = 1 plane (Fig. 14.2(a)). The final x, y coordinates are the result of projecting these intersection points onto the X/Y -plane (Fig. 14.2(b)) by simply dropping the Z component za . The result is the well-known horseshoe-shaped CIE x, y chromaticity diagram, which is shown in Fig. 14.2(c). Any x, y point in this diagram defines the hue and saturation of a particular color, but only the colors inside the horseshoe curve are potentially visible. Obviously an infinite number of X, Y, Z colors (with different luminance values) project to the same x, y, z chromaticity values, and the XYZ color coordinates thus cannot be uniquely reconstructed from given chromaticity values. Additional information is required. For example, it is common to specify the visible colors of the CIE system in the form Yxy, where Y is the original luminance component of the XYZ color. Given a pair of chromaticity values x, y (with y > 0) and an arbitrary Y value, the missing X, Z coordinates are obtained (using the definitions in Eqn. (14.1)) as X =x·

Y , y

Z =z·

Y Y = (1 − x − y) · . y y

(14.3)

The CIE diagram not only yields an intuitive layout of color hues but exhibits some remarkable formal properties. The xy values along the outer horseshoe boundary correspond to monochromatic (“spectrally pure”), maximally saturated colors with wavelengths ranging from below 400 nm (purple) up to 780 nm (red). Thus the position of any color inside the xy diagram can be specified with respect to any of the primary colors at the boundary, except for the points on the connecting line (“purple line”) between 380 and 780 nm, whose purple hues do not correspond to primary colors but can only be generated by mixing other colors. The saturation of colors falls off continuously toward the “neutral point” (E) at the center of the horseshoe, with x = y = 13 (or X = Y = Z = 1, respectively) and zero saturation. All other colorless (i.e., gray) values also map to the neutral point, just as any set of colors

427

14 Colorimetric Color Spaces Fig. 14.2 CIE x, y chromaticity diagram. For an arbitrary XYZ color point A = (Xa , Ya , Za ), the chromaticity values a = (xa , ya , za ) are obtained by a central projection onto the 3D plane X + Y + Z = 1 (a). The corner points of the RGB cube map to a triangle, and its white point W maps to the (colorless) neutral point N. The intersection points are then projected onto the X/Y plane (b) by simply dropping the Z component, which produces the familiar CIE chromaticity diagram shown in (c). The CIE diagram contains all visible color tones (hues and saturations) but no luminance information, with wavelengths in the range 380–780 nanometers. A particular color space is specified by at least three primary colors (tristimulus values; e.g., R, G, B), which define a triangle (linear hull) containing all representable colors.

Y

Y

A = (Xa , Ya , Za )

1

1

Y a

G

y

W N 0

C

S

0

X

1

S

1

x

X

M B

1

1

Z

Z

(a)

(b)

y

1.0 520 nm 540 nm 560 nm

G

500 nm

Y

0.5 C

480 nm

580 nm 600 nm

N

R

680...780 nm

(c)

M B

0.0

x

0.0

380 nm

0.5

1.0

with the same hue but different brightness corresponds to a single x, y point. All possible composite colors lie inside the convex hull specified by the coordinates of the primary colors of the CIE diagram and, in particular, complementary colors are located on straight lines that run diagonally through the white point. 14.1.3 Standard Illuminants

428

A central goal of colorimetry is the quantitative measurement of colors in physical reality, which strongly depends on the color properties of the illumination. The CIE system specifies a number of standard illuminants for a variety of real and hypothetical light sources, each specified by a spectral radiant power distribution and the “correlated color temperature” (expressed in degrees Kelvin) [290, Sec. 3.3.3]. The following daylight (D) illuminants are particularly important for the design of digital color spaces (see Table 14.2):

D50 emulates the spectrum of natural (direct) sunlight with an 14.1 CIE Color Spaces equivalent color temperature of approximately 5000◦ K. D50 is the recommended illuminant for viewing reflective images, such as paper prints. In practice, D50 lighting is commonly implemented with fluorescent lamps using multiple phosphors to approximate the specified color spectrum. D65 has a correlated color temperature of approximately 6500◦ K and is designed to emulate the average (indirect) daylight observed under an overcast sky on the northern hemisphere. D65 is also used as the reference white for emittive devices, such as display screens. The standard illuminants serve to specify the ambient viewing light but also to define the reference white points in various color spaces in the CIE color system. For example, the sRGB standard (see Sec. 14.4) refers to D65 as the media white point and D50 as the ambient viewing illuminant. In addition, the CIE system also specifies the range of admissible viewing angles (commonly at ±2◦ ). ◦

K

X

Y

Z

x

y

D50

5000

0.96429

1.00000

0.82510

0.3457

0.3585

D65

6500

0.95045

1.00000

1.08905

N

—

1.00000

1.00000

1.00000

0.3127 0.3333˙

0.3290 0.3333˙

Table 14.2 CIE color parameters for the standard illuminants D50 and D65. N denotes the absolute neutral point in CIE XYZ space.

14.1.4 Gamut The set of all colors that can be handled by a certain media device or can be represented by a particular color space is called “gamut”. This is usually a contiguous region in the 3D CIE XYZ color space or, reduced to the representable color hues and ignoring the luminance component, a convex region in the 2D CIE chromaticity diagram. Figure 14.3 illustrates some typical gamut regions inside the CIE diagram. The gamut of an output device mainly depends on the technology employed. For example, ordinary color monitors are typically not capable of displaying all colors of the gamut covered by the corresponding color space (usually sRGB). Conversely, it is also possible that devices would reproduce certain colors that cannot be represented in the utilized color space. Significant deviations exist, for example, between the RGB color space and the gamuts associated with CMYK-based printers. Also, media devices with very large gamuts exist, as demonstrated by the laser display system in Fig. 14.3. Representing such large gamuts and, in particular, transforming between different color representations requires adequately sized color spaces, such as the Adobe-RGB color space or CIELAB (described in Sec. 14.2), which covers the entire visible portion of the CIE diagram. 14.1.5 Variants of the CIE Color Space The original CIEXYZ color space and the derived xy chromaticity diagram have the disadvantage that color differences are not perceived equally in different regions of the color space. For example,

429

y

14 Colorimetric Color Spaces

CIELAB

1.0

Laser display Adobe RGB

Fig. 14.3 Gamut regions for different color spaces and output devices inside the CIE diagram.

sRGB CMYK

0.5

D65

0.0

x 0.0

0.5

1.0

large color changes are perceived in the magenta region for a given shift in XYZ while the change is relatively small in the green region for the same coordinate distance. Several variants of the CIE color space have been developed for different purposes, primarily with the goal of creating perceptually uniform color representations without sacrificing the formal qualities of the CIE reference system. Popular CIE-derived color spaces include CIE YUV, YU′ V′ , YCbCr , and particularly CIELAB and CIELUV, which are described in the following sections. In addition, CIE-compliant specifications exist for most common color spaces (see Sec. 13.2), which allow more or less dependable conversions between almost any pair of color spaces.

14.2 CIELAB Color Space The CIELAB color model (specified by CIE in 1976) was developed with the goal of linearizing the representation with respect to human color perception and at the same time creating a more intuitive color system. Since then, CIELAB2 has become a popular and widely used color model, particularly for high-quality photographic applications. It is used, for example, inside Adobe Photoshop as the standard model for converting between different color spaces. The dimensions in this color space are the luminosity L∗ and the two color components a∗ , b∗ , which specify the color hue and saturation along the greenred and blue-yellow axes, respectively. All three components are relative values and refer to the specified reference white point Cref = (Xref , Yref , Zref ). In addition, a nonlinear correction function (similar to the modified gamma correction described in Sec. 3.7.6) is applied to all three components, as will be detailed further. 14.2.1 CIEXYZ → CIELAB Conversion Several specifications for converting to and from CIELAB space exist, which differ marginally and for very small L values only. The current 2

430

Often CIELAB is simply referred to as the “Lab” color space.

14.2 CIELAB Color Space Fig. 14.4 CIELAB components shown as grayscale images. The contrast of the a∗ and b∗ images has been increased by 40% for better viewing. L∗

a∗

b∗

specification for converting between CIEXYZ and CIELAB colors is defined by ISO Standard 13655 [142] as follows: L∗ = 116 · Y ′ − 16,

(14.4)

∗

′

′

(14.5)

∗

′

′

(14.6)

a = 500 · (X − Y ), b = 200 · (Y − Z ), with X ′ = f1

X Xref

,

f1 (c) = ϵ= κ=

6 3 29

1 116

(

Y ′ = f1 c1/3 κ·c+

16 116

Y Yref

,

for c > ϵ, for c ≤ ϵ,

216 = 24389 ≈ 0.008856, 3 29 = 841 3 108 ≈ 7.787.

Z ′ = f1

Z Zref

,

(14.7) (14.8) (14.9) (14.10)

For the conversion in Eqn. (14.7), D65 is usually specified as the reference white point Cref = (Xref , Yref , Zref ), that is, Xref = 0.95047, Yref = 1.0 and Zref = 1.08883 (see Table 14.2). The L∗ values are positive and typically in the range [0, 100] (often scaled to [0, 255]), but may theoretically be greater. Values for a∗ and b∗ are in the range [−127, +127]. Figure 14.4 shows the separation of a color image into the corresponding CIELAB components. Table 14.3 lists the relation between CIELAB and XYZ coordinates for selected RGB colors. The given R′ G′ B ′ values are (nonlinear) sRGB coordinates with D65 as the reference white point.3 Figure 14.5(c) shows the transformation of the RGB color cube into the CIELAB color space. 14.2.2 CIELAB → CIEXYZ Conversion The reverse transformation from CIELAB space to CIEXYZ coordinates is defined as follows: a∗ X = Xref · f2 L′ + 500 , (14.11) ′ Y = Yref · f2 L , (14.12) ′ b∗ Z = Zref · f2 L − 200 , (14.13) with

L′ = 3

L∗ +16 116

and

(14.14)

Note that sRGB colors in Java are specified with respect to white point D50, which explains certain numerical deviations (see Sec. 14.7).

431

14 Colorimetric Color Spaces Table 14.3 CIELAB coordinates for selected color points in sRGB. The sRGB components R′ , G′ , B ′ are nonlinear (i.e., gamma-corrected), white point is D65 (see Table 14.2).

Pt. S R Y G C B M W K R75 R50 R25 P

Color Black Red Yellow Green Cyan Blue Magenta White 50% Gray 75% Red 50% Red 25% Red Pink

R′ 0.00 1.00 1.00 0.00 0.00 0.00 1.00 1.00 0.50 0.75 0.50 0.25 1.00

sRGB G′ B′ 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 1.00 1.00 0.00 1.00 0.00 1.00 1.00 1.00 0.50 0.50 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.50

f2 (c) =

CIEXYZ (D65) X65 Y65 Z65 0.0000 0.0000 0.0000 0.4125 0.2127 0.0193 0.7700 0.9278 0.1385 0.3576 0.7152 0.1192 0.5380 0.7873 1.0694 0.1804 0.0722 0.9502 0.5929 0.2848 0.9696 0.9505 1.0000 1.0888 0.2034 0.2140 0.2330 0.2155 0.1111 0.0101 0.0883 0.0455 0.0041 0.0210 0.0108 0.0010 0.5276 0.3812 0.2482

(

c3

CIELAB L∗ a∗ b∗ 0.00 0.00 0.00 53.24 80.09 67.20 97.14 −21.55 94.48 87.74 −86.18 83.18 91.11 −48.09 −14.13 32.30 79.19 −107.86 60.32 98.24 −60.83 100.00 0.00 0.00 53.39 0.00 0.00 39.77 64.51 54.13 25.42 47.91 37.91 9.66 29.68 15.24 68.11 48.39 22.83

for c3 > ϵ, for c3 ≤ ϵ,

c−16/116 κ

(14.15)

and ϵ, κ as defined in Eqns. (14.9–14.10). The complete Java code for the CIELAB→XYZ conversion and the implementation of the associated ColorSpace class can be found in Progs. 14.1 and 14.2 (see pp. 446–447).

14.3 CIELUV Color Space 14.3.1 CIEXYZ → CIELUV Conversion The CIELUV component values L∗ , u∗ , v∗ are calculated from given X, Y , Z color coordinates as follows: L∗ = 116 · Y ′ − 16, ∗

∗

∗

∗

u = 13 · L · v = 13 · L ·

(14.16)

′

(u − u′ref ), ′ (v ′ − vref ),

(14.17) (14.18)

with Y ′ as defined in Eqn. (14.7) (i.e., identical to CIELAB) and u′ = fu (X, Y, Z),

u′ref = fu (Xref , Yref , Zref ),

v ′ = fv (X, Y, Z),

′ vref = fv (Xref , Yref , Zref ),

with the correction functions ( fu (X, Y, Z) = fv (X, Y, Z) =

(

(14.19)

0

for X = 0, for X > 0,

(14.20)

0

for Y = 0, for Y > 0.

(14.21)

4X X+15Y +3Z 9Y X+15Y +3Z

Note that checking for zero X or Y in Eqns. (14.20)–(14.21) is not part of the original definitions but is essential in any real implementation to avoid divisions by zero.4 4

432

Remember though that floating-point values (double, float) should never be strictly tested against zero but compared to a sufficiently small (epsilon) quantity (see Sec. J.1.8 of the Appendix).

Linear RGB

sRGB

(a)

XYZ

(b)

(c)

CIELAB

(d)

(e)

CIELUV

(f)

14.3 CIELUV Color Space Fig. 14.5 Transformation of the RGB color cube to the XYZ, CIELAB, and CIELUV color space. The left column shows the color cube in linear RGB space, the right column in nonlinear sRGB space. Both RGB volumes were uniformly subdivided into 10 × 10 × 10 cubes of equal size. In both cases, the transformation to XYZ space (a, b) yields a distorted cube with straight edges and planar faces. Due to the linear transformation from RGB to XYZ, the subdivision of the RGB cube remains uniform (a). However, the nonlinear transformation (due to gamma correction) from sRGB to XYZ makes the tesselation strongly nonuniform in XYZ space (b). Since CIELAB uses gamma correction as well, the transformation of the linear RGB cube in (c) appears much less uniform than the nonlinear sRGB cube in (d), although this appears to be the other way round in CIELUV (e, f). Note that the RGB/sRGB color cube maps to a non-convex volume in both the CIELAB and the CLIELUV space.

14.3.2 CIELUV→ CIEXYZ Conversion The reverse mapping from L∗ , u∗ , v∗ components to X, Y, Z coordinates is defined as follows: Y = Yref · f2

L∗ +16 116

with f2 () as defined in Eqn. (14.15), and

,

(14.22) 433

14 Colorimetric Color Spaces Table 14.4 CIELUV coordinates for selected color points in sRGB. Reference white point is D65. The L∗ values are identical to CIELAB (see Table 14.3).

Pt. S R Y G C B M W K R75 R50 R25 P

Color Black Red Yellow Green Cyan Blue Magenta White 50% Gray 75% Red 50% Red 25% Red Pink

R′ 0.00 1.00 1.00 0.00 0.00 0.00 1.00 1.00 0.50 0.75 0.50 0.25 1.00

X=Y ·

sRGB G′ B′ 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 1.00 1.00 0.00 1.00 0.00 1.00 1.00 1.00 0.50 0.50 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.50

9u′ , 4v ′

CIEXYZ (D65) X65 Y65 Z65 0.0000 0.0000 0.0000 0.4125 0.2127 0.0193 0.7700 0.9278 0.1385 0.3576 0.7152 0.1192 0.5380 0.7873 1.0694 0.1804 0.0722 0.9502 0.5929 0.2848 0.9696 0.9505 1.0000 1.0888 0.2034 0.2140 0.2330 0.2155 0.1111 0.0101 0.0883 0.0455 0.0041 0.0210 0.0108 0.0010 0.5276 0.3812 0.2482

Z=Y ·

CIELUV L∗ u∗ v∗ 0.00 0.00 0.00 53.24 175.01 37.75 97.14 7.70 106.78 87.74 −83.08 107.39 91.11 −70.48 −15.20 32.30 −9.40 −130.34 60.32 84.07 −108.68 100.00 0.00 0.00 53.39 0.00 0.00 39.77 130.73 28.20 25.42 83.56 18.02 9.66 31.74 6.85 68.11 92.15 19.88

12 − 3u′ − 20v ′ , 4v ′

(14.23)

with ′

′

(u , v ) =

(

′ ) (u′ref , vref ′ ′ (uref , vref )+

1 13·L∗

∗

∗

· (u , v )

for L∗ = 0, for L∗ > 0,

(14.24)

′ as in Eqn. (14.19).5 and u′ref , vref

14.3.3 Measuring Color Differences Due to its high uniformity with respect to human color perception, the CIELAB color space is a particularly good choice for determining the difference between colors (the same holds for the CIELUV space) [113, p. 57]. The difference between two color points c1 = (L∗1 , a∗1 , b∗1 ) and c2 = (L∗2 , a∗2 , b∗2 ) can be found by simply calculating the Euclidean distance in CIELAB or CIELUV space, for example, dist(c1 , c2 ) = ∥c1 −c2 ∥ p = (L∗1 −L∗2 )2 + (a∗1 −a∗2 )2 + (b∗1 −b∗2 )2 .

(14.25) (14.26)

14.4 Standard RGB (sRGB) CIE-based color spaces such as CIELAB (and CIELUV) are deviceindependent and have a gamut sufficiently large to represent virtually all visible colors in the CIEXYZ system. However, in many computerbased, display-oriented applications, such as computer graphics or multimedia, the direct use of CIE-based color spaces may be too cumbersome or inefficient. sRGB (“standard RGB” [141]) was developed (jointly by HewlettPackard and Microsoft) with the goal of creating a precisely specified color space for these applications, based on standardized mappings with respect to the colorimetric CIEXYZ color space. This includes precise specifications of the three primary colors, the white reference 5

434

No explicit check for zero denominators is required in Eqn. (14.23) since v ′ can be assumed to be greater than zero.

point, ambient lighting conditions, and gamma values. Interestingly, 14.4 Standard RGB the sRGB color specification is the same as the one specified many (sRGB) years before for the European PAL/SECAM television standards. Compared to CIELAB, sRGB exhibits a relatively small gamut (see Fig. 14.3), which, however, includes most colors that can be reproduced by current computer and video monitors. Although sRGB was not designed as a universal color space, its CIE-based specification at least permits more or less exact conversions to and from other color spaces. Several standard image formats, including EXIF (JPEG) and PNG are based on sRGB color data, which makes sRGB the de facto standard for digital still cameras, color printers, and other imaging devices at the consumer level [128]. sRGB is used as a relatively dependable archive format for digital images, particularly in less demanding applications that do not require (or allow) explicit color management [254]. Thus, in practice, working with any RGB color data almost always means dealing with sRGB. It is thus no coincidence that sRGB is also the common color scheme in Java and is extensively supported by the Java standard API (see Sec. 14.7 for details). Table 14.5 lists the key parameters of the sRGB color space (i.e., the XYZ coordinates for the primary colors R, G, B and the white point W (D65)), which are defined according to ITU-R BT.709 [144] (see Tables 14.1 and 14.2). Together, these values permit the unambiguous mapping of all other colors in the CIE diagram. Pt. R G B W

R 1.0 0.0 0.0 1.0

G 0.0 1.0 0.0 1.0

B 0.0 0.0 1.0 1.0

X65 0.412453 0.357580 0.180423 0.950456

Y65 0.212671 0.715160 0.072169 1.000000

Z65 0.019334 0.119193 0.950227 1.088754

x65 0.6400 0.3000 0.1500 0.3127

y65 0.3300 0.6000 0.0600 0.3290

Table 14.5 sRGB tristimulus values R, G, B with reference to the white point W = D65.

14.4.1 Linear vs. Nonlinear Color Components sRGB is a nonlinear color space with respect to the XYZ coordinate system, and it is important to carefully distinguish between the linear and nonlinear RGB component values. The nonlinear values (denoted R′ , G′ , B ′ ) represent the actual color tuples, the data values read from an image file or received from a digital camera. These values are pre-corrected with a fixed Gamma (≈ 2.2) such that they can be easily viewed on a common color monitor without any additional conversion. The corresponding linear components (denoted R, G, B) relate to the CIEXYZ color space by a linear mapping and can thus be computed from X, Y, Z coordinates and vice versa by simple matrix multiplication, that is,         R X X R G = MRGB · Y  and  Y  = M−1 ·G , (14.27) RGB B Z Z B with

435

14 Colorimetric Color Spaces

MRGB

M−1 RGB



 3.240479 −1.537150 −0.498535 = −0.969256 1.875992 0.041556, 0.055648 −0.204043 1.057311   0.412453 0.357580 0.180423 =  0.212671 0.715160 0.072169. 0.019334 0.119193 0.950227

(14.28)

(14.29)

Notice that the column vectors of M−1 RGB (Eqn. (14.29)) are the coordinates of the three primary colors R, G, B (tristimulus values) in XYZ space (cf. Table 14.5) and thus       1 0 0 0, G = M−1 ·1, B = M−1 ·0. (14.30) R = M−1 · RGB RGB RGB 0 0 1 14.4.2 CIEXYZ → sRGB Conversion

To transform a given XYZ color to sRGB (Fig. 14.6), we first compute the linear R, G, B values by multiplying the (X, Y, Z) coordinate vector with the matrix MRGB (Eqn. (14.28)),     R X G = MRGB ·  Y  . (14.31) B Z

Subsequently, a modified gamma correction (see Sec. 3.7.6) with γ = 2.4 (which corresponds to an effective gamma value of ca. 2.2) is applied to the linear R, G, B values, R′ = f1 (R), with f1 (c) =

(

Fig. 14.6 Color transformation from CIEXYZ to sRGB.

X Y Z

G′ = f1 (G),

B ′ = f1 (B),

(14.32)

12.92 · c for c ≤ 0.0031308, 1/2.4 1.055 · c − 0.055 for c > 0.0031308.

linear −→ mapping −→ MRGB

R G B

gamma −→ correction −→ fγ ()

(14.33)

′ R G′ B′

The resulting sRGB components R′ , G′ , B ′ are limited to the interval [0, 1] (see Table 14.6). To obtain discrete numbers, the R′ , G′ , B ′ values are finally scaled linearly to the 8-bit integer range [0, 255]. 14.4.3 sRGB → CIEXYZ Conversion To calculate the reverse transformation from sRGB to XYZ, the given (nonlinear) R′ G′ B ′ values (in the range [0, 1]) are first linearized by inverting the gamma correction (Eqn. (14.33)), that is, R = f2 (R′ ), 436

with

G = f2 (G′ ),

B = f2 (B ′ ),

(14.34)

Pt. S R Y G C B M W K R75 R50 R25 P

Color Black Red Yellow Green Cyan Blue Magenta White 50% Gray 75% Red 50% Red 25% Red Pink

′

sRGB (nonlinear) ′ R G′ B′ 0.00 0.00 0.00 1.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00 1.00 0.00 1.00 1.00 1.00 1.00 0.50 0.50 0.50 0.75 0.00 0.00 0.50 0.00 0.00 0.25 0.00 0.00 1.00 0.50 0.50

f2 (c ) =

(

R 0.0000 1.0000 1.0000 0.0000 0.0000 0.0000 1.0000 1.0000 0.2140 0.5225 0.2140 0.0509 1.0000

c′ 12.92 c′ +0.055 2.4 1.055

RGB (linear) G B 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 1.0000 0.0000 1.0000 0.0000 1.0000 1.0000 1.0000 0.2140 0.2140 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.2140 0.2140

CIEXYZ X65 Y65 Z65 0.0000 0.0000 0.0000 0.4125 0.2127 0.0193 0.7700 0.9278 0.1385 0.3576 0.7152 0.1192 0.5380 0.7873 1.0694 0.1804 0.0722 0.9502 0.5929 0.2848 0.9696 0.9505 1.0000 1.0888 0.2034 0.2140 0.2330 0.2155 0.1111 0.0101 0.0883 0.0455 0.0041 0.0210 0.0108 0.0010 0.5276 0.3812 0.2482

for c′ ≤ 0.04045, for c′ > 0.04045.

(14.35)

Subsequently, the linearized (R, G, B) vector is transformed to XYZ coordinates by multiplication with the inverse of the matrix MRGB (see Eqn. (14.29)), i.e.,     X R  Y  = M−1 · G . (14.36) RGB Z B

14.4 Standard RGB (sRGB) Table 14.6 CIEXYZ coordinates for selected sRGB colors. The table lists the nonlinear R′ , G′ , and B ′ components, the linearized R, G, and B values, and the corresponding X, Y , and Z coordinates (for white point D65). The linear and nonlinear RGB values are identical for the extremal points of the RGB color cube S, . . . , W (top rows) because the gamma correction does not affect 0 and 1 component values. However, intermediate colors K, . . . , P (shaded rows) may exhibit large differences between the nonlinear and linear components (e.g., compare the R′ and R values for R25 ).

14.4.4 Calculations with Nonlinear sRGB Values Due to the wide use of sRGB in digital photography, graphics, multimedia, Internet imaging, etc., there is a probability that a given image is encoded in sRGB colors. If, for example, a JPEG image is opened with ImageJ or Java, the pixel values in the resulting data array are media-oriented (i.e., nonlinear R′ , G′ , B ′ components of the sRGB color space). Unfortunately, this fact is often overlooked by programmers, with the consequence that colors are incorrectly manipulated and reproduced. As a general rule, any arithmetic operation on color values should always be performed on the linearized R, G, B components, which are obtained from the nonlinear R′ , G′ , B ′ values through the inverse gamma function fγ−1 (Eqn. (14.35)) and converted back again with fγ (Eqn. (14.33)). Example: color to grayscale conversion The principle of converting RGB colors to grayscale values by computing a weighted sum of the color components was described already in Sec. 13.2.1, where we had simply ignored the issue of possible nonlinearities. As one may have guessed, however, the variables R, G, B, and Y in Eqn. (13.9) on p. 388, Y = 0.2125 · R + 0.7154 · G + 0.072 · B

(14.37)

implicitly refer to linear color and brightness values, respectively, and not the raw sRGB values! Based on Eqn. (14.37), the correct grayscale conversion from raw (nonlinear) sRGB components

437

14 Colorimetric Color Spaces

R′ , G′ , B ′ is Y ′ = f1 0.2125·f2 (R′ ) + 0.7154·f2 (G′ ) + 0.0721·f2 (B ′ ) ,

(14.38)

with f1 () and f2 () as defined in Eqns. (14.33) and (14.35), respectively. The result (Y ′ ) is again a nonlinear, sRGB-compatible gray value; that is, the sRGB color vector (Y ′ , Y ′ , Y ′ ) should give the same perceived luminance as the original color (R′ , G′ , B ′ ). Note that setting the components of an sRGB color pixel to three arbitrary but identical values Y ′ , (R′ , G′ , B ′ ) ← (Y ′ , Y ′ , Y ′ ) always creates a gray (colorless) pixel, despite the nonlinearities of the sRGB space. This is due to the fact that the gamma correction (Eqns. (14.33) and (14.35)) applies evenly to all three color components and thus any three identical values map to a (linearized) color on the straight gray line between the black point S and the white point W in XYZ space (cf. Fig. 14.1(b)). For many applications, however, the following approximation to the exact grayscale conversion in Eqn. (14.38) is sufficient. It works without converting the sRGB values (i.e., directly on the nonlinear R′ , G′ , B ′ components) by computing a linear combination ′ ′ ′ Y ′ ≈ wR · R ′ + wG · G′ + w B · B′

(14.39)

′ ′ with a slightly different set of weights: for example, wR = 0.309, wG = ′ 0.609, wB = 0.082, as proposed in [212]. The resulting quantity Y ′ is sometimes called luma (compared to the luminance Y obtained by Eqn. (14.37)).

14.5 Adobe RGB

438

A distinct weakness of sRGB is its relatively small gamut, which is limited to the range of colors reproducible by ordinary color monitors. This causes problems, for example, in printing, where larger gamuts are needed, particularly in the green regions. The “Adobe RGB (1998)” [2] color space, developed by Adobe as their own standard, is based on the same general concept as sRGB but exhibits a significantly larger gamut (Fig. 14.3), which extends its use particularly to print applications. Figure 14.7 shows the noted difference between the sRGB and Adobe RGB gamuts in 3D CIEXYZ color space. The neutral point of Adobe RGB corresponds to the D65 standard (with x = 0.3127, y = 0.3290), and the gamma value is 2.199 1 (compared with 2.4 for sRGB) for the forward correction and 2.199 for the inverse correction, respectively. The associated file specification provides for a number of different codings (8- to 16-bit integer and 32-bit floating point) for the color components. Adobe RGB is frequently used in professional photography as an alternative to the CIELAB color space and for picture archive applications.

sRGB

Adobe RGB

14.6 Chromatic Adaptation Fig. 14.7 Gamuts of sRGB and Adobe RGB shown in CIELAB color space. The volume of the sRGB gamut (a) is significantly smaller than the Adobe RGB gamut (b), particularly in the green color region. The tesselation corresponds to a uniform subdivision of the original RGB cubes (in the respective color spaces).

(a)

(b)

14.6 Chromatic Adaptation The human eye has the capability to interpret colors as being constant under varying viewing conditions and illumination in particular. A white sheet of paper appears white to us in bright daylight as well as under fluorescent lighting, although the spectral composition of the light that enters the eye is completely different in both situations. The CIE color system takes into account the color temperature of the ambient lighting because the exact interpretation of XYZ color values also requires knowledge of the corresponding reference white point. For example, a color value (X, Y, Z) specified with respect to the D50 reference white point is generally perceived differently when reproduced by a D65-based media device, although the absolute (i.e., measured) color is the same. Thus the actual meaning of XYZ values cannot be known without knowing the corresponding white point. This is known as relative colorimetry. If colors are specified with respect to different white points, for example W1 = (XW1 , YW1 , ZW1 ) and W2 = (XW2 , YW2 , ZW2 ), they can be related by first applying a so-called chromatic adaptation transformation (CAT) [136, Ch. 34] in XYZ color space. This transformation determines, for given color coordinates (X1 , Y1 , Z1 ) and the associated white point W1 , the new color coordinates (X2 , Y2 , Z2 ) relative to another white point W2 . 14.6.1 XYZ Scaling The simplest chromatic adaptation method is XYZ scaling, where the individual color coordinates are individually multiplied by the ratios of the corresponding white point coordinates, that is, X2 = X1 ·

ˆ2 X , ˆ1 X

Y 2 = Y1 ·

Yˆ2 , Yˆ1

Z 2 = Z1 ·

Zˆ2 . Zˆ1

(14.40)

For example, for converting colors (X65 , Y65 , Z65 ) related to the ˆ 65 , Yˆ65 , Zˆ65 ) to the corresponding colors for white point D65 = (X ˆ 50 , Yˆ50 , Zˆ50 ),6 the concrete scaling is white point D50 = (X 6

See Table 14.2.

439

14 Colorimetric Color Spaces

X50 = X65 · Y50 = Y65 · Z50 = Z65 ·

ˆ 50 X ˆ 65 X

Yˆ50 Yˆ65 ˆ50 Z ˆ65 Z

= X65 ·

0.964296 0.950456

= X65 · 1.01456,

= Y65 ·

1.000000 1.000000

= Y65 ,

= Z65 ·

0.825105 1.088754

= Z65 · 0.757843 .

(14.41)

This simple form of scaling color coordinates in XYZ space is usually not considered a good color adaptation model and is thus not recommended for high-quality applications. 14.6.2 Bradford Color Adaptation The most common chromatic adaptation models are based on scaling the color coordinates not directly in XYZ but in a “virtual” R∗ G∗ B ∗ color space obtained from the XYZ values by a linear transformation  ∗   R X G∗  = MCAT ·  Y  , (14.42) B∗ Z

where MCAT is a 3 × 3 transformation matrix (defined in Eqn. (14.45)). After appropriate scaling, the R∗ G∗ B ∗ coordinates are transformed back to XYZ, so the complete adaptation transform from color coordinates X1 , Y1 , Z1 (w.r.t. white point W1 = (XW1 , YW1 , ZW1 )) to the new color coordinates X2 , Y2 , Z2 (w.r.t. white point W2 = (XW2 , YW2 , ZW2 )) takes the form  R∗      W2 0 0 ∗ RW1 X2 X1   ∗  · MCAT · Y1  , (14.43)  Y2  = M−1 ·  0 GW2 0 ∗ CAT  GW1  ∗ BW2 Z2 Z1 0 0 ∗ BW1

R∗

G∗

B∗

where the diagonal elements RW2 , GW2 , BW2 are the (constant) ratios ∗ ∗ ∗ W1 W1 W1 ∗ ∗ ∗ of the R G B values of the white points W2 , W1 , respectively, i.e.,  ∗     ∗    RW1 XW1 RW2 XW2 G∗W1  = MCAT · YW1 , G∗W2  = MCAT · YW2 . (14.44) ∗ ∗ BW1 ZW1 BW2 ZW2

The “Bradford” model [136, p. 590] specifies for Eqn. (14.43) the particular transformation matrix   0.8951 0.2664 −0.1614 MCAT = −0.7502 1.7135 0.0367 . (14.45) 0.0389 −0.0685 1.0296

440

Inserting MCAT matrix in Eqn. (14.43) gives the complete chromatic adaptation. For example, the resulting transformation for converting from D65-based to D50-based colors (i.e., W1 = D65, W2 = D50, as listed in Table 14.2) is     X50 X65  Y50  = M50|65 ·  Y65  Z50 Z65    1.047884 0.022928 −0.050149 X65 =  0.029603 0.990437 −0.017059 ·  Y65  , (14.46) −0.009235 0.015042 0.752085 Z65

y

14.6 Chromatic Adaptation

1.0

Fig. 14.8 Bradford chromatic adaptation from white point D65 to D50. The solid triangle represents the original RGB gamut for white pointD65, with the primaries (R, G, B) located at the corner points. The dashed triangle is the corresponding gamut after chromatic adaptation to white point D50 (see Table 14.8 for concrete x/y coordinates).

G

D50 D65

R

B 0.0 0.0

Pt. S R Y G C B M W K R75 R50 R25 P

Color Black Red Yellow Green Cyan Blue Magenta White 50% Gray 75% Red 50% Red 25% Red Pink

sRGB R ′ G′ B ′ 0.00 0.0 0.0 1.00 0.0 0.0 1.00 1.0 0.0 0.00 1.0 0.0 0.00 1.0 1.0 0.00 0.0 1.0 1.00 0.0 1.0 1.00 1.0 1.0 0.50 0.5 0.5 0.75 0.0 0.0 0.50 0.0 0.0 0.25 0.0 0.0 1.00 0.5 0.5

0.5

XYZ (D65) X65 Y65 Z65 0.0000 0.0000 0.0000 0.4125 0.2127 0.0193 0.7700 0.9278 0.1385 0.3576 0.7152 0.1192 0.5380 0.7873 1.0694 0.1804 0.0722 0.9502 0.5929 0.2848 0.9696 0.9505 1.0000 1.0888 0.2034 0.2140 0.2330 0.2155 0.1111 0.0101 0.0883 0.0455 0.0041 0.0210 0.0108 0.0010 0.5276 0.3812 0.2482

1.0

x

XYZ (D50) X50 Y50 Z50 0.0000 0.0000 0.0000 0.4361 0.2225 0.0139 0.8212 0.9394 0.1110 0.3851 0.7169 0.0971 0.5282 0.7775 0.8112 0.1431 0.0606 0.7141 0.5792 0.2831 0.7280 0.9643 1.0000 0.8251 0.2064 0.2140 0.1766 0.2279 0.1163 0.0073 0.0933 0.0476 0.0030 0.0222 0.0113 0.0007 0.5492 0.3889 0.1876

Table 14.7 Bradford chromatic adaptation from white point D65 to D50 for selected sRGB colors. The XYZ coordinates X65 , Y65 , Z65 relate to the original white point D65 (W1 ). X50 , Y50 , Z50 are the corresponding coordinates for the new white point D50 (W2 ), obtained with the Bradford adaptation according to Eqn. (14.46).

and conversely, from D50-based to D65-based colors (i.e., W1 = D50, W2 = D65),       X65 X50 X50  Y65  = M65|50 ·  Y50  = M−1 ·  Y50  50|65 Z65 Z50 Z50    0.955513 −0.023079 0.063190 X50 = −0.028348 1.009992 0.021019 ·  Y50  . (14.47) 0.012300 −0.020484 1.329993 Z50 Figure 14.8 illustrates the effects of adaptation from the D65 white point to D50 in the CIE x, y chromaticity diagram. A short list of corresponding color coordinates is given in Table 14.7. The Bradford model is a widely used chromatic adaptation scheme but several similar procedures have been proposed (see also Exercise 14.1). Generally speaking, chromatic adaptation and related problems have a long history in color engineering and are still active fields of scientific research [290, Sec. 5.12].

441

ColorSpace

14 Colorimetric Color Spaces

sRGB (non-lin.)

Fig. 14.9 XYZ-based color conversion in Java. ColorSpace objects implement the methods fromCIEXYZ() and toCIEXYZ() to convert color vectors from and to the CIEXYZ color space, respectively. Colorimetric transformations between color spaces can be accomplished as a two-step process via the XYZ space. For example, to convert from sRGB to CIELAB, the sRGB color is first converted to XYZ and subsequently from XYZ to CIELAB. Notice that Java’s standard XYZ color space is based on the D50 white point, while most common color spaces refer to D65.

CS_sRGB toXYZ()

R'G'B' (D65)

fromXYZ()

RGB (linear)

CS_LINEAR_RGB toXYZ()

RGB (D65)

fromXYZ()

L*a*b*

Lab_ColorSpace

Profile Connection Space

XYZ (D50)

toXYZ() L*a*b* (D65)

fromXYZ()

14.7 Colorimetric Support in Java sRGB is the standard color space in Java, i.e., the components of color objects and RGB color images are gamma-corrected, nonlinear R′ , G′ , B ′ values (see Fig. 14.6). The R′ , G′ , B ′ values are related to the linear R, G, B values by a modified gamma correction, as specified by the sRGB standard (see Eqns. (14.33) and (14.35)). 14.7.1 Profile Connection Space (PCS) The Java API (AWT) provides classes for representing color objects and color spaces, together with a rich set of corresponding methods. Java’s color system is designed after the ICC7 “color management architecture”, which uses a CIEXYZ-based device-independent color space called the “profile connection space” (PCS) [140, 143]. The PCS color space is used as the intermediate reference for converting colors between different color spaces. The ICC standard defines device profiles (see Sec. 14.7.4) that specify the transforms to convert between a device’s color space and the PCS. The advantage of this approach is that for any given device only a single color transformation (profile) must be specified to convert between device-specific colors and the unified, colorimetric profile connection space. Every ColorSpace class (or subclass) provides the methods fromCIEXYZ() and toCIEXYZ() to convert device color values to XYZ coordinates in the standardized PCS. Figure 14.9 illustrates the principal application of ColorSpace objects for converting colors between different color spaces in Java using the XYZ space as a common “hub”. Different to the sRGB specification, the ICC specifies D50 (and not D65) as the illuminant white point for its default PCS color space (see Table 14.2). The reason is that the ICC standard was developed primarily for color management in photography, graphics, 7

442

International Color Consortium (ICC, www.color.org)

and printing, where D50 is normally used as the reflective media 14.7 Colorimetric white point. The Java methods fromCIEXYZ() and toCIEXYZ() thus Support in Java take and return X, Y, Z color coordinates that are relative to the D50 white point. The resulting coordinates for the primary colors (listed in Table 14.8) are different from the ones given for white point D65 (see Table 14.5)! This is a frequent cause of confusion since the sRGB component values are D65-based (as specified by the sRGB standard) but Java’s XYZ values are relative to D50. Pt. R G B W

R 1.0 0.0 0.0 1.0

G 0.0 1.0 0.0 1.0

B 0.0 0.0 1.0 1.0

X50 0.436108 0.385120 0.143064 0.964296

Y50 0.222517 0.716873 0.060610 1.000000

Z50 0.013931 0.097099 0.714075 0.825106

x50 0.6484 0.3212 0.1559 0.3457

y50 0.3309 0.5978 0.0660 0.3585

Table 14.8 Color coordinates for sRGB primaries and the white point in Java’s default XYZ color space. Color coordinates for sRGB primaries and the white point in Java’s default XYZ color space. The white point W is equal to D50.

Chromatic adaptation (see Sec. 14.6) is used to convert between XYZ color coordinates that are measured with respect to different white points. The ICC specification [140] recommends a linear chromatic adaptation based on the Bradford model to convert between the D65-related XYZ coordinates (X65 , Y65 , Z65 ) and D50-related values (X50 , Y50 , Z50 ). This is also implemented by the Java API. The complete mapping between the linearized sRGB color values (R, G, B) and the D50-based (X50 , Y50 , Z50 ) coordinates can be expressed as a linear transformation composed of the RGB→XYZ65 transformation by matrix MRGB (Eqns. (14.28) and (14.29)) and the chromatic adaptation transformation XYZ65 →XYZ50 defined by matrix M50|65 (Eqn. (14.46)),       X50 R −1 R  Y50  = M50|65 ·M−1 · G = MRGB · M65|50 · G RGB Z50 B B    0.436131 0.385147 0.143033 R = 0.222527 0.716878 0.060600 · G , (14.48) 0.013926 0.097080 0.713871 B and, in the reverse direction,     R X50 G = MRGB ·M65|50 ·  Y50  B Z50    3.133660 −1.617140 −0.490588 X50 = −0.978808 1.916280 0.033444 ·  Y50  . (14.49) 0.071979 −0.229051 1.405840 Z50 Equations (14.48) and (14.49) are the transformations implemented by the methods toCIEXYZ() and fromCIEXYZ(), respectively, for Java’s default sRGB ColorSpace class. Of course, these methods must also perform the necessary gamma correction between the linear R, G, B components and the actual (nonlinear) sRGB values R′ , G′ , B ′ . Figure 14.10 illustrates the complete transformation from D50-based PCS coordinates to nonlinear sRGB values.

443

14 Colorimetric Color Spaces

PCS

sRGB fγ

Fig. 14.10 Transformation from D50based CIEXYZ coordinates (X50 , Y50 , Z50 ) in Java’s Profile Connection Space (PCS) to nonlinear sRGB values (R′ , G′ , B ′ ). The first step is chromatic adaptation from D50 to D65 (by M65|50 ), followed by mapping the CIEXYZ coordinates to linear RGB values (by MRGB ). Finally, gamma correction is applied individually to all three color components.

X50 Y50

X65 M 65|50

Z50

Y65

MRGB

Z65

chromatic adaptation

R′

R G

fγ

B′

B

XYZ to linear RGB

G′

fγ

14.7.2 Color-Related Java Classes The Java standard API offers extensive support for working with colors and color images. The most important classes contained in the Java AWT package are: • Color: defines individual color objects. • ColorSpace: specifies entire color spaces. • ColorModel: describes the structure of color images; e.g., fullcolor images or indexed-color images (see Prog. 13.3 on p. 385). Class Color An object of class java.awt.Color describes a particular color point in the associated color space, specifying the number and type of the color components. Color objects are primarily used for graphic operations, such as to specify the color for drawing or filling graphic objects. Unless the color space is not explicitly specified, new Color objects are created as sRGB colors. The arguments passed to the Color constructor methods may be either float components in the range [0, 1] or integers in the range [0, 255], as demonstrated by the following example: Color pink = new Color(1.0f, 0.5f, 0.5f); Color blue = new Color(0, 0, 255);

Note that in both cases the arguments are interpreted as nonlinear sRGB values (R′ , G′ , B ′ ). Other constructor methods exist for class Color that also accept alpha (transparency) values. In addition, the Color class offers two useful static methods, RGBtoHSB() and HSBtoRGB(), for converting between sRGB and HSV8 colors (see also Sec. 13.2.3). Class ColorSpace An object of type java.awt.color.ColorSpace represents an entire color space, such as sRGB or CMYK. Every subclass of ColorSpace (which itself is an abstract class) provides methods for converting its native colors to the CIEXYZ and sRGB color space and vice versa, such that conversions between arbitrary color spaces can easily be performed (through Java’s XYZ-based profile connection space). 8

444

The HSV color space is referred to as “HSB” (hue, saturation, brightness) in the Java API.

In the following example, we first create an instance of the de- 14.7 Colorimetric fault sRGB color space by invoking the static method ColorSpace. Support in Java getInstance() and subsequently convert an sRGB color object (R′ , B ′ , G′ ) to the corresponding (X, Y, Z) coordinates in Java’s (D50-based) profile connection space: // create an sRGB color space object: ColorSpace sRGBcsp = ColorSpace.getInstance(ColorSpace.CS_sRGB); float[] pink_RGB = new float[] {1.0f, 0.5f, 0.5f}; // convert from sRGB to XYZ: float[] pink_XYZ = sRGBcsp.toCIEXYZ(pink_RGB);

Notice that color vectors are represented as float[] arrays for color conversions with ColorSpace objects. If required, the method getComponents() can be used to convert Color objects to float[] arrays. In summary, the types of color spaces that can be created with the ColorSpace.getInstance() method include: • CS_sRGB: the standard (D65-based) RGB color space with nonlinear R′ , G′ , B ′ components, as specified in [141]. • CS_LINEAR_RGB: color space with linear R, G, B components (i.e., no gamma correction applied). • CS_GRAY: single-component color space with linear grayscale values. • CS_PYCC: Kodak’s Photo YCC color space. • CS_CIEXYZ: the default XYZ profile connection space (based on the D50 white point). Other color spaces can be implemented by creating additional implementations (subclasses) of ColorSpace, as demonstrated for CIELAB in the example in Sec. 14.7.3. 14.7.3 Implementation of the CIELAB Color Space (Example) In the following, we show a complete implementation of the CIELAB color space, which is not available in the current Java API, based on the specification given in Sec. 14.2. For this purpose, we define a subclass of ColorSpace (defined in the package java.awt.color) named Lab_ColorSpace, which implements the required methods toCIEXYZ(), fromCIEXYZ() for converting to and from Java’s default profile connection space, respectively, and toRGB(), fromRGB() for converting between CIELAB and sRGB (Progs. 14.1 and 14.2). These conversions are performed in two steps via XYZ coordinates, where care must be taken regarding the right choice of the associated white point (CIELAB is based on D65 and Java XYZ on D50). The following examples demonstrate the principal use of the new Lab_ColorSpace class:9 // Conversion examples: sRGB to LAB to XYZ: ColorSpace labCs = new LabColorSpace(); float[] cyan_sRGB = {0.0f, 1.0f, 1.0f}; // some sRGB color float[] cyan_LAB = labCs.fromRGB(cyan_sRGB) // sRGB→LAB float[] cyan_XYZ = labCs.toXYZ(cyan_LAB); // LAB→XYZ (D50) 9

Classes LabColorSpace, LuvColorSpace (analogous implementation of the CIELUV color space) and associated auxiliary classes are found in package imagingbook.pub.colorimage.

445

14 Colorimetric Color Spaces Prog. 14.1 Java implementation of the CIELAB color space as a sub-class of ColorSpace (Part 1). The conversion from D50-based profile connection space XYZ coordinates to CIELAB (Eqn. (14.6)) and back is implemented by the required methods fromCIEXYZ() and toCIEXYZ(), respectively. The auxiliary methods fromCIEXYZ65() and toCIEXYZ65() are used for converting D65-based XYZ coordinates (see Eqn. (14.6)). Chromatic adaptation from D50 to D65 is performed by the objects catD65toD50 and catD50toD65 of type ChromaticAdaptation. The gamma correction functions f1 (Eqn. (14.8)) and f2 (Eqn. (14.15)) are implemented by the methods f1() and f2(), respectively (see Prog. 14.2).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56

package imagingbook.pub.color.image; import static imagingbook.pub.color.image.Illuminant.D50; import static imagingbook.pub.color.image.Illuminant.D65; import java.awt.color.ColorSpace; public class LabColorSpace extends ColorSpace { // D65 static static static

reference white final double final double final double

point Xref Yref Zref

and chromatic adaptation objects: = D65.X; // 0.950456 = D65.Y; // 1.000000 = D65.Z; // 1.088754

static final ChromaticAdaptation catD65toD50 = new BradfordAdaptation(D65, D50); static final ChromaticAdaptation catD50toD65 = new BradfordAdaptation(D50, D65); // the only constructor: public LabColorSpace() { super(TYPE_Lab, 3); }

// color space type and no of components

// XYZ (Profile Connection Space, D50) → CIELab conversion: public float[] fromCIEXYZ(float[] XYZ50) { float[] XYZ65 = catD50toD65.apply(XYZ50); return fromCIEXYZ65(XYZ65); } // XYZ (D65) → CIELab conversion (Eqn. (14.6)–14.10): public float[] fromCIEXYZ65(float[] XYZ65) { double xx = f1(XYZ65[0] / Xref); double yy = f1(XYZ65[1] / Yref); double zz = f1(XYZ65[2] / Zref); float L = (float)(116.0 * yy - 16.0); float a = (float)(500.0 * (xx - yy)); float b = (float)(200.0 * (yy - zz)); return new float[] {L, a, b}; } // CIELab→XYZ (Profile Connection Space, D50) conversion: public float[] toCIEXYZ(float[] Lab) { float[] XYZ65 = toCIEXYZ65(Lab); return catD65toD50.apply(XYZ65); } // CIELab→XYZ (D65) conversion (Eqn. (14.13)–14.15): public float[] toCIEXYZ65(float[] Lab) { double ll = ( Lab[0] + 16.0 ) / 116.0; float Y65 = (float) (Yref * f2(ll)); float X65 = (float) (Xref * f2(ll + Lab[1] / 500.0)); float Z65 = (float) (Zref * f2(ll - Lab[2] / 200.0)); return new float[] {X65, Y65, Z65}; } // continued in Prog. 14.2

14.7.4 ICC Profiles

446

Even with the most precise specification, a standard color space may not be sufficient to accurately describe the transfer characteristics of some input or output device. ICC profiles are standardized descrip-

57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 }

// continued from Prog. 14.1

14.7 Colorimetric Support in Java

// sRGB→CIELab conversion: public float[] fromRGB(float[] srgb) { // get linear rgb components: double r = sRgbUtil.gammaInv(srgb[0]); double g = sRgbUtil.gammaInv(srgb[1]); double b = sRgbUtil.gammaInv(srgb[2]);

}

// convert to XYZ (D65-based, Eqn. float X = (float) (0.412453*r + float Y = (float) (0.212671*r + float Z = (float) (0.019334*r + float[] XYZ65 = new float[] {X, return fromCIEXYZ65(XYZ65);

(14.29)): 0.357580*g + 0.180423*b); 0.715160*g + 0.072169*b); 0.119193*g + 0.950227*b); Y, Z}; // see Prog. 14.1

//CIELab→sRGB conversion: public float[] toRGB(float[] Lab) { float[] XYZ65 = toCIEXYZ65(Lab); double X = XYZ65[0]; double Y = XYZ65[1]; double Z = XYZ65[2];

// see Prog. 14.1

Prog. 14.2 Java implementation of the CIELAB color space as a subclass of ColorSpace (Part 2). The methods fromRGB() and toRGB() perform direct conversion between CIELAB and sRGB via D65-based XYZ coordinates, i.e., without conversion to Java’s Profile Connection Space. Gamma correction (for mapping between linear RGB and sRGB component values) is implemented by the methods gammaFwd() and gammaInv() in class sRgbUtil (not shown). The methods f1() and f2() implement the forward and inverse gamma correction of CIELAB components (see Eqns. (14.6) and (14.13)).

// XYZ→RGB (linear components, Eqn. (14.28)): double r = ( 3.240479*X + -1.537150*Y + -0.498535*Z); double g = (-0.969256*X + 1.875992*Y + 0.041556*Z); double b = ( 0.055648*X + -0.204043*Y + 1.057311*Z);

}

// RGB→sRGB (nonlinear components): float rr = (float) sRgbUtil.gammaFwd(r); float gg = (float) sRgbUtil.gammaFwd(g); float bb = (float) sRgbUtil.gammaFwd(b); return new float[] {rr, gg, bb};

static final double epsilon = 216.0 / 24389; static final double kappa = 841.0 / 108;

// Eqn. (14.9) // Eqn. (14.10)

// Forward gamma correction for L* (Eqn. (14.8)): double f1 (double c) { if (c > epsilon) // 0.008856 return Math.cbrt(c); else return (kappa * c) + (16.0 / 116); } // Inverse gamma correction for L* (Eqn. (14.15)): double f2 (double c) { double c3 = c * c * c; if (c3 > epsilon) return c3; else return (c - 16.0 / 116) / kappa; } // end of class LabColorSpace

tions of individual device transfer properties that warrant that an image or graphics can be reproduced accurately on different media. The contents and format of ICC profile files are specified in [140],

447

14 Colorimetric Color Spaces

which is identical to ISO standard 15076 [143]. Profiles are thus a key element in the process of digital color management [279]. The Java graphics API supports the use of ICC profiles mainly through the classes ICC_ColorSpace and ICC_Profile, which allow application designers to create various standard profiles and read ICC profiles from data files. Assume, for example, that an image was recorded with a calibrated scanner and shall be displayed accurately on a monitor. For this purpose, we need the ICC profiles for the scanner and the monitor, which are often supplied by the manufacturers as .icc data files.10 For standard color spaces, the associated ICC profiles are often available as part of the computer installation, such as CIERGB.icc or NTSC1953.icc. With these profiles, a color space object can be specified that converts the image data produced by the scanner into corresponding CIEXYZ or sRGB values, as illustrated by the following example: // load the scanner’s ICC profile and create a corresponding color space: ICC_ColorSpace scannerCs = new ICC_ColorSpace(ICC_ProfileRGB.getInstance("scanner.icc")); // specify a device-specific color: float[] deviceColor = {0.77f, 0.13f, 0.89f}; // convert to sRGB: float[] RGBColor = scannerCs.toRGB(deviceColor); // convert to (D50-based) XYZ: float[] XYZColor = scannerCs.toCIEXYZ(deviceColor);

Similarly, we can calculate the accurate color values to be sent to the monitor by creating a suitable color space object from this device’s ICC profile.

14.8 Exercises Exercise 14.1. For chromatic adaptation (defined in Eqn. (14.43)), transformation matrices other than the Bradford model (Eqn. (14.45)) have been proposed. For example, [254], 1.2694 −0.0988 −0.1706 (2) MCAT = −0.8364 1.8006 0.0357 or (14.50) (3)

MCAT =

0.0297 −0.0315

1.0018

0.7982 0.3389 −0.1371 −0.5918 1.5512 0.0406 0.0008 −0.0239 0.9753

.

(14.51)

Derive the complete chromatic adaptation transformations M50|65 and M65|50 for converting between D65 and D50 colors, analogous to Eqns. (14.46) and (14.47), for each of the above transformation matrices. Exercise 14.2. Implement the conversion of an sRGB color image to a colorless (grayscale) sRGB image using the three methods in Eqn. 10

448

ICC profile files may also come with extensions .icm or .pf (as in the Java distribution).

(14.37) (incorrectly applying standard weights to nonlinear R′ G′ B ′ 14.8 Exercises components), Eqn. (14.38) (exact computation), and Eqn. (14.39) (approximation using nonlinear components and modified weights). Compare the results by computing difference images, and also determine the total errors. Exercise 14.3. Write a program to evaluate the errors that are introduced by using nonlinear instead of linear color components for grayscale conversion. To do this, compute the diffence between the Y values obtained with the linear variant (Eqn. (14.38)) and the nonlin′ ′ ′ ear variant (Eqn. (14.39) with wR = 0.309, wG = 0.609, wB = 0.082) 24 for all possible 2 RGB colors. Let your program return the maximum gray value difference and the sum of the absolute differences for all colors. Exercise 14.4. Determine the virtual primaries R∗ , G∗ , B∗ obtained by Bradford adaptation (Eqn. (14.42)), with MCAT as defined in Eqn. (14.45). What are the resulting coordinates in the xy chromaticity diagram? Are the primaries inside the visible color range?

449

15 Filters for Color Images

Color images are everywhere and filtering them is such a common task that it does not seem to require much attention at all. In this chapter, we describe how classical linear and nonlinear filters, which we covered before in the context of grayscale images (see Ch. 4), can be either used directly or adapted for the processing of color images. Often color images are treated as stacks of intensity images and existing monochromatic filters are simply applied independently to the individual color channels. While this is straightforward and performs satisfactorily in many situations, it does not take into account the vector-valued nature of color pixels as samples taken in a specific, multi-dimensional color space. As we show in this chapter, the outcome of filter operations depends strongly on the working color space and the variations between different color spaces may be substantial. Although this may not be apparent in many situations, it should be of concern if high-quality color imaging is an issue.

15.1 Linear Filters Linear filters are important in many applications, such as smoothing, noise removal, interpolation for geometric transformations, decimation in scale-space transformations, image compression, reconstruction and edge enhancement. The general properties of linear filters and their use on scalar-valued grayscale images are detailed in Sec. 4.6.1. For color images, it is common practice to apply these monochromatic filters separately to each color channel, thereby treating the image as a stack of scalar-valued images. As we describe in the following section, this approach is simple as well as efficient, since existing implementations for grayscale images can be reused without any modification. However, the outcome depends strongly on the choice of the color space in which the filter operation is performed. For example, it makes a great difference if the channels of an RGB image contain linear or nonlinear component values. This topic is discussed in more detail in Sec. 15.1.2. © Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1_15

451

15 Filters for Color Images

15.1.1 Monochromatic Application of Linear Filters Given a discrete scalar (grayscale) image with elements I(p) ∈ R, the application of a linear filter can be expressed as a linear 2D convolution1 X ¯ I(p) = (I ∗ H)(p) = I(p−q)·H(q), (15.1) q∈H

where H denotes a discrete filter kernel defined over the (usually rectangular) region H = dom(H). For a vector-valued image I with K components, the individual picture elements are vectors, that is,   I0 (p)  I1 (p)    I(p) =  (15.2) , ..   . IK−1 (p)

with I(p) ∈ RK and Ik (p) ∈ R. In this case, the linear filter operation can be generalized to X ¯ I(p) = (I ∗ H)(p) = I(p−q)·H(q), (15.3) q∈H

with the same scalar-valued filter kernel H as in Eqn. (15.1). Thus the kth component of the resulting pixels, X I¯k (p) = Ik (p−q) · H(p) = (Ik ∗ H) (p), (15.4) q∈H

is simply the result of scalar convolution (Eqn. (15.1)) applied to the kth component plane Ik . In the case of an RGB color image (with K = 3 components), the filter kernel H is applied separately to the scalar-valued R, G, B planes I0 = IR , I1 = IG , and I2 = IB , i.e.,     I¯R (p) (IR ∗ H) (p) ¯ I(p) = I¯G (p) = (IG ∗ H) (p) . (15.5) (IB ∗ H) (p) I¯B (p) Figure 15.1 illustrates how linear filters for color images are typically implemented by individually filtering the three scalar-valued color components. Linear smoothing filters Smoothing filters are a particular class of linear filters that are found in many applications and characterized by positive-only filter coefficients. Let Cp = {c0 , . . . , cn−1 } = {I(p−q) | q ∈ H}

(15.6)

denote the set of color pixels in image I that are covered by kernel H, when placed (i.e., its hot spot anchored) at some position p. 1

452

See Sec. 4.3.1.

IR

IG

I

I¯R

H

I¯G

H

IB

15.1 Linear Filters Fig. 15.1 Monochromatic application of a linear filter. The filter, specified by the kernel H, is applied separately to each of the scalar-valued color channels IR , IG , IB . Combining the filtered component channels I¯R , I¯G , I¯B produces the ¯ filtered color image I.

¯ I

I¯B

H

With arbitrary kernel coefficients H(q) ∈ R, the resulting color pixel ¯ ¯ = I(p) c in the filtered image is a linear combination of the original color vectors ci , that is, ¯ = w0 ·c0 + w1 ·c1 + · · · + wn−1 ·cn−1 = c

n−1 X

wi ·ci ,

(15.7)

i=0

where wi is the coefficient (weight) in H that is applied to pixel ci . If the kernel is normalized (i.e., ΣH = Σwi = 1), the result is an affine combination of the original colors. Moreover, if H normalized and its elements are all positive (as in the case of a typical smoothing filter), ¯ is a convex combination of the contributing any resulting color c ¯ colors c0 , . . . , cn−1 . Geometrically this means that the mixed color c is contained within the convex hull of the color points c0 , . . . , cn−1 , as illustrated in Fig. 15.2. In the special case that only two original ¯ is located on the colors c0 , c1 are involved, the resulting color point c straight line segment connecting c0 and c1 (Fig. 15.2 (b)).2 G

G

c0 ¯ c R

R

Fig. 15.2 Convex linear color mixtures. The result of a convex combination (mixture) of color vectors C = {c0 , . . . , cn−1 } is always inside the convex hull of C (a). In the special case of only two initial colors (c0 , c1 ), ¯ is located any mixed color c on the straight line segment connecting c0 and c1 (b).

c1

B (a)

2

B (b)

The convex hull of two points c0 , c1 consists of the straight line segment between them.

453

c0

15 Filters for Color Images Fig. 15.3 Linear smoothing filter at a color edge. Discrete filter kernel with positive-only elements and support region H (a). Filter kernel positioned over a region of constant color c0 and over a color step edge c0 /c1 , respectively (b). If the (normalized) filter kernel of extent H is completely embedded in a region of constant color (c0 ), the result of filtering is exactly that same color. At a step edge between two colors c0 , c1 , one part of the kernel (H0 ) covers pixels of color c0 and the remaining part (H1 ) covers pixels of color c1 . In this case, the result is a linear mixture of the colors c0 , c1 , as illustrated in Fig. 15.2 (b).

c1

H

H0

(a)

H1

(b)

Filter Response to a Color Step Edge Assume, as a special case, that the original RGB image I contains a step edge separating two regions of constant colors c0 and c1 , respectively, as illustrated in Fig. 15.3 (b). If a normalized smoothing kernel H is placed at some position p, where it is fully supported by pixels of the same color c0 , the (trivial) response of the filter is X X ¯= c c0 · H(q) = c0 · H(q) = c0 ·1 = c0 . (15.8) q∈H

q∈H

As expected, the result at this position is the original color c0 . If, on the other hand, the filter kernel is placed at some position on a color edge, i.e., at the transition between two colors c0 , c1 (see again Fig. 15.3 (b)), a subset of its coefficients (H0 ) is supported by pixels of color c0 , while the other coefficients (H1 ) overlap with pixels of color c1 . Since H0 ∪ H1 = H and the kernel H is normalized, the resulting color is X X ¯= c c0 ·H(q) + c1 ·H(q) (15.9) q∈H0

= c0 ·

X

q∈H1

H(q) + c1 ·

q∈H0

|

{z

1−α

}

X

H(q)

q∈H1

|

{z α

}

= c0 ·(1 − α) + c1 ·α = c0 + α·(c1 − c0 ),

(15.10)

(15.11)

¯ is a linear “blend” for some α ∈ [0, 1]. As we see, the resulting color c of the original colors c0 and c1 . Thus, at a step edge between two colors c0 , c1 , the intermediate colors produced by a (normalized) smoothing filter are located on the straight line between the two original color coordinates. Note that this relationship between linear filtering and linear color mixtures is independent of the particular color space in which the filter operation is performed. 15.1.2 Color Space Matters

454

Since a linear filter always yields a convex linear mixture of the involved colors it should make a difference in which color space the filter operation is performed. For example, Fig. 15.4 shows the intermediate colors produced by a smoothing filter being applied to the

15.1 Linear Filters

255

Fig. 15.4 Intermediate colors produced by linear interpolation between yellow and blue, performed in different color spaces. The 3D plot shows the resulting colors in linear RGB space.

sRGB lin. RGB

G

CIELUV CIELAB

0 255 B R

255 0 ¯ RGB I

I RGB

I LAB ∗

L

a∗

H

H

¯ LAB I

¯∗ L

a ¯∗

Fig. 15.5 Linear filter operation performed in a “foreign” color space. The original RGB image I RGB is first transformed to CIELAB (by T ), where the linear filter is applied separately to the three channels L∗ , a∗ , b∗ . The filtered RGB image ¯ RGB is obtained by transI forming back from CIELAB to RGB (by T −1 ).

T −1

T

b∗

H

¯ b∗

same blue/yellow step edge but in different color spaces: sRGB, linear RGB, CIELUV, and CIELAB. As we see, the differences between the various color spaces are substantial. To obtain dependable and standardized results it might be reasonable to first transform the input image to a particular operating color space, perform the required filter operation, and finally transform the result back to the original color space, as illustrated in Fig. 15.5. Obviously, a linear filter implies certain “metric” properties of the underlying color space. If we assume that a certain color space SA has this property, then this is also true for any color space SB that is related to SA by a linear transformation, such as CIEXYZ and linear RGB space (see Sec. 14.4.1). However, many color spaces used in practice (sRGB in particular) are related to these reference color spaces by highly nonlinear mappings, and thus significant deviations can be expected. Preservation of Brightness (Luminance) Apart from the intermediate colors produced by interpolation, another important (and easily measurable) aspect is the resulting change of brightness or luminance across the filter region. In particular it should generally hold that the luminance of the filtered color image is identical to the result of filtering only the (scalar)

455

15 Filters for Color Images

luminance channel of the original image with the same kernel H. Thus, if Lum(I) denotes the luminance of the original color image and Lum(I ∗ H) is the luminance of the filtered image, it should hold that Lum(I ∗ H) = Lum(I) ∗ H.

(15.12)

This is only possible if Lum(·) is linearly related to the components of the associated color space, which is mostly not the case. From Eqn. (15.12) we also see that, when filtering a step edge with colors c0 and c1 , the resulting brightness should also change monotonically from Lum(c0 ) to Lum(c1 ) and, in particular, none of the intermediate brightness values should fall outside this range. Fig. 15.6 Gaussian smoothing performed in different color spaces. Synthetic color image (a) and corresponding luminance image (b). The test image contains a horizontal bar with reduced color saturation but the same luminance as its surround, i.e., it is invisible in the luminance image. Gaussian filter applied in different color spaces: sRGB (c), linear RGB (d), CIELUV (e), and CIELAB (f). The bottom row (g–j) shows the corresponding luminance (Y ) images. Note the dark bands in the sRGB result (b), particularly along the color boundaries between regions B–E, C–D, and D–E, which stand out clearly in the corresponding luminance image (g). Filtering in linear RGB space (d, h) gives good results between highly saturated colors, but subjectively too high luminance in unsaturated regions, which is apparent around the gray markers. Results with CIELUV (e, i) and CIELAB color spaces (f, j) appear most consistent as far as the preservation of luminance is concerned.

Original

Luminance

(a)

(b)

sRGB

lin. RGB

CIELUV

CIELAB

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

Figure 15.6 shows the results of filtering a synthetic test image with a normalized Gaussian kernel (of radius σ = 3) in different color spaces.3 Differences are most notable at the red–blue and green– magenta transitions, with particularly large deviations in the sRGB space. The corresponding luminance values Y (calculated from linear RGB components as in Eqn. (13.33)) are shown in Fig. 15.6 (g–j). Again conspicuous is the result for sRGB (Fig. 15.6 (c, g)), which exhibits transitions at the red–blue, magenta–blue, and magenta–green edges, where the resulting brightness drops below the original brightness of both contributing colors. Thus Eqn. (15.12) is not satisfied 3

456

Note that in the paper edition the relevant color changes may not be clearly visible due to the color conversions for printing (to CMYK), which introduce similar artifacts. This applies particularly to the highly saturated parts of the color test image in Fig. 15.6 (a).

in this case. On the other hand, filtering in linear RGB space has 15.1 Linear Filters the tendency to produce too high brightness values, as can be seen at the black–white markers in Fig. 15.6 (d, h). Out-Of-Gamut Colors If we apply a linear filter in RGB or sRGB space, the resulting intermediate colors are always valid RGB colors again and contained in the original RGB gamut volume. However, transformed to CIELUV or CIELAB, the set of possible RGB or sRGB colors forms a nonconvex shape (see Ch. 14, Fig. 14.5), such that linearly interpolated colors may fall outside the RGB gamut volume. Particularly critical (in both CIELUV and CIELAB) are the red–white, red–yellow, and red–magenta transitions, as well as yellow–green in CIELAB, where the resulting distances from the gamut surface can be quite large (see Fig. 15.7). During back-transformation to the original color space, such “out-of-gamut” colors must receive special treatment, since simple clipping of the affected components may cause unacceptable color distortions [191]. Red Yellow Y (255, 0, 0) −−−−−−−−−→ (255, 255, 0) R, G, B

CIELAB

255

sRGB 191

RGB

127

CIELUV

63

CIELAB

0.25

0.5 (a)

0.75

1

λ

(b)

Implications and Further Reading Applying a linear filter to the individual component channels of a color image presumes a certain “linearity” of the underlying color space. Smoothing filters implicitly perform additive linear mixing and interpolation. Despite common practice (and demonstrated by the results), there is no justification for performing a linear filter operation directly on gamma-mapped sRGB components. However, contrary to expectation, filtering in linear RGB does not yield better overall results either. In summary, both nonlinear sRGB and linear RGB color spaces are unsuitable for linear filtering if perceptually accurate results are desired. Perceptually uniform color spaces, such as CIELUV and CIELAB, are good choices for linear filtering because of their metric properties, with CIELUV being perhaps slightly superior when it comes to interpolation over large color distances. When using CIELUV or CIELAB as intermediate color spaces for filtering RGB images, one must consider that out-of-gamut colors may be produced that must be handled properly. Thus none of the existing standard color spaces is universally suited or even “ideal” with respect to linear filtering. The proper choice of the working color space is relevant not only to smoothing filters, but also to other types of filters, such as linear

Fig. 15.7 Out-of-gamut colors produced by linear interpolation between red and yellow in “foreign” color spaces. The graphs in (a) show the (linear) R, G, B component values and the luminance Y (gray curves) resulting from a linear filter between red and yellow performed in different color spaces. The graphs show that the red component runs significantly outside the RGB gamut for both CIELUV and CIELAB. In (b) all pixels with any component outside the RGB gamut by more than 1% are marked white (for filtering in CIELAB).

457

15 Filters for Color Images

interpolation filters for geometric image transformations, decimation filters used in multi-scale techniques, and also nonlinear filters that involve averaging colors or calculation of color distances, such as the vector median filter (see Sec. 15.2.2). While complex color space transformations in the context of filtering (e.g., sRGB ↔ CIELUV) are usually avoided for performance reasons, they should certainly be considered when high-quality results are important. Although the issues related to color mixtures and interpolation have been investigated for some time (see, e.g., [171,290]), their relevance to image filtering has not received much attention in the literature. Most image processing tools (including commercial software) apply linear filters directly to color images, without proper linearization or color space conversion. Lindbloom [171] was among the first to describe the problem of accurate color reproduction, particularly in the context of computer graphics and photo-realistic imagery. He also emphasized the relevance of perceptual uniformity for color processing and recommended the use of CIELUV as a suitable (albeit not perfect) processing space. Tomasi and Manduchi [260] suggested the use of the Euclidean distance in CIELAB space as “most natural” for bilateral filtering applied to color images (see also Sec. 17.2) and similar arguments are put forth in [130]. De Weijer [272] notes that the additional chromaticities introduced by linear smoothing are “visually unacceptable” and argues for the use of nonlinear operators as an alternative. Lukac et al. [178] mention “certain inaccuracies” and color artifacts related to the application of scalar filters and discuss the issue of choosing a proper distance metric for vector-based filters. The practical use of alternative color spaces for image filtering is described in [164, Ch. 5]. 15.1.3 Linear Filtering with Circular Values If any of the color components is a circular quantity, such as the hue component in the HSV and HLS color spaces (see Ch. 13, Sec. 13.2.3), linear filters cannot be applied directly without additional provisions. As described in the previous section, a linear filter effectively calculates a weighted average over the values inside the filter region. Since the hue component represents a revolving angle and exhibits a discontinuity at the 1 → 0 (i.e., 360◦ → 0◦ ) transition, simply averaging this quantity is not admissible (see Fig. 15.8). However, correct interpolation of angular data is possible by utilizing the corresponding cosine and sine values, without any special treatment of discontinuities [84]. Given two angles α1 , α2 , the average angle α12 can be calculated as4 α12 = tan−1

sin(α ) + sin(α ) 1 2 cos(α1 ) + cos(α2 )

≡ ArcTan cos(α1 ) + cos(α2 ), sin(α1 ) + sin(α2 )

(15.13)

(15.14)

and, in general, multiple angular values α0 , . . . , αn−1 can be correctly averaged in the form 4

458

See Appendix Sec. A.1 for the definition of the ArcTan() function.

15.1 Linear Filters I

Ih

(a)

Fig. 15.8 Naive linear filtering in HSV color space. Original RGB color image (a) and the associated HSV hue component Ih (b), with values in the range [0, 1). Hue component after direct application of a Gaussian blur filter H with σ = 3.0 (c). Reconstructed ˜ after filtering all RGB image I components in HSV space (d). Note the false colors introduced around the 0 → 1 discontinuity (near red) of the hue component.

(b)

˜ I

Ih ∗ H

(c)

(d)

α ¯ = ArcTan

n−1 X

cos(αi ),

i=0

n−1 X i=0

sin(αi ) .

(15.15)

Also, the calculation of a weighted average is possible in the same way, that is, α ¯ = ArcTan

n−1 X i=0

wi · cos(αi ),

n−1 X i=0

wi · sin(αi ) ,

(15.16)

without any additional provisions (even the weights wi need not be normalized). This approach can be used for linearly filtering circular data in general, as shown in the following example. Filtering the Hue Component in HSV Color Space To apply a linear filter H to the circular hue component Ih (with original values in [0, 1)) of a HSV or HLS image (see Ch. 13, Sec. 13.2.3), we first calculate the corresponding cosine and sine parts Ihsin and Ihcos by Ihsin (p) = sin(2π·Ih (p)),

Ihcos (p) = cos(2π·Ih (p)),

(15.17)

with resulting values in the range [−1, 1]. The sinne and cosine components are then filtered individually, that is, I¯hsin = Ihsin ∗ H, I¯hcos = Ihcos ∗ H.

(15.18)

Finally, the filtered hue component I¯h is obtained in the form ArcTan I¯hcos (p), I¯hsin (p) mod 2π I¯h (p) = , (15.19) 2π

459

Cosine of hue (orig.)

15 Filters for Color Images Fig. 15.9 Filtering the hue component of a HSV image by separating into cosine and sine parts (see Fig. 15.8 (a) for the original image). Cosine and sine parts Ihsin , Ihcos of the hue component before (a, b) and after the application of a Gaussian blur filter with σ = 3.0 (c, d). Smoothed hue component I¯h after merging the filtered cosine and sine parts I¯hsin , I¯hcos (e). Reconstructed ¯ after filtering RGB image I all HSV components (f). It is apparent that the hard 0/1 hue transitions in (e) are in fact only gradual color changes around the red hues. The other HSV components (S, V , which are non-circular) were filtered in the usual way. The reconstructed RGB image (f) shows no false colors and all hues correctly filtered.

Sine of hue (orig.)

Ihsin

Ihcos

(a)

(b)

Cosine of hue (filtered)

Sine of hue (filtered)

I¯hcos

I¯hsin

(c)

(d)

Hue angle (filtered)

HSV image (filtered)

¯ I

I¯h

(e)

(f)

with values again in the range [0, 1]. Figure 15.9 demonstrates the correct application of a Gaussian smoothing filter to the hue component of an HSV color image by separating into cosine and sine parts. The other two HSV components (S, V ) are non-circular and were filtered as usual. In contrast to the result in Fig. 15.8 (d), no false colors are produced at the 0 → 1 boundary. In this context it is helpful to look at the distribution of the hue values, which are clustered around 0/1 in the sample image (see Fig. 15.10 (a)). In Fig. 15.10 (b) we can clearly see how naive filtering of the hue component produces new (false) colors in the middle of the histogram. This does not occur when the hue component is filtered correctly (see Fig. 15.10 (c)). Saturation-Weighted Filtering

460

The method just described does not take into account that in HSV (and HLS) the hue and saturation components are closely related. In particular, the hue angle may be very inaccurate (or even indeterminate) if the associated saturation value goes to zero. For example, the test image in Fig. 15.8 (a) contains a bright patch in the lower right-hand corner, where the saturation is low and the hue value is quite unstable, as seen in Fig. 15.9 (a, b). However, the circular filter

1: HsvLinearFilter(I hsv , H) Input: I hsv = (Ih , Is , Iv ), a HSV color image of size M × N , with all components in [0, 1]; H, a 2D filter kernel. Returns a new (filtered) HSV color image of size M × N . 2: 3:

(M, N ) ← size(I hsv ) Create 2D maps Ihsin , Ihcos , I¯h : M × N 7→ R

4: 5: 6: 7: 8:

Split the hue channel into sine/cosine parts: for all p ∈ M × N do θ ← 2π · Ih (p) ▷ hue angle θ ∈ [0, 2π] s ← Is (p) ▷ saturation s ∈ [0, 1] Ihsin p) ← s· sin(θ) ▷ Ihsin (p) ∈ [−1, 1] cos Ih (p) ← s· cos(θ) ▷ Ihcos (p) ∈ [−1, 1]

9: 10: 11: 12:

Alg. 15.1 Linear filtering in HSV color space. All component values of the original HSV image are in the range [0, 1]. The algorithm considers the circular nature of the hue component and uses the saturation component (in line 6) as a weight factor, as defined in Eqn. (15.20). The same filter kernel H is applied to all three color components (lines 9–12).

Filter all components with the same kernel: I¯hsin ← Ihsin ∗ H I¯hcos ← Ihcos ∗ H I¯s ← Is ∗ H I¯v ← Iv ∗ H

13: 14: 15:

Reassemble the filtered hue channel: for all p ∈ M × N do θ ← ArcTan I¯hcos (p), I¯hsin (p) I¯h (p) ← 1 ·(θ mod 2π)

16: 17:

I¯ hsv ← (I¯h , I¯s , I¯v ) return I¯ hsv

▷ θ ∈ [−π, π] ▷ I¯h (p) ∈ [0, 1]

2π

Original image

0

15.1 Linear Filters

0.5

(a)

Naive filter

1 0

0.5

(b)

Circular filter

1 0

0.5

(c)

Saturation-weighted

1 0

0.5

1

(d)

defined in Eqns. (15.17)–(15.19) takes all color samples as equally significant. A simple solution is to use the saturation value Is (p) as a weight factor for the associated pixel [117], by modifying Eqn. (15.17) to Ihsin (p) = Is (p) · sin(2π·Ih (p)),

Ihcos (p) = Is (p) · cos(2π·Ih (p)).

Fig. 15.10 Histogram of the HSV hue component before and after linear filtering. Original distribution of hue values Ih (a), showing that colors are clustered around the 0/1 discontinuity (red). Result after naive filtering the hue component (b), after filtering separated cosine and sine parts (c), and after addition weighting with saturation values (d). The bottom row shows the isolated hue component (color angle) by the corresponding colors (saturation and value set to 100 %). Note the noisy spot in the lower right-hand corner of (a), where color saturation is low and hue angles are very unstable.

(15.20)

The remaining steps in Eqns. (15.18)–(15.19) are unchanged, and the complete process is summarized in Alg. 15.1. The results in Fig. 15.10 (d) show that, particularly in regions of low color saturation, more stable hue values can be expected. Note that no normalization of the weights is required because the calculation of the hue angles

461

15 Filters for Color Images

(with the ArcTan() function in Eqn. (15.19)) only considers the ratio of the resulting sine and cosine parts.

15.2 Nonlinear Color Filters In many practical image processing applications, linear filters are of limited use and nonlinear filters, such as the median filter, are applied instead.5 In particular, for effective noise removal, nonlinear filters are usually the better choice. However, as with linear filters, the techniques originally developed for scalar (grayscale) images do not transfer seamlessly to vector-based color data. One reason is that, unlike in scalar data, no natural ordering relation exists for multidimensional data. As a consequence, nonlinear filters of the scalar type are often applied separately to the individual color channels, and again one must be cautious about the intermediate colors being introduced by these types of filters. In the remainder of this section we describe the application of the classic (scalar) median filter to color images, a vector-based version of the median filter, and edge-preserving smoothing filters designed for color images. Additional filters for color images are presented in Chapter 17. 15.2.1 Scalar Median Filter Applying a median filter with support region H (e.g., a disk-shaped region) at some image position p means to select one pixel value that is the most representative of the pixels in H to replace the current center pixel (hot spot). In case of a median filter, the statistical median of the pixels in H is taken as that representative. Since we always select the value of one of the existing image pixels, the median filter does not introduce any new pixel values that were not contained in the original image. If a median filter is applied independently to the components of a color image, each channel is treated as a scalar image, like a single grayscale image. In this case, with the support region H centered at some point p, the median for each color channel will typically originate from a different spatial position in H, as illustrated in Fig. 15.11. Thus the components of the resulting color vector are generally collected from more than one pixel in H, therefore the color placed in the filtered image may not match any of the original colors and new colors may be generated that were not contained in the original image. Despite its obvious deficiencies, the scalar (monochromatic) median filter is used in many popular image processing environments (including Photoshop and ImageJ) as the standard median filter for color images. 15.2.2 Vector Median Filter The scalar median filter is based on the concept of rank ordering, that is, it assumes that the underlying data can be ordered and sorted. 5

462

See also Sec. 4.4.

R

u

u

G

B

u

15.2 Nonlinear Color Filters Fig. 15.11 Scalar median filter applied separately to color channels. With the filter region H centered at some point p, the median pixel value is generally found at different locations in the R, G, B channels of the original image. The components of the resulting RGB color vector are collected from spatially separated pixels. It thus may not match any of the colors in the original image.

v

v RGB u

However, no such natural ordering exists for data elements that are vectors. Although vectors can be sorted in many different ways, for example by length or lexicographically along their dimensions, it is usually impossible to define a useful greater-than relation between any pair of vectors. One can show, however, that the median of a sequence of n scalar values C = (c0 , . . . , cn−1 ) can also be defined as the value cm selected from C, such that n−1 X

|cm − ci | ≤

i=0

n−1 X

|cj − ci |,

(15.21)

i=0

holds for any value cj ∈ C. In other words, the median value cm = median(C) is the one for which the sum of the absolute differences to all other elements in the sequence C is smallest, i.e., cm = argmin cj ∈C

n−1 X

|cj − ci |.

(15.22)

i=0

Based on this definition, the concept of the median can be easily extended from the scalar situation to multi-dimensional data. Given a sequence of vector-valued samples C = (c0 , . . . , cn−1 ), with ci ∈ RK , we define the vector median element as am = argmin cj ∈C

n−1 X i=0

cj − ci .

(15.23)

This is analogous to Eqn. (15.21), with the exception that the scalar difference |·| has been replaced by the vector norm ∥·∥ for measuring the distance between two points in the K-dimensional space.6 We call X DL (c, C) = ∥c − ci ∥L (15.24) ci ∈C

6

K denotes the dimensionality of the samples in ai (e.g., K = 3 for RGB colors).

463

15 Filters for Color Images

the “aggregate distance” of the sample vector c with respect to all samples in C, for some vector norm L. Common choices for L are the standard L1 , L2 and L∞ norms, that is, L1 : L2 :

′

∥c − c ∥1 = ∥c − c′ ∥2 =

K−1 X

|ck − c′k |,

k=0 hK−1 X k=0

L∞ :

i1/2 |ck − c′k |2 ,

∥c − c′ ∥∞ = max |ck − c′k |. 0≤k t · a then I ′ (p) ← cmin ▷ replace the center pixel else I ′ (p) ← I(p) ▷ keep the original center pixel

19:

return I ′

20: TrimmedAggregateDistance(c, C, a) Returns the aggregate distance for color c to the a most similar colors in C = (c0 , . . . , cn−1 ). 21: n ← |C| ▷ size of C 22: d ← new map [0, n − 1] 7→ R 23: 24: 25: 26: 27: 28: 29:

468

for i ← 0, . . . , n − 1 do d(i) ← ∥c − C(i)∥L ′

d ← Sort(d) D←0 for i ← 1, . . . , a−1 do D ← D + d′ (i) return D

▷ choose any distance norm L ▷ d′ (0) ≤ d′ (1) ≤ . . . ≤ d′ (n − 1) ▷ d′ (0) = 0, thus skipped

sequence, the sharpening vector median filter provides good noise removal while at the same time leaving edge structures intact. Typically, the aggregate distance of color c to the a closest neighborhood colors is found by first calculating the distances between c and all other samples in C, then sorting the result, and finally adding up only the a initial elements of the sorted distances (see procedure TrimmedAggregateDistance(c, C, a) in Alg. 15.3). Thus the sharpening median filter requires an additional sorting step over n ∝ r2 elements at each pixel, which adds to its time complexity. The parameter s in Alg. 15.3 specifies the fraction of region pixels included in the calculation of the median and thus controls the amount of sharpening. The number of incorporated pixels a is determined as a = round(n − s · (n − 2)) (see Alg. 15.3, line 7), so that a = n, . . . , 2 for s ∈ [0, 1]. With s = 0, all a = |C| = n pixels in the filter region are included in the median calculation and the filter behaves like the ordinary vector-median filter described in Alg. 15.2.

At maximum sharpening (i.e., with s = 1) the calculation of the ag- 15.3 Java gregate distance includes only the single most similar color pixel in Implementation the neighborhood C. The calculation of the “trimmed aggregate distance” is shown in Alg. 15.3 (lines 20–29). The function TrimmedAggregateDistance (c, C, a) calculates the aggregate distance for a given vector (color sample) c over the a closest samples in the support region C. Initially (in line 24), the n distances D(i) between c and all colors in C are calculated, with D(i) = ∥c − C(i)∥L (see Eqns. (15.25)–(15.27)). These are subsequently sorted by increasing value (line 25) and the sum of the a smallest values is returned (Alg. 15.3, line 28). The effects of varying the sharpen parameter s are shown in Fig. 15.16, with a fixed filter radius r = 2.0 and threshold t = 0. For s = 0.0 (Fig. 15.16 (a)), the result is the same as that of the ordinary vector median filter (see Fig. 15.15 (b)). The value of the current center pixel is only replaced by a neighboring pixel value if the corresponding minimal (trimmed) aggregate distance dmin is significantly smaller than the center pixel’s aggregate distance dctr . In Alg. 15.3, this is controlled by the threshold t. The center pixel is replaced only if the condition (Dctr − Dmin ) > t · a

(15.29)

holds, otherwise it remains unmodified. Note that the distance limit is proportional to a and thus t really specifies the minimum “average” pixel distance; it is independent of the filter radius r and the sharpening parameter s. Results for typical values of t (in the range 0, . . . , 10) are shown in Figs. 15.17–15.18. To illustrate the effect, the images in Fig. 15.18 only display those pixels that were not replaced by the filter, while all modified pixels are set to black. As one would expect, increasing the threshold t leads to fewer pixels being modified. Of course, the same thresholding scheme may also be used with the ordinary vector median filter (see Exercise 15.2).

15.3 Java Implementation Implementations of the scalar and vector median filter as well as the sharpening vector median filter are available with full Java source code at the book’s website.7 The corresponding classes ScalarMedianFilter, VectorMedianFilter, VectorMedianFilterSharpen are based on the common super-class GenericFilter, which provides the abstract methods void applyTo(ImageProcessor ip) , which greatly simplifies the use of these filters. The code segment in Prog. 15.1 demonstrates the use of the class VectorMedianFilter 7

Package imagingbook.pub.color.filters

469

15 Filters for Color Images Fig. 15.16 Sharpening vector median filter with different sharpness values s. The filter radius is r = 2.0 and the corresponding filter mask contains n = 21 pixels. At each pixel, only the a = 21, 17, 12, 6 closest color samples (for sharpness s = 0.0, 0.2, 0.5, 0.8, respectively) are considered when calculating the local vector median.

(a) s = 0.0

(b) s = 0.2

(c) s = 0.5

(d) s = 0.8

(with radius 3.0 and L1 -norm) for RGB color images in an ImageJ plugin. For the specific filters described in this chapter, the following constructors are provided: ScalarMedianFilter (Parameters params) Creates a scalar median filter, as described in Sec. 15.2.1, with parameter radius = 3.0 (default). VectorMedianFilter (Parameters params) Creates a vector median filter, as described in Sec. 15.2.2, with parameters radius = 3.0 (default), distanceNorm = NormType.L1 (default), L2, Lmax.

470

VectorMedianFilterSharpen (Parameters params) Creates a sharpening vector median filter (see Sec. 15.2.3) with parameters radius = 3.0 (default), distanceNorm =

15.4 Further Reading Fig. 15.17 Sharpening vector median filter with different threshold values t = 0, 2, 5, 10. The filter radius and sharpening factor are fixed at r = 2.0 and s = 0.0, respectively. (a) t = 0

(b) t = 2

(c) t = 5

(d) t = 10

NormType.L1 (default), L2, Lmax, sharpening factor sharpen = 0.5 (default), threshold = 0.0 (default). The listed default values pertain to the parameterless constructors that are also available. See the online API documentation or the source code for additional details. Note that the created filter objects are generic and can be applied to both grayscale and color images without any modification.

15.4 Further Reading A good overview of different linear and nonlinear filtering techniques for color images can be found in [164]. In [210, Ch. 2], the authors give a concise treatment of color image filtering, including statistical noise models, vector ordering schemes, and different color similarity

471

15 Filters for Color Images Fig. 15.18 Sharpening vector median filter with different threshold values t = 0, 2, 5, 10 (also see Fig. 15.17). Only the unmodified pixels are shown in color, while all modified pixels are set to black. The filter radius and sharpening factor are fixed at r = 2.0 and s = 0.0, respectively. Raising the threshold reduces the number of pixels being modified.

(a) t = 0

(b) t = 2

(c) t = 5

(d) t = 10

measures. Several variants of weighted median filters for color images and multi-channel data in general are described in [11, Sec. 2.4]. A very readable and instructive survey of important color issues in computer vision, such as color constancy, photometric invariance, and color feature extraction, can be found in [102]. A vector median filter operating in HSV color space is proposed in [273]. In addition to the techniques discussed in this chapter, most of the filters described in Chapter 17 can either be applied directly to color images or easily modified for this purpose.

15.5 Exercises 472

Exercise 15.1. Verify Eqn. (15.21) by showing (formally or experimentally) that the usual calculation of the scalar median ( by sorting

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

import import import import import import import

ij.ImagePlus; ij.plugin.filter.PlugInFilter; ij.process.ImageProcessor; imagingbook.lib.math.VectorNorm.NormType; imagingbook.lib.util.Enums; imagingbook.pub.colorfilters.VectorMedianFilter; imagingbook.pub.colorfilters.VectorMedianFilter.*;

public class MedianFilter_Color_Vector implements PlugInFilter { public int setup(String arg, ImagePlus imp) { return DOES_RGB; }

15.5 Exercises Prog. 15.1 Color median filter using class VectorMedianFilter. In line 16, a suitable parameter object (with default values) is created, then modified and passed to the constructor of the filter (in line 20). The filter itself is applied to the input image, which is destructively modified (in line 21).

public void run(ImageProcessor ip) { Parameters params = new VectorMedianFilter.Parameters(); params.distanceNorm = NormType.L1; params.radius = 3.0;

}

}

VectorMedianFilter filter = new VectorMedianFilter(params); filter.applyTo(ip);

a sequence and selecting the center value) indeed gives the value with the smallest sum of differences from all other values in the same sequence. Is the result independent of the type of distance norm used? Exercise 15.2. Modify the ordinary vector median filter described in Alg. 15.2 to incorporate a threshold t for deciding whether to modify the current center pixel or not, analogous to the approach taken in the sharpening vector median filter in Alg. 15.3. Exercise 15.3. Implement a dedicated median filter (analogous to Alg. 15.1) for the HSV color space. The filter should process the color components independently but consider the circular nature of the hue component, as discussed in Sec. 15.1.3. Compare the results to the vector-median filter in Sec. 15.2.2.

473

16 Edge Detection in Color Images

Edge information is essential in many image analysis and computer vision applications and thus the ability to locate and characterize edges robustly and accurately is an important task. Basic techniques for edge detection in grayscale images are discussed in Chapter 5. Color images contain richer information than grayscale images and it appears natural to assume that edge detection methods based on color should outperform their monochromatic counterparts. For example, locating an edge between two image regions of different hue but similar brightness is difficult with an edge detector that only looks for changes in image intensity. In this chapter, we first look at the use of “ordinary” (i.e., monochromatic) edge detectors for color images and then discuss dedicated detectors that are specifically designed for color images. Although the problem of color edge detection has been pursued for a long time (see [163,298] for a good overview), most image processing texts do not treat this subject in much detail. One reason could be that, in practice, edge detection in color images is often accomplished by using “monochromatic” techniques on the intensity channel or the individual color components. We discuss these simple methods— which nevertheless give satisfactory results in many situations—in Sec. 16.1. Unfortunately, monochromatic techniques do not extend naturally to color images and other “multi-channel” data, since edge information in the different color channels may be ambiguous or even contradictory. For example, multiple edges running in different directions may coincide at a given image location, edge gradients may cancel out, or edges in different channels may be slightly displaced. In Sec. 16.2, we describe how local gradients can be calculated for edge detection by treating the color image as a 2D vector field. In Sec. 16.3, we show how the popular Canny edge detector, originally designed for monochromatic images, can be adapted for color images, and Sec. 16.4 goes on to look at other color edge operators. Implementations of the discussed algorithms are described in Sec. 16.5, with complete source code available on the book’s website. © Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1_16

475

16 Edge Detection in Color Images

16.1 Monochromatic Techniques Linear filters are the basis of most edge enhancement and edge detection operators for scalar-valued grayscale images, particularly the gradient filters described in Sec. 5.3. Again, it is quite common to apply these scalar filters separately to the individual color channels of RGB images. A popular example is the Sobel operator with the filter kernels     −1 0 1 −1 −2 −1 1 1 HxS = ·  −2 0 2  and HyS = ·  0 0 0  (16.1) 8 8 −1 0 1 1 2 1 for the x- and y-direction, respectively. Applied to a grayscale image I, with Ix = I ∗ HxS and Iy = I ∗ HyS , these filters give a reasonably good estimate of the local gradient vector, I (p) ∇I(p) = x , (16.2) Iy (p) at position p = (u, v). The local edge strength of the grayscale image is then taken as q Egray (p) = ∥∇I(p)∥ = Ix2 (p) + Iy2 (p), (16.3) and the corresponding edge orientation is calculated as Φ(p) = ∠∇I(p) = tan−1

I (p) y . Ix (p)

(16.4)

The angle Φ(p) gives the direction of maximum intensity change on the 2D image surface at position p, which is the normal to the edge tangent. Analogously, to apply this technique to a color image I = (IR , IG , IB ), each color plane is first filtered individually with the two gradient kernels given in Eqn. (16.1), i.e., I I ∗ HxS ∇IR = R,x = R , IR,y IR ∗ HyS I I ∗ HxS ∇IG = G,x = G , (16.5) IG,y IG ∗ HyS I I ∗ HxS ∇IB = B,x = B . IB,y IB ∗ HyS

476

The local edge strength is calculated separately for each color channel which yields a vector     ER (p) ∥∇IR (p)∥ E(p) = EG (p) = ∥∇IG (p)∥ (16.6) EB (p) ∥∇IB (p)∥   2 2 [IR,x (p) + IR,y (p)]1/2  2  2 = [IG,x (16.7) (p) + IG,y (p)]1/2  2 2 1/2 [IB,x (p) + IB,y (p)]

for each image position p. These vectors may be combined to a new 16.1 Monochromatic color image E = (ER , EG , EB ), although such a “color edge image” Techniques has no particularly useful interpretation.1 Finally, a scalar quantity of combined edge strength (C) over all color planes can be obtained, for example, by calculating the Euclidean (L2 ) norm of E as 2 1/2 2 2 C2 (p) = ∥E(p)∥2 = ER (p) + EG (p) + EB (p) 2 1/2 2 2 2 2 2 = IR,x + IR,y + IG,x + IG,y + IB,x + IB,y

(16.8)

C1 (p) = ∥E(p)∥1 = |ER (p)| + |EG (p)| + |EB (p)| .

(16.9)

(coordinates p are omitted in the second line) or, using the L1 norm,

Another alternative for calculating a combined edge strength is to take the maximum magnitude of the RGB gradients (i.e., the L∞ norm), C∞ (p) = ∥E(p)∥∞ = max (|ER (p)| , |EG (p)| , |EB (p)|) .

(16.10)

An example using the synthetic test image from Ch. 15 is given in Fig. 16.1. It shows the edge magnitude of the corresponding grayscale image and the combined color edge magnitude calculated with the different norms defined in Eqns. (16.8)–(16.10). As far as edge orientation is concerned, there is no simple extension of the grayscale case. While edge orientation can easily be calculated for each individual color component (using Eqn. (16.4)), the gradients, three color channels are generally different (or even contradictory) and there is no obvious way of combining them. A simple ad hoc approach is to choose, at each image position p, the gradient direction from the color channel of maximum edge strength, that is, Φcol (p) = tan−1

I

m,y (p)

Im,x (p)

,

with m = argmax |Ek (p)|.

(16.11)

k=R,G,B

This simple (monochromatic) method for calculating edge strength and orientation in color images is summarized in Alg. 16.1 (see Sec. 16.5 for the corresponding Java implementation). Two more examples are shown in Fig. 16.2. For comparison, these figures also show the edge maps obtained by first converting the color image to a grayscale image and then applying the Sobel operator2 (see Fig. 16.2 (b)). The edge magnitude in all examples is normalized; it is shown inverted and contrast-enhanced to increase the visibility of low-contrast edges. As expected and apparent from the examples, even simple monochromatic techniques applied to color images perform better than edge detection on the corresponding grayscale images. In particular, edges between color regions of similar brightness are not detectable in this way, so using color information for 1

2

Such images are nevertheless produced by the “Find Edges” command in ImageJ and the filter of the same name in Photoshop (showing inverted components). See Sec. 5.3.1.

477

RGB color image

16 Edge Detection in Color Images Fig. 16.1 Color edge enhancement with monochromatic methods. Original color image (a) and corresponding grayscale image (b); edge magnitude from the grayscale image (c). Color edge magnitude calculated with different norms: L1 (d), L2 (e), and L∞ (f). The images in (c–f) are inverted for better viewing. Note that the grayscale image in (c) was calculated with the direct conversion method, i.e., from nonlinear sRGB components (see Eqn. (14.39), p. 438). With linear grayscale conversion (see Eqn. (14.37), p. 437), the desaturated bar at the center would exhibit no grayscale edges along its borders, since the luminance is the same inside and outside.

Luminance

(a)

(b)

sRGB

Y′

(d)

(c)

Egray

C1

(e)

(f)

C2

C∞

edge detection is generally more powerful than relying on intensity alone. Among the simple color techniques, the maximum channel edge strength C∞ (see Eqn. (16.10)) seems to give the most consistent results with the fewest edges getting lost. However, none of the monochromatic detection techniques can be expected to work reliably under these circumstances. While the threshold for binarizing the edge magnitude could be tuned manually to give more pleasing results on specific images, it is difficult in practice to achieve consistently good results over a wide range of images. Methods for determining the optimal edge threshold dynamically, that is, depending on the image content, have been proposed, typically based on the statistical variability of the color gradients. Additional details can be found in [103, 194, 216].

16.2 Edges in Vector-Valued Images 478

In the “monochromatic” scheme described in Sec. 16.1, the edge magnitude in each color channel is calculated separately and thus no use

16.2 Edges in Vector-Valued Images

(a)

Original image I

(b)

(c)

Gray edge Egray

(d)

(e)

Color edge C1 (Eqn. (16.9))

(f)

(g)

Color edge C2 (Eqn. (16.8))

(h)

(i)

Color edge C∞ (Eqn. (16.10))

(j)

Fig. 16.2 Example of color edge enhancement with monochromatic techniques (balloons image). Original color image (a, b), edge magnitude obtained from the grayscale image (c, d). Color edge magnitude calculated with the L1 norm (e, f), L2 norm (g, h), and L∞ (i, j). Differences between the grayscale edge detector (c, d) and the color-based detector (e–j) are particularly visible on the right side of the balloon image and the bottom of the tangerine image.

479

1: MonochromaticColorEdge(I) Input: I = (IR , IG , IB ), a RGB color image of size M × N . Returns a pair of maps (E2 , Φ) containing edge magnitude and orientation, respectively.

16 Edge Detection in Color Images Alg. 16.1 Monochromatic color edge operator. A pair of Sobeltype filter kernels (HxS , HyS ) is used to estimate the local x/y gradients of each component of the RGB input image I. Color edge magnitude is calculated as the L2 norm of the color gradient vector (see Eqn. (16.8)). The procedure returns a pair of maps, holding the edge magnitude E2 and the edge orientation Φ, respectively.

h −1

0 1 −2 0 2 −1 0 1 −1 −2 −1 0 0 0 1 2 1

i

2:

HxS :=

1 · 8

3:

HyS :=

1 · 8

4:

(M, N ) ← size(I)

5: 6: 7:

Apply gradient filters: IR,x ← IR ∗HxS , IR,y ← IR ∗HyS IG,x ← IG ∗HxS , IG,y ← IG ∗HyS IB,x ← IB ∗HxS , IB,y ← IB ∗HyS

8: 9: 10: 11: 12: 13:

h

▷ x/y gradient kernels

i

Calculate edge magnitude and orientation: E ← new map M × N → R Φ ← new map M × N → R for all p ∈ M ×N do rx ← IR,x (p), ry ← IR,y (p)) gx ← IG,x (p), gy ← IG,y (p)) bx ← IB,x (p), by ← IB,y (p)) e2R ← rx2 + ry2 e2G ← gx2 + gy2 e2B ← b2x + b2y

17: 18: 19: 20:

e2max ← e2R c x ← rx , c y ← ry if e2G > e2max then e2max ← e2G , cx ← gx ,

c y ← gy

21: 22:

if e2B > e2max then e2max ← e2B , cx ← bx ,

c y ← by

23: 24:

e2R

25:

▷ = ∇IR ▷ = ∇IG ▷ = ∇IB ▷ = ∥∇IR ∥2 ▷ = ∥∇IG ∥2 ▷ = ∥∇IB ∥2

14: 15: 16:

p

▷ edge magnitude ▷ edge orientation

e2G

▷ find max. gradient channel

e2B

E(p) ← + + Φ(p) ← ArcTan(cx , cy )

▷ edge magnitude (L2 norm) ▷ edge orientation

return (E, Φ).

is made of the potential coupling between color channels. Only in a subsequent step are the individual edge responses in the color channels combined, albeit in an ad hoc fashion. In other words, the color data are not treated as vectors, but merely as separate and unrelated scalar values. To obtain better insight into this problem it is helpful to treat the color image as a vector field, a standard construct in vector calculus [39, 252].3 A three-channel RGB color image I(p) = (IR (p), IG (p), IB (p)) can be modeled as a discrete 2D vector field, that is, a function whose coordinates p = (u, v) are 2D and whose values are 3D vectors. Similarly, a grayscale image can be described as a discrete scalar field, since its pixel values are only 1D. 16.2.1 Multi-Dimensional Gradients As noted in the previous section, the gradient of a scalar image I at a specific position p is defined as 3

480

See Sec. D.1 of the Appendix for general properties of vector fields.

∇I(p) =

∂I ∂x (p) ∂I ∂y (p)

!

(16.12) 16.2 Edges in

,

Vector-Valued Images

that is, the vector of the partial derivatives of the function I in the x- and y-direction, respectively.4 Obviously, the gradient of a scalar image is a 2D vector field. In the case of a color image I = (IR , IG , IB ), we can treat the three color channels as separate scalar images and obtain their gradients analogously as ∇IR (p) =

∂IR ∂x (p) ∂IR ∂y (p)

!

, ∇IG (p) =

∂IG ∂x (p) ∂IG ∂y (p)

!

, ∇IB (p) =

∂IB ∂x (p) ∂IB ∂y (p)

!

,

(16.13) which is the same as in Eqn. (16.5). Before we can take the next steps, we need to introduce a standard tool for the analysis of vector fields. 16.2.2 The Jacobian Matrix The Jacobian matrix5 Jf (x0 ) combines all first partial derivatives of a vector-valued function f (x) at a given position x0 , its row vectors being the gradients of f ’s scalar-valued component functions. In particular, for an RGB color image I, the Jacobian matrix is defined as    ∂I  ∂IR R (∇IR )⊺ (p) ∂x (p) ∂y (p)    ∂I  ⊺  =  G (p) ∂IG (p) = I x (p) I y (p) , JI (p) =  (∇I ) (p) G ∂x ∂y     ∂IB ∂IB (∇IB )⊺ (p) (16.14) ∂x (p) ∂y (p) with the 2D gradient vectors ∇IR , ∇IG , ∇IB , as defined in Eqn. (16.13). The transposed gradient vectors (∇IR )⊺ , (∇IG )⊺ , (∇IB )⊺ form the rows of the 3 × 2 Jacobian matrix JI . Thus the two 3D column vectors of this matrix,

I x (p) =

 ∂I

R

(p)



∂x  ∂I  ∂I G (p) = (p)  , ∂x ∂x ∂IB ∂x (p)

I y (p) =

 ∂I

R



∂y (p)  ∂I  ∂I G  (p) = ∂y (p),  ∂y ∂IB ∂y (p)

(16.15)

are the partial derivatives of the color components along the x- and y-axes, respectively. At a particular position p, the total amount of change over all three color channels in the horizontal direction can be quantified by the norm of the corresponding column vector ∥I x (p)∥. Analogously, ∥I y (p)∥ gives the total amount of change over all three color channels along the vertical axis. 4

5

Of course, images are discrete functions and the partial derivatives are estimated from finite differences (see Appendix Sec. D.3). See also Sec. D.1.1 of the Appendix.

481

16 Edge Detection in Color Images

16.2.3 Squared Local Contrast Now that we can quantify the combined color change along the horizontal and vertical axes at any position p, the next task is to calculate the direction of maximum change to find the angle of the edge normal, which we then use to derive the local edge strength. How can we calculate the gradient in some direction θ other than horizontal and vertical? For this purpose, we use the product of the unit vector oriented at angle θ, cos(θ) eθ = , (16.16) sin(θ) and the Jacobian matrix JI (Eqn. (16.14)) in the form cos(θ) (gradθ I)(p) = JI (p)·eθ = I x (p) I y (p) · sin(θ) = I x (p)· cos(θ) + I y (p)· sin(θ),

(16.17)

with the derivative vectors I x , I y as defined in Eqn. (16.15). The resulting 3D vector (gradθ I)(p) is called the directional gradient 6 of the color image I at position p, for any direction θ. By taking the squared norm of this vector, i.e., 2

Sθ (I, p) = ∥(gradθ I)(p)∥2

2 = I x (p)·cos(θ) + I y (p)·sin(θ) 2

(16.18)

= I 2x (p)· cos2 (θ) + 2·I x (p)·I y (p)· cos(θ)· sin(θ) + I 2y (p)· sin2 (θ),

we obtain what is called the squared local contrast of the vectorvalued image I at position p in direction θ. In particular, for a RGB image I = (IR , IG , IB ), the squared local contrast in Eqn. (16.18) is     IR,y (p)

IR,x (p)

2 Sθ (I, p) = IG,x (p) · cos(θ) + IG,y (p) · sin(θ) 2 (16.19) IB,x (p) IB,y (p) 2 2 2 = IR,x (p) + IG,x (p) + IB,x (p) · cos2 (θ) 2 2 2 + IR,y (p) + IG,y (p) + IB,y (p) · sin2 (θ) (16.20) + 2 · cos(θ) · sin(θ) · IR,x (p)·IR,y (p) + IG,x (p)·IG,y (p) + IB,x (p)·IB,y (p) .

If I is a scalar (i.e., grayscale) image, the squared local contrast reduces to

2 Ix (p) ⊺ cos(θ) 2

Sθ (I, p) = (gradθ I)(p) = · (16.21) Iy (p) sin(θ) 2 2 = Ix (p)· cos(θ) + Iy (p)· sin(θ) . (16.22)

We will return to this result again later in Sec. 16.2.6. In p the following, we use the root of the squared local contrast, that is, Sθ (I, p), under the term local contrast. 6

482

See also Sec. D.1.2 of the Appendix (Eqn. (D.8)).

Grayscale image I

RGB color image I = (IR , IG , IB )

Fig. 16.3 Local image gradients and local contrast. In case of a scalar (grayscale) image I (a), the local gradient ∇I defines a single plane that is tangential to the image function I at position p = (u, v). In case of an RGB color image I = (IR , IG , IB ) (b), the local gradients ∇IR , ∇IG , ∇IB for each color channel define three tangent planes. The vertical axes in graphs (c, d) show the corresponding local contrast values √ Sθ (I, p) (see Eqns. (16.18) and (16.19)) for all possible directions θ = 0, . . . , 2π.

∇IR ∇I

∇IB ∇IG

(a)

(b)

p

p

Sθ (p)

(c)

16.2 Edges in Vector-Valued Images

Sθ (p)

(d)

Figure 16.3 illustrates the meaning of the squared local contrast in relation to the local image gradients. At a given image position p, the local gradient ∇I(p) in a grayscale image (Fig. 16.3 (a)) defines a single plane that is tangential to the image function I at position p. In case of a color image (Fig. 16.3 (b)), each color channel defines an individual tangent plane. In Fig. 16.3 (c, d) the local contrast values are shown as the height of cylindrical surfaces for all directions θ. For a grayscale image (Fig. 16.3 (c)), the local contrast changes linearly with the orientation θ, while the relation is quadratic for a color image (Fig. 16.3 (d)). To calculate the strength and orientation of edges we need to determine the direction of the maximum local contrast, which is described in the following. 16.2.4 Color Edge Magnitude The directions that maximize the squared local contrast Sθ (I, p) in Eqn. (16.18) can be found analytically as the roots of the first partial derivative of S with respect to the angle θ, as originally suggested by Di Zenzo [74], and the resulting quantity is called maximum local contrast. As shown in [69], the maximum local contrast can also be found from the Jacobian matrix JI (Eqn. (16.14)) as the largest eigenvalue of the (symmetric) 2 × 2 matrix ! ⊺ I x (p) ⊺ M(p) = JI (p)· JI (p) = · I x (p) I y (p) (16.23) ⊺ I y (p) ⊺ I 2x (p) I x (p)·I y (p) AC = = , (16.24) ⊺ CB I y (p)·I x (p) I 2y (p)

483

16 Edge Detection in Color Images

with the elements A = I 2x (p) = I x (p)·I x (p), ⊺

B = I 2y (p) = I y (p)·I y (p), ⊺

C=

⊺ I x (p)·I y (p)

=

(16.25)

⊺ I y (p)·I x (p).

Matrix M(p) could be considered the color equivalent to the local structure matrix used for corner detection on grayscale images (see Sec. 6.2.1). The two eigenvalues of M can be calculated in closed form as7 p λ0 (p) = A + B + (A−B)2 + 4·C 2 /2, (16.26) p λ1 (p) = A + B − (A−B)2 + 4·C 2 /2.

Since M is symmetric, the expression under the square root in Eqn. (16.26) is positive and thus all eigenvalues are real. In addition, A, B are both positive and therefore λ0 is always the larger of the two eigenvalues. Moreover, λ0 is equivalent to the maximum squared local contrast (Eqn. (16.18)), that is, λ0 (p) ≡ max Sθ (I, p),

(16.27)

0≤θ tlo ). Returns a binary edge image of size M × N . 2: 3: 4:

I¯R ← IR ∗ H G,σ I¯G ← IG ∗ H G,σ I¯B ← IB ∗ H G,σ

5: 6:

Hx∇ ← [−0.5 0 0.5 ] ⊺ Hy∇ ← [−0.5 0 0.5 ]

7: 8: 9:

Apply gradient filters: I¯R,x ← I¯R ∗ Hx∇ , I¯R,y ← I¯R ∗ Hy∇ I¯G,x ← I¯G ∗ Hx∇ , I¯G,y ← I¯G ∗ Hy∇ I¯B,x ← I¯B ∗ Hx∇ , I¯B,y ← I¯B ∗ Hy∇

▷ blur components w. Gaussian of width σ

▷ x gradient kernel ▷ y gradient kernel

10:

(M, N ) ← size(I)

11: 12:

Create scalar maps Emag , Enms , Ex , Ey : M × N → R Create binary map Ebin : M × N → {0, 1}

13: 14: 15: 16:

for all p ∈ M × N do rx ← IR,x (p), ry ← IR,y (p)) gx ← IG,x (p), gy ← IG,y (p)) bx ← IB,x (p), by ← IB,y (p))

▷ = ∇IR ▷ = ∇IG ▷ = ∇IB

17: 18: 19:

Calculate local structure matrix M for position p: A ← rx2 + gx2 + b2x , B ← ry2 + gy2 + b2y C ← rx ·ry + gx ·gy + bx ·by

20: 21: 22: 23: 24:

A C Get first eigenvalue/eigenvector of M = ( C B ): p 2 2 D ← (A−B) + 4·C λ0 ← A+B + D / 2 ▷ √ Emag (p) ← λ0 ▷ Ex (p) ← A−B + D ▷ Ey (p) ← 2·C ▷

25: 26:

Enms (p) ← 0 Ebin (p) ← 0

Eqn. Eqn. Eqn. Eqn.

(16.26) (16.27) (16.28) (16.28)

▷ non-max suppressed gradient magn. ▷ final (binary) edge map

27: 28: 29: 30: 31: 32:

Perform directional non-maximum suppression: for u ← 1, . . . , M −2 do for v ← 1, . . . , N −2 do p ← (u, v) s ← GetOrientationSector(Ex (p), Ey (p)) ▷ Alg. 5.2 if IsLocalMax(Emag , p, s, tlo ) then ▷ Alg. 5.2 Enms (p) ← Emag (p) ▷ keep local maxima only

33: 34: 35: 36: 37:

Trace edges: for u ← 1, . . . , M −2 do for v ← 1, . . . , N −2 do p ← (u, v) if (Enms (p) ≥ thi ∧ Ebin (p) = 0) then TraceAndThreshold(Enms , Ebin , p, tlo )

38:

return Ebin .

▷ Alg. 5.2

Canny (grayscale)

Canny (color)

(a)

(b)

(c)

σ = 0.5

(d)

(e)

σ = 1.0

(f)

(g)

σ = 2.0

(h)

(i)

σ = 5.0

(i)

16.3 Canny Edge Detector for Color Images Fig. 16.6 Canny grayscale vs. color version. Results from the grayscale (left) and the color version (right) of the Canny operator for different values of σ (thi = 20%, tlo = 5% of max. edge magnitude).

491

16 Edge Detection in Color Images Fig. 16.7 Canny grayscale vs. color version. Results from the grayscale (left) and the color version (right) of the Canny operator for different threshold values thi , given in % of max. edge magnitude (tlo = 5%, σ = 2.0).

492

Canny (grayscale)

Canny (color)

(a)

(b)

(c)

thi = 10%

(d)

(e)

thi = 30%

(f)

(g)

thi = 50%

(h)

(i)

thi = 70%

(i)

σ = 2.0

σ = 5.0

(a)

(b)

(c)

(d)

(e)

(f)

16.4 Other Color Edge Operators Fig. 16.8 Scalar vs. vector-based color Canny operator. Results from the scalar Canny operator applied separately to each color channel (a, b). Channel edges are shown in corresponding colors, with mixed colors indicating that edge points were detected in multiple channels (e.g., yellow marks overlapping points from the red and the green channel). A black pixel indicates that an edge point was detected in all three color channels. Channel edges combined into a joint edge map (c, d). For comparison, the result of the vector-based color Canny operator (e, f). Common parameter settings are σ = 2.0 and 5.0, thi = 20%, tlo = 5% of max. edge magnitude.

16.4 Other Color Edge Operators The idea of using a vector field model in the context of color edge detection was first presented by Di Zenzo [74], who suggested finding the orientation of maximum change by maximizing S(p, θ) in Eqn. (16.18) over the angle θ. Later Cumani [69, 70] proposed directly using the eigenvalues and eigenvectors of the local structure matrix M (Eqn. (16.24)) for calculating edge strength and orientation. He also proposed using the zero-crossings of the second-order gradients along the direction of maximum contrast to precisely locate edges, which is a general problem with first-order techniques. Both Di Zenzo and Cumani used only the dominant eigenvalue, indicating the edge strength perpendicular to the edge (if an edge exists at all), and then discarded the smaller eigenvalue proportional to the edge strength in the perpendicular (i.e., tangential) direction. Real edges only exist where the larger eigenvalue is considerably greater than the smaller one. If both eigenvalues have similar values, this indicates that the local image surface exhibits change in all directions, which is not typically true at an edge but quite characteristic of flat, noisy regions and corners. One solution therefore is to use the difference between the eigenvalues, λ0 −λ1 , to quantify edge strength [232].

493

16 Edge Detection in Color Images

Several color versions of the Canny edge detector can be found in the literature, such as the one proposed by Kanade [163], which is very similar to the algorithm presented here. Other approaches of adapting the Canny detector for color images can be found in [104]. In addition to Canny’s scheme, other types of color edge detectors have been used successfully, including techniques based on vector order statistics and color difference vectors. Excellent surveys of the various color edge detection approaches can be found in [298] and [164, Ch. 6].

16.5 Java Implementation The following Java implementations of the algorithms described in this chapter can be found in the source code section14 of the book’s website. The common (abstract) super-class for all color edge detectors is ColorEdgeDetector, which mainly provides the following methods: FloatProcessor getEdgeMagnitude() Returns the resulting edge magnitude map E(p) as a FloatProcessor object. FloatProcessor getEdgeOrientation() Returns the resulting edge orientation map Φ(p) as a FloatProcessor object, with values in the range [−π, π]. The following edge detectors are defined as concrete sub-classes of ColorEdgeDetector: GrayscaleEdgeDetector: Implements an edge detector that uses only the intensity (brightness) of the supplied color image. MonochromaticEdgeDetector: Implements the monochromatic color edge detector described in Alg. 16.1. DiZenzoCumaniEdgeDetector: Implements the Di Zenzo-Cumani type color edge detector described in Alg. 16.2. CannyEdgeDetector: Implements the Canny edge detector for grayscale and color images described in Alg. 16.3. This class defines the additional methods ByteProcessor getEdgeBinary(), List getEdgeTraces(). Program 16.1 shows a complete example for the use of class CannyEdgeDetector inside an ImageJ plugin.

16.6 Exercises Exercise 16.1. Modify the monochromatic color edge operator described in Alg. 16.1 to use the L1 and L∞ norms, respectively, for calculating the local edge magnitude. Exercise 16.2. Verify Eqn. (16.30) by showing that the result is ˆ1 in Eqn. (16.28). identical to the normalized eigenvector e 14

494

Package imagingbook.pub.color.edge.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

import import import import import import

ij.ImagePlus; ij.plugin.filter.PlugInFilter; ij.process.ByteProcessor; ij.process.FloatProcessor; ij.process.ImageProcessor; imagingbook.pub.coloredge.CannyEdgeDetector;

import java.awt.Point; import java.util.List; public class Canny_Edge_Demo implements PlugInFilter { public int setup(String arg0, ImagePlus imp) { return DOES_ALL + NO_CHANGES; } public void run(ImageProcessor ip) { CannyEdgeDetector.Parameters params = new CannyEdgeDetector.Parameters();

16.6 Exercises Prog. 16.1 Use of the CannyEdgeDetector class in an ImageJ plugin. A parameter object (params) is created in line 20, subsequently configured (in lines 22–24) and finally used to construct a CannyEdgeDetector object in line 27. Note that edge detection is performed within the constructor method. Lines 29– 33 demonstrate how different types of edge detection results can be retrieved. The binary edge map eBin is displayed in line 35. As indicated in the setup() method (by returning DOES_ALL), this plugin works with any type of image.

params.gSigma = 3.0f; // σ of Gaussian params.hiThr = 20.0f; // 20% of max. edge magnitude params.loThr = 5.0f; // 5% of max. edge magnitude CannyEdgeDetector detector = new CannyEdgeDetector(ip, params); FloatProcessor eMag = detector.getEdgeMagnitude(); FloatProcessor eOrt = detector.getEdgeOrientation(); ByteProcessor eBin = detector.getEdgeBinary(); List edgeTraces = detector.getEdgeTraces(); (new ImagePlus("Canny Edges", eBin)).show();

}

}

// process edge detection results ...

495

17 Edge-Preserving Smoothing Filters

Noise reduction in images is a common objective in image processing, not only for producing pleasing results for human viewing but also to facilitate easier extraction of meaningful information in subsequent steps, for example, in segmentation or feature detection. Simple smoothing filters, such as the Gaussian filter1 and the filters discussed in Chapter 15 effectively perform low-pass filtering and thus remove high-frequency noise. However, they also tend to suppress high-rate intensity variations that are part of the original signal, thereby destroying image structures that are visually important. The filters described in this chapter are “edge preserving” in the sense that they change their smoothing behavior adaptively depending upon the local image structure. In general, maximum smoothing is performed over “flat” (uniform) image regions, while smoothing is reduced near or across edge-like structures, typically characterized by high intensity gradients. In the following, three classical types of edge preserving filters are presented, which are largely based on different strategies. The Kuwahara-type filters described in Sec. 17.1 partition the filter kernel into smaller sub-kernels and select the most “homogeneous” of the underlying image regions for calculating the filter’s result. In contrast, the bilateral filter in Sec. 17.2 uses the differences between pixel values to control how much each individual pixel in the filter region contributes to the local average. Pixels which are similar to the current center pixel contribute strongly, while highly different pixels add little to the result. Thus, in a sense, the bilateral filter is a non-homogeneous linear filter with a convolution kernel that is adaptively controlled by the local image content. Finally, the anisotropic diffusion filters in Sec. 17.3 iteratively smooth the image similar to the process of thermal diffusion, using the image gradient to block the local diffusion at edges and similar structures. It should be noted that all filters described in this chapter are nonlinear and can be applied to either grayscale or color images. 1

See Sec. 4.6.1.

© Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1_17

497

17 Edge-Preserving Smoothing Filters

17.1 Kuwahara-Type Filters The filters described in this section are all based on a similar concept that has its early roots in the work of Kuwahara et al. [166]. Although many variations have been proposed by other authors, we summarize them here under the term “Kuwahara-type” to indicate their origin and algorithmic similarities. In principle, these filters work by calculating the mean and variance within neighboring image regions and selecting the mean value of the most “homogeneous” region, that is, the one with the smallest variance, to replace the original (center) pixel. For this purpose, the filter region R is divided into K partially overlapping subregions R0 , . . . , RK−1 . At every image position p, the mean µk and the variance σk2 of each subregion Rk are calculated from the corresponding pixel values in I as µk (I, p) =

1 X 1 · I(p+q) = ·S (I, p), |Rk | |Rk | 1,k

(17.1)

q∈Rk

2 1 X · Ip+q) − µk (I, p) |Rk | q∈Rk 1 1 = ·[S2,k (I, p) − ·S 2 (I, p)], |Rk | |Rk | 1,k

σk2 (I, p) =

(17.2) (17.3)

for k = 0, . . . , K −1, with2 S1,k (I, p) =

X

I(p + q),

(17.4)

q∈Rk

S2,k (I, p) =

X

I 2 (p + q).

(17.5)

q∈Rk

The mean (µ) of the subregion with the smallest variance (σ 2 ) is selected as the update value, that is, I ′ (p) ← µk′ (p),

with k ′ = argmin σk2 (I, p). 0≤k < K

(17.6)

The subregion structure originally proposed by Kuwahara et al. [166] is shown in Fig. 17.1 (a) for a 3 × 3 filter (r = 1). It uses four square subregions of size (r+1) × (r+1) that overlap at the center. In general, the size of the whole filter is (2r+1) × (2r+1). This particular filter process is summarized in Alg. 17.1. Note that this filter does not have a centered subregion, which means that the center pixel is always replaced by the mean of one of the neighboring regions, even if it had perfectly fit the surrounding values. Thus the filter always performs a spatial shift, which introduces jitter and banding artifacts in regions of smooth intensity change. This effect is reduced with the filter proposed by Tomita and Tsuji [261], which is similar but includes a fifth subregion at its center (Fig. 17.1 (b)). Filters of arbitrary size can be built by simply scaling the corresponding structure. In case of the Tomita-Tsuji filter, the side length of the subregions should be odd. 2

498

|Rk | denotes the size (number of pixels) of the subregion Rk .

• • • • • • • • •

• • • • • • • • •

R0

• • • • •

• • • • •

• • • • •

R0

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

R1

R1

• • • • •

• • • • •

• • • • •

• • • • • • • • • R2

(a) • • • • •

• • • • •

R2

• • • • •

• • • • •

17.1 Kuwahara-Type Filters

• • • • • • • • • R3

• • • • •

• • • • •

• • • • •

R3

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

R4

• • • • •

Fig. 17.1 Subregion structures for Kuwahara-type filters. The orginal Kuwahara-Hachimura filter (a) considers four square, overlapping subregions [166]. Tomita-Tsuji filter (b) with five subregions (r = 2). The current center pixel (red) is contained in all subregions.

(b)

Note that replacing a pixel value by the mean of a square neighborhood is equivalent to linear filtering with a simple box kernel, which is not an optimal smoothing operator. To reduce the artifacts caused by the square subregions, alternative filter structures have been proposed, such as the 5 × 5 Nagao-Matsuyama filter [193] shown in Fig. 17.2. • • • • •

• • • • •

• • • • •

R0

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

R1

R8

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

R2

R7

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

R3

R6

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

• • • • •

R4

R5

Fig. 17.2 Subregions for the 5 × 5 (r = 2) Nagao-Matsuyama filter [193]. Note that the centered subregion (R0 ) has a different size than the remaining subregions (R1 , . . . , R8 ).

If all subregions are of identical size |Rk | = n, the quantities 2 σk2 (I, p) · n = S2,k (I, p) − S1,k (I, p)/n

σk2 (I, p)

2

· n = S2,k (I, p) · n −

2 S1,k (I, p)

or

(17.7) (17.8)

can be used to quantify the amount of variation within each subregion. Both expressions require calculating one multiplication less for each pixel than the “real” variance σk2 in Eqn. (17.3). Moreover, if all subregions have the same shape (such as the filters in Fig. 17.1), additional optimizations are possible that substantially improve the performance. In this case, the local mean and variance need to be calculated only once over a fixed neighborhood for each image position. This type of filter can be efficiently implemented by using a set of pre-calculated maps for the local variance and mean values, as described in Alg. 17.2. As before, the parameter r specifies the radius of the composite filter, with subregions of size (r+1) × (r+1) and overall size (2r+1) × (2r+1). For example, r = 2 for the 5 × 5 filter shown in Fig. 17.1 (b). All these filters tend to generate banding artifacts in smooth image regions due to erratic spatial displacements, which become worse with increasing filter size. If a centered subregion is used (such as

499

17 Edge-Preserving Smoothing Filters Alg. 17.1 Simple KuwaharaHachimura filter.

1: KuwaharaFilter(I) Input: I, a grayscale image of size M × N . Returns a new (filtered) image of size M × N . 2: 3: 4: 5:

R0 R1 R2 R3

0 −1 0 ← { −1 −1 , −1 , ( 0 ), ( 0 )} 0 1 0 1 ← { −1 , −1 , ( 0 ), ( 0 )} ← {( 00 ), ( 10 ), ( 10 ), ( 11 )} ← {( −10 ), ( 00 ), ( −11 ), ( 10 )}

6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

(M, N ) ← size(I) I ′ ← duplicate(I) for all p ∈ M × N do 2 σmin ←∞ for R ← R0 , . . . , R3 do (σ 2 , µ) ← EvalSubregion(I, R, p) 2 if σ 2 < σmin then 2 σmin ← σ 2 µmin ← µ I ′ (p) ← µmin

16:

return I ′

17: EvalSubregion(I, R, p) Returns the variance and mean of the grayscale image I for the subregion R positioned at p. 18: n ← |R| ▷ size or R 19: S1 ← 0, S2 ← 0 20: for all q ∈ R do 21: a ← I(p+q) 22: S1 ← S1 + a ▷ Eqn. (17.4) 23: S 2 ← S 2 + a2 ▷ Eqn. (17.5) 24: σ 2 ← (S2 − S12 /n)/n ▷ variance of subregion R, see Eqn. (17.1) 25: µ ← S1 /n ▷ mean of subregion R, see Eqn. (17.3) 26: return (σ 2 , µ)

R4 in Fig. 17.1 or R0 in Fig. 17.2), one could reduce this effect by applying a threshold (tσ ) to select any off-center subregion Rk only if its variance is significantly smaller than the variance of the center region R0 (see Alg. 17.2, line 14). 17.1.1 Application to Color Images While all of the aforementioned filters were originally designed for grayscale images, they are easily modified to work with color images. We only need to specify how to calculate the variance and mean for any subregion; the decision and replacement mechanisms then remain the same. Given a RGB color image I = (IR , IG , IB ), we can calculate the local mean and variance for each color channel and subregion Rk as    2  µk (IR , p) σk (IR , p) µk (I, p) = µk (IG , p), σ 2k (I, p) = σk2 (IG , p), (17.9) µk (IB , p) σk2 (IB , p) 500

with µk (), σk2 () as defined in Eqns. (17.1) and (17.3), respectively. Analogous to the grayscale case, each pixel is then replaced by the

1: FastKuwaharaFilter(I, r, tσ ) Input: I, a grayscale image of size M × N ; r, filter radius (r ≥ 1); tσ , variance threshold. Returns a new (filtered) image of size M × N. 2: 3: 4: 5: 6:

(M, N ) ← size(I) S ← new map M × N → R ▷ local var. S(p) = n·σ 2 (I, p) A ← new map M × N → R ▷ local mean A(p) = µ(I, p) dmin ← (r ÷ 2) − r ▷ subregions’ left/top position dmax ← dmin + r ▷ subregions’ right/bottom position

7: 8: 9: 10:

for all p ∈ M × N do (s, µ) ← EvalSquareSubregion(I, p, dmin , dmax ) S(p) ← s A(p) ← µ

11: 12:

n ← (r + 1)2 I ′ ← duplicate(I)

13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23:

for all p ∈ M × N do smin ← S(p) − tσ · n ▷ variance of center region µmin ← A(p) ▷ mean of center region (u, v) ← p for i ← dmin , . . . , dmax do for j ← dmin , . . . , dmax do if S(u+i, v+j) < smin then smin ← S(u+i, v+j) µmin ← A(u+i, v+j) I ′ (p) ← µmin return I ′

17.1 Kuwahara-Type Filters Alg. 17.2 Fast Kuwahara-type (TomitaTsuji) filter with variable size and fixed subregion structure. The filter uses five square subregions of size (r+1)×(r+1), with a composite filter of (2r+1)×(2r+1), as shown in Fig. 17.1 (b). The purpose of the variance threshold tσ is to reduce banding effects in smooth image regions (typically tσ = 5, . . . , 50 for 8-bit images).

▷ fixed subregion size

24: EvalSquareSubregion(I, p, dmin , dmax ) Returns the variance and mean of the grayscale image I for a square subregion positioned at p. 25: (u, v) ← p 26: S1 ← 0, S2 ← 0 27: for i ← dmin , . . . , dmax do 28: for j ← dmin , . . . , dmax do 29: a ← I(u+i, v+j) 30: S1 ← S1 + a ▷ Eqn. (17.4) 31: S 2 ← S 2 + a2 ▷ Eqn. (17.5) 32: s ← S2 − S12 /n ▷ subregion variance (s ≡ n · σ 2 ) 33: µ ← S1 /n ▷ subregion mean (µ) 34: return (s, µ)

average color in the subregion with the smallest variance, that is, I ′ (p) ← µk′ (I, p),

2 with k ′ = argmin σk,RGB (I, p). 0≤k < K

(17.10)

2 The overall variance σk,RGB , used to determine k ′ in Eqn. (17.10), can be defined in different ways, for example, as the sum of the variances in the individual color channels, i.e., 2 σk,RGB (I, p) = σk2 (IR , p) + σk2 (IG , p) + σk2 (IB , p).

(17.11)

This is sometimes called the “total variance”. The resulting filter is summarized in Alg. 17.3 and color examples produced with this algorithm are shown in Figs. 17.3 and 17.4.

501

1: KuwaharaFilterColor(I) Input: I, an RGB image of size M × N . Returns a new (filtered) color image of size M × N .

17 Edge-Preserving Smoothing Filters Alg. 17.3 Color version of the Kuwahara-type filter (adapted from Alg. 17.1). The algorithm uses the definition in Eqn. (17.11) for the total variance σ 2 in subregion R (see line 25). The vector µ (calculated in line 26) is the average color of the subregion.

0 −1 0 ← { −1 −1 , −1 , ( 0 ), ( 0 )} 0 1 0 1 ← { −1 , −1 , ( 0 ), ( 0 )} ← {( 00 ), ( 10 ), ( 10 ), ( 11 )} ← {( −10 ), ( 00 ), ( −11 ), ( 10 )}

2: 3: 4: 5:

R0 R1 R2 R3

6: 7:

I ′ ← duplicate(I) (M, N ) ← size(I)

8: 9: 10: 11: 12: 13: 14: 15:

for all p ∈ M × N do 2 σmin ←∞ for R ← R0 , . . . , R3 do (σ 2 , µ) ← EvalSubregion(I, Rk , p) 2 if σ 2 < σmin then 2 σmin ← σ 2 µmin ← µ I ′ (p) ← µmin

16:

return I ′

17: EvalSubregion(I, R, p) Returns the total variance and the mean vector of the color image I for the subregion R positioned at p. 18: n ← |R| ▷ size of R 19: S 1 ← 0, S 2 ← 0 ▷ S 1 , S 2 ∈ R3 20: for all q ∈ R do 21: a ← I(p+q) ▷ a ∈ R3 22: S1 ← S1 + a 23: S 2 ← S 2 + a2 ▷ a2 = a · a (dot product) 2 2 2 24: S ← n1 · S 2 − n1 ·S 21 ▷ S = (σR , σG , σB ) 2 2 2 2 2 25: σRGB ← ΣS ▷ σRGB = σR +σG +σB , total variance in R 26: µ ← n1 ·S 1 ▷ µ ∈ R3 , avg. color vector for subregion R 2 27: return (σRGB , µ)

Alternatively [130], one could define the combined color variance as the norm of the color covariance matrix 3 for the subregion Rk ,   σk,RR σk,RG σk,RB Σk (I, p) =  σk,GR σk,GG σk,GB  , with (17.12) σk,BR σk,BG σk,BB 1 X σk,a,b = · Ia (p+q) − µk (Ia , p) · Ib (p+q) − µk (Ib , p) , |Rk | q∈Rk (17.13) 2 for all possible color pairs (a, b) ∈ {R, G, B}2 . Note that σk,a,a = σk,a and σk,a,b = σk,b,a , and thus the matrix Σk is symmetric and only 6 of its 9 entries need to be calculated. The (Frobenius) norm of this 3 × 3 color covariance matrix is defined as X 2 2 σk,RGB = ∥Σk (I, p)∥2 = (σk,a,b )2. (17.14) a,b ∈ {R,G,B}

3

502

See Sec. G.2 of the Appendix for details.

17.1 Kuwahara-Type Filters

(a) RGB test image with selected details

Fig. 17.3 Kuwahara-type (Tomita-Tsuji) filter—color example using the variance definition in Eqn. (17.11). The filter radius is varied from r = 1 (b) to r = 4 (e).

(b) r = 1 (3 × 3 filter)

(c) r = 2 (5 × 5 filter)

(d) r = 3 (7 × 7 filter)

(e) r = 4 (9 × 9 filter)

Note that the total variance in Eqn. (17.11)—which is simpler to calculate than this norm—is equivalent to the trace of Σk . Since each pixel of the filtered image is calculated as the mean (i.e., a linear combination) of a set of original color pixels, the results depend on the color space used, as discussed in Sec. 15.1.2.

503

17 Edge-Preserving Smoothing Filters Fig. 17.4 Color versions of the TomitaTsuji (Fig. 17.1 (b)) and Nagao-Matsuyama filter (Fig. 17.2). Both filters are of size 5 × 5 and use the variance definition in Eqn. (17.11). Results are visually similar, but in general the Nagao-Matsuyama filter is slightly less destructive on diagonal structures. Original image in Fig. 17.3 (a).

(a) 5 × 5 Tomita-Tsuji filter (r = 2)

(b) 5 × 5 Nagao-Matsuyama filter

17.2 Bilateral Filter Traditional linear smoothing filters operate by convolving the image with a kernel, whose coefficients act as weights for the corresponding image pixels and only depend on the spatial distance from the center coordinate. Pixels close to the filter center are typically given larger weights while pixels at a greater distance carry smaller weights. Thus the convolution kernel effectively encodes the closeness of the underlying pixels in space. In the following, a filter whose weights depend only on the distance in the spatial domain is called a domain filter. To make smoothing filters less destructive on edges, a typical strategy is to exclude individual pixels from the filter operation or to reduce the weight of their contribution if they are very dissimilar in value to the pixel found at the center position. This operation too can be formulated as a filter, but this time the kernel coefficients depend only upon the differences in pixel values or range. Therefore this is called a range filter, as explained in more detail Sec. 17.2.2. The idea of the bilateral filter, proposed by Tomasi and Manduchi in [260], is to combine both domain and range filtering into a common, edge-preserving smoothing filter. 17.2.1 Domain Filter In an ordinary 2D linear (i.e., “convolution”) filter,4 I ′ (u, v) =

∞ X ∞ X

I(u+i, v+j) · H(i, j)

i= j= −∞ −∞ ∞ X ∞ X

=

i= j= −∞ −∞

4

504

See also Eqn. (4.5) on page 88.

I(i, j) · H(i−u, j−v),

(17.15)

(17.16)

a new pixel value I ′ (u, v) is calculated as the weighted average of the 17.2 Bilateral Filter original image pixels I in a certain neighborhood, with the weights specified by the elements of the filter kernel H.5 The weight assigned to each pixel only depends on its spatial position relative to the current center coordinate (u, v). In particular, H(0, 0) specifies the weight given to the center pixel I(u, v), and H(i, j) is the weight assigned to a pixel displaced by (i, j) from the center. Since only the spatial image coordinates are relevant, such a filter is called a domain filter, and we use Hd : Z × Z 7→ R

(17.17)

to denote a domain kernel in the following. Obviously, ordinary filters as we know them are all domain filters. 17.2.2 Range Filter Although the idea may appear strange at first, one could also apply a linear filter to the pixel values or range of an image in the form Ir′ (u, v) =

∞ X ∞ X i= j= −∞ −∞

I(i, j) · hr I(i, j)−I(u, v) .

(17.18)

At some filter position (u, v), the contribution of a neighboring pixel relative position (i, j) is specified by the function hr : R 7→ R,

(17.19)

which is called the range kernel. In Eqn. (17.18), the scalar-valued argument passed to hr () is the difference between the pixel value I(i, j) and the value of the center pixel I(u, v). Note that hr ( ) is a one-dimensional function6 over pixel values and the spatial position of the contributing pixels is irrelevant. In particular, this means that all included pixels with the same value contribute equally to the result Ir′ (u, v). Consequently, the application of a range filter has no spatial effect upon the image, i.e., in contrast to a domain filter, it can cause no no blurring or sharpening. Instead, a range filter effectively performs a global point operation by remapping the intensity (or color) values. However, a global range filter by itself is of little use, since it combines pixels from the entire image and only changes the intensity or color map of the image, equivalent to a nonlinear, image-dependent point operation. 17.2.3 Bilateral Filter: General Idea The key idea behind the bilateral filter is to combine domain filtering (Eqn. (17.16)) and range filtering (Eqn. (17.18)) in the form 5

6

In Eqn. (17.16), the 2D functions I() and H() are assumed to be zero outside their domains of definition. Thus the lower-case designation hr .

505

17 Edge-Preserving Smoothing Filters

∞ ∞ 1 XX · I(i, j) · Hd (i−u, j−v)·hr I(i, j)−I(u, v) , Wu,v i = j = | {z } wi,j −∞ −∞ (17.20)

I ′ (u, v) =

where Hd , hr are the domain and range kernels, respectively, wi,j are the resulting composite weights, and the scalar quantity Wu,v =

∞ X ∞ X

wi,j =

i= j= −∞ −∞

∞ ∞ X X i= j= −∞ −∞

Hd (i−u, j−v) · hr I(i, j)−I(u, v)

(17.21)

is the (position-dependent) sum of the weights wi,j , used to normalize the combined filter kernel. In this form, the scope of range filtering is constrained to the spatial neighborhood defined by the domain kernel Hd . At a given filter position (u, v), the weight wi,j assigned to each contributing pixel depends upon • its spatial position relative to (u, v), and • the similarity of its pixel value I(i, j) to the value at the center position, i.e., I(u, v). In other words, the resulting pixel is the weighted average of pixels that are nearby and similar to the original pixel. In a flat image region, where most surrounding pixels have values similar to the center pixel, the bilateral filter acts as a conventional smoothing filter, controlled only by the domain kernel Hd . However, when placed near a step edge or on an intensity ridge, only those pixels are included in the smoothing process that are similar in value to the center pixel, thus avoiding blurring the edges. If the domain kernel Hd has a limited radius D, i.e., size (2D+1) × (2D+1), the bilateral filter defined in Eqn. (17.20) can be written as

I ′ (u, v) =

u+D X v+D X

I(i, j) · Hd (i−u, j−v) · hr (I(i, j)−I(u, v))

i= j= u−D v−D u+D X v+D X

i= j= u−D v−D

=

D X D X

m= n= −D −D

I(u+m, v+n) · Hd (m, n) · hr (I(u+m, v+n)−I(u, v)) D X D X

Hd (m, n) · hr (I(u+m, v+n)−I(u, v))

m= n= −D −D

506

Hd (i−u, j−v) · hr (I(i, j)−I(u, v))

(17.22)

(by substituting (i−u) → m and (j−v) → n). The effective, space variant filter kernel for the image I at position (u, v) thus is Hd (i, j) · hr I(u+i, v+j)−I(u, v) ¯ HI,u,v (i, j) = D D , (17.23) P P Hd (m, n)·hr I(u+m, v+n)−I(u, v) m= n= −D −D

for −D ≤ i, j ≤ D, and zero everywhere else. This quantity speci- 17.2 Bilateral Filter fies the contribution of the original image pixels I(u+i, v+j) to the resulting new pixel value I ′ (u, v). 17.2.4 Bilateral Filter with Gaussian Kernels A special (but common) case is the use of Gaussian kernels for both the domain and the range parts of the bilateral filter. The discrete 2D Gaussian domain kernel of width σd is defined as HσGd (m, n) =

2 2 1 · exp − m2σ+n , 2 2 d 2πσd

(17.24)

for m, n ∈ Z. It has its maximum at the center (m = n = 0)√and drops off smoothly and isotropically with increasing radius ρ = m2 + n2 . For ρ > 3.5·σd , the Gaussian is practically zero. The 2D Gaussian function can be x/y-separated into the product of two 1D Gaussians, allowing for quite efficient implementations.7 The constant factor 1/(2π σd2 ) can be neglected in the calculation, since the bilateral filter performs kernel normalization at each image position anyway (Eqn. (17.21)). Similarly, the corresponding range filter kernel is defined as a (continuous) 1D Gaussian of width σr , i.e., 1 x2 · exp − 2σ , (17.25) 2 r 2π σr √ for x ∈ R. Again the constant factor 1/( 2π σr ) may be ignored and the resulting composite filter (Eqn. (17.20)) can thus be written as hG σr (x) = √

I ′ (u, v) =

u+D v+D 1 X X · I(i, j)·HσGd (i−u, j−v)·hG σr I(i, j)−I(u, v) Wu,v i = j = u−D v−D

=

D X D X

1 · Wu,v m =

n= −D −D

=

I(u+m, v+n) · HσGd (m, n) · hG σr

(17.26) I(u+m, v+n) − I(u, v)

D D 2 2 1 X X · I(u+m, v+n) · exp − m2σ+n (17.27) 2 d Wu,v m = n = 2 −D −D · exp − (I(u+m,v+n)−I(u,v)) , 2σ 2 r

with D = ⌈3.5 · σd ⌉ and Wu,v =

D X D X

m= n= −D −D

2

exp − m2σ+n 2 d

2

2 · exp − (I(u+m,v+n)−I(u,v)) . 2σ 2 r

(17.28)

For 8-bit grayscale images, with pixel values in the range [0, 255], the width of the range kernel is typically set to σr = 10, . . . , 50. The width of the domain kernel (σd ) depends on the desired amount of spatial smoothing. Algorithm 17.4 gives a summary of the steps involved in bilateral filtering for grayscale images. 7

See also Sec. 4.3.3.

507

17 Edge-Preserving Smoothing Filters Alg. 17.4 Bilateral filter with Gaussian domain and range kernels (grayscale version). Of course, the domain kernel Hd (m, n) = wd , which depends only on relative coordinates m, n, can be precalculated as an array. The range kernel hr (a−b) = wr can also be pre-tabulated if pixel values are discrete.

1: BilateralFilterGray(I, σd , σr ) Input: I, a grayscale image of size M × N ; σd , width of the 2D Gaussian domain kernel; σr , width of the 1D Gaussian range kernel. Returns a new filtered image of size M × N . 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18:

(M, N ) ← size(I) D ← ⌈3.5 · σd ⌉ I ′ ← duplicate(I)

▷ width of domain filter kernel

for all (u, v) ∈ M × N do S←0 W ←0 a ← I(u, v) for m ← −D, . . . , D do for n ← −D, . . . , D do b ← I(u + m, v + n) 2 2 wd ← exp − m2σ+n 2 d 2

wr ← exp − (a−b) 2σr2 w ← wd ·wr S ← S + w·b W ←W +w I ′ (u, v) ← S/W

▷ sum of weighted pixel values ▷ sum of weights ▷ center pixel value

▷ off-center pixel value ▷ domain coefficient ▷ range coefficient ▷ composite coefficient

return I ′

Figures 17.5–17.9 show the effective, space-variant filter kernels (see Eqn. (17.23)) and the results of applying a bilateral filter with Gaussian domain and range kernels in different situations. Uniform noise was applied to the original images to demonstrate the filtering effect. One can see clearly how the range part makes the combined filter kernel adapt to the local image structure. Only those surrounding parts that have brightness values similar to the center pixel are included in the filter operation. The filter parameters were set to σd = 2.0 and σr = 50; the domain kernel is of size 15 × 15.

17.2.5 Application to Color Images

Linear smoothing filters are typically used on color images by separately applying the same filter to the individual color channels. As discussed in Sec. 15.1, this is legitimate if a suitable working color space is used to avoid the introduction of unnatural intensity and chromaticity values. Thus, for the domain-part of the bilateral filter, the same considerations apply as for any linear smoothing filter. However, as will be described, the bilateral filter as a whole cannot be implemented by filtering the color channels separately.

508

In the range part of the filter, the weight assigned to each contributing pixel depends on its difference to the value of the center pixel. Given a suitable distance measure dist(a, b) between two color vectors a, b, the bilateral filter in Eqn. (17.20) can be easily modified for a color image I to

17.2 Bilateral Filter Fig. 17.5 Bilateral filter response when positioned in a flat, noisy image region. Original image function (b), filtered image (c), effective impulse response (a) of the filter at the given position.

(a)

(b)

(c)

Fig. 17.6 Bilateral filter response when positioned on a linear ramp. Original image function (b), filtered image (c), effective impulse response (a) of the filter at the given position.

(a)

(b)

(c)

∞ X ∞ X 1 · I(i, j) · Hd (i−u, j−v) (17.29) Wu,v i= j= · hr dist(I(i, j), I(u, v)) , −∞ −∞ P = Hd (i−u, j−v) · hr dist(I(i, j), I(u, v)) . (17.30)

I ′ (u, v) =

with

Wu,v

i,j

It is common to use one of the popular norms for measuring color distances, such as the L1 , L2 (Euclidean), or L∞ (maximum) norms, for example (assuming colors a, b have K components),

509

17 Edge-Preserving Smoothing Filters Fig. 17.7 Bilateral filter response when positioned left to a vertical step edge. Original image function (b), filtered image (c), effective impulse response (a) of the filter at the given position. Note how the effective kernel (a) is suppressed on the right (i.e., higher) side of the edge, which inhibits blurring across the edge.

(a)

(b)

(c)

Fig. 17.8 Bilateral filter response when positioned right to a vertical step edge. Original image function (b), filtered image (c), effective impulse response (a) of the filter at the given position. Note how the effective kernel (a) is suppressed on the left (i.e., lower) side of the edge, which inhibits blurring across the edge.

(a)

(b)

(c)

dist1 (a, b) := ∥a−b∥1 =

PK−1

k=0 |ak −bk | , PK−1 2 1/2 dist2 (a, b) := ∥a−b∥2 = , k=0 |ak −bk |

510

dist∞ (a, b) := ∥a−b∥∞ = max |ak −bk |. 0≤k < K

(17.31) (17.32) (17.33)

17.2 Bilateral Filter Fig. 17.9 Bilateral filter response when positioned at a corner. Original image function (b), filtered image (c), effective impulse response (a) of the filter at the given position.

(a)

(b)

(c)

Fig. 17.10 Bilateral filter response when positioned on a vertical ridge. Original image function (b), filtered image (c), effective impulse response (a) of the filter at the given position.

(a)

(b)

(c)

Of course, in most color spaces none of these norms measures perceived color difference.8 However, the distance function itself is not really critical since it only affects the relative weights assigned to the contributing color pixels. Regardless of the distance function used, the resulting chromaticities are linear, convex combinations of the 8

The CIELAB and CIELUV color spaces are designed to use the Euclidean distance (L2 norm) as a valid metric for color difference (see Sec. 14.1.5).

511

17 Edge-Preserving Smoothing Filters Alg. 17.5 Bilateral filter with Gaussian domain and range kernels (color version), analogous to Alg. 17.4. The function dist(a, b) in line 12 measures the distance between two colors a and b, e.g., as the L2 norm of the difference vector (see Eqns. (17.31)– (17.33) for other options).

1: BilateralFilterColor(I, σd , σr ) Input: I, a color image of size M × N ; σd , width of the 2D Gaussian domain kernel; σr , width of the 1D Gaussian range kernel. Returns a new filtered color image of size M × N . 2: 3: 4:

(M, N ) ← size(I) D ← ⌈3.5 · σd ⌉ I ′ ← duplicate(I)

5: 6: 7: 8:

for all (u, v) ∈ (M × N ) do S←0 ▷ S ∈ RK , sum of weighted pixel vectors W ←0 ▷ sum of pixel weights (scalar) a ← I(u, v) ▷ a ∈ R3 , center pixel vector

▷ width of domain filter kernel

▷ b ∈ R3 , off-center pixel vector ▷ color distance, e.g., ∥a−b∥2 ▷ domain coefficient

14:

for m ← −D, . . . , D do for n ← −D, . . . , D do b ← I(u + m, v + n) d ← dist(a, b) 2 2 wd ← exp − m2σ+n 2

15: 16: 17:

w ← wd · wr S ←S+w·b W ←W +w

▷ composite coefficient ▷ accum. weighted color vector ▷ accumulate weight

9: 10: 11: 12: 13:

18: 19:

2

d wr ← exp − 2σ 2 r

I ′ (u, v) ← return I

1 W

d

▷ range coefficient

·S

′

original colors in the filter region, and thus the choice of the working color space is more important (see Sec. 15.1). The process of bilateral filtering for color images (again using Gaussian kernels for the domain and the range filters) is summarized in Alg. 17.5. The examples in Fig. 17.11 were produced using sRGB as the color working space and the Euclidean distance (L2 norm) for measuring color distances. 17.2.6 Efficient Implementation by x/y Separation The bilateral filter, if implemented in the way described in Algs. 17.4–17.5, is computationally expensive, with a time complexity of O(D2 ) for each pixel, where D denotes the radius of the filter. Some mild speedup is possible by tabulating the domain and range kernels, but the performance of the brute-force implementation is usually not acceptable for practical applications. In [209], a separable approximation of the bilateral filter is proposed that brings about a significant performance increase. In this implementation, a 1D bilateral filter is first applied in the horizontal direction only, which uses 1D domain and range kernels hd and hr , respectively, and produces the intermediate image I ▷ , that is (from Eqn. (17.22)),

I ▷ (u, v) = 512

D P

I(u+m, v) · hd (m) · hr I(u+m, v)−I(u, v)

m=−D

D P

hd (m) · hr I(u+m, v)−I(u, v)

m=−D

.

(17.34)

17.2 Bilateral Filter Fig. 17.11 Bilateral filter—color example. A Gaussian kernel with σd = 2.0 (kernel size 15 × 15) is used for the domain part of the filter; working color space is sRGB. The width of the range filter is varied from σr = 10 to 100. The filter was applied in sRGB color space.

(a) σr = 10

(b) σr = 20

(c) σr = 50

(d) σr = 100

In the second pass, the same filter is applied to the intermediate result I ▷ in the vertical direction to obtain the final result I ′ as

′

I (u, v) =

D P

I ▷ (u, v+n) · hd (n) · hr (I ▷ (u, v+n)−I ▷ (u, v))

n=−D

D P

, hd (n) · hr

n=−D

(I ▷ (u, v+n)−I ▷ (u, v))

(17.35)

for all (u, v), using the same 1D domain and range kernels hd and hr , respectively, as in Eqn. (17.34). For the horizontal part of the filter, the effective, space-variant (i.e., image-dependent) kernel at position (u, v) is

513

17 Edge-Preserving Smoothing Filters

¯ ▷I,u,v (i) = h

hd (i) · hr (I(u+i, v) − I(u, v)) , D P hd (i) · hr (I(u+m, v) − I(u, v))

(17.36)

m=−D

for −D ≤ i ≤ D (zero otherwise). Analogously, the effective kernel for the vertical part of the filter is ¯▽ h I,u,v (j) =

hd (i) · hr (I(u, v+j) − I(u, v)) , D P hd (j) · hr (I(u, v+j) − I(u, v))

(17.37)

n=−D

again for −D ≤ j ≤ D. For the combined filter, the effective 2D kernel at position (u, v) then is ( ¯ ▷ (i) · h ¯ ▽▷ (j) for −D ≤ i, j ≤ D, h I,u,v I ,u,v ¯ HI,u,v (i, j) = (17.38) 0 otherwise, where I is the original image and I ▷ denotes the intermediate image, as defined in Eqn. (17.34). Alternatively, the vertical filter could be applied first, followed by the horizontal filter. Algorithm 17.6 shows a direct implementation of the separable bilateral filter for grayscale images, using Gaussian kernels for both the domain and the range parts of the filter. Again, the extension to color images is straightforward (see Eqn. (17.30) and Exercise 17.3). As intended, the advantage of the separable filter is performance. For a given kernel radius D, the original (non-separable) requires O(D2 ) calculations for each pixel, while the separable version takes only O(D) steps. This means a substantial saving and speed increase, particularly for large filters. Figure 17.12 shows the response of the 1D separable bilateral filter in various situations. The results produced by the separable filter are very similar to those obtained with the original filter in Figs. 17.5–17.9, partly because the local structures in these images are parallel to the coordinate axes. In general, the results are different, as demonstrated for a diagonal step edge in Fig. 17.13. The effective filter kernels are shown in Fig. 17.13 (g, h) for an anchor point positioned on the bright side of the edge. It can be seen that, while the kernel of the full filter Fig. 17.13 (g) is orientation-insensitive, the upper part of the separable kernel is clearly truncated in Fig. 17.13 (h). But although the separable bilateral filter is sensitive to local structure orientation, it performs well and is usually a sufficient substitute for the non-separable version [209]. The color examples shown in Fig. 17.14 demonstrate the effects of 1D bilateral filtering in the x- and y-directions. Note that the results are not exactly the same if the filter is first applied in the x- or in y-direction, but usually the differences are negligible. 17.2.7 Further Reading

514

A thorough analysis of the bilateral filter as well as its relationship to adaptive smoothing and nonlinear diffusion can be found in [21]

17.2 Bilateral Filter

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

(m)

(n)

(o)

Original image

Full bilateral filter

Separable version

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Fig. 17.12 Response of a separable bilateral filter in various situations. ¯ I,u,v (Eqn. Effective kernel H (17.38)) at the center pixel (a– e), original image data (f–j), filtered image data (k–o). Settings are the same as in Figs. 17.5–17.9.

Fig. 17.13 Bilateral filter—full vs. separable version. Original image (a) and enlarged detail (d). Results of the full bilateral filter (b, e) and the separable version (c, f). The effective impulse response (for the center pixel positioned on the bright side of the step edge) for the full filter (g) and the separable version (h). Note how the upper part of the kernel in (h) is truncated along the horizontal axis, which shows that the separable filter is orientation-sensitive. In both cases, σd = 2.0, σr = 25.

515

17 Edge-Preserving Smoothing Filters Alg. 17.6 Separable bilateral filter with Gaussian kernels (grayscale version, adapted from Alg. 17.4). The input image is processed in two passes. In each pass, a 1D kernel is applied in horizontal or vertical direction, respectively (see Eqns. (17.34)–(17.35)). Note that results of the separable filter are similar (but not identical) to the full (2D) bilateral filter in Alg. 17.4.

1: BilateralFilterGraySeparable(I, σd , σr ) Input: I, a grayscale image of size M × N ; σd , width of the 2D Gaussian domain kernel; σr , width of the 1D Gaussian range kernel. Returns a new filtered image of size M × N . 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

(M, N ) ← size(I) D ← ⌈3.5 · σd ⌉ I ▷ ← duplicate(I) Pass 1 (horizontal): for all (u, v) ∈ M × N do a ← I(u, v) S ← 0, W ← 0 for m ← −D, . . . , D do b ← I(u + m, v) m2 wd ← exp − 2σ 2 d

wr ← exp

17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28:

2 − (a−b) 2σr2

w ← wd · wr S ←S+w·b W ←W +w I ▷ (u, v) ← S/W

12: 13: 14: 15: 16:

▷ width of domain filter kernel

▷ domain kernel coeff. hd (m) ▷ range kernel coeff. hr (b) ▷ composite filter coeff.

▷ see Eqn. (17.34)

′

I ← duplicate(I) Pass 2 (vertical): for all (u, v) ∈ M × N do a ← I ▷ (u, v) S ← 0, W ← 0 for n ← −D, . . . , D do b ← I ▷ (u, v + n) n2 wd ← exp − 2σ 2 d

2 − (a−b) 2σr2

wr ← exp w ← wd · wr S ←S+w·b W ←W +w I ′ (u, v) ← S/W

return I

▷ domain kernel coeff. Hd (n) ▷ range kernel coeff. hr (b) ▷ composite filter coeff.

▷ see Eqn. (17.35)

′

and [81]. In addition to the simple separable implementation described, several other fast versions of the bilateral filter have been proposed. For example, the method described in [77] approximates the bilateral filter by filtering sub-sampled copies of the image with discrete intensity kernels and recombining the results using linear interpolation. An improved and theoretically well-grounded version of this method was presented in [205]. The fast technique proposed in [286] eliminates the redundant calculations performed in partly overlapping image regions, albeit being restricted to the use of boxshaped domain kernels. As demonstrated in [211,291], real-time performance using arbitrary-shaped kernels can be obtained by decomposing the filter into a set of smaller spatial filters.

516

Original

Bilateral sep. x only

Bilateral sep. y only

(a)

(b)

(c)

Bilateral full (2D)

Bilateral sep. x→y

Bilateral sep. y→x

(d)

(e)

(f)

17.3 Anisotropic Diffusion Filters Fig. 17.14 Separable bilateral filter (color example). Original image (a), bilateral filter applied only in the x-direction (b) and only in the y-direction (c). Result of applying the full bilateral filter (d) and the separable bilateral filter applied in x/y order (e) and y/x order (f). Settings: σd = 2.0, σr = 50, L2 color distance.

17.3 Anisotropic Diffusion Filters Diffusion is a concept adopted from physics that models the spatial propagation of particles or state properties within substances. In the real world, certain physical properties (such as temperature) tend to diffuse homogeneously through a physical body, that is, equally in all directions. The idea viewing image smoothing as a diffusion process has a long history in image processing (see, e.g., [16,162]). To smooth an image and, at the same time, preserve edges or other “interesting” image structures, the diffusion process must somehow be made locally non-homogeneous; otherwise the entire image would come out equally blurred. Typically, the dominant smoothing direction is chosen to be parallel to nearby image contours, while smoothing is inhibited in the perpendicular direction, that is, across the contours. Since the pioneering work by Perona and Malik [208], anisotropic diffusion has seen continued interest in the image processing community and research in this area is still strong today. The main elements of their approach are outlined in Sec. 17.3.2. While various other formulations have been proposed since, a key contribution by Weickert [283, 284] and Tschumperlé [265, 268] unified them into a common framework and demonstrated their extension to color images. They also proposed to separate the actual smoothing process from the smoothing geometry in order to obtain better control of the local smoothing behavior. In Sec. 17.3.4 we give a brief introduction to the approach proposed by Tschumperlé and Deriche, as initially described in [265]. Beyond these selected examples, a vast literature exists on this topic, including excellent reviews [114, 283], textbook material [147, 231], and journal articles (see [7, 54, 64, 198, 232, 256], for example). 517

17 Edge-Preserving Smoothing Filters

17.3.1 Homogeneous Diffusion and the Heat Equation Assume that in a homogeneous, 3D volume some physical property (e.g., temperature) is specified by a continuous function f (x, t) at position x = (x, y, z) and time t. With the system left to itself, the local differences in the property f will equalize over time until a global equilibrium is reached. This diffusion process in 3D space (x, y, z) and time (t) can be expressed using a partial differential equation (PDE), ∂ 2f ∂f ∂ 2f ∂ 2f = c · (∇2 f ) = c · + + . ∂t ∂x2 ∂y 2 ∂z 2

(17.39)

∂f ∂ 2f ∂ 2f ∂ 2f (x, τ ) = (x, τ ) + 2 (x, τ ) + 2 (x, τ ), 2 ∂t ∂x ∂y ∂z

(17.40)

This is the so-called heat equation, where ∇2 f denotes the Laplace operator 9 applied to the scalar-valued function f , and c is a constant which describes the (thermal) conductivity or conductivity coefficient of the material. Since the conductivity is independent of position and orientation (c is constant), the resulting process is isotropic, that is, the heat spreads evenly in all directions. For simplicity, we assume c = 1. Since f is a multi-dimensional function in space and time, we make this fact a bit more transparent by attaching explicit space and time coordinates x and τ to Eqn. (17.39), that is,

or, written more compactly, ft (x, τ ) = fxx (x, τ ) + fyy (x, τ ) + fzz (x, τ ).

(17.41)

Diffusion in Images A continuous, time-varying image I may be treated analogously to the function f (x, τ ), with the local intensities taking on the role of the temperature values in Eqn. (17.41). In this 2D case, the isotropic diffusion equation can be written as10 ∂I ∂ 2I ∂ 2I = ∇2 I = + 2 or 2 ∂t ∂x ∂y It (x, τ ) = Ixx (x, τ ) + Iyy (x, τ ),

(17.42) (17.43)

with the partial derivatives It = ∂I/∂t, Ixx = ∂ 2I/∂x2 , and Iyy = ∂ 2I/∂y 2 . As usual, for discrete images, these derivatives are approximated by finite differences.11 Starting with the initial (typically noisy) image I (0) = I, the solution to the differential equation in Eqn. (17.43) can be calculated iteratively in the form 9

10 11

518

Remember that ∇f denotes the gradient of the function f , which is a vector for any multi-dimensional function. The Laplace operator (or Laplacian) ∇2 f corresponds to the divergence of the gradient of f , denoted div ∇f , which is a scalar value (see Secs. D.1.5 and D.1.4 of the Appendix). Other notations for the Laplacian are ∇·(∇f ), (∇·∇)f , ∇·∇f , ∇2 f , or ∆f . Function arguments (ξ, τ ) are omitted here for better readability. See Sec. D.3 of the Appendix.

n=0

n=5

n = 10

n = 20

n = 40

n = 80

σn ≈ 1.411

σn ≈ 1.996

σn ≈ 2.823

σn ≈ 3.992

σn ≈ 5.646

(b)

(c)

(d)

(e)

(f)

(a)

I (n) (p) ←

(

I(p) I (n−1) (p) + α· ∇2 I (n−1) (p)

for n = 0, for n > 0,

17.3 Anisotropic Diffusion Filters Fig. 17.15 Discrete isotropic diffusion. Blurred images and impulse response obtained after n iterations, with α = 0.20 (see Eqn. (17.44)). The size of the images is 50 × 50. The width of the equivalent Gaussian kernel (σn ) grows with the square root of n (the number of iterations). Impulse response plots are normalized to identical peak values.

(17.44)

for each image position p = (u, v), with n denoting the iteration number. This is called the “direct” solution method (there are other methods but this is the simplest). The constant α in Eqn. (17.44) is the time increment, which controls the speed of the diffusion process. Its value should be in the range (0, 0.25] for the numerical scheme to be stable. At each iteration n, the variations in the image function are reduced and (depending on the boundary conditions) the image function should eventually flatten out to a constant plane as n approaches infinity. For a discrete image I, the Laplacian ∇2 I in Eqn. (17.44) can be approximated by a linear 2D filter, ∇2 I ≈ I ∗ H L ,



0 1 with H L =  1 −4 0 1

 0 1, 0

(17.45)

as described earlier.12 An essential property of isotropic diffusion is that it has the same effect as a Gaussian filter whose width grows with the elapsed time. For a discrete 2D image, in particular, the result obtained after n diffusion steps (Eqn. (17.44)), is the same as applying a linear filter to the original image I, I (n) ≡ I ∗ HσGn ,

(17.46)

with the normalized Gaussian kernel HσGn (x, y) =

1 x2 +y 2 · exp − 2πσn2 2σn2

(17.47)

√ √ of width σn = 2t = 2n·α. The example in Fig. 17.15 illustrates this Gaussian smoothing effect obtained by discrete isotropic diffusion. 12

See Sec. 5.6.1 and Appendix Sec. D.3.

519

17 Edge-Preserving Smoothing Filters

17.3.2 The Perona-Malik Filter Isotropic diffusion, as we have described, is a homogeneous operation that is independent of the underlying image content. Like any Gaussian filter, it effectively suppresses image noise but also tends to blur away sharp boundaries and detailed structures, a property that is often undesirable. The idea proposed in [208] is to make the conductivity coefficient variable and dependent on the local image structure. This is done by replacing the conductivity constant c in Eqn. (17.39), which can be written as ∂I (x, τ ) = c·[∇2 I](x, τ ), (17.48) ∂t by a function c(x, t) that varies over space x and time t, i.e., ∂I (x, τ ) = c(x, τ ) · [∇2 I](x, τ ). (17.49) ∂t If the conductivity function c() is constant, then the equation reduces to the isotropic diffusion model in Eqn. (17.43). Different behaviors can be implemented by selecting a particular function c(). To achieve edge-preserving smoothing, the conductivity c() is chosen as a function of the magnitude of the local gradient vector ∇I, that is, c(x, τ ) := g(d) = g ∥[∇I (τ ) ](x)∥ . (17.50) To preserve edges, the function g(d) : R → [0, 1] should return high values in areas of low image gradient, enabling homogeneous regions to be smoothed, but return low values (and thus inhibit smoothing) where the local brightness changes rapidly. Commonly used conductivity functions g(d) are, for example [57, 208], g1 (d) = e

−(d/κ)2

g3 (d) = p

,

1 1+(d/κ)2

1 , 1+(d/κ)2 ( (1−(d/2κ)2 )2 g4 (d) = 0 g2 (d) =

,

(17.51) for d ≤ 2κ, otherwise,

where κ > 0 is a constant that is either set manually (typically in the range [5, 50] for 8-bit images) or adjusted to the amount of image noise. Graphs of the four functions in Eqn. (17.51) are shown in Fig. 17.16 for selected values of κ. The Gaussian conductivity function g1 tends to promote high-contrast edges, whereas g2 and even more g3 prefer wide, flat regions over smaller ones. Function g4 , which corresponds to Tuckey’s biweight function known from robust statistics [231, p. 230], is strictly zero for any argument d > 2κ. The exact shape of the function g() does not appear to be critical; other functions with similar properties (e.g., with a linear cutoff) are sometimes used instead. As an approximate discretization of Eqn. (17.49), Perona and Malik [208] proposed the simple iterative scheme I (n) (p) ← I (n−1) (p) + α·

3 X i=0

520

g |δi (I (n−1), p)| · δi (I (n−1), p),

(17.52)

g1 (d)

g2 (d)

1.0

1.0

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

20

40

60

80

100

d

0

(a)

g3 (d)

1.0

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2 20

40

20

40

60

80

100

d

0

60

80

100

60

80

100

17.3 Anisotropic Diffusion Filters

d

(b)

g4 (d)

1.0

0

κ = 40 κ = 30 κ = 20 κ = 10 κ= 4

20

40

(c)

Fig. 17.16 Typical conductivity functions g1 (), . . . , g4 () for κ = 4, 10, 20, 30, 40 (see Eqn. (17.51)). If the magnitude of the local gradient d is small (near zero), smoothing amounts to a maximum (1.0), whereas diffusion is reduced where the gradient is high, for example, at or near edges. Smaller values of κ result in narrower curves, thereby restricting the smoothing operation to image areas with only small variations.

d

(d)

u

u−1

Fig. 17.17 Discrete lattice used for implementing diffusion filters in the Perona-Malik algorithm. The green element represents the center pixel at position (u, v) and value ac = I(u, v); the yellow elements are the four 4neighboring values a0 , . . . , a3 .

u+1 a1

v−1 δ1 v

a2

δ2

δ0

a0

ac δ3 v+1 a3

where I (0) = I is the original image and δi (I, p) = I(p+di ) − I(p) = ai − ac

(17.53)

denotes the difference between the pixel value I(p) = ac and its direct neighbor value ai (i = 0, . . . , 3, see Fig. 17.17), with 0 d0 = ( 10 ), d1 = −1 , d2 = −1 d3 = ( 01 ). (17.54) 0 , The procedure for computing the Perona-Malik filter for scalar-valued images is summarized in Alg. 17.7. The examples in Fig. 17.18 demonstrate how this filter performs along a step edge in a noisy grayscale image compared to isotropic (i.e., Gaussian) filtering. In summary, the principle operation of this filter is to inhibit smoothing in the direction of strong local gradient vectors. Wherever the local contrast (and thus the gradient) is small, diffusion occurs uniformly in all directions, effectively implementing a Gaussian smoothing filter. However, in locations of high gradients, smoothing is inhibited along the gradient direction and allowed only in the direction perpendicular to it. If viewed as a heat diffusion process, a high-gradient brightness edge in an image acts like an insulating

521

17 Edge-Preserving Smoothing Filters Alg. 17.7 Perona-Malik anisotropic diffusion filter for scalar (grayscale) images. The input image I is assumed to be real-valued (floating-point). The conductivity function g(d) : R 7→ R can be any of the functions defined in Eqn. (17.51) or a similar function.

1: PeronaMalikGray(I, α, g, T ) Input: I, a scalar-valued image of size M × N , with I(u, v) ∈ R; α, update rate; g, the conductivity function (see Eqn. (17.51)); T , number of iterations. Returns nothing, the input image I is modified. 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:

(M, N ) ← size(I) I ′ ← duplicate(I) for n ← 1, . . . , T do for all (u, v) ∈ M × N do ac ← I(u, v) a0 ← I(u+1, v) a1 ← I(u, v−1) a2 ← I(u−1, v) a3 ← I(u, v+1) ∆←0 for i ← 0, . . . , 3 do δi ← ai −ac ∆ ← ∆ + g(|δi |) · δi

▷ intermediate image ▷ perform T iterations ▷ center pixel value ▷ neighboring pixel values

▷ for all neighbors ▷ see Eqn. (17.52)

I ′ (u, v) ← I(u, v) + α·∆ ′

copy(I , I)

▷ diffusion ′

▷ copy I back to input image I

return

layer between areas of different temperatures. While temperatures continuously level out in the homogeneous regions on either side of an edge, thermal energy does not diffuse across the edge itself. Note that the Perona-Malik filter (as defined in Eqn. (17.49)) is formally considered a nonlinear filter but not an anisotropic diffusion filter because the conductivity function g() is only a scalar and not a (directed) vector-valued function [283]. However, the (inexact) discretization used in Eqn. (17.52), where each lattice direction is attenuated individually, makes the filter appear to perform in an anisotropic fashion. 17.3.3 Perona-Malik Filter for Color Images The original Perona-Malik filter is not explicitly designed for color images or vector-valued images in general. The simplest way to apply this filter to a color image is (as usual) to treat the color channels as a set of independent scalar images and filter them separately. Edges should be preserved, since they occur only where at least one of the color channels exhibits a strong variation. However, different filters are applied to the color channels and thus new chromaticities may be produced that were not contained in the original image. Nevertheless, the results obtained (see the examples in Fig. 17.19 (b–d)) are often satisfactory and the approach is frequently used because of its simplicity. Color Diffusion Based on the Brightness Gradient

522

As an alternative to filtering each color channel separately, it has been proposed to use only the brightness (intensity) component to control the diffusion process of all color channels. Given a RGB color

Original

Isotropic (non-directed) diffusion

17.3 Anisotropic Diffusion Filters Fig. 17.18 Isotropic vs. anisotropic diffusion applied to a noisy step edge. Original image, enlarged detail, and horizontal profile (a), results of isotropic diffusion (b–d), results of anisotropic diffusion (e–g) after n = 2, 5, 10 iterations, respectively (α = 0.20, κ = 40).

(a) n = 0

(b) n = 2

(c) n = 5

(d) n = 10

Anisotropic (directed) diffusion

(e) n = 2

(f) n = 5

(g) n = 10

image I = (IR , IG , IB ) and a brightness function β(a), the iterative scheme in Eqn. (17.52) could be modified to I (n)(p) ← I (n−1)(p) + α·

3 X i=0

where

g |βi′ (I (n−1), p)| ·δ i (I (n−1), p) ,

βi′ (I, p) = β(I(p + di )) − β(I(p)) = β(ai )−β(ac )

(17.55) (17.56)

is the local brightness difference in direction i, associated with the 2D offset vector di , (see Eqn. (17.54)), and δ i (I, p) = I(p + di ) − I(p) = ai − ac

(17.57)

is the local color difference vector w.r.t. the neighboring pixel in directions i = 0, . . . , 3 (see Fig. 17.17). Typical choices for the brightness function β(a) are the luminance Y (calculated as a weighted sum of the linear R, G, B components), luma Y ′ (from nonlinear R′ , G′ , B ′

523

Original

17 Edge-Preserving Smoothing Filters Fig. 17.19 Anisotropic diffusion filter (color). Noisy test image (a). Anisotropic diffusion filter applied separately to individual color channels (b–d), diffusion controlled by brightness gradient (e–g), diffusion controlled by color gradient (h–j), after 2, 5, and 10 iterations, respectively (α = 0.20, κ = 40). With diffusion controlled by the brightness gradient, strong blurring occurs between regions of different color but similar brightness (e–g). The most consistent results are obtained by diffusion controlled by the color gradient (h–j). Filtering was performed in linear RGB color space.

(a) n = 0

Individual channels

(b) n = 2

(c) n = 5

(d) n = 10

Brightness gradient

(e) n = 2

(f) n = 5

(g) n = 10

Color gradient

(h) n = 2

(i) n = 5

(j) n = 10

components), or the lightness component (L) of the CIELAB and CIELUV color spaces (see Sec. 15.1 for a detailed discussion). Algorithm 17.7 can be easily adapted to implement this type of color filter., as shown in Alg. 17.8. An obvious disadvantage of this method is that it naturally blurs across color edges if the neighboring colors have similar brightness, as the examples in Fig. 17.19 (e–g)) demonstrate. This limits its usefulness for practical applications. Using The Color Gradient A better option for controlling the diffusion process in all three color channels is to use the color gradient (see Sec. 16.2.1). As defined in Eqn. (16.17), the color gradient (gradθ I)(p) = I x (p)· cos(θ) + I y (p)· sin(θ)

(17.58)

is a 3D vector, representing the combined variations of the color image I at position p in a given direction θ. The squared norm of this 2 vector, Sθ (I, p) = ∥(gradθ I)(p)∥ , called squared local contrast, is a scalar quantity useful for color edge detection. Along the horizontal and vertical directions of the discrete diffusion lattice (see Fig. 17.17), the angle θ is a multiple of π/2, and thus one of the cosine/sine terms in Eqn. (17.58) vanishes, that is,13 ( ∥I x (p)∥ for i = 0, 2, ∥(gradθi I)(p)∥ = ∥(grad iπ I)(p)∥ = (17.59) 2 ∥I y (p)∥ for i = 1, 3. 13

524

See Eqn. (16.15) for the definition of the partial derivative vectors I x (p) and I y (p).

1: PeronaMalikBrightnessGradient(I, α, g, T ) Input: I, a RGB color image of size M × N ; α, update rate; g, the conductivity function (see Eqn. (17.51)); T , number of iterations. Returns nothing, the input image I is modified. 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

(M, N ) ← size(I) I ′ ← duplicate(I)

I ′ (u, v) ← I(u, v) + α·∆

16:

′

Alg. 17.8 Anisotropic diffusion filter for color images based on the brightness gradient (see Ch. 16, Sec. 16.2.1). The conductivity function g(d) may be chosen from the functions defined in Eqn. (17.51), or any similar function. The brightness function β(a) : RK 7→ R (lines 11, 14) is a mapping from color vectors a to scalar brightness values (e.g., luminance).

▷ diffusion ′

copy(I , I)

17: 18:

▷ intermediate image

for n ← 1, . . . , T do ▷ perform T iterations for all (u, v) ∈ M ×N do ac ← I(u, v) ▷ center pixel color a0 ← I(u+1, v) ▷ neighboring pixel colors a1 ← I(u, v−1) a2 ← I(u−1, v) a3 ← I(u, v+1) bc ← β(ac ) ▷ brightness of center pixel ∆←0 ▷ resulting color vector for i ← 0, . . . , 3 do ▷ for all neighbors bi ← β(ai ) ▷ brightness of neighbor pixel ∆ ← ∆ + g(|bi − bc |)·(ai −ac )

17.3 Anisotropic Diffusion Filters

▷ copy I back to input image I

return

Taking δ i (Eqn. (17.57)) as an estimate for the horizontal and vertical derivatives I x , I y , the diffusion iteration (adapted from Eqn. (17.52)) thus becomes 3 X I (n) (p) ← I (n−1) (p) + α · g ∥δ i (I (n−1), p)∥ · δ i (I (n−1), p) i=0

3 X = I (n−1) (p) + α · g ∥ai −ac ∥ · (ai −ac ),

(17.60)

i=0

with the conductivity function g() chosen from one of the options in Eqn. (17.51). Note that this is almost identical to the formulation in Eqn. (17.52), except for the use of vector-valued images and the absolute values | · | being replaced by the vector norm ∥ · ∥. The diffusion process is coupled between all color channels, because the local diffusion strength depends on the combined color difference vectors. Thus, unlike in the brightness-governed diffusion scheme in Eqn. (17.55), opposing variations in different color do not cancel out and edges between colors of similar brightness are preserved (see the examples in Fig. 17.19 (h–j)). The resulting process is summarized in Alg. 17.9. The algorithm assumes that the components of the color image I are real-valued. In practice, integer-valued images must be converted to floating point before this procedure can be applied and integer results should be recovered by appropriate rounding. Examples Figure 17.20 shows the results of applying the Perona-Malik filter to a color image, using different modalities to control the diffusion

525

17 Edge-Preserving Smoothing Filters Alg. 17.9 Anisotropic diffusion filter for color images based on the color gradient (see Ch. 16, Sec. 16.2.1). This version is identical to Alg. 17.9 except for the use of the color gradient ∥ai −ac ∥ instead of the brightness difference in line 13. The conductivity function g(d) may be chosen from the functions defined in Eqn. (17.51), or any similar function.

1: PeronaMalikColorGradient(I, α, κ, g, T ) Input: I, a RGB color image of size M × N ; α, update rate; g, the conductivity function (see Eqn. (17.51)); T , number of iterations. Returns nothing, the input image I is modified. 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

(M, N ) ← size(I) I ′ ← duplicate(I)

▷ intermediate image

for n ← 1, . . . , T do ▷ perform T iterations for all (u, v) ∈ M ×N do ac ← I(u, v) ▷ center pixel color a0 ← I(u+1, v) ▷ neighboring pixel colors a1 ← I(u, v−1) a2 ← I(u−1, v) a3 ← I(u, v+1) ∆←0 ▷ resulting color vector for i ← 0, . . . , 3 do ▷ for all neighbors ∆ ← ∆ + g(∥ai −ac ∥)·(ai −ac ) ▷ see Eqn. (17.60) I ′ (u, v) ← I(u, v) + α·∆ ′

copy(I , I) return

▷ diffusion ′

▷ copy I back to input image I

process. In Fig. 17.20 (a) the scalar (grayscale) diffusion filter (described in Alg. 17.7) is applied separately to each color channel. In Fig. 17.20 (b) the diffusion process is coupled over all three color channels and controlled by the brightness gradient, as specified in Eqn. (17.55). Finally, in Fig. 17.20 (c) the color gradient is used to control the common diffusion process, as defined in Eqn. (17.60) and Alg. 17.9. In each case, T = 10 diffusion iterations were applied, with update rate α = 0.20, smoothness κ = 25, and conductivity function g1 (d). The example demonstrates that, under otherwise equal conditions, edges and line structures are best preserved by the filter if the diffusion process is controlled by the color gradient. 17.3.4 Geometry Preserving Anisotropic Diffusion Historically, the seminal publication by Perona and Malik [208] was followed by increased interest in the use of diffusion filters based on partial differential equations. Numerous different schemes were proposed, mainly with the aim to better adapt the diffusion process to the underlying image geometry. Generalized Divergence-Based Formulation Weickert [282, 283] generalized the divergence-based formulation of the Perona-Malik approach (see Eqn. (17.48)), that is, ∂I = div c·∇I , ∂t

by replacing the time-varying, scalar diffusivity field c(x, τ ) ∈ R by a diffusion tensor field D(x, τ ) ∈ R2 × 2 in the form

526

∂I = div D·∇I . ∂t

(17.61)

17.3 Anisotropic Diffusion Filters Fig. 17.20 Perona-Malik color example. Scalar diffusion filter applied separately to each color channel (a); diffusion controlled by the brightness gradient (b); diffusion controlled by color gradient (c). Common settings are T = 10, α = 0.20, g(d) = g1 (d), κ = 25. Original image in Fig. 17.3 (a).

(a) Color channels filtered separately

(b) Diffusion controlled by the local brightness gradient

(c) Diffusion controlled by the local color gradient

The time-varying tensor field D(x, τ ) specifies a symmetric, positivedefinite 2 × 2 matrix for each 2D image position x and time τ (i.e., D : R3 → R2 × 2 in the continuous case). Geometrically, D specifies an oriented, stretched ellipse which controls the local diffusion process. D may be independent of the image I but is typically derived from it. For example, the original Perona-Malik diffusion equation could be (trivially) written in the form14 c 0 ∂I = div (c · I2 ) ·∇I = div 0 c · ∇I , | {z } ∂t

(17.62)

D

where c = g (∥∇I(x, t)∥) (see Eqn. (17.50)), and thus D is coupled to the image content. In Weickert’s approach, D is constructed from the eigenvalues of the local “image structure tensor” [284], which we have encountered under different names in several places. This approach was also adapted to work with color images [285]. Trace-Based Formulation Similar to the work of Weickert, the approach proposed by Tschumperlé and Deriche [265, 267] also pursues a geometry-oriented generalization of anisotropic diffusion. The approach is directly aimed 14

I2 denotes the 2 × 2 identity matrix.

527

17 Edge-Preserving Smoothing Filters

at vector-valued (color) images, but can also be applied to singlechannel (scalar-valued) images. For a vector-valued image I = (I0 , . . . , IK−1 ), the smoothing process is specified as ∂Ik = tr(A·HIk ), ∂t

(17.63)

for each channel k, where Hk denotes the Hessian matrix of the scalar-valued image function of channel Ik , and A is a square (2 × 2 for 2D images) matrix that depends on the complete image I and adapts the smoothing process to the local image geometry. Note that A is the same for all image channels. Since the trace of the Hessian matrix15 is the Laplacian of the corresponding function (i.e., tr(HI ) = ∇2 I) the diffusion equation for the Perona-Malik filter (Eqn. (17.48)) can be written as ∂I = c·(∇2 I) = div(c·∇I) = tr (c·I2 ) ·HI = tr(c·HI ). | {z } ∂t

(17.64)

A

In this case, A = c·I2 , which merely applies the constant scalar factor c to the Hessian matrix HI (and thus to the resulting Laplacian) that is derived from the local image (since c = g (∥∇I(p, t)∥)) and does not represent any geometric information. 17.3.5 Tschumperlé-Deriche Algorithm This is different in the trace-based approach proposed by Tschumperlé and Deriche [265, 267]. Here the geometry matrix A (see Eqn. (17.63)), associated with a particular pixel position p, is defined as ⊺

⊺

ˆ 1 ) + f1 (λ0 , λ1 )·(ˆ ˆ0 ) A(p) = f0 (λ0 , λ1 )·(ˆ x1 · x x0 · x (17.65) 2 2 x ˆ1 x ˆ1 ·ˆ y1 x ˆ0 x ˆ0 ·ˆ y0 = f0 (λ0 , λ1 )· + f1 (λ0 , λ1 )· , (17.66) x ˆ1 ·ˆ y1 yˆ12 x ˆ0 ·ˆ y0 yˆ02 ˆ i = (ˆ where λi are the eigenvalues and x xi , yˆi )⊺ the normalized eigenvectors of the (smoothed) 2 × 2 structure matrix G(p) =

K X

⊺

∇Ik (p)·∇Ik (p),

(17.67)

k=1

with ∇Ik (p) denoting the local gradient vector in image channel Ik at position p. The functions f0 (), f1 (), defined in Eqn. (17.80), use the two eigenvalues to control the diffusion strength along the dominant direction of the contours and perpendicular to it, respectively. Since the resulting algorithm is more involved than most previous ones, we describe it at greater detail than usual. Given a vector-valued image I : M × N → RK (i.e., with K components), the following steps are performed in each iteration of the algorithm: 15

528

See Sec. D.1.6 of the Appendix for details.

Step 1: Calculate Image Gradients For each image position p and each component k = 0, . . . , K −1, calculate the gradient vector16 ! ! ! ∂Ik Ik,x (p) (Ik ∗Hx∇ )(p) ∂x (p) ∇Ik (p) = = ∂I = . (17.68) k Ik,y (p) (Ik ∗Hy∇ )(p) ∂y (p)

17.3 Anisotropic Diffusion Filters

The first derivatives of the gradient vector ∇Ik are estimated by convolving each component image Ik with the kernels     −a 0 a −a −b −a Hx∇ =  −b 0 b  and Hy∇ =  0 0 0 , (17.69) −a 0 a a b a √ √ with a = (2 − 2)/4 and b = ( 2 − 1)/2 (such that 2a + b = 1/2).17 Next, all component gradients Ik,x , Ik,y (see Eqn. (17.68)) are smoothed by convolving with a 2D Gaussian kernel HσGd (with radius σd ), i.e., ! ! Ik,x ∗ HσGd Ik ∗Hx∇ ∗HσGd Dk,x Dk = = = . (17.70) Dk,y Ik,y ∗ HσG Ik ∗Hy∇ ∗HσG d

d

In practice, the additional smoothing step by HσGd is often skipped, since the 3 × 3 gradient kernels in Eqn. (17.69) effect some spatial smoothing already. Step 2: Build The Local Structure Matrix From the component gradients obtained in Step 1, the local variation or “structure” matrix is calculated as M(p) =

=

A(p) C(p) C(p) B(p)

K−1 X k=0

=

K−1 X

⊺

Dk (p)·Dk (p)

k=0

2 Dk,x (p)

Dk,x (p)·Dk,y (p)

Dk,x (p)·Dk,y (p)

2 Dk,y (p)

(17.71) !

,

(17.72)

again for every image position p. The resulting matrix M, which is always symmetric and positive semidefinite, combines the variations from all component channels k. In particular, for a color image with K = 3 components (R, G, B) this is (with coordinates p omitted) 16

17

Note that ∇Ik (p) in Eqn. (17.68) is a 2D, vector-valued function, that is, a dedicated vector is calculated for every image position p. For better readability, we omit the spatial coordinate (p) in the following and simply write ∇Ik instead of ∇Ik (p). Analogously, all related vectors and ˆ 0, x ˆ 1 and the matrices matrices defined below (including the vectors x ˜ A, and Hk ) are also calculated for each image point p, without M, M, the spatial coordinate being explicitly noted. Any other common set of x/y gradient kernels (e.g., Sobel masks) could be used instead, but these filters have better rotation invariance than their traditional counterparts. Similar kernels (with a = 3/32, b = 10/32) were proposed by Jähne [148, p. 353]. See also Sec. D.3 of the Appendix.

529

17 Edge-Preserving Smoothing Filters

A(p) C(p) M(p) = C(p) (17.73) B(p) D2 2 2 DR,x DR,y DG,x DG,x DG,y DB,x DB,x DB,y = D R,x + + 2 2 2 DR,y DG,x DG,y DG,y DB,x DB,y DB,x R,x DR,y 2 2 2 DR,x +DG,x +DB,x DR,x DR,y +DG,x DG,y +DB,x DB,y = D D +D D +D D . 2 2 2 D +D +D R,x

R,y

G,x

G,y

B,x

B,y

R,y

G,y

B,y

Next, the scalar element fields A, B, C of M are spatially smoothed by individually convolving with an isotropic Gaussian kernel HσGm of radius σg , that is, A˜ = A ∗ HσGm ,

˜ = B ∗ HσG , B m

C˜ = C ∗ HσGm ,

such that the smoothed structure matrix at position p is ˜ ˜ A(p) C(p) ˜ M(p) = ˜ . ˜ C(p) B(p)

(17.74)

(17.75)

Step 3: Image Update

For each image position p, we now calculate the two eigenvalues18 λ0 , λ1 (with λ0 ≥ λ1 ) and the associated eigenvectors x0 , x1 for the ˜ smoothed 2 × 2 structure matrix M(p). The first eigenvector x0 is normalized to 1 x ˆ0 ˆ1 = x = · x0 , (17.76) yˆ0 ∥x0 ∥ and, since x0 and x1 are orthogonal to each other, we can express ˆ 1 in terms of x ˆ 0 as x 0 −1 −ˆ y0 ˆ1 ≡ ˆ0 = x ·x . (17.77) 1 0 x ˆ0 Note that the eigenvector x0 (corresponding to the greater eigenvalue λ0 ) points in the direction of maximum change at position p, i.e., perpendicular to a local edge, and x1 is parallel to the edge tangent. Thus smoothing should occur predominantly in the direction of x1 and inhibited in direction x0 . From the eigenvalues λ0 , λ1 and the associated normalized eigen˜ ˆ 0, x ˆ 1 of M(p) vectors x we compose the “geometry matrix” A(p) in the form A 0 A2 ⊺ ⊺ ˆ ) + f0 (λ0 , λ1 ) ·(ˆ ˆ ) A(p) = = f0 (λ0 , λ1 ) ·(ˆ x ·x x ·x A 2 A1 | {z } 1 1 | {z } 0 0 c0 c1 2 2 yˆ0 −ˆ x0 ·ˆ y0 x ˆ0 x ˆ0 ·ˆ y0 = c0 · + c1 · (17.78) −ˆ x0 ·ˆ y0 x ˆ20 x ˆ0 ·ˆ y0 yˆ02 c0 ·ˆ y02 + c1 ·ˆ x20 (c1 −c0 )·ˆ x0 ·ˆ y0 = , (17.79) (c1 −c0 )·ˆ x0 ·ˆ y0 c0 ·ˆ x20 + c1 ·ˆ y02 using the conductivity coefficients c0 = f0 (λ0 , λ1 ) = (1+λ0 +λ1 )−a0 , c1 = f1 (λ0 , λ1 ) = (1+λ0 +λ1 )−a1 , 18

530

See Sec. B.5.1 of the Appendix.

(17.80)

with fixed parameters a0 , a1 > 0 to control the non-isotropy of the fil- 17.3 Anisotropic ter. Parameter a0 specifies the amount of smoothing along contours, Diffusion Filters a1 controls the smoothing in the gradient direction (i.e., across contours). Small values of a0 , a1 promote diffusion in the corresponding direction, while larger values inhibit smoothing. With a0 close to zero, diffusion is practically unconstrained along the tangent direction. Typical values are a0 = 0.5 and a1 = 0.9; results from other settings are shown in the examples. Next, we calculate the local Hessian matrix for each position p and each image component k = 0, . . . , K −1,  2  ∂ Ik ∂ 2 Ik ∂x2 (p) ∂x∂y (p)  = h0 h2 . Hk (p) =  2 (17.81) 2 h2 h1 ∂ Ik (p) ∂ Ik (p) ∂y 2

∂x∂y

The elements of Hk (p) are the second-order derivatives of the component image Ik at position p, which are estimated from the discrete image data by convolution,19 ∂ 2 Ik ∂x2

∂ 2 Ik ∂y 2

∇ ≈ Ik ∗Hxx ,

∇ ≈ Ik ∗Hyy ,

∂ 2 Ik ∂x∂y

with the second-derivative derivative kernels 1 ∇ ∇ ∇ Hxx = ( 1 −2 1 ), Hyy = −2 , Hxy = 14 · 1

∇ ≈ Ik ∗Hxy , (17.82)

1 0 −1

0 −1 0 0 0 1

. (17.83)

Finally, each component image Ik is updated iteratively (as specified by the number of passes) by (n)

(n−1)

Ik (p) ← Ik

(n−1)

= Ik =

(p) + α(n) · tr (A(p)·Hk (p))

(p) + α(n) ·(A0 ·h0 + A1 ·h1 + 2·A2 ·h2 )

(n−1) Ik (p)

+α

(n)

(17.84)

·βk (p).

The scalar factor βk (p) = tr (A(p)·Hk (p)) = A0 h0 + A1 h1 + 2A2 h2 denotes the local image velocity in channel k. Note that, although a separate Hessian matrix Hk (p) is calculated for each channel, the geometry matrix A(p) (see Eqn. (17.79)) is the same for all image channels. The image is thus smoothed along a common image geometry which considers the correlation between color channels, since A(p) is derived (via its eigenvalues) from the joint structure matrix ˜ M(p) (see Eqn. (17.75)) and therefore combines all K color components. In each iteration n, the factor α(n) in Eqn. (17.84) is adjusted dynamically to the maximum current velocity βk in all channels in the form α(n) =

dt dt = , max|βk (p)| max |tr(A(p)·Hk (p))| k,p

(17.85)

k,p

where dt is the (constant) “time increment” parameter. Thus the time step α is kept small as long as the image gradients (vector field 19

See also Secs. D.1.6 and D.3.2 of the Appendix.

531

17 Edge-Preserving Smoothing Filters

velocities) are large. As smoothing proceeds, image gradients are reduced and thus α typically increases over time. The steps (1–8) outlined above are repeated for the specified number of iterations. The complete procedure is summarized in Alg. 17.10 and a corresponding Java implementation can be found on the book’s website (see Sec. 17.4). Beyond this baseline algorithm, several variations and extensions of this filter exist, including the use of spatially-adaptive, oriented smoothing filters.20 This type of filter has also been used with good results for image inpainting [266], where diffusion is applied to fill out only selected (masked) parts of the image where the content is unknown or should be removed. Examples The example in Fig. 17.21 demonstrates the influence of image geometry and how the non-isotropy of the Tschumperlé-Deriche filter can be controlled by varying the diffusion parameters a0 , a1 (see Eqn. (17.80)). Parameter a0 , which specifies the diffusion in the direction of contours, is changed while a1 (controlling the diffusion in the gradient direction) is held constant. In Fig. 17.21 (a), smoothing along contours is modest and very small across edges with the default settings a0 = 0.5 and a1 = 0.9. With lower values of a0 , increased blurring occurs in the direction of the contours, as shown in Figs. 17.21 (b, c).

17.4 Java Implementation Implementations of all filters described in this chapter are available as part of the imagingbook library.21 The associated classes are subtypes of GenericFilter22 and can be applied to all types of scalar and vector-valued images. They are mostly implemented in two versions, -Scalar and -Vector, which differ in the way they process color images. Scalar-type filters treat each color component individually (as a scalar-valued image), while vector-type filters combine information from all color channels. KuwaharaFilterScalar, -Vector: Kuwahara-type filter for grayscale and color images, as described in Sec. 17.1 (Alg. 17.2), with radius r (default 2) and variance threshold tsigma (denoted tσ in Alg. 17.2, defaults to 0.0). The size of the resulting filter is (2r + 1) × (2r + 1). NagaoMatsuyamaFilterScalar, -Vector: Similar to the Kuwahara filter (see Sec. 17.1, Fig. 17.2). The only parameter is the variance threshold tsigma (default value is 0.0). 20

21 22

532

An implementation was released by the original authors as part of the “GREYC’s Magic Image Converter” open-source framework, which is also available as a GIMP plugin (https://gmic.eu). Package imagingbook.pub.edgepreservingfilters Package imagingbook.lib.filters

1: TschumperleDericheFilter(I, T, dt , σd , σm , a0 , a1 , α0 ) Input: I = (I0 , . . . , IK −1 ), a vector-valued image with K components of size M × N ; T , number of iterations; dt , time increment; σd , width of the Gaussian kernel for smoothing the gradient; σm , width of the Gaussian kernel for smoothing the structure matrix; a0 , a1 , diffusion parameters along/across edges; α0 , initial time step. Returns nothing, the input image I is modified. 2: 3: 4:

(M, N ) ← size(I) Create maps Ik,x , Ik,y : M ×N → R, for k = 0, . . . , K −1 Create maps A, C, B : M ×N → R ▷ elements of M

5: 6:

I ′ ← duplicate(I) α ← α0

7:

for n ← 1, . . . , T do

′ ▷ temp. image I ′ = (I0′ , . . . , IK −1 ) ▷ initial time step

▷ perform T iterations

8: 9: 10:

Step 1 – Image gradients (Eqns. (17.68)–(17.69)): for k ← 0, . . . , K −1 do Ik,x ← Ik ∗Hx∇ ∗HσGd ▷ Eqns. (17.69, 17.70) Ik,y ← Ik ∗Hy∇ ∗HσGd

11: 12: 13: 14: 15: 16: 17:

Step 2 – Local structure matrix M: for all p ∈ M P× N2 do A(p) ← Pk Ik,x (p) C(p) ← Pk Ik,x (p)·Ik,y (p) 2 B(p) ← k Ik,y (p) G A ← A ∗ Hσm ▷ smooth A, B, C with Gaussian HσGm G C ← C ∗ Hσm B ← B ∗ HσGm

18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33:

Step 3 – Update image: βmax ← −∞ for all p = (u, v) ∈ M × N do ˜ ← A(p) C(p) M C(p) B(p) ˜ (λ, E) ← Eigen2x2(M)

37:

Alg. 17.10 Tschumperlé-Deriche anisotropic diffusion filter for vector-valued (color) images. Typical settings are T = 5, . . . , 20, dt = 20, σd = 0, σm = 0.5, a0 = 0.5, a1 = 0.9, α0 = 0.5. The temporary maps Ik,x , Ik,y hold the current image gradients for each component k. Note that the gradients are recalculated at the beginning of each pass. Maps A, B, C contain the elements of the 2 × 2 ˜ smoothed structure matrix M. I ′ is a temporary duplicate of the input image I used to hold the results of each filter pass. Function Eigen2x2() (called in line 21) returns the eigenvalues λ = (λ0 , λ1 ) and eigenvectors ˜ (see E = (x0 | x1 ) of matrix M Sec. B.5.1).

˜ ▷ = M(p) ▷ λ = (λ0 , λ1 ), see Alg. B.1

x0 ← E(∗, 0) ▷ E = (x0 | x1 ) ⊺ (ˆ x, yˆ) ← ∥x1 ∥ ·x0 ▷ normalize eigenvalue x0 0 c0 ← (1 + λ0 + λ1 )−a0 ▷ Eqn. (17.80) c1 ← (1 + λ0 + λ1 )−a1 c ·yˆ2 + c1 ·ˆ x2 (c1 −c0 )·ˆ x ·yˆ A← 0 ▷ Eqn. (17.79) 2 2 (c1 −c0 )·ˆ x ·yˆ c0 ·ˆ x + c1 ·yˆ for k = 0, . . . , K −1 do ▷ for each component k h0 ← Ik (u−1, v) − 2·Ik (u, v) + Ik (u+1, v) h1 ← Ik (u, v−1) − 2·Ik (u, v) + Ik (u, v+1) I (u−1,v−1)+Ik (u+1,v+1)−Ik (u−1,v+1)−Ik (u+1,v−1) h2 ← k 4 h h H ← h0 h2 ▷ Hessian matrix Hk (p) 2 1 β ← tr(A·H) ▷ βk (p), Eqn. (17.84) βmax ← max(βmax , |β|) Ik′ (p) ← Ik′ (p) + α · β

34: 35: 36:

17.4 Java Implementation

▷ update tmp. image I ′

Step 4 – Copy back and update α: copy(I ′ , I) ▷ copy I ′ back to input image I α ← dt /βmax ▷ adapt time step α return

533

17 Edge-Preserving Smoothing Filters Fig. 17.21 Tschumperlé-Deriche filter example. The non-isotropy of the filter can be adjusted by changing parameter a0 , which controls the diffusion along contours (see Eqn. (17.80)): a0 = 0.50, 0.25, 0.00 (a–c). Parameter a1 = 0.90 (constant) controls the diffusion in the direction of the gradient (perpendicular to contours). Remaining settings are T = 20, dt = 20, σd = 0.5, σm = 0.5 (see the description of Alg. 17.10); original image in Fig. 17.3 (a).

(a) a0 = 0.50

(b) a0 = 0.25

(c) a0 = 0.00

BilateralFilterScalar, -Vector: Bilateral filter using Gaussian kernels, as described in Sec. 17.2 (see Algs. 17.4 and 17.5). Parameters sigmaD (σd , default 2.0) and sigmaR (σr , default 50.0) specify the widths of the domain and the range kernels, respectively. The type of norm for measuring color distances is specified by colorNormType (default is NormType.L2). BilateralFilterScalarSeparable, -VectorSeparable: x/y-separable versions of the bilateral filter (see Alg. 17.6). Parameters are the same as for the non-separable bilateral filters.

534

PeronaMalikFilterScalar, -Vector: A diffusion filters for grayscale and color images (see Algs. 17.7 and 17.9). The key parameters and their default values are iterations (T = 10), alpha (α = 0.2), kappa (κ = 25), smoothRegions (true), colorMode (SeparateChannels). Set(2) ting smoothRegions = true, function gκ is used to control (1) conductivity, otherwise gκ (see Eqn. (17.51)). For filtering color images, three different color modes can be specified for diffusion control: SeparateChannels, BrightnessGradient, or ColorGradient. See Prog. 17.1 for an example of using this class in a simple ImageJ plugin.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

import import import import import import import import import

ij.ImagePlus; ij.plugin.filter.PlugInFilter; ij.process.ColorProcessor; ij.process.ImageProcessor; imagingbook.lib.filter.GenericFilter; imagingbook.pub... .PeronaMalikF.ColorMode; imagingbook.pub... .PeronaMalikF.Parameters; imagingbook.pub... .PeronaMalikFilterScalar; imagingbook.pub... .PeronaMalikFilterVector;

public class Perona_Malik_Demo implements PlugInFilter { public int setup(String args, ImagePlus imp) { return DOES_ALL + DOES_STACKS; } public void run(ImageProcessor ip) { // create a parameter object, modify settings if needed: Parameters params = new Parameters(); params.iterations = 20; params.alpha = 0.15f; params.kappa = 20.0f; // create the actual filter: GenericFilter filter = null; if (ip instanceof ColorProcessor) { params.colorMode = ColorMode.ColorGradient; filter = new PeronaMalikFilterVector(params); } else { filter = new PeronaMalikFilterScalar(params); }

}

17.5 Exercises Prog. 17.1 Perona-Malik filter usage example (ImageJ plugin). Inside the run() method, a parameter object (instance of class PeronaMalikFilter.Parameters) is created in line 19. Individual parameters may then be modified, as shown in lines 20–22. This would typically be done be querying the user (e.g., with ImageJ’s GenericDialog class). ColorMode (in line 27) is defined as an enumeration type, providing the options SeparateChannels (default), BrightnessGradient and ColorGradient. In line 28, a new instance of PeronaMalikFilterVector is created for processing color images using the color gradient. For a non-color image, a scalar version of the filter is created in line 31. Finally, in line 35, the filter is (destructively) applied to the input image, that is, ip is modified. Note that, as specified in the setup() method, this plugin works for any type of image and image stacks.

// apply the filter: filter.applyTo(ip);

}

TschumperleDericheFilter: Anisotropic diffusion filter for all types of grayscale and color images, as described in Sec. 17.3.4 (Alg. 17.10). Parameters and default values are iterations (T = 20), dt (dt = 20), sigmaD (σd = 0.0), sigmaM (σm = 0.5), a0 (a0 = 0.25), a1 (a1 = 0.90), alpha0 (α0 = 0.5). Otherwise the usage of this class is analogous to the example in Prog. 17.1. All default values pertain to the parameterless constructors that are also available. Note that these filters are generic and can be applied to grayscale and color images without any modification.

17.5 Exercises Exercise 17.1. Implement a pure range filter (Eqn. (17.18)) for grayscale images, using a 1D Gaussian kernel hr (x) = √

1 x2 · exp(− 2 ). 2σ 2π · σ

535

17 Edge-Preserving Smoothing Filters

Investigate the effects of this filter upon the image and its histogram for σ = 10, 20, and 25. Exercise 17.2. Modify the Kuwahara-type filter for color images in Alg. 17.3 to use the norm of the color covariance matrix (as defined in Eqn. (17.12)) for quantifying the amount of variation in each subregion. Estimate the number of additional calculations required for processing each image pixel. Implement the modified algorithm, compare the results and execution times. Exercise 17.3. Modify the separable bilateral filter algorithm (given in Alg. 17.6) to handle color images, using Alg. 17.5 as a starting point. Implement and test your algorithm, compare the results (see also Fig. 17.14) and execution times. Exercise 17.4. Verify (experimentally) that n iterations of the diffusion process defined in Eqn. (17.44) have the same effect as a Gaussian filter of width σn , as stated in Eqn. (17.47). To determine the impulse response of the resulting diffusion filter, use an “impulse” test image, that is, a black (zero-valued) image with a single bright pixel at the center.

536

Part VI

Spectral Techniques

18 Introduction to Spectral Methods

The following three chapters deal with the representation and analysis of images in the frequency domain, based on the decomposition of image signals into sine and cosine functions using the wellknown Fourier transform. Students often consider this a difficult topic, mainly because of its mathematical flavor and that its practical applications are not immediately obvious. Indeed, most common operations and methods in digital image processing can be sufficiently described in the original signal or image space without even mentioning spectral techniques. This is the reason why we pick up this topic relatively late in this text. While spectral techniques were often used to improve the efficiency of image-processing operations, this has become increasingly less important due to the high power of modern computers. There exist, however, some important effects, concepts, and techniques in digital image processing that are considerably easier to describe in the frequency domain or cannot otherwise be understood at all. The topic should therefore not be avoided all together. Fourier analysis not only owns a very elegant (perhaps not always sufficiently appreciated) mathematical theory but interestingly enough also complements some important concepts we have seen earlier, in particular linear filters and linear convolution (see Sec. 4.6.1). Equally important are applications of spectral techniques in many popular methods for image and video compression, and they provide valuable insight into the mechanisms of sampling (discretization) of continuous signals as well as the reconstruction and interpolation of discrete signals. In the following, we first give a basic introduction to the concepts of frequency and spectral decomposition that tries to be minimally formal and thus should be easily “digestible” even for readers without previous exposure to this topic. We start with the representation of 1D signals and will then extend the discussion to 2D signals (images) in the next chapter. Subsequently, Chapter 20 briefly explains the discrete cosine transform, a popular variant of the discrete Fourier transform that is frequently used in image compression. © Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1_18

539

18 Introduction to Spectral Methods

18.1 The Fourier Transform The concept of frequency and the decomposition of waveforms into elementary “harmonic” functions first arose in the context of music and sound. The idea of describing acoustic events in terms of “pure” sinusoidal functions does not seem unreasonable, considering that sine waves appear naturally in every form of oscillation (e.g., on a free-swinging pendulum). 18.1.1 Sine and Cosine Functions The well-known cosine function, f (x) = cos(x),

(18.1)

has the value 1 at the origin (i.e., cos(0) = 1) and performs exactly one full swing between the origin and the point x = 2π (Fig. 18.1 (a)). We say that the function is periodic with a cycle length (period) T = 2π, that is, cos(x) = cos(x+2π) = cos(x+4π) = . . . cos(x+k2π) . . . ,

(18.2)

for any k ∈ Z. The same is true for the corresponding sine function, except that its value is zero at the origin (since sin(0) = 0). sin(x)

cos(x)

π 2

1

sin(3x) cos(3x)

π

1

0.5

Fig. 18.1 Cosine and sine functions of different frequency. The expression cos(ωx) describes a cosine function with angular frequency ω at position x. The angular frequency ω of this periodic function corresponds to a cycle length (period) T = 2π/ω. For ω = 1, the period is T1 = 2π (a), and for ω = 3 it is T3 = 2π/3 ≈ 2.0944 (b). The same holds for the sine function sin(ωx).

4

2

0.5

0

2

4

x

4

2

0

0.5

0.5

1

1

(a)

2

4

x

(b)

Frequency and Amplitude The number of oscillations of cos(x) over the distance T = 2π is one and thus the value of the angular frequency ω=

2π =1. T

(18.3)

If we modify the cosine function in Eqn. (18.1) to f (x) = cos(3x) ,

(18.4)

we obtain a compressed cosine wave that oscillates three times faster than the original function cos(x) (see Fig. 18.1 (b)). The function cos(3x) performs three full cycles over a distance of 2π and thus has the angular frequency ω = 3 and period T = 2π 3 . In general, the period T relates to the angular frequency ω as T =

540

2π ω ,

(18.5)

for ω > 0. A sine or cosine function oscillates between peak values +1 and −1, and its amplitude is 1. Multiplying by a constant a ∈ R

changes the peak values of the function to ±a and its amplitude to 18.1 The Fourier a. In general, the expressions Transform a · cos(ωx)

and

a · sin(ωx)

denote a cosine or sine function, respectively, with amplitude a and angular frequency ω, evaluated at position (or point in time) x. The relation between the angular frequency ω and the “common” frequency f is given by f=

1 ω = T 2π

or

ω = 2πf,

(18.6)

respectively, where f is measured in cycles per length or time unit.1 In the following, we use either ω or f as appropriate, and the meaning should always be clear from the symbol used. Phase Shifting a cosine function along the x axis by some distance φ, i.e., cos(x) → cos(x − φ), changes the phase of the cosine wave, and φ denotes the phase angle of the resulting function. Thus a sine function is really just a cosine π function shifted to the right2 by a quarter period (φ = 2π 4 = 2 ), thus sin(ωx) = cos ωx −

π 2

.

(18.7)

If we take the cosine function as the reference, i.e., with phase φcos = 0, then the phase angle of the corresponding sine function is φsin = π2 = 90◦ . Cosine and sine functions are “orthogonal” in a sense and we can use this fact to create new “sinusoidal” functions with arbitrary frequency, phase, and amplitude. In particular, adding a cosine and a sine function with identical frequency ω and arbitrary amplitudes A and B, respectively, creates another sinusoid: A· cos(ωx) + B· sin(ωx) = C · cos(ωx − φ).

(18.8)

The amplitude C of the resulting function and its phase angle φ are defined only by the two original amplitudes A and B as C=

p

A2 + B 2

and

φ = tan−1

B A

.

(18.9)

Figure 18.2 (a) shows an example with amplitudes A = B = 0.5 and a resulting phase angle φ = π/4 = 45◦ . 1

2

For example, a temporal oscillation with frequency f = 1000 cycles/s (Hertz) has the period T = 1/1000 s and therefore the angular frequency ω = 2000π. The latter is a unitless quantity. In general, f (x−d) is the original function f (x) shifted to the right by a distance d.

541

A · cos(ωx) + B · sin(ωx)

18 Introduction to Spectral Methods Fig. 18.2 Adding cosine and sine functions with identical frequencies, A· cos(ωx) + B· sin(ωx), with ω = 3 and A = B = 0.5. The result is a phaseshifted cosine function (dotted √ curve) with amplitude C = 0.52 + 0.52 ≈ 0.707 and phase angle φ = 45◦ (a). If the cosine and sine components are treated as orthogonal vectors (A, B) in 2-space, the amplitude and phase of the resulting sinusoid (C) can be easily determined by vector addition (b).

1 0.75 0.5 0.25

(a)

x

-3

-2

-1

1

2

3

-0.25 -0.5 -0.75 -1

sin(ωx) A · cos(ωx) + B · sin(ωx)

0.5 (b)

B

C ϕ A

0.5

cos(ωx)

Complex-Valued Sine Functions—Euler’s Notation Figure 18.2 (b) depicts the contributing cosine and sine components of the new function as a pair of orthogonal vectors in 2-space, whose lengths correspond to the amplitudes A and B. Not coincidentally, this reminds us of the representation of real and imaginary components of complex numbers, z = a + i·b, in the 2D plane C, where i is the imaginary unit (i2 = −1). This association becomes even stronger if we look at Euler’s famous notation of complex numbers along the unit circle, z = eiθ = cos(θ) + i· sin(θ),

(18.10)

where e ≈ 2.71828 is the Euler number. If we take the expression eiθ as a function of the angle θ rotating around the unit circle, we obtain a “complex-valued sinusoid” whose real and imaginary parts correspond to a cosine and a sine function, respectively, i.e., Re(eiθ ) = cos(θ), Im(eiθ ) = sin(θ).

(18.11)

Since the quantity z = eiθ is placed on the unit circle, the amplitude of the complex-valued sinusoid is |z| = r = 1. We can easily modify the amplitude of this function by multiplying it by some real value a ≥ 0, that is, |a · eiθ | = a · |eiθ | = a . (18.12) Similarly, we can alter the phase of a complex-valued sinusoid by adding a phase angle φ in the function’s exponent or, equivalently, by multiplying it by a complex-valued constant c = eiφ , 542

ei(θ+φ) = eiθ · eiφ .

(18.13)

In summary, multiplying by some real value affects only the ampli- 18.1 The Fourier tude of a sinusoid, while multiplying by some complex value c (with Transform unit amplitude |c| = 1) modifies only the function’s phase (without changing its amplitude). In general, of course, multiplying by some arbitrary complex value changes both the amplitude and the phase of the function.3 The complex notation makes it easy to combine orthogonal pairs of sine functions cos(ωx) and sin(ωx) with identical frequencies ω into a single expression, namely eiθ = eiω·x = cos(ωx) + i · sin(ω·x).

(18.14)

We will make more use of this notation later (in Sec. 18.1.4) to explain the Fourier transform. 18.1.2 Fourier Series Representation of Periodic Functions As we demonstrated in Eqn. (18.8), sinusoidal functions of arbitrary frequency, amplitude, and phase can be described as the sum of suitably weighted cosine and sine functions. One may wonder if nonsinusoidal functions can also be decomposed into a sum of cosine and sine functions. The answer is yes, of course. It was Fourier4 who first extended this idea to arbitrary functions and showed that (almost) any periodic function g(x) with a fundamental frequency ω0 can be described as a—possibly infinite—sum of “harmonic” sinusoids, that is, g(x) =

∞ X

Ak · cos(k·ω0 ·x) + Bk · sin(k·ω0 ·x).

(18.15)

k=0

This is called a Fourier series, and the constant factors Ak , Bk are the Fourier coefficients of function g(). Notice that in Eqn. (18.15) the frequencies of the sine and cosine functions contributing to the Fourier series are integer multiples (“harmonics”) of the fundamental frequency ω0 , including the zero frequency for k = 0. The corresponding coefficients Ak and Bk , which are initially unknown, can be uniquely derived from the original function g(). This process is commonly referred to as Fourier analysis. 18.1.3 Fourier Integral Fourier did not want to limit this concept to periodic functions and postulated that nonperiodic functions, too, could be described as sums of sine and cosine functions. While this proved to be true in principle, it generally requires—beyond multiples of the fundamental frequency (kω0 )—infinitely many, densely spaced frequencies! The resulting decomposition, Z ∞ g(x) = Aω · cos(ω·x) + Bω · sin(ω·x) dω, (18.16) 0

3 4

See also Sec. A.5 of the Appendix. Jean-Baptiste Joseph de Fourier (1768–1830).

543

18 Introduction to Spectral Methods

is called a Fourier integral and the coefficients Aω , Bω are again the weights for the corresponding cosine and sine functions with the (continuous) frequency ω. The Fourier integral is the basis of the Fourier spectrum and the Fourier transform, as will be described (for details, see, e.g., [43, Sec. 15.3]). In Eqn. (18.16), every coefficient Aω and Bω specifies the amplitude of the corresponding cosine or sine function, respectively. The coefficients thus define “how much of each frequency” contributes to a given function or signal g(x). But what are the proper values of these coefficients for a given function g(x), and can they be determined uniquely? The answer is yes again, and the “recipe” for calculating the coefficients is amazingly simple: Z ∞ 1 Aω = A(ω) = · g(x)· cos(ω·x) dx, π −∞ (18.17) Z ∞ 1 Bω = B(ω) = · g(x)· sin(ω·x) dx. π −∞ Since this representation of the function g(x) involves infinitely many densely spaced frequencies ω, the corresponding coefficients A(ω) and B(ω) are indeed continuous functions as well. They hold the continuous distribution of frequency components contained in the original signal, which is called a “spectrum”. Thus the Fourier integral in Eqn. (18.16) describes the original function g(x) as a sum of infinitely many cosine and sine functions, with the corresponding Fourier coefficients contained in the functions A(ω) and B(ω). Note that the signal g(x) is uniquely and fully represented by the corresponding coefficient functions A(ω) and B(ω). We learn from Eqn. (18.17) how to calculate the spectrum of a given function g(x), and Eqn. (18.16) explains how to reconstruct the original function from its spectrum if it is ever needed. 18.1.4 Fourier Spectrum and Transformation There is now only a small remaining step from the decomposition of a function g(x), as shown in Eqn. (18.17), to the “real” Fourier transform. In contrast to the Fourier integral, the Fourier transform treats both the original signal and the corresponding spectrum as complex-valued functions, which considerably simplifies the resulting notation. Based on the functions A(ω) and B(ω) defined in the Fourier integral (Eqn. (18.17)), the Fourier spectrum G(ω) of a function g(x) is defined as q G(ω) = π2 · A(ω) − i·B(ω) Z q h1 Z ∞ i 1 ∞ = π2 · g(x)· cos(ωx) dx − i· g(x)· sin(ωx) dx π −∞ π −∞ Z ∞ 1 g(x)· cos(ωx) − i· sin(ωx) dx , (18.18) =√ · 2π −∞

544

with g(x), G(ω) ∈ C. Using Euler’s notation of complex values (see Eqn. (18.14)) yields the continuous Fourier spectrum in Eqn. (18.18) in its common form:

1 G(ω) = √ 2π 1 = √ 2π

Z

∞

g(x)· cos(ωx) − i· sin(ωx) dx

−∞ Z ∞

(18.19)

18.1 The Fourier Transform

g(x) · e−iωx dx .

−∞

The transition from the function g(x) to its Fourier spectrum G(ω) is called the (forward) Fourier transform 5 (F). Conversely, the original function g(x) can be reconstructed completely from its Fourier spectrum G(ω) using the inverse Fourier transform 6 (F−1 ), defined as Z ∞ 1 g(x) = √ G(ω)· cos(ωx) + i· sin(ωx) dω 2π −∞ (18.20) Z ∞ 1 iωx = √ G(ω) · e dω . 2π −∞ In general, even if one of the involved functions (g(x) or G(ω)) is real-valued (which is usually the case for physical signals g(x)), the other function is complex-valued. One may also note that the forward transformation F (Eqn. (18.19)) and the inverse transformation F−1 (Eqn. (18.20)) are almost completely symmetrical, the sign of the exponent being the only difference.7 The spectrum produced by the Fourier transform is a new representation of the signal in a space of frequencies. Apparently, this “frequency space” and the original “signal space” are dual and interchangeable mathematical representations. 18.1.5 Fourier Transform Pairs The relationship between a function g(x) and its Fourier spectrum G(ω) is unique in both directions: the Fourier spectrum is uniquely defined for a given function, and for any Fourier spectrum there is only one matching signal. Thus the two functions g(x) and G(ω) form a so-called “transform pair”, which we denote as g(x)

G(ω).

Table 18.1 lists the transform pairs for some selected analytical functions, which are also shown graphically in Figs. 18.3 and 18.4. The Fourier spectrum of a cosine function cos(ω0 x), for example, consists of two separate thin pulses arranged symmetrically at a distance ω0 from the origin (Fig. 18.3 (a, c)). Intuitively, this corresponds to our physical understanding of a spectrum (e.g., if we think of a pure monophonic sound in acoustics or the thin line produced by some extremely pure color in a light spectrum). Increasing the 5 6 7

Also called “direct” transformation. Also called “backward” transformation. Various definitions of the Fourier transform are in common use. They are contrasted mainly by the constant factors outside the integral and the signs of the exponents in the forward and inverse transforms, but all versions are equivalent in√principle. The symmetric variant shown here uses the same factor (1/ 2π) in the forward and inverse transforms.

545

18 Introduction to Spectral Methods Table 18.1 Fourier transforms of selected analytical functions; δ() denotes the “impulse” or Dirac function (see Sec. 18.2.1).

Function

Transform pair g(x)

Cosine function with frequency ω0

g(x) = cos(ω0 x)

Sine function with frequency ω0

g(x) = sin(ω0 x)

Gaussian function of width σ

g(x) =

Rectangular pulse of width 2b

G(ω) =

pπ 2

G(ω) = i 1 σ

pπ

G(ω) = e

2

·e

· δ(ω + ω0 ) + δ(ω − ω0 )

· δ(ω + ω0 ) − δ(ω − ω0 )

2 − x2 2σ

2 2 − σ 2ω

g(x) = Πb (x) = G(ω) =

G(ω)

1 0

Figure

|x| ≤ b otherwise

18.3 (a, c)

18.3 (b, d)

18.4 (a, b)

18.4 (c, d)

2b sin(bω) √ 2πω

frequency ω0 would move the corresponding pulses in the spectrum away from the origin. Notice that the spectrum of the cosine function is real-valued, the imaginary part being zero. Of course, the same relation holds for the sine function (Fig. 18.3 (b, d)), with the only difference being that the pulses have different polarities and appear in the imaginary part of the spectrum. In this case, the real part of the spectrum G(ω) is zero. The Gaussian function is particularly interesting because its Fourier spectrum is also a Gaussian function (Fig. 18.4 (a, b))! It is one of the few examples where the function type in frequency space is the same as in signal space. With the Gaussian function, it is also clear to see that stretching a function in signal space corresponds to shortening its spectrum and vice versa. The Fourier transform of a rectangular pulse (Fig. 18.4 (c, d)) is the “Sinc” function8 of the form sinc(x) = sin(x)/x. With increasing frequencies, this function drops off rather slowly, which shows that the components contained in the original rectangular signal are spread out over a large frequency range. Thus a rectangular pulse function exhibits a very wide spectrum in general. 18.1.6 Important Properties of the Fourier Transform Symmetry The Fourier spectrum extends over positive and negative frequencies and could, in principle, be an arbitrary complex-valued function. However, in many situations, the spectrum is symmetric about its origin (see, e.g., [52, p. 178]). In particular, the Fourier transform of a real-valued signal g(x) ∈ R is a so-called Hermite function with the property G(ω) = G∗ (−ω), (18.21) 8

546

The full name of sinc(x) is “sine cardinal” function. There is also a normalized form, sinc(x) = sin(πx)/(πx), whose zeros are at integer positions (also called filtering or interpolation function [40, Ch. 4]). See also Fig. 22.3 (p. 641).

-9

-7

-5

-3

1

1

0.5

0.5

-1

1

3

5

7

9

x -9

-7

-5

-3

-1

-0.5

-0.5

-1

-1

(a) Cosine (ω0 = 3): g(x) = cos(3x)

G(ω) =

pπ 2

-7

-5

-3

1

0.5

0.5

1

3

5

7

9

x -9

-7

-5

-3

-1

-0.5

-0.5

-1

-1

(b) Sine (ω0 = 3): g(x) = sin(3x)

G(ω) = i

pπ

-7

-5

-3

1

0.5

0.5

1

3

5

7

9

x -9

-7

-5

-3

-1

-0.5

-0.5

-1

-1

(c) Cosine (ω0 = 5): g(x) = cos(5x)

G(ω) =

pπ 2

-7

-5

-3

1

0.5

0.5

1

3

9

ω

3

5

7

9

ω

1

3

5

7

9

ω

GRe (ω)

1

-1

1

· δ(ω+5)+δ(ω−5)

g(x)

-9

7

GRe (ω)

1

-1

5

· δ(ω+3)−δ(ω−3)

2

g(x)

-9

3

Fig. 18.3 Fourier transform pairs—cosine and sine functions.

GRe (ω)

1

-1

1

· δ(ω+3)+δ(ω−3)

g(x)

-9

18.1 The Fourier Transform

GRe (ω)

g(x)

5

7

9

x -9

-7

-5

-3

-1

-0.5

-0.5

-1

-1

(d) Sine (ω0 = 5): g(x) = sin(5x)

G(ω) = i

pπ 2

1

3

5

7

9

· δ(ω+5)−δ(ω−5)

ω

547

18 Introduction to Spectral Methods Fig. 18.4 Fourier transform pairs—Gaussian functions and square pulses.

GRe (ω)

g(x)

-9

-7

-5

-3

1

1

0.5

0.5

-1

1

3

5

7

x

9

-9

-7

-5

-3

-1

-0.5

-0.5

-1

-1

(a) Gauss. (σ = 1): g(x) = e−

x2 2

-7

-5

-3

1

0.5

0.5

1

3

5

7

x

9

-9

-7

-5

-3

-1

-0.5

-0.5

-1

-1

(b) Gauss. (σ = 3): g(x) =

1 3

x2

-7

-5

-3

1.5

1.5

1

1

0.5

0.5

-1

1

3

5

7

9

-9

-7

-5

-3

-1

-3

1.5

1.5

1

1

0.5

0.5

-1

1

3

5

7

9

-0.5

(d) Pulse (b = 2): g(x) = Π2 (x)

548

7

9

7

9

7

9

ω

ω2 2

3

5

ω

9ω 2 2

GRe (ω)

1

G(ω) =

g(x)

-5

9

3

5

ω

-0.5

(c) Pulse (b = 1): g(x) = Π1 (x)

-7

7

x

-0.5

-9

1

G(ω) = e−

·e− 2·9

g(x)

-9

5

GRe (ω)

1

-1

3

G(ω) = e−

g(x)

-9

1

2 sin(ω) √ 2πω

GRe (ω)

x -9

-7

-5

-3

-1

1

3

5

-0.5

G(ω) =

4 sin(2ω) √ 2πω

ω

where G∗ denotes the complex conjugate of G.9

18.1 The Fourier Transform

Linearity The Fourier transform is also a linear operation such that multiplying the signal by a constant value c ∈ C scales the corresponding spectrum by the same amount, a · g(x)

a · G(ω).

(18.22)

Linearity also means that the transform of the sum of two signals g(x) = g1 (x) + g2 (x) is identical to the sum of their individual transforms G1 (ω) and G2 (ω) and thus g1 (x) + g2 (x)

G1 (ω) + G2 (ω).

(18.23)

Similarity If the original function g(x) is scaled in space or time, the opposite effect appears in the corresponding Fourier spectrum. In particular, as observed on the Gaussian function in Fig. 18.4, stretching a signal by a factor s (i.e., g(x) → g(sx)) leads to a shortening of the Fourier spectrum: 1 ω g(sx) (18.24) |s| · G s . Similarly, the signal is shortened if the corresponding spectrum is stretched. Shift Property If the original function g(x) is shifted by a distance d along its coordinate axis (i.e., g(x) → g(x−d)), then the Fourier spectrum multiplies by the complex value e−iωd dependent on ω: g(x−d)

e−iωd · G(ω).

(18.25)

Since e−iωd lies on the unit circle, the multiplication causes a phase shift on the spectral values (i.e., a redistribution between the real and imaginary components) without altering the magnitude |G(ω)|. Obviously, the amount (angle) of phase shift (ωd) is proportional to the angular frequency ω. Convolution Property From the image-processing point of view, the most interesting property of the Fourier transform is its relation to linear convolution (see Sec. 4.3.1). Let us assume that we have two functions g(x) and h(x) and their corresponding Fourier spectra G(ω) and H(ω), respectively. If the original functions are subject to linear convolution (i.e., g(x) ∗ h(x)), then the Fourier transform of the result equals the (pointwise) product of the individual Fourier transforms G(ω) and H(ω): g(x) ∗ h(x) G(ω) · H(ω). (18.26) Due to the duality of signal space and frequency space, the same also holds in the opposite direction; i.e., a pointwise multiplication of two signals is equivalent to convolving the corresponding spectra: 9

See also Sec. A.5 of the Appendix.

549

18 Introduction to Spectral Methods

g(x) · h(x)

G(ω) ∗ H(ω).

(18.27)

A multiplication of the functions in one space (signal or frequency space) thus corresponds to a linear convolution of the Fourier spectra in the opposite space.

18.2 Working with Discrete Signals The definition of the continuous Fourier transform in Sec. 18.1 is of little use for numerical computation on a computer. Neither can arbitrary continuous (and possibly infinite) functions be represented in practice. Nor can the required integrals be computed. In reality, we must always deal with discrete signals, and we therefore need a new version of the Fourier transform that treats signals and spectra as finite data vectors—the “discrete” Fourier transform. Before continuing with this issue we want to use our existing wisdom to take a closer look at the process of discretizing signals in general. 18.2.1 Sampling We first consider the question of how a continuous function can be converted to a discrete signal in the first place. This process is usually called “sampling” (i.e., taking samples of the continuous function at certain points in time (or in space), usually spaced at regular distances). To describe this step in a simple but formal way, we require an inconspicuous but nevertheless important piece from the mathematician’s toolbox. The Impulse Function δ(x) We casually encountered the impulse function (also called the delta or Dirac function) earlier when we looked at the impulse response of linear filters (see Sec. 4.3.4) and in the Fourier transforms of the cosine and sine functions (Fig. 18.3). This function, which models a continuous “ideal” impulse, is unusual in several respects: its value is zero everywhere except at the origin, where it is nonzero (though undefined), but its integral is one, that is, Z ∞ δ(x) = 0 for x ̸= 0 and δ(x) dx = 1 . (18.28) −∞

One could imagine δ(x) as a single pulse at position x = 0 that is infinitesimally narrow but still contains finite energy (1). Also remarkable is the impulse function’s behavior under scaling along the time (or space) axis (i.e., δ(x) → δ(sx)), with δ(sx) =

550

1 · δ(x), |s|

(18.29)

for s ̸= 0. Despite the fact that δ(x) does not exist in physical reality and cannot be plotted (the corresponding plots in Fig. 18.3 are for illustration only), this function is a useful mathematical tool for describing the sampling process, as will be shown.

Sampling With The Impulse Function Using the concept of the ideal impulse, the sampling process can be described in a straightforward and intuitive way.10 If a continuous function g(x) is multiplied with the impulse function δ(x), we obtain a new function ( g(0) for x = 0, g¯(x) = g(x)·δ(x) = (18.30) 0 otherwise.

18.2 Working with Discrete Signals

The resulting function g¯(x) consists of a single pulse at position 0 whose height corresponds to the original function value g(0) (at position 0). Thus, by multiplying the function g(x) by the impulse function, we obtain a single discrete sample value of g(x) at position x = 0. If the impulse function δ(x) is shifted by a distance x0 , we can sample g(x) at an arbitrary position x = x0 , ( g(x0 ) for x = x0 , g¯(x) = g(x)·δ(x−x0 ) = (18.31) 0 otherwise. Here δ(x−x0 ) is the impulse function shifted (to the right) by x0 , i.e., the resulting function g¯(x) is zero except at position x0 , where it contains the original function value g(x0 ). This relationship is illustrated in Fig. 18.5 for the sampling position x0 = 3. g(x)

x

1

1

3

Fig. 18.5 Sampling with the impulse function. The continuous signal g(x) is sampled at position x0 = 3 by multiplying g(x) by a shifted impulse function δ(x−3).

g ¯(x)

δ(x−3)

x

x

1

1

3

1

1

3

To sample the function g(x) at more than one position simultaneously (e.g., at positions x1 and x2 ), we use two separately shifted versions of the impulse function, multiply g(x) by both of them, and simply add the resulting function values. In this particular case, we get g¯(x) = g(x)·δ(x−x1 ) + g(x)·δ(x−x2 ) = g(x)· δ(x−x1 ) + δ(x−x2 )   g(x1 ) for x = x1 , = g(x2 ) for x = x2 ,   0 otherwise.

(18.32) (18.33) (18.34)

From Eqn. (18.33), sampling a continuous function g(x) at N discrete positions xi = 0, 1, . . . , N −1 can thus be described as the sum of the N individual samples, that is, 10

The following description is deliberately somewhat superficial (in a mathematical sense). See, e.g., [52, 150] for more concise coverage of these topics.

551

g(x)

18 Introduction to Spectral Methods Fig. 18.6 Sampling with the comb function. The original continuous signal g(x) is multiplied by the comb function III(x). The function value g(x) is transferred to the resulting function g ¯(x) only at integer positions x = xi ∈ Z and ignored at all non-integer positions.

9

7

5

3

1

1

3

5

7

9

3

5

7

9

x

III(x)

9

7

5

3

1

1

x

g ¯(x)

x

g¯(x) = g(x) · δ(x−0) + δ(x−1) + . . . + δ(x−N −1) = g(x) ·

N −1 X

δ(x−i) .

(18.35)

i=0

The Comb Function

The sum of shifted impulses, i.e., Σi δ(x−i) in Eqn. (18.35), is called a pulse sequence or pulse train. Extending this sequence to infinity in both directions, we obtain the “comb” or “Shah” function III(x) =

∞ X

δ(x − i) .

(18.36)

i=−∞

The process of discretizing a continuous function by taking samples at regular integer intervals can thus be written simply as g¯(x) = g(x) · III(x),

(18.37)

that is, as a pointwise multiplication of the original signal g(x) with the comb function III(x). As illustrated in Fig. 18.6 , the function values of g(x) at integer positions xi ∈ Z are transferred to the discrete function g¯(xi ) and ignored at all non-integer positions. Of course, the sampling interval (i.e., the distance between adjacent samples) is not restricted to 1. To take samples at regular but arbitrary intervals τ , the sampling function III(x) is simply scaled along the time or space axis; that is, g¯(x) = g(x) · III xτ , for τ > 0. (18.38) Spectral Effects of Sampling

552

Despite the elegant formulation made possible by the use of the comb function, one may still wonder why all this math is necessary to describe a process that appears intuitively to be so simple anyway. The

τ =1

Comb function: III1 (x) = III(x)

18.2 Working with Discrete Signals

(a)

13

11

9

7

5

3

1

1

3

5

7

9

11

13

11

13

x

τ =1 1 Fourier transform: III( 2π ω)

(b)

ω 13

11

9

7

5

3

1

1

3

5

7

9

ω0 = 2π τ =3

Fig. 18.7 Comb function and its Fourier transform. Comb function IIIτ (x) for the sampling interval τ = 1 (a) and its Fourier transform. Comb function for τ = 3 (c) and its Fourier transform (d). Note that the actual height of the δ-pulses is undefined and shown only for illustration.

Comb function: III3 (x) = III( 13 x)

(c)

13

11

9

7

5

3

1

1

3

5

7

9

11

13

x

τ =3 3 ω) Fourier transform: 3III( 2π

(d)

ω 13

11

9

7

5

3

1

1

3

ω0 =

5

7

9

11

13

2π 3

Fourier spectrum gives one answer to this question. Sampling a continuous function has massive—though predictable—effects upon the frequency spectrum of the resulting (discrete) signal. Using the comb function as a formal model for the sampling process makes it relatively easy to estimate and interpret those spectral effects. Similar to the Gaussian (see Sec. 18.1.5), the comb function features the special property that its Fourier transform III(x)

1 III( 2π ω)

(18.39)

is again a comb function (i.e., the same type of function). In general, the Fourier transform of a comb function scaled to an arbitrary sampling interval τ is τ III( xτ ) τ III 2π ω , (18.40)

due to the similarity property of the Fourier transform (Eqn. (18.24)). Figure 18.7 shows two examples of the comb function IIIτ (x) with sampling intervals τ = 1 and τ = 3 and the corresponding Fourier transforms. Now, what happens to the Fourier spectrum during discretization, that is, when we multiply a function in signal space by the comb function III( xτ )? We get the answer by recalling the convolution property of the Fourier transform (Eqn. (18.26)): the product of two functions in one space (signal or frequency space) corresponds to the linear convolution of the transformed functions in the opposite space, and thus τ g(x) · III( xτ ) G(ω) ∗ τ ·III 2π ω . (18.41)

553

18 Introduction to Spectral Methods Fig. 18.8 Spectral effects of sampling. The spectrum G(ω) of the original continuous signal is assumed to be band-limited within the range ±ωmax (a). Sampling the signal at a rate (sampling frequency) ωs = ω1 causes the signal’s spectrum G(ω) to be replicated at multiples of ω1 along the frequency (ω) axis (b). Obviously, the replicas in the spectrum do not overlap as long as ωs > 2ωmax . In (c), the sampling frequency ωs = ω2 is less than 2ωmax , so there is overlap between the replicas in the spectrum, and frequency components are mirrored at 2ωmax and superimpose the original spectrum. This effect is called “aliasing” because the original spectrum (and thus the original signal) cannot be reproduced from such a corrupted spectrum.

G(ω) (a)

ω ωmax ¯ 1 (ω) G (b)

ω ωmax

ω1

¯ 2 (ω) G (c)

ω ω2 Aliasing

We already know that the Fourier spectrum of the sampling function is a comb function again and therefore consists of a sequence of regularly spaced pulses (Fig. 18.7). In addition, we know that convolving an arbitrary function with the impulse δ(x) returns the original function; that is, f (x) ∗δ(x) = f (x) (see Sec. 4.3.4). Convolving with a shifted pulse δ(x−d) also reproduces the original function f (x), though shifted by the same distance d: f (x) ∗ δ(x−d) = f (x−d).

(18.42)

As a consequence, the spectrum G(ω) of the original continuous signal ¯ becomes replicated in the Fourier spectrum G(ω) of a sampled signal at every pulse of the sampling function’s spectrum; that is, infinitely many times (see Fig. 18.8 (a, b))! Thus the resulting Fourier spectrum is repetitive with a period 2π τ , which corresponds to the sampling frequency ωs . Aliasing and the sampling theorem ¯ As long as the spectral replicas in G(ω) created by the sampling process do not overlap, the original spectrum G(ω)—and thus the original continuous function—can be reconstructed without loss from any ¯ isolated replica of G(ω) in the periodic spectrum G(ω). As we can see in Fig. 18.8, this requires that the frequencies contained in the original signal g(x) be within some upper limit ωmax ; that is, the signal contains no components with frequencies greater than ωmax . The maximum allowed signal frequency ωmax depends upon the sampling frequency ωs used to discretize the signal, with the requirement ωmax ≤ 12 · ωs

554

or

ωs ≥ 2 · ωmax .

(18.43)

Discretizing a continuous signal g(x) with frequency components in the range 0 ≤ ω ≤ ωmax thus requires a sampling frequency ωs of at least twice the maximum signal frequency ωmax . If this condition is not met, the replicas in the spectrum of the sampled signal overlap

(Fig. 18.8 (c)) and the spectrum becomes corrupted. Consequently, 18.3 The Discrete the original signal cannot be recovered flawlessly from the sampled Fourier Transform signal’s spectrum. This effect is commonly called “aliasing”. (DFT) What we just said in simple terms is nothing but the essence of the famous “sampling theorem” formulated by Shannon and Nyquist (see, e.g., [52, p. 256]). It actually states that the sampling frequency must be at least twice the bandwidth 11 of the continuous signal to avoid aliasing effects. However, if we assume that a signal’s frequency range starts at zero, then bandwidth and maximum frequency are the same anyway. 18.2.2 Discrete and Periodic Functions Assume that we are given a continuous signal g(x) that is periodic with a period of length T . In this case, the corresponding Fourier spectrum G(ω) is a sequence of thin spectral lines equally spaced at a distance ω0 = 2π/T . As discussed in Sec. 18.1.2, the Fourier spectrum of a periodic function can be represented as a Fourier series and is therefore discrete. Conversely, if a continuous signal g(x) is sampled at regular intervals τ , then the corresponding Fourier spectrum becomes periodic with a period of length ωs = 2π/τ . Sampling in signal space thus leads to periodicity in frequency space and vice versa. Figure 18.9 illustrates this relationship and the transition from a continuous nonperiodic signal to a discrete periodic function, which can be represented as a finite vector of numbers and thus easily processed on a computer. Thus, in general, the Fourier spectrum of a continuous, nonperiodic signal g(x) is also continuous and nonperiodic (Fig. 18.9 (a, b)). However, if the signal g(x) is periodic, then the corresponding spectrum is discrete (Fig. 18.9 (c, d)). Conversely, a discrete—but not necessarily periodic—signal leads to a periodic spectrum (Fig. 18.9 (e, f)). Finally, if a signal is discrete and periodic with M samples per period, then its spectrum is also discrete and periodic with M values (Fig. 18.9 (g, h)). Note that the particular signals and spectra in Fig. 18.9 were chosen for illustration only and do not really correspond with each other.

18.3 The Discrete Fourier Transform (DFT) In the case of a discrete periodic signal, only a finite sequence of M sample values is required to completely represent either the signal 11

This may be surprising at first because it allows a signal with high frequency—but low bandwidth—to be sampled (and correctly reconstructed) at a relatively low sampling frequency, even well below the maximum signal frequency. This is possible because one can also use a filter with suitably low bandwidth for reconstructing the original signal. For example, it may be sufficient to strike (i.e., “sample”) a church bell (a low-bandwidth oscillatory system with small internal damping) to uniquely generate a sound wave of relatively high frequency.

555

Signal g(x)

18 Introduction to Spectral Methods

Spectrum G(ω)

g(x)

G(ω)

Fig. 18.9 Transition from continuous to discrete periodic functions. (Note that signals and spectra are not real but for illustration only).

x

(a) Continuous nonperiodic signal

ω

(b) Continuous nonperiodic spectrum

g(x)

G(ω)

x

t0 (c) Continuous periodic signal with period t0

ω

ω0 (d) Discrete nonperiodic spectrum with values spaced at ω0 = 2π/t0

G(ω)

g(x)

ω

x

ωs (f) Continuous periodic spectrum with period ωs = 2π/ts

ts (e) Discrete nonperiodic signal with samples spaced at ts

g(x)

G(ω)

ts

ωs

ω

x

t0 (g) Discrete periodic signal with samples spaced at ts and period t0 = M ·ts

ω0 (h) Discrete periodic spectrum with values spaced at ω0 = 2π/t0 and period ωs = 2π/ts = M ·ω0

g(u) itself or its Fourier spectrum G(m).12 This representation as finite vectors makes it straightforward to store and process signals 12

556

Notation: We use g(x), G(ω) for a continuous signal or spectrum, respectively, and g(u), G(m) for the discrete versions.

and spectra on a computer. What we still need is a version of the 18.3 The Discrete Fourier transform applicable to discrete signals. Fourier Transform (DFT)

18.3.1 Definition of the DFT The discrete Fourier transform is, just like its continuous counterpart, identical in both directions. For a discrete signal g(u) of length M (u = 0 . . . M −1), the forward transform (DFT) is defined as M −1 h mu mu i 1 X G(m) = √ g(u)· cos 2π − i· sin 2π M M M u=0 M −1 mu 1 X √ = g(u) · e−i2π M , M u=0

(18.44)

(18.45)

for 0 ≤ m < M , and the inverse transform (DFT−1 ) is13 M −1 h mu mu i X 1 g(u) = √ · G(m)· cos 2π + i· sin 2π M M M m=0 M −1 X mu 1 =√ · G(m) · ei2π M , M m=0

(18.46)

(18.47)

for 0 ≤ u < M . Note that both the signal g(u) and the discrete spectrum G(m) are complex-valued vectors of length M , that is, g(u) = gRe (u) + i · gIm (u), G(m) = GRe (m) + i · GIm (m),

(18.48)

for u, m = 0, . . . , M −1. A numerical example for a DFT with M = 10 is shown in Fig. 18.10. Converting Eqn. (18.44) from Euler’s exponential notation (Eqn. (18.10)) we obtain the discrete Fourier spectrum in component notation as M −1 ih i 1 Xh mu G(m) = √ · gRe (u) + i·gIm (u) · cos 2π mu − i· sin 2π , {z } | {z M } | {z M } M u=0 | g(u) CM SM m (u) m (u)

(18.49)

M where C M m and S m denote the discrete (cosine and sine) basis functions, respectively, as described in the next section. Applying the usual complex multiplication,14 we obtain the real and imaginary parts of the discrete Fourier spectrum as M −1 X 1 M GRe (m) = √ · gRe (u) · C M m (u) + gIm (u) · S m (u), M u=0

M −1 X 1 M GIm (m) = √ · gIm (u) · C M m (u) − gRe (u) · S m (u), M u=0 13

14

(18.50)

(18.51)

Compare these definitions with the corresponding expressions for the continuous forward and inverse Fourier transforms in Eqns. (18.19) and (18.20), respectively. See also Sec. A.5 of the Appendix.

557

18 Introduction to Spectral Methods Fig. 18.10 Complex-valued result of the DFT for a short (real-valued) signal of length M = 10 (example). Both the original signal g(u) and its spectrum G(m) are complex-valued vectors of length M ; ∗ stands for values smaller than 10−15 .

u 0 1 2 3 4 5 6 7 8 9

g(u) 1.0000 0.0000 3.0000 0.0000 5.0000 0.0000 7.0000 0.0000 9.0000 0.0000 8.0000 0.0000 6.0000 0.0000 4.0000 0.0000 2.0000 0.0000 0.0000 0.0000 Re Im

DFT −→

DFT−1 ←−

G(m) 14.2302 0.0000 −5.6745 −2.9198 ∗ 0.0000 ∗ 0.0000 −0.0176 −0.6893 ∗ 0.0000 ∗ 0.0000 0.3162 0.0000 ∗ 0.0000 ∗ 0.0000 −0.0176 0.6893 ∗ 0.0000 ∗ 0.0000 −5.6745 2.9198 Re Im

m 0 1 2 3 4 5 6 7 8 9

for m = 0, . . . , M −1. Analogously, the inverse DFT in Eqn. (18.46) expands to M −1 X 1 M √ gRe (u) = · GRe (m) · C M u (m) − GIm (m) · S u (m), (18.52) M m=0

M −1 X 1 M gIm (u) = √ · GIm (m) · C M u (m) + GRe (m) · S u (m), (18.53) M m=0

for u = 0, . . . , M −1. 18.3.2 Discrete Basis Functions The inverse DFT (Eqn. (18.46)) performs the decomposition of the discrete function g(u) into a finite sum of M discrete cosine and sine M functions (C M m , S m ) whose weights (or “amplitudes”) are determined by the DFT coefficients in G(m). Each of these 1D basis functions (first used in Eqn. (18.49)), M mu CM (18.54) m (u) = C u (m) = cos 2π M , M M mu S m (u) = S u (m) = sin 2π M , (18.55)

is periodic with M and has a discrete frequency (wave number) m, which corresponds to the angular frequency ωm = 2 π ·

558

m . M

(18.56)

For example, Figs. 18.11 and 18.12 show the discrete basis functions (with integer ordinate values u ∈ Z) for the DFT of length M = 8 as well as their continuous counterparts (with ordinate values x ∈ R). For wave number m = 0, the cosine function C M 0 (u) (Eqn. (18.54)) has the constant value 1. The corresponding DFT coefficient GRe (0)— the real part of G(0)—thus specifies the constant part of the signal or the average value of the signal g(u) in Eqn. (18.52). In contrast, the zero-frequency sine function S M 0 (u) is zero for any value of u and thus cannot contribute anything to the signal. The corresponding DFT coefficients GIm (0) in Eqn. (18.52) and GRe (0) in Eqn. (18.53) are therefore of no relevance. For a real-valued signal (i.e., gIm (u) = 0 for all u), the coefficient GIm (0) in the corresponding Fourier spectrum must also be zero.

C 8m (u) = cos

2πm 8 u

C 80 (u)

S 8m (u) = sin S 80 (u)

1.0

1.0

0.5

0.5

2

4

6

8

u

2

-0.5

-0.5

-1.0

-1.0

C 81 (u)

S 81 (u)

1.0

4

2πm 8 u

6

18.3 The Discrete Fourier Transform (DFT)

m=0

8

Fig. 18.11 Discrete basis functions M CM m (u) and S m (u) for the signal length M = 8 and wave u numbers m = 0, . . . , 3. Each plot shows both the discrete function (round dots) and the corresponding continuous function.

m=1

1.0

0.5

0.5

2

4

6

8

u

2

-0.5

-0.5

-1.0

-1.0

C 82 (u)

S 82 (u)

1.0

1.0

0.5

0.5

2

4

6

8

u -0.5

-1.0

-1.0

C 83 (u)

S 83 (u)

1.0

1.0

0.5

0.5

2

4

6

8

u -0.5

-1.0

-1.0

continuous function

8

u

4

6

8

u

m=3

2

-0.5

6

m=2

2

-0.5

4

4

6

8

u

discrete function

As seen in Fig. 18.11, the wave number m = 1 relates to a cosine or sine function that performs exactly one full cycle over the signal length M = 8. Similarly, the wave numbers m = 2, . . . , 7 correspond to 2, . . . , 7 complete cycles over the signal length M (see Figs. 18.11 and 18.12). 18.3.3 Aliasing Again! A closer look at Figs. 18.11 and 18.12 reveals an interesting fact: the sampled (discrete) cosine and sine functions for m = 3 and m = 5 are identical, although their continuous counterparts are different! The same is true for the frequency pairs m = 2, 6 and m = 1, 7. What we

559

C 8m (u) = cos

18 Introduction to Spectral Methods Fig. 18.12 Discrete basis functions (continued). Signal length M = 8 and wave numbers m = 4, . . . , 7. Notice that, for example, the discrete functions for m = 5 and m = 3 (Fig. 18.11) are identical because m = 4 is the maximum wave number that can be represented in a discrete spectrum of length M = 8.

C 84 (u)

2πm 8 u

S 8m (u) = sin S 84 (u)

1.0

1.0

0.5

0.5

2

4

6

8

u -0.5

-1.0

-1.0

C 85 (u)

S 85 (u)

1.0

4

m=4

6

8

u

m=5

1.0

0.5

0.5

2

4

6

8

u

2

-0.5

-0.5

-1.0

-1.0

C 86 (u)

S 86 (u)

1.0

1.0

0.5

0.5

2

4

6

8

u -0.5

-1.0

-1.0

C 87 (u)

S 87 (u)

1.0

1.0

0.5

0.5

4

6

8

u -0.5

-1.0

-1.0

continuous function

discrete function

6

8

u

4

6

8

u

m=7

2

-0.5

4

m=6

2

-0.5

2

560

2

-0.5

2πm 8 u

4

6

8

u

aliased function

see here is another manifestation of the sampling theorem—which we had originally encountered (Sec. 18.2.1) in frequency space—in signal space. Obviously, m = 4 is the maximum frequency component that can be represented by a discrete signal of length M = 8. Any discrete function with a higher frequency (m = 5, . . . , 7 in this case) has an identical counterpart with a lower wave number and thus cannot be reconstructed from the sampled signal (see also Fig. 18.13)! If a continuous signal is sampled at a regular distance τ , the corresponding Fourier spectrum is repeated at multiples of ωs = 2π/τ , as we have shown earlier (Fig. 18.8). In the discrete case, the spectrum is periodic with length M . Since the Fourier spectrum of a

C 8m (u) = cos C 81 (u)

2πm 8 u

S 8m (u) = sin S 81 (u)

1.0

1.0

0.5

0.5

2

4

6

8

u

2

-0.5

-0.5

-1.0

-1.0

C 89 (u)

S 89 (u)

1.0

4

2πm 8 u

18.3 The Discrete Fourier Transform (DFT)

m=1

6

8

Fig. 18.13 Aliasing in signal space. For the signal length M = 8, the u discrete cosine and sine basis functions for the wave numbers m = 1, 9, 17, . . . (round dots) are all identical. The sampling frequency itself corresponds to wave number m = 8.

m=9

1.0

0.5

0.5

2

4

6

8

u

2

-0.5

-0.5

-1.0

-1.0

C 817 (u)

S 817 (u)

1.0

1.0

0.5

0.5

2

4

6

8

u -0.5

-1.0

-1.0

discrete function

6

8

u

m = 17

2

-0.5

continuous function

4

4

6

8

u

aliased function

real-valued signal is symmetric about the origin (Eqn. (18.21)), there is for every coefficient with wave number m an equal-sized duplicate with wave number −m. Thus the spectral components appear pairwise and mirrored at multiples of M ; that is, |G(m)| = |G(M −m)| = |G(M +m)|

(18.57)

= |G(2M −m)| = |G(2M +m)| ... = |G(kM −m)| = |G(kM +m)|, for all k ∈ Z. If the original continuous signal contains “energy” at the frequencies ωm > ωM/2 (i.e., signal components with wave numbers m > M/2), then, according to the sampling theorem, the overlapping parts of the spectra are superimposed in the resulting periodic spectrum of the discrete signal. 561

18 Introduction to Spectral Methods

18.3.4 Units in Signal and Frequency Space The relation between the units in signal and frequency space and the interpretation of wave numbers m is a common cause of confusion. While the discrete signal and its spectrum are simple numerical vectors and units of measurement are irrelevant for computing the DFT itself, it is nevertheless important to understand how the coordinates in the spectrum relate to physical dimensions in the real world. Clearly, every complex-valued spectral coefficient G(m) corresponds to one pair of cosine and sine functions with a particular frequency in signal space. Assume a continuous signal is sampled at M consecutive positions spaced at τ (an interval in time or distance in space). The wave number m = 1 then corresponds to the fundamental period of the discrete signal (which is now assumed to be periodic) with a period of length M τ ; that is, to the frequency f1 =

1 . Mτ

(18.58)

In general, the wave number m of a discrete spectrum relates to the physical frequency as fm = m

1 = m · f1 Mτ

(18.59)

for 0 ≤ m < M , which is equivalent to the angular frequency ωm = 2πfm = m

2π = m · ω1 . Mτ

(18.60)

Obviously then, the sampling frequency fs = 1/τ = M ·f1 corresponds to the wave number ms = M . As expected, the maximum nonaliased wave number in the spectrum is mmax =

M m = s, 2 2

(18.61)

that is, half the sampling frequency index ms . Example 1: Time-Domain Signal We assume for this example that g(u) is a signal in the time domain (e.g., a discrete sound signal) that contains M = 500 sample values taken at regular intervals τ = 1 ms = 10−3 s. Thus the sampling frequency is fs = 1/τ = 1000 Hertz (cycles per second) and the total duration (fundamental period) of the signal is M τ = 0.5 s. The signal is implicitly periodic, and from Eqn. (18.58) we ob1 1 tain its fundamental frequency as f1 = 500·10 −3 = 0.5 = 2 Hertz. The wave number m = 2 in this case corresponds to a real frequency f2 = 2f1 = 4 Hertz, f3 = 6 Hertz, etc. The maximum frequency that can be represented by this discrete signal without aliasing is fmax = M 1 2 f1 = 2τ = 500 Hertz, i.e., exactly half the sampling frequency fs . Example 2: Space-Domain Signal 562

Assume we have a 1D print pattern with a resolution (i.e., spatial sampling frequency) of 120 dots per cm, which equals approximately

300 dots per inch (dpi) and a total signal length of M = 1800 samples. 18.4 Implementing the This corresponds to a spatial sampling interval of τ = 1/120 cm ≈ DFT 83 µm and a physical signal length of (1800/120) cm = 15 cm. The fundamental frequency of this signal (again implicitly as1 sumed to be periodic) is f1 = 15 , expressed in cycles per cm. The sampling frequency is fs = 120 cycles per cm and thus the maximum signal frequency is fmax = f2s = 60 cycles per cm. The maximum sig1 nal frequency specifies the finest structure ( 60 cm) that can be reproduced by this print raster. 18.3.5 Power Spectrum The magnitude of the complex-valued Fourier spectrum, q |G(m)| = G2Re (m) + G2Im (m),

(18.62)

is commonly called the “power spectrum” of a signal. It specifies the energy that individual frequency components in the spectrum contribute to the signal. The power spectrum is real-valued and positive and thus often used for graphically displaying the results of Fourier transforms (see also Sec. 19.2). Since all phase information is lost in the power spectrum, the original signal cannot be reconstructed from the power spectrum alone. However, because of the missing phase information, the power spectrum is insensitive to shifts of the original signal and can thus be efficiently used for comparing signals. To be more precise, the power spectrum of a circularly shifted signal is identical to the power spectrum of the original signal. Thus, given a discrete periodic signal g1 (u) of length M and a second signal g2 (u) shifted by some offset d, such that g2 (u) = g1 (u−d)

(18.63)

the corresponding power spectra are the same, that is, |G2 (m)| = |G1 (m)|,

(18.64)

although in general the complex-valued spectra G1 (m) and G2 (m) are different. Furthermore, from the symmetry property of the Fourier spectrum, it follows that |G(m)| = |G(−m)|,

(18.65)

for real-valued signals g(u) ∈ R.

18.4 Implementing the DFT 18.4.1 Direct Implementation Based on the definitions in Eqns. (18.50) and (18.51) the DFT can be directly implemented, as shown in Prog. 18.1. The main method DFT() transforms a signal vector of arbitrary length M (not necessarily a power of 2). It requires roughly M 2 operations (multiplications

563

1 class Complex { 2 double re, im; 3 Complex(double re, double im) { //constructor method 4 this.re = re; 5 this.im = im; 6 } 7 }

18 Introduction to Spectral Methods Prog. 18.1 Direct implementation of the DFT based on the definition in Eqns. (18.50) and (18.51). The method DFT() returns a complex-valued vector with the same length as the complex-valued input (signal) vector g. This method implements both the forward and the inverse transforms, controlled by the Boolean parameter forward. The class Complex (bottom) defines the structure of the complexvalued vector elements.

8 Complex[] DFT(Complex[] g, boolean forward) { 9 int M = g.length; 10 double s = 1 / Math.sqrt(M); //common scale factor 11 Complex[] G = new Complex[M]; 12 for (int m = 0; m < M; m++) { 13 double sumRe = 0; 14 double sumIm = 0; 15 double phim = 2 * Math.PI * m / M; 16 for (int u = 0; u < M; u++) { 17 double gRe = g[u].re; 18 double gIm = g[u].im; 19 double cosw = Math.cos(phim * u); 20 double sinw = Math.sin(phim * u); 21 if (!forward) { // inverse transform 22 sinw = -sinw; 23 } 24 // complex multiplication: [gRe +i · gIm ]·[cos(ω)+i· sin(ω)] 25 sumRe += gRe * cosw + gIm * sinw; 26 sumIm += gIm * cosw - gRe * sinw; 27 } 28 G[m] = new Complex(s * sumRe, s * sumIm); 29 } 30 return G; 31 }

and additions); that is, the time complexity of this DFT algorithm is O(M 2 ). One way to improve the efficiency of the DFT algorithm is to use lookup tables for the sin and cos functions (which are relatively “expensive” to compute) since only function values for a set of M m different angles φm are ever needed. The angles φm = 2π M corresponding to m = 0, . . . , M − 1 are evenly distributed over the full 360◦ circle. Any integer multiple φm · u (for u ∈ Z) can only fall onto one of these angles again because φm · u = 2π mu ≡ M

2π M

k · (mu mod M}) = 2π M = φk , | {z

(18.66)

0≤k 4 corresponding point pairs in 2D. Several approaches to “homography estimation” exist, including linear and (iterative) nonlinear methods. The simplest and most common is the direct linear transform (DLT) method [67,124], which requires solving a system of 2n homogenous linear equations, typically done by singular-value decomposition (SVD).6 Java Implementation Projective transformations are implemented by class ProjectiveMapping2D of the imagingbook library.7 Several constructors are provided, as well as a static method ProjectiveMapping2D fromPoints(Pnt2d[] P, Pnt2d[] Q) , 6 7

See also Sec. B.8.2 of the Appendix. Package imagingbook.pub.geometry.mappings.linear.

613

21 Geometric Operations

which creates a projective mapping from arrays P, Q, each consisting of four or more 2D points. The resulting mapping is exact if four point pairs are specified, otherwise a minimum least-squares fit is calculated.8 See Prog. 21.2 (p. 631) for a concrete example. 21.1.5 Bilinear Mapping Similar to the projective transformation (Eqn. (21.33)), the bilinear mapping function Tx : x′ = a0 ·x + a1 ·y + a2 ·x·y + a3 , Ty : y ′ = b0 ·x + b1 ·y + b2 ·x·y + b3 ,

(21.55)

is specified with four pairs of corresponding points and has eight parameters (a0 , . . . , a3 , b0 , . . . , b3 ). Note that this transformation is nonlinear because of the mixed term x·y, i.e., cannot be described by a linear transformation, even with homogeneous coordinates. In contrast to the projective transformation, straight lines are not preserved in general but map onto quadratic curves. Similarly, circles are not mapped to ellipses by a bilinear transform. A bilinear mapping is uniquely specified by four corresponding pairs of 2D points (x0 , x′0 ), . . . , (x3 , x′3 ). In the general case (i.e., a bilinear mapping between arbitrary quadrilaterals), the coefficients a0 , . . . , a3 , b0 , . . . , b3 (Eqn. (21.55)) are found as the solution of two separate systems of equations, each with four unknowns:      ′ x0 y0 x0 ·y0 1 a0 x0 x1 y1 x1 ·y1 1 a1  x′1        x2 y2 x2 ·y2 1 · a2  = x′2  or M·a = x, (21.56) x3 y3 x3 ·y3 1 a3 x′3      ′ x0 y0 x0 ·y0 1 b0 y0 x1 y1 x1 ·y1 1 b1  y1′        or M·b = y. (21.57) x2 y2 x2 ·y2 1 · b2  = y2′  ′ x3 y3 x3 ·y3 1 b3 y3

These equations can again be solved using standard numerical methods. In the special case of bilinearly mapping the unit square S1 to an arbitrary quadrilateral Q = (x′0 , . . . , x′3 ), the parameters a0 , . . . , a3 and b0 , . . . , b3 are found directly as a0 = x′1 − x′0 , a1 = a2 = a3 =

x′3 − x′0 − x′0 ,

x′0 , x′1 +

b0 = y1′ − y0′ , b1 = x′2 − x′3 ,

b2 = b3 =

y3′ − y0′ − y0′ .

y0′ , y1′ +

(21.58) (21.59) y2′ − y3′ ,

(21.60) (21.61)

Figure 21.6 shows results of the affine, projective, and bilinear transformations applied to a simple test pattern. The affine transformation (Fig. 21.6 (b)) is specified by mapping to the triangle 1-2-3, while the four points of the quadrilateral 1-2-3-4 define the projective and the bilinear transforms (Fig. 21.6 (c, d)). 8

614

Class Pnt2d is the common data structure for 2D points (with integer or floating-point coordinates) in the imagingbook library (not to be confused with java.awt.geom.Point2D).

21.1 Coordinate Transformations in 2D Fig. 21.6 Geometric transformations compared. Original image (a), affine transformation of the triangle 1-2-3 (b), projective transformation (c), and bilinear transformation (d).

(a)

(b)

(c)

(d)

21.1.6 Log-Polar Mapping Log-polar mapping refers to the technique of transforming an image (or image patch) from Cartesian to polar coordinates, each consisting of angular an component ω and a radial component ϱ, which is the logarithm of the Euclidean distance to a given reference point in image space. The basic form of the coordinate transformation is T (x, y | xc , yc ) = (ϱ, ω),

(21.62)

where (x, y) = x are continuous image coordinates, (xc , yc ) = xc is an arbitrary (but fixed) reference point, and ϱ = logb (∥x−xc ∥ + 1),

(21.63)

ω = ∠(x−xc ).

(21.64)

The base of the logarithm (b > 1) controls the scale of the radial component ϱ. Note that changing the base of a logarithm only scales the result by a constant,9 thus any logarithm can be used with some additional scale factor. Typically the effective logarithm is quite “mild” (i.e., with b close to 1). The log-polar transformation has been of considerable interest in the image sciences as a hypothetical model for the geometry of biological vision systems, particularly in primates [10, 237]. It is also of technical interest in the context of image registration, because this transformation (at least in principle) converts rotation in the image domain to a (circular) translation along the ω axis, and uniform image scaling to a translation along the ϱ axis (see p. 619 for details). 9

Since logb (x) = log(x)/ log(b) for any logarithm log().

615

y

21 Geometric Operations Fig. 21.7 Discrete log-polar mapping. In (a), Cartesian image coordinates (x, y) are measured relative to some reference point (xc , yc ). The forward transformation T maps each point (x, y) of the image I to a continuous log-polar coordinate (ϱ, ω) in L. Colored patches in (a, b) show the relation between image areas and the associated cells in the log-polar map. Each vertical column in L corresponds to a concentric ring in I, and each horizontal row corresponds to an angular wedge.

ω

L

Q

I rmax T (x,y)

− −−−−− →

θ x

T −1 (ϱ,ω)

←−−−−−−− 0 (a)

0

(b)

P

ϱ

Figure 21.7 illustrates the key characteristics of the log-polar transformation. A circular image region with radius rmax and center point (xc , yc ) is mapped to a rectangle of size P along the radial axis (ϱ) and Q along the angular axis (ω). In the following, we give two different definitions of the log-polar transformation and the associated inverse transformations, which facilitate the rendering of discrete images by target-to-source mapping (see Sec. 21.2.2). They differ only in the calculation of the radial component, the angular transformation is the same. Log-Polar Mapping (Version 1) This transformation is basically the same as defined in Eqn. (21.63), but the radial scale (i.e., the base of the logarithm) is derived from the specified maximum image radius rmax and the target ranges P, Q. Forward Transformation The forward transformation T : (x, y) 7→ (ϱ, ω) is T (x, y | xc , yc , rmax ) :

(

ϱ=P·

log(r+1) log(rmax +1) ,

ω = Q · θ/(2π),

(21.65)

with r=

p

(x−xc )2 + (y−yc )2 ,

θ = ArcTan(x−xc , y−yc ) mod (2π).

616

(21.66) (21.67)

The radius r is simply the Euclidean distance of the image coordinate (x, y) from the reference point (xc , yc ), thus (r, θ) is the vector (x−xc , y−yc ) expressed in polar coordinates. The actual log-polar representation is obtained by applying a logarithm to the radius r and then scaling by P in Eqn. (21.65), such that the resulting ϱ is in [0, P ]. Thus the specified maximum image radius rmax maps to the radial position ϱ = P in the log-polar map (see Fig. 21.8). Analogously, the angle θ is scaled in Eqn. (21.67) to yield ω in the range [0, Q]. Note that any logarithm (i.e., with arbitrary base) can be used in Eqn. (21.65), including the natural logarithm (with base e), which is assumed here. 1 is added to the argument of log() to avoid negative

results. The ‘mod’ operation in Eqn. (21.67) ensures that the angle 21.1 Coordinate θ is strictly positive, i.e., in the range [0, 2π].10 Transformations in 2D Inverse Transformation The inverse mapping from log-polar to image coordinates, that is, T −1 : (ϱ, ω) 7→ (x, y), is x = xc + r′ · cos(ω ′ ), −1 T (ϱ, ω | xc , yc , rmax ) : (21.68) y = yc + r′ · sin(ω ′ ), with r′ = e[ϱ·log(rmax +1)/P ] − 1, ′

ω = ω · 2π/Q.

(21.69) (21.70)

Log-Polar Mapping (Version 2) One problem with the formulation in Version 1 is that the logarithm function is very steep near the origin, i.e., the concentric rings near the reference point are very densely packed. Consequently, the center pixels are typically covered by multiple rings and get hugely enlarged (i.e., oversampled) in the log-polar map without adding any information (see Fig. 21.9 (b)). This can be avoided by limiting the range of the radial mapping to [rmin , rmax ], with the additional parameter rmin to specify the radius of the inner circle (which remains unmapped). Forward Transformation The modified forward transformation T : (x, y) 7→ (ϱ, ω) is ( log(r/rmin ) ϱ = P · log(r , max /rmin ) T (x, y | xc , yc , rmax , rmin ) : ω = Q · θ/(2π),

(21.71)

with r ∈ [rmin , rmax ] and θ as defined in Eqn. (21.66) and (21.67), respectively. Figure 21.8 (blue curve) shows the mapping r 7→ ϱ in Eqn. (21.71), for rmin = 6, rmax = 200 and P = 60. In principle, arbitrary values for rmin and rmax can be chosen, provided that rmin < rmax . However, depending on the number of discrete subdivisions P, Q, the resulting sampling grid (in the image domain) may become quite non-homogeneous. The constraint rmin = rmax · e−2π·(P −1)/Q

(21.72)

assures that distances between neighboring sampling points are similar in orthogonal directions [293], i.e., the enclosed patches are approximately square (see Fig. 21.9 (c)). Inverse Transformation The inverse transformation T −1 : (ϱ, ω) 7→ (x, y) is x = xc + r′ · cos(ω ′ ), −1 T (ϱ, ω | xc , yc , rmax , rmin ) : y = yc + r′ · sin(ω ′ ), 10

(21.73)

The result of ArcTan(x, y) is in [−π, π] (see Sec. J.1.5). ‘mod’ denotes the modulus operator (see Sec. J.1.2).

617

ϱ

21 Geometric Operations

P 60 ϱ=r

Fig. 21.8 Radial mapping r 7→ ϱ for different log-polar transformations. Version 1 (red curve, Eqn. (21.65)) is defined over the range [0, rmax ], Version 2 (blue curve, Eqn. (21.71)) over [rmin , rmax ]. Both functions are set up to map rmax = 200 to P = 60.

Version 1

50 40

Version 2 30 20 10

rmin Fig. 21.9 Comparison of log-polar mapping versions with identical parameters (P = 60, Q = 100, rmax = 160) and reference point. The original image I and the resulting sampling grid is shown on the left, the discrete log-polar map L is on the right. In Version 1 (a, b), the radial transformation starts right at the reference point (r = 0), thus pixels close to the center are strongly enlarged (i.e., over-sampled). In Version 2 (c, d), mapping starts at r = rmin (≈ 3.40, see Eqn. (21.72)). Note that the sampling points in (c) are uniformly spaced in orthogonal directions. To reduce aliasing effects the log-polar maps were rendered to quadruple size and subsequently reduced. Image source: NASA (Apollo 13).

50

100

150

200

r

rmax

I

L

Version 1

(a)

(b)

Version 2

(c)

(d)

with ω ′ = ω · 2π/Q (see Eqn. (21.70)) and r′ = e

618

ϱ·log(rmax /rmin )/P

· rmin .

(21.74)

Again the natural logarithm (base e) is assumed in all definitions, but any other logarithm may be used.

Creating a Discrete Log-Polar Map We assume that the discrete log-polar map L is of size P × Q, with P denoting the number of radial steps (p = 0, . . . , P −1) and Q the number of angular steps (q = 0, . . . , Q−1). From continuous log-polar coordinates (ϱ, ω) the discrete cell (i.e., pixel) coordinates (p, q) are simply obtained by truncation, i.e., (p, q) = (⌊ϱ⌋, ⌊ω⌋).

21.1 Coordinate Transformations in 2D

(21.75)

Thus each column p of the log-polar map corresponds to one of P concentric rings in the image I, and each row q corresponds to one of Q angular wedges. To fill the map L with pixel values from the input image I, we apply the usual “target-to source mapping” scheme for geometric transformations (see Sec. 21.2.2). This means that we simply scan over all positions (p, q) of the map L, apply the inverse transformation T −1 (p, q) to obtain the associated image position (x, y), and finally insert the interpolated pixel value into L. The complete process (using Version 2 definitions) is detailed in Alg. 21.1. Note that some quantities are independent of p, q and can be pre-calculated as constants, e.g., the factors cr = P · log(rmax /rmin ) and ca = Q/(2π), which occur in both the forward and inverse mapping (see Alg. 21.1, lines 3–4 and 11–12). Figure 21.9 shows an example rendered with log-polar transformation Version 1 and Version 2 under identical parameter settings. In Version 1 (Fig. 21.9 (a, b)), as mentioned before, image pixels close to the reference point are strongly enlarged (i.e., oversampled) and thus cover a large portion of the resulting log-polar map. Spacing between sample points is quite different in the radial and angular directions, which could be corrected (e.g., by increasing the number of radial subdivisions P ). In contrast, the sample points in Version 2 (Fig. 21.9 (c, d)) are spaced uniformly in orthogonal directions and, since pixels inside rmin are excluded, oversampling is not an issue. On the other hand, distances between sampling points generally become quite large with increasing radius, which results in undersampling and aliasing, unless appropriate countermeasures are taken (see Sec. 22.6). The log-polar maps in Fig. 21.9 were initially rendered with quadruple resolution (i.e., to size 4P × 4Q) and then reduced to their final size for reduced aliasing. Effects of Rotation and Scaling A fundamentally relevant property of the log-polar transformation is its response to image rotation and scaling. When the input image is rotated by about the reference point, the corresponding log-polar map is shifted cyclically along the angular (vertical) axis by a distance proportional to the rotation angle, as illustrated in Fig. 21.10 (a–c). When the image undergoes uniform scaling by some factor s, all distances r are multiplied by the same factor, but the logarithmic distance ϱ = log(r) is only affected by an additive constant log(s), since ϱ′ = log(s·r) = log(s) + log(r).

(21.76) 619

1: CreateLogPolarMap(I, xc , yc , P, Q, rmax , rmin ) Input: I, the input image; (xc , yc ) ∈ R2 , the reference point; P , the number of radial steps; Q, the number of angular steps; rmax , rmin , the max./min. radius in the image domain. Returns a log-polar map (L) of size P × Q.

21 Geometric Operations Alg. 21.1 Calculation of a log-polar map (Version 2) for a source image I and a given reference point (xc , yc ).

2:

L ← new map P × Q 7→ R

3: 4:

cr ← P · log(rmax /rmin ) ca ← Q/(2π)

5: 6: 7: 8: 9:

for p ← 0, . . . , P −1 do ▷ radial map index p for q ← 0, . . . , Q−1 do ▷ angular map index q (x, y) ← MapInverse(p, q, xc , yc , rmin ) L(p, q) ← GetInterpolatedValue(I, x, y) ▷ interpolate I return L

10: MapInverse(ϱ, ω, xc , yc , rmin ) 11: r′ ← exp(ϱ/cr ) · rmin 12: θ′ ← ω/ca 13: x ← xc + r′ · cos(θ′ ) 14: y ← yc + r′ · sin(θ′ ) 15: return (x, y)

▷ constant ▷ constant

▷ = T −1 (ϱ, ω) ▷ see Eqn. (21.74) ▷ see Eqn. (21.70) ▷ see Eqn. (21.73)

Thus uniform scaling of the image simply shifts the corresponding log-polar map along the radial (horizontal) axis, as shown in Fig. 21.10 (d-f). Image enlargement shifts the log-polar map to the right, reduction causes a shift to the left. These relations, which of course hold only if the reference point remains at the same image position, are useful for rotation- and scaleinvariant pattern matching. They are also used for motion estimation and tracking in video sequences [262]. Java Implementation Both described versions of the log-polar transformation are available in the imagingbook library as classes LogPolarMapping1 and LogPolarMapping2,11 which implement the generic Mapping2D interface (see Sec. 21.3 for additional details). 21.1.7 Other Nonlinear Transformations The bilinear transformation discussed in the previous section is only one example of a nonlinear mapping in 2D that cannot be expressed as a simple matrix-vector multiplication in homogeneous coordinates. Many other types of nonlinear deformations exist; for example, to implement various artistic effects for creative imaging. This type of image deformation is often called “image warping”. Depending on the type of transformation used, the derivation of the inverse transformation function—which is required for the practical computation of the mapping using target-to-source mapping (see Sec. 21.2.2)—is not always easy or may even be impossible. In the following three examples, we therefore look straight at the inverse maps x = T −1 (x′ ) 11

620

Package imagingbook.pub.geometry.mappings.nonlinear

(21.77)

Rotation

21.1 Coordinate Transformations in 2D Fig. 21.10 Log-polar effects of image rotation and uniform scaling. Rotating the image (a–c) corresponds to a translation of the log-polar map along the angular (vertical) axis. Uniform scaling in the image domain (d–f) effects a shift along the redial (horizontal) axis. Note that this only holds if the reference point remains unchanged.

⇕ (a)

(b)

(c)

Scaling

⇔ (d)

(e)

(f)

621

21 Geometric Operations Fig. 21.11 Various nonlinear image deformations: twirl (a, d), ripple (b, e), and sphere (c, f) transformations. The size of the original images is 400 × 400 pixels.

(a)

(b)

(c)

(d)

(e)

(f)

without really bothering about the corresponding forward transformations. “Twirl” Transformation The twirl mapping causes the image to be rotated around a given anchor point xc = (xc , yc ) with a space-variant rotation angle, which has a fixed value α at the center xc and decreases linearly with the radial distance from the center. The image remains unchanged outside the limiting radius rmax . The associated (inverse) mapping from target points u = (u, v) to source points x = (x, y) is defined as (  xc + r· cos(β) for r ≤ rmax ,   x = u otherwise, ( T −1 (u, v | xc , yc , α) : (21.78)    y = yc + r· sin(β) for r ≤ rmax , v otherwise, q −r with r = d2x + d2y , β = ArcTan(dx , dy ) + α· rmax , r max

dx = (u−xc ),

dy = (v−yc ).

Figure 21.11 (a, d) shows a twirl mapping with the anchor point xc placed at the image center. The limiting radius rmax is half the length of the image diagonal, and the rotation angle is α = 43◦ at the center. A Java implementation of this transformation is shown in Prog. 21.3 (see page 632). “Ripple” Transformation

622

The ripple transformation causes a local wavelike displacement of the image along both the x and y directions. The parameters of this mapping function are the period lengths τx , τy ̸= 0 (in pixels) and the corresponding amplitude values ax , ay for the displacement in both directions:

T

−1

(u, v | τx , τy , ax , ay ) :

(

x = u + ax · sin y = v + ay · sin

2π·u τx , 2π·v τy .

(21.79) 21.1 Coordinate

Transformations in 2D

An example of a ripple transformation with τx = 120, τy = 250, ax = 10, and ay = 15 is shown in Fig. 21.11 (b, e). Spherical Transformation The spherical deformation imitates the effect of viewing the image through a transparent hemisphere or lens placed on top of the image. The parameters of this transformation are the position xc = (xc , yc ) of the lens center, the radius of the lens rmax and its refraction index ρ. The corresponding (inverse) mapping can be defined as (  z· tan(βx ) for r ≤ rmax ,    x = u− 0 otherwise, ( T −1 (u, v | xc , yc , rmax , ρ) : (21.80)  z· tan(βy ) for r ≤ rmax ,   y = v− 0 otherwise, with q r = d2x + d2y , p 2 z = rmax − r2 ,

βx = 1− ρ1 · sin−1 √ βy = 1− ρ1 · sin−1 √

dx

(d2x +z 2 ) dy (d2y +z 2 )

,

dx = u−xc ,

,

dy = v−yc .

Figure 21.11 (c, f) shows a spherical transformation with (xc , yc ) positioned at the image center. The lens radius rmax is set to half the image width, and the refraction index ρ is 1.8. See Exercise 21.9 for additional examples of nonlinear geometric transformations. 21.1.8 Piecewise Image Transformations All the geometric transformations discussed so far are global (i.e., the same mapping function is applied to all pixels in the given image). It is often necessary to deform an image such that a larger number of n original image points x0 , . . . , xn are precisely mapped onto a given set of target points x′0 , . . . , x′n . For n = 3, this problem can be solved with an affine mapping (see Sec. 21.1.3), and for n = 4 we could use a projective or bilinear mapping (see Secs. 21.1.4 and 21.1.5). A precise global mapping of n > 4 points requires a more complicated function T (e.g., a 2D nth-order polynomial or a spline function). A good alternative is to use local or piecewise transformations, where the image is partitioned into disjoint patches that are transformed separately, applying an individual mapping function to each patch. In practice, it is common to partition the image into a mesh of triangles or quadrilaterals, as illustrated in Fig. 21.12. For a triangular mesh partitioning (Fig. 21.12 (a)), the transformation between each pair of triangles Di → D′i could be accomplished with an affine mapping, whose parameters must be computed individually for every patch. Similarly, the projective transformation would be suitable for mapping each patch in a mesh partitioning

623

Fig. 21.12 Mesh partitionings for piecewise local deformation. Almost arbitrary image deformations can be implemented by partitioning the image plane into non-overlapping triangles Di , D′i (a) or quadrilaterals Qi , Q′i (b) and applying simple local transformations. Every patch in the resulting mesh is transformed separately with the required transformation parameters derived from the corresponding three or four corner points, respectively.

Di′

Di

21 Geometric Operations

(a)

Qi

Q′i

(b)

composed of quadrilaterals Qi (Fig. 21.12 (b)). Since both types of transformations preserve the straightness of lines, we can be certain that no holes or overlaps will arise and the deformation will appear continuous between adjacent mesh patches (see Fig. 21.13 for an example). Local transformations of this type are frequently used; for example, to register aerial and satellite images or to undistort images for panoramic stitching. In computer graphics, similar techniques are used to map texture images onto polygonal 3D surfaces in the rendered 2D image. Another popular application of this technique is “morphing” [289], which performs a stepwise geometric transformation from one image to another while simultaneously blending their intensity (or color) values.12

21.2 Resampling the Image In the discussion of geometric transformations, we have so far considered the 2D image coordinates as being continuous (i.e., real-valued). 12

624

Image morphing has also been implemented in ImageJ as a plugin (iMorph) by Hajime Hirase (https://imagej.nih.gov/ij/plugins/morph. html).

21.2 Resampling the Image Fig. 21.13 Piecewise deformation example using a triangular mesh and local affine transformations. Original image and undistorted mesh (a), warped image (b) and enlarged details (c, d). (a)

(b)

(c)

(d)

In reality, the picture elements in digital images reside at discrete (i.e., integer-valued) coordinates, and thus transferring a discrete image into another discrete image without introducing significant losses in quality is a nontrivial subproblem in the implementation of geometric transformations. Given a (discrete) original image I : Z × Z 7→ R and some (continuous) geometric transformation T : R2 7→ R2 , the aim is to create a discrete transformed image I ′ : Z × Z 7→ R.13 This can be accomplished in one of two ways, which differ by the mapping direction and are commonly referred to as source-to-target or target-to-source mapping, respectively. 21.2.1 Source-to-Target Mapping In this approach, which appears quite natural at first sight, we calculate for every (discrete) pixel position u of the original (source) image I its transformed position x = T (u)

(21.81)

in the target image I ′ . Since the the resulting position x is continuous, it is very unlikely that it will coincide with any raster point of I ′ , as illustrated in Fig. 21.14. Thus we have to decide which pixel of the target image I ′ the original intensity or color value from I(u) should be stored in. One might perhaps even think of somehow distributing this value onto all neighboring pixels. 13

⊺

Remark on notation: We use u = (u, v) to denote discrete (integer) ⊺ coordinates and x = (x, y) for continuous (i.e., real-valued) coordinates.

625

21 Geometric Operations Fig. 21.14 Source-to-target mapping. For each discrete pixel position u = (u, v) in the source image I, the corresponding (continuous) target position x = (x, y) is found by applying the geometric transformation T (u). In general, a target position x does not coincide with any discrete raster point. The source pixel value I(u) is subsequently transferred to one (or more) of the adjacent target pixels.

Source image I

v

Target image I ′

T

u

x

u

y

x

It turns out that this is not such a good idea. The problem with the source-to-target method is that, depending on the geometric transformation T , some elements in the target image I ′ may never be “hit” at all (i.e., never receive a source pixel value)! This happens, for example, when the image is enlarged (even slightly) by the geometric transformation. The resulting holes in the target image would be difficult to close in a subsequent processing step. Conversely, one would have to consider (e.g., when the image is shrunk) that a single element in the target image I ′ may be hit by multiple source pixels and thus image content may get lost. In the light of all these complications, source-to-target mapping is not really out method of choice. 21.2.2 Target-to-Source Mapping This method avoids most difficulties encountered in the source-totarget mapping by simply reversing the image generation process. The trick is to do everything in reverse. For every discrete pixel position u in the target image, we calculate the corresponding (continuous) source position x = T −1 (u),

626

(21.82)

using the inverse geometric transformation T −1 . Of course, the resulting coordinate x again does not fall onto a raster point in general (Fig. 21.15) but this time it is a lot easier to extract a suitable target value by interpolating over the source pixels in the neighborhood of x. The problem of interpolating among pixel values is discussed in detail in Chapter 22. While this is not a trivial task either, this method avoids most of the complications that come with source-totarget mapping. The major advantage of the target-to-source method is that all pixels in the target image I ′ (and only these) are computed and filled exactly once such that no holes or multiple hits can occur. This, however, requires the inverse geometric transformation T −1 to be available which, however, is no disadvantage in most cases since the forward transformation T itself is never really needed. Due to its simplicity, which is also demonstrated in Alg. 21.2, target-to-source mapping is the most common method for geometrically transforming 2D images.

Source image I

21.3 Java Implementation

Target image I ′ T −1

y

x

x

u

v

u

1: TransformImage (I, I ′ , T ) Input: I, source image; I ′ , target image; T , geometric transformation R2 7→ R2 . The target image is modified. 2: 3: 4: 5:

(M, N ) ← size(I ′ ) for all u ∈ M × N do ▷ loop over all pixels of I ′ −1 x ← T (u) I ′ (u) ← GetInterpolatedValue(I, x)

21.3 Java Implementation

Fig. 21.15 Target-to-source mapping. For each discrete pixel position u = (u, v) in the target image I ′ , the corresponding (continuous) source position x = (x, y) is calculated by applying the inverse transformation T −1 (u). The pixel value assigned to I ′ (u) is obtained by interpolating the source pixels within some neighborhood of x. Alg. 21.2 Geometric image transformation by target-to-source mapping. Given are the original (source) image I, the target image I ′ , and the 2D coordinate transformation T . The function GetInterpolatedValue(I, x) returns the interpolated value of the source image I at the continuous position x ∈ R2 .

In plain ImageJ, only a few geometric operations are provided as methods for the ImageProcessor class, such as rotation and flipping.14 Most transformations described in this chapter are available as part of the imagingbook library.15 21.3.1 Geometric Transformations The imagingbook library provides a generic framework for geometric transformations in the continuous 2D plane, which is independent of any image structures. The common interface shared by all geometric (linear and non-linear) 2D transformation classes is Mapping2D (see Fig. 21.16). Implementations of Mapping2D are required to define just one method, Pnt2d applyTo(Pnt2d pnt) , which applies the transformation to the given 2D point and returns the transformed point. Mapping2D itself defines the default method Pnt2d[] applyTo(Pnt2d[] pnts) , which maps a whole array of points, and also two optional methods which implementations may or may not override:16 Mapping2D getInverse() returns the inverse of this mapping (if available) and double[][] getJacobian(Pnt2d pnt) 14

15 16

Additional operations, including affine transformations, are available as plugin classes as part of the optional TransformJ package [185]. Package imagingbook.pub.geometry.mappings An exception is thrown if such a method is called but not defined for the specific mapping.

627

21 Geometric Operations

should return the 2 × 2 Jacobian matrix17 for this mapping, evaluated at point pnt. Linear Mappings All linear transformations are sub-classes of LinearMapping2D,18 as the type hierarchy in Fig. 21.16 illustrates. Every instance of type LinearMapping2D (or any sub-type) can be represented by a 3 × 3 (homogeneous) transformation matrix A. Instances of LinearMapping2D may be created in various ways, for example, directly from a given transformation matrix: double[][] A = {{-1.230769, 2.076923, -1.769231}, {-2.461538, 2.615385, -3.538462}, {-0.307692, 0.230769, 1.000000}}; LinearMapping2D lm = new LinearMapping2D(A);

An arbitrary affine transformation only requires a 2 × 3 matrix to construct, for example, double[][] B = {{2, -4, -3}, {3, 7, 2}}; LinearMapping2D m = new AffineMapping2D(B);

The effective (homogeneous) transformation matrix of m is 2 −4 −3 3 0

7 0

2 1

It is usually convenient to use one of the linear transformation subtypes, such as Translation2D and Rotation2D, e.g., LinearMapping2D mt = new Translation2D(-5.7, 3.1); LinearMapping2D mr = new Rotation2D(Math.PI/3);

creates a translation by (dx , dy ) = (−5.7, 3.1) and a rotation by angle α = π/3. The 3 × 3 transformation matrix of a LinearMapping2D instance can be obtained with the method double[][] getTransformationMatrix() . A linear mapping can be chained (concatenated) with another by using method LinearMapping2D concat(LinearMapping2D other) , which creates and returns a new linear mapping. For example, given two linear mappings ma, mb with associated transformation matrices A, B, i.e., double[][] A, B; // affine transformation matrices LinearMapping2D ma = new AffineMapping2D(A); LinearMapping2D mb = new AffineMapping2D(B); LinearMapping2D mc = ma.concat(mb); double[][] C = mc.getTransformationMatrix(); // C = B·A

the transformation matrix of the composite mapping mc is C = B·A. Note that if two projective transformations are concatenated, the resulting transformation is again projective. Analogously, the result of combining two affine transformations is another affine transformation. 17 18

628

See Sec. D.1.1 of the Appendix. Package imagingbook.pub.geometry.mappings.linear.

Mapping2D LinearMapping2D

BilinearMapping2D

21.3 Java Implementation LogPolarMapping Fig. 21.16 Type hierarchy for 2D mappings in the imagingbook library. Mapping2D is the common interface for all mapping classes.

ProjectiveMapping2D AffineMapping2D

Rotation2D

Scaling2D

Shear2D

Translation2D

Nonlinear Mappings The imagingbook library defines only a few nonlinear mappings as separate classes, including BilinearMapping and two versions of the log-polar transformation (LogPolarMapping1, LogPolarMapping2). Other nonlinear mappings are implemented as ImageJ plugins (e.g., see the “twirl” mapping in Prog. 21.3). 21.3.2 Image Transformations The imagingbook library encapsulates all necessary steps for transforming images, such as image allocation, iteration over target pixels and source pixel interpolation, to make these operations transparent, generic and simple. The key class is ImageMapper,19 which applies a geometric transformation to a given image by target-to-source mapping (see Sec. 21.2.2). To instantiate ImageMapper the inverse transformation (mapping target to source coordinates) must be specified, for example, LinearMapping2D mr = new Rotation2D(Math.PI/3); ImageMapper mapper = new ImageMapper(mr.getInverse());

Optionally, an interpolation method may be specified when creating a new ImageMapper (default is Bicubic). The map() method is then used to apply the mapping to any20 ImageProcessor instance, e.g., ImageProcessor ip = ... ; // any image type mapper.map(ip); // transforms ip “in place”

In this case, the image ip is transformed “in place”, i.e., source and target image are the same. It is also possible to specify a separate target image, which may be of different type and size, e.g., ImageProcessor source = ... ; ImageProcessor target = source.duplicate(); mapper.map(source, target); // only target is modified

Here only the target image is modified (to the transformed image) while the source image is read-only and remains unchanged. Additional details can be found in the examples below. 19 20

Package imagingbook.lib.image Behind the scenes, unified access to different image types is provided by class ImageAccessor (package imagingbook.lib.image.access).

629

21 Geometric Operations Prog. 21.1 Image rotation example using class Rotation2D inside an ImageJ plugin. The rotation angle α (alpha) is 15◦ . Rotation is performed about the coordinate origin (upper-left corner).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

import import import import import import

ij.ImagePlus; ij.plugin.filter.PlugInFilter; ij.process.ImageProcessor; imagingbook.lib.image.ImageMapper; imagingbook.pub.geometry.mappings.Mapping2D; imagingbook.pub.geometry.mappings.linear.Rotation2D;

public class Transform_Rotate implements PlugInFilter { static double alpha = Math.toRadians(15.0); @Override public int setup(String arg, ImagePlus im) { return DOES_ALL; // works for all image types }

}

@Override public void run(ImageProcessor ip) { Mapping2D mi = // inverse mapping (target to source) new Rotation2D(alpha).getInverse(); ImageMapper mapper = new ImageMapper(mi); mapper.map(ip); }

21.3.3 Examples The following two ImageJ plugins show two simple usage examples of the classes in Sec. 21.3.1 for implementing geometric operations and pixel interpolation (see Ch. 22 for details). Note that these plugins can be applied to any type of image. Example 1: Image Rotation The example in Prog. 21.1 shows a plugin (Transform_Rotate) to rotate an image by 15◦ about the origin. It relies on class ImageMapper for transparently handling all necessary image allocation and pixel interpolation tasks. In line 20, a Rotation2D mapping (imap) is created as an instance of class Rotation2D, which is inverted by getInverse() for target-to-source mapping. Then the ImageMapper object (mapper) is set up which transforms the image (in-place) by the invocation of map() in line 22. If not specified, the default interpolation method (bicubic) is used. Example 2: Projective Mapping The ImageJ plugin in Prog. 21.2 illustrates the implementation of a projective transformation. The forward mapping T is defined by two corresponding point sequences (quadrilaterals) P and Q, respectively. In a real application, these points would probably be specified interactively or given as the result of a mesh partitioning. The inverse mapping imap (representing the transformation T −1 ) is created by first calling the static method fromPoints(P,Q) of class ProjectiveMapping, which returns the forward mapping (from P to Q), followed by getInverse() in line 38.21 The inverse mapping is 21

630

Alternatively the inverse mapping could be obtained directly (i.e., without getInverse()) by interchanging source and target points P, Q.

1 2 3 4 5 6 7

import ij.ImagePlus; import ij.plugin.filter.PlugInFilter; import ij.process.ImageProcessor; import imagingbook.lib.image.ImageMapper; import imagingbook.pub.geometry.basic.Pnt2d; import imagingbook.pub.geometry.mappings.Mapping2D; import imagingbook.pub.geometry.mappings.linear. ProjectiveMapping2D;

8 9 public class Transform_Projective implements PlugInFilter { 10 11 static Pnt2d[] P = { // point sequence P 12 Pnt2d.from(0, 0), 13 Pnt2d.from(400, 0), 14 Pnt2d.from(400, 400), 15 Pnt2d.from(0, 400)}; 16 17 static Pnt2d[] Q = { // point sequence Q 18 Pnt2d.from(0, 60), 19 Pnt2d.from(400, 20), 20 Pnt2d.from(300, 400), 21 Pnt2d.from(30, 200)}; 22 23 @Override 24 public int setup(String arg, ImagePlus im) { 25 return DOES_ALL + NO_CHANGES; // works for all image types 26 } 27 28 @Override 29 public void run(ImageProcessor ip) { 30 int W = source.getWidth(); 31 int H = source.getHeight(); 32 33 // create the target image: 34 ImageProcessor target = source.createProcessor(W, H); 35 36 // create the target-to source mapping: 37 Mapping2D mi = 38 ProjectiveMapping2D.fromPoints(P,Q).getInverse(); 39 40 // create a mapper instance: 41 ImageMapper mapper = 42 new ImageMapper(mi, InterpolationMethod.Bicubic); 43 44 // apply the mapper: 45 mapper.map(source, target); 46 47 // display the target image: 48 new ImagePlus("Target", target).show(); 49 } 50 }

21.3 Java Implementation Prog. 21.2 Projective mapping using class ProjectiveMapping2D inside an ImageJ plugin. The projective transformation is obtained from four point pairs, specified by Pnt2d sequences P, q. Note that the inverse mapping could also be obtained directly (i.e., without getInverse()) by interchanging points P, Q in line 38.

applied to the input image by an ImageMapper object in line 45, this time by specifying an explicit target image and leaving the input image unmodified (NO_CHANGES). Example 3: “Twirl” Transformation Program 21.3 shows the implementation of an arbitrary (non-linear) mapping, using the “twirl” transformation as an example. In this case, no pre-defined mapping class is used. Instead, an anonymous (i.e., unnamed) sub-class of Mapping2D is defined with the single

631

21 Geometric Operations Prog. 21.3 “Twirl” transformation (see Eqn. (21.78)), implemented as a ImageJ plugin. An anonymous sub-class of Mapping2D is defined (line 24) with the single method applyTo() specifying the inverse geometric mapping (lines 26–40). The mapping is applied to the input image in line 43. The limiting radius rmax (rmax) is set to half the length of the image diagonal, xc (xc, yc) is the image center, and the center rotation angle α (alpha) is 43◦ .

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

import import import import import import import

ij.ImagePlus; ij.plugin.filter.PlugInFilter; ij.process.ImageProcessor; imagingbook.lib.image.ImageMapper; imagingbook.pub.geometry.basic.Pnt2d; imagingbook.pub.geometry.basic.Pnt2d.PntDouble; imagingbook.pub.geometry.mappings.Mapping2D;

public class Transform_Twirl implements PlugInFilter { static double alpha = Math.toRadians(43.0); @Override public int setup(String arg, ImagePlus imp) { return DOES_ALL; } @Override public void run(ImageProcessor ip) { double xc = 0.5 * ip.getWidth(); double yc = 0.5 * ip.getHeight(); double rmax = Math.sqrt(xc * xc + yc * yc); Mapping2D imap = new Mapping2D() { // anonymous class @Override public uv2d applyTo(Pnt2d uv) { double dx = uv.getX() - xc; double dy = uv.getY() - yc; double r = Math.sqrt(dx * dx + dy * dy); if (r < rmax) { double beta = Math.atan2(dy,dx) + alpha * (rmax-r)/rmax; double x = xc + r * Math.cos(beta); double y = yc + r * Math.sin(beta); return Pnt2d.from(x, y); } else { return uv; // return the original point } } };

}

}

new ImageMapper(imap).map(ip);

method applyTo() in lines 26–40, which implements the definitions given in Eqn. (21.78). This anonymous class is instantiated in line 24 and applied in line 43. Note that in this case the inverse transformation is defined directly, i.e., no additional inversion step is needed.

21.4 Exercises Exercise 21.1. Show that a straight line y = kx + d in 2D is mapped to another straight line under a projective transformation, as defined in Eqn. (21.33).

632

Exercise 21.2. Show that parallel lines remain parallel under affine transformation (Eqn. (21.19)).

Exercise 21.3. Write a program that rotates an image about its 21.4 Exercises center (see Eqns. (21.26)–(21.27)). Program 21.1 may be used as a template. This requires three successive steps: (a) translate the image such that its center coincides with the coordinate origin, (b) perform the actual rotation by angle α, and finally (c) shift the image back to its original center. Thus the resulting affine transformation matrix is A = T2 ·Rα ·T1 ,

(21.83)

composed of the rotation Rα and translations T1 , T2 . The AffineMapping2D22 class allows transformations to be easily concatenated. For example, assuming that ip is the input image with center xc, yc, and alpha is the rotation angle, the above transformation can be defined and applied by the following steps: AffineMapping2D AffineMapping2D AffineMapping2D AffineMapping2D

mt1 = new Translation2D(-xc, -yc); mr = new Rotation2D(alpha); mt2 = new Translation2D(xc, yc); m = mt1.concat(mr).concat(mt2);

AffineMapping2D mi = m.getInverse(); new ImageMapper(mi).map(ip);

// invert // apply

Exercise 21.4. The task is to up-scale an image of size M × N by some integer factor s ≥ 1, such the resulting image is of size sM × sN . In other words, the new image contains M × N blocks of pixels, each of size s × s. Assume that the integer coordinates (u, v) are the pixel centers in the original image. What geometric transformation is required to map each pixel center to the exact center of the corresponding s × s pixel block? (Note that the mapping is not a pure scaling transformation!) Implement this up-scaling operation and extend it to non-integer scale factors. Exercise 21.5. Implement an interactive program (plugin) that performs projective rectification (see Sec. 21.1.4) of a selected quadrilateral, as shown in Fig. 21.17. Make your program perform the following steps: 1. Let the user mark the source quad in the source image I as a polygon-shaped region of interest (ROI) with at least four points x0 , . . . , x3 . In ImageJ this is easily done with the built-in polygon selection tool (see Prog. 21.4 for how to extract ROI points). 2. Create an output image I ′ of fixed size (i.e., proportional to A4 or Letter paper size). 3. The target rectangle is defined by the four corners x′0 , . . . , x′3 of the output image. The source and target points are associated 1:1, that is, the four corresponding point pairs are ⟨x0 , x′0 ⟩, . . . , ⟨x3 , x′3 ⟩. 4. From the four point pairs, create an instance of ProjectiveMapping2D, as demonstrated in Prog. 21.2. 5. Test the obtained mapping by applying A to the specified source points x0 , . . . , x3 . Make sure they project exactly to the specified target points x′0 , . . . , x′3 . 22

Package imagingbook.pub.geometry.mappings

633

21 Geometric Operations Fig. 21.17 Projective rectification example (see Exercise 21.5). Source image and user-defined 4-point selection (green polygon) (a); rectified output image (b).

(a)

(b)

6. Apply the resulting projective mapping to the source image using ImageMapper’s method map(source, target). Show the resulting output image. Exercise 21.6. Design a nonlinear geometric transformation similar to the ripple transformation (see Eqn. (21.79)) that uses a sawtooth function instead of a sinusoid for the distortions in the horizontal and vertical directions. Use Prog. 21.3 as a template for your implementation. Exercise 21.7. Verify that the log-polar mapping (Version 1) in Eqn. (21.65) is indeed inverted by the transformation given in Eqn. (21.68). Do the same for Version 2, defined in Eqns. (21.71) and (21.73) respectively.

634

Exercise 21.8. Investigate the properties of the log-polar transformation w.r.t. image rotation and scaling (as described on p. 619 and illustrated in Fig. 21.10) to associate corner points in transformed images. This could be useful for image registration. Use a corner detector (see Ch. 6) to detect distinct points to serve as reference points

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51

import import import import import import import import import

java.awt.Polygon; ij.IJ; ij.ImagePlus; ij.gui.PolygonRoi; ij.gui.Roi; ij.plugin.filter.PlugInFilter; ij.process.FloatPolygon; ij.process.ImageProcessor; imagingbook.pub.geometry.basic.Pnt2d;

public class Get_Roi_Points implements PlugInFilter { ImagePlus im; @Override public int setup(String args, ImagePlus im) { this.im = im; // keep a reference to im return DOES_ALL + ROI_REQUIRED; }

21.4 Exercises Prog. 21.4 Extraction of vertex points from a user-selected polygon ROI (ImageJ plugin). Two different versions are shown using either integer coordinates (lines 30–37) or floating-point coordinates (lines 39–46), respectively. Notice that (in line 23) the ROI is obtained from the associated ImagePlus instance (to which a reference is stored in the setup() method in line 17) and not from the ImageProcessor ip.

@Override public void run(ImageProcessor source) { Roi roi = im.getRoi(); if (!(roi instanceof PolygonRoi)) { IJ.error("Polygon selection required!"); return; } IJ.log("ROI integer coordinates:"); Polygon pgnI = roi.getPolygon(); Pnt2d[] ptsI = new Pnt2d[pgnI.npoints]; for (int i = 0; i < pgnI.npoints; i++) { ptsI[i] = Pnt2d.from(pgnI.xpoints[i], pgnI.ypoints[i]); IJ.log(i + ": " + ptsI[i].toString()); } IJ.log("ROI float coordinates:"); FloatPolygon pgnF = roi.getFloatPolygon(); Pnt2d[] ptsF = new Pnt2d[pgnF.npoints]; for (int i = 0; i < pgnF.npoints; i++) { ptsF[i] = Pnt2d.from(pgnF.xpoints[i], pgnF.ypoints[i]); IJ.log(i + ": " + ptsF[i].toString()); } }

// ... use ROI points in ptsI (int) or ptsF (float)

}

for extracting log-polar maps. Apply (cyclic!) correlation along the ω-axis do determine image rotation and ordinary correlation (see Ch. 23) along the ϱ-axis to find the optimal scale factor. Create suitable gray-scale test images to validate your results. Exercise 21.9. Implement one or more of the following nonlinear geometric transformations (see Fig. 21.18): A. Radial wave transformation: This transformation simulates an omni-directional wave which originates from a fixed center point

635

21 Geometric Operations Fig. 21.18 Examples of nonlinear geometric transformations defined in Exercise 21.9. The reference point xc is always taken at the image center.

(a) Original image

(b) Radial wave (a = 10.0, τ = 38)

(c) Clover (a = 0.2, N = 8)

(d) Spiral (a = 0.01)

(e) Angular wave (a = 0.1, τ = 38)

(f) Tapestry (a = 5.0, τx = τy = 30)

xc (see Fig. 21.18 (b)). The inverse transformation (applied to a target image point u = (u, v)) is T

−1

(u | xc , a, τ ) =

(

xc xc +

r+δ r ·(u−xc )

for r = 0, for r > 0,

(21.84)

with r = ∥u − xc ∥ and δ = a · sin (2πr/τ ). Parameter a specifies the amplitude (strength) of the distortion and τ is the period (width) of the radial wave (in pixel units). B. Clover transformation: This transformation distorts the image in the form of a N -leafed clover shape (see Fig. 21.18 (c)). The associated inverse transformation is the same as in Eqn. (21.84) but uses δ = a·r· cos(N ·α),

636

with α = ∠(u−xc )

(21.85)

instead. Again r = ∥u−xc ∥ is the radius of the target image point u from the designated center point xc . Parameter a specifies the amplitude of the distortion and N is the number of radial “leaves”.

C. Spiral transformation: This transformation (see Fig. 21.18 (d)) 21.4 Exercises is similar to the twirl transformation in Eqn. (21.78), defined by the inverse transformation cos(β) T −1 (u | xc , a) = xc + r· , (21.86) sin(β) with β = ∠(u − xc ) + a · r and r = ∥u − xc ∥ denoting the distance from the target point u and the center point xc . The angle β increases linearly with r; parameter a specifies the “velocity” of the spiral. D. Angular wave transformation: This is another variant of the twirl transformation in Eqn. (21.78). Its inverse transformation is the same as for the spiral mapping in Eqn. (21.86), but in this case β = ∠(u−xc ) + a· sin 2πr . (21.87) τ Thus the angle β is modified by a sine function with amplitude a (see Fig. 21.18 (e)).

E. Tapestry transformation: In this case the inverse transformation of a target point u = (u, v) is ! sin 2π · (u−xc ) τ x , (21.88) T −1 (u | xc , τx , τy , a) = x′ + a· sin 2π τ · (v−yc ) y

again with the center point xc = (xc , yc ). Parameter a specifies the distortion’s amplitude and τx , τy are “wavelengths” (measured in pixel units) along the x and y axis, respectively (see Fig. 21.18 (f)).

637

22 Pixel Interpolation

Interpolation is the process of estimating the intermediate values of a sampled function or signal at continuous positions or the attempt to reconstruct the original continuous function from a set of discrete samples. In the context of geometric operations this task arises from the fact that discrete pixel positions in one image are generally not mapped to discrete raster positions in the other image under some continuous geometric transformation T (or T −1 , respectively). The concrete goal is to obtain an optimal estimate for the value of the image function I at some continuous position x = (x, y). This is what the abstract function GetInterpolatedValue(I, x), introduced in the previous chapter (see Alg. 21.2), is supposed to do. Ideally the interpolated image should preserve as much detail (i.e., sharpness) as possible without causing any visible damage, such as ringing, or moiré patterns.

22.1 Interpolation in 1D: Simple Methods To illustrate the problem, we first attend to the 1D case (see Fig. 22.1). Several simple, ad-hoc methods exist for interpolating the values of a given, discrete function g : Z 7→ R, at arbitrary continuous positions x ∈ R. 22.1.1 Nearest-Neighbor Interpolation The simplest of all interpolation methods is to round the continuous coordinate x to the closest integer u = round(x) = ⌊x + 0.5⌋ and use the associated sample g(u) as the interpolated value, i.e., the interpolated function is simply calculated as g˜(x) = g(round(x)).

(22.1)

An example of this so-called nearest-neighbor interpolation is shown in Fig. 22.2 (a). © Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1_22

639

22 Pixel Interpolation Fig. 22.1 Interpolating a discrete function in 1D. Given the discrete sample values g(u) (a), the goal is to estimate the original function f (x) at continuous positions x ∈ R (b). Fig. 22.2 Simple interpolation methods. Nearest-neighbor interpolation (a) simply selects the discrete sample g(u) closest to the continuous coordinate x as the interpolating value g ˜(x). With linear interpolation (b), the result is a piecewise linear function connecting adjacent samples g(u) and g(u+1).

f (x)

g(u)

0

1

2

3

4

5

6

7

8

9

10

u

0

1

2

3

4

(a) g ˜(x)

0

1

5

6

7

8

9

10

x

6

7

8

9

10

x

(b) g ˜(x)

2

3

4

5

6

7

8

9

(a)

10

x

0

1

2

3

4

5

(b)

22.1.2 Linear Interpolation Another simple method is linear interpolation. Here the estimated value is the sum of the two closest samples g(u) and g(u+1), with u = ⌊x⌋. The weight given to each sample is proportional to its proximity to the continuous position x, that is, g˜(x) = g(u) + (x−u)·[(g(u+1)−g(u)] = g(u)·[1 − (x−u)] + g(u+1)·(x−u).

(22.2) (22.3)

As shown in Fig. 22.2 (b), the result is a piecewise linear function made up of straight line segments between consecutive sample values. 22.1.3 Ideal Low-Pass Filter

640

Obviously the results of these simple interpolation methods do not well approximate the original continuous function (see Fig. 22.1). But how can we obtain a better approximation from the discrete samples only when the original function is unknown? This may appear hopeless at first, because the discrete samples g(u) could possibly originate from any continuous function f (x) passing through the discrete samples. We find an intuitive answer to this question (once again) by looking at the functions in the spectral domain. If the original (continuous) function f (x) was discretized in accordance with the sampling theorem (see Sec. 18.2.1), then f (x) must have been “well-behaved”, i.e., band limited: it could not contain any signal components with frequencies higher than half the sampling frequency ωs . This means that the reconstructed signal too only needs to consist of a limited set of frequencies and thus its trajectory between the discrete sample values is not completely arbitrary but naturally constrained (i.e., “smooth”). In this context, absolute units of measure are of no concern since in a digital signal all frequencies relate to the sampling frequency. In particular, if we take τs = 1 as the (unitless) sampling interval, the resulting sampling frequency is

22.1 Interpolation in 1D: Simple Methods

1

-15

-10

5

-5

10

15

x

-1

1 πx

sin(πx)

sinc(x) =

sin(πx) πx

ωs = 2π·fs = 2π· τ1 = 2π s

Fig. 22.3 Composition of the sinc (“sine cardinal”) function. The sinc() function is the product of a sine function (dashed, green) and a hyperbolic decay function (dashed, magenta). The normalized form (also called filtering or interpolation function [40, Ch. 4]) has value 1 at the origin and zeros at all integer positions.

(22.4)

and thus the maximum signal frequency is ωmax = ω2s = π. Let G(ω) be the Fourier spectrum1 of the original function g(x), which becomes periodic when the signal is sampled (see Fig. 22.4). To extract the original frequency range [−ωmax , ωmax ] from the periodic Fourier spectrum (i.e., suppress all other frequencies outside that range), we multiply the spectrum G(ω) with a square windowing function Π(ω) of width ±ωmax = ± π, i.e., ( 1 for −π ≤ ω ≤ π, ˜ G(ω) = G(ω)·Π(ω) = G(ω)· (22.5) 0 otherwise. This is called an ideal low-pass filter, which cuts off all signal components with frequencies greater than π and keeps all lower-frequency components unchanged. In the signal domain, the operation in Eqn. (22.5) corresponds to a linear convolution (see Eqn. (18.27)) with the inverse Fourier transform of the spectral box window Π(ω), which is the so-called (normalized) sinc function, sinc(x) =

sin(πx) , πx

(22.6)

shown in Fig. 22.3. This correspondence between convolution in the signal domain and simple multiplication in the frequency domain, which we already discussed in Sec. 18.1.6 (see Table 18.1), is illustrated in Fig. 22.4. So, at least in theory, sinc(x) is the ideal interpolation function for reconstructing a frequency-limited continuous signal. The interpolated value for the discrete function g(u) at some continuous position x0 is found by shifting the sinc function to x0 (such that its origin coincides with x0 ), multiplying with all discrete sample values g(u) (i.e., for all u ∈ Z), and summing the results. That is, g(u) and sinc(x) are simply convolved and the interpolated value at position x0 is g˜(x0 ) = (sinc ∗ g )(x0 ) = | {z } convolve

1

See Sec. 18.1.4, Fig. 18.8.

∞ X

sinc(u−x0 ) · g(u), | {z }

u=−∞

(22.7)

shifted sinc

641

22 Pixel Interpolation Fig. 22.4 Interpolation of a discrete signal—relation between signal and frequency space. The discrete signal g(u) in signal space (left) corresponds to the periodic Fourier spectrum G(ω) in frequency space ˜ (right). The spectrum G(ω) of the continuous signal is isolated from G(ω) by point-wise multiplication (×) with the square function Π(ω), which constitutes an ideal low-pass filter (right). In signal space (left), this operation corresponds to a linear convolution (∗) with the function sinc(x).

Signal

Spectrum

g(u)

G(ω)

sinc(x)

Π(ω)

g ˜(x) = [sinc∗g] (x)

˜ G(ω) = G(ω)·Π(ω)

where ∗ is the linear convolution operator (see Sec. 4.3.1). If the discrete signal g(u) is finite with length N (as is usually the case), it is assumed to be periodic (i.e., g(u) = g(u+kN ) for all k ∈ Z).2 In this case, Eqn. (22.7) modifies to g˜(x0 ) =

∞ X

sinc(u−x0 ) · g(u mod N ) .

(22.8)

u=−∞

Fig. 22.5 Interpolation by convolving with the sinc function. The sinc function is shifted by aligning its origin with the interpolation points x0 = 4.4 (a) and x0 = 5 (b). The values of the shifted sinc function (dashed curve) at the integer positions are the weights (coefficients) for the corresponding sample values g(u). When the function is interpolated at some integer position, such as x0 = 5 (b), only the sample value g(x0 ) = g(5) is considered and weighted with 1, while all other samples coincide with the zeros of the sinc function and thus do not contribute to the result.

642

It may be surprising that the ideal interpolation of a discrete function g(u) apparently involves not only a few neighboring sample points around x0 but, in general, infinitely many values of g(u), whose weights decrease continuously with their distance from the interpo1 lation point x0 (at rate | π(u−x |), but never become zero. Figure 0) 22.5 shows two examples for interpolating a discrete function g at positions x0 = 4.4 and x0 = 5. If the function is interpolated at some integer position, such as x0 = 5, the sample g(u) at u ̸= x0 receives the weight 1, while all other samples coincide with the zeros of the sinc function and are thus ignored. Consequently, the resulting interpolation values are identical to the sample values g(u) at all discrete positions x = u. 1.0

1.0

sinc(x−x0 )

0.8

0.6

0.4

0.4

0.2

-0.2 -0.4

sinc(x−x0 )

0.8

0.6

0.2 2

4

6

x0 = 4.4 (a)

8

10

x -0.2 -0.4

2

4

6

8

10

x

x0 = 5 (b)

If a continuous signal is properly frequency limited (by half the sampling frequency ω2s ), it can be exactly reconstructed from the discrete signal by interpolation with the sinc function, as Fig. 22.6(a) 2

This assumption is explained by the fact that a discrete Fourier spectrum implicitly corresponds to a periodic signal (see also Sec. 18.2.2).

demonstrates. Problems occur, however, around local high-frequency 22.2 Interpolation by signal events, such as rapid transitions or pulses, as shown in Fig. Convolution 22.6 (b, c). In those situations, the sinc interpolation causes strong overshooting or “ringing” artifacts, which are perceived as visually disturbing. For practical applications, the sinc function is therefore not suitable as an interpolation kernel—not only because of its infinite extent (and the resulting non-computability). g ˜1 (x)

0

1

g ˜2 (x)

2

3

4

5

(a)

6

7

8

9

10

x

0

1

g ˜3 (x)

2

3

4

5

6

7

8

9

10

x

0

1

2

3

(b)

4

5

6

7

8

9

10

x

(c)

Fig. 22.6 Sinc interpolation applied to various signal types. The reconstructed function in (a) is identical to the continuous, band-limited original. The results for the step function (b) and the pulse function (c) show the strong ringing caused by sinc (ideal low-pass) interpolation.

A good interpolation function implements a low-pass filter that, on the one hand, introduces minimal blurring by maintaining the maximum signal bandwidth but, on the other hand, also delivers a good reconstruction at rapid signal transitions. In this regard, the sinc function is an extreme choice—it implements an ideal low-pass filter and thus preserves a maximum bandwidth and signal continuity but gives inferior results at signal transitions. At the opposite extreme, nearest-neighbor interpolation (see Fig. 22.2) can perfectly handle steps and pulses but generally fails to produce a continuous signal reconstruction between sample points. The design of an interpolation function thus always involves a trade-off, and the quality of the results often depends on the particular application and subjective judgment. In the following, we discuss some common interpolation functions that come close to this goal and are therefore frequently used in practice.

22.2 Interpolation by Convolution As we saw earlier in the context of sinc interpolation (Eqn. (22.6)), the reconstruction of a continuous signal can be described as a linear convolution operation. In general, we can express interpolation as a convolution of the given discrete function g(u) with some continuous interpolation kernel w(x) as g˜(x0 ) = (w ∗ g)(x0 ) =

∞ X

w(u−x0 ) · g(u).

(22.9)

u=−∞

Obviously, sinc interpolation (see Eqn. (22.7)) is only a special case, with w(x) = sinc(x). Similarly, nearest-neighbor interpolation (see Eqn. (22.1), Fig. 22.2(a)) can be expressed as a linear convolution with the kernel ( 1 for −0.5 ≤ x < 0.5, wnn (x) = (22.10) 0 otherwise, and, likewise, linear interpolation (see Eqn. (22.3), Fig. 22.2(b)) with the kernel

643

22 Pixel Interpolation

1

Fig. 22.7 Convolution kernels for the nearest-neighbor interpolation (a) and the linear interpolation (b).

Fig. 22.8 Interpolation examples: nearest-neighbor interpolation (a–c), linear interpolation (d–f).

-3

-2

wnn (x)

1

-1

1

2

x

3

-3

-2

1

-1

(a)

1

g ˜2 (x)

2

3

4

5

6

7

8

9

10

x

0

1

2

3

4

5

6

7

8

9

10

x

3

4

5

1

2

3

4

6

7

8

9

10

x

0

1

6

7

8

9

10

x

6

7

8

9

10

x

g ˜3 (x)

2

(d)

3

4

5

6

7

8

9

10

x

0

(e)

wlin (x) =

5

(c)

g ˜2 (x)

2

0

(b)

g ˜1 (x)

1

x

3

g ˜3 (x)

(a)

0

2

(b)

g ˜1 (x)

0

wlin (x)

(

1 − |x| 0

1

2

3

4

5

(f)

for |x| < 1, for |x| ≥ 1.

(22.11)

Both kernels wnn (x) and wlin (x) are shown in Fig. 22.7, and interpolation results on various discrete function are plotted in Fig. 22.8.

22.3 Cubic Interpolation The sinc function is not a useful interpolation kernel in practice, because of its infinite extent and the ringing artifacts caused by its slowly decaying oscillations. Therefore several interpolation methods employ a truncated version of the sinc function or an approximation of it, thereby making the convolution kernel more compact and reducing the ringing. A frequently used approximation of a truncated sinc function is the so-called cubic interpolation, whose convolution kernel is defined as the piecewise cubic polynomial  3 2  for 0 ≤ |x| < 1, (−a+2)·|x| + (a−3)·|x| + 1 3 2 wcub (x, a) = −a·|x| + 5a·|x| − 8a·|x| + 4a for 1 ≤ |x| < 2,   0 for |x| ≥ 2 . (22.12)

644

Parameter a can be used to adjust the steepness of the spline function and thus the perceived “sharpness” of the interpolation (see Fig. 22.9 (a)). For the standard value a = 1, Eqn. (22.12) reduces to  3 2  for 0 ≤ |x| < 1,  |x| − 2·|x| + 1 wcub (x) = −|x|3 + 5·|x|2 − 8·|x| + 4 for 1 ≤ |x| < 2, (22.13)   0 for |x| ≥ 2 .

1

wcub (x, a)

1

0.5

wcub (x)

22.4 Spline Interpolation

0.5

sinc(x) -2

1

-1

2

x

-6

-4

2

-2

a = 0.25 a = 1.0 a = 1.75

4

6

x

wcub (x) sinc(x)

(a)

Fig. 22.9 Cubic interpolation kernel. Function wcub (x, a) with control parameter a set to 0.25, 1.0, and 1.75, respectively (a). Cubic function wcub (x) = wcub (x, 1) and sinc function compared (b).

(b)

The comparison of the sinc function and the cubic interpolation kernel wcub (x) = wcub (x, 1) in Fig. 22.9 (b) shows that many highvalue coefficients outside x = ±2 are truncated and thus relatively large errors can be expected. However, because of the compactness of the cubic function, this type of interpolation can be calculated very efficiently. Since wcub (x) = 0 for |x| ≥ 2, only four discrete values g(u) need to be accounted for in the convolution operation (see Eqn. (22.9)) at any continuous interpolation point x0 ∈ R, namely g(u0 −1), g(u0 ), g(u0 +1), g(u0 +2),

with u0 = ⌊x0 ⌋.

This reduces the 1D cubic interpolation to the expression ⌊x0 ⌋+2

g˜(x0 ) =

X

wcub (u−x0 ) u=⌊x0 ⌋−1

· g(u) .

(22.14)

Figure 22.10 shows the results of cubic interpolation with different settings of the control parameter a. Notice that the cubic reconstruction obtained with the popular standard setting (a = 1) exhibits substantial overshooting at edges as well as strong ripple effects in the continuous parts of the signal (Fig. 22.10 (b)). With a = 0.5, the expression in Eqn. (22.12) corresponds to a Catmull-Rom spline [53] (see also Sec. 22.4), which produces significantly better results than the standard setup (with a = 1), particularly in smooth signal regions (see Fig. 22.12 (a–c)).

22.4 Spline Interpolation The cubic interpolation kernel (Eqn. (22.12)) described in the previous section is a piecewise cubic polynomial function, also known as a cubic spline in computer graphics. In its general form, this function takes not only one but two control parameters (a, b) [188],  wcs (x, a, b) = (−6a−9b+12)·|x|3    2   + (6a+12b−18)·|x| − 2b + 6 1  · (−6a−b)·|x|3 + (30a+6b)·|x|2 6    + (−48a−12b)·|x| + 24a + 8b    0

(22.15) for 0 ≤ |x| < 1, for 1 ≤ |x| < 2, for |x| ≥ 2. 645

22 Pixel Interpolation Fig. 22.10 Cubic interpolation examples. Parameter a in Eqn. (22.12) controls the amount of signal overshoot or perceived sharpness: a = 0.25 (a), standard setting a = 1 (b), a = 1.75 (c). The standard settings (b) lead to noticeable ripple effects in g ˜1 (x).

g ˜1 (x)

0

1

g ˜2 (x)

2

3

4

5

6

7

8

9

10

x

0

1

g ˜3 (x)

2

3

4

5

6

7

8

9

10

x

0

1

2

3

4

5

6

7

8

9

10

x

2

3

4

5

6

7

8

9

10

x

2

3

4

5

6

7

8

9

10

x

(a) a = 0.25 g ˜1 (x)

0

1

g ˜2 (x)

2

3

4

5

6

7

8

9

10

x

0

1

g ˜3 (x)

2

3

4

5

6

7

8

9

10

x

0

1

(b) a = 1.0 g ˜1 (x)

0

1

g ˜2 (x)

2

3

4

5

6

7

8

9

10

x

0

1

g ˜3 (x)

2

3

4

5

6

7

8

9

10

x

0

1

(c) a = 1.75

Equation (22.15) describes a family of smooth, C 1 -continuous functions (i.e., with continuous first derivatives) with no visible discontinuities or sharp corners. For b = 0, the function wcs (x, a, b) specifies a one-parameter family of so-called cardinal splines equivalent to the cubic interpolation function wcub (x, a) in Eqn. (22.12), wcs (x, a, 0) = wcub (x, a),

(22.16)

and for the standard setting a = 1 (Eqn. (22.13)) in particular wcs (x, 1, 0) = wcub (x, 1) = wcub (x).

(22.17)

Figure 22.11 shows three additional examples of this function type that are important in the context of interpolation: Catmull-Rom splines, cubic B-splines, and the Mitchell-Netravali function. All three functions are briefly described in the following sections. The actual calculation of the interpolated signal follows exactly the same scheme as used for the cubic interpolation described in Eqn. (22.14). 22.4.1 Catmull-Rom Interpolation With the control parameters set to a = 0.5 and b = 0, the function in Eqn. (22.15) is a Catmull-Rom spline [53], as already mentioned in Sec. 22.3: wcrm (x) = wcs (x, 0.5, 0)   3·|x|3 − 5·|x|2 + 2 1  = · −|x|3 + 5·|x|2 − 8·|x| + 4 2   0

646

(22.18) for 0 ≤ |x| < 1, for 1 ≤ |x| < 2, for |x| ≥ 2 .

Examples of signals interpolated with this kernel are shown in Fig. 22.12 (a–c). The results are similar to ones produced by cubic interpolation (with a = 1, see Fig. 22.10) with regard to sharpness, but

wcs (x, a, b)

22.4 Spline Interpolation

1

Fig. 22.11 Examples of cubic spline functions as defined in Eqn. (22.15): Catmull-Rom spline wcs (x, 0.5, 0) (dotted line), cubic B-spline wcs (x, 0, 1) (dashed line), and MitchellNetravali function wcs (x, 13 , 13 ) (solid line).

0.5

-2 Catmull-Rom

1

-1

Mitchell-Netravali

2

x

cubic B-spline

the Catmull-Rom reconstruction is clearly superior in smooth signal regions (e.g., compare Fig. 22.10 (b) vs. Fig. 22.12 (a)). 22.4.2 Cubic B-spline Approximation With parameters set to a = 0 and b = 1, Eqn. (22.15) corresponds to a cubic B-spline function of the form wcbs (x) = wcs (x, 0, 1)   3·|x|3 − 6·|x|2 + 4 1  = · −|x|3 + 6·|x|2 − 12·|x| + 8 6   0

(22.19) for 0 ≤ |x| < 1, for 1 ≤ |x| < 2, for |x| ≥ 2.

This function is positive everywhere and, when used as an interpolation kernel, causes a pure smoothing effect similar to a Gaussian smoothing filter (see Fig. 22.12 (d–f)). The B-spline function in Eqn. (22.19) is C 2 -continuous, that is, its first and second derivatives are continuous. Notice that—in contrast to all previously described interpolation methods—the reconstructed function does not pass through all discrete sample points. Thus, to be precise, the reconstruction with cubic B-splines is not called an interpolation but an approximation of the signal. 22.4.3 Mitchell-Netravali Approximation The design of an optimal interpolation kernel is always a trade-off between high bandwidth (sharpness) and good transient response (low ringing). Catmull-Rom interpolation, for example, emphasizes high sharpness, whereas cubic B-spline interpolation blurs but creates no ringing. Based on empirical tests, Mitchell and Netravali [188] proposed a cubic interpolation kernel as described in Eqn. (22.15) with parameter settings a = 13 and b = 13 , and the resulting interpolation function wmn (x) = wcs x, 13 , 13 (22.20)   21·|x|3 − 36·|x|2 + 16 for 0 ≤ |x| < 1, 1  = · −7·|x|3 + 36·|x|2 − 60·|x| + 32 for 1 ≤ |x| < 2, 18   0 for |x| ≥ 2 .

647

22 Pixel Interpolation Fig. 22.12 Cardinal spline reconstruction examples. CatmullRom interpolation (a), cubic B-spline approximation (b), and MitchellNetravali approximation (c).

g ˜1 (x)

0

1

g ˜2 (x)

2

3

4

5

6

7

8

9

10

x

0

1

g ˜3 (x)

2

3

4

5

6

7

8

9

10

x

0

1

2

3

4

5

6

7

8

9

10

x

2

3

4

5

6

7

8

9

10

x

2

3

4

5

6

7

8

9

10

x

(a) Catmull-Rom g ˜1 (x)

0

1

g ˜2 (x)

2

3

4

5

6

7

8

9

10

x

0

1

g ˜3 (x)

2

3

4

5

6

7

8

9

10

x

0

1

(b) cubic B-spline g ˜1 (x)

0

1

g ˜2 (x)

2

3

4

5

6

7

8

9

10

x

0

1

g ˜3 (x)

2

3

4

5

6

7

8

9

10

x

0

1

(c) Mitchell-Netravali

This function is the weighted sum of a Catmull-Rom spline in Eqn. (22.18) and a cubic B-spline in Eqn. (22.19).3 The examples in Fig. 22.12 (c) show that this method is a good compromise, creating little overshoot, high edge sharpness, and good signal continuity in smooth regions. Since the resulting function does not pass through the original sample points, the Mitchell-Netravali method is again an approximation and not an interpolation. 22.4.4 Lanczos Interpolation The Lanczos4 interpolation belongs to the family of “windowed sinc” methods. In contrast to the methods described in the previous sections, these do not use a polynomial (or other) approximation of the sinc function but the sinc function itself combined with a suitable window function ψ(x); that is, an interpolation kernel of the form w(x) = ψ(x) · sinc(x) . The particular window functions for defined as   1 ψLn (x) = sin(πx/n) πx/n   0

(22.21)

the Lanczos interpolation are for |x| = 0, for 0 < |x| < n, for |x| ≥ n,

(22.22)

where n ∈ N denotes the order of the filter [202,269]. Notice that the window function ψLn (x) is a truncated sinc function itself, thus the kernel in Eqn. (22.21) is the product of two sinc functions! For the Lanczos filters of order n = 2, 3, which are the most commonly used in image processing, the corresponding window functions are 3 4

648

See also Exercise 22.1. Cornelius Lanczos (1893–1974).

ψL2 (x) =

ψL3 (x) =

  1

sin(πx/2)  πx/2



0   1

sin(πx/3)  πx/3



0

for |x| = 0, for 0 < |x| < 2, for |x| ≥ 2,

(22.23) 2D

for |x| = 0, for 0 < |x| < 3, for |x| ≥ 3.

(22.24)

22.5 Interpolation in

Both window functions are shown in Fig. 22.13. The actual interpolation kernels wL2 and wL3 are obtained as the product of the sinc function (see Eqn. (22.6)) and the associated window function (see Eqn. (22.22)), i.e.,   for |x| = 0, 1 wL2 (x) = 2· sin(πx/2)·sin(πx) (22.25) for 0 < |x| < 2, 2 2 π x   0 for |x| ≥ 2, and

  1 wL3 (x) = 3· sin(πx/3)·sin(πx) π 2 x2   0

for |x| = 0, for 0 < |x| < 3, for |x| ≥ 3,

(22.26)

respectively. In general, for Lanczos interpolation of order n, we get   for |x| = 0, 1 wLn (x) = n· sin(πx/n)·sin(πx) (22.27) for 0 < |x| < n, π 2 x2   0 for |x| ≥ n. Figure 22.13 (c, d) shows the resulting interpolation kernels together with the original sinc function. The function wL2 (x) is quite similar to the Catmull-Rom kernel wcrm (x) (see Eqn. (22.18) and Fig. 22.11), so the results can be expected to be similar as well, as shown in Fig. 22.14 (a) (cf. Fig. 22.12 (a)). Notice, however, the relatively poor reconstruction in the smooth signal regions (Fig. 22.14 (a)) and the strong ringing introduced in the constant high-amplitude regions (center plot of Fig. 22.14 (a)). The “3-tap” kernel wL3 (x) reduces these artifacts and produces steeper edges, at the cost of increased overshoot (Fig. 22.12 (b)). In summary, although Lanczos interpolators have seen revived interest and popularity in recent years, they do not seem to offer much (if any) advantage over other established methods, particularly the cubic, Catmull-Rom, or Mitchell-Netravali interpolations. While these are based on efficiently computable polynomial functions, Lanczos interpolation requires trigonometric functions which are relatively costly to compute, unless some form of tabulation is used.

22.5 Interpolation in 2D So far we have only looked at interpolating (or reconstructing) 1D signals from discrete samples. Images are 2D signals but, as we shall see

649

22 Pixel Interpolation Fig. 22.13 One-dimensional Lanczos interpolation kernels. Twotap Lanczos window ψL2 and kernel wL2 (a). Threetap window ψL3 and kernel wL3 (b). The sinc function is shown for comparison.

1

wL2 (x) ψL2 (x)

0.5

-3

-2

1

1

-1

2

x

3

-3

-2

1

-1

g ˜1 (x)

1

ψLn (x)

3

4

5

6

7

8

x

3

wLn (x)

g ˜2 (x)

2

2

(b) sinc(x)

0

ψL3 (x)

0.5

(a)

Fig. 22.14 Lanczos interpolation examples: Lanczos-2 (a), Lanczos3 (b). Note the ringing in the flat (constant) regions caused by Lanczos-2 interpolation in the center graph of (b). The Lanczos-3 interpolator shows less ringing but produces steeper edges at the cost of increased overshoot.

wL3 (x)

9

10

x

0

1

g ˜3 (x)

2

3

4

5

6

7

8

9

10

x

0

1

2

3

4

5

6

7

8

9

10

x

2

3

4

5

6

7

8

9

10

x

(a) Lanczos-2 interpolation g ˜1 (x)

0

1

g ˜2 (x)

2

3

4

5

6

7

8

9

10

x

0

1

g ˜3 (x)

2

3

4

5

6

7

8

9

10

x

0

1

(b) Lanczos-3 interpolation

in this section, the techniques for interpolating images are very similar and can be derived from the 1D approach. In particular, “ideal” (low-pass filter) interpolation requires a 2D sinc function defined as Wsinc = sinc(x) · sinc(y) =

sin(πx) sin(πy) · , πx πy

(22.28)

which is shown in Fig. 22.15(a). Just as in 1D, the 2D sinc function is not a practical interpolation function for various reasons. In the following, we look at some common interpolation methods for images, particularly the nearest-neighbor, bilinear, bicubic, and Lanczos interpolations, whose 1D versions were described in the previous sections. Fig. 22.15 Two-dimensional interpolation kernels. 2D sinc kernel Wsinc (x, y) and nearestneighbor kernel Wnn (x, y) (b), for −2.5 ≤ x, y ≤ 2.5.

(a) Wsinc (x, y)

650

(b) Wnn (x, y)

22.5.1 Nearest-Neighbor Interpolation in 2D The position u = (u, v) of the pixel closest to a continuous interpolation point x = (x, y) ∈ R2 is found by independently rounding the x ˜ and y coordinates to discrete values, i.e., I(x) = I(u), with round(x) ⌊x + 0.5⌋ u= = . (22.29) round(y) ⌊y + 0.5⌋

22.5 Interpolation in 2D

As in the 1D case, the interpolation in 2D can be described as a linear convolution (linear filter), with the associated 2D kernel (analogous to Eqn. (22.10)) defined as ( 1 for −0.5 ≤ x, y < 0.5, Wnn (x, y) = (22.30) 0 otherwise. This function is shown in Fig. 22.15 (b). Nearest-neighbor interpolation is known for its strong blocking effects (see Fig. 22.17 (b, d)) and thus is rarely used for geometric image operations. However, in some situations, this effect may be intended; for example, if an image is to be enlarged by replicating each pixel without any smoothing. 22.5.2 Bilinear Interpolation The 2D counterpart to the linear interpolation in 1D (see Sec. 22.1) is the so-called bilinear interpolation,5 whose operation is illustrated in Fig. 22.16. For the given interpolation point x = (x, y), we first find the four closest (surrounding) pixel vcalues, A = I(u, v),

B = I(u+1, v),

C = I(u, v+1),

D = I(u+1, v+1),

(22.31)

where u = ⌊x⌋ and v = ⌊y⌋. Then the pixel values A, B, C, D are interpolated in horizontal and subsequently in vertical direction. The intermediate values E, F are calculated from the horizontal distance a = (x−u) of the interpolation point (x, y) from the nearest raster coordinate u as E = A + (x−u)·(B−A) = A + a·(B−A),

(22.32)

F = C + (x−u)·(D−C) = C + a·(D−C),

(22.33)

and the final interpolation value G is calculated from the vertical distance b = y−v as ˜ y) = G = E + (y−v)·(F −E) = E + b·(F −E) I(x, = (a−1)(b−1) A + a(1−b) B + (1−a) b C + a b D . (22.34) Expressed as a linear convolution filter, the corresponding 2D kernel Wbil (x, y) is the product of the two 1D kernels wlin (x) and wlin (y) (see Eqn. (22.11)), that is, 5

Not to be confused with the bilinear mapping (transformation) described in Sec. 21.1.5.

651

C

22 Pixel Interpolation Fig. 22.16 Bilinear interpolation. For a given position (x, y), the interpolated value is computed from the values A, B, C, D of the four closest pixels in two steps (a). First the intermediate values E and F are calculated by linear interpolation in the horizontal direction between A, B and C, D, respectively, where a = x−u is the distance to the nearest pixel to the left of x. Subsequently, the intermediate values E, F are interpolated in the vertical direction, where b = y − v is the distance to the nearest pixel below y. The interpolated surface between four adjacent pixels is shown in (b).

G

F D 1−b

v+1 v+1

A E

B

b

y

a u

u

x

1−a

v

u+1

u+1 (a)

(b)

Fig. 22.17 Image enlargement example. Original image (a), 8-times enlargement using nearestneighbor interpolation (b) and bilinear interpolation (c). Enlarged detail (d, e). The blue grid lines mark the original pixel boundaries. (a)

(b)

(c)

(d)

(e)

Wbilin (x, y) = wlin (x) · wlin (y) ( 1 − |x| − |y| + |x · y| = 0

for 0 ≤ |x|, |y| < 1, otherwise.

(22.35)

Note that this function is nonlinear, not least because of the mixed term x·y, which gives the method its name. See Fig. 22.18 (a)) for a plot of this 2D kernel. 22.5.3 Bicubic and Spline Interpolation in 2D

652

The convolution kernel for the 2D cubic interpolation is also defined as the product of the corresponding 1D kernels (see Eqn. (22.13)),

22.5 Interpolation in 2D Fig. 22.18 Two-dimensional interpolation kernels. Bilinear kernel Wbil (x, y) (a) and bicubic kernel Wbic (x, y) (b) for −2.5 ≤ x, y ≤ 2.5.

(a) Wbil (x, y)

(b) Wbic (x, y)

˜ I(x, y) p3 p2 y

y p1

v1

v1 p0

u1

u1 x

x

(a) Step 1

(b) Step 2

I(u1 , v1 )

Wbic (x, y) = wcub (x) · wcub (y).

Fig. 22.19 Bicubic interpolation in two steps. The discrete image I (integer pixel positions correspond to grid-lines) is to be interpolated at some continuous position (x, y). In Step 1 (a), a 1D interpolation is performed in the horizontal direction with wcub (x) over four pixels I(ui , vj ) in four lines. One intermediate result pj (marked ) is computed for each line j. ˜ In Step 2 (b), the result I(x, y) is calculated by a single cubic interpolation in vertical direction, over the intermediate values p0 , . . . , p3 . In total, 16+4 = 20 interpolation steps are required.

(22.36)

The resulting kernel is plotted in Fig. 22.18(b). Due to the decomposition into 1D kernels (Eqn. (22.14)), the calculation of the bicubic interpolation is separable in x, y and can be expressed as ˜ y) = I(x,

⌊y⌋+2 h ⌊x⌋+2

X

v= ⌊y⌋−1

=

3 h X j=0

X

u= ⌊x⌋−1

Wbic (u−x, v−y)·I(u, v)

wcub (vj −y)·

3 X i=0

|

i

(22.37)

i wcub (ui −x)·I(ui , vj ) , {z = pj

(22.38)

}

with ui = ⌊x⌋−i+1 and vj = ⌊y⌋−j+i. Quantities pj (j = 0, . . . , 3) are the intermediate results of the cubic interpolation in x-direction in lines vj , as illustrated in Fig. 22.19 (a). Equation (22.38) describes a simple and efficient procedure for computing the bicubic interpolation using only a 1D kernel wcub (x). The interpolation is based on a 4 × 4 neighborhood of pixels and requires a total of 16+4 = 20 additions and multiplications. This method, which is summarized in Alg. 22.1, can be used to implement any x/y-separable 2D interpolation kernel of size 4 × 4, such as the 2D Catmull-Rom interpolation (Eqn. (22.18)) with Wcrm (x, y) = wcrm (x) · wcrm (y)

(22.39) 653

22 Pixel Interpolation Fig. 22.20 Two-dimensional spline interpolation kernels: CatmullRom kernel Wcrm (x, y) (a), Mitchell-Netravali kernel Wmn (x, y) (b), for −2.5 ≤ x, y ≤ 2.5.

(a) Wcrm (x, y) Alg. 22.1 Bicubic interpolation of image I at position (x, y). The 1D cubic function wcub (·) (Eqn. (22.12)) is used for the separate interpolation in the x and y directions, based on a 4 × 4 pixel neighborhood (see Fig. 22.19). See Prog. 22.1 for a straightforward implementation in Java.

(b) Wmn (x, y)

1: BicubicInterpolation(I, x) ▷ two-dimensional interpolation I, original image; x = (x, y) ∈ R, continuous interpolation point. Returns the interpolated pixel value at position x. 2: q←0 3: for j ← 0, . . . , 3 do ▷ iterate over 4 lines 4: v ← ⌊y⌋ + j − 1 5: p←0 6: for i ← 0, . . . , 3 do ▷ iterate over 4 columns 7: u ← ⌊x⌋ + i − 1 8: p ← p + Cubic1D(x−u) · I(u, v) ▷ see Eq. 22.12 9: q ← q + Cubic1D(y−v) · p 10: return q 11: Cubic1D(x) ▷ one-dimensional cubic interpol. 12: z←0 13: if |x| < 1 then ▷ see Eqn. (22.13) 14: z ← |x|3 − 2 · |x|2 + 1 15: else if |x| < 2 then 16: z ← −|x|3 + 5 · |x|2 − 8 · |x| + 4 17: return z

or the Mitchell-Netravali interpolation (Eqn. (22.20)) with Wmn (x, y) = wmn (x) · wmn (y).

(22.40)

The corresponding 2D kernels are shown in Fig. 22.20. For interpolation with separable kernels of larger size see the general procedure in Alg. 22.2. 22.5.4 Lanczos Interpolation in 2D The kernels for the 2D Lanczos interpolation are also x/y-separable into 1D kernels (see Eqns. (22.25) and (22.26), respectively), that is, WLn (x, y) = wLn (x) · wLn (y) .

654

(22.41)

The resulting kernels for orders n = 2 and n = 3 are shown in Fig. 22.21. Because of the separability the 2D Lanczos interpolation can be computed, similar to the bicubic interpolation, separately in the x and y directions. Like the bicubic kernel, the 2-tap Lanczos kernel WL2 (Eqn. (22.25)) is zero outside the interval [−2, 2], and thus the procedure described in Eqn. (22.38) and Alg. 22.1 can be used with only a small modification (replace wcub by wL2 ).

22.5 Interpolation in 2D Fig. 22.21 Two-dimensional Lanczos kernels for n = 2 and n = 3: kernels WL2 (x, y) (a) and WL3 (x, y) (b), with −3 ≤ x, y ≤ 3.

(a) WL2 (x, y)

(b) WL3 (x, y)

Compared to Eqn. (22.38), the larger Lanczos kernel WL3 (see Eqn. (22.26)) requires two additional pixel rows and columns. The calculation of the interpolated pixel value at position (x, y) thus has the form ˜ y) = I(x,

⌊y⌋+3h ⌊x⌋+3

X

X

v= u= ⌊y⌋−2 ⌊x⌋−2

=

5 h X

WL3 (u−x, v−y) · I(u, v)

wL3 (vj −y) ·

j=0

5 X i=0

i

(22.42)

i wL3 (ui −x) · I(ui , vj ) ,

(22.43)

with ui = ⌊x⌋−2+i and vj = ⌊y⌋−2+j. Thus the L3 Lanczos interpolation in 2D uses a support region of 6 × 6 = 36 pixels from the original image, 20 pixels more than the bicubic interpolation. In general, the expression for a 2D Lanczos interpolator Ln of order n ≥ 1 is ˜ y) = I(x,

⌊y⌋+n

X

h ⌊x⌋+n X

WLn (u−x, v−y) · I(u, v)

v= u= ⌊y⌋−n+1 ⌊x⌋−n+1

=

2n−1 Xh j=0

wLn (vj −y) ·

2n−1 X i=0

i

i wLn (ui −x) · I(ui , vj ) ,

(22.44)

(22.45)

with ui = ⌊x⌋−n+1+i and vj = ⌊y⌋−n+1+j. The size of this interpolator’s support region is 2n × 2n pixels. How the expression in Eqn. (22.45) could be computed is shown in Alg. 22.2, which describes a general interpolation procedure that can be used with any separable interpolation kernel W (x, y) = wn (x)·wn (y) of extent ±n. 22.5.5 Examples and Discussion Figures 22.22 and 22.23 compare the interpolation methods described in this section: nearest-neighbor, bilinear, bicubic Catmull-Rom, cubic B-spline, Mitchell-Netravali, and Lanczos interpolation. In both figures, the original images are rotated counter-clockwise by 15◦ . A gray background is used to visualize the edge overshoot produced by some interpolators.

655

22 Pixel Interpolation Alg. 22.2 General interpolation with a separable interpolation kernel W (x, y) = wn (x)·wn (y) of extent ±n (i.e., the 1D kernel wn (x) is zero for x∈ / [−n, n]). Note that procedure BicubicInterpolation in Alg. 22.1 is a special instance of this algorithm (with n = 2).

1: SeparableInterpolation(I, x, w, n) Input: I, original image; x = (x, y) ∈ R2 , continuous position; w, a 1D interpolation kernel of extent ±n (n ≥ 1). Calculates and returns the interpolated intensity value for position x, using the composite 2D kernel W (x, y) = w(x)·w(y). 2: 3: 4: 5: 6: 7: 8: 9: 10:

q←0 for j ← 0, . . . , 2n−1 do v ← ⌊y⌋ − n + 1 + j p←0 for i ← 0, . . . , 2n−1 do u ← ⌊x⌋ − n + 1 + i p ← p + w(x−u)·I(u, v)

▷ iterate over 2n lines ▷ = vj ▷ = pj ▷ iterate over 2n columns ▷ = ui

q ← q + w(y−v)·p return q

˜ ▷ = I(x)

Nearest-neighbor interpolation (Fig. 22.22 (b)) creates no new pixel values but forms, as expected, coarse blocks of pixels with the same intensity. The effect of bilinear interpolation (Fig. 22.22 (c)) is local smoothing over four neighboring pixels. The weights for these four pixels are positive, and thus no result can be smaller than the smallest neighboring pixel value or greater than the greatest neighboring pixel value. In other words, bilinear interpolation never creates any overor undershoot at edges. This is not the case with bicubic interpolation (see Fig. 22.22 (d)): some of the coefficients in the bicubic interpolation kernel are negative, which makes pixels near edges clearly brighter or darker, respectively, thus increasing the perceived sharpness. In general, bicubic interpolation produces better results than the bilinear method at comparable computing cost, and it is thus widely accepted as the standard technique and used in most image manipulation programs. By adjusting the control parameter a (see Eqn. (22.12)), the bicubic kernel can be easily tuned to fit the need of particular applications. For example, the Catmull-Rom method (Fig. 22.22 (e)) can be implemented with bicubic interpolation by setting a = 0.5 (see Eqns. (22.18) and (22.39)). Results from 2D Lanczos interpolation (Fig. 22.22 (h)), using the 2-tap kernel WL2 . cannot be much better than from bicubic interpolation, which can be adjusted to give similar results without causing any ringing in flat regions, as observed in Fig. 22.14. The 3-tap Lanczos kernel WL3 (see Fig. 22.22 (i)) on the other hand should produce slightly sharper edges at the cost of increased overshoot (see also Exercise 22.3). In summary, for high-quality applications one should consider the Catmull-Rom (see Eqns. (22.18) and (22.39)) or Mitchell-Netravali (see Eqns. (22.20) and (22.40)) method, which offer good reconstruction at the same computational cost as bicubic interpolation.

656

22.6 Aliasing Fig. 22.22 Image interpolation methods compared (Example 1).

(a) Original

(b) Nearest-neighbor

(c) Bilinear

(d) Bicubic

(e) Catmull-Rom

(f) Cubic B-spline

(g) Mitchell-Netravali

(h) Lanczos-2

(i) Lanczos-3

22.6 Aliasing As we described in the main part of this chapter, the usual approach for implementing geometric image transformations can be summarized by the following three steps (see Fig. 22.24): 1. Each discrete coordinate u′ of the target image I ′ is projected to the continuous source coordinate x = T −1 (u′ ) by the inverse geometric transformation T −1 . 2. The continuous intensity function I˜ for the discrete source image I is reconstructed by local interpolation, using one of the methods described earlier. 3. The interpolated function is sampled at position x, and the as˜ sociated value I(x) is transferred to the target pixel I ′ (u′ ). 22.6.1 Sampling the Interpolated Image One problem not considered so far concerns the process of sampling the reconstructed, continuous image in step 3 above. The problem occurs when the geometric transformation T causes parts of the image to be contracted. In this case, the distance between adjacent sample points in the source image is locally increased by the inverse transformation T −1 . Now, widening the sampling distance implicitly

657

22 Pixel Interpolation Fig. 22.23 Image interpolation methods compared (Example 2).

Fig. 22.24 Aliasing caused by geometric operations. If the geometric (forward) transformation T leads to a local contraction of the image (equivalent to enlargement by T −1 ), the sampling distance in the source image I is increased from τ (original) to τ ′ . This reduces the local sampling frequency and thus the maximum allowed signal frequency, which leads to aliasing of high-frequency content.

(a) Original

(b) Nearest-neighbor

(c) Bilinear

(d) Bicubic

(e) Catmull-Rom

(f) Cubic B-spline

(g) Mitchell-Netravali

(h) Lanczos-2

(i) Lanczos-3

I (source)

I ′ (target) enlarge

τ T

−→ τ′

T −1

←−

contract

658

lowers the spatial sampling rate and thus the maximum permissi˜ This may quickly lead ble frequencies in the reconstructed image I. to the violation of the sampling criterion (see Sec. 18.2.1) and visible aliasing in the transformed image. This problem does not occur when the image is enlarged by the transformation T because in this case the sampling interval in the source image is shortened (corresponding to an increased sampling frequency) and no aliasing can occur. Note that this effect is largely unrelated to the interpolation method, as demonstrated by the examples in Fig. 22.25. The ef-

22.6 Aliasing Fig. 22.25 Aliasing caused by local image contraction. Aliasing is caused by a violation of the sampling criterion and is largely unaffected by the interpolation method used: complete transformed image (a), detail using nearest-neighbor interpolation (b), bilinear interpolation (c), and bicubic interpolation (d). (a)

(b)

(c)

(d)

I′

I

Interpolation

Filter

Sampling

1

2

3

Fig. 22.26 Low-pass filtering to avoid aliasing in geometric operations. After interpolation (Step 1), the reconstructed image function is subjected to low-pass filtering (Step 2) before being resampled (Step 3).

fect is most noticeable under nearest-neighbor interpolation in Fig. 22.25 (b), where the thin lines are simply not “hit” by the widened sampling raster and thus disappear in some places. Important image information is thereby lost. The bilinear and bicubic interpolation methods in Fig. 22.25 (c, d) have wider interpolation kernels but still cannot avoid the aliasing effect. The problem of course gets worse with increasing reduction factors. 22.6.2 Space-Variant Low-Pass Filtering One solution to the aliasing problem is to make sure that the interpolated image function is properly frequency-limited before it gets resampled. This can be accomplished with a suitable low-pass filter, as illustrated in Fig. 22.26. The cutoff frequency of the low-pass filter is determined by the amount of local scale change, which may—depending upon the type of transformation—be different in various parts of the image. In the

659

22 Pixel Interpolation

simplest case, the amount of scale change is the same throughout the image (e.g., under global scaling or affine transformations, where the same filter can be used everywhere in the image). In general, however, the low-pass filter is space-variant or nonhomogeneous, and the local filter parameters are determined by the transformation T and the current image position. If convolution filters are used for both interpolation and low-pass filtering, they could be combined into a common, space-variant reconstruction filter. Unfortunately, space-variant filtering is computationally expensive and thus is often avoided, even in professional applications (e.g., Adobe Photoshop). The technique is nevertheless used in certain applications, such as high-quality texture mapping in computer graphics [91, 126, 289]. Integral images, as described in Sec. 2.8.1, can be used to implement efficient space-variant smoothing filters.

22.7 Java Implementation Implementations of most interpolation methods described in this chapter are openly available as part of the imagingbook library.6 The following interpolators are available as implementations of the common PixelInterpolator interface: BicubicInterpolator, BilinearInterpolator, LanczosInterpolator, NearestNeighborInterpolator, SplineInterpolator. The key method provided by all these classes is getInterpolatedValue(ia, x, y), where ia is of type ScalarAccessor7 and x, y are the interpolation coordinates (of type double). The Catmull-Rom, cubic B-spline and Mitchell-Netravali interpolators are all implemented (with different parameters) by SplineInterpolator. Similarly, class LanczosInterpolator implements Lanczos-type interpolators of arbitrary order n. For illustration, the complete implementation of the class BicubicInterpolator is listed in Prog. 22.1. Program 22.2 shows a simple interpolation example based on class ImageAccessor, which allows the interpolation method and out-of-bounds strategy to be specified and runs on any type of scalar- or vector-valued image.

6 7

660

Package imagingbook.lib.interpolation. Class ScalarAccessor is part of the “image accessor” framework defined by the imagingbook library. It facilitates unified access to various scalarvalued image types. Analogously, class VectorAccessor handles vectorvalued images (e.g., RGB color images). Their common super-class is ImageAccessor, which can be configured how to handle image borders (see Sec. 4.5.2).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

package imagingbook.lib.interpolation;

22.8 Exercises

import imagingbook.lib.image.access.ScalarAccessor; public class BicubicInterpolator implements PixelInterpolator { private final double a; // sharpness value public BicubicInterpolator() { this(0.5); } public BicubicInterpolator(double a) { this.a = a; } @Override public float getInterpolatedValue(ScalarAccessor ia, double x, double y) { final int u0 = (int) Math.floor(x); final int v0 = (int) Math.floor(y); double q = 0; for (int j = 0; j 0 then d2 ← m1 + D(u, v−1) if u < M − 1 then d3 ← m2 + D(u+1, v−1) D(u, v) ← min(d0 , d1 , d2 , d3 )

20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32:

Pass 2 (R → L): for v ← N −1, . . . , 0 do ▷ bottom → top for u ← M −1, . . . , 0 do ▷ right → left if D(u, v) > 0 then d0 , d 1 , d 2 , d 3 ← ∞ ▷ neighbor distances if u < M −1 then ▷ check image borders d0 ← m1 + D(u+1, v) if v < N −1 then d1 ← m2 + D(u+1, v+1) if v < N −1 then d2 ← m1 + D(u, v+1) if u > 0 then d3 ← m2 + D(u−1, v+1) D(u, v) ← min(D(u, v), d0 , d1 , d2 , d3 )

33:

return D

23.2 Matching Binary Images Alg. 23.2 Chamfer algorithm for calculating the distance transform. From the binary image I, the distance transform D (Eqn. (23.18)) is calculated using a pair of distance masks, specified by values m1 , m2 (see Eqn. (23.21)), for the first and second pass (see Fig. 23.9). Notice that the image borders require special treatment.

Exercise 23.4), which include the exact distances to pixels in a larger neighborhood [37]. Furthermore, floating point-operations can be avoided by using distance masks with scaled integer values, such as the masks     4 3 4 · · · (1) (2) ME ′ =  3 × ·  and ME ′ =  · × 3  (23.24) · · · 4 3 4 for the Euclidean distance. Compared with the original masks (Eqn. (23.23)), the resulting distance values are scaled by about the factor 3. 681

23 Image Matching and Registration

Original image

Manhattan distance

Euclid. distance (approx.)

(a)

(b)

(c)

Fig. 23.10 Distance transform with the chamfer algorithm: original image with black foreground pixels (a), and results of distance transforms using the Manhattan distance (b) and the Euclidean distance (c). The brightness (scaled to maximum contrast) corresponds to the estimated distance to the nearest foreground pixel.

23.2.3 Chamfer Matching The chamfer algorithm offers an efficient way to approximate the distance transform for a binary image of arbitrary size. The next step is to use the distance transform for matching binary images. Chamfer matching (first described in [25]) uses the distance transform to localize the points of maximum agreement between a binary search image I and a binary reference image (template) R. Instead of counting the overlapping foreground pixels as in the direct approach (see Sec. 23.2.1), chamfer matching uses the accumulated values of the distance transform as the match score Q. At each position (r, s) of the template R, the distance values corresponding to all foreground pixels in R are accumulated, that is, X 1 Q(r, s) = · D(r+i, s+j) , (23.25) |FG(R)| (i,j)∈ FG(R)

682

where |FG(R)| denotes the number of foreground pixels in the reference image R. The complete procedure for computing the match score Q is summarized in Alg. 23.3. If at some position each foreground pixel in the template R coincides with a foreground pixel in the image I, the sum of the distance values is zero, which indicates a perfect match. The more foreground pixels of the template fall onto distance values greater than zero, the larger is the resulting score value Q (sum of

1: ChamferMatch (I, R) Input: I, binary search image; R, binary reference image. Returns a 2D map of match scores. 2: 3: 4: 5:

Step 1 – Initialize: (MI , NI ) ← size(I) (MR , NR ) ← size(R) D ← DistanceTransform(I) Q ← new map (MI −MR +1) × (NI −NR +1) 7→ R

▷ Alg. 23.2

8: 9: 10: 11: 12: 13: 14: 15:

Step 2 – Calculate match function Q: for r ← 0, . . . , MI −MR do ▷ place R at (r, s) for s ← 0, . . . , NI −NR do Get match score for R placed at (r, s) q←0 n←0 ▷ number of FG pixels in R for i ← 0, . . . , MR −1 do for j ← 0, . . . , NR −1 do if R(i, j) ̸= 0 then ▷ FG pixel in R q ← q + D(r+i, s+j) n←n+1 Q(r, s) ← q/n

16:

return Q

6: 7:

23.2 Matching Binary Images Alg. 23.3 Chamfer matching (calculation of the match function). Given is a binary search image I and a binary reference image (template) R. In Step 1, the distance transform D is calculated for the image I, using the chamfer algorithm (Alg. 23.2). In Step 2, the distance values in D are accumulated over all foreground pixels in template R and for each template position (r, s). The resulting scores are stored in the 2D map Q, which is returned.

distances). The best match is found at the global minimum of Q, that is, xopt = (ropt , sopt ) = argmin(Q(r, s)). (23.26) (r,s)

The example in Fig. 23.11 demonstrates the difference between direct pixel comparison and chamfer matching using the binary image shown in Fig. 23.7. Obviously the match score produced by the chamfer method is considerably smoother and exhibits only a few distinct local maxima. This is of great advantage because it facilitates the detection of optimal match points using simple local search methods. Figure 23.12 shows another example with circles and squares. The circles have different diameters and the medium-sized circle is used as the template. As this example illustrates, chamfer matching is tolerant against small-scale changes between the search image and the template and even in this case yields a smooth score function with distinct peaks. While chamfer matching is not a “silver bullet”, it is efficient and works sufficiently well if the applications and conditions are suitable. It is most suited for matching line or edge images where the percentage of foreground pixels is small, such as for registering aerial images or aligning wide-baseline stereo images. The method tolerates deviations between the image and the template to a small extent but is of course not generally invariant under scaling, rotation, and deformation. The quality of the results deteriorates quickly when images contain random noise (“clutter”) or large foreground regions, because the method is based on minimizing the distances to foreground pixels. One way to reduce the probability of false matches is not to use a linear summation (as in Eqn. (23.25)) but add up the squared distances, that is,

683

23 Image Matching and Registration Fig. 23.11 Direct pixel comparison vs. chamfer matching (see original images in Fig. 23.7). Unlike the results of the direct pixel comparison (a), the chamfer match score Q (b) is much smoother. It shows distinct peak values in places of high agreement that are easy to track down with local search methods. The match score Q (see Eqn. (23.25)) in (b) is shown inverted for easy comparison. The corresponding surface plots in (c, d) give an additional impression of the functions’ behavior.

Direct comparison

Chamfer matching

(a)

(b)

(c)

(d)

Qrms (r, s) =

X 2 1/2 1 · D(r+i, s+i) |FG(R)|

(23.27)

(i,j)∈ FG(R)

(“root mean square” of the distances) as the match score between the template R and the current subimage, as suggested in [37]. Also, hierarchical variants of the chamfer method have been proposed to reduce the search effort as well as to increase robustness [38]. 23.2.4 Java Implementation The calculation of the distance transform, as described in Alg. 23.2, is implemented by the class DistanceTransform.6 Program 23.3 shows the complete code of class ChamferMatcher for comparing binary images with the distance transform, which is a direct implementation of Alg. 23.3. Additional examples (ImageJ plugins) can be found in the online code repository.

23.3 Exercises Exercise 23.1. Implement the chamfer-matching method (Alg. 23.2) for binary images using the Euclidean distance and the Manhattan distance. 6

684

Package imagingbook.pub.matching

23.3 Exercises

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 23.12 Chamfer matching under varying scales. Binary search image with three circles of different diameters and three identical squares (a). The medium-sized circle at the top is used as the template (b). The result from a direct pixel comparison (c, e) and the result from chamfer matching (d, f). Again the chamfer match produces a much smoother score, which is most notable in the 3D plots shown in the bottom row (e, f). Notice that the three circles and the squares produce high match scores with similar absolute values (f).

Exercise 23.2. Implement the exact Euclidean distance transform using a “brute-force” search for each closest foreground pixel (this may take a while to compute). Compare your results with the approximation obtained with the chamfer method (Alg. 23.2), and compute the maximum deviation (as percentage of the real distance). Exercise 23.3. Implement the chamfer-matching technique using (a) the linear summation of distances (Eqn. (23.25)) and (b) the summation of squared distances (Eqn. (23.27)) for computing the match score. Select suitable test images to find out if version (b) is really more robust in terms of reducing the number of false matches. Exercise 23.4. Modify the chamfer algorithm for computing the distance transform (Alg. 23.2) by replacing the 3 × 3 pixel Euclidean distance masks (Eqn. (23.23)) with the following masks of size 5 × 5:

685

23 Image Matching and Registration Prog. 23.3 Java implementation of binary chamfer matching (Alg. 23.3). The distance transform of the binary search image I is calculated only once in the constructor method (by class DistanceTransform) and stored as a 2D float array (line 17). Parameter norm may be L1 or L2 (line 14). Method getMatch(R) (lines 20–30) calculates the 2D score map Q (again as a float array) for the reference image R. Note that atching to another reference image R does not require the distance transform of I to be re-calculated.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

package imagingbook.pub.matching; import ij.process.ByteProcessor; import imagingbook.pub.matching.DistanceTransform.Norm; public class ChamferMatcher { private final int MI, NI; private final float[][] D; private int MR, NR;

// distance transform of I

public ChamferMatcher(ByteProcessor I) { this(I, Norm.L2); } public ChamferMatcher(ByteProcessor I, Norm norm) { this.MI = I.getWidth(); this.NI = I.getHeight(); this.D = (new DistanceTransform(I, norm)).getDistanceMap(); } public float[][] getMatch(ByteProcessor R) { this.MR = R.getWidth(); this.NR = R.getHeight(); float[][] Q = new float[MI - MR + 1][NI - NR + 1]; for (int r = 0; r ϵ) ∧ (i < imax )

34: 35: 36: 37:

if i < imax then return p else return nil

▷ centered position ▷ warp I to I ′ ▷ pixel difference d ∈ R ▷ get pre-calculated s(u)

▷ H−1 is pre-calculated in line 16

▷ repeat until convergence

▷ max. iteration count exceeded

697

24 Non-Rigid Image Matching

given in Alg. 24.2 and concrete setups for various linear transformations are described in Sec. 24.4. Since the Jacobian matrix J (for the null parameter vector p = 0) and the Hessian matrix H are calculated only once during initialization, this algorithm executes significantly faster than the original Lucas-Kanade (forward-additive) algorithm, while offering similar convergence properties.

24.4 Linear Transformation Parameters The use of linear transformatons for the geometric mapping T is very common. In the following, we describe detailed setups required for the Lucas-Kanade algorithm for various geometric transformations, such as pure translation as well as affine and projective transformations. This should help to reduce the chance of confusion about the content and structure of the involved vectors and matrices. For additional details and concrete implementations of these transformations readers should consult the associated Java source code in the imagingbook10 library. 24.4.1 Pure Translation In the case of pure 2D translation, we have n = 2 parameters (tx , ty ) and the geometric transformation is (see Eqn. (24.15)) t ´ = Tp (x) = x + x , x (24.34) ty with parameter vector p = (p0 , p1 )⊺ = (tx , ty )⊺ and x = (x, y)⊺ . Thus the component functions of the transformation (Eqn. (24.18)) are Tx,p (x) = x + tx ,

(24.35)

Ty,p (x) = y + ty , with the associated 2 × 2 Jacobian matrix   ∂Tx,p ∂Tx,p (x) (x) ∂t ∂ty = 10 . Jp (x) =  ∂T x ∂Ty,p y,p 01 ∂t (x) ∂t (x) x

(24.36)

y

Note that in this case Jp (x) is constant,11 that is, independent of the position x and the parameters p. The 2D column vector δ p (see Eqn. (24.25)) is calculated as X ⊺ ⊺ δp = ∇I (Tp (u)) · Jp (u) · R(u) − I(Tp (u)) (24.37) | {z } | {z } u∈R 2 D(u) ∈ R ´ u∈R X X I (´ 1 0 ⊺ x u) = Ix (´ u), Iy (´ u) · 0 1 ·D(u) = ·D(u) (24.38) Iy (´ u) | {z } u∈R

=

10 11

698

s(u)=(S0 (u),S1 (u))

Σu Ix (´ u)·D(u) Σu Iy (´ u)·D(u)

=

u∈R

Σu S0 (u)·D(u) Σu S1 (u)·D(u)

δ = 0 , δ1

Package imagingbook.pub.geometry.mappings I2 denotes the 2 × 2 identity matrix.

(24.39)

where Ix , Iy denote the (estimated) first derivatives of the search im- 24.4 Linear ´ = Tp (u). Thus, in the case of pure Transformation age I in x/y directions,12 and u translation, the steepest descent images S0 (x) = Ix (´ x) and S1 (x) = Parameters Iy (´ x) (see Eqn. (24.27)) are simply the x/y components of the interpolated gradient of I in the region of the shifted reference image. The associated Hessian matrix (Eqn. (24.28)) is calculated as X ⊺ ⊺ ⊺ ¯p = H [∇I (Tp (u))·Jp (u)] · [∇I (Tp (u))·Jp (u)] (24.40) u∈R

=

X

⊺

∇I (´ u)· | {z u∈R

s(u)

10 01

⊺ ⊺ · ∇I (´ u)· } | {z

s(u)

10 01

}

=

X

⊺

s (u)·s(u)

X I (´ ⊺ x u) u), Iy (´ u) = ∇I (´ u)·∇I (´ u) = · Ix (´ Iy (´ u) u∈R u∈R X Ix2 (´ u) Ix (´ u)·Iy (´ u) = Ix (´ u)·Iy (´ u) Iy2 (´ u) u∈R Σu Ix2 (´ u) Σu Ix (´ u)·Iy (´ u) H00 H01 = = , Σu Ix (´ u)·Iy (´ u) Σu Iy2 (´ u) H10 H11 X

(24.41)

u∈R

(24.42) (24.43) (24.44)

¯ p is symmetric (H01 = H10 ) and only ´ = Tp (u). Since H again with u of size 2 × 2, its inverse can be easily obtained in closed form, i.e., 1 H11 −H01 ¯ −1 H = · . (24.45) p 2 −H01 H00 H00 ·H11 − H01 The resulting optimal parameter increment (see Eqn. (24.24)) is ′ t δ0 −1 ¯ −1 ¯ q opt = x′ = H ·δ = H · (24.46) p p p ty δ1 1 H11 ·δ0 − H01 ·δ1 = · , (24.47) 2 H00 ·δ1 − H01 ·δ0 H00 ·H11 − H01 with δ0 , δ1 as defined in Eqn. (24.39). Alternatively the same result could be obtained by solving Eqn. (24.30) for q opt . 24.4.2 Affine Transformation An affine transformation in 2D can be expressed (for example) with homogeneous coordinates13 in the form   x 1+a b tx   Tp (x) = · y , (24.48) c 1+d ty 1

with n = 6 parameters p = (p0 , . . . , p5 )⊺ = (a, b, c, d, tx , ty )⊺ . This parameterization of the affine transformation implies that the null parameter vector (p = 0) corresponds to the identity transformation. The component functions of this transformation thus are 12

13

See Sec. D.3 of the Appendix for how to estimate gradients of discrete images. See also Secs. 21.1.2 and 21.1.3.

699

Tx,p (x) = (1 + a)·x + b·y + tx ,

24 Non-Rigid Image Matching

Ty,p (x) = c·x + (1 + d)·y + ty ,

(24.49)

and the associated Jacobian matrix at some position x = (x, y) is   Jp (x) =  =

∂Tx,p ∂a ∂Ty,p ∂a

∂Tx,p ∂b ∂Ty,p ∂b

∂Tx,p ∂c ∂Ty,p ∂c

x y 0 0 1 0 . 0 0 x y 0 1

∂Tx,p ∂d ∂Ty,p ∂d

∂Tx,p ∂tx ∂Ty,p ∂tx

∂Tx,p ∂ty  (x) ∂Ty,p ∂ty

(24.50)

(24.51)

Note that in this case, the Jacobian only depends on the position x, but not on the transformation parameters p. It can thus be precalculated once for all positions x of the reference image R. The 6-dimensional column vector δ p (see Eqn. (24.25)) is obtained as X

⊺ ⊺ ∇I (Tp (u))·Jp (u) · R(u) − I(Tp (u)) | {z } | {z } u∈R s⊺ (u) D(u) Xh x y 0 0 1 0 i⊺ = Ix (´ u), Iy (´ u) · · D(u) 0 0 x y 0 1 u∈R     Ix (´ u)·x S0 (u) Ix (´  S1 (u)  u)·y    X I (´ X  S2 (u)  y u)·x ·D(u) =   = Iy (´ S3 (u) ·D(u) u)·y    u∈R  u∈R  Ix (´ S4 (u) u)  Iy (´ u) S5 (u)     Σu S0 (u)·D(u) X S0 (u)·D(u) . . .. ..  = , = u∈R S (u)·D(u) Σ S (u)·D(u) 5 u 5

δp =



(24.53)

(24.54)

(24.55)

´ = Tp (u). The associated Hessian matrix (of size 6 × 6) again with u is found as X ⊺ ⊺ ⊺ ¯p = H ∇I (Tp (u))·Jp (u) · ∇I (Tp (u))·Jp (u) (24.56)   ⊺ u∈R Ix (´ u)·x Ix (´ u)·x Ix (´  Ix (´  u )·y   u)·y  X ⊺ X Iy (´   u)·x  u)·x · Iy (´  = = s (u)·s(u) = (24.57) Iy (´   u)·y  Iy (´ u)·y    x∈R x∈R Ix (´ u)  Ix (´ u)  Iy (´ u) Iy (´ u)  2 2 2 2

´ ΣIx2 (u)x

´ ´ y (u)x ´ ΣIx (u)xy ΣIx (u)I ´ 2 ´ y (u)xy ´ ΣIx2 (u)y ΣIx (u)I ´ y (u)xy ´ ´ 2 ΣIx (u)I ΣIy2 (u)x ´ y (u)y ´ 2 ´ ΣIx (u)I ΣIy2 (u)xy ´ ´ y (u)x ´ ΣIx2 (u)y ΣIx (u)I x ´ y (u)x ´ ´ y (u)y ´ ´ ΣIx (u)I ΣIx (u)I ΣIy2 (u)x

´  ΣIx2 (u)xy ΣIx (u)I ´ y (u)x ´ 2  ΣIx (u)I ´ y (u)xy ´  ΣI 2 (u)x ´

700

(24.52)

´ y (u)xy ´ ´ ´ y (u)x ´ ΣIx (u)I ΣIx (u)x ΣIx (u)I ´ y (u)y ´ 2 ´ ´ y (u)y ´  ΣIx (u)I ΣIx2 (u)y ΣIx (u)I  ´ ´ y (u)x ´ ´ ΣIy2 (u)xy ΣIx (u)I ΣIy2 (u)x .  ´ 2 (u) ´ ´ y (u)y ´ ´ ΣIy2 (u)y ΣIx (u)I ΣIy2 (u)y  ´ y (u)y ´ ´ ´ y (u) ´ ΣIx (u)I ΣIx2 (u) ΣIx (u)I ´ ´ y (u) ´ ´ ΣIy2 (u)y ΣIx (u)I ΣIy2 (u)

Finally, the optimal parameter increment (see Eqn. (24.24)) can be calculated as ⊺ ¯ −1 q opt = a′ , b′ , c′ , d′ , t′x , t′y = H (24.58) p ·δ p

or, equivalently, by solving H·q opt = δ p (see Eqn. (24.30)). In any 24.4 Linear case, no closed-form solution is possible but some numerical method Transformation must be used.14 Parameters 24.4.3 Projective Transformation A projective transformation15 can be expressed (for example) with homogeneous coordinates in the form    1+a b tx x Tp (x) = Mp ·x =  c 1+d ty  ·y  , (24.59) e f 1 1

with n = 8 parameters p = (p0 , . . . , p7 )⊺ = (a, b, c, d, e, f, tx , ty )⊺ . Again the null parameter vector corresponds to the identity transformation. In this case, the results need to be converted back to nonhomogeneous coordinates (see Sec. 21.1.2) which, in Cartesian coordinates, yields the transformation’s (nonlinear) component functions Tx,p (x) =

α (1+a)·x + b·y + tx = , e·x + f ·y + 1 γ

(24.60)

Ty,p (x) =

c·x + (1+d)·y + ty β = , e·x + f ·y + 1 γ

(24.61)

with x = (x, y) and α = (1+a)·x + b·y + tx ,

(24.62)

β = c·x + (1+d)·y + ty ,

(24.63)

γ = e·x + f ·y + 1.

(24.64)

In this case, the associated Jacobian matrix for position x = (x, y),   ∂Tx,p ∂Tx,p ∂Tx,p ∂Tx,p ∂Tx,p ∂Tx,p ∂Tx,p ∂Tx,p ∂a ∂b ∂c ∂d ∂e ∂f ∂tx ∂ty  (x) y,p ∂Ty,p ∂Ty,p ∂Ty,p ∂Ty,p ∂Ty,p ∂Ty,p ∂Ty,p ∂a ∂b ∂c ∂d ∂e ∂f ∂tx ∂ty

Jp (x) =  ∂T

  y·α 1 x y 0 0 − x·α γ − γ 1 0 = · , γ 0 0 x y − x·β − y·β 0 1 γ γ

(24.65)

depends on both the position x and the transformation parameters p. The setup for the Hessian matrix H is analogous to Eqns. (24.56)– (24.58). 24.4.4 Concatenating Linear Transformations The “inverse compositional” algorithm described in Sec. 24.3 requires the concatenation of geometric transformations (see Eqn. (24.32)). In particular, if Tp , Tq are linear transformations (in homogeneous coordinates, see Eqn. (24.59)), with associated transformation matrices Ap and Aq (such that Tp (x) = Ap ·x and Tq (x) = Aq ·x, respectively), the matrix for the concatenated transformation,16 14 15 16

See Sec. B.8 of the Appendix. See also Sec. 21.1.4. See also Sec. 21.1.3.

701

xc

24 Non-Rigid Image Matching Fig. 24.4 Matching coordinates (example). The center of the reference image R is aligned with the origin of the search image I (red square), which is taken as the absolute origin. Image samples (indicated by round dots) are assumed to be located at integer positions. In this example, the reference image R is of size MR = 11 and NR = 8, thus the center point is xc = (5.0, 3.5). In the x/y coordinate frame of I (i.e., absolute coordinates), the four corners of R’s bounding rectangle are A = (−5, −3.5), B = (5, −3.5), C = (5, 3.5) and D = (−5, 3.5). All geometric transformations refer to these reference points (cf. Figs. 24.2 and 24.3).

A

MR = 11 B R

x

yc

I

NR = 8

xc =

1 2

·

D

MR −1 NR −1

C

=

5.0 3.5

y coordinate origin

Tp′ (x) = (Tp ◦ Tq )(x) = Tq (Tp (x))

(24.66)

is simply the product of the original matrices, that is, Ap′ ·x = Aq ·Ap ·x.

(24.67)

The resulting parameter vector p′ for the composite transformation Tp′ can be simply extracted from the corresponding elements of the matrix Ap′ (see Eqns. (24.48, 24.59)). 24.4.5 Coordinate Frames The described matching algorithms involve two images I, R of different size, as illustrated by the example in Fig. 24.4. Iteration is performed over integer positions of the reference image R, whose center is taken as the (continuous) reference point xc = (xc , yc ). The corresponding pixel values in I are calculated by spatial interpolation.

24.5 Example

702

Figure 24.5 shows an example for using the classic Lucas-Kanade (forward-additive) matcher. Initially, a rectangular region Q is selected in the search image I, marked by the green rectangle in Fig. 24.5 (a, b), which specifies the approximate position of the reference image. To create the (synthetic) reference image R, all four corners of the rectangle Q were perturbed by Gaussian noise (with σ = 2.5) in x and y direction. The resulting quadrilateral Q′ (red outline in Fig. 24.5 (a, b)) specifies the region in image I where the reference image R was extracted by transformation and interpolation (see Fig. 24.5 (d)). The matching process starts from the rectangle Q, which specifies the initial warp transformation Tinit , given by the green rectangle (Q), while the real (but unknown) transformation corresponds

24.5 Example

(a)

Fig. 24.5 Lucas-Kanade (forwardadditive) matcher with projective transformation. Original image I (a b); the initial warp transformation Tinit is visualized by the green rectangle Q, which corresponds to the subimage shown in (c). The actual reference image R (d) has been extracted from the red quadrilateral Q′ (by transformation and interpolation). The blue circles mark the corners of the back-projected reference image under the changing transformation Tp . The radius of each circle is proportional to the registration error between the transformed reference image R and the currently overlapping part of the search image I. The steepestdescent images S0 , . . . , S7 (one for each of the 8 parameters a, b, c, d, e, f, tx , ty of the projective transformation) for the first iteration are shown in (e). These images are of the same size as the reference image R.

(b)

Q′

Q

(c)

(d)

S0 (param. a)

S1 (param. b)

S2 (param. c)

S3 (param. d)

S4 (param. e)

S5 (param. f )

S6 (param. tx )

S7 (param. ty )

(e) Steepest descent images S0 , . . . , S7 (for parameters a, b, . . . , tx , ty )

to the red quadrilateral (Q′ ). Each iteration of the matcher updates the warp transformation T . The blue circles in Fig. 24.5(b) mark the corners of the back-projected reference frame under the changing transformation T ; the radius of the circles corresponds to the remaining registration error between the reference image R and the current subimage of I. Figure 24.5 (e) shows the steepest-descent images S0 , . . . , S7 (see Eqn. (24.27)) for the first iteration. Each of these images is of the same size as R and corresponds to one of the 8 parameters a, b, c, d, e, f, tx , ty of the projective warp transformation (see Eqn. (24.59)). The quantity Sk (u) indicates the optimal change of transformation parameter k in order to minimize the residual error for the particular pixel u. The actual change of parameter k is calculated by averaging over all positions u of the reference image R. The example demonstrates the robustness and fast convergence of the classic Lucas-Kanade matcher, which typically requires only

703

24 Non-Rigid Image Matching

5–20 iterations. In this case, the matcher performed 7 iterations to converge (with convergence limit ϵ = 0.00001). In comparison, the inverse-compositional matcher typically requires more iterations and is less tolerant to deviations of the initial warp transformation, that is, has a smaller convergence range than the additive-forward algorithm.17

24.6 Java Implementation The algorithms described in this chapter have been implemented in Java, with the source code available as part of the imagingbook18 library on the book’s accompanying website. The (abstract) toplevel class LucasKanadeMatcher has two concrete sub-classes, each implementing one of the algorithms described in this chapter: LucasKanadeForwardMatcher (Alg. 24.1), LucasKanadeInverseMatcher (Alg. 24.2). Constructors of these classes accept a search image I, a reference image R, and optional parameters, e.g., to specify the maximum number of iterations. ImageJ plugins based on this functionality are available at the online repository.

24.7 Exercises Exercise 24.1. Determine the general structure of the Hessian matrix for the projective transformation (see Sec. 24.4.3), analogous to the affine transformation in Eqns. (24.56)–(24.57). Exercise 24.2. Create comparative statistics of the convergence properties of the classes LucasKanadeForwardMatcher and -InverseMatcher by evaluating the number of iterations required including the percentage of failures. Use a test scenario with randomly distorted reference regions. Exercise 24.3. It is sometimes suggested to refine the warp transformation step-by-step instead of using the full transformation for the whole matching process. For example, one could first match with a pure translation model, then—starting from the result of the first match—switch to an affine transformation model, and eventually apply a full projective transformation. Explore this idea and find out whether this can yield a more robust matching process. Exercise 24.4. Adapt the 2D Lucas-Kanade method described in Sec. 24.2 for the registration of discrete 1D signals under shifting and scaling. Given is a search signal I(u), for u = 0, . . . , MI −1, and a reference signal R(u), for u = 0, . . . , MR −1. It is assumed that I contains a transformed version of R, which is specified by the mapping Tp (x) = s·x + t, with the two unknown parameters p = (s, t). A 17

18

704

In fact, the inverse-compositional algorithm does not converge with this particular example. Package imagingbook.pub.lucaskanade

practical application could be the registration of neighboring image 24.7 Exercises lines under perspective distortion. Exercise 24.5. Use the Lucas-Kanade matcher to design a tracker that follows a given reference patch through a sequence of N images. Hint: In ImageJ, an image sequence (AVI-video or multi-frame TIFF) can be imported as an ImageStack and simply processed frame-byframe. Select the original reference patch in the first frame of the image sequence and use its position to calculate the initial warp transformation to find a match in the second image. Subsequently, take the match obtained in the second image as the initial transformation for the third image, etc. Consider two approaches: (a) use the initial patch as the reference image for all frames of the sequence or (b) extract a new reference image for each pair of frames.

705

Part IX

Local Features

25 Scale-Invariant Feature Transform (SIFT)

Many real applications require the localization of reference positions in one or more images, for example, for image alignment, removing distortions, object tracking, 3D reconstruction, etc. We have seen that corner points1 can be located quite reliably and independent of orientation. However, typical corner detectors only provide the position and strength of each candidate point, they do not provide any information about its characteristic or “identity” that could be used for matching. Another limitation is that most corner detectors only operate at a particular scale or resolution, since they are based on a rigid set of filters. This chapter describes the Scale-Invariant Feature Transform (SIFT) technique for local feature detection, which was originally proposed by D. Lowe [174] and has since become a “workhorse” method in the imaging industry. Its goal is to locate image features that can be identified robustly to facilitate matching in multiple images and image sequences as well as object recognition under different viewing conditions. SIFT employs the concept of “scale space” [173] to capture features at multiple scale levels or image resolutions, which not only increases the number of available features but also makes the method highly tolerant to scale changes. This makes it possible, for example, to track features on objects that move towards the camera and thereby change their scale continuously or to stitch together images taken with widely different zoom settings. Accelerated variants of the SIFT algorithm have been implemented by streamlining the scale space calculation and feature detection or the use of GPU hardware [26, 110, 247]. In principle, SIFT works like a multi-scale corner detector with sub-pixel positioning accuracy and a rotation-invariant feature descriptor attached to each candidate point. This (typically 128-dimensional) feature descriptor summarizes the distribution of the gradient directions in a spatial neighborhood around the corresponding feature point and can thus be used like a “fingerprint”. The main steps involved in the calculation of SIFT features are as follows: 1

See Chapter 6.

© Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1_25

709

25 Scale-Invariant Feature Transform (SIFT)

1. Extrema detection in a Laplacian-of-Gaussian (LoG) scale space to locate potential interest points. 2. Key point refinement by fitting a continuous model to determine precise location and scale. 3. Orientation assignment by the dominant orientation of the feature point from the directions of the surrounding image gradients. 4. Formation of the feature descriptor by normalizing the local gradient histogram. These steps are all described in the remaining parts of this chapter. There are several reasons why we explain the SIFT technique here at such great detail. For one, it is by far the most complex algorithm that we have looked at so far, its individual steps are carefully designed and delicately interdependent, with numerous parameters that need to be considered. A good understanding of the inner workings and limitations is thus important for successful use as well as for analyzing problems if the results are not as expected.

25.1 Interest Points at Multiple Scales The first step in detecting interest points is to find locations with stable features that can be localized under a wide range of viewing conditions and different scales. In the SIFT approach, interest point detection is based on Laplacian-of-Gaussian (LoG) filters, which respond primarily to distinct bright blobs surrounded by darker regions, or vice versa. Unlike the filters used in popular corner detectors, LoG filters are isotropic, i.e., insensitive to orientation. To locate interest points over multiple scales, a scale space representation of the input image is constructed by recursively smoothing the image with a sequence of small Gaussian filters. The difference between the images in adjacent scale layers is used to approximate the LoG filter at each scale. Interest points are finally selected by finding the local maxima in the 3D LoG scale space. 25.1.1 The LoG Filter In this section, we first outline LoG filters and the basic construction of a Gaussian scale space, followed by a detailed description of the actual implementation and the parameters used in the SIFT approach. The LoG is a so-called center-surround operator, which most strongly responds to isolated local intensity peaks, edge, and cornerlike image structures. The corresponding filter kernel is based on the second derivative of the Gaussian function, as illustrated in Fig. 25.1 for the 1D case. The 1D Gaussian function of width σ is defined as Gσ (x) = √

2 − x 1 ·e 2σ2 , 2π·σ

(25.1)

and its first derivative is G′σ (x) = 710

2 − x d Gσ x (x) = − √ ·e 2σ2 . dx 2π·σ 3

(25.2)

0.5

Gσ (x)

25.1 Interest Points at Multiple Scales

G′σ (x) G′′ σ (x)

0.2

-4

2

-2

4

Fig. 25.1 1D Gaussian function Gσ (x) with σ = 1 (orange), its first derivative G′σ (x) (green) and second derivative G′′ σ (x) (blue).

x

-0.2

-0.5

Analogously, the second derivative of the 1D Gaussian is G′′σ (x) =

x2 d 2 Gσ x2 − σ 2 − 2σ 2 √ (x) = · e . d x2 2π·σ 5

(25.3)

The Laplacian (denoted ∇2 ) of a continuous, 2D function f : R2 7→ R is defined as the sum of the second partial derivatives for the x and y directions. The outcome is another 2D function,which is traditionally written as ∂ 2f ∂ 2f ∇2f (x, y) = (x, y) + (x, y) . ∂x2 ∂y 2

(25.4)

Note that, unlike the gradient 2 of a 2D function, the result of the Laplacian is not a vector but a scalar quantity. Its value is invariant against rotations of the coordinate system, that is, the Laplacian operator has the important property of being isotropic. By applying the Laplacian operator to a rotationally symmetric 2D Gaussian, 2

Gσ (x, y) =

+y − x 2σ 1 2 · e 2π·σ 2

2

(25.5)

with identical widths σ = σx = σy in the x/y directions (as shown in Fig. 25.2 (a)), we obtain the Laplaciaon-of-Gaussian (LoG) function ∂ 2 Gσ ∂ 2 Gσ Lσ (x, y) = ∇2 Gσ (x, y) = (x, y) + (x, y) ∂x2 ∂y 2 2 2 2 2 (x2 −σ 2 ) − x2 +y (y 2 −σ 2 ) − x2 +y σ2 σ2 = · e + · e 2π·σ 6 2π·σ 6 2 2 2 x2 +y 2 − 1 x +y −2 σ = · · e 2 σ2 , π·σ 4 2 σ2

(25.6)

depicted in Fig. 25.2 (b). The continuous LoG function in Eqn. (25.6) has the absolute value integral Z ∞Z ∞ 4 |Lσ (x, y)| dx dy = , (25.7) 2 e·σ −∞ −∞ and zero average, that is, 2

See Sec. 5.2.1.

711

25 Scale-Invariant Feature Transform (SIFT) Fig. 25.2 2D Gaussian and LoG. Gaussian function Gσ (x, y) with σ = 1 (a); the corresponding LoG function Lσ (x, y) in (b), and the inverted function (sometimes called “Mexican hat” or “Sombrero” function) −Lσ (x, y) in (c). For illustration, all three functions are normalized to an absolute value of 1 at the origin.

(a) Gσ (x, y)

(b) Lσ (x, y)

Z

∞ −∞

(c) −Lσ (x, y)

Z

∞ −∞

Lσ (x, y) dx dy = 0 .

(25.8)

When used as the kernel of a linear filter,3 the LoG responds maximally to circular spots that are darker than the surrounding background and have a radius of approximately σ.4 Blobs that are brighter than the surrounding background are enhanced by filtering with the negative LoG kernel, that is, −Lσ , which is sometimes referred to as the “Mexican hat” or “Sombrero” filter (see Fig. 25.2). Both types of blobs can be detected simultaneously by simply taking the absolute value of the filter response (see Fig. 25.3). Since the LoG function is based on derivatives, its magnitude strongly depends on the steepness of the Gaussian slope, which is controlled by σ. To obtain responses of comparable magnitude over multiple scales, a scale normalized LoG kernel can be defined in the form [173] ˆ σ (x, y) = σ 2 · ∇2 Gσ (x, y) = σ 2 ·Lσ (x, y) L (25.9) 1 x2 +y 2 −2 σ 2 − x2 +y σ2 · ·e . = πσ 2 2σ 2 2

Note that the integral of this function, i.e., Z ∞Z ∞ ˆ σ (x, y)| dx dy = 4 , |L e −∞ −∞ 3

4

712

2

(25.10)

(25.11)

To produce a sufficiently accurate discrete LoG filter kernel, the support radius should be set to at least 4σ (kernel diameter ≥ 8σ). The LoG is often used as a model for early processes in biological vision systems [183], particularly to describe the center-surround response of receptive fields. In this model, an “on-center” cell is stimulated when the center of its receptive field is exposed to light, and is inhibited when light falls on its surround. Conversely, an “off-center” cell is stimulated by light falling on its surround. Thus filtering with the original LoG Lσ (Eqn. (25.6)) corresponds to the behavior of off -center cells, while the response to the negative LoG kernel −Lσ is that of an on-center cell.

25.1 Interest Points at Multiple Scales Fig. 25.3 Filtering with the LoG kernel (with σ = 3). Original images (a). A linear filter with the LoG kernel Lσ (x, y) responds strongest to dark spots in a bright surround (b), while the inverted kernel −Lσ (x, y) responds strongest to bright spots in a dark surround (c). In (b, c), zero values are shown as medium gray, negative values are dark, positive values are bright. The absolute value of (b) or (c) combines the responses from both dark and bright spots (d).

(a)

(b)

(c)

(d)

0.05 -6

-4

0.05 2

-2

4

6

-0.1

-0.2

-0.3

x

-6

-4

2

-2

4

6

x

-0.1

σ = 1.0 √ σ= 2 σ = 2.0

(a) non-normalized LoG

-0.2

-0.3

σ = 1.0 √ σ= 2 σ = 2.0

(b) scale-normalized LoG

Fig. 25.4 Scale-normalized LoG function. Cross section of the non-normalized LoG function Lσ (x, y) as defined in Eqn. (25.6) (a); scale-normalized LoG (b) as defined in Eqn. (25.10), for√scales σ = 1.0 (orange), σ = 2 (green), σ = 2.0 (blue). All three functions in (b) have the same absolute value integral (see Eqn. (25.11)).

is constant and thus (unlike Eqn. (25.7)) independent of the scale parameter σ (see Fig. 25.4). LoG-Approximation by the Difference of Two Gaussians (DoG) Although the LoG is “quasi-separable” [135, 276] and can thus be calculated efficiently, the most common method for implementing

713

25 Scale-Invariant Feature Transform (SIFT)

LoG filters is by approximation with the difference of two Gaussians (DoG), i.e., Lσ (x, y) ≈ Dσ,κ (x, y) = λ·[Gb (x, y) − Ga (x, y)].

(25.12)

The two Gaussians Ga , Gb have widths σa = σ and σb = κ·σ, respectively, related by the constant factor κ > 1 (i.e., Gb is wider than Ga ). If the common scale factor λ is chosen properly (see Eqn. (25.13) below), the DOG function Dσ,κ () can be tuned to approximate the LoG function Lσ () with arbitrary precision, as κ approaches 1 (κ = 1 being excluded, of course). In practice, values of κ in the range 1.1, . . . , 1.3 yield sufficiently accurate results. The example in Fig. 25.5 (b) shows the cross-section of the 2D DoG function for κ = 21/3 ≈ 1.25992 in comparison to the LoG function.5 Fig. 25.5 Approximating the LoG by the difference of two Gaussians (DoG). Two Gaussians Ga , Gb , with σa = 1.0 and σb = σa ·κ (with κ = 21/3 ), respectively, and their difference Gb −Ga (a). In (b), the blue, dashed line shows the LoG function in comparison to the DoG (orange). The DoG is scaled to match the magnitude of the LoG function.

0.2 0.05

Ga -3

-2

LoG 1

-1

0.1

Gb -4

-0.1

3

x

-0.1

2

-2

2

DoG

4

x

-0.2

Gb −Ga

-0.3

(a) difference of 2 Gaussians

(b) LoG vs. DoG

The factor λ ∈ R in Eqn. (25.12) controls the magnitude of the DoG function; it depends on both the ratio κ and the scale parameter σ. To match the magnitude of the original LoG (Eqn. (25.6)) at the origin, it must be set to λ=

2 κ2 σ 2 ·(κ2 −1)

.

(25.13)

ˆ σ (see Eqn. (25.10)) can be Similarly, the scale-normalized LoG, L approximated by the DoG function Dσ,κ (Eqn. (25.12)) as 2 ˆ σ (x, y) = σ 2 Lσ (x, y) ≈ σ 2 ·λ ·Dσ,κ (x, y) = 2κ ·Dσ,κ (x, y), L | {z } κ2 −1 ˆ λ

(25.14)

ˆ = σ 2 ·λ = 2κ2 /(κ2 −1) being constant and with the modified factor λ therefore independent of the scale parameter σ. Thus, as pointed out in [175], with a fixed scale increment κ, the DoG already approximates the scale-normalized LoG up to a constant factor, and thus no additional scaling is required to compare the magnitudes of the DoG responses obtained at different scales.6 5

6

714

The factor κ = 21/3 originates from splitting the scale interval 2 (i.e., one scale octave) into 3 equal intervals, as described later on. Another factor mentioned frequently in the literature is 1.6, which, however, does not yield a satisfactory approximation. Possibly that value refers to the ratio of the variances σ22 /σ12 and not the ratio of the standard deviations σ2 /σ1 . See Sec. H.4 in the Appendix for additional details.

G(x, y, σ)

continuous Gaussian scale space

G = (G0 , . . . , GK −1 )

discrete Gaussian scale space with K levels

Gk

single level in a discrete Gaussian scale space

L = (L0 , . . . , LK −1 )

discrete LoG scale space with K levels

Lk

single level in a LoG scale space

D = (D0 , . . . , DK −1 )

discrete DoG scale space with K levels

Dk

single level in a DoG scale space

G = (G0 , . . . , GP −1 )

hierarchical Gaussian scale space with P octaves

Gp,q

single level in a hierarchical Gaussian scale space

D = (D0 , . . . , DP −1 )

hierarchical DoG scale space with P octaves

25.1 Interest Points at Multiple Scales Table 25.1 Scale space-related symbols used in this chapter.

Gp = (Gp,0 , . . . , Gp,Q−1 ) octave in a hier. Gaussian scale space with Q levels

Dp = (Dp,0 , . . . , Dp,Q−1 ) octave in a hierarchical DoG scale space with Q levels Dp,q

single level in a hierarchical DoG scale space

Nc (i, j, k)

3 × 3 × 3 neigborhood in DoG scale space

k = (p, q, u, v) k′ = (p, q, x, y)

discrete key point position in hierarchical scale space (p, q, u, v ∈ Z) continuous (refined) key point position (x, y ∈ R)

In the SIFT approach, the DoG is used as an approximation of the (scale-normalized) LoG filter at multiple scales, based on a Gaussian scale space representation of the input image that is described next. For a quick reference, Table 25.1 lists the most important scale spacerelated symbols used in this chapter. 25.1.2 Gaussian Scale Space The concept of scale space [172] is motivated by the observation that real-world scenes exhibit relevant image features over a large range of sizes and, depending on the particular viewing situation, at various different scales. To relate image structures at different and unknown sizes, it is useful to represent the images simultaneously at different scale levels. The scale space representation of an image adds scale as a third coordinate (in addition to the two image coordinates). Thus the scale space is a 3D structure, which can be navigated not only along the x/y positions but also across different scale levels. Continuous Gaussian Scale Space The scale-space representation of an image at a particular scale level is obtained by filtering the image with a kernel that is parameterized to the desired scale. Because of its unique properties [16, 87], the most common type of scale space is based on successive filtering with Gaussian kernels. Conceptually, given a continuous, 2D function F (x, y), its Gaussian scale space representation is a 3D function G(x, y, σ) = (F ∗ HσG )(x, y),

(25.15)

where HσG ≡ Gσ (x, y) is a 2D Gaussian kernel (see Eqn. (25.5)) with unit integral, and ∗ denotes the linear convolution over x, y. Note that σ ≥ 0 serves both as a continuous scale parameter and the width of the corresponding Gaussian filter kernel. A fully continuous Gaussian scale space G(x, y, σ) covers a 3D volume and represents the original function F (x, y) at varying scales

715

25 Scale-Invariant Feature Transform (SIFT)

σ. For σ = 0, the Gaussian kernel H0G has zero width, which makes it equivalent to an impulse or Dirac function δ(x, y).7 This is the neutral element of linear convolution, that is, G(x, y, 0) = (F ∗ H0G )(x, y) = (F ∗ δ)(x, y) = F (x, y).

(25.16)

Thus the base level G(x, y, 0) of the Gaussian scale space is identical to the input function F (x, y). In general (with σ > 0), the Gaussian kernel HσG acts as a low-pass filter with a cutoff frequency proportional to 1/σ (see Sec. H.3 in the Appendix), the maximum frequency (or bandwidth) of the original “signal” F (x, y) being potentially unlimited. Discrete Gaussian Scale Space This is different for a discrete input function I(u, v), whose bandwidth is implicitly limited to half the sampling frequency, as mandated by the sampling theorem to avoid aliasing.8 Thus, in the discrete case, the lowest level G(x, y, 0) of the Gaussian scale space is not accessible! To model the implicit bandwidth limitations of the sampling process, the discrete input image I is assumed to be pre-filtered (with respect to the underlying continuous signal) with a Gaussian kernel of width σs ≥ 0.5 [175], that is, G(u, v, σs ) ≡ I.

(25.17)

Thus the discrete input image I is implicitly considered the bottom level σs of the Gaussian scale space, and the lower levels (with σ < σs ) are not available. Any higher level σh > σs of the Gaussian scale space can be derived from the original image I by filtering with the Gaussian kernel Hσ¯G , that is, q G(u, v, σh ) = I ∗ Hσ¯G , with σ ¯ = σh2 − σ2s . (25.18) This is due to the fact that applying two Gaussian filters of widths σ1 and σ2 , one after the other, is equivalent to a single convolution with a Gaussian kernel of width σ12 , i.e.,9 I ∗ HσG1 ∗ HσG2 ≡ I ∗ HσG12 , (25.19) with

σ12 =

q

σ12 + σ22 .

(25.20)

We define the discrete Gaussian scale space representation of an image I as a vector of M images, one for each scale level m: G = (G0 , G1 , . . . , GM −1 ). 7 8 9

716

(25.21)

See Sec. 4.3.4. See Sec. 18.2.1. See Sec. H.1 in the Appendix for additional details on combining Gaussian filters.

Associated with each level Gm is its absolute scale σm > 0, and each 25.1 Interest Points at level Gm represents a blurred version of the original image, that Multiple Scales is, Gm (u, v) ≡ G(u, v, σm ) in the notation introduced in Eqn. (25.15). The scale ratio between adjacent scale levels, ∆σ =

σm+1 , σm

(25.22)

is pre-defined and constant. Usually, ∆σ is specified such that the absolute scale σm doubles with a given number of levels Q, called an octave. In this case, the resulting scale increment is ∆σ = 21/Q with (typically) Q = 3, . . . , 6. In addition, a base scale σ0 > σs is specified for the initial level G0 , with σs denoting the smoothing of the discrete image implied by the sampling process, as discussed already. Based on empirical results, a base scale of σ0 = 1.6 is recommended in [175] to achieve reliable interest point detection. Given Q and the base scale σ0 , the absolute scale at an arbitrary scale space level Gm is m/Q σm = σ0 · ∆m , σ = σ0 · 2

(25.23)

for m = 0, . . . , M −1. It follows from Eqn. (25.18) that each scale level Gm can be obtained directly from the discrete input image I by a convolution Gm = I ∗ Hσ¯Gm ,

(25.24)

with a Gaussian kernel Hσ¯Gm of width σ ¯m =

p

2 − σ2 = σm s

q

σ02 · 22m/Q − σ2s .

(25.25)

In particular, the initial scale space level G0 , (with the specified base scale σ0 > σs ) is obtained from the discrete input image I by linear filtering using a Gaussian kernel of width q σ ¯0 = σ02 − σ2s . (25.26)

Alternatively, using the relation σm = σm−1 ·∆σ (see Eqn. (25.22)), the scale levels G1 , . . . , GM −1 could be calculated recursively from the base level G0 in the form Gm = Gm−1 ∗ HσGm ′ ,

(25.27)

for m > 0, with a sequence of Gaussian kernels HσGm of width ′ ′ σm =

q

m/Q 2 − σ2 σm · m−1 = σ0 ·2

p

1 − 1/∆2σ .

(25.28)

Table 25.2 lists the resulting kernel widths for Q = 3 levels per octave and base scale σ0 = 1.6 over a scale range of 6 octaves. The value σ ¯m denotes the size of the Gaussian kernel required to compute the image at scale m from the discrete input image I (assumed to be ′ sampled with σs = 0.5). σm is the width of the Gaussian kernel to compute level m recursively from the previous level m−1. Apparently

717

25 Scale-Invariant Feature Transform (SIFT)

Table 25.2 Filter sizes required for calculating Gaussian scale levels Gm for the first 6 octaves. Each octave consists of Q = 3 levels, placed at increments of ∆σ along the scale coordinate. The discrete input image I is assumed to be pre-filtered with σs . Column σm denotes the absolute scale at level m, starting with the specified base offset scale σ0 . σ ¯m is the width of the Gaussian filter required to calculate level Gm directly from the input image ′ I. Values σm are the widths of the Gaussian kernels required to calculate level Gm from the previous level Gm−1 . Note that the width of the Gaussian kernels needed for recursive ′ filtering (σm ) grows at the same exponential rate as the size of the direct filter (¯ σm ).

(though perhaps unexpectedly), the kernel size required for recursive ′ filtering (by σm ) grows at the same (exponential) rate as the absolute kernel size σ ¯m .10 At scale level m = 16 and absolute scale σ16 = 1.6·216/3 ≈ 64.5, for example, the Gaussian filters required to compute G16 directly 2 from the input image I has the width σ ¯16 = (σ16 − σ2s )1/2 = (64.50802 2 1/2 − 0.5 ) ≈ 64.5, while the filter to blur incrementally from the ′ 2 2 1/2 previous scale level has the width σ16 = (σ16 − σ15 ) = (64.50802 2 1/2 − 51.1976 ) ≈ 39.2. Since recursive filtering also tends to accrue numerical inaccuracies, this approach does not offer a significant advantage in general. Fortunately, the growth of the Gaussian kernels can be kept small by spatially subsampling after each octave, as will be described in Sec. 25.1.4. m

σm

σ ¯m

′ σm

18 102.4000 102.3988 62.2908 17 81.2749 81.2734 49.4402 16 64.5080 64.5060 39.2408 15 51.2000 51.1976 31.1454 14 40.6375 40.6344 24.7201 13 32.2540 32.2501 19.6204 12 25.6000 25.5951 15.5727 11 20.3187 20.3126 12.3601 10 16.1270 16.1192 9.8102 9 12.8000 12.7902 7.7864 8 10.1594 10.1471 6.1800 7 8.0635 8.0480 4.9051 6 6.4000 6.3804 3.8932 5 5.0797 5.0550 3.0900 4 4.0317 4.0006 2.4525 3 3.2000 3.1607 1.9466 2 2.5398 2.4901 1.5450 1 2.0159 1.9529 1.2263 0 1.6000 1.5199 —

m . . . linear scale index σm . . . absolute scale at level m (Eqn. (25.23)) σ ¯m . . . relative scale at level m w.r.t. the original image (Eqn. (25.25)) ′ σm . . . relative scale at level m w.r.t. the previous level m−1 (Eqn. (25.28))

σs = 0.5 (sampling scale) σ0 = 1.6 (base scale) Q = 3 (levels per octave) ∆σ = 21/Q ≈ 1.256

The process of constructing a discrete Gaussian scale space, using the same parameters as listed in Table 25.2, is illustrated in Fig. 25.6. Again the input image I is assumed to be pre-filtered at σs = 0.5 due to sampling and the absolute scale of the first level G0 is set to σ0 = 1.6. The scale ratio between successive levels is fixed at ∆σ = 21/3 ≈ 1.25992, that is, each octave spans three discrete scale levels. As shown in this figure, each scale level Gm can be calculated either directly from the input image I by filtering with a Gaussian of width σ ¯m , or recursively from the previous level by filtering with ′ σm . 25.1.3 LoG/DoG Scale Space

718

Interest point detection in the SIFT approach is based on finding local maxima in the output of LoG filters over multiple scales. Analogous p 10 ′

The ratio of the kernel sizes σ ¯m /σm converges to 1−1/∆2σ (≈ 1.64 for Q = 3). It is thus practically constant for larger values of m.

G3 σ ¯3 = 3.1607

25.1 Interest Points at Multiple Scales σ3′ = 1.9466

σ3 = 3.2000 G2

∆σ

σ ¯2 = 2.4901

σ2′ = 1.5450

σ2 = 2.5398 G1 ∆σ σ ¯1 = 1.9529

σ1′ = 1.2263

σ1 = 2.0159 G0 ∆σ σ ¯0 = 1.5199 σ0 = 1.6000

Fig. 25.6 Gaussian scale space construction (first four levels). Parameters are the same as listed in Table 25.2. The discrete input image I is assumed to be (implicitly) pre-filtered with a Gaussian of width σs = 0.5; the scale of the initial level (base scale offset) is set to σ0 = 1.6. The discrete scale space levels G0 , G1 , . . . (at absolute scales σ0 , σ1 , . . .) are slices through the continuous scale space. Scale levels can either be calculated by filtering directly from the discrete image I with Gaussian kernels of width σ ¯0 , σ ¯1 , . . . (blue arrows) or, alternatively, by recursively filtering with σ1′ , σ2′ , . . . (green arrows).

I

σs = 0.5000

to the discrete Gaussian scale space described in Sec. 25.1.2, a LoG scale space representation of an image I can be defined as L = (L0 , L1 , . . . , LM −1 ),

(25.29)

ˆ σ (x, y) is a scalewith levels Lm = I ∗ HσLm , where HσLm (x, y) ≡ L m normalized LoG kernel of width σm (see Eqn. (25.10)). As demonstrated in Eqn. (25.12), the LoG kernel can be approximated by the the difference of two Gaussians whose widths differ by a certain ratio κ. Since pairs of adjacent scale layers in the Gaussian scale space are also separated by a fixed scale ratio, it is straightforward to construct a multi-scale DoG representation, D = (D0 , D1 , . . . , DM −2 )

(25.30)

from an existing Gaussian scale space G = (G0 , G1 , . . . , GM −1 ). The individual levels in the DoG scale space are defined as ˆ m+1 − Gm ) ≈ Lm , Dm = λ·(G

(25.31)

ˆ (defined in Eqn. (25.14)) for m = 0, . . . , M −2. The constant factor λ can be omitted in the above expression, as the relative width of the involved Gaussians, κ = ∆σ =

σm+1 = 21/Q , σm

(25.32) 719

25 Scale-Invariant Feature Transform (SIFT) Fig. 25.7 DoG scale-space construction. The differences of successive levels G0 , G1 , . . . of the Gaussian scale space (see Fig. 25.6) are used to approximate a LoG scale space. Each DoG-level Dm is calculated as the point-wise difference Gm+1 − Gm between Gaussian levels Gm+1 and Gm . The values in D0 , . . . , D3 are scale-normalized (see Eqn. (25.14)) and mapped to a uniform intensity range for better viewing.

G3

D3

G2

D2

G1

D1

G0

D0

σ3 = 3.2000

∆σ σ2 = 2.5398

∆σ σ1 = 2.0159

∆σ σ0 = 1.6000

I

σs = 0.5000

is simply the fixed scale ratio ∆σ between successive scale space levels. Note that the DoG approximation does not require any additional normalization to approximate a scale-normalized LoG representation (see Eqns. (25.10) and (25.14)). The process of calculating a DoG scale space from a discrete Gaussian scale space is illustrated in Fig. 25.7, using the same parameters as in Table 25.2 and Fig. 25.6. 25.1.4 Hierarchical Scale Space Despite the fact that 2D Gaussian filter kernels are separable into 1D kernels,11 the size of the required filter grows quickly with increasing scale, regardless if a direct or recursive approach is used (as shown in Table 25.2). However, each Gaussian filter operation reduces the bandwidth of the signal inversely proportional to the width of the kernel (see Sec. H.3 of the Appendix). If the image size is kept constant over all scales, the images become increasingly oversampled at higher scale levels. In other words, the sampling rate in a Gaussian scale space can be reduced with increasing scale without losing relevant signal information. Octaves and Sub-Sampling (Decimation) In particular, doubling the scale cuts the bandwidth by half, that is, the signal at scale level 2σ has only half the bandwidth of the signal 11

720

See also Sec. 4.3.3.

at level σ. An image signal at scale level 2σ of a Gaussian scale space 25.1 Interest Points at thus shows only half the bandwidth of the same image at scale level Multiple Scales σ. In a Gaussian scale space representation it is thus safe to downsample the image to half the sample rate after each octave without any loss of information. This suggests a very efficient, “pyramidstyle” approach for constructing a DoG scale space, as illustrated in Fig. 25.8.12 At the start (bottom) of each octave, the image is down-sampled to half the resolution, that is, each pixel in the new octave covers twice the distance of the pixels in the previous octave in every spatial direction. Within each octave, the same small Gaussian kernels can be used for successive filtering, since their relative widths (with respect to the original sampling lattice) also implicitly double at each octave. To describe these relations formally, we use G = (G0 , G1 , . . . , GP −1 )

(25.33)

to denote a hierarchical Gaussian scale space consisting of P octaves. Each octave Gp = Gp,0 , Gp,1 , . . . , Gp,Q , (25.34)

consists of Q+1 scale levels Gp,q , where p ∈ [0, P −1] is the octave index and q ∈ [0, Q] is the level index within the containing octave Gp . In terms of absolute scale, a level Gp,q in the hierarchical Gaussian scale space corresponds to level Gm in the non-hierarchical Gaussian scale space (see Eqn. (25.21)) with index m = Q · p + q.

(25.35)

As follows from Eqn. (25.23), the absolute scale at level Gp,q thus is m/Q σp,q = σm = σ0 ·∆m σ = σ0 ·2

= σ0 ·2(Qp+q)/Q = σ0 ·2p+q/Q ,

(25.36)

where σ0 = σ0,0 denotes the predefined base scale offset (e.g., σ0 = 1.6 in Table 25.2). In particular, the absolute scale of the base level Gp,0 of any octave Gp is σp,0 = σ0 ·2p .

(25.37)

The decimated scale σ˙ p,q is the absolute scale σp,q (see Eqn. (25.36)), expressed in coordinate units of octave Gp , that is, σ˙ p,q = σ˙ q = σp,q ·2−p = σ0 ·2p+q/Q ·2−p = σ0 ·2q/Q .

(25.38)

Note that the decimated scale σ˙ p,q is independent of the octave index p and therefore σ˙ p,q ≡ σ˙ q , for any level index q. From the octave’s base level Gp,0 , the subsequent levels in the same octave can be calculated by concolving with relatively small 12

Successive reduction of image resolution by subsampling is the key concept of “image pyramid” methods [50].

721

25 Scale-Invariant Feature Transform (SIFT) Fig. 25.8 Hierarchical Gaussian scale space. Each octave extends over Q = scale steps. The base level Gp,0 of each octave p > 0 is obtained by 2:1 subsampling of the top level Gp−1,3 of the next-lower octave. At the transition between octaves, the resolution (image size) is cut in half in the x- and y-direction. The absolute scale at octave level Gp,q is σm , with m = Qp+q. Within each octave, the same set of Gaussian kernels (˜ σ1 , σ ˜2 , σ ˜3 ) is used to calculate the following levels from the octave’s base level Gp,0 .

G3,3 [σ12 ] G3,2 [σ11 ] G3,1 [σ10 ] G3,0 [σ9 ]

Octave G3

G2,3 [σ9 ] G2,2 [σ8 ] G2,1 [σ7 ] G2,0 [σ6 ]

Octave G2

σ ˜3 σ ˜2 σ ˜1

G1,3 [σ6 ] G1,2 [σ5 ] G1,1 [σ4 ] G1,0 [σ3 ]

Octave G1

Octave G0

σ ˜3 σ ˜2 σ ˜1

2:1 subsampling σ ˜3 σ ˜2 σ ˜1

σ ˜3 σ ˜2 σ ˜1

G0,3 [σ3 ] G0,2 [σ2 ] G0,1 [σ1 ] G0,0 [σ0 ]

I (with σs )

Gaussian kernels. The size of the kernel needed to calculate scalelevel Gp,q from the octave’s base level Gp,0 is obtained from the corresponding decimated scales (Eqn. (25.38)) as q q p 2 = 2 −σ σ ˜p,q = σ˙ p,q ˙ p,0 (σ0 ·2q/Q )2 −σ02 = σ0 · 22q/Q −1 , (25.39) for q ≥ 0. Note that σ ˜q is independent of the octave index p and thus the same filter kernels can be used at each octave. For example, with Q = 3 and σ0 = 1.6 (see Table 25.2) the resulting kernel widths are σ ˜1 = 1.2263,

σ ˜2 = 1.9725,

σ ˜3 = 2.7713.

Also note that, instead of filtering all scale levels Gp,q in an octave from the corresponding base level Gp,0 , we could calculate them recursively from the next-lower level Gp,q−1 . While this approach requires even smaller Gaussian kernels (and is thus more efficient), recursive filtering tends to accrue numerical inaccuracies. Nevertheless, the method is used frequently in scale-space implementations. Decimation Between Successive Octaves With M × N being the size of the original image I, every subsampling step between octaves cuts the image size by half, i.e., j k j k M N Mp+1 × Np+1 = 2p × 2p , (25.40)

for octaves with index p ≥ 0. The image size at some octave Gp is thus j k j k N0 0 Mp × N p = M (25.41) 2p × 2p . 722

The base level Gp,0 of each octave Gp (with p > 0) is obtained by subsampling the top level Gp−1,Q of the next-lower octave Gp−1 as

1: BuildGaussianScaleSpace(I, σs , σ0 , P, Q) Input: I, source image; σs , sampling scale; σ0 , reference scale of the first octave; P , number of octaves. Q, number of scale steps per octave. Returns a hierarchical Gaussian scale space representation G of the image I. 2: σ ¯0 ← (σ02 − σ2s )1/2 ▷ scale to base of 1st octave, Eqn. (25.26) 3: Ginit ← I ∗ Hσ¯G0 ▷ apply 2D Gaussian filter of width σ ¯0 4: G0 ← MakeGaussianOctave(Ginit , 0, Q, σ0 ) ▷ create octave G0 5: 6: 7:

for p ← 1, . . . , P −1 do ▷ octave index p Gnext ← Decimate(Gp−1,Q ) ▷ dec. top level of octave p−1 Gp ← MakeGaussianOctave(Gnext , p, Q, σ0 ) ▷ create octave Gp

8: 9:

G ← (G0 , . . . , GP −1 ) return G

▷ hierarchical Gaussian scale space G

10: MakeGaussianOctave(Gbase , p, Q, σ0 ) Input: Gbase , octave base level; p, octave index; Q, number of levels per octave; σ0 , reference scale. 11: Gp,0 ← Gbase 12: 13: 14:

for q ← 1, . . .√ , Q do σ ˜q ← σ0 · 22q/Q − 1 Gp,q ← Gbase ∗ Hσ˜Gq

15: 16:

Gp ← (Gp,0 , . . . , Gp,Q ) return Gp

for all (u, v) ∈ M ′ × N ′ do Gout (u, v) ← Gin (2u, 2v)

23:

return Gout

Alg. 25.1 Building a hierarchical Gaussian scale space. The input image I is first blurred to scale σ0 by filtering with a Gaussian kernel of width σ ¯0 (line 3). In each octave Gp , the scale levels Gp,q are calculated from the base level Gp,0 by filtering with a set of Gaussian filters of width σ ˜1 , . . . , σ ˜Q (lines 13– 14). The base level Gp,0 of each higher octave is obtained by subsampling the top level Gp−1,Q of the previous octave (line 6).

▷ level index q ▷ see Eqn. (25.39) ▷ apply 2D Gauss. filter with σ ˜q ▷ scale space octave Gp

17: Decimate(Gin ) Input: Gin , Gaussian scale space level. 18: (M, N ) ← size(Gin ) 19: M′ ← ⌊M ⌋, N ′ ← ⌊ N2 ⌋ 2 20: Gout ← new map M ′ × N ′ 7→ R 21: 22:

25.1 Interest Points at Multiple Scales

▷ decimated size

▷ 2:1 subsampling ▷ decimated scale level Gout

Gp,0 = Decimate(Gp−1,Q ),

(25.42)

where Decimate(G) denotes the 2:1 subsampling operation, that is, Gp,0 (u, v) ← Gp−1,Q (2u, 2v),

(25.43)

for each sample position (u, v) ∈ [0, Mp −1] × [0, Np −1]. No additional low-pass filtering is required prior to subsampling since the Gaussian smoothing performed in each octave also cuts the bandwidth by half. The main steps involved in constructing a hierarchical Gaussian scale space are summarized in Alg. 25.1. In summary, the input image I is first blurred to scale σ0 by filtering with a Gaussian kernel of width σ ¯0 . Within each octave Gp , the scale levels Gp,q are calculated from the base level Gp,0 by filtering with a set of Gaussian filters of width σ ˜q (q = 1, . . . , Q). Note that the values σ ˜q and the corresponding Gaussian kernels Hσ˜Gq can be pre-calculated once since they are independent of the octave index p (see Alg. 25.1, lines 13–14). The base level Gp,0 of each higher octave Gp is obtained by decimating

723

25 Scale-Invariant Feature Transform (SIFT)

the top level Gp−1,Q of the previous octave Gp−1 . Typical parameter settings are σs = 0.5, σ0 = 1.6, Q = 3, P = 4. Spatial Positions in Hierarchical Scale Space To properly associate the spatial positions of features detected in different octaves of the hierarchical scale space we define the function x0 ← AbsPos(xp , p), which maps the continuous position xp = (xp , yp ) in the local coordinate system of octave p to the corresponding position x0 = (x0 , y0 ) in the coordinate system of the original full-resolution image I (i.e., in octave p = 0). Function AbsPos() can be defined recursively by relating the positions in the next-lower octave, that is, ( xp for p = 0, AbsPos(xp , p) = (25.44) AbsPos(2·xp , p−1) for p > 0, which gives x0 = 2x1 = 4x2 = . . . = 2p xp , and thus x0 = AbsPos(xp , p) = 2p ·xp .

(25.45)

Conversely, given some point x0 in original image coordinates, the corresponding pixel position at octave p is xp = x0 ·2−p .

(25.46)

Hierarchical LoG/DoG Scale Space Analogous to the scheme shown in Fig. 25.7, a hierarchical DoG scale space representation is obtained by calculating the difference of adjacent Gaussian scale levels within each octave of the hierarchical scale space, i.e., Dp,q = Gp,q+1 − Gp,q ,

(25.47)

for level numbers q ∈ [0, Q−1]. Figure 25.9 shows the corresponding Gaussian and DoG scale levels for the previous example over a range of three octaves. To demonstrate the effects of subsampling, the same information is shown in Figs. 25.10–25.11, with all level images scaled to the same size. Figure 25.11 also shows the absolute values of the DoG response, which are effectively used for detecting interest points at different scale levels. Note how blob-like features stand out and disappear again as the scale varies from fine to coarse. Analogous results obtained from a different image are shown in Figs. 25.12 and 25.13. 25.1.5 Scale Space Structure in SIFT

724

In the SIFT approach, the absolute value of the DoG response is used to localize interest points at different scales. For this purpose, local maxima are detected in the 3D space spanned by the spatial x/y positions and the scale coordinate. To determine local maxima along the scale dimension over a full octave, two additional DoG levels,

Gaussian scale space

DoG scale space

25.1 Interest Points at Multiple Scales

G2,3

Octave G2 (100 × 75)

G2,2

D2,2

G2,1

D2,1

G2,0

D2,0

Fig. 25.9 Hierarchical Gaussian and DoG scale space example, with P = Q = 3. Gaussian scale space levels Gp,q are shown in the left column, DoG levels Dp,q = Gp,q+1 − Gp,q in the right column. All images are shown at their real scale.

G1,3

Octave G1 (200 × 150)

G1,2

D1,2

G1,1

D1,1

G1,0

D1,0

G0,3

G0,2

D0,2

G0,1

D0,1

G0,0

D0,0

Octave G0 (400 × 300)

Dp,−1 and Dp,Q , and two additional Gaussian scale levels, Gp,−1 and Gp,Q+1 , are required in each octave. In total, each octave Gp consists of Q+3 Gaussian scale levels Gp,q (q = −1, . . . , Q+1) and Q+2 DoG levels Dp,q (q = −1, . . . , Q), as shown in Fig. 25.14. For the base level G0,−1 , the scale index is m = −1 and its absolute scale (see Eqns. (25.23) and (25.36)) is σ0,−1 = σ0 ·2−1/Q = σ0 · ∆1 . σ

(25.48) 725

25 Scale-Invariant Feature Transform (SIFT) Fig. 25.10 Hierarchical Gaussian scale space example (castle image). All images are scaled to the same size. Note that G1,0 is merely a sub-sampled copy of G0,3 ; analogously, G2,0 is sub-sampled from G1,3 .

Octave G0 (400 × 300)

Octave G1 (200 × 150)

p=0

p=1

Octave G2 (100 × 75) p=2

q=3

G0,3

G1,3

G2,3

q=2

G0,2

G1,2

G2,2

q=1

G0,1

G1,1

G2,1

q=0

G0,0

G1,0

G2,0

Thus, with the usual settings (i.e., σ0 = 1.6 and Q = 3), the absolute scale values for the six levels of the first octave are σ0,−1 = 1.2699, σ0,2 = 2.5398,

σ0,0 = 1.6000, σ0,3 = 3.2000,

σ0,1 = 2.0159, σ0,4 = 4.0317.

The complete set of scale values for a SIFT scale space with four octaves (p = 0, . . . , 3) is listed in Table 25.3. To build the Gaussian part of the first scale space octave G0 , the initial level G0,−1 is obtained by filtering the input image I with a Gaussian kernel of width p p 2 σ ¯0,−1 = σ0,−1 − σ2s = 1.26992 − 0.52 ≈ 1.1673 . (25.49)

For the higher octaves (p > 0), the initial level (q = −1) is obtained by subsampling (decimating) level Q−1 of the next-lower octave Gp−1 , that is, Gp,−1 ← Decimate(Gp−1,Q−1 ),

726

(25.50)

analogous to Eqn. (25.42). The remaining levels Gp,0 , . . . , Gp,Q+1 of the octave are either calculated by incremental filtering (as illustrated

Octave D0 (400 × 300) p=0

Octave D1 (200 × 150) p=1

Octave D2 (100 × 75)

25.1 Interest Points at Multiple Scales

p=2

q=2

D0,2

D1,2

D2,2

q=1

D0,1

D1,1

Fig. 25.11 Hierarchical DoG scale space example (castle image). The three top rows show the positive and negative DoG values (zero is mapped to intermediate gray). The three bottom rows show the absolute values of the DoG results (zero is mapped to black, maximum values to white). All images are scaled to the size of the original image.

D2,1

q=0

D0,0

D1,0

D2,0

q=2

|D0,2 |

|D1,2 |

|D2,2 |

q=1

|D0,1 |

|D1,1 |

|D2,1 |

q=0

|D0,0 |

|D1,0 |

|D2,0 |

in Fig. 25.6) or by filtering from the octave’s initial level Gp,−1 , with a Gaussian of width σ ˜p,q (see Eqn. (25.39)). The advantage of the direct approach is that numerical errors do not accrue across the scale space; the disadvantage is that the kernels are up to 50 % larger than those needed for the incremental approach (˜ σ0,4 = 3.8265 vs. ′ σ0,4 = 2.4525). Note that the inner levels Gp,q of all higher octaves

727

Octave G0 (400 × 300)

25 Scale-Invariant Feature Transform (SIFT)

Octave G1 (200 × 150)

p=0

Octave G2 (100 × 75)

p=1

p=2

Fig. 25.12 Hierarchical Gaussian scale space example (stars image).

q=3

G0,3

G1,3

G2,3

q=2

G0,2

G1,2

G2,2

q=1

G0,1

G1,1

G2,1

q=0

G0,0

Table 25.3 Absolute and relative scale values for a SIFT scale space with four octaves. Each octave with index p = 0, . . . , 3 consists of 6 Gaussian scale layers Gp,q , with q = −1, . . . , 4. For each scale layer, m is the scale index and σp,q is the corresponding absolute scale. Within each octave p, σ ˜p,q denotes the relative scale with respect to the octave’s base layer Gp,−1 . Each base layer Gp,−1 is obtained by subsampling (decimating) layer q = Q−1 = 2 in the previous octave, i.e., Gp,−1 = Decimate(Gp−1,Q−1 ), for p > 0. The base layer G0,−1 in the bottom octave is derived by Gaussian smoothing of the original image. Note that the relative scale values σ ˜p,q = σ ˜q are the same inside every octave (independent of p) and thus the same Gaussian filter kernels can be used for calculating all octaves.

728

p 3 3 3 3 3 3 2 2 2 2 2 2 1 1 1 1 1 1 0 0 0 0 0 0

q

m d

4 13 3 12 2 11 1 10 0 9 −1 8 4 10 3 9 2 8 1 7 0 6 −1 5 4 7 3 6 2 5 1 4 0 3 −1 2 4 4 3 3 2 2 1 1 0 0 −1 −1

8 8 8 8 8 8 4 4 4 4 4 4 2 2 2 2 2 2 1 1 1 1 1 1

G1,0

σp,q 32.2540 25.6000 20.3187 16.1270 12.8000 10.1594 16.1270 12.8000 10.1594 8.0635 6.4000 5.0797 8.0635 6.4000 5.0797 4.0317 3.2000 2.5398 4.0317 3.2000 2.5398 2.0159 1.6000 1.2699

σ˙ q 4.0317 3.2000 2.5398 2.0159 1.6000 1.2699 4.0317 3.2000 2.5398 2.0159 1.6000 1.2699 4.0317 3.2000 2.5398 2.0159 1.6000 1.2699 4.0317 3.2000 2.5398 2.0159 1.6000 1.2699

σ ˜q 3.8265 2.9372 2.1996 1.5656 0.9733 0.0000 3.8265 2.9372 2.1996 1.5656 0.9733 0.0000 3.8265 2.9372 2.1996 1.5656 0.9733 0.0000 3.8265 2.9372 2.1996 1.5656 0.9733 0.0000

G2,0

p . . . octave index q . . . level index m . . . linear scale index (m = Qp + q) d . . . decimation factor (d = 2p ) σp,q . . . absolute scale (Eqn. (25.36)) σ˙ q . . . decimated scale (Eqn. (25.38)) σ ˜q . . . relative decimated scale w.r.t. octave’s base level Gp,−1 (Eqn. (25.39)) P = 3 (number of octaves) Q = 3 (levels per octave) σ0 = 1.6 (base scale)

Octave D0 (400 × 300) p=0

Octave D1 (200 × 150) p=1

Octave D2 (100 × 75)

25.1 Interest Points at Multiple Scales

p=2

q=2

D0,2

D1,2

D2,2

q=1

D0,1

D1,1

Fig. 25.13 Hierarchical DoG scale space example (stars image). The three top rows show the positive and negative DoG values (zero is mapped to intermediate gray). The three bottom rows show the absolute values of the DoG results (zero is mapped to black, maximum values to white). All images are scaled to the size of the original image.

D2,1

q=0

D0,0

D1,0

D2,0

q=2

|D0,2 |

|D1,2 |

|D2,2 |

q=1

|D0,1 |

|D1,1 |

|D2,1 |

q=0

|D0,0 |

|D1,0 |

|D2,0 |

(i.e., p > 0, q ≥ 0) are calculated from the base level Gp,−1 , using the same set of kernels as for the first octave, as listed in Table 25.3. The complete process of building a SIFT scale space is summarized in Alg. 25.2.

729

25 Scale-Invariant Feature Transform (SIFT) Fig. 25.14 Scale space structure for SIFT with P = 3 octaves and Q = 3 levels per octave. To perform local maximum detection (“max”) over the full octave, Q+2 DoG scale space levels (Dp,−1 , . . . , Dp,Q ) are required. The blue arrows indicate the decimation steps between successive Gaussian octaves. Since the DoG levels are obtained by subtracting pairs of Gaussian scale space levels, Q+3 such levels (Gp,−1 , . . . , Gp,Q+1 ) are needed in each octave Gp . The two vertical axes on the left show the absolute scale (σ) and the discrete scale index (m), respectively. Note that the values along the scale axis are logarithmic with constant multiplicative scale increments ∆σ = 21/Q . The absolute scale of the input image (I) is assumed as σs = 0.5.

σ

25.6

12.8

6.4

3.2 ∆σ

1.6

m

12 11 10 9 8 7 6 5 4 3 2 1 0 {1

G0,4 G0,3 G0,2 G0,1 G0,0 G0,−1

D0,3 D0,2 D0,1 D0,0 D0,−1

G1,4 G1,3 G1,2 G1,1 G1,0 G1,−1

Octave 0

D1,3 D1,2 D1,1 D1,0 D1,−1

G2,4 G2,3 G2,2 G2,1 G2,0 G2,−1

D2,3 D2,2 D2,1 D2,0 D2,−1

Octave 2 Decimate(G1,2 )

Octave 1 Decimate(G0,2 ) → G1,−1

0.8 0.5

σs

I

Input image

25.2 Key Point Selection and Refinement Key points are identified in three steps: (1) detection of extremal points in the DOG scale space, (2) position refinement by local interpolation, and (3) elimination of edge responses. These steps are detailed in the following and summarized in Algs. 25.3–25.6. 25.2.1 Local Extrema Detection In the first step, candidate interest points are detected as local extrema in the 3D DoG scale space that we described in the previous section. Extrema detection is performed independently within each octave p. For the sake of convenience we define the 3D scale space coordinate c = (u, v, q), composed of the spatial position (u, v) and the level index q, as well as the function Dp (c) = Dp,q+k (u, v)

(25.51)

as a short notation for selecting DoG values from a given octave p. Also, for collecting the DoG values in the 3D neighborhood around a scale space position c, we define the map Nc (i, j, k) = Dp (c + i·ex + j·ey + k·eσ ),

(25.52)

with i, j, k ∈ {−1, 0, 1} and the 3D unit vectors ⊺

ex = (1, 0, 0) ,

730

⊺

ey = (0, 1, 0) ,

⊺

eσ = (0, 0, 1) .

(25.53)

The neighborhood Nc includes the center element Dp (c) and its 26 immediate neighbors (see Fig. 25.15 (a)). These values are used to estimate the 3D gradient vector and the Hessian matrix for the 3D scale space position c, as will be described.

1: BuildSiftScaleSpace(I, σs , σ0 , P, Q) Input: I, source image; σs , sampling scale; σ0 , reference scale of the first octave; P , number of octaves; Q, number of scale steps per octave. Returns a SIFT scale space representation ⟨G, D⟩ of the image I. −1/Q 2: σinit ← σ ▷ abs. scale at level (0, −1), Eqn. (25.48) p0 · 2 2 3: σ ¯init ← σinit − σs2 ▷ relative scale w.r.t. σs , Eqn. (25.49) 4: Ginit ← I ∗ Hσ¯Ginit ▷ 2D Gaussian filter with σ ¯init 5: G0 ← MakeGaussianOctave(Ginit , 0, Q, σ0 ) ▷ Gauss. octave 0 6: 7: 8:

for p ← 1, . . . , P −1 do ▷ for octaves 1, . . . , P −1 Gnext ← Decimate(Gp−1,Q−1 ) ▷ see Alg. 25.1 Gp ← MakeGaussianOctave(Gnext , p, Q, σ0 ) ▷ octave p

9:

G ← (G0 , . . . , GP −1 )

▷ assemble the Gaussian scale space G

10: 11:

for p ← 0, . . . , P −1 do Dp ← MakeDogOctave(Gp , p, Q)

12: 13:

D ← (D0 , . . . , DP −1 ) return ⟨G, D⟩

▷ assemble the DoG scale space D

14: MakeGaussianOctave(Gbase , p, Q, σ0 ) Input: Gbase , Gaussian base level; p, octave index; Q, scale steps per octave, σ0 , reference scale. Returns a new Gaussian octave Gp with Q+3 levels levels. 15: Gp,−1 ← Gbase ▷ level q = −1 16: 17: 18: 19: 20:

for q ← 0, . . .√ , Q+1 do σ ˜q ← σ0 · 22q/Q − 2−2/Q Gp,q ← Gbase ∗ Hσ˜Gq Gp ← (Gp,−1 , . . . , Gp,Q+1 ) return Gp

25.2 Key Point Selection and Refinement Alg. 25.2 Building a SIFT scale space. This procedure is an extension of Alg. 25.1 and takes the same parameters. The SIFT scale space (see Fig. 25.14) consists of two components: a hierarchical Gaussian scale space G = (G0 , . . . , GP −1 ) with P octaves and a (derived) hierarchical DoG scale space D = (D0 , . . . , DP −1 ). Each Gaussian octave Gp holds Q+3 levels (Gp,−1 , . . . , Gp,Q+1 ). At each Gaussian octave, the lowest level Gp,−1 is obtained by decimating level Q−1 of the previous octave Gp−1 (line 7). Every DoG octave Dp contains Q+2 levels (Dp,−1 , . . . , Dp,Q ). A DoG level Dp,q is calculated as the pointwise difference of two adjacent Gaussian levels Gp,q+1 and Gp,q (line 23). Typical parameter settings are σs = 0.5, σ0 = 1.6, Q = 3, P = 4.

▷ levels q = −1, . . . , Q+1 ▷ rel. scale w.r.t base level Gbase ▷ 2D Gaussian filter with σ ˜q

21: MakeDogOctave(Gp , p, Q) Input: Gp , Gaussian octave; p, octave index; Q, scale steps per octave. Returns a new DoG octave Dp with Q+2 levels. 22: 23:

for q ← −1, . . . , Q do Dp,q ← Gp,q+1 − Gp,q

24: 25:

Dp ← (Dp,−1 , Dp,0 , . . . , Dp,Q ) return Dp

▷ diff. of Gaussians, Eqn. (25.31) ▷ levels q = −1, . . . , Q

A DoG scale space position c is accepted as a local extremum (minimum or maximum) if the associated value D(c) = Nc (0, 0, 0) is either negative and also smaller or positive and greater than all neighboring values. A minimum difference textrm ≥ 0 can be specified, indicating how much the center value must at least deviate from the surrounding values. The decision whether a given neighborhood Nc contains a local minimum or maximum can thus be expressed as IsLocalMin(Nc ) := Nc (0, 0, 0) < 0 ∧ Nc (0, 0, 0) + textrm < min Nc (i, j, k),

(25.54)

IsLocalMax(Nc ) := Nc (0, 0, 0) > 0 ∧ Nc (0, 0, 0) − textrm < max Nc (i, j, k)

(25.55)

(i,j,k)̸= (0,0,0)

(i,j,k)̸= (0,0,0)

731

25 Scale-Invariant Feature Transform (SIFT) Fig. 25.15 Different 3D neighborhoods for detecting local extrema in the DoG scale space. The red cube represents the DoG value at the reference coordinate c = (u, v, q) at the spatial position (u, v) at scale level q (within some octave p). Full 3 × 3 × 3 neighborhood with 26 elements (a); other types of neighborhoods with 18 (b) or 10 (c) elements, respectively, are also commonly used. A local maximum/minimum is detected if the DoG value at the center is greater/smaller than all neighboring values (green cubes).

(a) 26-neighborhood

(b) 18-neighborhood

(c) 10-neighborhood

(see procedure IsExtremum(Nc ) in Alg. 25.5). As illustrated in Fig. 25.15 (b–c), alternative 3D neighborhoods with 18 or 10 cells may be specified for extrema detection. 25.2.2 Position Refinement Once a local extremum is detected in the DoG scale space, only its discrete 3D coordinates c = (u, v, q) are known, consisting of the spatial grid position (u, v) and the index (q) of the associated scale level. In the second step, a more accurate, continuous position for each candidate key point is estimated by fitting a quadratic function to the local neighborhood, as proposed in [45]. This is particularly important at the higher octaves of the scale space, where the spatial resolution becomes increasingly coarse due to successive decimation. Position refinement is based on a local second-order Taylor expansion of the discrete DoG function, which yields a continuous approximation function whose maximum or minimum can be found analytically. Additional details and illustrative examples are provided in Sec. D.2 of the Appendix. At any extremal position c = (u, v, q) in octave p of the hierarchical DoG scale space D, the corresponding 3 × 3 × 3 neighborhood ND (c) is used to estimate the elements of the continuous 3D gradient, that is,     dx Dp (c+ex ) − Dp (c−ex ) 1 ∇D (c) =  dy  ≈ · Dp (c+ey ) − Dp (c−ey )  , (25.56) 2 dσ Dp (c+eσ ) − Dp (c−eσ ) with Dp ( ) as defined in Eqn. (25.51). Similarly, the 3 × 3 Hessian matrix for position c is obtained as   dxx dxy dxσ HD (c) =  dxy dyy dyσ  , (25.57) dxσ dyσ dσσ with the required second-order derivatives estimated as dxx = Dp (c−ex ) − 2·Dp (c) + Dp (c+ex ), dyy = Dp (c−ey ) − 2·Dp (c) + Dp (c+ey ), dσσ = Dp (c−eσ ) − 2·Dp (c) + Dp (c+eσ ), dxy = dxσ =

732

dyσ =

Dp (c+ex +ey )−Dp (c−ex +ey )−Dp (c+ex −ey )+Dp (c−ex −ey ) , 4 Dp (c+ex +eσ )−Dp (c−ex +eσ )−Dp (c+ex −eσ )+Dp (c−ex −eσ ) , 4 Dp (c+ey +eσ )−Dp (c−ey +eσ )−Dp (c+ey −eσ )+Dp (c−ey −eσ ) . 4

(25.58)

See procedures Gradient(Nc ) and Hessian(Nc ) in Alg. 25.5 (p. 751) for 25.2 Key Point additional details. From the gradient vector ∇D (c) and the Hessian Selection and matrix HD (c), the second-order Taylor expansion around point c is Refinement ⊺ ⊺ ˜ c (x) = Dp (c) + ∇D D (c)·(x−c) + 12 (x−c) ·HD (c)·(x−c),

(25.59)

for some continuous position x = (x, y, σ)⊺ ∈ R3 . The scalar-valued ˜ c (x), with c = (u, v, q)⊺ , is a local, continuous approximafunction D tion of the discrete DoG function Dp,q (u, v) at octave p, scale level q, and spatial position (u, v). This is a quadratic function with an extremum (maximum or minimum) at position   x ˘ −1 ˘ =  y˘ = c + d = c − HD (25.60) x (c)·∇D (c) | {z } σ ˘ ˘ −c d=x

˘ − c = (x′ , y ′ , σ ′ )⊺ , under the assumption that the inverse with d = x of the Hessian matrix HD exists. By inserting the extremal position ˘ into Eqn. (25.59), the peak (minimum or maximum) value of the x ˜ is found as13 continuous approximation function D ˜ c (˘ Dpeak (c) = D x) = Dp (c) + 12 ·∇D (c)·(˘ x − c) ⊺

= Dp (c) + 12 ·∇D (c)·d, ⊺

(25.61)

where d (see Eqn. (25.60)) denotes the 3D vector between the neighborhood’s discrete center position c and the continuous extremal po˘. sition x A scale space location c is only retained as a candidate interest point if the estimated magnitude of the DoG exceeds a given threshold tpeak , that is, if |Dpeak (c)| > tpeak .

(25.62)

If the distance vector d between c and the estimated (continuous) ˘ is greater than a predefined limit (typically 0.5) in peak position x any spatial direction, the center point c = (u, v, q)⊺ is moved to one of the neighboring DoG cells by maximally ±1 unit steps along the u, v axes, that is,   min(1, max(−1, round(x′ ))) c′ = c + min(1, max(−1, round(y ′ ))) . (25.63) 0 The q component of c is not modified in this scheme, i.e., the search continues at the original scale level.14 Based on the surrounding 3D neighborhood of this new point, a Taylor expansion (Eqn. (25.60)) is again performed to estimate a new peak location. This is repeated until either the peak location is inside the current DoG cell or the allowed number of repositioning steps nrefine is reached (typically nrefine is set to 4 or 5). If successful, the result of this step is a candidate feature point 13 14

See Sec. D.2.3 (Eqn. (D.56)) of the Appendix for details. This is handled differently in other SIFT implementations.

733

25 Scale-Invariant Feature Transform (SIFT)

⊺

⊺

˘ = (˘ c x, y˘, q˘) = c + (x′ , y ′ , 0) .

(25.64)

Notice that (in this implementation) the scale level q remains unchanged even if the 3D Taylor expansion indicates that the estimated peak is located at another scale level. See the associated procedure RefineKeyPosition() in Alg. 25.4 (p. 750) for a concise summary of these steps. It should be mentioned that the original publication [175] is not particularly explicit about the aforementioned position refinement process and thus slightly different approaches are used in various open-source SIFT implementations. For example, the implementation in VLFeat 15 [274] moves to one of the direct neighbors at the same scale level as described earlier, as long as |x′ | or |y ′ | is greater than 0.6. AutoPano-SIFT 16 by S. Nowozin calculates the length of the spatial displacement d = ∥(x′ , y ′ )∥ and discards the current point if d > 2. Otherwise it moves by ∆u = round(x′ ), ∆v = round(y ′ ) without limiting the displacement to ±1. The Open-Source SIFT Library 17 [127] used in OpenCV also makes full moves in the spatial directions and, in addition, potentially also changes the scale level by ∆q = round(σ ′ ) in each iteration. 25.2.3 Suppressing Responses to Edge-Like Structures In the previous step, candidate interest points were selected as those locations in the DoG scale space where the Taylor approximation had a local maximum and the extrapolated DoG value was above a given threshold (tpeak ). However, the DoG filter also responds strongly to edge-like structures. At such positions, interest points cannot be located with sufficient stability and repeatability. To eliminate the responses near edges, Lowe suggests the use of the principal curvatures of the 2D DoG result along the spatial x/y axes, using the fact that the principal curvatures of a function are proportional to the eigenvalues of the function’s Hessian matrix at a given point. For a particular grid point c = (u, v, q) in DoG scale space, with neighborhood ND (see Eqn. (25.52)), the 2 × 2 Hessian matrix for the spatial coordinates is dxx dxy Hxy (c) = , (25.65) dxy dyy with dxx , dyy , dxy as defined in Eqn. (25.58), i.e., these values can be taken from the existing 3 × 3 Hessian matrix HD (c) (see Eqn. (25.57)). The matrix Hxy (c) has two eigenvalues λ0 , λ1 , which we assumed to be ordered, such that λ0 ≥ λ1 . If both eigenvalues for a point c are of similar magnitude, the function exhibits a high curvature along two orthogonal directions and in this case c is likely to be a good reference point that can be located reliably. In the optimal situation (e.g., near a corner), the ratio of the eigenvalues, ρ = λ0 /λ1 , is close 15 16 17

734

https://www.vlfeat.org/overview/sift.html https://wiki.panotools.org/Autopano-sift-C https://robwhess.github.io/opensift/.

10

a ↑ reject

8 æ

amax

æ

↓ accept

6 4 2

← ρmax

0

1

2

3

4

5

6

7

ρ1

to 1. Otherwise, i.e., if the ratio ρ is high, it can be concluded that a single orientation dominates at this position, as is typically the case near an edge. To estimate the ratio ρ it is not necessary to calculate the eigenvalues themselves. Using the general relation between eigenvalues and the trace and determinant of a matrix,18 the sum and product of the eigenvalues λ0 , λ1 can be calculated as λ0 +λ1 = tr(Hxy (c)) = dxx + dyy , λ0 · λ1 = det(Hxy (c)) = dxx · dyy −

25.2 Key Point Selection and Refinement Fig. 25.16 Limiting the ratio of principal curvatures (edge ratio) ρ0,1 by specifying amax . The quantity a (blue line) has a minimum when the eigenvalue ratio ρ0,1 = λ0 /λ1 is one, that is, when the two eigenvalues λ0 , λ1 are equal, indicating a corner-like event. Typically only one of the eigenvalues is dominant in the vicinity of image lines, such that ρ0,1 and a values are significantly increased. In this example, the principal curvature ratio ρ0,1 is limited to ρmax = 5.0 by setting amax = (5+1)2 /5 = 7.2 (dashed red line).

(25.66) d2xy .

(25.67)

If the determinant det(Hxy ) is negative, the principal curvatures of the underlying 2D function have opposite signs and thus point c can be discarded as not being an extremum. Otherwise, if the signs of both eigenvalues λ0 , λ1 are the same, then the ratio ρ0,1 = λ0 /λ1

(25.68)

is positive (with λ0 = ρ0,1 ·λ1 ), and thus the expession a= =

[tr(Hxy (c))]2 (λ +λ )2 = 0 1 det(Hxy (c)) λ0 · λ1

(25.69)

(ρ0,1 ·λ1 +λ1 )2 λ21 ·(ρ0,1 +1)2 (ρ0,1 +1)2 = = ρ0,1 ·λ21 ρ0,1 ·λ21 ρ0,1

(25.70)

depends only on the ratio ρ0,1 . If the determinant of Hxy is positive, the quantity a has a minimum (4.0) at ρ0,1 = 1, if the two eigenvalues are equal (see Fig. 25.16). Note that the ratio a is the same for ρ0,1 = λ0 /λ1 or ρ0,1 = λ1 /λ0 , since ( ρ 1 + 1)2 (ρ0,1 +1)2 a= = 0,1 1 . ρ0,1 ρ

(25.71)

0,1

To verify that the eigenvalue ratio ρ0,1 at a given position c is below a specified limit ρmax (making c a good candidate), it is thus sufficient to check the condition a ≤ amax , 18

with

See Sec. B.5 of the Appendix.

amax =

(ρmax + 1)2 , ρmax

(25.72) 735

25 Scale-Invariant Feature Transform (SIFT) Fig. 25.17 Rejection of edge-like features by controlling the max. curvature ratio ρmax . The size of the circles is proportional to the scale level at which the corresponding key point was detected, the color indicating the containing octave p (red = 0, green = 1, blue = 2, magenta = 3).

ρmax =3

ρmax = 10

ρmax = 20

ρmax = 40

without the need to actually calculate the individual eigenvalues λ0 and λ1 .19 ρmax should be greater than 1 and is typically chosen to be in the range 3, . . . , 10 (ρmax = 10 is suggested in [175]). The resulting value amax in Eqn. (25.72) is constant and needs only be calculated once (see Alg. 25.3, line 2). Detection examples for varying values of ρmax are shown in Fig. 25.17. Note that considerably more candidates appear near edges as ρmax is raised from 3 to 40.

25.3 Creating Local Descriptors For each local maximum detected in the hierarchical DoG scale space, a candidate key point is created, which is subsequently re19

736

A similar trick is used in the Harris corner detection algorithm (see Chapter 6).

fined to a continuous position following the steps we have just de- 25.3 Creating Local scribed (see Eqns. (25.56)–(25.64)). Then, for each refined key point Descriptors k′ = (p, q, x, y), one or more (up to four) local descriptors are calculated. Multiple (up to four) descriptors may be created for a position if the local orientation is not unique. This process involves the following steps: 1. Find the dominant orientation(s) of the key point k′ from the distribution of the gradients at the corresponding Gaussian scale space level. 2. For each dominant orientation, create a separate SIFT descriptor at the key point k′ . 25.3.1 Finding Dominant Orientations Local Orientation From Gaussian Scale Space Orientation vectors are obtained by sampling the gradient values of the hierarchical Gaussian scale space Gp,q (u, v) (see Eqn. (25.33)). For any lattice position (u, v) at octave p and scale level q, the local gradient is calculated as dx G (u+1, v) − Gp,q (u−1, v) ∇p,q (u, v) = = 0.5· p,q . (25.73) dy Gp,q (u, v+1) − Gp,q (u, v−1) From these gradient vectors, the gradient magnitude and orientation (i.e., polar coordinates) are found as20

q Ep,q (u, v) = ∇p,q (u, v) = d2x + d2y , (25.74) ϕp,q (u, v) = ∠∇p,q (u, v) = tan−1 (dy /dx ).

(25.75)

These scalar fields Ep,q and ϕp,q are typically pre-calculated for all relevant octaves/levels p, q of the Gaussian scale space G. Orientation Histograms To find the dominant orientations for a given key point, a histogram hϕ of the orientation angles is calculated for the gradient vectors collected from a square window around the key point center. Typically the histogram has norient = 36 bins, that is, the angular resolution is 10◦ . The orientation histogram is collected from a square region using an isotropic Gaussian weighting function whose width σw is proportional to the decimated scale σ˙ q (see Eqn. (25.38)) of the key point’s scale level q. Typically a Gaussian weighting function “with a σ that is 1.5 times that of the scale of the key point” [175] is used, that is, σw = 1.5·σ˙ q = 1.5·σ0 ·2q/Q .

(25.76)

Note that σw is independent of the octave index p and thus the same weights are used in each octave. To calculate the orientation histogram, the Gaussian gradients around the given key point are collected from a square region of size 2rw × 2rw , with 20

See also Sec. 16.1.

737

25 Scale-Invariant Feature Transform (SIFT)

rw = ⌈2.5·σw ⌉,

(25.77)

i.e., amply dimensioned to avoid numerical truncation effects. For the parameters listed in Table 25.3 (σ0 = 1.6, Q = 3), the values for σw (expressed in the octave’s coordinate units) are q σw rw

0 1 2 3 1.6000 2.0159 2.5398 3.2000 4 5 6 7

(25.78)

In Alg. 25.7, σw and rw of the Gaussian weighting function are calculated in lines 7 and 8, respectively. At each lattice point (u, v), the gradient vector ∇p,q (u, v) is calculated in octave p and level q of the Gaussian scale space G (Alg. 25.7, line 16). From this, the gradient magnitude Ep,q (u, v) and orientation ϕp,q (u, v) are obtained (lines 29–30). The corresponding Gaussian weight is calculated (in line 18) from the spatial distance between the grid point (u, v) and the interest point (x, y) as 2 +(v−y)2 wG (u, v) = exp − (u−x)2·σ . 2 w

(25.79)

For the grid point (u, v), the quantity to be accumulated into the orientation histogram is z = Ep,q (u, v) · wG (u, v),

(25.80)

that is, the local gradient magnitude weighted by the Gaussian window function (Alg. 25.7, line 19). The orientation histogram hϕ consists of norient bins and thus the continuous bin number for the angle ϕ(u, v) is κϕ =

norient · ϕ(u, v) 2π

(25.81)

(see Alg. 25.7, line 20). To collect the continuous orientations into a histogram with discrete bins, quantization must be performed. The simplest approach is to select the “nearest” bin (by rounding) and to add the associated quantity (denoted z) entirely to the selected bin. Alternatively, to reduce quantization effects, a common technique is to split the quantity z onto the two closest bins. Given the continuous bin value κϕ , the indexes of the two closest discrete bins are k0 = ⌊κϕ ⌋ mod norient

and

k1 = (⌊κϕ ⌋+1) mod norient , (25.82)

respectively. The quantity z (Eqn. (25.80)) is then partitioned and accumulated into the neighboring bins k0 , k1 of the orientation histogram hϕ in the form hϕ (k0 ) ← hϕ (k0 ) + (1−α) · z, hϕ (k1 ) ← hϕ (k1 ) + α · z,

738

(25.83)

with α = κϕ −⌊κϕ ⌋. This process is illustrated by the example in Fig. 25.18 (see also Alg. 25.7, lines 21–25).

h(k)

z

z · (1−α)

25.3 Creating Local Descriptors

z·α

α

k k0

k1

1

κ

y

hφ (k)

φk

x

Fig. 25.18 Accumulating into multiple histogram bins by linear interpolation. Assume that some quantity z (blue bar) is to be added to the discrete histogram hϕ at the continuous position κϕ . The histogram bins adjacent to κϕ are k0 = ⌊κϕ ⌋ and k1 = ⌊κϕ ⌋+1. The fraction of z accumulated into bin k1 is z1 = z · α (red bar), with α = κϕ −k0 . Analogously, the quantity added to bin k0 is z0 = z · (1−α) (green bar).

Fig. 25.19 Orientation histogram example. Each of the 36 radial bars corresponds to one entry in the orientation histogram hϕ . The length (radius) of each radial bar with index k is proportional to the accumulated value in the corresponding bin hϕ (k) and its orientation is ϕk .

Fig. 25.20 Smoothing the orientation histogram (from Fig. 25.19) by repeatedly applying a circular low-pass filter with the 1D kernel H = 14 ·(1, 2, 1). n=0

n=1

n=2

n=3

Smoothing the Orientation Histogram Figure 25.19 shows a geometric rendering of the orientation histogram that explains the relevance of the cell indexes (discrete angles ϕk ) and the accumulated quantities (z). Before calculating the dominant orientations, the raw orientation histogram hϕ is usually smoothed by applying a (circular) low-pass filter, typically a simple 3-tap Gaussian or box-type filter (see procedure SmoothCircular() in Alg. 25.7, lines 6–16).21 Stronger smoothing is achieved by applying the filter multiple times, as illustrated in Fig. 25.20. In practice, 2–3 smoothing iterations appear to be sufficient. Locating and Interpolating Orientation Peaks After smoothing the orientation histogram, the next step is to detect the peak entries in hϕ . A bin k is considered a significant orientation peak if hϕ (k) is a local maximum and its value is not less than a certain fraction of the maximum histogram entry, that is, only if 21

Histogram smoothing is not mentioned in the original SIFT publication [175] but used in most implementations.

739

hϕ (k) > hϕ ((k−1) mod norient ) ∧

25 Scale-Invariant Feature Transform (SIFT)

hϕ (k) > hϕ ((k+1) mod norient ) ∧

(25.84)

hϕ (k) > tdomor · max hϕ (i) , i

with tdomor = 0.8 as a typical limit. To achieve a finer angular resolution than provided by the orientation histogram bins (typically spaced at 10◦ steps) alone, a continuous peak orientation is calculated by quadratic interpolation of the neighboring histogram values. Given a discrete peak index k, the interpolated (continuous) peak position k˘ is obtained by fitting a quadratic function to the three successive histogram values hϕ (k−1), hϕ (k), hϕ (k+1) as22 k˘ = k +

hϕ (k−1) − hϕ (k+1) , 2 · hϕ (k−1) − 2 hϕ (k) + hϕ (k+1)

(25.85)

with all indexes taken modulo norient . From Eqn. (25.81), the (continuous) dominant orientation angle θ ∈ [0, 2π) is then obtained as θ = (k˘ mod norient ) ·

2π norient

,

(25.86)

with θ ∈ [0, 2π). In this way, the dominant orientation can be estimated with accuracy much beyond the coarse resolution of the orientation histogram. Note that, in some cases, multiple histogram peaks are obtained for a given key point (see proc. FindPeakOrientations() in Alg. 25.6, lines 18–31). In this event, individual SIFT descriptors are created for each dominant orientation at the same key point position (see Alg. 25.3, line 8). Figure 25.21 shows the orientation histograms for a set of detected key points in two different images after applying a varying number of smoothing steps. It also shows the interpolated dominant orientations θ calculated from the orientation histograms (Eqn. (25.86)) by the corresponding vectors. 25.3.2 SIFT Descriptor Construction For each key point k′ = (p, q, x, y) and each dominant orientation θ, a corresponding SIFT descriptor is created by sampling the surrounding gradients at octave p and level q of the Gaussian scale space G. Descriptor Geometry The geometry underlying the calculation of SIFT descriptors is illustrated in Fig. 25.22. The descriptor combines the gradient orientation and magnitude from a square region of size wd × wd , which is centered at the (continuous) position (x, y) of the associated feature point and aligned with its dominant orientation θ. The side length of the descriptor is set to wd = 10 · σ˙ q , where σ˙ q denotes the key point’s decimated scale (radius of the inner circle). It depends on the key point’s scale level q (see Table 25.4). 22

740

See Sec. E.1 of the Appendix for details.

25.3 Creating Local Descriptors

(a) n = 0

Fig. 25.21 Orientation histograms and dominant orientations (examples). n = 0, . . . , 3 smoothing iterations were applied to the orientation histograms. The (interpolated) dominant orientations are shown as radial lines that emanate from each feature’s center point. The size of the histogram graphs is proportional to the absolute scale (σp,q , see Table 25.3) at which the corresponding key point was detected. The colors indicate the index of the containing scale space octave p (red = 0, green = 1, blue = 2, magenta = 3).

(b) n = 1

(c) n = 2

(d) n = 3

The region is partitioned into nspat × nspat sub-squares of identical size; typically nspat = 4 (see Table 25.5). The contribution of each gradient sample is attenuated by a circular Gaussian function of width σd = 0.25·wd (blue circle). The weights drop off radially and are practically zero at rd = 2.5 · σd (green circle in Fig. 25.22). Thus only samples outside this zone need to be included for calculating the descriptor statistics.

741

y

25 Scale-Invariant Feature Transform (SIFT)

wd

Fig. 25.22 Geometry of a SIFT descriptor. The descriptor is calculated from a square support region that is centered at the key point’s position (x, y), aligned to the key point’s dominant orientation θ, and partitioned into nspat × nspat (4 × 4) subsquares. The radius of the inner (gray) circle corresponds to the feature point’s decimated scale value (σ˙ q ). The blue circle displays the width (σd ) of the Gaussian weighting function applied to the gradients; its value is practically zero outside the green circle (rd ). Table 25.4 SIFT descriptor dimensions for different scale levels q (for size factor sd = 10 and Q = 3 levels per octave). σ˙ q is the key point’s decimated scale, wd is the descriptor size, σd is the width of the Gaussian weighting function, and rd is the radius of the descriptor’s support region. For Q = 3, only scale levels q = 0, 1, 2 are relevant. All lengths are expressed in the octave’s (i.e., decimated) coordinate units.

θ x wd

σ˙ q σd rd

q 3 2 1 0 −1

σ˙ q

wd = sd · σ˙ q

3.2000 2.5398 2.0159 1.6000 1.2699

32.000 25.398 20.159 16.000 12.699

σd = 0.25 · wd rd = 2.5 · σd 8.0000 6.3495 5.0398 4.0000 3.1748

20.0000 15.8738 12.5994 10.0000 7.9369

To achieve rotation invariance, the descriptor region is aligned to the key point’s dominant orientation, as determined in the previous steps. To make the descriptor invariant to scale changes, its size wd (expressed in the grid coordinate units of octave p) is set proportional to the key point’s decimated scale σ˙ q (see Eqn. (25.38)), that is, wd = sd ·σ˙ q = sd ·σ0 ·2q/Q ,

(25.87)

where sd is a constant size factor. For sd = 10 (see Table 25.5), the descriptor size wd ranges from 16.0 (at level 0) to 25.4 (at level 2), as listed in Table 25.4. Note that the descriptor size wd only depends on the scale level index q and is independent of the octave index p. Thus the same descriptor geometry applies to all octaves of the scale space. The descriptor’s spatial resolution is specified by the parameter nspat . Typically nspat = 4 (as shown in Fig. 25.22) and thus the total number of spatial bins is nspat × nspat = 16 (in this case). Each spatial descriptor bin relates to an area of size (wd /nspat ) × (wd /nspat ). For example, at scale level q = 0 of any octave, σ˙ 0 = 1.6 and the corresponding descriptor size is wd = sd ·σ˙ 0 = 10·1.6 = 16.0 (see Table 25.4). In this case (illustrated in Fig. 25.23), the descriptor covers 16 × 16 gradient samples, as suggested in [175]. Figure 25.24 shows an example with M-shaped feature point markers aligned to the dominant orientation and scaled to the descriptor region width wd of the associated scale level. 742

x

25.3 Creating Local Descriptors Fig. 25.23 Geometry of the SIFT descriptor in relation to the discrete sample grid of the associated octave (level q = 0, parameter sd = 10). In this case, the decimated scale is σ˙ 0 = 1.6 and the width of the descriptor is wd = sd · σ˙ 0 = 10·1.6 = 16.0.

y

(a)

(b)

Fig. 25.24 Marked key points aligned to their dominant orientation. Note that multiple feature instances are inserted at key point positions with more than one dominant orientation. The size of the markers is proportional to the absolute scale (σp,q , see Table 25.3) at which the corresponding key point was detected. The colors indicate the index of the scale space containing octave p (red = 0, green = 1, blue = 2, magenta = 3).

Gradient Features The actual SIFT descriptor is a feature vector obtained by histogramming the gradient orientations of the Gaussian scale level within the descriptors spatial support region. This requires a 3D histogram h∇ (i, j, k), with two spatial dimensions (i, j) for the nspat × nspat subregions and one additional dimension (k) for nangl gradient orientations. This histogram thus contains nspat × nspat × nangl bins. Figure 25.25 illustrates this structure for the typical setup, with nspat = 4 and nangl = 8 (see Table 25.5). In this arrangement, eight orientation bins k = 0, . . . , 7 are attached to each of the 16 spatial position bins (A1, . . . , D4), which makes a total of 128 histogram bins. For a given key point k′ = (p, q, x, y), the histogram h∇ accumulates the orientations (angles) of the gradients at the Gaussian scale space level Gp,q within the support region around the (continous) center coordinate (x, y). At each grid point (u, v) inside this region, the gradient vector ∇G is estimated (as described in Eqn. (25.73)), from which the gradient magnitude E(u, v) and orientation ϕ(u, v) are calculated (see Eqns. (25.74)–(25.75) and lines 27–31 in Alg. 25.7). For efficiency reasons, E(u, v) and ϕ(u, v) are typically pre-calculated for all relevant scale levels. Each gradient sample contributes to the gradient histogram h∇ a particular quantity z that depends on the gradient magnitude E and the distance of the sample point (u, v) from the key point’s center (x, y). Again a Gaussian weighting function (of width σd ) is used to attenuate samples with increasing spatial distance; thus the resulting accumulated quantity is

743

A

25 Scale-Invariant Feature Transform (SIFT) Fig. 25.25 SIFT descriptor structure for nspat = 4 and nangl = 8. Eight orientation bins k = 0, . . . , 7 are provided for each of the 16 spatial bins A1, . . . , D4. Thus the gradient histogram h∇ holds 128 cells that are arranged to a 1D feature vector (A10 , A12 . . . , D46 , D47 ), as shown in (b).

B

C

D

1

2 (a)

θ 3

4

B

(b)

2

k=

A1

A2

A3

A4

k

B1 B2 B3 B4 C1 C2 C3 C4 D1

D2 D3 D4

2 2 z(u, v) = R(u, v)·wG = R(u, v)· exp − (u−x)2σ+(v−y) . 2 d

(25.88)

The width σd of the Gaussian function wG ( ) is proportional to the side length of the descriptor region, with σd = 0.25·wd = 0.25·sd ·σ˙ q .

(25.89)

The weighting function drops off radially from the center and is practically zero at distance rd = 2.5 · σd . Therefore, only gradient samples that are closer to the key point’s center than rd (green circle in Fig. 25.22) need to be considered in the gradient histogram calculation (see Alg. 25.8, lines 7 and 17). For a given key point k′ = (p, q, x, y), sampling of the Gaussian gradients can thus be confined to the grid points (u, v) inside the square region bounded by x ± rd and y ± rd (see Alg. 25.8, lines 8–10 and 15–16). Each sample point (u, v) is then subjected to the affine transformation ′ 1 u cos(−θ) − sin(−θ) u−x = · · , (25.90) v′ v−y wd sin(−θ) cos(−θ) which performs a rotation by the dominant orientation θ and maps the original (i.e., rotated) square of size wd × wd to the unit square with coordinates u′ , v ′ ∈ [−0.5, +0.5] (see Fig. 25.23). To make feature vectors rotation invariant, the individual gradient orientations ϕ(u, v) are rotated by the dominant orientation, that is, ϕ′ (u, v) = (ϕ(u, v) − θ) mod 2π, 744

(25.91)

with ϕ′ (u, v) ∈ [0, 2π), such that the relative orientation is preserved.

For each gradient sample with continuous coordinates (u′ , v ′ , ϕ′ ), 25.3 Creating Local the corresponding quantity z(u, v) (Eqn. (25.88)) is accumulated into Descriptors the 3D gradient histogram h∇ . For a complete description of this step see procedure UpdateGradientHistogram() in Alg. 25.9. It first maps the coordinates (u′ , v ′ , ϕ′ ) (see Eqn. (25.90)) to the continuous histogram position (i′ , j ′ , k ′ ) by i′ = nspat ·u′ + 0.5·(nspat −1), j ′ = nspat ·v ′ + 0.5·(nspat −1), nangl k ′ = ϕ′ · , 2π

(25.92)

such that i′ , j ′ ∈ [−0.5, nspat −0.5] and k ′ ∈ [0, nangl ). Analogous to inserting into a continuous position of a 1D histogram by linear interpolation over two bins (see Fig. 25.18), the quantity z is distributed over eight neighboring histogram bins by tri-linear interpolation. The quantiles of z contributing to the individual histogram bins are determined by the distances of the coordinates (i′ , j ′ , k ′ ) from the discrete indexes (i, j, k) of the affected histogram bins. The indexes i, j, k are found as the set of possible combinations {i0 , i1 } × {j0 , j1 } × {k0 , k1 }, with i0 = ⌊ i′ ⌋,

i1 = (i0 +1),

′

j1 = (j0 +1),

′

k1 = (k0 +1) mod nangl ,

j0 = ⌊ j ⌋, k0 = ⌊k ⌋ mod nangl ,

(25.93)

and the corresponding quantiles (weights) α 0 = ⌊ i′ ⌋ + 1 − i′ = i1 − i′ , ′

′

′

′

′

β0 = ⌊ j ⌋ + 1 − j = j 1 − j , γ0 = ⌊k ⌋ + 1 − k ,

α 1 = 1 − α0 , β 1 = 1 − β0 ,

(25.94)

γ 1 = 1 − γ0 .

The (eight) affected bins of the gradient histogram are finally updated as23 ⊕ z·α0 ·β0 ·γ0 , h∇ (i0 , j0 , k0 ) ← ⊕ z·α1 ·β0 ·γ0 , h∇ (i1 , j0 , k0 ) ← ⊕ z·α0 ·β1 ·γ0 , h∇ (i0 , j1 , k0 ) ← .. .. . . ⊕ z·α1 ·β1 ·γ1 . h∇ (i1 , j1 , k1 ) ←

(25.95)

Attention must be paid to the fact that the coordinate k represents an orientation and must therefore be treated in a circular manner, as illustrated in Fig. 25.26 (also see Alg. 25.9, lines 11–12). For each histogram bin, the range of contributing gradient samples covers half of each neighboring bin, that is, the support regions of neighboring bins overlap, as illustrated in Fig. 25.27. Normalizing SIFT Descriptors The elements of the gradient histogram h∇ are the raw material for the SIFT feature vectors f sift . The process of calculating the feature vectors from the gradient histogram is described in Alg. 25.10. 23

⊕ is the increment operator (x ← ⊕ a is equivalent to x ← x+a). ←

745

25 Scale-Invariant Feature Transform (SIFT) j′

Fig. 25.26 3D structure of the gradient histogram, with nspat × nspat = 4 × 4 bins for the spatial dimensions (i, j) and nangl = 8 bins along the orientation axis (k). For the histogram to accumulate a quantity z into some continuous position (i′ , j ′ , k′ ), eight adjacent bins receive different quantiles of z that are determined by tri-linear interpolation (a). Note that the bins along the orientation axis ϕ are treated circularly; for example, bins at k = 0 are also considered adjacent to the bins at k = 7 (b).

0 j

1

i′ k′

2 3

0

1

2

5

6

j′

7 0

4 3 k

j

1

k′

2 3

0 1 2 3 i

5

i′

0

1

2

6

7

4 3 k

0 1 2 3 i

(a)

(b)

y

Fig. 25.27 Overlapping support regions in the gradient field. Due to the tri-linear interpolation used in the histogram calculation, the spatial regions associated with the cells of the orientation histogram h∇ overlap. The shading of the circles indicates the weight wG assigned to each sample by the Gaussian weighting function, whose value depends on the distance of each sample from the key point’s center (see Eqn. (25.88)).

x

Initially, the 3D gradient histogram h∇ (which contains continuous values) of size nspat × nspat × nangl is flattened to a 1D vector f of 2 length nfeat = nspat · nangl (typ. 128), with f (m) = f ((i·nspat + j)·nangl + k) = h∇ (i, j, k),

(25.96)

for i, j = 0, . . . , nspat −1 and k = 0, . . . , nangl −1. The nfeat = |f | elements in f are thus arranged in the same order as shown in Fig. 25.25, with the orientation index k being the fastest moving and the spatial index i being the slowest (see Alg. 25.10, lines 3–8).24 Changes in image contrast have a linear impact upon the gradient magnitude and thus also upon the values of the feature vector f . To eliminate these effects, the vector f is subsequently normalized to f (m) ←

1 · f (m), ∥f ∥

(25.97)

for all m, such that f has unit norm (see Alg. 25.10, line 9). Since the gradient is calculated from local pixel differences, changes in absolute 24

746

Note that different ordering schemes for arranging the elements of the feature vector are used in various SIFT implementations. For successful matching, the ordering of the elements must be identical, of course.

brightness do not affect the gradient magnitude, unless saturation 25.4 SIFT Algorithm occurs. Such nonlinear illumination changes tend to produce peak Summary gradient values, which are compensated for by clipping the values of f to a predefined maximum tfclip , that is, f (m) ← min(f (m), tfclip ),

(25.98)

for m = 0, . . . , |f |−1, with typically tfclip = 0.2, as suggested in [175] (see Alg. 25.10, line 10). After this step, f is normalized once again, as in Eqn. (25.97). Finally, the real-valued feature vector f is converted to an integer vector by fsift (m) ← min round(sf ·f (m)), 255) ,

(25.99)

with sf being a predefined constant (typ. sf = 512). The elements of fsift are in the range [0, 255] to be conveniently encoded and stored as a (unsigned) byte vector (see Alg. 25.10, line 12). The final SIFT descriptor for a given extremal point k′ = (p, q, x, y) is a tuple s = ⟨x′ , y ′ , σ, θ, fsift ⟩ ,

(25.100)

which contains the key point’s interpolated position (x′ , y ′ ) (in original image coordinates), the absolute scale σ, its dominant orientation θ, and the corresponding integer-valued gradient feature vector fsift (see Alg. 25.8, line 27). Remember that multiple SIFT descriptors may be produced for different dominant orientations at the same key point position. These will have the same position and scale values but different θ and fsift data.

25.4 SIFT Algorithm Summary This section contains a collection of algorithms that summarizes the SIFT feature extraction process described in the previous sections of this chapter. Algorithm 25.3 shows the top-level procedure GetSiftFeatures(I), which returns a sequence of SIFT feature descriptors for the given image I. The remaining parts of Alg. 25.3 describe the key point detection as extrema of the DOG scale space. The refinement of key point positions is covered in Alg. 25.4. Algorithm 25.5 contains the procedures used for neighborhood operations, detecting local extrema, and the calculation of the gradient and Hessian matrix in 3D. Algorithm 25.6 covers the operations related to finding the dominant orientations at a given key point location, based on the orientation histogram that is calculated in Alg. 25.7. The final formation of the SIFT descriptors is described in Alg. 25.8, which is based on the procedures defined in Algs. 25.9 and 25.10. The global constants used throughout these algorithms are listed in Table 25.5, together with the corresponding Java identifiers in the associated source code (see Sec. 25.7). 747

25 Scale-Invariant Feature Transform (SIFT) Table 25.5 Predefined constants and default values used in the SIFT algorithms (Algs. 25.3–25.11).

Scale space parameters Symbol Java id. Value Q Q 3 P P 4 σs sigmaS 0.5 σ0 sigma0 1.6

Description scale steps (levels) per octave number of scale space octaves sampling scale (nominal smoothing of the input image) base scale of level 0 (base smoothing)

Key-point detection Symbol Java id. Value Description norient nOrient 36 number of orientation bins (angular resolution) used for calculating the dominant key point orientation nrefine nRefine 5 max. number of iterations for repositioning a key point nsmooth nSmooth 2 number of smoothing iterations applied to the orientation histogram ρmax rhoMax 10.0 max. ratio of principal curvatures (3, . . . , 10) tdomor tDomOr 0.8 min. value in orientation histogram for selecting dominant orientations (rel. to max. entry) textrm tExtrm 0.0 min. difference w.r.t. any neighbor for extrema detection tmag tMag 0.01 min. DoG magnitude for initial key point candidates tpeak tPeak 0.01 min. DoG magnitude at interpolated peaks Feature Symbol nspat nangl sd sf tfclip

descriptor Java id. Value nSpat 4 nAngl 16 sDesc 10.0 sFscale 512.0 tFclip

Description number of spatial descriptor bins along each x/y axis number of angular descriptor bins spatial size factor of descriptor (relative to feature scale) scale factor for converting normalized feature values to byte values in [0, 255] 0.2 max. value for clipping elements of normalized feature vectors

Feature matching Symbol Java id. Value Description rmax rMax 0.8 max. ratio of best and second-best matching feature distance

25.5 Matching SIFT Features Most applications of SIFT features aim at locating corresponding interest points in two or more images of the same scene, for example, for matching stereo pairs, panorama stitching, or feature tracking. Other applications like self-localization or object recognition might use a large database of model descriptors and the task is to match these to the SIFT features detected in a new image or video sequence. All these applications require possibly large numbers of pairs of SIFT features to be compared reliably and efficiently. 25.5.1 Feature Distance and Match Quality In a typical situation, two unordered sets of SIFT features S(a) , S(b) are extracted independently from a pair of input images Ia , Ib , i.e., (a)

(a)

S(a) = {s0 , . . . , sNa −1 }

and

(b)

(b)

S(b) = {s0 , . . . , sNb −1 }.

The goal is to find matching descriptors in the two feature sets. The similarity between a given pair of descriptors, si , sj is measured by the distance between the corresponding feature vectors fi , fj , i.e., 748

dist(si , sj ) = ∥fi − fj ∥ ,

(25.101)

1: GetSiftFeatures(I) Input: I, the source image (scalar-valued). Returns the set of SIFT feature descriptors detected in I. 2: 3: 4: 5: 6: 7: 8: 9: 10:

⟨G, D⟩ ← BuildSiftScaleSpace(I, σs , σ0 , P, Q) ▷ Alg. 25.2 K ← GetKeyPoints(D) S ← {} ▷ empty set of SIFT descriptors for all k′ ∈ K do ▷ k′ = (p, q, x, y) ′ A ← GetDominantOrientations(G, k ) ▷ Alg. 25.6 for all θ ∈ A do s ← MakeSiftDescriptor(G, k′ , θ) ▷ Alg. 25.8 S ← S ∪ {s} return S

25.5 Matching SIFT Features Alg. 25.3 SIFT feature extraction (Part 1). Top-level SIFT procedure. Global parameters: σs , σ0 , tmag , Q, P (see Table 25.5).

11: GetKeypoints(D) Input: D, DoG scale space (with P octaves, each containing Q levels). Returns a set of key points located in D. 12: K ← {} ▷ empty set of key points 13: for p ← 0, . . . , P −1 do ▷ for all octaves p 14: for q ← 0, . . . , Q−1 do ▷ for all scale levels q 15: Dp,q ← GetScaleLevel(D, p, q) 16: E ← FindExtrema(Dp,q ) 17: for all k ∈ E do ▷ k = (p, q, u, v) 18: k′ ← RefineKeyPosition(D, k) ▷ Alg. 25.4 19: if k′ ̸= nil then ▷ k′ = (p, q, x, y) 20: K ← K ∪ {k′ } ▷ collect refined key point k′ 21: return K 22: FindExtrema(Dp,q ) Input: Dp,q , DoG scale-space level. Returns a set of extremal points detected in Dp,q . 23: (M, N ) ← size(Dp,q ) 24: E ← {} ▷ empty list of extrema 25: for u ← 1, . . . , M −2 do 26: for v ← 1, . . . , N −2 do 27: if |Dp,q (u, v)| > tmag then ▷ parameter tmag 28: k ← (p, q, u, v) 29: Nc ← GetNeighborhood(D, k) ▷ Alg. 25.5 30: if IsExtremum(Nc ) then ▷ Alg. 25.5 31: E ← E ∪ {k} ▷ collect extremal point k 32: return E

where ∥· · · ∥ denotes some vector norm (typically Euclidean, alternatives are discussed below).25 Note that this distance is measured between individual points distributed in a high-dimensional (typically 128-dimensional) vector space that is only sparsely populated. Since there is always a bestmatching counterpart for a given descriptor, matches may occur between unrelated features even if the correct feature is not contained in the target set. This is particularly critical if feature matching is used to determine whether two images show any correspondence at all. 25

See also Sec. B.1.3 of the Appendix.

749

25 Scale-Invariant Feature Transform (SIFT) Alg. 25.4 SIFT feature extraction (Part 2). Position refinement. Global parameters: amax , nrefine , tpeak , ρmax (see Table 25.5).

1: RefineKeyPosition(D, k) Input: D, hierarchical DoG scale space; k = (p, q, u, v), extremal (candidate) position. Returns a refined key point k′ or nil if no proper key point could be localized at or near the extremal position k. (ρ

+1)2

20: 21: 22: 23: 24: 25: 26:

amax ← max ▷ see Eqn. (25.72) ρmax k′ ← nil ▷ refined key point done ← false n←1 ▷ number of repositioning steps while ¬done ∧ n ≤ nrefine ∧ IsInside(D, k) do Nc ← GetNeighborhood(D, k) ▷ Alg. 25.5 ⊺ ∇ = (dx , dx , dσ ) ← Gradient(Nc ) ▷ Alg. 25.5 ! dxx dxy dxσ HD = dxy dyy dyσ ← Hessian(Nc ) ▷ Alg. 25.5 dxσ dyσ dσσ if det(HD ) = 0 then ▷ HD is not invertible done ← true ▷ ignore this point and finish else ⊺ d = (x′ , y ′ , σ ′ ) ← −H−1 ▷ Eqn. (25.60) D ·∇ ′ if |x | < 0.5 ∧ |y ′ | < 0.5 then ▷ stay in same DoG cell done ← true ⊺ Dpeak ← Nc (0, 0, 0) + 12 · ∇ ·d ▷ Eqn. (25.61) if |Dpeak | >tpeak ∧ det(Hxy ) > 0 then d d Hxy ← xx xy ▷ get 2D Hessian from HD dxy dyy [tr(Hxy )]2 a← ▷ Eqn. (25.69) det(Hxy ) if a < amax then ▷ suppress edges, Eqn. (25.72) ⊺ k′ ← k + (0, 0, x′ , y ′ ) ▷ refined key point else ▷ move to neighboring DoG cell at same level q u′ ← min(1, max(−1, round(x′ ))) ▷ move by max. ±1 v ′ ← min(1, max(−1, round(y ′ ))) ▷ move by max. ±1 ⊺ k ← k + (0, 0, u′ , v ′ ) n←n+1

27:

return k′

2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19:

▷ k′ is either a refined key point position or nil

Obviously, significant matches should exhibit small feature distances but setting a fixed limit on the acceptable feature distance turns out to be inappropriate in practice, since some descriptors are more discriminative than others. The solution proposed in [175] is to compare the distance obtained for the best feature match to that of the second-best match. For a given reference descriptor sref ∈ S(a) , the best match is defined as the descriptor s0 ∈ S(b) which has the smallest distance from sref , that is, s0 = argmin dist(sref , sj ), sj ∈ S(b)

(25.102)

and the associated “primary” distance is d0 = dist(sref , s0 ). Analogously, the second-best matching descriptor is s1 = argmin dist(sref , sj ), 750

sj ∈ S(b) \{s0 }

(25.103)

1: IsInside(D, k) Input: D, a DoG scale space; k = (p, q, u, v), a discrete scalespace coordinate. Returns true if k is inside D false otherwise. 2: (M, N ) ← size(GetScaleLevel(D, p, q)) 3: Q ← NumberOfLevels(D) 4: return (0 < u < M −1) ∧ (0 < v < N −1) ∧ (0 ≤ q < Q) 5: GetNeighborhood(D, k) Collects and returns the 3 × 3 × 3 neighborhood values around position k = (p, q, u, v) in the hierarchical DoG scale space D. 6: Nc ← new map {−1, 0, 1}3 7→ R 7: for all (i, j, k) ∈ {−1, 0, 1}3 do ▷ collect neighborhood values 8: Nc (i, j, k) ← Dp,q+k (u+i, v+j) 9: return Nc

25.5 Matching SIFT Features Alg. 25.5 SIFT feature extraction (Part 3): Neighborhood operations. Global parameters: Q, textrm (see Table 25.5).

10: IsExtremum(Nc ) ▷ Nc is a 3 × 3 × 3 map Determines if the center of the 3D neighborhood Nc is either a local minimum or maximum by the threshold textrm ≥ 0. Returns a boolean value (i.e., true or false). 11: c ← Nc (0, 0, 0) ▷ center DoG value 12: isMin ← c < 0 ∧ (c+textrm ) < min Nc (i, j, k) ▷ s. Eqn. (25.54) (i,j,k)̸= (0,0,0)

13:

isMax ← c > 0 ∧ (c−textrm ) > max Nc (i, j, k)

▷ s. Eqn. (25.55)

(i,j,k)̸= (0,0,0)

14:

return isMin ∨ isMax

15: Gradient(Nc ) ▷ Nc is a 3 × 3 × 3 map Returns the estim. gradient vector (∇) for the 3D neighborhood Nc . 16: dx ← 0.5 · (Nc (1, 2, 1) − Nc (1, 0, 1)) 17: dy ← 0.5 · (Nc (1, 1, 2) − Nc (1, 1, 0)) ▷ see Eqn. (25.56) 18: dσ ← 0.5 · (Nc (2, 1, 1) − Nc (0, 1, 1)) ⊺ 19: ∇ ← (dx , dy , dσ ) 20: return ∇ 21: Hessian(Nc ) ▷ Nc is a 3 × 3 × 3 map Returns the estim. Hessian matrix (H) for the neighborhood Nc . 22: dxx ← Nc (−1, 0, 0) − 2·Nc (0, 0, 0) + Nc (1, 0, 0) ▷ see Eqn. (25.58) 23: dyy ← Nc (0, −1, 0) − 2·Nc (0, 0, 0) + Nc (0, 1, 0) 24: dσσ ← Nc (0, 0, −1) − 2·Nc (0, 0, 0) + Nc (0, 0, 1) 25: dxy ← [ Nc (1, 1, 0)−Nc (−1, 1, 0)−Nc (1, −1, 0)+Nc (−1, −1, 0) ] /4 26: dxσ ← [ Nc (1, 0, 1)−Nc (−1, 0, 1)−Nc (1, 0, −1)+Nc (−1, 0, −1) ] /4 27: dyσ ← [ Nc (0, 1, 1)−Nc (0, −1, 1)−Nc (0, 1, −1)+Nc (0, −1, −1) ] /4 ! dxx dxy dxσ 28: H ← dxy dyy dyσ dxσ dyσ dσσ 29: return H

with the associated distance d1 = dist(sref , s1 ) ≥ d0 . Reliable matches are expected to have a primary feature distance that is considerably smaller than the distance to any other feature in the target set. In the case of a weak or ambiguous match, on the other hand, it is likely that other matches exist at a distance similar to the primary feature s0 , including the second-best match s1 . Comparing the best and the second-best distances thus provides information about the likelihood of a false match. A good measure to evaluate the match quality is

751

25 Scale-Invariant Feature Transform (SIFT) Alg. 25.6 SIFT feature extraction (Part 4): Key point orientation assignment. Global parameters: nsmooth , tdomor (see Table 25.5).

1: GetDominantOrientations(G, k′ ) Input: G, hierarchical Gaussian scale space; k′ = (p, q, x, y), refined key point at octave p, scale level q and spatial position x, y (in octave’s coordinates). Returns a list of dominant orientations for the key point k′ . 2: 3: 4: 5:

hϕ ← GetOrientationHistogram(G, k′ ) SmoothCircular(hϕ , nsmooth ) A ← FindPeakOrientations(hϕ ) return A

▷ Alg. 25.7

6: SmoothCircular(x, niter ) Smooths the real-valued vector x = (x0 , . . . , xn−1 ) circularly using the 3-element kernel H = (h0 , h1 , h2 ), with h1 as the hot-spot. The filter operation is applied niter times and “in place”, i.e., the vector x is modified. 7: (h0 , h1 , h2 ) ← 14 · (1, 2, 1) ▷ 1D filter kernel 8: n ← size(x) 9: for i ← 1, . . . , niter do ▷ 1 smooting iteration 10: s ← x(0) 11: p ← x(n−1) 12: for j ← 0, . . . , n−2 do 13: c ← x(j) 14: x(j) ← h0 ·p + h1 ·x(j) + h2 ·x(j+1) 15: p←c 16: x(n−1) ← h0 ·p + h1 ·x(n−1) + h2 ·s 17: return 18: FindPeakOrientations(hϕ ) Returns a (possibly empty) collection of dominant directions (angles) obtained from the orientation histogram hϕ . 19: n ← size(hϕ ) 20: hmax ← max hϕ (i) 0≤i tdomor · hmax then ▷ only accept dominant peaks hp ← hϕ ((k−1) mod n) hn ← hϕ ((k+1) mod n) if (hc > hp ) ∧ (hc > hn ) then ▷ local max. at index k hp −hn ˘ k ← k + 2· h −2·h +h ▷ Eqn. (25.85) ( p c n) 2π ˘ θ ← k · n mod 2π ▷ Eqn. (25.86) A ← A ⊞ (θ) return A

the feature distance ratio, defined as rmatch (sref , s0 , s1 ) =

752

d0 dist(sref , s0 ) = , d1 dist(sref , s1 )

(25.104)

such that rmatch ∈ [0, 1]. If distance d0 is small compared to the second-best distance d1 , then the value of rmatch is small as well, indicating a good match. Conversely, large values of rmatch indicate that the match is poor or possibly ambiguous. Matches are only accepted if they are sufficiently distinctive, for example, by enforcing the condition

1: GetOrientationHistogram(G, k′ ) Input: G, hierarchical Gaussian scale space; k′ = (p, q, x, y), refined key point at octave p, scale level q and relative position x, y.Returns the gradient orientation histogram for key point k′ . 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26:

Gp,q ← GetScaleLevel(G, p, q) (M, N ) ← size(Gp,q ) hϕ ← new map [0, norient −1] 7→ R for i ← 0, . . . , norient −1 do hϕ (i) ← 0

▷ new histogram hϕ ▷ initialize hϕ to zero

25.5 Matching SIFT Features Alg. 25.7 SIFT feature extraction (Part 5): Calculation of the orientation histogram and gradients from Gaussian scale levels. Global parameters: norient (see Table 25.5).

σw ← 1.5 · σ0 · 2q/Q ▷ Eqn. (25.76) rw ← max(1, 2.5 · σw ) ▷ Eqn. (25.77) umin ← max(⌊x − rw ⌋, 1) umax ← min(⌈x + rw ⌉, M −2) vmin ← max(⌊y − rw ⌋, 1) vmax ← min(⌈y + rw ⌉, N −2) for u ← umin , . . . , umax do for v ← vmin , . . . , vmax do r2 ← (u−x)2 + (v−y)2 2 if r2 < rw then (E, ϕ) ← GetGradientPolar(Gp,q , u, v) ▷ see below r2 wG ← exp − 2σ ▷ Gaussian weight 2 w z ← E · wG ▷ quantity to accumulate n n n κϕ ← orient ·ϕ ▷ κϕ ∈ [− orient , + orient ] 2π 2 2 α ← κϕ − ⌊κϕ ⌋ ▷ α ∈ [0, 1] k0 ← ⌊κϕ ⌋ mod norient ▷ lower bin index k1 ← (k0 + 1) mod norient ▷ upper bin index ⊕ (1−α) · z hϕ (k0 ) ← ▷ update bin k0 ⊕ α·z hϕ (k1 ) ← ▷ update bin k1 return hϕ

27: GetGradientPolar(Gp,q , u, v) Returns the gradient magnitude (E) and orientation (ϕ) at position (u, v) of the Gaussian scale level Gp,q . Gp,q (u+1, v) − Gp,q (u−1, v) 28: ∇ ← 0.5· ▷ gradient at u, v Gp,q (u, v+1) − Gp,q (u, v−1) 29: E ← ∥∇∥ ▷ gradient magnitude 30: ϕ ← ∠∇ ▷ gradient orientation (−π ≤ ϕ ≤ π) 31: return (E, ϕ)

rmatch (sr , s1 , s2 ) ≤ rmax ,

(25.105)

where rmax ∈ [0, 1] is a predefined constant (see Table 25.5). The complete matching process, using the Euclidean distance norm and sequential search, is summarized in Alg. 25.11. Other common options for distance measurement are the L1 and L∞ norms. 25.5.2 Examples The following examples were calculated on pairs of stereographic images taken at the beginning of the 20th century.26 From each of the 26

The images used in Figs. 25.28–25.31 are historic stereographs made publicly available by the Library of Congress (www.loc.gov).

753

25 Scale-Invariant Feature Transform (SIFT) Alg. 25.8 SIFT feature extraction (Part 6): Calculation of SIFT descriptors. Global parameters: Q, σ0 , sd , nspat , nangl (see Table 25.5).

1: MakeSiftDescriptor(G, k′ , θ) Input: G, hierarchical Gaussian scale space; k′ = (p, q, x, y), refined key point; θ, dominant orientation. Returns a new SIFT descriptor for the key point k′ . 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

Gp,q ← GetScaleLevel(G, p, q) (M, N ) ← size(Gp,q ) σ˙ q ← σ0 · 2q/Q ▷ decimated scale at level q wd ← sd · σ˙ q ▷ descriptor size is prop. to key point scale σd ← 0.25 · wd ▷ width of Gaussian weighting function rd ← 2.5 · σd ▷ cutoff radius of weighting function umin ← max(⌊x−rd ⌋, 1) umax ← min(⌈x+rd ⌉, M −2) vmin ← max(⌊y−rd ⌋, 1) vmax ← min(⌈y+rd ⌉, N −2) h∇ ← new map nspat × nspat × nangl 7→ R ▷ gradient hist. h∇ for all (i, j, k) ∈ nspat × nspat × nangl do h∇ (i, j, k) ← 0 ▷ initialize h∇ to zero

15: 16: 17: 18:

for u ← umin , . . . , umax do for v ← vmin , . . . , vmax do r2 ← (u−x)2 + (v−y)2 if r2 < rd2 then ▷ map to canonical coord. frame: 1 u′ cos(−θ) − sin(−θ) u−x ← · · v−y v′ wd sin(−θ) cos(−θ) (E, ϕ) ← GetGradientPolar(Gp,q , u, v) ▷ Alg. 25.7 ϕ′ ← (ϕ − θ) mod 2π ▷ normalize gradient angle r2 wG ← exp − 2σ ▷ Gaussian weight 2

19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29:

d

z ← E · wG ▷ quantity to accumulate UpdateGradientHistogram(h∇ , u′ , v ′ , ϕ′ , z) ▷ Alg. 25.9

fsift ← MakeFeatureVector(h∇ ) p+q/Q σ ← σ0 · 2 x′ x p ←2 · y y′ s ← ⟨x′ , y ′ , σ, θ, fsift ⟩ return s

▷ see Alg. 25.10 ▷ absolute scale, Eqn. (25.36) ▷ real position, Eqn. (25.45) ▷ create a new SIFT descriptor

two frames of a stereo picture, a sequence of (ca. 1000) SIFT descriptors (marked by blue rectangles) was extracted with identical parameter settings. Matching was done by enumerating all possible descriptor pairs from the left and the right image, calculating their (Euclidean) distance, and showing the 25 closest matches obtained from ca. 1000 detected key points in each frame. Only the best 25 matches are shown in the examples. Feature matches are numbered according to their goodness, that is, label “1” denotes the best-matching descriptor pair (with the smallest feature distance). Selected details from these results are shown in Fig. 25.29. Unless otherwise noted, all SIFT parameters are set to their default values (see Table 25.5).

754

Although the use of the Euclidean (L2 ) norm for measuring the distances between feature vectors in Eqn. (25.101) is suggested in [175], other norms have been considered [152, 207, 257] to improve the statistical robustness and noise resistance. In Fig. 25.30, matching results are shown using the L1 , L2 , and L∞ norms, respectively.

1: UpdateGradientHistogram(h∇ , u′ , v ′ , ϕ′ , z) Input: h∇ , gradient histogram of size nspat × nspat × nangl , with h∇ (i, j, k) ∈ R; u′ , v ′ ∈ [−0.5, 0.5], normalized spatial position; ϕ′ ∈ [0, 2π), normalized gradient orientation; z ∈ R, quantity to be accumulated into h∇ . Returns nothing but modifies the histogram h∇ . 2: 3: 4:

i′ ← nspat ·u′ + 0.5·(nspat −1) j ′ ← nspat ·v ′ + 0.5·(nspat −1) ϕ′ k′ ← nangl · 2π

5: 6: 7:

i0 ← ⌊i′ ⌋ i1 ← i0 +1 i ← (i0 , i1 )

▷ see Eqn. (25.93); i(0) = i0 , i(1) = i1

8: 9: 10:

j0 ← ⌊j ′ ⌋ j1 ← j0 +1 j ← (j0 , j1 )

▷ j(0) = j0 , j(1) = j1

▷ see Eqn. (25.92) ▷ −0.5 ≤ i′ , j ′ ≤ nspat −0.5 n n ▷ − angl ≤ k′ ≤ angl 2 2

′

11: 12: 13:

k0 ← ⌊k ⌋ mod nangl k1 ← (k0 +1) mod nangl k ← (k0 , k1 )

14: 15: 16:

′

α0 ← i1 − i α1 ← 1 − α0 A ← (α0 , α1 )

25.5 Matching SIFT Features Alg. 25.9 SIFT feature extraction (Part 7): Updating the gradient descriptor histogram. The quantity z pertaining to the continuous position (u′ , v ′ , ϕ′ ) is to be accumulated into the 3D histogram h∇ (u′ , v ′ are normalized spatial coordinates, ϕ′ is the orientation). The quantity z is distributed over up to eight neighboring histogram bins (see Fig. 25.26) by tri-linear interpolation. Note that the orientation coordinate ϕ′ receives special treatment because it is circular. Global parameters: nspat , nangl (see Table 25.5).

▷ k(0) = k0 , k(1) = k1 ▷ see Eqn. (25.94) ▷ A(0) = α0 , A(1) = α1

′

17: 18: 19:

β 0 ← j1 − j β1 ← 1 − β 0 B ← (β0 , β1 )

▷ B(0) = β0 , B(1) = β1

20: 21: 22:

γ0 ← 1 − (k′ − ⌊k′ ⌋) γ1 ← 1 − γ0 C ← (γ0 , γ1 )

▷ C (0) = γ0 , C (1) = γ1

23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34:

Distribute quantity z among (up to) 8 adjacent histogram bins: for all a ∈ {0, 1} do i ← i(a) if (0 ≤ i < nspat ) then wa ← A(a) for all b ∈ {0, 1} do j ← j(b) if (0 ≤ j < nspat ) then wb ← B(b) for all c ∈ {0, 1} do k ← k(c) wc ← C (c) ⊕ z·wa ·wb ·wc h∇ (i, j, k) ← ▷ see Eqn. (25.95)

35:

return

Note that the resulting sets of top-ranking matches are almost the same with different distance norms, but the ordering of the strongest matches does change. Figure 25.31 demonstrates the effectiveness of selecting feature matches based on the ratio between the distances to the best and the second-best match (see Eqns. (25.102)–(25.103)). Again the figure shows the 25 top-ranking matches based on the minimum (L2 ) feature distance. With the maximum distance ratio rmax set to 1.0, rejection is practically turned off with the result that several false or ambiguous

755

25 Scale-Invariant Feature Transform (SIFT) Alg. 25.10 SIFT feature extraction (Part 8): Converting the orientation histogram to a SIFT feature vector. Global parameters: nspat , nangl , tfclip , sf (see Table 25.5).

1: MakeSiftFeatureVector(h∇ ) Input: h∇ , gradient histogram of size nspat × nspat × nangl . Returns a vector of (8 bit) integers obtained from h∇ . 2 2: f ← new map 0, nspat ·nangl − 1 7→ R ▷ new 1D vector f 3: m←0 4: for i ← 0, . . . , nspat −1 do ▷ flatten h∇ into f 5: for j ← 0, . . . , nspat −1 do 6: for k ← 0, . . . , nangl −1 do 7: f (m) ← h∇ (i, j, k) 8: m←m+1 9: Normalize(f ) 10: ClipPeaks(f , tfclip ) 11: Normalize(f ) 12: fsift ← MapToBytes(f , sf ) 13: return fsift 14: Normalize(x) Scales vector x to unit norm. Returns nothing, but x is modified. 15: n ← |x| ▷ length of x

P

n−1

16:

s←

x(i)

i=0

17: 18: 19:

for i ← 0, . . . , n−1 do x(i) ← 1s · x(i) return

20: ClipPeaks(x, xmax ) Limits the elements of x to xmax . Returns nothing, but x is modified. 21: n ← |x| ▷ length of x 22: for i ← 0, . . . , n−1 do 23: x(i) ← min x(i), xmax 24: return 25: MapToBytes(x, s) Converts the real-valued vector x to an integer (unsigned byte) valued vector with elements in [0, 255], using scale factor s > 0. 26: n ← |x| ▷ length of x 27: xint ← new map [0, n−1] 7→ [0, 255] ▷ new byte vector 28: for i ← 0, . . . , n−1 do 29: a ← round (s · x(i)) ▷ a ∈ N0 30: xint (i) ← min a, 255 ▷ xint (i) ∈ [0, 255] 31: return xint

matches are among the top-ranking feature matches (Fig. 25.31 (a)). With rmax set to 0.8 and finally 0.5, the number of false matches is effectively reduced (Fig. 25.31 (b, c)).27

25.6 Efficient Feature Matching The task of finding the best match based on the minimum distance in feature space is called “nearest-neighbor” search. If performed exhaustively, evaluating all possible matches between two descriptor sets S(a) and S(b) of size Na and Nb , respectively, requires Na ·Nb 27

756

rmax = 0.8 is recommended in [175].

1: MatchDescriptors(S(a) , S(b) , rmax ) Input: S(a) , S(b) , two sets of SIFT descriptors; rmax , max. ratio of best and second-best matching distance (s. Eqn. (25.105)). Returns a sorted list of matches mab = ⟨sa , sb , dab ⟩, with sa ∈ S(a) , sb ∈ S(b) and dij being the distance between sa , sb in feature space. 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

M ← () ▷ empty sequence of matches for all sa ∈ S(a) do d0 ← ∞, s0 ← nil ▷ best distance/feature d1 ← ∞ ▷ second-best distance for all sb ∈ S(b) do d ← Dist(sa , sb ) if d < d0 then ▷ d is a new ‘best’ distance d1 ← d0 ▷ cur. best becomes second-best s 0 ← s b , d0 ← d ▷ new best match else if d < d1 then ▷ d is a new ‘second-best’ distance s 1 ← s b , d1 ← d

14: 15: 16: 17: 18:

if (d1 < ∞) ∧ ( d0 ≤ rmax ) then ▷ Eqns. (25.104–25.105) 1 m ← ⟨sa , s0 , d0 ⟩ ▷ create new match M ⊞ (m) Sort(M) ▷ sort M to ascending distance dab return M

25.6 Efficient Feature Matching Alg. 25.11 SIFT feature matching using Euclidean feature distance and linear search. The returned sequence of SIFT matches is sorted to ascending distance between corresponding feature pairs. Function Dist(sa , sb ) demonstrates the calculation of the Euclidean (L2 ) feature distance, other options are the L1 and L∞ norms.

d

19: Dist(sa , sb ) Input: descriptors sa = ⟨xa , ya , σa , θa , fa ⟩, sb = ⟨xb , yb , σb , θb , fb ⟩. Returns the Euclidean distance between feature vectors f a , f b . 20: n ← |fa | ▷ length of feature vectors, n = |fa | = |fb | 21: sum ← 0 22: for i ← 0, . . . , n−1 do 23: sum ← sum + [fa (i) − fb (i)]2 √ 24: return sum

feature distance calculations and comparisons. While this may be acceptable for small feature sets (with maybe up to 1000 descriptors each), this linear (brute-force) approach becomes prohibitively expensive for large feature sets with possibly millions of candidates, as required, for example, in the context of image database indexing or robot self-localization. Although efficient methods for exact nearest-neighbor search based on tree structures exist, such as the k-d tree method [99], it has been shown that these methods lose their effectiveness with increasing dimensionality of the search space. In fact, no algorithms are known that significantly outperform exhaustive (linear) nearest neighbor search in feature spaces that are more than about 10-dimensional [175]. SIFT feature vectors are 128dimensional and therefore exact nearest-neighbor search is not a viable option for efficient matching between large descriptor sets. The approach taken in [27, 175] abandons exact nearest-neighbor search in favor of finding an approximate solution with substantially reduced effort, based on ideas described in [14]. This so-called “best-bin-first” method uses a modified k-d algorithm, which searches neighboring feature space partitions in the order of their closest distance from the given feature vector. To limit the exploration to a

757

25 Scale-Invariant Feature Transform (SIFT) Fig. 25.28 SIFT feature matching examples on pairs of stereo images. Shown are the 25 best matches obtained with the L2 feature distance and rmax = 0.8.

758

small fraction of the feature space, the search is cut off after checking the first 200 candidates, which results in a substantial speedup without compromising the search results, particularly when combined with feature selection based on the ratio of primary and secondary distances (see Eqns. (25.104)–(25.105)). Additional details can be found in [27]. Approximate nearest-neighbor search in high-dimensional spaces is not only essential for practical SIFT matching in real time, but is a general problem with numerous applications in various disciplines and continued research. Open-source implementations of several different methods are available as software libraries.

Left frame

Right frame

13

13

21

21

Fig. 25.29 Stereo matching examples (enlarged details from Fig. 25.28).

11 17

11 17

16

7 20

25.7 Java Implementation

16

7 21

21 4

4

8

19

8

19 2 6

2 6 23

23

25.7 Java Implementation A new and complete Java implementation of the SIFT method has been written from ground up to complement the algorithms described in this chapter. Space limitations do not permit a full listing here, but the entire implementation and additional examples can be found in the source code section of this book’s website. Most Java methods are named and structured identically to the procedures listed in the algorithms for easy identification. Note, however, that this implementation is again written for instructional clarity and readability. The code is neither tuned for efficiency nor is it intended to be used in a production environment.

759

25 Scale-Invariant Feature Transform (SIFT) Fig. 25.30 Using different distance norms for feature matching. L1 (a), L2 (b), and L∞ norm (c). All other parameters are set to their default values (see Table 25.5).

(a) L1 -norm

(b) L2 -norm

(c) L∞ -norm

25.7.1 SIFT Feature Extraction The key class in this package28 is SiftDetector, which implements a SIFT detector for a given floating-point image. The following example illustrates its basic use for a given ImageProcessor ip: ... FloatProcessor I = ip.convertToFloatProcessor(); 28

760

Package imagingbook.pub.sift

25.7 Java Implementation Fig. 25.31 Rejection of weak or ambiguous matches by limiting the ratio of primary and secondary match distance rmax (see Eqns. (25.104)–(25.105)).

(a) rmax = 1.0

(b) rmax = 0.8

(c) rmax = 0.5

SiftDetector sd = new SiftDetector(I); List S = sd.getSiftFeatures(); ... // process descriptor list S

The initial work of setting up the required Gaussian and DoG scale space structures is done inside SiftDetector’s constructor. Method getSiftFeatures() then performs the actual feature detection task and returns a sequence of SiftDescriptor objects (S) for the image I. Each extracted SiftDescriptor in S holds information about its image position (x, y), its absolute scale σ (scale)

761

25 Scale-Invariant Feature Transform (SIFT)

and its dominant orientation θ (orientation). It also contains an invariant, 128-element, int-type feature vector fsift (see Alg. 25.8). The SIFT detector uses a large set of parameters that are set to their default values (see Table 25.5) if the simple constructor new SiftDetector(I) is used, as in the previous example. All parameters can be adjusted individually by passing a parameter object (of type SiftDetector.Parameters) to its constructor, as in the following example, which shows feature extraction from two images A, B using identical parameters: ... FloatProcessor Ia = A.convertToFloatProcessor(); FloatProcessor Ib = B.convertToFloatProcessor(); ... SiftDetector.Parameters params = new SiftDetector.Parameters(); params.sigmaS = 0.5; // modify individual parameters params.sigma0 = 1.6; ... SiftDetector sdA = new SiftDetector(Ia, params); SiftDetector sdB = new SiftDetector(Ib, params); List SA = sdA.getSiftFeatures(); List SB = sdB.getSiftFeatures(); ... // process descriptor sets SA and SB

25.7.2 SIFT Feature Matching Finding matching descriptors from a pair of SIFT descriptor sets SA, SB is accomplished by the class SiftMatcher. One descriptor set (SA) is considered the “reference” or “model” set and used to initialize a new SiftMatcher object, as shown in the following example. The actual matches are calculated by invoking method matchDescriptors(), implementing procedure MatchDescriptors() defined in Alg. 25.11. It takes the second descriptor set (SB) as the only argument. The following code segment continues from the previous example: ... SiftMatcher.Parameters params = new SiftMatcher.Parameters(); // set matcher parameters here (see below) SiftMatcher matcher = new SiftMatcher(SA, params); List matches = matcher.matchDescriptors(SB); ... // process matches

As noted, certain parameters of class SiftMatcher can be set individually, for example, params.norm = FeatureDistanceNorm.L1; // L1, L2, or Linf params.rMax = 0.8; // rmax , max. ratio of best and second-best match params.sort = true; // set true if sorting of matches is desired

762

Method matchDescriptors() in this prototypical implementation performs an exhaustive search over all possible descriptor pairs in the two sets SA and SB. To implement efficient approximate nearestneighbor search (see Sec. 25.6), one would pre-calculate the required search tree structures for the model descriptor set (SA) once inside SiftMatcher’s constructor method. The same matcher object could then be reused to match against multiple descriptor sets without the

need to recalculate the search tree structure over and over again. 25.8 Exercises This is particularly effective when the given model set is large.

25.8 Exercises Exercise 25.1. As claimed in Eqn. (25.12), the 2D LoG function Lσ (x, y) can be approximated by the DoG in the form Lσ (x, y) ≈ λ·(Gκσ (x, y)−Gσ (x, y)). Create a combined plot, similar to the one in Fig. 25.5 (b), showing the 1D cross sections of the LoG and DoG functions (with σ = 1.0 and y = 0). Compare both functions by varying the values of κ = 2.00, 1.25, 1.10, 1.05, and 1.01. How does the approximation change as κ approaches 1, and what happens if κ becomes exactly 1? Exercise 25.2. Test the performance of the SIFT feature detection and matching on pairs of related images under (a) changes of image brightness and contrast, (b) image rotation, (c) scale changes, (d) adding (synthetic) noise. Choose (or shoot) your own test images, show the results in a suitable way and document the parameters used. Exercise 25.3. Evaluate the SIFT mechanism for tracking features in video sequences. Search for a suitable video sequence with good features to track and process the images frame-by-frame.29 Then match the SIFT features detected in pairs of successive frames by connecting the best-matching features, as long as the “match quality” is above a predefined threshold. Visualize the resulting feature trajectories. Could other properties of the SIFT descriptors (such as position, scale, and dominant orientation) be used to improve tracking stability?

29

In ImageJ, choose a video short enough to fit into main memory and open it as an image stack.

763

26 Maximally Stable Extremal Regions (MSER)

Most feature detection schemes are based on local change or gradient information, for example, corner detection or SIFT features. In this sense MSER features are complementary since they rely on image regions, i.e., patches of connected pixels that exhibit some increased uniformity instead of change. The concept of connected regions (or “components”) was introduced and described extensively in the context of binary images in Chapter 8. A scalar-valued image (e.g., a grayscale image) can be easily converted to a binary image by thresholding, either using a fixed, global threshold value or adaptive thresholds that change across the image (see Chapter 9). Once the image is binarized, connected regions can be collected with any of the labeling methods described in Sec. 8.1, where every set of connected foreground pixels is assigned a unique label number. Such regions are called “extremal”, since the value of each contained grayscale pixel is less or equal to the threshold, and regions cannot grow anywhere without adding pixels with values greater than the current threshold. However, the initial thresholding step is generally sensitive and sometimes, depending on the contents of the image, the resulting binary image may change dramatically when the threshold is changed by only a small amount. The key idea of the MSER method is to locate connected regions that do not change their size much when the input image is binarized over some range of threshold values. Thus the task is to analyze the “growth history” of binary regions over varying threshold levels and to find out which regions are “stable”, i.e., show a sufficiently small increase in size, and which are not. Finally, stable components are classified as “maximally stable” if the rate of size change over varying thresholds is a local minimum. Though the basic idea appears quite simple, the real challenge is to devise efficient algorithms that can perform this process in reasonable time. Thus the actual MSER algorithm is not trivial and requires quite a few details to be considered. It consists of two main steps: © Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1_26

765

26 Maximally Stable Extremal Regions (MSER)

1. Building a component tree, i.e., a hierarchical structure of extremal regions extracted from the threshold sets at successive levels. 2. Scanning the component tree for maximally stable extremal regions by evaluating their growth history and calculating essential feature properties. While these two steps are combined in some implementations, it makes the process easier to understand if they are treated separately. In particular, we present two different algorithms for building the component tree which yield identical results and can be used interchangeably.

26.1 Threshold Sets and Extremal Regions Thresholding a scalar-valued image I at some brighness level q partitions the set of pixels into a sub-level or threshold set Tq = {p ∈ dom(I) | I(p) ≤ q}

(26.1)

of “background” pixels,1 and a complementary set ¯q = {p ∈ dom(I) | I(p) > q} T

(26.2)

¯q do not overlap, that is, of “foreground” pixels. The sets Tq and T ¯q = ∅ and their union includes all pixels in the image, so Tq ∪ Tq ∩ T ¯ Tq = dom(I). For an integer-valued image with I(p) ∈ [0, K −1] the set T0 (for q = 0) consists of all pixels p with I(p) = 0, T1 is the set T0 plus all pixels with I(p) = 1, etc. In general, ( {p ∈ dom(I) | I(p) = 0} for q = 0, Tq = (26.3) {p ∈ dom(I) | I(p) = q} ∪ Tq−1 for q > 0. Thus, if the threshold level is raised from q to a greater value q ′ > q, the resulting threshold set Tq′ is never smaller than Tq , since any pixel in Tq is also contained in Tq′ , that is, Tq ⊆ Tq′ . An intuitive analogy to thresholding is to view the image’s brightness function as a 21/2-D surface (landscape or topographic relief) being gradually filled with water. This analogy is often used in the context of watershed segmentation, which aims at separating regions in gray-level images along drainage divides separating adjacent basins (e.g., see [225, 277]). As illustrated in Fig. 26.1, this process can be viewed (and implemented, as we shall see) in two different ways. In Fig. 26.1 (a), the image surface is considered permeable (like a sieve) and the water level underneath it is raised uniformly until the entire surface is immersed. The alternative concept, shown in Fig. 26.1 (b), considers the surface non-permeable. Instead of raising the water level underneath, water is poured onto the surface itself, thereby flooding it gradually, filling basins one after the other until the whole landscape 1

766

dom(I) = [0, M −1] × [0, N −1] denotes the domain of the image, i.e., the set of all pixel coordinates inside the image boundaries.

26.1 Threshold Sets and Extremal Regions Fig. 26.1 Filling metaphors: Global immersion (a) vs. local flooding (b).

(a) global immersion

(b) local flooding

is fully submerged. These two schemes are the basis of the two component tree algorithms described in Sections 26.2.2 and 26.2.3, i.e., the “global immersion” algorithm and the “local flooding” algorithm, respectively. A set of pixels R ⊆ dom(I) forms a region or connected component, if any pair of pixels in R can be linked by a sequence of mutually adjacent pixels that are also contained in R. A pixel p′ is adjacent to a given pixel p if it is contained in neighborhood N of p, i.e., p ∼ p′ ⇔ p′ ∈ N(p)

(26.4)

and vice versa.2 Thus, for R to be connected, there must, for every pair of terminal pixels p1 , pN ∈ R, exist a contiguous path (p1 , p2 , . . . , pN ),

(26.5)

with all pi ∈ R and pi ∼ pi+1 . A region R is called extremal if it is a connected component whose surrounding (neighboring) pixels all have greater values than the pixels inside R itself, i.e., ∀ p ∈ R, p′ ∈ / R : p′ ∈ N(p) ⇒ I(p′ ) > I(p).

(26.6)

Any neighbor p′ of a pixel p ∈ R is either inside R, or outside R and has a pixel value greater than the maximum pixel value in R. Thus, reusing the immersion metaphor described above, if the region R is viewed as a basin being filled with water, there is no way that any of it could flow out into neighboring areas. The connected components of a threshold set Tq are always maximal, since all pixels with values below or equal to q are already contained in Tq and all outside pixels have values greater than q. Let {Rq,i } be the set of regions for the threshold set Tq . If a region Rq,i exists at a given threshold level q, there must also be a region Rq+1,j at the next threshold level q + 1 which contains all pixels of Rq,i , i.e., Rq,i ⊆ Rq+1,j .

(26.7)

Thus any region found at a given threshold level q has an associated “parent” region at level q+1 by which it is completely overlapped. Region Rq+1,j contains at least the same pixels as region Rq,i but may include additional pixels with value q+1. 2

Typically N is the 4-neighborhood N4 or the 8-neighborhood N8 (see Ch. 8).

767

26 Maximally Stable Extremal Regions (MSER)

26.2 Building the Component Tree Once a connected component comes into existence (“pops up”) at some threshold level q, it keeps growing with increasing threshold levels. On its “way up”, a region can never become smaller, move away from lower-level regions or disappear altogether. However, at some level a region may become adjacent to one or more other regions of the same threshold set. In this situation all “touching” regions are joined into a single region. The resulting structure is a tree of connected components which we call a component tree. When, with rising threshold level q, a component first appears, it becomes a leaf node of the tree, initially not connected to any other nodes.3 The associated component at the next higher level becomes the parent to that node. It may consist of the same pixels as the child component or add new pixels that are not part of any other region at this level. The joining of two or more components into one creates a new tree node, whose child nodes are the merged components. Finally, the threshold set TK−1 for the maximum brightness level (K −1) necessarily contains all pixels of the image and thus becomes the common root node joining all other components. The example in Fig. 26.2 shows a simple test image (which will be used repeatedly in the following) being “immersed” at different threshold levels q. The resulting extremal regions are marked in Fig. 26.2 (a–d), while Fig. 26.2 (e–d) shows a cross section to illustrate the process of growing and joining components as the threshold level is increased.

Fig. 26.2 Test image at various stages of immersion, i.e., at different threshold levels q (a– d). Cross section along the marked (red) line (e–h).

q = 60

q = 110

q = 160

q = 210

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

26.2.1 Component Tree Algorithms The following sections describe two different algorithms for building the component tree from a given grayscale image. The first method, which we refer to as “global immersion” (see Sec. 26.2.2), is similar to algorithms used for watershed segmentation and follows the 3

768

Actually the component starts out as the sub-tree of a forest, which (initially) has no common root.

26.2 Building the Component Tree Fig. 26.3 Test image at various stages of immersion. Original image (a), immersed by rising threshold levels q (b–e).

(a)

(b)

(c)

(d)

(e)

process shown in Fig. 26.1 (a). It starts out by assigning an individual component to each pixel and builds the tree by successively joining components as the threshold is raised from low to high levels. The algorithm potentially runs in “quasi-linear” (i.e., almost linear) time w.r.t. the number of pixels, thanks to efficient, tree-based union finding [195, 277]. The second algorithm (see Sec. 26.2.3), originally proposed in [197], emulates the idea of “local flooding” depicted in Fig. 26.1 (b), where flooding may start at an arbitrary image position and level. The algorithm uses stack data structures for ordering pixels and components and is said to perform in true linear time. It is therefore used in most current MSER implementations and experiments suggest that it is indeed significantly faster than the global immersion algorithm. Note that the component trees produced by the two algorithms are exactly the same. Component Tree Data Structures The component tree algorithms in the following sections use the same basic data structures (“objects”), summarized in Table 26.1. Instances of type Pixel represent individual pixels, each composed of its x/y position (pos) and its grayscale value (val). Each pixel also records the direction (dir) of its next unprocessed neighbor (used by function GetNextNeighbor() in Alg. 26.3).

769

Pixel := ⟨pos, val, dir⟩ pos . . . . . the pixel’s position val. . . . . .the pixel’s grayscale value dir . . . . . . direction of next unprocessed neighbor

26 Maximally Stable Extremal Regions (MSER) Table 26.1 Common data structures and constructor functions used in component tree algorithms. The additional MSER properties are not required for component tree construction; they are attached to components during MSER extraction.

Constructor: new Pixel(u, val) := ⟨u, val, 0⟩ Component := ⟨level, pxls, chdr, par, hgt⟩ level . . . . the component’s grayscale level pxls . . . . . the set of “local” (directly attached) pixels chdr . . . . the (possibly empty) set of child components par . . . . . a reference to the parent component hgt . . . . . the height of the underneath sub-tree Constructor: new Component(q) := ⟨q, ∅, ∅, nil, 0⟩ Additional properties (used for MSER classification only): var . . . . . the component’s size variation stable . . . maximally stable flag (boolean) stats . . . . vector of coordinate statistics

Instances of Component represent connected components, i.e., the nodes of the component tree. Each component is comprised of a set of “local” pixels (pxls), the associated grayscale level (level), the set of child components (chdr), a reference to the parent component (par), and the height of the sub-tree underneath (hgt). If the parent reference is nil, the component is a root node, i.e., at the top of a sub-tree. The final component tree has only a single root but multiple sub-trees may exist while the tree is built.4 The size of a component is defined as the number of its local pixels plus the combined number of pixels belonging to its child components (recursively). The gray value of all local (i.e., directly attached) pixels is identical to the component’s level. All other member pixels (with lower values) are held by the child components. Thus any pixel instance is never referenced by more than one component. 26.2.2 Component Tree Algorithm 1: Global Immersion This algorithm is based on the approach described in [195, 277]. Its details are outlined in Algs. 26.1–26.2, with the main procedure BuildComponentTree() in Alg. 26.1. Its key steps are: 1. Collect components: Image pixels are processed in the order of increasing grayscale values. Initially, each pixel forms a separate component, i.e., a sub-tree of a forest. When components are found to be connected, the associated sub-trees are merged, thereby creating a preliminary component tree. 2. Build the final component tree: The final component tree T is built by linking the extremal components and removing all nonextremal components from the preliminary component tree. 4

770

A set of isolated trees is called a “forest”.

1: BuildComponentTree(I) Input: I, a grayscale image of size M × N . Returns a tree T = ⟨E, r⟩, where E is a set of extremal components (nodes) and r ist the root of the tree. 2: 3: 4:

T0 ← CollectComponents(I) T ← BuildTree(T0 ) return T

▷ see Alg. 26.2

26.2 Building the Component Tree Alg. 26.1 Component Tree Algorithm 1: Global Immersion (Part 1).

5: CollectComponents(I) Builds and returns the raw component tree T0 . 6: 7:

P ← new map M × N 7→ Pixel Q ← ()

▷ array of Pixels ▷ empty sequence of Pixels

8: 9: 10: 11: 12: 13:

C ← new map M × N 7→ Component for all u ∈ M × N do p ← new Pixel(u, I(u)) P (u) ← p Q ← Q ⊞ (p) C(u) ← nil

14:

sort(Q, val)

15:

r ← nil

16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27:

for i ← 0, . . . , |Q|−1 do ▷ process pixels in sorted order p ← Q(i) cp ← new Component(val(p)) ▷ see Alg. 26.3 pxls(cp ) ← {p} ▷ add pixel p to cp C (pos(p)) ← cp ▷ add cp to map C n ← GetNextNeighbor(p, P ) ▷ see Alg. 26.3 while n ̸= nil do cn ← F (pos(n)) ▷ get n’s component if cn ̸= nil then ▷ n was processed before rp ← FindRoot(cp ) ▷ find root for cp rn ← FindRoot(cn ) ▷ find root for cn if rp ̸= rn then ▷ cp , cn have different roots Join the two sub-trees rooted at rp and rn : if level(rp ) = level(rn ) ∧ hgt(rp ) < hgt(rn ) then Join(rp , rn ) ▷ rn becomes parent of rp r ← rn else Join(rn , rp ) ▷ rp becomes parent of rn r ← rp n ← GetNextNeighbor(p, P ) ▷ see Alg. 26.3 T0 ← ⟨C, r⟩ return T0

28: 29: 30: 31: 32: 33: 34: 35: 36:

▷ array of Components ▷ see Alg. 26.3

▷ sort pixels in Q by increasing val ▷ final root Component

37: FindRoot(c) Recursively searches (upwards) for the root of the sub-tree containing component c. 38:

return

c FindRoot(par(c))

if par(c) = nil otherwise

39: Join(r1 , r2 ) Joins sub-trees with roots r1 , r2 by making r2 parent of r1 . 40: par(r1 ) ← r2 ▷ make r2 parent of r1 41: chdr(r2 ) ← chdr(r2 ) + r1 ▷ add r1 to the children of r2 42: size(r2 ) ← size(r2 ) + size(r1 ) ▷ adjust the root’s size 43: hgt(r2 ) ← max(hgt(r2 ), hgt(r1 ) + 1) ▷ adjust the tree’s height

771

26 Maximally Stable Extremal Regions (MSER) Alg. 26.2 Component Tree Algorithm 1: Global Immersion (Part 2)

1: BuildTree(T0 ) Input: T0 = ⟨C, r⟩, the raw component tree, with the set of components C and the root component r. Returns a tree of extremal components (with the same root as T0 ). 2: 3:

4: 5: 6:

Link and collect all extremal components (from C): E ← LinkExtremalComponents(r, r) ▷ see below for all c ∈ E do Merge the remaining non-extremal components into the associated (ancestor) extremal components: ReduceNonExtremalComponents(c) T ← ⟨E, r⟩ ▷ tree of extremal components return T

7: LinkExtremalComponents(c, e) Input: c, the current component; e the previous extremal ancestor component. Returns the set of collected extremal components. 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:

E←∅ if IsExtremal(c) then ▷ see below E ← E ∪ {c} ▷ add c to extremal components if c ̸= e then par(c) ← e ▷ make e parent of c chdr(e) ← chdr(e) ∪ {c} ▷ add c to children of e e←c ▷ make c the prev. extremal comp. for all c′ ∈ chdr(c) do ▷ continue with children of c E ← E ∪ LinkExtremalComponents(c′ , e) return E ▷ the collected extremal components

18: ReduceNonExtremalComponents(c) Input: c, the current component. Recursively merges all nonextremal children of c. Returns nothing but c is modified. 19: for all c′ ∈ chdr(c) do 20: if ¬IsExtremal(c′ ) then 21: ReduceNonExtremalComponents(c′ ) 22: pxls(c) ← pxls(c) ∪ pxls(c′ ) ▷ add all child pixels to c 23: pxls(c′ ) ← ∅ ▷ clear child’s pixels 24: chdr(c) ← chdr(c) \ {c′ } ▷ remove c′ from children 25: IsExtremal(c) ▷ returns true iff component c is extremal 26: p ← par(c) 27: return p = nil ∨ level(p) > level(c)

Step 1: Collect Components Initially, all image pixels are converted to Pixel instances (see Table 26.1) and collected into two containers (Alg. 26.1, lines 6–13): • P is a 2D map (array) of Pixel objects with the same size as the original image. • Q is a 1D sequence of Pixel objects which is sorted by increasing pixel value (line 14).

772

All pixels in sequence Q are processed one by one in the order of their pixel values (Alg. 26.1, lines 16–34). Pixels with the lowest value are processed first (in no specific order), followed by the pixels at the next higher level, etc. Each pixel p is attached to a new Component

1: GetNextNeighbor(p, P ) Input: p, the current Pixel; P , a M × N map of Pixel instances. Returns the next valid 4-neighbor pixel of p or nil if there are no more neighbors to process. dir(p) is modified. 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

d0 ←

( 10 ),

d1 ←

0 ( −1 ),

d2 ←

( −1 0 ),

d3 ←

( 01 )

(M, N ) ← size(P ) found ← false k ← dir(p) ▷ search direction k = 0, . . . , 4 while k < 4 ∧ ¬found do u ← pos(p) + dk ▷ pos. in direction k found ← u ∈ [0, M −1] × [0, N −1] ▷ is u inside the image? k ←k+1 dir(p) ←k ▷ update p’s search direction P (u) if found return nil otherwise

26.2 Building the Component Tree Alg. 26.3 Retrieving the next unprocessed neighbor for a given pixel p. This function is used in both component tree algorithms.

instance cp (see Table 26.1) which initially contains only that pixel and has no parent, i.e., is the root of an isolated sub-tree (lines 18– 19). Newly created components are registered in a (initially empty) 2D map C, indexed by the associated pixel position (Alg. 26.1, line 20). Subsequently, all neighboring pixels of p are inspected (Alg. 26.1, lines 22–34): • If the neighboring pixel n was processed before (i.e., C contains a component for n), the sub-trees containing p and n are joined at their roots, unless they belong to the same sub-tree already (Alg. 26.1, lines 24–33). • Otherwise (if n was not processed before) nothing needs to be done at this point, since n will be visited eventually (by the main loop in Alg. 26.1, line 16). When two sub-trees are joined (Alg. 26.1, lines 29 and 32), one subtree’s root becomes the parent of the other’s root, effected by procedure Join() (line 39). Function FindRoot(c) in line 37 locates the root of the sub-tree containing component c by recursively traversing the tree “upwards” until the root is found.5 The decision which root to make the parent depends on the height (hgt) of the involved subtrees, i.e., the maximum length of the path from the root to any of its leaf nodes (Alg. 26.1, line 28). The rule is that the root of the taller sub-tree becomes the parent of the smaller one.6 After the last pixel was processed, only one root component remains, which represents the single extremal region for the image’s maximum brightness level and contains all other components at lower levels.

5

6

The (repeated) search for the tree’s root is usually avoided by maintaining a “shortcut” reference to the root (or at least to a node closer to the root) in each tree node. This scheme, called “weighted quick-union” finding [239], avoids the buildup of highly unbalanced trees.

773

26 Maximally Stable Extremal Regions (MSER)

Step 2: Build the Final Component Tree The second step of this algorithm is to transform the preliminary component tree T0 (built by CollectComponents() in Alg. 26.1) into the final component tree T , which contains extremal regions only. This step is performed by function BuildTree() in Alg. 26.2 (lines 1–6). Starting from the root (r) of the preliminary component tree T0 , all extremal components are located and linked by the recursive function LinkExtremalComponents() (lines 2, 7–17). A component c is extremal if its brightness level is smaller than its parent’s level (see function IsExtremal(c) in Alg. 26.2, lines 25–27). Finally, the remaining nonextremal components are joined into the associated (ancestor) extremal components by procedure ReduceNonExtremalComponents(), which also works recursively (lines 4, 18–24). 26.2.3 Component Tree Algorithm 2: Local Flooding This algorithm, which is due to Nistér and Stewénius [197], roughly follows the local flooding analogy outlined in Fig. 26.1 (b). Here the water level is not raised gradually, as in the global immersion metaphor, but the (brightness) surface is considered impermeable and the water is poured in at some arbitrary but fixed position. From there the water spreads out, flows “downhill” into nearby basins, fills them, “climbs” over separating ridges and so on, until the entire “landscape” is submerged. To emulate this process, the algorithm makes use of stack data structures which keep track of pixels and components (unfilled basins) that require further processing. Pixels are not processed in the order of their brightness values but according to their geometric neighborhood. The complete process is summarized in Algs. 26.4–26.5. The algorithm uses the same Pixel and Component types as the global immersion algorithm (see Table 26.1). Initially, all Pixel instances are placed in a 2D array P which has the same size as the input image I (Alg. 26.4, line 2). A boolean array V (initialized to false) of the same size is used to mark visited pixel positions as “accessible” (line 3). Next, two (initially empty) collections are set up: • B is a sorted sequence (list) of pixels that are encountered “on the way” but must be memorized for later processing (Alg. 26.4, line 7). The pixels contained in B are “boundary pixels” in the sense that they are adjacent to the current water front but water has either not entered the pixel or not all its neighbors have been explored yet. Inside B the pixels are kept ordered by increasing pixel value, i.e., a new pixel is always inserted at an appropriate sorted position. Pixels can only be removed from the head of sequence B, i.e., the pixel with the lowest brightness value is checked out first.7 • C is a stack which stores “unfinished” components (line 8). It is used in strict last-in/first-out mode, providing only push and pop operations for adding and removing elements, respectively. 7

774

The associated implementation uses Java’s standard class PriorityQueue to represent B.

1: BuildComponentTree(I) Input: I, a grayscale image of size M × N . Returns the component tree T = ⟨E, r⟩. 2: 3: 4: 5: 6:

Create and initialize an array of pixel data: P ← new map M × N 7→ Pixel V ← new map M × N 7→ B for all u ∈ M × N do P (u) ← new Pixel(u, I(u)) V (u) ← false

▷ array of pixels ▷ “visited” bitmap

26.2 Building the Component Tree Alg. 26.4 Component Tree Algorithm 2: Local Flooding.

▷ see Alg. 26.3 ▷ not yet visited

7: 8: 9: 10: 11: 12: 13:

B ← () ▷ sorted sequence of pixels (by level) C ← () ▷ stack of components E←∅ ▷ set of extremal components p ← P (0) ▷ start position p = (0, 0) V (pos(p)) ← true ▷ mark p as accessible c ← new Component(val(p)) ▷ see Alg. 26.3 push(c, C) ▷ push c onto stack C

14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26:

while p ̸= nil do n ← GetNextNeighbor(p, P ) ▷ see Alg. 26.3 while n ̸= nil do if ¬ V (pos(n)) then ▷ n is not accessible yet V (pos(n)) ← true ▷ mark n as accessible if val(n) ≥ val(p) then addSorted(n, B) ▷ insert n sorted by val else addSorted(p, B) ▷ insert p sorted by val c ← new Component(val(n)) ▷ see Alg. 26.3 push(c, C) ▷ push c onto stack C p←n ▷ make neighbor current n ← GetNextNeighbor(p, P ) ▷ see Alg. 26.3

27: 28: 29: 30: 31: 32:

c ← first(C) pxls(c) ← pxls(c) ∪ {p} ▷ add pixel p to component c q ← pop(B) ▷ pop next pixel off stack B if q ̸= nil ∧ val(q) > val(p) then E ← E ∪ ProcessStack(val(q), C) ▷ see Alg. 26.5 p←q ▷ make q the current point

33: 34:

r ← first(C) ▷ only the root component remains the stack E ← E ∪ ProcessStack(∞, C) ▷ process the root component

35: 36:

T ← ⟨E, r⟩ return T

▷ the component tree

Local flooding may start at any image position—the choice has no influence on the final result. In line 10 of Alg. 26.4, position (0, 0) is arbitrarily chosen as the start position and the associated pixel p = P (0) is marked as accessible in the boolean map V (line 11). Then, an empty component c is created with level val(p) and no parent (nil), which is added to the previously empty component stack C (Alg. 26.4, lines 12–13). The algorithm’s main loop (Alg. 26.4, lines 14–27) implements the iterative flooding process by adding and retrieving pixels to and from the pixel sequence B until it becomes empty. All neighbors n of any pixel p are visited exactly once, and every pixel keeps track of its processed neighbors. Function GetNextNeighbor(p, P )—called

775

26 Maximally Stable Extremal Regions (MSER) Alg. 26.5 Component Tree Algorithm 2: Local Flooding (Part 2).

776

1: ProcessStack(v, C) Input: v, the new “target” value, C, the component stack. Returns a set of “emitted” (extremal) components, C is modified. 2: E←∅ ▷ empty set of extremal components 3: while C ̸= () ∧ v > level(first(C)) do 4: c1 ← pop(C) 5: E ← E ∪ {c1 } ▷ add (“emit”) c1 to E 6: if v < ∞ then 7: if C = () ∨ v < level(first(C)) then 8: c2 ← new Component(v) ▷ see Alg. 26.3 9: else 10: c2 ← pop(C) Merge component c1 into component c2 : 11: chdr(c2 ) ← chdr(c2 ) + c1 ▷ add c1 to children of c2 12: par(c1 ) ← c2 ▷ c2 becomes parent of c1 13: push(c2 , C) ▷ push c2 onto component stack 14: return E ▷ the set of “emitted” components

in Alg. 26.4, lines 15 and 26—returns the next unprocessed neighbor of p from the pixel array P (see Alg. 26.3 for its definition). If a neighbor n of the current pixel p was not visited before, there are two possible situations (Alg. 26.4, line 17): A. The brightness level of neighbor n is not lower than the level of pixel p, i.e., val(n) ≥ val(p). In this case there is no need to handle the neighbor immediately, but n is added to the pixel sequence B (ordered by val) to be processed later (Alg. 26.4, line 20). B. Otherwise, if val(n) < val(p), neighbor n must be handled immediately before continuing with the current pixel p. Thus p is moved back to the pixel sequence B, a new component with the neighbor’s level val(n) is pushed onto the component stack C, and neighbor n becomes the current pixel (see Alg. 26.4, lines 20– 25). In the local flooding analogy this means that the neighbor lies lower than the current level and thus water flows downward toward the neighbor. Note that the inner while-loop in Alg. 26.4 (lines 16–26) does not visit all neighbors of the current pixel p at once, since p keeps changing inside the loop. Pixels may be set aside for later processing before all of their neighbors are processed. The pixel’s dir property (initialized to 0 by the constructor in Table 26.1) records the “direction” where to find the next valid neighbor, skipping any out-of-bounds positions (see function GetNextNeighbor() in Alg. 26.3). When all its neighbors have been processed, the current pixel p is added to the component that is currently on the top of stack C (see Alg. 26.4, lines 27–28). Thus, in terms of the flooding metaphor, water finally occupies the current pixel. At the end of the main loop, the next pixel (q) is taken from the pixel sequence B (Alg. 26.4, line 29). The loop terminates when B is empty. Otherwise, since B is sorted, none of the remaining pixels in B can have a higher level than pixel q. Thus, if val(q) is greater than the current pixel’s level val(p), no more unprocessed pixels exist

that could belong to the current top component in stack C. In this 26.2 Building the situation, ProcessStack() is called to take care of the component stack Component Tree (see Alg. 26.4, line 31). Procedure ProcessStack() (see Alg. 26.5) removes components from stack C until the top component’s level matches the new “target” level v = val(q). In every iteration, the first component (c1 ) is taken off the stack and “emitted”, i.e., added to the list of extremal components E (Alg. 26.5, lines 4–5). Then, unless C is empty, the next component on the stack is inspected (Alg. 26.5, line 7): • If the target value v is smaller than the level of the next stack component, no component has been instantiated for level v yet. In this case, an “artificial” component c2 with level v is created (Alg. 26.5, line 8). • Otherwise c2 is the next component popped off stack C (see Alg. 26.5, line 10). Now (the previously emitted) component c1 is “merged into” c2 by making c2 the parent of c1 , and pushing c2 (back) onto the component stack C (Alg. 26.5, lines 11–13). After the main loop has finished, ProcessStack() is called once more to retrieve the global root component r (covering all image pixels) which at this point is still on the component stack (Alg. 26.4, line 34).8 Now all extremal components are contained in set E, with all parent/child relations properly set up, and the component tree T = ⟨E, r⟩ is complete. 26.2.4 Component Tree Examples Figure 26.4 shows the component trees obtained from two simple test images as a 3D structure, where each blob represents an extremal region. The x/y position of each component marks its center position (in image coordinates), while the z coordinate corresponds to the associated brightness level. Connections between components represent parent-child relations. The top component (with the highest gray value) is the tree’s root node. The blue nodes are the ones that are later selected as MSER components. The root component, which corresponds to the image’s maximum pixel value and covers the entire image (and thus is always positioned at the image center), is marked red. The synthetic image in Fig. 26.4 (a) contains step changes only. Regions maintain a constant size over some extended range of brightness levels and thus the resulting component tree is very simple. The image in Fig. 26.4 (b) is the same image but blurred, which introduces intermediate gray levels and smooth changes. In this case, extremal regions grow continuously with increasing brightness levels, thereby adding additional nodes to the component tree. In practice, component trees tend to become quite large, as shown in Fig. 26.5 for a natural gray image of size 200 × 200 (only). The resulting tree contains 10,128 components, from which only a small 8

This is done differently in the original algorithm [197], where a dummy component with an out-of-range value is initially pushed onto the component stack.

777

26 Maximally Stable Extremal Regions (MSER) Fig. 26.4 Component tree examples for simple test images. Image with step changes only (a) and blurred version (b). Each dot represents a component, the links connect parent and child components. The vertical position indicates the component’s gray level. Final MSER components are marked blue (see also Fig. 26.7), the root node is marked red, intermediate tree nodes are yellow.

(a)

(b)

Fig. 26.5 Component tree for a real image of size 200 × 200 with 10,128 components. Selected MSER components are marked blue (see also Fig. 26.8), the root node is marked red, intermediate tree nodes are yellow.

number is eventually selected as MSERs (marked blue). On larger images it is common to find trees with several million components.

26.3 Extracting MSERs from the Component Tree As described in the previous sections, the component tree is a hierarchy of the extremal regions found in the input image. The final step is to extract a set of maximally stable extremal regions (MSERs) from the component tree, i.e., a subset of its components.9 The criteria for selecting a component as a MSER are [184]: 1. The area A = |c| of a component c must be sufficiently large but not too large, i.e., amin ≤ A ≤ amax . The size limits are defined 9

778

A MSER instance is an extremal component with some additional properties attached (see Table 26.1).

as amin = AI ·αmin and amax = AI ·αmax , relative to the image size 26.3 Extracting MSERs AI = M ·N , with parameters αmin and αmax . from the Component 2. The component’s size must be “stable” over a given range ∆ of Tree brightness levels, i.e., its size variation must be sufficiently small (see Sec. 26.3.1). 3. A MSER must be maximally stable, that is, its size variation is required to be a local minimum w.r.t. its parent and all child components (see Sec. 26.3.2). 4. Successive MSER’s in the tree hierarchy should not be too similar to each other but “diverse”, i.e., of significantly different size, to avoid close-by duplicates (see Sec. 26.3.3). Some additional constraints, which are not part of the original algorithm, are formulated in Sec. 26.3.5. The complete MSER extraction process is summarized in Algs. 26.6–26.7. 26.3.1 Component Size Variation (Growth Rate) The main criterion for selecting a region as maximally stable is that its growth rate over a given range of brightness levels is sufficiently small and a local minimum. The components along any upward path of the tree keep growing in size, i.e., any tree component is smaller than its parent. The smallest components are found at the leaf nodes and may consist of a single pixel only, while the largest component is the root node, which includes all image pixels. For every component on the path from a leaf node to the tree’s root we can calculate the rate of growth w.r.t. to its children and parent. A component tree T = ⟨E, r⟩ consists of the set of extremal components E and the root component r ∈ E. A leaf component has no children and is the terminal node on a tree branch. The path from a leaf component c0 to the tree’s root r = cn is a sequence of components, (c0 , c1 , . . . , cn )

(26.8)

where par(ci ) = ci+1 . The components along any such path are of increasing size, i.e.,10 |c0 | ≤ |c1 | ≤ · · · ≤ |cn |

(26.9)

and (strictly) increasing brightness levels li = level(ci ), i.e., l0 < l 1 < · · · < l n .

(26.10)

Note that components are generally not inserted for every possible brightness value but only at levels where the size of the extremal region changes. In particular, the size of the region is zero (nonexistent) for any level l < l0 below the leaf component’s level, remains constant between successive components ci and ci+1 , and is infinite at and above the level of the root component cn . Thus, for a given path (c0 , . . . , cn ) with levels (l0 , . . . , ln ), the relationship between the brightness level l and component size can be expressed as a function 10

|ci | denotes the size (i.e., total number of pixels) of component ci .

779

  0 size(l) := |ci |   ∞

26 Maximally Stable Extremal Regions (MSER)

for l < l0 , for li ≤ l < li+1 , for l ≥ ln .

(26.11)

Based on this function, the size variation for a given component c at level l = level(c) is defined as var(c) =

size(l+∆) − size(l) , size(l)

(26.12)

where ∆ is the brightness span parameter.11 Thus var(c) is the rate of relative region growth for component c over ∆ consecutive brightness levels. Since component size is always increasing on the way up, values var(ci ) in Eqn. (26.12) are in the range [0, ∞]. The variation is 0 if component size is constant over ∆ brightness steps, and it is ∞ if level(c) is fewer than ∆ steps away from the root’s level. Note that a particular component c may be on several paths from leaf nodes to the root. Since the formulation in Eqn. (26.12) only uses the “upward” difference for calculating a component’s size variation, the result is independent of the path (i.e., child) by which c is reached. Example Figure 26.6 illustrates the calculation of size variation on a simple component tree. In this example, brightness levels are assumed to be in the range [0, 31]. The nodes of the component tree (real components) are shown as yellow bars positioned at the associated brightness levels. Gray bars represent virtual components which have the same size as the next real component underneath but are not part of the component tree. The green, dotted line segments mark the path (c0 , c1 , . . . , c4 ) from leaf node c0 to the component tree’s root node c4 . The root component covers all pixels and its associated level (25) is the maximum pixel value found in the image. For example, to calculate the size variation for component c0 at l0 = level(c0 ) = 3, its size is compared to the size of the (virtual) component at level l0′ = 3+∆ = 8. Since l0′ is below the level of the parent component c1 (at level 15), the size does not change w.r.t. c0 and thus the variation at c0 is 0. For component c1 (at level l1 = 15) the reference level is l1′ = 15+∆ = 20, which is above its parent component c2 but still below c3 . Thus the variation of component c1 will be set to var(c1 ) =

size(20) − size(15) |c | − |c1 | = 2 . size(15) |c1 |

See procedure CalculateVariations() in Alg. 26.6 for a summary of this calculation.

11

780

The formulation in Eqn. (26.12) slightly deviates from the original algorithm [184], which takes into account size changes in both (i.e., upand downward) directions.

v

26.3 Extracting MSERs from the Component Tree

31

25

∆ c4 c3

20

∆ 15

c2 c1

c9 c8 c7

10

5

c6

∆

c5

c0 0

x real component

virtual component

leaf

root

Fig. 26.6 Calculation of component size variation. The illustration shows a simple component tree with root c4 and two leaf nodes c1 and c5 . The vertical position of each component indicates the the associated brightness level (in the range [0, 31]). The root level (25) is the maximum brightness found in the image. Yellow nodes are extremal components of the component tree while gray components are only “virtual”. ∆ is the (constant) brightness span for calculating a component’s size variation (growth rate).

26.3.2 Maximally Stable Components A component c at level l is classified as maximally stable if its variation is smaller than the variation of the components at the neighboring levels l−1 and l+1 on the same path. This calculation is performed by procedure MarkMaximallyStable() in Alg. 26.6. Initially, all components are assumed to be maximally stable (Alg. 26.6, line 25). Then each component c is visited (line 26) and its variation is compared to the variation of its parent p = par(c), but only if their brightness levels are off by one, i.e., if level(c)+1 = level(p).

(26.13)

The component with the greater variation (the “loser”) is marked as not maximally stable (see Alg. 26.6, lines 31–36). Note that nothing is changed when the variations of c and p are the same. If component c has any child component c′ with brightness level(c)−1 and smaller variation var(c′ ) < var(c), then c is marked as non-stable when c′ is visited. In other words, any component c can only “survive” as maximally stable if var(c) is a local minimum, i.e., min var(c′ ) > var(c) < var(p),

c′ ∈chdr(c)

(26.14)

and c′ , c, p have consecutive brightness levels. 26.3.3 Constraints on Component Size and Diversity MSERs are “harvested” by traversing all branches of the component tree. The process starts at the root component, i.e., large components are evaluated before smaller components. To be acceptable as a MSER in the first place, a component’s size |c| must lie within predefined limits, that is, amin ≤ |c| ≤ amax ,

(26.15) 781

26 Maximally Stable Extremal Regions (MSER) Alg. 26.6 MSER extraction (Part 1).

1: ExtractMSER(I, ∆, αmin , αmax , varmax , divmin ) Input: I, a grayscale image of size M × N . Parameters: ∆, brightness span; αmin , αmax , min./max. relative component size; varmax , max. size variation; divmin , min. diversity. Returns a set of detected MSER components. 2: 3: 4: 5: 6: 7: 8: 9: 10:

⟨E, r⟩ ← BuildComponentTree(I) ▷ T = ⟨E, r⟩, see Sec. 26.2 CalculateVariations(E, ∆) ▷ see below MarkMaximallyStable(E) ▷ see below UpdateStatistics(r) ▷ see Alg. 26.7 A ← M ·N ▷ image size amin ← A·αmin ▷ min. component size (in pixels) amax ← A·αmax ▷ max. component size (in pixels) M ← GetMsers(r, amin , amax , varmax , divmin ) return M

11: CalculateVariations(E, ∆) Calculates and sets the size variation var(c) of each component c in set E. Parameter ∆ is the brightness span for calculating size variations. Returns nothing but modifies the components in E. 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22:

for all c ∈ E do ▷ no particular order var(c) ← ∞ p ← par(c) if p ̸= nil then ▷ c is not root of component tree l+ ← level(c) + ∆ ▷ target level l+ ′ c ←c Find nearest ancestor component p with level(p) ≥ l+ : while p ̸= nil ∧ level(p) < l+ do c′ ← p ▷ step up to parent of c′ ′ p ← par(c ) if p ̸= nil then ▷ c′ not root, level(p) ≥ l+ |c′ |−|c| var(c) ← |c| ▷ set variation of c

23: MarkMaximallyStable(E) Checks all components c in set E and sets the property stable(c) if c is maximally stable. Returns nothing but modifies the components in E.

782

24: 25:

for all c ∈ E do stable(c) ← true

26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36:

for all c ∈ E do ▷ no particular order p ← par(c) if p = nil then ▷ c is the root component stable(c) ← false ▷ root is never max. stable else if level(c) + 1 = level(p) then vc ← var(c) vp ← var(p) if vc < vp then stable(p) ← false ▷ mark p as not stable else if vc > vp then stable(c) ← false ▷ mark c as not stable

▷ assume all components are max. stable

with amin = αmin ·A, amax = A·αmax specified by the image size A = M ·N and parameters αmin , αmax (see Alg. 26.6, lines 6–8). To avoid similar-sized components being selected close to each other, a minimum size difference is required between successive

1: UpdateStatistics(c) Recursively calculates and updates the component statistics in the sub-tree rooted at component c. 2:

m ← (0, 0, 0, 0, 0)

▷ m = (Σx, Σy, Σx2 , Σy 2 , Σxy)

3: 4:

for all (x, y) ∈ pxls(c) do m ← m + (x, y, x2 , y 2 , x·y)

▷ iterate over local points of c ▷ update vector m

5: 6: 7: 8: 9:

for all c′ ∈ chdr(c) do m′ ← UpdateStatistics(c′ ) m ← m + m′ stats(c) ← m return m

26.3 Extracting MSERs from the Component Tree Alg. 26.7 MSER extraction (Part 2).

▷ recursively update children ▷ update stats properties of c

10: GetMsers(r, amin , amax , varmax , divmin ) Collects and returns a set of MSERs from the component tree with root r. 11: 12: 13:

M←∅ CollectMsers(r, ∞, M) return M

14: CollectMsers(c, a+ , M) Recursively collects MSERs from the sub-tree rooted at component c. a+ is the size of the last collected MSER (c+ ). Parameters (defined above): amin , amax , varmax , divmin . Set M is modified. 15:

⟨xc , yc , ra , rb , θ⟩ ← GetEllipse(c)

▷ equiv. ellipse

16: 17: 18: 19: 20:

isStable ← stable(c) sizeOK ← (amin = |c| = amax ) varOK ← (var(c) = varmax ) + divOK ← ( a a−|c| ≥ divmin ) + ellipseOK ← (ra ·rb ·π = amax )

▷ Eqn. (26.16) ▷ Eqn. (26.28)

21: 22: 23:

if isStable ∧ sizeOK ∧ varOK ∧ divOK ∧ ellipseOK then add(c, M) ▷ add MSER c to set M a+ ← |c| ▷ set size of last collected MSER

24: 25: 26:

if |c| > amin then ▷ continue unless size is too small for all c′ ∈ chdr(c) do CollectMsers(c′ , a+ , M) ▷ collect MSERs from child c′

27: GetEllipse(c) Calculates and returns the equivalent ellipse for component c. 28: n ← |c| ▷ size of component c 29: (m10 , m01 , m20 , m02 , m11 ) ← stats(c) ▷ (Σx, Σy, Σx2 , Σy 2 , Σxy) 30: xc ← m10 /n ▷ Eqn. (26.20) 31: yc ← m01 /n 32: µ20 ← m20 − m210 /n ▷ Eqn. (26.23) 33: µ02 ← m02 − m201 /n ▷ Eqn. (26.24) 34: µ11 ← m11 − m10 ·m01 /n ▷ Eqn. (26.25) µ20 µ11 µ11 µ02

35:

Σ←

36: 37: 38: 39: 40: 41:

(λ, E) ← Eigen2x2(Σ) ⊺ (dx , dy ) ← E(∗, 0) p ra ← 2· λ0 /n p rb ← 2· λ1 /n θ ← ArcTan(dx , dy ) return ⟨xc , yc , ra , rb , θ⟩

▷ λ = (λ0 , λ1 ), see Alg. B.1, p. 817 ▷ E = (e0 |e1 ) ▷ Eqn. (26.21)

▷ new ellipse

783

26 Maximally Stable Extremal Regions (MSER)

MSER’s on the same branch. If some component c+ has been identified as a MSER, the next candidate c on the same tree branch must have a minimum “diversity” relative to c+ , defined as div(c) =

|c+ | − | c | . |c+ |

(26.16)

Since |c+ | ≥ |c| and |c| > 0, quantity div(c) is in the range [0, 1). If c+ and c have the same size, then div(c) = 0. Component c is a potential MSER candidate if div(c) ≥ divmin ; otherwise c′ is considered too similar to the previous MSER c+ and skipped. Parameter divmin is in [0, 1): for example, with (typical) divmin = 0.2 the size difference between successive MSERs must be greater than 20%. This classification is formulated in procedure GetMsers(), called in Alg. 26.6 (line 9), and the recursive procedure CollectMsers(c, a+ , M), where c is the current component, a+ is the size of the most recently found MSER (along the path from the tree’s root to c), and M is the set of collected MSERs (see Alg. 26.7, lines 14–26). 26.3.4 MSER Feature Statistics and Equivalent Ellipse Since the set of associated pixels is known for every component, this can be used to calculate descriptive feature statistics for each MSER, including any of the binary region properties described in Sec. 8.4. The minimum requirement is knowledge about position and size of an MSER. Typically, oriented ellipses are fitted to the binary component regions (see Sec. 8.6.3). For each component c with contained pixel coordinates Rc = {(x, y)}, a vector of ordinary moments12   P    m10 x Px m01   y  y X   P 2   2    x  mc =  (26.17) m20  =  P x2  =   m02   y  (x,y)  y 2  P ∈Rc m11 xy xy

is calculated. In Alg. 26.7, this is implemented by the recursive function UpdateStatistics(c), which starts from the root of the component tree. Recall that the set of pixels Rc belonging to a component c consists of the “local” pixels pxls(c) and the pixels Rc′ of its child components c′ , that is, [ Rc = pxls(c) ∪ R c′ . (26.18) c′ ∈chdr(c)

For each component c, the moments mc′ of all child components c′ ∈ chdr(c) are summed up and added the local point statistics, i.e.,   x  X y  X  x2  + mc = m c′ (26.19)  2 (x,y)∈  y  c′ ∈ chdr(c) pxls(c) xy

12

784

See also Sec. 8.5.2, Eqn. (8.20).

(see Alg. 26.7, lines 3–7 for details). As described in Sec. 8.6.3, the 26.3 Extracting MSERs center of the associated ellipse is obtained as13 from the Component Tree 1 m10 x ¯ µ10 ¯= x = = · . (26.20) y¯ µ01 |c| m01 The lengths of the ellipse’s major and minor axis are ra = 2·

λ 12 0

and

|c|

rb = 2·

λ 12 1

|c|

,

(26.21)

respectively, λ0 , λ1 being the eigenvalues14 of the component’s secondorder central moment matrix µ20 µ11 µ= , (26.22) µ11 µ02 with15 µ20 = Σx2 − (Σx)2 /|c| = m20 − m210 /|c|, 2

2

µ02 = Σy − (Σy) /|c| = m02 −

m201 /|c|,

µ11 = Σxy − Σx·Σy/|c| = m11 − m01 ·m10 /|c|.

(26.23) (26.24) (26.25)

Finally, the orientation angle θ of the ellipse can be calculated from the first (i.e., the larger) eigenvector x0 = (x0 , y0 ) of matrix µ as θ = ∠x0 = ArcTan(x0 , y0 ).

(26.26)

Alternatively the region’s orientation could be calculated directly from the central moments µ20 , µ02 and µ11 (i.e., the elements of matrix µ), as described in Sec. 8.6.1. Note that the covariance matrix Σ of the component’s coordinates can be obtained by normalizing µ, that is, 1 µ20 µ11 σ20 σ11 Σ= = · . (26.27) σ11 σ02 |c| µ11 µ02 Figures 26.7 and 26.8 show various detection examples on synthetic and real images, with MSER components marked by equivalent ellipses, as described above. 26.3.5 Additional Constraints As formulated by the component size constraint in Eqn. (26.15), a MSER is only accepted if its area (i.e., the number of all contained pixels) is neither too small nor too large. If a component is a fairly compact blob, the area of its equivalent ellipse is similar to the region’s area. However, if a component is highly non-compact, the size of the ellipse may be much greater, since its radii are determined by the second-order moments of its pixel coordinates and thus remote 13 14 15

|c| = |Rc | denotes the size (total number of pixels) of component c. λ0 ≥ λ1 , see also Sec. B.5 in the Appendix. µ20 , µ02 , µ11 are the second-order central moments of the region coordinates (see Sec. 8.5.3, Eqn. (8.39)).

785

l=0

26 Maximally Stable Extremal Regions (MSER) Fig. 26.7 MSER detection examples (synthetic images). Test image with step changes (a) and smoothed version (b). Detected MSERs are shown as ellipses with colors indicating the associated gray level l (see the color scale on top). See Fig. 26.4 for the complete component trees of these images. Note that extremal components (and thus MSERs) are always darker than their surroundings. Thus the bright blob in the upper-right corner creates neither an extremal region nor a MSER candidate.

255

(a)

(b) l=0

255

Fig. 26.8 MSER detection example (real image). Original image of size 200 × 200 (a), detected MSER components (b). See Fig. 26.5 for the complete component tree of this image.

(a) Fig. 26.9 Scattered components may produce exceedingly large ellipses. Images (a–d) show the four largest components and their equivalent ellipses. These provide a good fit when regions are relatively compact (a, d) but may become extremely large when regions are highly non-compact, as in (b, c). Note that, in (b, c), the colored patches at the top and bottom form single regions, connected by dark pixels along the left image border.

(a)

(b)

(b)

(c)

(d)

pixels have a disproportionately large impact. This effect is shown in Fig. 26.9, which shows the four largest components with less than 25 % of the image area. While the components in Fig. 26.9 (a, d) are quite compact, the equivalent ellipses in Fig. 26.9 (b, c) are much larger than the associated pixel sets. To avoid this problem, one could (in addition to Eqn. (26.15)) also constrain the size of the equivalent ellipse to amin ≤ ae ≤ amax , 786

with the ellipse area

(26.28)

255

l=0

26.3 Extracting MSERs from the Component Tree Fig. 26.10 Elimination of large, scattered regions by limiting ellipse size. MSERs detected with ellipse size unconstrained (a) and limited to amax (b). Image pixels contained in each MSER are tinted with the same color as the ellipse. In (a) several large ellipses appear which represent widely scattered (though connected) regions caused by the dark pixels around the image border. All ellipse centers are inside the image.

(a)

(b)

ae = ra rb π.

(26.29)

This (optional) size constraint is applied in Alg. 26.7 (line 20) and its effect is illustrated in Fig. 26.10. In addition, one could suppress highly non-compact MSERs, for example, by defining a component’s compactness as the ratio of component size and ellipse area, i.e., γ=

|c| , ae

(26.30)

and enforcing a minimum compactness γ ≥ γmin . With γmin = 0.2, e.g., a MSER must cover at least 20% of the area of its equivalent ellipse. 26.3.6 Detecting Dark And Bright Blobs Due to the original definition of the component tree, extracted MSER components are always darker than the surrounding image and thus the described algorithm is capable of detecting dark blobs only (see Fig. 26.8). We call this forward mode and M+ the resulting MSER set, i.e.,16 M+ ← ExtractMSER(I, . . .).

(26.31)

Alternatively, bright blobs, which are equally important features, can be detected by simply running the MSER algorithm on the inverted ¯ i.e., in inverse mode, producing the set M− , image (I), ¯ . . .). M− ← ExtractMSER(I, 16

See Alg. 26.6 for the definition of ExtractMSER().

(26.32) 787

26 Maximally Stable Extremal Regions (MSER)

M+

M−

Fig. 26.11 MSERs detected by applying the algorithm in forward mode (M+ ) and inverse mode (M− ). Blobs in M+ (a, c, e) are darker while blobs in M− (b, d, f) are brighter than the surrounding image.

(a)

(b)

(c)

(d)

(e)

(f)

The examples in Fig. 26.11 show both sets for the previously used test images. The combined set with both dark and bright of MSERs is finally obtained by simply joining the individual results, i.e., M ← M+ ∪ M− .

(26.33)

26.3.7 MSER Examples

788

Examples of combined MSER sets (forward and inverse mode) for various outdoor images are shown in Fig. 26.12. Typically with this

type of image the MSER sets M+ , M− are of similar size. As one 26.4 Matching MSERs may expect, distinct and sharply delineated blobs are found quite reliably, whereas detection in more blurry parts of the images tends to produce considerable clutter. All results were obtained with the default parameter settings listed in Table 26.2. Param. ∆ αmin αmax varmax divmin γmin

Java id. Value Description delta 5 brightness span for component stability minRelCompSize 0.0001 min. relative component size (w.r.t. image size) maxRelCompSize 0.25 max. relative component size (w.r.t. image size) maxSizeVariation 0.25 max. component size variation minDiversity 0.50 min. diversity (rel. size of consecutive MSERs) minCompactness 0.20 min. compactness (component/ellipse size ratio)

Table 26.2 MSER parameters and default values. Java ids are the associated variable names in the imagingbook reference implementation.

The results from stereo image pairs in Fig. 26.13 should give a rough (qualitative) impression of the detector’s stability under small variations of camera view and exposure.

26.4 Matching MSERs MSER features are used in a variety of applications, including widebaseline stereo [184], image retrieval [199] and motion tracking [75]. In any of these applications it is essential that features extracted from different images can be associated in a reliable manner that is reasonably resistant against geometric transformations. In general, different views of a given planar patch in a 3D scene are related by projective transformations in 2D,17 but affine transformations are commonly used as a sufficient and simpler approximation. MSERs are unaffected by practically any continuous geometric transformation, and affine matching is aided by the properties of the equivalent ellipse.18 There are two main approaches for matching MSER features. The first approach is based on the shape of the extremal region itself or quantities derived from the shape, e.g., statistical moments 19 or Fourier descriptors calculated from the shape’s contour [168].20 In [184], a set of stable points along the region contour is used to establish initial feature correspondences. The second (more common) approach is to sample the original image to extract a fixed-size texture patch that is geometrically normalized and represents a pictorial “fingerprint” of the associated MSER. The basic scheme for extracting such “local affine frames” is outlined in Sec. 26.5. In this process, the component’s equivalent ellipse 17 18

19 20

See Sec. 21.1.4. Note that an oriented ellipse has only five degrees of freedom (x/yposition, radii ra , rb and orientation θ), while a general affine transformation requires six parameters (see Sec. 21.1.1). Thus the ellipse parameters (obtained from first- and second-order moments) are not sufficient to derive a complete affine transformation. In particular, the amount of shearing cannot be determined from second-order moments alone. See Sec. 8.5.2. See also [49, Ch. 26].

789

26 Maximally Stable Extremal Regions (MSER) Fig. 26.12 MSER detection on various outdoor images. The left column (a, c, e, g) shows the results of forward-mode (M+ ) processing only, with colors indicating the components’ brightness level (l). The right column (b, d, f, h) shows both MSER sets (M+ , M− ) obtained by processing in forward- and inverse-mode, respectively.

790

M+ : l = 0

255

M+

M+−

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

M+ : l = 0

255

26.4 Matching MSERs (a)

Fig. 26.13 MSER detection on pairs of stereo images. The color of each ellipse marker (M+ , forward mode only) indicates the maximum gray-level (l) of the associated component. The slight differences in view and exposure between left and right images give a coarse impression of detection stability.

(b)

(c)

(d)

791

26 Maximally Stable Extremal Regions (MSER)

is geometrically mapped to a circle, with its dominant orientation aligned to the horizontal frame axis. In principle. this should allow matching of MSERs by direct correlation (i.e., pixel-by-pixel comparison) of their normalized texture patches. Note, however, that the ellipse’s orientation has an implicit 180◦ ambiguity, or may even be undefined if the second-order moments of the underlying shape cancel out (i.e., µ20 −µ02 ≈ 0). Various schemes for resolving this orientation ambiguity are described in [199], e.g., aligning to the contour point with the maximum distance from the shape’s center. Another idea to overcome the rotation problem is to use invariant texture moments21 calculated from the local affine frame images, for matching MSERs [184]. Alternatively, the use of SIFT descriptors,22 again computed from the local affine frames and using the transformed MSER-shape as a mask, was proposed in [94].

26.5 Local Affine Frames A local affine frame is a geometrically normalized image (or texture) patch that is extracted from the original input image around a particular MSER component. The required geometric transformation is specified by the position and the shape of the MSER. The image is geometrically normalized, such that its equivalent ellipse is mapped to a circle, which is achieved by a geometric transformation that diagonalizes the region’s covariance matrix [184, Sec. 3]. The covariance matrix Σ of a component (see Eqn. (26.22)) is always real and symmetric.23 It can therefore be decomposed in the form ⊺

Σ = Q·Λ·Q ,

(26.34)

where Q is an orthogonal matrix (i.e., Q⊺ = Q−1 ) whose columns are the eigenvectors of Σ, and Λ = (Λij ) is a diagonal matrix whose entries are the eigenvalues of Σ. This eigenvalue decomposition is obtained numerically, e.g., by eigendecomposition.24 The resulting matrix Q is used to specify the transformation from an elliptical to a circular patch [94, Sec. 3.1]. A local affine frame F is (typically) a quadratic image of fixed size, e.g., (2m+1) × (2m+1). The frame’s center pixel at position uc = (uc , vc )⊺ = (m, m)⊺ corresponds to the center of the equivalent ellipse. To calculate the complete frame, the original image I is resampled (and interpolated) for each frame pixel u ∈ [0, 2k+1]2 at the associated continuous point ¯ x = A·(u−uc ) + x

with

A=

2s ·Q·Λ1/2 , m

(26.35)

p ¯ = (¯ where x x, y¯)⊺ is the component’s centroid, and Λ1/2 = ( Λij ). The scale factor s specifies the size of the frame relative to the size 21 22 23 24

792

See See See See

Sec. Sec. Sec. Sec.

8.6.5. 25.3. G.3.4 of the Appendix. B.7.5 of the Appendix.

m I

x

m

F

u

m uc m rb

ra

v

¯ x

y

26.5 Local Affine Frames Fig. 26.14 Affine frame extraction. The shape’s equivalent ellipse (blue) and its scaled version (magenta) are marked in the original image I (with scale factor s = 1.5). The target frame F is of size (2m+1) × (2m+1). The affine transformation maps the two ellipses (and the associated image content) to centered circles in frame F , with major axes parallel to the u-axis.

of the equivalent ellipse, With s = 1, the component’s equivalent ellipse maps to a circle that exactly fits inside the local affine frame. Typically a larger area is extracted to capture more of the surrounding image context, particularly since there may be only small image variations within the MSER region itself. Figure 26.14 illustrates this transformation for s = 1.5. Avoiding Reflections Naturally the mapping from the region’s ellipse to the square local frame is not unique. While the ellipse’s major axis is transformed to a horizontal line (through the frame’s center), left and right are unspecified. If the region has no dominant axis (i.e., its ellipse is a circle) the orientation of the resulting image patch is arbitrary. In general, the transformation matrix A may involve a reflection, in which case the determinant of A is negative, i.e., det(A) < 0. For the purpose of matching local frames it may be helpful to avoid reflections. A reflective transformation matrix A can be easily fixed by applying another reflection, e.g., by multiplying with the matrix 1 0 0 −1 , which represents a reflexion about the x-axis, i.e., ( 0 A· 10 −1 if det(A) < 0, ′ A = (26.36) A otherwise. The actual image transformation can be implemented by simple target-to-source mapping (see Sec. 21.2.2), using the original image I as the source and the frame F as the target, as is summarized by Alg. 26.8. Alternatively, the geometric transformation in Eqn. (26.35) from local frame coordinates u = (u, v) to image points x = (x, y) can be written in homogeneous coordinates25 as   ! 1 0 −u  10x ¯ c A′ 0 · 0 1 −vc  · u, (26.37) x = M·u =0 1 y¯ · ⊺ 0 1 00 1 001 | {z } | {z′ } | {z } T2

25

See Sec. 21.1.2.

A

T1

793

1: ExtractLocalAffineFrame(I, c, m, s) Input: I, the original image; c, a MSER component; m, the size of the local affine frame; s, the scale factor. Returns the local affine frame F of size (2m+1) × (2m+1) for component for c.

26 Maximally Stable Extremal Regions (MSER) Alg. 26.8 Local affine frame extraction by geometric transformation. The component’s equivalent ellipse is mapped to a circle. The transformation is implemented by target-to-source mapping (with source I and target F ). A Gaussian preG filter (Hσ ) with radius σ is applied to the image prior to re-sampling to avoid aliasing.

2: 3: 4: 5: 6: 7: 8: 9:

F ← new map [0, 2m] × [0, 2m] 7→ R ▷ local affine frame ⊺ uc ← (m, m) ▷ center of F ⟨¯ x, y¯, ra , rb , θ⟩ ← GetEllipse(c) ▷ see Alg. 26.7 ¯ ← (¯ x x, y¯) ∂max ← ra /m ▷ max. scale change if ∂max > 1.5 then σ ← 0.5 · (∂max −1) I ← I ∗ HσG ▷ apply Gaussian pre-filter

10: 11: 12:

for all u ∈ [0, 2m] × [0, 2m] do ▷ frame coord. (discrete) ¯ x ← A′ ·(u−uc ) + x ▷ image coord. (continuous) F (u) ← GetInterpolatedValue(I, x) ▷ interpolate I at pos. x

13:

return F

with u = (u, v, 1)⊺ , x = (x, y, 1)⊺ , and translation matrices T1 , T2 . Since the affine 3 × 3 matrix M = T2 ·A′ ·T1 maps target to source coordinates, this is already the inverse transformation needed to render the local frame by “target-to-source” mapping.26 Examples Figure 26.15 shows an example with local affine frames extracted for a single shape under different affine transformations. The frames’ radius is m = 20 and s = 1.5, i.e., the image patch mapped to each frame is 1.5 times larger than the major axis of the equivalent ellipse, which itself is mapped to a circle. For rendering the local frames (of size 41 × 41), a simple Gaussian anti-aliasing filter was applied prior to re-sampling with bi-linear interpolation.27 Aliasing occurs when the sampling distance is increased by the geometric transformation, i.e., the sampling frequency is lowered. This may cause under-sampling of the original signal, unless its bandwidth is reduced accordingly. Since the inverse geometric transformation maps a circle (of radius m) to the equivalent ellipse, the maximum scale change (stretching) occurs in the direction of the ellipse’s major axis (of length ra ) by the factor ∂max = ra /m. If ∂max < 1, the sampling interval becomes smaller and thus the input signal is over-sampled, which is generally harmless. Otherwise, if ∂max > 1, the sampling theorem may be violated and the signal bandwidth should be reduced. For the examples in Fig. 26.15, an isotropic Gaussian pre-filter of radius σ = 0.5·(∂max −1) was used if ∂max > 1.5.

26.6 Summary MSER is an unusual feature detector and quite complementary to most other detectors, such as typical corner detectors or the SIFT 26 27

794

See Sec. 21.2.2. See Sec. 22.5.2.

26.6 Summary

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

detector. It has been shown to perform reliably under adverse conditions and compares favorably to other methods in difficult situations [187, 270]. Although the overall algorithm appears quite complex and is somewhat tricky to implement (particularly the component tree generation part), the MSER detector is surprisingly efficient. It ranks among the fastest methods available and is suitable for real-time applications, such as 3D stereo reconstruction and object tracking. Multi-scale and color versions of the MSER detector have been introduced [93]. It should also be mentioned that the MSER technique is not limited to 2D images, but extends seamlessly to N -dimensional data, particularly 3D volumetric data.

Fig. 26.15 Extraction of local affine frames. The top row (a–d) shows the MSER’s ellipse for various versions of the same shape, the bottom row (e–h) are the corresponding local affine frames (LAFs). Equivalent ellipses are blue, outer ellipses (scaled by s = 1.5) are magenta. In each LAF, the equivalent ellipse is mapped to a circle, with the major axis aligned to the x-axis. The original shape (a) and the rotated and scaled version in (b) produce almost identical LAFs (e, d). The mirrored and rotated shape in (c) results in a mirrored LAF (g). The shape in (d) is non-uniformly scaled to the extent that the major and minor axes of the equivalent ellipse are flipped, and thus its LAF is also rotated.

Java Implementation The imagingbook library28 provides a prototypical implementation of the MSER feature detector, which closely follows the above description. The core functionality is contained in class MserDetector, which detects blobs that are darker than the surrounding background. To find bright blobs within a darker background the detector is simply applied to the inverted image. The two algorithms for component tree generation (described in Sec. 26.2.1) are implemented by classes ComponentTreeGlobalImmersion and ComponentTreeLinearTime, respectively. In addition, various ImageJ demo programs are available in the associated plugins collection.

28

Package imagingbook.pub.mser

795

Appendix

Appendix A Mathematical Symbols and Notation

A.1 Symbols The following symbols are used in the main text primarily with the denotations given here. While some symbols may be used for purposes other than the ones listed, the meaning should always be clear in the particular context. (a0 , . . . , an−1 ) A vector or list, that is, a linear sequence of n elements of the same type. Unlike a set (see below), a list may contain the same element more than once. If used to denote a vector, then (a0 , . . . , an−1 ) is usually a row vector and (a0 , . . . , an−1 )⊺ is the corresponding (transposed) column vector.1 If used to represent a list,2 ( ) represents the empty list and (a) is a list with a single element a. |A| is the length of the sequence A, that is, the number of contained elements. A ⊞ Bdenotes the concatenation of lists A, B. A(i) or ai refers to the i-th element of A. A(i) ← x means that the i-th element of A is set to (i.e., replaced by) some value x (see also Sec. A.3). {a, b, c, d, . . .} A unordered set or collection of items. A set may contain a particular element not more than once, while a collection may contain duplicate values.3 { } or ∅ denote the empty set or collection. |A| = card(A) is the size or cardinality of the set or collection A. A ∪ B denotes the union and A ∩ B the intersection of sets A, B. x ∈ A means that the element x is contained in A (see also Sec. A.2). 1

2

3

In most programming environments, vectors are implemented as onedimensional arrays, with elements being referred to by position (index). Lists are usually implemented with dynamic data structures, such as linked lists. Java’s Collections framework provides numerous easy-touse list implementations. We use the same {. . .} notation for (mostly) collections and strict sets. The respective meaning should be clear from the context.

© Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1

799

Appendix A Mathematical Symbols and Notation

⟨A, B, C⟩ A tuple, that is, a fixed-size, ordered sequence of elements, each possibly of a different type (see also Sec. A.4).4 [a, b]

Numeric interval; x ∈ [a, b] means a ≤ x ≤ b. Similarly, x ∈ [a, b) says that a ≤ x < b.

|A|

Length (number of elements) of a sequence (see above) or size (cardinality) of a set A, that is, |A| ≡ card A.

|A|

Determinant of a matrix A (|A| ≡ det(A)).

|x|

Absolute value (magnitude) of a scalar or complex quantity x.

∥x∥

Norm of vector x. In general, ∥x∥n denotes the Ln norm of x, by default ∥x∥ = ∥x∥2 means the Euclidean (L2 ) norm.

⌈x⌉

“Ceil” of x, the smallest integer z ∈ Z greater than x ∈ R. For example, ⌈3.141⌉ = 4, ⌈−1.2⌉ = −1.

⌊x⌋

“Floor” of x, the largest integer z ∈ Z smaller than x ∈ R. For example, ⌊3.141⌋ = 3, ⌊−1.2⌋ = −2.

÷

Integer division operator: a÷b denotes the quotient of the two integers a, b. For example, 5 ÷ 3 = 1 and −13 ÷ 4 = −3 (equivalent to Java’s “/” operator in the case of integer operands).

⋄

Mirroring operator, used to denote x/y-reflected filter kernels, i.e., H ⋄ (i, j) = H(−i, −j) (see Sec. 4.3.1).

∗

Linear convolution operator (see Sec. 4.3.1).

⊛

Linear cross correlation operator (see Sec. 23.1.1).

⊗

Outer vector product (see Sec. B.3.2).

×

Cross product (between vectors or complex quantities (see Sec. B.3.3).

⊕

Morphological dilation operator (see Sec. 7.2.3).

⊖

Morphological erosion operator (see Sec. 7.2.4).

◦

Morphological opening operator (see Sec. 7.3.1).

•

Morphological closing operator (see Sec. 7.3.2).

⊞

∼

“Similarity” relation used in the context of random variables and statistical distributions.

≈

“Approximately equal” relation.

≡

Equivalence relation.

←

Assignment operator: a ← expr means that expression expr is evaluated and the result assigned to variable a.

⊕ ←

⊕ a is equivalent to Incremental assignment operator: x ← x ← x + a. This is analogous to the ‘+=’ operator in Java and C (and many other programming languages).

4

800

Concatenation operator. Given two sequences or lists A = (a, b, c) and B = (d, e), A ⊞ B denotes the concatenation of A and B, with the result (a, b, c, d, e). Inserting a single element x at the end or front of the list A is written as A ⊞ (x) = (a, b, c, x) or (x) ⊞ A = (x, a, b, c), respectively.

Tuples are typically implemented as objects (in Java or C++) or structures (in C) with elements being referred to by name.

:=

Function definition operator (used in algorithms). For exam- A.1 Symbols ple, f (x) := x2 + 5 defines a function f () with the bound variable (formal function argument) x. ... “upto” (incrementing) iteration, used in loop constructs like for q ← 1, . . . , K (with q = 1, 2, . . . , K). ... “downto” (decrementing) iteration, for example, for q ← K, . . . , 1 (with q = K, K −1, . . . , 2, 1). ∧ Logical “and” operator. ∨ Logical “or” operator. ∅ The empty set, same as {}. ∂ Partial derivative operator (see Sec. 5.2.1). For example, ∂ ∂xi f denotes the first derivative of some n-dimensional function f (x0 , x1 , . . . , xn−1 ) : Rn 7→ R by variable xi . Analogous∂2 ly, ∂x 2 f is the second derivative (i.e., differentiating f twice i by variable xi ), etc. ∇ Gradient operator. The gradient of a n-dimensional function f (x0 , x1 , . . . , xn−1 ) : Rn 7→ R, denoted ∇f (also ∇f or grad f ), is the vector of its first partial derivatives (see also Sec. D.1.2). ∇2 Laplace operator (or Laplacian). The Laplacian of a n dimensional function f (x0 , x1 , . . . , xn−1 ) : Rn 7→ R, denoted ∇2f (or ∇f2 ), is the sum of its second partial derivatives (see Sec. D.1.5). 0n Zero vector of length n, i.e., 0 = (0, . . . , 0)⊺ . adj Adjugate of a square matrix, denoted adj(A); also called adjoint in older texts (see, e.g., [43, pp. 251, 260], [169, Sec. 3.3]). AND Bitwise “and” operation. Example: (0011b AND 1010b ) = 0010b (binary) and (3 AND 6) = 2 (decimal). ArcTan(x, y) Inverse tangent function. The result of ArcTan(x, y) is equivalent to arctan( xy ) = tan−1 ( xy ) but with two arguments and resulting angles in the range [−π, +π] (i.e., covering all four quadrants). ArcTan(x, y) is equivalent to the ArcTan[x,y ] function in Mathematica and the Math.atan2 (y, x) method in Java (note the reversed arguments!). ∠ The ‘angle’ of a 2D vector. If x = (x, y) then ∠x = ∠(x, y) = ArcTan(x, y). B The set of boolean values, i.e., {false, true}. C The set of complex numbers. card Size (cardinality) of a set or collection. card(A) = |A| (see also Sec. 2.1). det Determinant of a matrix (det(A) = |A|). DFT Discrete Fourier transform (see Sec. 18.3). diag Matrix diagonal. If A is a n×n matrix with diagonal elements Aii , diag(A) is the vector d = (d0 , . . . , dn−1 ), with di = Aii . Alternatively, given a vector d = (d0 , . . . , dn−1 ), B = diag(d) creates a n × n matrix with diagonal elements Bii = di and Bij = 0 for i ̸= j. dims Dimensions of a matrix (rows, columns); e.g., dims(A) = (2, 3) if A has 2 rows and 3 columns.

801

Appendix A Mathematical Symbols and Notation

802

dom

Domain of a function or mapping; e.g., the domain of an image I of size M × N is dom(I) = [0, M −1] × [0, N −1]. e Euler’s number (e ≈ 2.71828). e Unit vector. For example, ex = (1, 0)⊺ denotes the 2D unit vector in x-direction. eθ = (cos θ, sin θ)⊺ is the 2D unit vector oriented at angle θ and ex , ey , ez are the unit vectors along the coordinate axes in 3D. exp Exponential function: exp(x) = ex . F Continuous Fourier transform (see Sec. 18.1.4). false Boolean constant (false = ¬true). grad Gradient operator (see ∇). h Histogram of an image (see Sec. 2.1). H Cumulative histogram (see Sec. 2.6). H Hessian matrix (see Sec. D.1.6). hom Operator for converting Cartesian to homogeneous coordinates. x = hom(x) maps the Cartesian point x to a corresponding homogeneous point x; the reverse mapping is denoted x = hom−1 (x) (see Sec. B.6). i Imaginary unit (i2 = −1), see Sec. A.5. I Image with scalar pixel values (e.g., an intensity or grayscale image). I(u, v) ∈ R is the pixel value at position (u, v) I Vector-valued image, for example, a RGB color image with 3D color vectors I(u, v) ∈ R3 . In Identity matrix of size n × n. For example, I2 = ( 10 01 ) is the 2 × 2 identity matrix. J Jacobian matrix of a multi-dimensional function (see Sec. D.1.1). L1 , L2 , L∞ Common distance measures or norms (see Eqns. (15.25)– (15.27)). M × N Domain of pixel coordinates (u, v) for an image with M columns (width) and N rows (height); used as a shortcut notation for the set dom(I) = {0, . . . , M −1} × {0, . . . , N −1}. mod Modulus operator: (a mod b) is the remainder of the integer division a ÷ b (see Sec. J.1.2). µ Central moment (see Sec. 8.5.3); also used to denote the mean of a probability distribution. N The set of natural numbers; N = {1, 2, 3, . . .}, N0 = {0, 1, 2, . . .}. nil Null (“nothing”) constant, typically used in algorithms to denote an invalid quantity (analogous to null in Java). p Discrete probability density function (see Sec. 3.6.1). P Discrete probability distribution function or cumulative probability density (see Sec. 3.6.1). Q Quadrilateral, i.e., a polygon with four vertices (see Sec. 21.1.4). R The set of real numbers. R, G, B Red, green, blue color components.

rank

Rank of a matrix A, i.e., the number of linearly independent A.2 Sets rows or columns (r = rank(A)). round Rounding function: returns the integer closest to the scalar x ∈ R; round(x) ≡ ⌊x + 0.5⌋. σ Standard deviation (square root of the variance σ 2 ). S1 Unit square (see Sec. 21.1.4). sgn “Sign” or “signum” n function: sgn(x) =

size

1 for x > 0 0 for x = 0 −1 for x < 0

Dimensions of an image (width, height); e.g., size(I) = (640, 480) if I is 640 pixels wide and 480 pixels tall. Note that the ordering is (M, N ) = (columns, rows), which is reverse to dims(A) = (rows, columns) for the dimension of matrices. τ Interval in time or space. ⊺ Transpose of a vector (x⊺ ) or matrix (A⊺ ). tr Trace (sum of the diagonal elements) of a matrix, e.g., tr(A). true Boolean constant (true = ¬false). u = (u, v) Discrete 2D coordinate, with u, v ∈ Z. x = (x, y) Continuous 2D coordinate, with x, y ∈ R. XOR Exclusive OR operator. For logical (i.e., boolean) operands a, b, the expression (a XOR b) is true if either a or b is true but not both. The same applies to associated bits in binary bit patterns, e.g., (0011b XOR 1010b ) = 1001b . Z The set of integers.

A.2 Sets A set is an unordered collection of arbitrary elements. Any element may be contained only once, i.e., there are no duplicate elements. A.2.1 Basic Set Symbols and Operators {}, ∅ |A| ∀x . . . ∃x . . . ∪ ∩ S Ai Ti i Ai \ x∈A

Empty set (with zero elements). The size of the set A (equal to card(A)). “All” quantifier (for all x, . . . ). “Exists” quantifier (there is some x for which . . . ). Set union (e.g., A ∪ B). A ∪ {x} creates a new set which adds the element x to the set A. Set intersection (e.g., A ∩ B). Union of multiple sets Ai . Intersection of multiple sets Ai . Set difference: A \ B is the set of all elements that are contained in A but not in B. For example, A ← A\{x} removes element x from the set A. Boolean expression: x ∈ A is true if the element x is contained in set A, otherwise false.

803

Appendix A Mathematical Symbols and Notation

A.2.2 Destructive Set Operators For convenience, we define the following operators which add or remove a single item x by modifying the set A destructively: add(x, A) remove(x, A)

≡ ≡

A ← A ∪ {x}, A ← A \ {x}.

Set A remains unchanged when adding an element x that is already contained in A, and removal has no effect if x is not in A. A.2.3 Relations, Mappings and Functions A relation R between two sets A, B is a collection of ordered pairs (a, b) containing one item from each set, i.e., a ∈ A and b ∈ B. Items a, b are said to be related if (a, b) ∈ R. The sets A and B, representing all possible values for a and b. are usually referred to as the domain and codomain of R, respectively. The subset of actually contained domain values {a ∈ A | (a, b) ∈ R} is called the support of R and the subset of codomain values {b ∈ B | (a, b) ∈ R} is called the image 5 of R. A function F : A 7→ B is a relation between two sets A, B which pairs at most one range element B to any given element of A. Thus F relates any item a ∈ A to a unique item b ∈ B. In other words, if (a, b) ∈ F then F may contain no other pair (a, b′ ), such that b′ ̸= b. If the support of F is the whole of A, it is called a mapping 6 from A into B. For example, a scalar-valued image I can be viewed as a function over 2D coordinates (u, v) ∈ Z2 (the domain) to the set of real numbers R (the codomain), i.e., I : Z × Z 7→ R. More specifically, an 8-bit image of size M × N can be defined as a mapping I : [0, M −1] × [0, N −1] 7→ [0, 255], for which we sometimes use the short notation I : M × N 7→ [0, 255]. The expression F (a) retrieves the range value of the function F for the domain element a or nil if undefined, i.e., ( b if (a, b) ∈ F , F (a) = (A.1) nil otherwise. Analogously, the expression F (a) ← b

(A.2)

associates the range value b to the domain element a by adding the pair (a, b) to the relation F , under the assumption that (a, b) replaces any existing pair (a, ∗) in F to keep the mapping unique. 5

6

804

This is often called the range of R but this term is also commonly used for the codomain, so we avoid it here. The terms function and mapping are often used synonymously.

A.3 Sequences

A.3 Sequences

A sequence (also called a list or vector) is an ordered collection of items. A particular item may be contained more than once, i.e., duplicates are permitted. We use the following definitions for sequences: () |A|

Empty sequence (with zero elements). The length of the sequence A (i.e., the number of contained elements), |A| = size(A). A(i) The i-th element of sequence A. A(0) = A0 refers to the first element, A(n−1) = An−1 to the last (n = |A|). The result is nil if i ∈ / [0, n−1]. first(A) The first element of A or nil if A is empty (equiv. to A(0)). last(A) The last element of A or nil if A is empty (equiv. to A(|A| −1)). A ⊞ B Concatenation of two sequences A, B. If A = (a0 , . . . , am ) and B = (b0 , . . . , bn ) then A ⊞ B = (a0 , . . . , am , b0 , . . . , bn ). A.3.1 Adding and Removing Elements Based on the above notation, expression A ⊞ (x) creates a new sequence by adding a single item x to the end of A, while (x) ⊞ A adds item x to the front of A. Thus, given A = (a0 , . . . , am ), A ⊞ (x) = (a0 , . . . , am , x), (x) ⊞ A = (x, a0 , . . . , am ). The expression remove(A, i) modifies the sequence A by removing the element at its i-th position; e.g., if A = (a0 , a1 , a2 , . . . , am ), then remove(A, 1) removes element a1 , i.e., modifies sequence A to A = (a0 , a2 , . . . , am ). A remains unchanged if i < 0 or i ≥ |A|. Two special operators are defined for removing the first and last element of A, respectively: removeFirst(A) ≡ remove(A, 0), removeLast(A) ≡ remove(A, |A|−1). A.3.2 “Stack”-Type Sequences A stack (or last-in-first-out buffer ) is a special sequence which only allows elements to be added and removed at one end (i.e., the top of the stack). The traditional operators are push(x, S), which adds the element x to the front of S, and pop(S), which removes and returns the top element:7 push(x, S) ≡ S ← (x) ⊞ S, x ← first(S), x ← pop(S) ≡ removeFirst(S). 7

It is irrelevant (and implementation-dependent) which end of the sequence to use for inserting and removing stack elements. In Java, stacks commonly model the Deque interface which defines the head (i.e., the first element) of the sequence as the top of the stack.

805

Appendix A Mathematical Symbols and Notation

Note that S(0) or first(S) can be used to access (“peek”) the top element of S without removing it from the stack. A.3.3 “Queue”-Type Sequences A queue (or first-in-first-out buffer ) is a sequence which allows the insertion of elements at one end and removal of items at the opposite end. We define the operators enqueue(x, Q) for inserting x at the beginning of queue Q and dequeue(Q) for removing the element at its end, i.e.,8 enqueue(x, Q) ≡ Q ← (x) ⊞ Q, x ← last(Q), x ← dequeue(Q) ≡ Q ← removeLast(Q). A.3.4 Sorting Sequences Sometimes sequences need to be sorted according to some property. The expression sort(A, f ) sorts the sequence A destructively by applying the function f (x) ∈ R to each of its elements ai . The elements of the sorted sequence are in ascending order, i.e. f (ai ) ≤ f (ai+1 ). For example, to sort coordinate vectors x ∈ Rn by increasing distance from the origin we use f (x) = ∥x∥. To sort a sequence of real numbers in natural order we omit f and simply write sort(A).

A.4 Tuples and Objects A tuple is a fixed-size, ordered sequence of elements, each possibly of a different type. Tuples are useful as a generic (though primitive) representation of objects and similar structures that are common in programming. For example, a graph G is commonly defined as a tuple ⟨V, E⟩, where V is a set of vertices and E is a set of edges (ordered pairs of vertices). A.4.1 Type Definition and Instantiation We use the following tuple notation as a simple way to define “objects”.9 For example, to define a type (class) named T with 3 components10 named a, b, c we write T := ⟨a, b, c⟩. 8

9

10

806

In Java, queues can also be implemented with the Deque interface, but insertion is done at the tail (i.e., the last element) of the sequence and removal at the head. This model is extremely basic, there is no inheritance, method definitions or similar features. Called “fields” or “instance variables” in programming terminology.

To create an instance of class T, with concrete component values A.5 Complex Numbers a = x, b = y, c = z, we write t ← new T(x, y, z).

(A.3)

i.e., the component names and supplied values are associated by position. No other “constructors” are provided. A.4.2 Accessing Object Components To retrieve a component’s value from a given object obj we use the associated component name as a function, for example,11 x ← a(obj) retrieves the value of component a of object obj. Analogously, a component can be modified by a(obj) ← x′ . Thus, in programming terms, component names are also used for the associated getter and setter functions. A.4.3 Duplication Objects may be duplicated if needed by obj′ ← duplicate(obj). All components of the cloned instance obj′ have the same values as the original instance i.e., a(obj′ ) ≡ a(obj), etc. The implicit assumption is that such a clone is “deep”, i.e., the components can be modified without affecting the original components.

A.5 Complex Numbers Basic Relations z = a + i·b (with z, i ∈ C, a, b ∈ R, i2 = −1) s·z = s·a + i·s·b (for s ∈ R) p |z| = a2 + b2

|s·z| = s·|z|

11

z = a + i · b = |z| · (cos ψ + i · sin ψ) = |z| · ei·ψ (with ψ = ArcTan(a, b)) Re a + i·b = a Re ei·φ = cos φ Im a + i·b = b Im ei·φ = sin φ

(A.4) (A.5) (A.6) (A.7) (A.8) (A.9) (A.10) (A.11)

This is analogous to the object dot-notation used in many common programming languages. In Java, for example, if obj is an object reference, then obj.a accesses its instance variable a. The statements x = obj.a and obj.a = x retrieve and set the variable’s value, respectively.

807

Appendix A Mathematical Symbols and Notation

ei·φ = cos φ + i · sin φ e

−i·φ

= cos φ − i · sin φ

cos(φ) = sin(φ) = ∗

1 2· 1 2i ·

e

i·φ

e

i·φ

+e

−e

z =a−i·b ∗

−i·φ

(A.12) (A.13)

−i·φ

(A.14)

(A.15)

(complex conjugate) 2

∗

2

z·z = z ·z = |z| = a + b 0

2

0

z = (a + i·b) = (1 + i · 0) = 1

(A.16) (A.17) (A.18)

Arithmetic Operations z1 = (a1 + i·b1 ) = |z1 | ei·φ1

(A.19)

i·φ2

(A.20)

z2 = (a2 + i·b2 ) = |z2 | e

z1 +z2 = (a1 + a2 ) + i·(b1 + b2 ),

(A.21)

z1 · z2 = (a1 ·a2 − b1 ·b2 ) + i·(a1 ·b2 + b1 ·a2 )

(A.22)

= |z1 | · |z2 | · ei·(φ1 +φ2 ) z1 a ·a + b1 ·b2 a ·b − a1 ·b2 |z | = 1 22 + i· 2 12 = 1 · ei·(φ1 −φ2 ) 2 2 z2 a 2 + b2 a 2 + b2 |z2 |

808

(A.23)

(A.24)

Appendix B Linear Algebra

This part contains a compact set of elementary tools and concepts from algebra and calculus that are referenced in the main text. Many good textbooks (probably including some of your school books) are available on this subject, for example, [43,44,169,296]. For numerical aspects of linear algebra see [182,214]. Note that all vector and matrix indexes throughout this book are zero-based in agreement with the convention used in most modern programming languages.

B.1 Vectors and Matrices Here we describe the basic notation for vectors in two and three dimensions. Let a0 b a= , b= 0 (B.1) a1 b1 denote vectors a, b in 2D, and analogously     a0 b0 a = a1  , b = b1  a2 b2

(B.2)

vectors in 3D (with ai , bi ∈ R). Vectors are used to describe 2D or 3D points (relative to the origin of the coordinate system) or the displacement between two arbitrary points in the corresponding space. We commonly use upper-case letters to denote a matrix, e.g.,   A0,0 A0,1 A = A1,0 A1,1  . (B.3) A2,0 A2,1

This matrix consists of 3 rows and 2 columns; in other words, A is of size (3, 2). We define the “dims” operator to return the dimensions of a given matrix as a list, e.g., dims(A) = (3, 2).

(B.4) 809

© Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1

Appendix B Linear Algebra

The individual elements of A are referenced as Ai,j , where i is the row index (vertical coordinate) and j is the column index (horizontal coordinate).1 We use ordinary parentheses ( ) to reference individual elements of a vector or a matrix, analogous to discrete functions, e.g., a(i) ≡ ai

and

A(i, j) ≡ Ai,j .

(B.5)

The transpose of a matrix A, denoted A⊺ , is obtained be exchanging rows and columns, for example, 

⊺ A0,0 A0,1 A0,0 A1,0 A2,0 ⊺   A = A1,0 A1,1 = . A0,1 A1,1 A2,1 A2,0 A2,1

(B.6)

The inverse of a square matrix A is denoted A−1 , such that A·A−1 = I

and

A−1 ·A = I

(B.7)

(I is the identity matrix). Note that not every square matrix has an inverse. Calculation of the inverse can be performed in closed form up to the size (3, 3), see Eqn. (21.23) and Eqn. (24.45), e.g. In general, the use of standard numerical methods is recommended (see Sec. B.7.4). B.1.1 Column and Row Vectors For practical purposes, a vector can be considered a special case of a matrix, with only one column or row. In particular, a the mdimensional column vector   a0 . a =  ..  (B.8) am−1 corresponds to a matrix of size (m, 1), while its transpose a⊺ is a row vector and thus like a matrix of size (1, m). By default, and unless otherwise noted, any vector is implicitly assumed to be a column vector. Given some m × n matrix A = (Ai,j ), we refer to its ith row vector (0 ≤ i < m) of length n by A(i, ∗) = (Ai,0 , . . . Ai,n−1 ),

(B.9)

and its j th column vector (0 ≤ j < n) of length m is 

1

810

 A0,j ⊺ A(∗, j) = (A0,j , . . . Am−1,j ) =  ...  . Am−1,j

(B.10)

Note that the usual notation for matrix coordinates is (unlike image coordinates) vertical coordinate (i.e., row index) first!

B.1.2 Extracting Submatrices and Vectors To extract a rectangular submatrix from a given matrix A we adopt a similar notation as is used in common mathematical software. In particular, the expression A(r1 :r2 , c1 :c2 )

B.2 Matrix Multiplication

(B.11)

(with r1 ≤ r2 , c1 ≤ c2 ) specifies the submatrix of A extending over rows r1 , . . . , r2 and columns c1 , . . . , c2 . For example, with A as defined in Eqn. (B.3), A(1:2, 0:1) =

A1,0 A1,1 , A2,0 A2,1

A(2:2, 0:1) = A(2, ∗) = A2,0 A2,1   A0,1 A(0:2, 1:1) = A(∗, 1) = A1,1 . A2,1

(B.12)

(B.13) (B.14)

The same notation is used to extract specific parts of a vector. To refer to a single (scalar) element of a matrix or vector we simply write A(i, j)

a(i),

and

(B.15)

respectively. This may also be used to modify individual matrix or vector elements, e.g., A(2, 1) ← x. B.1.3 Length (Norm) of a Vector The length or Euclidean norm (L2 norm) of a vector a = (a1 , . . . , am−1 ), denoted ∥a∥, is defined as ∥a∥ =

X m−1 i=0

a2i

1/2 .

(B.16)

For example, the length of the 3D vector x = (x, y, z) is ∥x∥ =

p

x2 + y 2 + z 2 .

(B.17)

B.2 Matrix Multiplication B.2.1 Scalar Multiplication The product of a real-valued matrix and a scalar value s ∈ R is defined as   s·A0,0 · · · s·A0,n−1   .. .. .. s·A = A·s = s·Ai,j =  (B.18) . . . . s·Am−1,0 · · · s·Am−1,n−1

811

Appendix B Linear Algebra

B.2.2 Product of Two Matrices We say that a matrix is of size (m, n) if consists of m rows and n columns. Given two matrices A, B of size (m, n) and (p, q), respectively, the product A · B is only defined if n = p. Thus the number of columns (n) in A must always match the number of rows (p) in B. The result is a new matrix C of size (m, q), that is,     A0,0 . . . A0,n−1 B0,0 . . . B0,q−1    ..  .. .. .. .. C = A·B =  ... · .  . . . . |



 = with elements

|

Am−1,0 . . . Am−1,n−1 Bn−1,0 . . . Bn−1,q−1 {z } | {z } (m,n)

C0,0 .. .

Cm−1,0



. . . C0,q−1  .. .. , . . . . . Cm−1,q−1 {z }

(n,q)

(B.19)

(m,q)

Cij =

n−1 X

Ai,k ·Bk,j ,

(B.20)

k=0

for i = 0, . . . , m−1 and j = 0, . . . , q−1. Note that the matrix product is not commutative, that is, A·B ̸= B·A in general. B.2.3 Matrix-Vector Products The product A · x between a matrix A and a vector x is only a special case of the matrix-matrix multiplication given in Eqn. (B.19). In particular, if x = (x0 , . . . , xn−1 )⊺ is a n-dimensional column vector (i.e., a matrix of size (n, 1)), then the multiplication y = |{z} A · |{z} x |{z}

(m,1)

(B.21)

(m,n) (n,1)

is only defined if the matrix A is of size (m, n), for arbitrary m ≥ 1. The result y is a column vector of length m (equivalent to a matrix of size (m, 1)). For example (with m = 2, n = 3),   x ABC   A·x + B·y + C ·z y A·x = · = . (B.22) DEF D·x + E·y + F ·z z | {z } | {z } | {z } (2,3) (2,1) (3,1)

Here A operates on the column vector x “from the left”, that is, A·x is the left-sided matrix-vector product of A and x. Similarly, a right-sided multiplication of a row vector x⊺ of length m with a matrix of size (m, n) is performed as ⊺

812

x · |{z} B = |{z} z , |{z}

(1,m) (m,n)

(1,n)

(B.23)

the result z being a n-dimensional row vector; for example (again B.3 Vector Products with m = 2, n = 3), ABC x ·B = x, y · = x·A+y·D, x·B+y·E, x·C +y·F . D E F | {z } | {z } | {z } (1,2) (1,3) (B.24) (2,3) ⊺

In general, if A·x is defined, then ⊺

⊺ ⊺

A·x = (x ·A )

⊺

⊺

⊺

(A·x) = x ·A .

and

(B.25)

Thus, any right-sided matrix-vector product A·x can also be calculated as a left-sided product x⊺ · A⊺ , by transposing both matrix A and vector x.

B.3 Vector Products Products between vectors are a common cause of confusion, mainly because the same symbol (·) is used to denote widely different operators. B.3.1 Dot (Scalar) Product The dot product (also called scalar or inner product) of two vectors a = (a0 , . . . , an−1 )⊺ , b = (b0 , . . . , bn−1 )⊺ of the same length n is defined as x=a·b=

n−1 X

ai ·bi .

(B.26)

i=0

Thus the result x is a scalar value (hence the name of this product). If we write this as the product of a row and a column vector, analogous to Eqn. (B.23), ⊺

x = |{z} a · |{z} b , |{z}

(B.27)

a · b = ∥a∥ · ∥b∥ · cos(α),

(B.28)

(1,1)

(1,n) (n,1)

we find that the result x is a matrix of size (1, 1), that is, a single scalar value. The dot product can be viewed as the projection of one vector onto the other, with the relation

where α is angle enclosed by the vectors a and b. As a consequence, the dot product is zero if the two vectors are orthogonal to each other. The dot product of a vector with itself gives the square of its norm (see Eqn. (B.16)), that is, a·a=

n−1 X i=0

2

a2i = ∥a∥ .

(B.29) 813

Appendix B Linear Algebra

B.3.2 Outer Product The outer product of two vectors a = (a0 , . . . , am−1 )⊺ , bn−1 )⊺ of length m and n, respectively, is defined as  a 0 b0 a0 b1 . . . a0 bn−1  a 1 b0 a1 b1 . . . a1 bn−1 ⊺  M=a⊗b=a·b = . .. .. ..  .. . . .

b = (b0 , . . . ,

am−1 b0 am−1 b1 . . . am−1 bn−1



  . (B.30) 

Thus the result is a matrix M with m rows and n columns, and elements Mij = ai · bj , for i = 0, . . . , m−1 and j = 1, . . . , n−1. Note ⊺ that a · b in Eqn. (B.30) denotes the ordinary (matrix) product of ⊺ the column vector a (of size m × 1) and the row vector b (of size 1 × n), as defined in Eqn. (B.19). The outer product is a special case of the Kronecker product (⊗) which generally operates on pairs of matrices. B.3.3 Cross Product The cross product (×) is generally defined for n-dimensional vectors, but is almost exclusively used in the 3D case, where the result is geometrically easy to understand. For a pair of 3D vectors, a = (a0 , a1 , a2 )⊺ and b = (b0 , b1 , b2 )⊺ , the cross product is defined as       a0 b0 a 1 b2 − a 2 b1 c = a × b = a1  × b1  = a2 b0 − a0 b2  . (B.31) a2 b2 a 0 b1 − a 1 b0 In the 3D case, the cross product is another 3D vector that is perpendicular to both original vectors a, b.2 The magnitude (length) of the vector c relates to the angle θ between a and b as ∥c∥ = ∥a × b∥ = ∥a∥ · ∥b∥ · sin(θ).

(B.32)

The quantity ∥a × b∥ corresponds to the area of the parallelogram spanned by the vectors a and b.

B.4 Trace and Determinant of a Square Matrix The trace of a n × n (square) matrix A = (ai,j ), denoted tr(A), is the (scalar-valued) sum of its diagonal elements, i.e., tr(A) =

n−1 X

ai,i = a0,0 + a1,1 + · · · + an−1,n−1 .

(B.33)

i=0

The determinant, denoted det(A) or |A|, is another important (scalar-valued) function of a square matrix. The meaning of the determinant does not open itself easily to intuition. One popular interpretation is to think of the determinant as the volume formed by column 2

814

For dimensions greater than three, the definition (and calculation) of the cross product is considerably more involved.

vectors of the matrix, or how much the volume of a n-dimensional B.5 Eigenvalues and hypercube is changed when its vertices are transformed by A. How- Eigenvectors ever, the fact that determinants may be negative has the (uncanny) consequence that the volume (or volume change) is signed. Another interpretation is that the determinant quantifies the degree of linear independence between the vectors of a matrix, which directly relates to its invertibility, i.e., A−1 only exists if det(A) ̸= 0. Other important facts about the determinant include det(I) = 1,

(B.34) ⊺

det(A) = det(A ), n

det(s·A) = s · det(A), det(A

−1

) = 1/ det(A),

det(A·B) = det(A) · det(B).

(B.35) (B.36) (B.37) (B.38)

The numerical calculation of determinants becomes increasingly involved with growing matrix size n, but fortunately this is rarely necessary in practice. For the frequent cases n = 2 and n = 3, determinants can be easily calculated as a00 a01 a00 a01 = a00 ·a11 − a01 ·a10 , det = (B.39) a10 a11 a10 a11   a00 a01 a02 a a a a a a det a10 a11 a12  = a00 · 11 12 − a01 · 10 12 + a02 · 10 11 a21 a22 a20 a22 a20 a21 a20 a21 a22 = a00 a11 a22 + a01 a12 a20 + a02 a10 a21 − a02 a11 a20 − a01 a10 a22 − a00 a12 a21 . (B.40) respectively.

B.5 Eigenvalues and Eigenvectors This section gives an elementary introduction to eigenvectors and eigenvalues, which are mentioned at several places in the main text (see also [34, 76]). Given a square matrix A of size n × n, λ is an eigenvalue of A and x ∈ Rn its associated eigenvector if A·x = λ·x.

(B.41)

Eigenvalue and eigenvectors thus always come in pairs ⟨λj , xj ⟩, usually called eigenpairs. There is generally more than one eigenpair, in fact, Eqn. (B.41) may have up to n different solutions. Geometrically speaking, applying the matrix A to an eigenvector xk only changes the vector’s length (by scaling with the associated eigenvalue λk ), but not its orientation in space. If they exist, the eigenvalues of a matrix are unique, but the associated eigenvectors are not! This results from the fact that, if Eqn. (B.41) is satisfied for some vector x (and the associated eigenvalue λ), it also holds for a scaled vector sx, that is, A·sx = λ·sx ,

(B.42)

815

Appendix B Linear Algebra

for any nonzero s ∈ R. Thus, if x is an eigenvector of A, then sx is also an (equivalent) eigenvector. Note that the eigenvalues of a real-valued matrix may generally be complex. However (as an important special case), if A is real and symmetric, all its eigenvalues are guaranteed to be real. Two fundamental relationships exist between the eigenvalues of a n × n matrix A and its determinant and trace (see Sec. B.4), respectively. The trace of the matrix is equal to the sum of its eigenvalues and the determinant is equal to their product, i.e., tr(A) =

n−1 X

λi

and

det(A) =

i=0

n−1 Y

λi .

(B.43)

i=0

Example For the real-valued (non-symmetric) 2 × 2 matrix 3 −2 A = −4 1 , the two eigenvalues and their associated eigenvectors are 4 −2 λ0 = 5, x0 = s· −4 , and λ1 = −1, x1 = s· −4 ,

for any nonzero s ∈ R. The result can be easily verified by inserting pairs ⟨λ0 , x0 ⟩ and ⟨λ1 , x1 ⟩ into Eqn. (B.41). B.5.1 Calculating Eigenvalues Special Case: 2×2 Matrix For the special (but frequent) case of n = 2, the solution can be found in closed form (and without any software libraries). Given a 2×2 matrix ab A= (B.44) cd its so-called “characteristic equation” is3 det A − λ·I2 = 0 .

(B.45)

This expands to a simple quadratic equation4 in λ, λ2 − (a + d)·λ + (a·d − b·c) = 0,

(B.46)

with two solutions λ0,1 =

a+d 2

±

a−d 2 2

+ b·c

1/2

=R±

which are the eigenvalues of matrix A, i.e. 3 4

816

p S 2 + b·c ,

(B.47)

For a n × n matrix, the characteristic equation is det(A−λ·In ) = 0. 2 This is in the standard √ form Ax + Bx + C = 0 with the two solutions 1 2 x1,2 = 2A · (−B ± B − 4AC).

1: Eigen2x2(A) Input: A = ( ac db ), a real-valued 2 × 2 matrix. Returns a vector ⊺ of eigenvalues λ = (λ0 , λ1 ) , with λ0 ≥ λ1 , and a matrix E whose column vectors are the associated eigenvectors x0 , x1 . If A has no real eigenvalues, nil is returned. 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33:

R ← (a + d)/2 S ← (a − d)/2 ϱ = S 2 + b·c if ϱ < 0 then return nil else √ T ← ϱ λ0 ← R + T λ1 ← R − T

▷ A has no real eigenvalues

▷ eigenvalue λ0 ▷ eigenvalue λ1

B.5 Eigenvalues and Eigenvectors Alg. B.1 Special closed-form calculation of real eigenvalues and eigenvectors for a 2 × 2 matrix. If matrix A has real eigenvalues, two items are returned: a vector of eigenvalues λ and a matrix E whose columns are the associated eigenvectors. Eigenpairs ⟨λi , ei ⟩ are ordered such that λ0 ≥ λ1 . The algorithm follows the scheme described in [35, Ch. 5].

if (a − d) > 0 then ⊺ x0 ← (S+T, c) ▷ eigenvector x0 ⊺ x1 ← (b, −(S+T )) ▷ eigenvector x1 else if (a − d) < 0 then ⊺ x0 ← (b, −(S−T )) ⊺ x1 ← (S−T, c) else ▷ (a − d) = 0 if |b| < |c| √ then ▷ |b| < |c| ⊺ x0 ← ( √ b·c, c) ⊺ x1 ← (− b·c, c) else if |b| > √ |c| then ▷ |b| > |c| ⊺ x0 ← (b, √ b·c) ⊺ x1 ← (b, − b·c) else ▷ |b| = |c| if |c| > 0 then ⊺ x0 ← (c, c) ⊺ x1 ← (−c, c) else ▷ b = c = 0, any vector is an eigenvector ⊺ x0 ← (0, 1) ▷ pick 2 orthogonal, nonzero vectors ⊺ x1 ← (1, 0) λ ← (λ0 , λ1 ) E ← (x0 | x1 ) ⊺ return (λ, E)

p S 2 + b·c, p λ1 = R − S 2 + b·c, λ0 = R +

(B.48)

with R = (a+d)/2 and S = (a−d)/2. Both eigenvalues are real-valued if the term under the square root is positive, i.e., if S 2 + b·c =

a−d 2 2

+ b·c ≥ 0 .

(B.49)

If matrix A is symmetric (i.e., b = c), this condition is guaranteed (since b·c = b2 ≥ 0), and also λ0 ≥ λ1 . The calculation of the associated eigenvectors is a bit more involved (see [35, Ch. 5]) as summarized in B.1.5 5

See class Eigensolver2x2 of the imagingbook library for an implementation of this algorithm. This implementation is about five times faster than the general solution (based on Apache Commons Math).

817

Appendix B Linear Algebra

General Case: n × n Matrix For matrices larger than 2 × 2 closed-form solutions do not exist or are too cumbersome to calculate. Instead, reliable numerical software should be used for eigenvalue calculations. Throughout the main text we refer to the abstract function (λ, E) ← eigen(A), for obtaining the eigenvalues and eigenvectors of a n × n matrix A. Function eigen() is, analogous to Alg. B.1, supposed to return a pair of items: a vector of eigenvalues λ = (λ0 , . . . , λn−1 ) and a matrix E = (e0 | . . . | en−1 ) whose columns are the associated eigenvectors, i.e., ⟨λk , ek ⟩ form eigenpairs. A method called “eigendecomposition” is used to factor matrix A in the form ⊺

A = V·Λ·V ,

(B.50)

where Λ is a n × n diagonal matrix whose diagonal elements are the eigenvalues λk , and the columns of V are the corresponding eigenvectors ek , i.e., λk = Λ(k, k)

and

ek = V(∗, k),

(B.51)

for k = 0, . . . , n−1. Thus, given the eigendecomposition (Λ, V) of A, (λ, E) = eigen(A) = (diag(Λ), V).

(B.52)

In Apache Commons Math, this operation is implemented by class EigenDecomposition (see the example in Sec. B.7.5). B.5.2 Generalized Symmetric Eigenproblems A generalized eigenproblem has the form A·x = λ·B·x.

(B.53)

It differs from an ordinary eigenproblem (see Eqn. (B.41)) by including a second matrix (B) on the right-hand side of the equation. Although generalized eigenvalue problems with arbitrary matrices A, B can be solved with dedicated algorithms (such as the QZ-algorithm [107]), they are not implemented in many popular numerical libraries.6 A special but frequent situation is that the matrices A, B are real and symmetric and B is positive definite.7 In this case, the problem in Eqn. (B.53) turns into a generalized symmetric eigenproblem and can be solved with “conventional” tools. In particular, B can be decomposed (by Cholesky decomposition) to B = L·L⊺ , where L is a lower triangular matrix. The matrix 6

7

818

The Apache Commons Math library currently provides no such implementation. ⊺ A real matrix A is called positive definite if x ·A·x > 0, for any nonzero vector x. The determinant of a positive definite matrix is always positive, i.e., such a matrix is always nonsingular.

1: GeneralSymmetricEigen(A, B) Input: A, B, symmetric matrices of size n × n, B positive definite. Returns the vector of eigenvalues λ = (λ0 , ..., λn−1 ) and a matrix X = (x0 |...|xn−1 ) whose column vectors are the associated eigenvectors. ⊺ 2: L ← solve(B = L·L ) ▷ by Cholesky decomposition 3: Q ← solve(L·Q = C) ▷ e.g., by LU decomposition ⊺ ⊺ 4: Y ← solve(L·Y = Q ) ▷ alternatively Y = L−1 ·A·(L−1 ) 5: (λ, V) ← eigen(Y) ▷ λ = (λ0 , ..., λn−1 ), V = (v 0 | ... | v n−1 ) ⊺ 6: X ← (L−1 ) ·V ▷ X = (x0 | ... | xn−1 ) 7: return (λ, X)

Y = L−1 ·A·(L−1 )

⊺

B.6 Homogeneous Coordinates Alg. B.2 Solving a generalized symmetric eigenvalue problem by Cholesky decomposition. Function eigen(), to perform ordinary eigendecomposition (i.e., Mx = λx) by returning eigenvalues λ and eigenvectors V, is assumed to be available.

(B.54)

has the same eigenvalues as A and eigenvectors y k , from which the eigenvectors of A can be found as xk = (L−1 )⊺ ·y k [214, Sec. 11.0.5]. Thus, given matrices A and B of size n × n, solving the problem in Eqn. (B.53) involves the following steps: 1. Calculate L, such that L·L⊺ = B (by Cholesky decomposition). 2. Create matrix Y = L−1 ·A·(L−1 )⊺ . 3. Get the eigenvalues λ = (λ0 , . . . , λn−1 ) of Y and the matrix of eigenvectors V = (v 0 |...|v n−1 ). 4. Return eigenvalues λ and the matrix of transformed eigenvectors X = (L−1 )⊺ ·V. As suggested in [214, Sec. 11.0.5], matrix Y (in Step 2) can be efficiently calculated as follows: 2a. find Q by solving L·Q = C, then 2b. find Y by solving L·Y = Q⊺ , e.g., by using a LU decomposition solver (see Sec. B.8.1). The complete calculation is summarized in Alg. B.2.

B.6 Homogeneous Coordinates Homogeneous coordinates are an alternative representation of points in multi-dimensional space. They are commonly used in 2D and 3D geometry because they can greatly simplify the description of certain transformations. For example, affine and projective transformations become matrices with homogeneous coordinates and the composition of transformations can be performed by simple matrix multiplication.8 To convert a given n-dimensional Cartesian point x = (x0 , . . . , xn−1 )⊺ to homogeneous coordinates x, we use the notation9 hom(x) = x.

(B.55)

This operation increases the dimensionality of the original vector by one by inserting the additional element xn = 1, that is, 8 9

See Sec. 21.1.2. The operator hom( ) is introduced here for convenience and clarity.

819

   x0 x0 x0  .   .   ..   ..   ..  hom  .  =  = . xn−1  xn−1  xn−1 1 xn

(B.56)

hom−1 (x) = x.

(B.57)



Appendix B Linear Algebra





Note that the homogeneous representation of a Cartesian vector is not unique, but every multiple of the homogeneous vector is an equivalent representation of x. Thus any scaled homogeneous vector x′ = s·x (with s ∈ R, s ̸= 0) corresponds to the same Cartesian vector (see also Eqn. (B.61)). To convert a given homogeneous point x = (x0 , . . . , xn )⊺ back to Cartesian coordinates x we simply write

This operation can be easily derived as       x0 x0 x0  ..  1  .   .    hom−1 .  = · ..  =  .. , xn−1  xn xn−1 xn−1 xn

(B.58)

provided that xn ̸= 0. Two homogeneous points x1 , x2 are considered equivalent (≡), if they represent the same Cartesian point, that is, x1 ≡ x2 ⇔ hom−1 (x1 ) = hom−1 (x2 ).

(B.59)

It follows from Eqn. (B.58) that hom−1 (x) = hom−1 (s·x)

(B.60)

for any nonzero factor s ∈ R. Thus, as mentioned earlier, any scaled homogeneous point corresponds to the same Cartesian point, that is, x ≡ s·x.

(B.61)

For example, for the Cartesian point x = (3, 7, 2)⊺ , the homogeneous coordinates 3 −3 9 30 hom(x) = 72 ≡ −7 ≡ 21 ≡ 70 ... (B.62) −2 6 20 1

−1

3

10

are all equivalent. Homogeneous coordinates can be used for vector spaces of arbitrary dimension, including 2D coordinates.

B.7 Basic Matrix-Vector Operations with the Apache Commons Math Library It is recommended to use proven standard software, such as the Apache Commons Math 10 (ACM) library, for any non-trivial linear algebra calculation. This section describes a selection of common operations with corresponding examples. 10

820

https://commons.apache.org/proper/commons-math/

B.7.1 Vectors and Matrices The basic data structures for representing vectors and matrices are RealVector and RealMatrix, respectively. The following ACM examples show the conversion from and to simple Java arrays of element-type double:

B.7 Basic Matrix-Vector Operations with the Apache Commons Math Library

import org.apache.commons.math3.linear.MatrixUtils; import org.apache.commons.math3.linear.RealMatrix; import org.apache.commons.math3.linear.RealVector; // Data given as simple arrays: double[] xa = {1, 2, 3}; double[][] Aa = {{2, 0, 1}, {0, 2, 0}, {1, 0, 2}}; // Conversion to vectors and matrices: RealVector x = MatrixUtils.createRealVector(xa); RealMatrix A = MatrixUtils.createRealMatrix(Aa); // Get a particular matrix element Ai,j : int i, j; // specify row (i) and column (j) double aij = A.getEntry(i, j); // Set a particular matrix element to a new value: double value; A.setEntry(i, j, value); // Extract data to arrays again: double[] xb = x.toArray(); double[][] Ab = A.getData(); // Transpose the matrix A: RealMatrix At = A.transpose();

B.7.2 Matrix-Vector Multiplication The following examples show how to implement the various matrixvector products described in Sec. B.2.3. RealMatrix A = ...; // matrix A of size (m, n) RealMatrix B = ...; // matrix B of size (p, q), with p = n RealVector x = ...; // vector x of length n // Scalar multiplication C ← s · A: double s = ...; RealMatrix C = A.scalarMultiply(s); // Product of two matrices: C ← A · B: RealMatrix C = A.multiply(B); // C is of size (m, q) // Left-sided matrix-vector product: y ← A · x: RealVector y = A.operate(x); // Right-sided matrix-vector product: y ← x⊺ · A: RealVector y = A.preMultiply(x);

B.7.3 Vector Products The following code segments show the use of the ACM library for calculating various vector products described in Sec. B.3.

821

Appendix B Linear Algebra

RealVector a, b;

// vectors a, b (both of length n)

// Multiplication by a scalar c ← s · a: double s; RealVector c = a.mapMultiply(s); // Dot (scalar) product x ← a · b : double x = a.dotProduct(b); // Outer product M ← a ⊗ b: RealMatrix M = a.outerProduct(b);

B.7.4 Inverse of a Square Matrix The following example shows the inversion of a square matrix: RealMatrix A = ... ; // a square matrix RealMatrix Ai = MatrixUtils.inverse(A);

B.7.5 Eigenvalues and Eigenvectors The following code snippet calculates and lists the eigenvalues and eigenvectors of a square matrix A using class EigenDecomposition of the Apache Commons Math API: import org.apache.commons.math3.linear.EigenDecomposition; RealMatrix A = MatrixUtils.createRealMatrix(new double[][] {{2, 0, 1}, {0, 2, 0}, {1, 0, 2}}); EigenDecomposition ed = new EigenDecomposition(A); // Get all real eigenvalues: double[] lambda = ed.getRealEigenvalues();

// = (3, 2, 1)

// Get the associated eigenvectors: for (int i = 0; i < lambda.length; i++) { RealVector x = ed.getEigenvector(i); ... }

B.8 Solving Systems of Linear Equations

822

This section describes standard methods for solving systems of linear equations. Such systems appear widely and frequently in all sorts of engineering problems. Identifying them and knowing about standard solution methods is thus quite important and may save much time in any development process. In addition, the solution techniques presented here are very mature and numerically stable. Note that this section is supposed to give only a brief summary of the topic and practical implementations using the Apache Commons Math library. Further details and the underlying theory can be found in most linear algebra textbooks (e.g., [169, 214]). Systems of linear equations generally come in the form

      

A0,0 A1,0 A2,0 .. .

A0,1 A1,1 A2,1 .. .

Am−1,0 Am−1,1

     · · · A0,n−1 b0 x0   · · · A1,n−1    x 1   b1     b2  · · · A2,n−1   ·  ..  =  ,   .   ..  .. ..   . . .  xn−1 · · · Am−1,n−1 bm−1

B.8 Solving Systems of Linear Equations

(B.63)

or, in the standard notation, A · x = b,

(B.64)

where the (known) matrix A is of size (m, n), the unknown vector x is n-dimensional, and the (known) vector b is m-dimensional. Thus n corresponds to the number of unknowns and m to the number of equations. Each row i of the matrix A thus represents a single equation Ai,0 ·x0 + Ai,1 ·x1 + . . . + Ai,n−1 ·xn−1 = bi or

n−1 X

Ai,j ·xj = bi ,

(B.65) (B.66)

j=0

for i = 0, . . . , m−1. Depending on m and n, the following situations may occur: • If m = n (i.e., A is square) the number of unknowns matches the number of equations and the system typically (but not always, of course) has a unique solution (see Sec. B.8.1 below). • If m < n, we have more unknowns than equations. In this case no unique solution exists (but possibly infinitely many). • With m > n the system is said to be over-determined and thus not solvable in general. Nevertheless, this is a frequent case that is typically handled by calculating a minimum least squares solution (see Sec. B.8.2). B.8.1 Exact Solutions If the number of equations (m) is equal to the number of unknowns (n) and the resulting (square) matrix A is non-singular and of full rank, the system A·x = b can be expected to have a unique solution for x. Within algorithms, we use the notation x′ ← solve(A·x = b)

(B.67)

for finding the exact solution (x′ ) to such a linear system. For example, the system11 2 · x0 + 3 · x1 − 2 · x2 = 1, −x0 + 7 · x1 + 6 · x2 = −2, 4 · x0 − 3 · x1 − 5 · x2 = 1,

(B.68)

with 11

Example taken from the Apache Commons Math User Guide [9].

823

Appendix B Linear Algebra



 2 3 −2 A =  −1 7 6  , 4 −3 −5

  x0 x = x1  , x2



 1 b =  −2  , 1

(B.69)

has the unique solution x = (−0.3698, 0.1780, −0.6027)⊺ . The following code segment shows how this example is solved using class LUDecomposition of the ACM library (see Table B.1): import org.apache.commons.math3.linear.DecompositionSolver; import org.apache.commons.math3.linear.LUDecomposition; RealMatrix A = MatrixUtils.createRealMatrix(new double[][] {{ 2, 3, -2}, {-1, 7, 6}, { 4, -3, -5}}); RealVector b = MatrixUtils.createRealVector(new double[] {1, -2, 1}); DecompositionSolver solver = new LUDecomposition(A).getSolver(); RealVector x = solver.solve(b);

An exception is thrown if matrix A is non-square or singular. B.8.2 Over-Determined System (Least-Squares Solutions) If a system of linear equations has more equations than unknowns (i.e., m > n) it is over-determined and thus has no exact solution. In other words, there is no vector x that satisfies A·x = b or, equivalently, A · x − b = 0.

(B.70)

Instead, any x plugged into Eqn. (B.70) yields some non-zero “residual” vector ϵ, such that A · x − b = ϵ.

(B.71)

A “best” solution is commonly found by minimizing the squared norm of this residual, that is, by searching for x such that 2

2

∥A·x − b∥ = ∥ϵ∥ → min .

(B.72)

Several common matrix decompositions can be used for calculating the “least-squares solution” of an over-determined system of linear equations. As a simple example, we add a fourth line (m = 4) to the system in Eqns. (B.68) and (B.69) to       2 3 −2 1 x 0  −1 7 6   −2    A= x = x1  , b= (B.73)  4 −3 −5  ,  1, x2 2 −2 −1 0

824

without changing the number of unknowns (i.e., n = 3). The leastsquares solution to this over-determined system is (approximately) x = (−0.2339, 0.1157, −0.4942)⊺ . The following code segment shows this calculation using class SingularValueDecomposition of the ACM library:

import org.apache.commons.math3.linear.DecompositionSolver; import org.apache.commons.math3.linear.SingularValueDecomposition;

B.8 Solving Systems of Linear Equations

RealMatrix A = MatrixUtils.createRealMatrix(new double[][] {{ 2, 3, -2}, {-1, 7, 6}, { 4, -3, -5}, { 2, -2, -1}); RealVector b = MatrixUtils.createRealVector(new double[] {1, -2, 1, 0}); DecompositionSolver svd = new SingularValueDecomposition(A); RealVector x = svd.getSolver().solve(b);

Alternatively, an instance of QRDecomposition could be used for calculating the least-squares solution. If an exact solution exists (see Sec. B.8.1), it is the same as the least-squares solution (with zero residual, i.e., ϵ = 0). Table B.1 lists the decomposition methods implemented by the Apache Commons Math library for solving linear systems of equations. Similar methods are available in most common linear algebra packages. Method

Coefficient Matrix (A)

Solution Type

LUDecomposition

square

exact solution only

EigenDecomposition

square

exact solution only

CholeskyDecomposition

square, symmetric pos. definite exact solution only

QRDecomposition

any

least squares solution

SingularValueDecomposition any

least squares solution

Table B.1 Decomposition methods for solving linear systems of type A·x = b as implemented by the Apache Commons Math library [8].

B.8.3 Solving Homogeneous Linear Systems A special (though not rare) case is a system of linear equations of the form A · x = 0,

(B.74)

where A has m rows and n columns. This system of equations is called “homogeneous” because of the zero-vector on its right-hand side. It has an obvious (trivial) solution, x = 0, which is of no interest, but do other solutions exist? The solution space of the homogeneous system in Eqn. (B.74) is called the null space of matrix A, whose dimension d = n− rank(A) determines the number of linearly independent solutions of the system (see, e.g., [97, Sec. 2.4]). Thus the condition rank(A) < n

(B.75)

must hold for the system in Eqn. (B.74) to have at least one solution other than the trivial solution. If another solution exists, it is not unique: if x is a solution to Eqn. (B.74), then any scaled vector s·x (for s ∈ R) is a solution too. With ∥x∥ = 1 as an additional constraint, Eqn. (B.74) can be solved by singular-value decomposition (SVD) of A, which separates the original (m × n) matrix into the product of three matrices U, S, V of the form

825

Appendix B Linear Algebra

⊺

A = U·S·V ,

(B.76)

12

where U is a m × m unitary matrix, S is a m × n rectangular diagonal matrix with non-negative real values, and V is a n × n unitary matrix [44, 115, 214].13 The resulting structure is   s0 · · · 0 · · · 0 . . .  ..   ⊺  .. . . . ..   — v0 —     ..  0 · · · si · · · 0    .. . . ..   | | | .  ..    ⊺ . . .   . — v — A = u0 · · · ui · · · um−1  · . ·  i   0 · · · 0 · · · sn−1    . | | |   .   0 ··· 0 ··· 0  .  .  ⊺ . .  .. .. ..  — v n−1 — 0 ··· 0 ···

0

(B.77)

The diagonal elements of S, s0 , . . . , sn−1 , are called the singular values of the decomposed matrix A. Each singular value si = Si,i has an associated column vector ui in U (called a left singular vector of A) and a dedicated row vector v ⊺i in V⊺ (i.e., the i-th column vector of V), called a right singular vector. Thus U consists of the column vectors u0 , . . . , um−1 and V of the column vectors v 0 , . . . , v n−1 . If the condition in Eqn. (B.75) is satisfied, the non-trivial leastsquares solution to Eqn. (B.74) is found as the right singular vector v k associated with the smallest singular value sk = min(s0 , . . . , sn−1 ), that is, x = vk ,

with k = argmin si . 0≤i MaxIterations) { return null; // solver did not converge } else { return solution.getPoint(); //= popt }

C.2.1 Implementation With Apache Commons Math Program C.1 shows a sample Java implementation of the nonlinear least-squares solver function SolveNLS(), as defined in Eqn. (C.6). It is based on the Apache Commons Math implementation of the Levenberg-Marquardt method.1 C.2.2 Example 1: One-Dimensional Curve Fitting Let us look at a simple example to illustrate the concept and clarify the terminology. Suppose we want to fit a one-dimensional (decaying sine) function of type y = fp (x) = e−ax · sin(bx) + c,

(C.7)

with parameters p = (a, b, c), to a given set of measurements (see Eqn. (C.3) and Fig. C.1)2 {(xi , yi )} = {(3, 2.5), (6, 1.7), (8, 2.5), (8, 2.3), (15, 2.1)}. The vector of observed values (zi = yi ) thus is ⊺

⊺

z = (z0 , z1 , z2 , z3 , z4 ) = (2.5, 1.7, 2.5, 2.3, 2.1) . 1

2

Method solveNLS() is actually defined in the imagingbook library (class NonlinearLeastSquares). Note that there are (intentionally) two measurements with different yvalues but the same x-position 8, which is perfectly fine.

829

y

Appendix C Nonlinear Least Squares

3.0

2.5

Fig. C.1 One-dimensional curve fitting by nonlinear least-squares. Measurements (xi , yi ) (red dots) and fitted function fp (x) (Eqn. (C.7)), for the optimal parameter vector popt .

(xi , yi ) fp (x)

2.0

1.5

5

10

15

20

25

x

The individual value functions, predicting the outcome for each observation point xi , (see Eqn. (C.4)) are vi (p) = fp (xi ) = e

−a·xi

· sin(b · xi ) + c,

and thus the complete (vector-valued) value function is    e−a·3 · sin(b · 3) + c  v0 (p) e−a·6 · sin(b · 6) + c  v1 (p)        =  e−a·8 · sin(b · 8) + c  v (p) V (p) =  . 2    v3 (p)   e−a·8 · sin(b · 8) + c  v4 (p) e−a·15 · sin(b · 15) + c

(C.8)

(C.9)

Note that each function vi (p) only depends on the parameters p = (a, b, c) as free variables, while the associated measurement xi is treated as a constant. To set up the Jacobian matrix we need the first partial derivatives of each function vi (p) w.r.t. the parameters a, b, c, that is, ∂vi ∂a (p)

= −x·e

−a·x

· sin(b · x),

(C.10)

∂vi ∂b (p)

=

x·e

−a·x

· cos(b · x),

(C.11)

∂vi ∂c (p)

=

1.

(C.12)

Thus the complete n × 3 Jacobian matrix at some “point” p is  ∂v  ∂v0 ∂v0 0 ∂a (p) ∂b (p) ∂c (p) .. .. ..   J(p) =  (C.13)  . . . 

∂vn−1 ∂vn−1 ∂vn−1 ∂a (p) ∂b (p) ∂c (p) −a·x0 −x0 · e sin(b·x0 )

 1  .. .. ..  = . . .   . −a·xn−1 −a·xn−1 −xn−1 · e sin(b·xn−1 ) xn−1 · e cos(b·xn−1 ) 1 x0 · e

−a·x0

cos(b·x0 )

Running SolveNLS(V, J, z, p0 ) with initial parameters p0 = (0, 1, 2) returns the least-squares solution popt = (0.1150, 0.8624, 2.1302),

830

for which the Levenberg-Marquardt solver takes 10 iterations to converge. The residual fitting error is

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56

// import statements omitted void NonlinearCurveFit1D() { double[][] data = // {(xi , yi )} {{3, 2.5}, {6, 1.7}, {8, 2.5}, {8, 2.3}, {15, 2.1}}; int n = data.length; int k = 3; // number of parameters in p MultivariateVectorFunction V = // = V (p) new MultivariateVectorFunction() { @Override public double[] value(double[] p) { double[] V = new double[n]; double a = p[0]; double b = p[1]; double c = p[2]; for (int i = 0; i < n; i++) { double x = data[i][0]; V[i] = exp(-a * x) * sin(b * x) + c; } return V; } }; MultivariateMatrixFunction J = // = J(p) new MultivariateMatrixFunction() { @Override public double[][] value(double[] p) { double[][] J = new double[n][k]; double a = p[0]; double b = p[1]; for (int i = 0; i < n; i++) { double x = data[i][0]; J[i][0] = -exp(-a * x) * x * sin(b * x); J[i][1] = exp(-a * x) * x * cos(b * x); J[i][2] = 1; } return J; } };

}

C.2 Solution Methods

Prog. C.2 Implementation of the 1D fitting example with model function f (x) = e−ax · sin(bx) + c. Sample data {(xi , yi )} are defined in line 3. The value function V (p) is defined as an instance of MultivariateVectorFunction (line 9) and the Jacobian function J(p) is an instance of MultivariateMatrixFunction (line 25). The vector of observed values z is extracted from the data by the makeTarget() helper method (line 41) and the initial parameter vector p0 is initialized in line 43. Finally, method solveNLS() (see Prog. C.1) is called to obtain the optimal solution (line 44).

RealVector z = makeTarget(data); // = z RealVector p0 = // = p0 new ArrayRealVector(new double[] {0, 1, 2}); RealVector popt = solveNLS(V,J,z,p0); // see Prog. C.1 ... // use solution popt = (a, b, c)

RealVector makeTarget(double[][] data) { int n = data.length; RealVector target = new ArrayRealVector(n); for (int i = 0; i < n; i++) { target.setEntry(i, data[i][1]); // extract y-values } return target; }

∥R(popt )∥ = ∥V (popt )−z∥ = 0.1801. The fitted function f (x) (see Eqn. (C.7)) is plotted in Fig. C.1. A complete implementation of this example is shown in Prog. C.2. 831

Appendix C Nonlinear Least Squares

C.3 Multi-Dimensional NLS Problems The model function in the first nonlinear least squares example (see Eqn. (C.7)) is one-dimensional, i.e., fp (xi ) is a scalar value and the argument x is scalar-valued as well. In the general case of a vectorvalued function, f p (xi ) ∈ Rn , the argument xi and the outcome of the function are both vectors. To simplify things, we can treat each of the n output components as the result of a separate, scalar-valued function fp,j , that is, 

 fp,0 (xi ) .. . f p (xi ) =  .

(C.14)

fp,n−1 (xi )

C.3.1 Example 2: Geometric Circle Fitting In this example the task is to fit a set of n observed 2D points xi = (xi , yi ) to a circle, specified by k = 3 parameters p = (xc , yc , r). This problem (geometric circle fitting) is discussed at length in Sec. 11.1.3, which describes a “distance-based” and a “coordinate-based” approach. The coordinate-based approach tries to minimize the x/y deviations between the observed 2D points xi and the associated points “predicted” by the circle model (which are the on-circle points closest to xi ). Since the problem is 2-dimensional, the objective function has two components, i.e.,

f p (xi ) =

vi,x (p) vi,y (p)

=

xc + (xi −xc )· rr i yc + (yi −yc )· rr i

!

(C.15)

p with ri = (xi −xc )2 + (yi −yc )2 and circle parameters xc , yc , r. The associated value functions (analogous to Eqn. (C.8)) vi,x (p), vi,x (p) are are stacked in pairs for all n data points to build the composite value function (analogous to Eqn. (C.9)), i.e., 

    V (p) =    

832

v0,x (p) v0,y (p)

.. .

vi,x (p) vi,y (p)

.. .

vn−1,x (p) vn−1,y (p)



    .   

(C.16)

Thus the outcome of function V (p) is a vector of length 2n. Similarly, the size of the associated Jacobian matrix is 2n × 3 (since there are k = 3 parameters):



∂v0,x ∂xc ∂v0,y

(p)

  ∂xc (p)  .   ∂v ..  i,x  ∂xc (p) J(p) =  ∂vi,y  ∂x (p) c  ..   .  ∂vn−1,x  ∂xc (p) ∂vn−1,y ∂xc

(p)

∂v0,x ∂yc ∂v0,y ∂yc

(p) (p)

.. .

∂vi,x ∂yc ∂vi,y ∂yc

(p) (p)

.. .

∂vn−1,x ∂yc ∂vn−1,y ∂yc

(p) (p)

∂v0,x ∂r ∂v0,y

(p)



  ∂r  ..  .  ∂vi,x  ∂r (p)  . ∂vi,y  ∂r (p)  ..   . ∂vn−1,x  (p)  ∂r (p)

∂vn−1,y ∂r

C.3 Multi-Dimensional NLS Problems

(C.17)

(p)

See Eqn. (11.56) for the concrete partial derivatives and, in general, Sec. 11.1.3 for more details. Program C.3 shows a concrete implementation of this process, which follows the summary given in Alg. 11.4 (see p. 311). It resorts to method solveNLS(), called in line 53 and defined in Prog. C.1, for the actual nonlinear least-squares (Levenberg-Marquardt) optimization. The remaining parts of Prog. C.3 illustrate how to set up the “model” functions V (p) and J(p) to produce the value vector and Jacobian matrix, respectively, for the current parameter point p. The corresponding function objects V (line 8) and J (line 26) are passed to method solveNLS() (line 53) and invoked by the leastsquares solver for every visited parameter point p (i.e., at least once in every iteration). Note that the quantities calculated by the value() methods of the two function objects MultivariateVectorFunction V, MultivariateMatrixFunction J, respectively, are almost identical. Since the optimizer typically calls both methods jointly for the same parameter vector p, this means that the very same calculations are unnecessarily done twice. A more efficient solution in this case is to use a combined value and Jacobian function of type MultivariateJacobianFunction VJ, which specifies a single value() method that returns both the value vector and the Jacobian matrix as a pair, i.e., Pair value(RealVector point). In this way, calculations for parameters p must be performed only once within a single method.3 This issue is even more pronounced in the context of geometric ellipse fitting (see Sec. 11.2.2), where the calculation of the closest ellipse points (required for both the model values and the Jacaobian) is performed iteratively (and thus “expensive”). Numeric Example The example in Prog. C.3 fits a circle to n = 5 given data points 3

Interested readers should look at the corresponding Java code in the imagingbook library.

833

Appendix C Nonlinear Least Squares Prog. C.3 Geometric circle fitting example. Circles are specified by parameters p = (xc , yc , r). Fitting is performed by minimizing the squared Euclidean distances between a given set of sample points (data, line 3) and the associated closest points on the circle. The value function V (p) is defined as an instance of MultivariateVectorFunction (line 9) and the Jacobian function J(p) is an instance of MultivariateMatrixFunction (line 27). The vector of observed values z is extracted from the data by the makeTarget() method (line 50) and the initial parameter vector p0 is initialized in line 51. Finally, method solveNLS() is called to obtain the optimal solution (line 53).

1 // import statements omitted 2 void GeometricCircleFit() { 3 double[][] data = // 2D sample points 4 {{15, 9}, {68, 33}, {35, 69}, {17, 51}, {90, 54}}; 5 int n = data.length; 6 int k = 3; // number of parameters in p 7 8 MultivariateVectorFunction V = // = V (p) 9 new MultivariateVectorFunction() { 10 @Override 11 public double[] value(double[] p) { 12 double[] V = new double[2*n]; 13 double xc = p[0], yc = p[1], r = p[2]; 14 for (int i = 0; i < n; i++) { 15 double dx = data[i][0] - xc; 16 double dy = data[i][1] - yc; 17 double ri = hypot(dx, dy); 18 // closest circle point (see Eqn. (11.53)): 19 V[2*i] = xc + dx * r / ri; // = x ˘i 20 V[2*i+1] = yc + dy * r / ri; // = y˘i 21 } 22 return V; 23 } 24 }; 25 26 MultivariateMatrixFunction J = // = J(p) 27 new MultivariateMatrixFunction() { 28 @Override 29 public double[][] value(double[] p) { 30 double[][] J = new double[2*n][k]; 31 double xc = p[0], yc = p[1], r = p[2]; 32 for (int i = 0; i < n; i++) { 33 double dx = data[i][0] - xc; 34 double dy = data[i][1] - yc; 35 double ri = hypot(dx, dy); // ri 36 double ri2 = ri * ri; // ri2 37 double ri3 = ri2 * ri; // ri3 38 // partial derivatives (see Eqn. (11.56)): 39 J[2*i][0] = 1 + (r / ri) * (sqr(dx) / ri2 - 1); 40 J[2*i][1] = r * dx * dy / ri3; 41 J[2*i][2] = dx / ri; 42 J[2*i+1][0] = r * dx * dy / ri3; 43 J[2*i+1][1] = 1 + (r / ri) * (sqr(dy) / ri2 - 1); 44 J[2*i+1][2] = dy / ri; 45 } 46 return J; 47 } 48 }; 49 50 RealVector z = makeTarget(data); // see below 51 RealVector p0 = new ArrayRealVector(new double[] {45,40,30}); 52 53 RealVector popt = solveNLS(V,J,z,p0); // see Prog. C.1 54 ... // use solution popt = (xc , yc , r) 55 } 56 RealVector makeTarget(double[][] data) { 57 int n = data.length; 58 final double[] target = new double[2*n]; 59 for (int i = 0; i < n; i++) { 60 target[2*i] = data[i][0]; // xi 61 target[2*i+1] = data[i][1]; // yi 62 } 63 return MatrixUtils.createRealVector(target); 64 }

834

15 , 9

X = (x0 , . . . , x4 ) =

68 , 33

35 , 69

17 , 51

90 , 54

(C.18) C.3 Multi-Dimensional NLS Problems

as shown in Fig. C.2. The target vector z is obtained by “flattening” the point coordinates xi = (xi , yi ) to a 1D sequence, i.e., z = (x0 , y0 , . . . , xj , yj , . . . , xn−1 , yn−1 )

⊺

(C.19)

⊺

= (15, 9, 68, 33, 35, 69, 17, 51, 90, 54) , which is of length 2n = 10. The corresponding Jacobian matrix is of size 10 × 3. For the initial parameters p0 = (xc , yc , r) = (45, 40, 30), the associated model values are     V

24.137  18.442   73.700   31.265   35.220   (p0 ) =   68.361   17.077   50.970    73.646 48.912

and

0.641  0.348  0.894  −0.348  0.126 J(p0 ) =   −0.301  0.867  −0.339  0.944 0.181

0.348 0.664 −0.348 −0.142 −0.301 0.896 −0.339 0.136 0.181 0.420

−0.695 −0.719  0.957  −0.291   −0.326  , 0.945  −0.931   0.366  0.955 0.297

with the initial residual F (p0 ) = ∥V (p0 )−z∥ = 22.40 (see Eqn. (C.3)). The Levenberg-Marquardt solver takes 12 iterations to find the optimal circle parameters, popt = (xc , yc , r) = (44.959, 36.208, 35.613), with residual F (popt ) = ∥V (p0 )−z∥ = 18.92. This result is shown in Fig. C.2 (a). The red-dashed circle corresponds to the initial parameter vector p0 , the green circle is the final result (popt ). Only one (gray) intermediate circle is visible, since the solution is very close to the final optimum after the first iteration already. initial parameters (p0 )

0

interm. steps

1

optimal fit

4 3 2

1 0

(a) p0 = (45, 40, 30), 12 iterations

(b) p0 = (75, 75, 12), 15 iterations

Figure C.2 (b) shows a second example with the same data points but starting from initial parameters p0 = (75, 75, 12), i.e., far away from the final solution. Again intermediate solutions are shown as gray circles, labeled with the associated iteration number.

Fig. C.2 Geometric circle fitting examples. The blue dots represent the sample points listed in Eqn. (C.18), the red-dashed circle corresponds to the initial parameters p0 , the green circle is the result of nonlinearleast-squares fitting. Intermediate circles (corresponding to parameter points visited by the optimizer) are shown in gray with attached iteration numbers. The situation in (a) is the one described in the numeric example, with p0 = (45, 40, 30). The outcome in (b), with the initial circle placed far off the final solution at p0 = (75, 75, 12), is the same as in (a). The solver converged after 12 and 15 iterations, respectively. Frame size is 100 × 100.

835

Appendix C Nonlinear Least Squares

C.3.2 Numerical Estimation of Partial Derivatives When building a non-linear optimizer, getting the partial derivatives right is one of the hardest parts and results may be catastrophic if any of them are wrong. To verify the correctness of the Jacobian matrix it may be helpful to calculate derivatives numerically. This can be done by finite difference approximation, i.e., by varying one parameter by some small amount (while keeping all other parameters fixed) and measuring the resulting change of the value functions. For example, if we vary only the xc component of parameter vector p, i.e.,     xc xc +δ p =  yc  , and p ′ =  yc  r r and calculate the associated outcomes of function vi,x (), e.g., vi,x (p) = x ˘,

and

vi,x (p′ ) = x ˘′ ,

the partial derivative of vi,x () w.r.t. parameter xc at point p can be estimated as ∂vi,x x ˘′ −˘ x (p) ≈ . ∂xc δ

(C.20)

In this way all partial derivatives and thus the complete Jacobian (e.g., Eqn. (C.17)) can be estimated. The value δ is not critical, typically δ = 10−5 (or smaller) works fine. The numerical results are usually quite close to the real (i.e., analytic) derivatives. This is also a practical method for calculating the Jacobian in the first place if partial derivatives are too hard (or impossible) to find analytically.

836

Appendix D Elements from Calculus

This part outlines selected topics from calculus that may serve as a useful supplement to Chapters 5, 16, 17, 24, and 25, in particular.

D.1 Scalar and Vector Fields An RGB color image I(u, v) = (IR (u, v), IG (u, v), IB (u, v)) can be considered a 2D function whose values are 3D vectors. Mathematically, this is a special case of a vector-valued function f : Rn 7→ Rm ,   f0 (x) ..   f (x) = f (x0 , . . . , xn−1 ) =  (D.1) , . fm−1 (x) which is composed of m scalar-valued functions fi : Rn 7→ R, each being defined on the domain of n-dimensional vectors. A multi-variable, scalar-valued function f : Rn 7→ R is called a scalar field, while a vector-valued function f : Rn 7→ Rm is referred to as a vector field. D.1.1 The Jacobian Matrix Assuming that the function f (x) = (f0 (x), . . . , fm−1 (x))⊺ is differentiable, the so-called functional or Jacobian matrix at some specific point x˙ ∈ Rn is defined as 

 ˙ = Jf (x) 

∂f0 ˙ ∂x0 (x)

.. .

∂fm−1 ˙ ∂x0 (x)

··· .. . ···

 ∂f0 ˙ ∂xn−1 (x) .. .

∂fm−1 ˙ ∂xn−1 (x)

 . 

(D.2)

The Jacobian matrix is of size m×n and composed of the first derivatives of the m component functions f0 , . . . , fm−1 with respect to each of the n independent variables x0 , . . . , xn−1 . Thus each of its elements © Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1

837

Appendix D Elements from Calculus

∂fi ˙ ∂xj (x)

quantifies how much the value of the scalar-valued component function fi (x) = fi (x0 , . . . , xn−1 ) changes when only variable xj is varied and all other variables remain fixed. Note that the matrix ˙ In general, Jf (x) is not constant but is different at each position x. the Jacobian matrix is neither square (unless m = n) nor symmetric. D.1.2 Gradients Gradient of a Scalar Field The gradient of a scalar field f : Rn 7→ R, with f (x) = f (x0 , ..., xn−1 ), at some given position x˙ ∈ Rn is defined as 

 ˙ = (grad f )(x) ˙ = (∇f )(x)

∂f ˙ ∂x0 (x)

.. .

∂f ˙ ∂xn−1 (x)



 .

(D.3)

This vector-valued function quantifies the amount of output change with respect to changing any of the input variables x0 , . . . , xn−1 at ˙ Thus the gradient of a scalar field is a vector field. position x. The directional gradient of a scalar field describes how the (scalar) function value changes when the coordinates are modified along a particular direction, specified by a unit vector e ∈ Rn . We denote the directional gradient as ∇e f and define ⊺

˙ = [(∇f )(x)] ˙ (∇e f )(x) ·e,

(D.4)

where · is the dot product (see Sec. B.3.1). The result is a scalar value that can be interpreted as the slope of the tangent on the ˙ along the din-dimensional surface of the scalar field at position x, rection specified by the n-dimensional unit vector e = (e0 , . . . , en−1 )⊺ . In the 2D case (n = 2), the gradient direction may also be specified by some angle θ, i.e., ˙ = (∇e f )(x), ˙ (∇θ f )(x)

(D.5)

with e = (cos(θ), sin(θ))⊺ . Gradient of a Vector Field To calculate the gradient of a vector field f : Rn 7→ Rm , we note that each row i in the m × n Jacobian matrix Jf (Eqn. (D.2)) is the transposed gradient vector of the corresponding component function fi , that is,   ˙ ⊺ [(∇f0 )(x)]   .. , ˙ = Jf (x) (D.6) .   ⊺ ˙ [(∇fm−1 )(x)] and thus the Jacobian matrix is equivalent to the gradient of the vector field f ,

838

˙ ≡ Jf (x). ˙ (grad f )(x)

(D.7)

Analogous to Eqn. (D.4), the directional gradient of the vector field D.1 Scalar and Vector is then defined as Fields ˙ ≡ Jf (x) ˙ · e, (grade f )(x)

(D.8)

where e ∈ Rn is again a unit vector specifying the gradient direction and · is the ordinary matrix-vector product. In this case the resulting gradient is a m-dimensional vector with one element for each component of function f . D.1.3 Maximum Gradient Direction ˙ (see In case of a scalar field f : Rn 7→ R, the gradient vector (∇f )(x) Eqn. (D.3)) is also the direction of the steepest ascent of f (x) at ˙ 1 In this case, the L2 norm (see Sec. B.1.3) of the gradient position x. ˙ vector, that is, ∥(∇f )(x)∥, corresponds to the maximum slope of f at point x˙ in any direction. In the case of a vector field f : Rn 7→ Rm , the direction of maximum slope cannot be obtained directly, since the gradient is not a vector but its m × n Jacobian matrix. In this case, the direction of maximum change in the function f is found as the eigenvector xk of the square (n × n) matrix ⊺

˙ Jf (x) ˙ M = Jf (x)·

(D.9)

that corresponds to its largest eigenvalue λk (see also Sec. B.5). D.1.4 Divergence of a Vector Field If a vector field maps to the same vector space (i.e., f : Rn 7→ Rn ), its divergence (div) is defined as ˙ = (divf )(x) =

∂f0 ˙ ∂x0 (x) n−1 X

+ ··· +

∂fi ˙ ∂xi (x)

∂fn−1 ˙ ∂xn−1 (x)

∈ R,

(D.10) (D.11)

i=0

˙ The result is a scalar value and thus (divf )(x) ˙ for some given point x. yields a scalar field Rn 7→ R. Note that, in this case, the Jacobian matrix of f (see Eqn. (D.2)) is square (of size n × n), and divf is equivalent to the trace of Jf , that is, ˙ ≡ tr(Jf (x)). ˙ (divf )(x)

(D.12)

D.1.5 The Laplacian Operator The Laplace operator (or simply Laplacian) of a scalar field f : Rn 7→ R is a linear differential operator, commonly denoted ∆ or ∇2 . The result of applying this operator to some scalar field f : Rn 7→ R generates another scalar field that consists of the sum of all unmixed second-order partial derivatives of f (if existent), that is, 1

˙ = 0, the direction of the If the gradient vector is zero, i.e., (∇f )(x) ˙ gradient is undefined at position x.

839

Appendix D Elements from Calculus

˙ = (∇2 f )(x)

∂2f ˙ (x) ∂x20

+ ··· +

∂2f ˙ (x) ∂x2n−1

=

n−1 X i=0

∂2f ˙ (x). ∂x2i

(D.13)

The result is a scalar value that is equivalent to the divergence (see Eqn. (D.11)) of the gradient (see Eqn. (D.3)) of the scalar field f , that is, ˙ = (div∇f )(x). ˙ (∇2 f )(x)

(D.14)

The Laplacian is also found as the trace of the function’s Hessian matrix Hf (see Sec. D.1.6). For a vector-field f : Rn 7→ Rm , the Laplacian at point x˙ is again a vector field Rn 7→ Rm ,   ˙ (∇2 f0 )(x)   .. m ˙ = (∇2 f )(x) (D.15) ∈R , . ˙ (∇2 fm−1 )(x)

that is obtained by applying the Laplacian to the individual (scalarvalued) component functions. D.1.6 The Hessian Matrix

The Hessian matrix of a n-variable, real-valued function (scalar field) f : Rn 7→ R is the n × n square matrix composed of its second-order partial derivatives (assuming they all exist), that is,   H0,0 H0,1 · · · H0,n−1  H H1,1 · · · H1,n−1   1,0  Hf =  . (D.16)  .. .. ..   .. . . . Hn−1,0 Hn−1,1 · · · Hn−1,n−1   ∂2f ∂2f ∂2f · · · 2 ∂x0 ∂x1 ∂x0 ∂xn−1   ∂x0 2 2  ∂2f  ∂ f  ∂x ∂x · · · ∂x ∂∂xf  ∂x21  . 0 n−1 1 1 = (D.17)  .. .. .. ..   . . . .   2 ∂2f ∂2f · · · ∂x∂2 f ∂x ∂x ∂x ∂x n−1

0

n−1

1

n−1

Since the order of differentiation does not matter (i.e., Hi,j = Hj,i ), Hf is symmetric. Note that the Hessian is a matrix of functions. To evaluate the Hessian at a particular point x˙ ∈ Rn , we write  ∂2f  2 ˙ ˙ (x) · · · ∂x ∂∂xf (x) ∂x20 n−1 0   .. .. .. , ˙ = Hf (x) (D.18) . . .   ∂2f ∂2f ˙ · · · ∂x2 (x) ˙ (x) ∂x ∂x n−1

0

n−1

which is a (scalar-valued) matrix of size n × n. As mentioned before, the trace of the Hessian matrix is the Laplacian ∇2 of function f , i.e., ∇2 f = tr(Hf ) = 840

n−1 X i=0

∂2f . ∂x2i

(D.19)

Example: Hessian of a 2D function Given a 2D, continuous function, e.g., a grayscale image or scalarvalued intensity function I(x, y), the corresponding Hessian matrix (of size 2 × 2) contains all second derivatives along the coordinates x, y, that is, ! ∂2I ∂2I I I ∂x2 ∂x∂y HI = ∂ 2 I ∂ 2 I = xx xy , (D.20) Iyx Iyy 2 ∂y∂x

D.2 Taylor Series Expansion

∂y

with Ixy = Iyx . The elements of HI are 2D, scalar-valued functions over x, y and thus scalar fields again. Evaluating the Hessian matrix at a particular point x˙ yields the values of the second partial derivatives of I at that position, i.e., ! ∂2I ∂2I ˙ ˙ ( x) ( x) ˙ Ixy (x) ˙ Ixx (x) 2 ∂x ∂x∂y ˙ = ∂2I HI (x) = , (D.21) ∂2I ˙ Iyy (x) ˙ Ixy (x) ˙ ˙ ∂y∂x (x) ∂y 2 (x) that is, a matrix with scalar-valued elements.

D.2 Taylor Series Expansion A Taylor series representation of a function is a sum of terms that are based on the derivatives of the function at a particular point. Taylor series expansion about some point can be used to estimate the function’s value in the vicinity of that point. D.2.1 Single-Variable Functions The Taylor series expansion (of degree d) of a single-variable function f : R 7→ R about some reference point a is 2

f (x) = f (a) + f ′ (a)·(x−a) + f ′′ (a)· (x−a) + ··· 2 · · · + f (d) (a)· (x−a) + Rd d! d

= f (a) + =

d X

d X

f (i) (a)· (x−a) + Rd i! i

(D.22) (D.23)

i=1

f (i) (a)· (x−a) + Rd , i! i

(D.24)

i=0

where Rd denotes a residual term.2 This means that if the value f (a) and the first d derivatives f ′ (a), f ′′ (a), . . . , f (d) (a) exist and are known at some position a, the value of f at another point x can be estimated (up to the residual Rd ) only from the values at point a, without actually evaluating f (x). Omitting the residual Rd , the result is a polynomial approximation function f˜(x) =

d X

f (i) (a)· (x−a) ≈ f (x), i! i

(D.25)

i=0

2

Note that f (0) = f , f (1) = f ′ , f (2) = f ′′ etc., and 1! = 1.

841

Appendix D Elements from Calculus

whose accuracy depends upon d and the distance x−a. For example, the frequently-used second-order (i.e., d = 2) Taylor approximation of a single-variable function f (x) is f˜(x) = f (a) + f ′ (a)·(x−a) +

f ′′ (a) 2 ·(x

− a).

(D.26)

D.2.2 Multi-Variable Functions In general, for a real-valued function of n variables, i.e., f : Rn 7→ R, the full Taylor series expansion about a reference point a = (a0 , . . . , an−1 )⊺ is f (x) = f (x0 , . . . , xn−1 ) = ∞ ∞ X X ∂ i0 f (a) + ··· i0 · · · =

i0 =1 in−1 =1 ∞ ∞ X X

···

i1 =0

in =0

∂x0

∂ i0 i ∂x00

···

(D.27) ∂

in−1 i

n−1 ∂xn−1

∂

in−1 in−1

∂xn−1

f (a)·

(x0 −a0 )i0 ···(xn−1 −an−1 ) i1 !···in !

f (a)·

(x0 −a0 )i0 ···(xn−1 −an−1 ) i0 !···in−1 !

in−1

in−1

.

In this equation, the term

∂ i0 i ∂x00

∂

···

in−1 in−1

∂xn−1

f (a)

(D.28)

is the value of the function f , after applying a sequence of n partial ∂i derivatives, at position a. The operator ∂x i denotes the i-th partial k derivative on the variable xk . To formulate Eqn. (D.27) in a more compact fashion, we define the index vector i = (i0 , i1 , . . . , in−1 ),

(D.29)

(with ik ∈ N0 and thus i ∈ Nn0 ), and the associated operations i! = i0 ! · i1 ! · . . . · in−1 !, i

i

i

n−1 xi = x00 · x11 · . . . · xn−1 ,

(D.30)

Σi = i0 + i1 + . . . + in−1 . As a shorthand notation for the combined partial derivative operator in Eqn. (D.28) we define Di :=

∂

i0

i ∂x00

∂

i1

i ∂x11

···

∂

in−1 in−1

∂xn−1

=

∂

i0 +i1 +...+in−1

in−1 i i ∂x00 ∂x11 ··· ∂xn−1

.

(D.31)

With these definitions, the full Taylor expansion of a multi-variable function about a point a, as given in Eqn. (D.27), can be elegantly written in the form f (x) =

X

i ∈ Nn 0

842

Dif (a)·

(x−a)i . i!

(D.32)

Note that Dif is again a n-dimensional function Rn 7→ R, and thus Dif (a) in Eqn. (D.32) is the scalar quantity obtained by evaluating the function Dif at the n-dimensional point a.

To obtain a Taylor approximation of order d, the sum of the D.2 Taylor Series indices i1 , . . . , in is limited to d, that is, the summation is constrained Expansion to index vectors i, with Σi ≤ d. The resulting formulation, f (x) ≈

X

(x−a)i , i!

Dif (a)·

i ∈ Nn 0 Σi≤d

(D.33)

is analogous to the 1D case in Eqn. (D.25). Example: Two-Variable (2D) Function This example demonstrates the second-order (d = 2) Taylor expansion of a 2D (n = 2) function f : R2 7→ R about a point a = (xa , ya ). By inserting into Eqn. (D.32), we get f (x, y) ≈

i 1 x−xa D f (xa , ya )· · i! y−ya 2

X

i

i ∈ N0 Σi≤2

X

=

0≤i,j≤2 (i+j)≤2

∂ i+j f ∂xi ∂y j (xa , ya )

·

(x−xa )i ·(y−ya )j . i!·j !

(D.34)

(D.35)

Since d = 2, the six permissible index vectors i = (i, j), with Σi ≤ 2, are (0, 0), (1, 0), (0, 1), (1, 1), (2, 0), and (0, 2). Inserting into Eqn. (D.35), we obtain the corresponding Taylor approximation at position x˙ = (x, ˙ y) ˙ as f (x, y) ≈

∂0f ∂x0 ∂y 0 (xa , ya ) 1

·

(x−xa )0 ·(y−ya )0 1·1 1

+

∂ f ∂x1 ∂y 0 (xa , ya )

·

(x−xa ) ·(y−ya ) 1·1

+

∂1f ∂x0 ∂y 1 (xa , ya )

·

(x−xa )0 ·(y−ya )1 1·1

+

∂2f ∂x1 ∂y 1 (xa , ya )

·

(x−xa )1 ·(y−ya )1 1·1

+

∂2f ∂x2 ∂y 0 (xa , ya )

·

(x−xa )2 ·(y−ya )0 2·1

+ = f (xa , ya ) + +

(D.36)

0

(x−xa )0 ·(y−ya )2 ∂2f ∂x0 ∂y 2 (xa , ya ) · 1·2 ∂f ∂f ∂x (xa , ya )·(x−xa ) + ∂y (xa , ya )·(y−ya )

(D.37)

∂2f ∂x ∂y (xa , ya )·(x−xa )·(y−ya ) 2

2

+ 12 · ∂∂xf2 (xa , ya )·(x−xa )2 + 12 · ∂∂yf2 (xa , ya )·(y−ya )2 . It is assumed that the required derivatives of f exist, i.e., f is differentiable at point (xa , ya ) with respect to x and y at least up to second order. By slightly rearranging Eqn. (D.37) to ∂f f (x, y) ≈ f (xa , ya ) + ∂f ∂x (xa , ya ) · (x−xa ) + ∂y (xa , ya ) · (y−ya ) h 2 ∂2f + 12 · ∂∂xf2 (xa , ya )·(x−xa )2 + 2· ∂x ∂y (xa , ya )·(x−xa )·(y−ya ) i 2 + ∂∂yf2 (xa , ya )·(y−ya )2 (D.38)

we can write the Taylor expansion in matrix-vector form as

843

Appendix D Elements from Calculus

x−xa ∂f ∂f ˜ f (x, y) ≈ f (x, y) = f (xa , ya ) + ∂x (xa , ya ), ∂y (xa , ya ) · y−ya  2  2 ⊺ ∂ f ∂ f h x−x x−xa i ∂x2 (xa , ya ) ∂x∂y (xa , ya ) a + 12 · · 2 · (D.39) 2 ∂ f y−ya y−ya (x , y ) ∂ f (x , y ) ∂x∂y

a

a

∂y 2

a

a

or, a bit more compactly, as

⊺ ⊺ f˜(x) = f (a) + ∇f,a ·(x−a) + 12 ·(x−a) · Hf,a ·(x−a).

(D.40)

⊺ Here ∇f,a denotes the (transposed) gradient vector of the function f (see Sec. D.1.2), and Hf,a is the 2 × 2 Hessian matrix of f (see Sec. D.1.6), both for the expansion point a, i.e.,  2  ∂ f ∂2f (a) (a) H00 H01 ∂x2 ∂x∂y . Hf,a = = 2 (D.41) 2 ∂ f H10 H11 (a) ∂ f (a) ∂x∂y

∂y 2

If the function f is discrete (e.g., a scalar-valued image I) the required partial derivatives at some grid point a = (ua , va )⊺ can be estimated from its 3 × 3 neighborhood, as discussed in Sec. D.3. Example: Three-Variable (3D) Function

For a 3D function f : R3 7→ R, the second-order Taylor expansion (d = 2) is analogous to the 2D case in Eqns. (D.39–D.40), except that here the positions x = (x, y, z)⊺ and a = (xa , ya , za )⊺ are 3D vectors. The associated (transposed) gradient vector is ⊺ ∂f ∂f ∇f (a) = ∂x (a), ∂y (a), ∂f (D.42) ∂z (a) , and the Hessian, composed of all second-order partial derivatives at point a, is the 3 × 3 matrix  ∂2f  ∂2f ∂2f ∂x2 (a) ∂x∂y (a) ∂x∂z (a)  2  ∂ f  ∂2f ∂2f Hf,a =  ∂y (D.43) (a) (a) (a) . ∂y 2 ∂y ∂z  ∂x  ∂2f ∂2f ∂2f ∂z ∂x (a) ∂z ∂y (a) ∂z 2 (a) Note that the order of differentiation is not relevant since, for exam∂2 ∂2 ple, ∂x∂y = ∂y∂x , and therefore Hf is always symmetric. This can be easily generalized to the n-dimensional case, though things become considerably more involved for Taylor expansions of higher orders (d > 2). D.2.3 Finding Function Extrema by 2nd-Order Taylor Expansion Extremum of a Single-Variable Function The 2nd-order Taylor expansion of a single-variable function f () was given in Eqn. (D.26). Without any loss of generality we can place the expansion point at a = 0, which simplifies the Taylor approximation of the function to f˜(x) = f (0)+f ′ (0)·x+ f 844

(0) 2 2 ·x

′′

= c0 +c1 ·x+c2 ·x2 ,

(D.44)

with c0 = f (0), c1 = f ′ (0) and c2 = f ′′ (0)/2. This quadratic polyno- D.2 Taylor Series mial has a single extremum where its first derivative (i.e., the slope of Expansion the function), f˜′ (x) = c1 +2 c2 ·x, is zero, which happens at position x ˘=−

c1 f ′ (0) = − ′′ . 2 c2 f (0)

(D.45)

The function value at the extremal position x ˘ is (by substituting x ˘ in Eqn. (D.44)) c2 [f ′ (0)]2 f˜(˘ x) = c0 − 1 = f (0) − . 4 c2 2·f ′′ (0)

(D.46)

This is a maximum if c2 < 0 and a minimum if c2 > 0 (obviously no extremum exist if c2 = 0). For some arbitrary Taylor expansion point a, with c0 = f (a), c1 = f ′ (a) and c2 = f ′′ (a)/2, the extremum position (Eqn. (D.45)) becomes x ˘=a−

c1 f ′ (a) = a − ′′ , 2 c2 f (a)

(D.47)

and the associated minimum or maximum value (Eqn. (D.46)) is [f ′ (a)]2 f˜(˘ x) = f (a) − . 2·f ′′ (a)

(D.48)

Extremum of a Multi -Variable Function To estimate the extremum of a n-dimensional, scalar-valued function f (x) : Rn 7→ R, we again assume that the Taylor expansion is relative to position a = 0 = (0, . . . , 0)⊺ , which simplifies the remaining formulations. The second-order Taylor expansion (see Eqn. (D.40)) for this point can be written as ⊺ ⊺ f˜(x) = f (0) + ∇f,0 ·x + 12 ·x ·Hf,0 ·x,

(D.49)

with the gradient ∇f and the Hessian matrix Hf (both evaluated at position x = 0). The vector of the first derivatives (i.e., the gradient) of f˜ at some position x = (x0 , . . . , xn−1 )⊺ is  ˜  ∂f ∂x0 (x)   ..  = ∇f,0 + 1 · (x⊺ ·Hf,0 )⊺ + Hf,0 ·x . (D.50) f˜′ (x) =  . 2   ∂ f˜ (x) ∂x n−1

Since (x⊺ · Hf )⊺ = Hf⊺ · x, and Hf is symmetric (i.e., Hf = Hf⊺ ), this simplifies to f˜′ (x) = ∇f,0 + 12 ·[Hf,0 ·x + Hf,0 ·x] = ∇f,0 + Hf,0 ·x.

(D.51)

A local maximum or minimum is found where all first derivatives f˜′ vanish, so we need to solve ˘ = 0, ∇f,0 + Hf,0 · x

(D.52) 845

Appendix D Elements from Calculus

˘ . By multiplying both sides with for the unknown extremal position x −1 Hf,0 (assuming that the inverse exists), the solution is −1 ˘ = −Hf,0 x · ∇f,0 .

(D.53)

The extremal value of f˜ is found by replacing x in Eqn. (D.49) with ˘ (calculated in Eqn. (D.53)), i.e., the position x ⊺ ⊺ ˘ + 12 ·x ˘ ·Hf,0 ·x ˘ f˜(˘ x) = f (0) + ∇f,0 ·x

−1 −1 −1 = f (0) + ∇f,0 ·[−Hf,0 ·∇f,0 ] + 12 ·[−Hf,0 ·∇f,0 ] ·Hf,0 ·[−Hf,0 ·∇f,0 ] ⊺

⊺

−1 −1 = f (0) − ∇f,0 ·Hf,0 ·∇f,0 + 12 ·[Hf,0 ·∇f,0 ] ·∇f,0 ⊺

⊺

−1 −1 = f (0) − ∇f,0 ·Hf,0 ·∇f,0 + 12 ·∇f,0 ·[Hf,0 ] ·∇f,0 ⊺

⊺

⊺

−1 −1 = f (0) − [∇f,0 ·Hf,0 ·∇f,0 ] + 12 ·[∇f,0 ·Hf,0 ·∇f,0 ] ⊺

⊺

−1 = f (0) − 12 ·∇f,0 ·Hf,0 . ⊺

(D.54)

As noted, the above result is for the specific expansion point a = 0.3 For an arbitrary expansion point a, the extremum position is −1 ˘ = a − Hf,a x ·∇f,a ,

(D.55)

and the associated function value is ⊺ −1 f˜(˘ x) = f (a) − 12 ·∇f,a ·Hf,a ·∇f,a .

(D.56)

Classifying Local Extrema and Saddle Points All first derivatives of f˜() are zero at the “critical” (or “stationary”) ˘ , as found in Eqn. (D.55). This may be a either a local minipoint x mum or maximum, but it could also be a so-called saddle point. This ˘ , which can be tested with the use of the second derivatives at point x represent the local curvature along the associated axis or variable. Intuitively one may think that, if all second derivatives are neg˘ , it is a local maximum and, conversely, a local minative at point x imum if all second derivatives are positive. Otherwise (i.e., if the second derivatives have different signs) we have a saddle point. However, this is not a sufficient test. For a function of n ≥ 2 variables, the second-derivative condition generalizes to a test based on the n eigenvalues λi of the function’s ˘ . In particular, if the n Hessian matrix Hf at the critical point x ˘ is a local maximum and if they eigenvalues are all negative, then x ˘ is a local minimum. If some eigenvalues are are all positive, then x positive and some negative, then the point is a saddle point, i.e., ˘ is a relative maximum, all λi < 0 ⇒ x ˘ is a relative minimum, all λi > 0 ⇒ x ˘ is a saddle point. otherwise ⇒ x

(D.57)

In general, the eigenvalues λ0 , . . . , λn−1 of a n × n matrix A are related to its determinant det(A) by 3

846

Note the analogy of Eqns. (D.53, D.54) to the scalar case in Eqns. (D.45, D.46), respectively.

λ0 · λ1 · . . . · λn−1 = det(A).

(D.58) D.3 Estimating Derivatives of Discrete

In the special case of a two-variable (2D) function, the associated Functions Hessian matrix Hf is of size 2 × 2 and therefore has two distinct eigenvalues λ0 , λ1 , with 2 λ0 ·λ1 = det(Hf ) = H00 ·H11 − H01 .

(D.59)

Thus, if det(Hf ) < 0, the eigenvalues λ0 , λ1 are known to have opposite signs and we have a saddle point. If det(Hf ) > 0 then both eigenvalues have the same sign and we have either a relative maximum or minimum, which is easy to decide by checking the sign of either H00 or H11 . Thus, for a 2D function, the decision in Eqn. (D.57) can be reformulated as ( ˘ is a relative maximum, if H00 < 0, x λ0 ·λ1 > 0 ⇒ ˘ is a relative minimum, otherwise, x (D.60) ˘ is a saddle point. λ0 ·λ1 < 0 ⇒ x Note that Hf (and thus λ0 , λ1 ) refers to the Hessian matrix evaluated ˘ . In the (common) case of a 2D quadratic funcat the critical point x tion, the Hessian matrix is constant, i.e., needs not be calculated for specific points (see Sec. E.2.1).

D.3 Estimating Derivatives of Discrete Functions Images are typically discrete functions (i.e., I : Z2 7→ R) and thus not differentiable. The derivatives can nevertheless be estimated by calculating finite differences from the pixel values in a small (3 × 3) neighborhood, which can be expressed as a linear filter or convolution operation (∗). In the following we assume a discrete, real-valued function I(u, v) and the task is to estimate the function’s derivatives at integer positions (u, v) ∈ Z2 . D.3.1 First-order Derivatives The first-order derivatives in x- and y-direction are usually estimated from finite differences in the form4 ∂I ∂x (u, v) ∂I ∂y (u, v)

= Ix (u, v) ≈ 12 ·[I(u+1, v) − I(u−1, v)],

(D.61)

= Iy (u, v) ≈

(D.62)

1 2 ·[I(u, v+1)

− I(u, v−1)].

The calculations in Eqn. (D.62) can be conveniently expressed as a convolution (∗) of the image I with two one-dimensional kernels, i.e.,   −1 Ix ≈ I ∗ 12 · −1 0 1 and Iy ≈ I ∗ 12 · 0  . (D.63) 1 4

This form is called “central difference” approximation, because differences on both sides of the estimation point are considered. There are also single-sided (“forward” or “backward” difference) methods, which are generally less accurate.

847

Appendix D Elements from Calculus

D.3.2 Second-Order Derivatives Similarly, the second-order derivatives in x an y can be approximated as ∂2I ∂x2 (u, v) ∂2I ∂y 2 (u, v)

= Ixx (u, v) ≈ I(u−1, v) − 2·I(u, v) + I(u+1, v),

(D.64)

= Iyy (u, v) ≈ I(u, v−1) − 2·I(u, v) + I(u, v+1).

(D.65)

respectively, which again can be expressed as 1D convolutions,   1 Ixx ≈ I ∗ 1 −2 1 and Iyy ≈ I ∗ −2 , (D.66) 1 respectively. The second-order mixed derivative, Ixy = Iyx ) is typically estimated as Ixy (u, v) ≈ 1 4 ·[I(u−1, v−1)

∂2I ∂x∂y

2

∂ I = ∂y∂x (or

(D.67) + I(u+1, v+1) − I(u−1, v+1) − I(u+1, v−1)],

which can be expressed as a sequence of one-dimensional convolutions (with the first-order derivative kernels in Eqn. (D.63)) or a single twodimensional convolution,     −1 −1 1 1    −1 0 1 0 Ixy = Iyx ≈ 4 ·I ∗ ∗ = 4 ·I ∗ 0  ∗ −1 0 1   1 1 1 0 −1 = 14 ·I ∗  0 0 0 . (D.68) −1 0 1 D.3.3 Alternative Formulations

848

The above methods for estimating derivatives from finite differences are based on second-order Taylor expansion. Their accuracy depends on the smoothness of the underlying continuous function, i.e., the error incurred by ignoring high-order terms (see [214, Sec. 5.1] for a derivation). More accurate estimates are possible by using approximation functions of higher order, which requires the inclusion of more sample values. For example, the following third-order scheme estimates the first, second and mixed derivatives from discrete function samples in a 5 × 5 neighborhood [78]: I x ≈ I ∗ Hx ⊺ 1 with Hx = Hy = 12 · 1 −8 0 8 −1 , (D.69) I y ≈ I ∗ Hy Ixx ≈ I ∗ Hxx ⊺ 1 with Hxx = Hyy = 12 · −1 16 −30 16 −1 , (D.70) Iyy ≈ I ∗ Hyy   1 −8 0 8 −1 Ixy ≈ I ∗ Hxy ,  −8 64 0 −64 8    1  0 0 0 0 with Hxy = Hx ∗Hy = 144 · 0 . (D.71)  8 −64 0 64 −8  −1 8 0 −8 1

Appendix E Sub-Pixel Maximum Finding

E.1 Second-Order Interpolation in 1D Given a single-variable (1D), discrete function g : Z 7→ R, it is sometimes useful to locally fit a quadratic (parabolic) function, for example, for precisely locating a maximum or minimum position. For a single-variable quadratic function (i.e., second-order polynomial) f (x) = c0 + c1 ·x + c2 ·x2 = y,

(E.1)

with parameters c0 , c1 , c2 , to pass through a given set of three sample points pi = (xi , yi ), i = 0, 1, 2, means that the following three equations must be satisfied: c0 + c1 ·x0 + c2 ·x20 = y0 , c0 + c1 ·x1 + c2 ·x21 = y1 , c0 + c1 ·x2 + c2 ·x22 = y2 . Written in standard matrix form M·c = y, i.e.,       1 x0 x20 c0 y0 1 x1 x21  · c1  = y1 , 1 x2 x22 c2 y2

(E.2)

(E.3)

the unknown coefficient vector c = (c0 , c1 , c2 )⊺ is directly found as  −1   1 x0 x20 y0 c = M−1 ·y = 1 x1 x21  · y1  , 1 x2 x22 y2

(E.4)

(unless matrix M is non-singular when points pi are collinear) or by solving the linear system in Eqn. (E.3) directly (see Sec. B.8). Example Fitting a quadratic function to the three sample points p0 = (−2, 5),

p1 = (−1, 6),

p2 = (3, −10),

© Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1

849

Appendix E Sub-Pixel Maximum Finding

●

●

p0 = (−2, 5) p1 = (−1, 6) p2 = (3, −10)

Fig. E.1 Fitting a continuous quadratic function to three given sample points.

-3

y

p1

p0

-2

5

f (x) 1

-1

2

x

3

-5

-10

●

p2

the system of equations to solve is (see Eqn. (E.3)) 1 −2 4 c0 5 6 1 −1 1 · c1 = , c 1

3 9

with the solution c0 1 −2 4 −1 c1 = 1 −1 1 · c 2

1

5 6 −10

3 9

=

1 20 ·

(E.5)

−10

2

−12

30 2 −8 5 3 4 −5 1

·

5 6 −10

=

5 −2 −1

.

Thus c0 = 5, c1 = −2, c2 = −1, and the equation of the quadratic fitting function (see Eqn. (E.1)) is f (x) = 5 − 2x − x2 .

(E.6)

The result for this example is shown in Fig. E.1. Locating the Maximum A special situation is when the sample points are positioned at x0 = −1, x1 = 0, and x2 = +1. This is useful, for example, to estimate a continuous extremum position from discrete function values defined on a regular (unit) grid. Again the objective is to fit a quadratic function (as in Eqn. (E.1)) to pass through the points p0 = (−1, y0 ),

p1 = (0, y1 ),

p2 = (1, y2 ).

In this case, Eqn. (E.2) simplifies to y0 = c 0 − c 1 + c 2 ,

y1 = c0 ,

y2 = c0 + c1 + c2 ,

(E.7)

y0 − 2·y1 + y2 . 2

(E.8)

with the solution c 0 = y0 ,

c1 =

y2 − y0 , 2

c2 =

To estimate a local extremum position, we take the first derivative of the quadratic fitting function (Eqn. (E.1)), which is the linear function f ′ (x) = c1 + 2c2 · x, and find the position x ˘ of its (single) root by solving c1 + 2·c2 ·x = 0.

(E.9)

With c1 , c2 taken from Eqn. (E.8), this is

850

y2 − y0 +x ˘ · (y0 − 2·y1 + y2 ) = 0, 2

(E.10)

and thus the extremal position is x ˘ = −0.5 ·

E.2 Subpixel Interpolation in 2D

y2 − y0 . y0 − 2y1 + y2

(E.11)

The corresponding extremal value can then be found by evaluating function f () at position x ˘, i.e., y˘ = f (˘ x) = c0 + c1 ·˘ x + c2 ·˘ x2 ,

(E.12)

with c0 , c1 , c2 as defined in Eqn. (E.8). With the above scheme, we can interpolate any triplet of successive sample values centered around some position u, that is, p0 = (u−1, y0 ),

p1 = (u, y1 ),

p2 = (u+1, y2 ),

with arbitrary values y0 , y1 , y2 . In this case the estimated position of the extremum is simply (from Eqn. (E.11)) x ˘ = u − 0.5 ·

y2 − y0 . y0 − 2y1 + y2

(E.13)

Numeric Example Figure E.2 shows an example with sample points p0 = (−1, −2),

p1 = (0, 7),

p2 = (1, 6).

By Eqn. (E.11) the interpolated maximum is located at x ˘ = 0.4 and the corresponding function value is f (˘ x) = 7.8. 10

y

p1 p0 = (−1, −2) p1 = (0, 7) p2 = (1, 6)

f (x)

-1.5

-1.0 ●

p0

●

●

5

0.5

-0.5

x ˘

●

p2

1.0

1.5

x

Fig. E.2 Fitting a quadratic function to three reference points at positions xi = −1, 0, 1. The interpolation function has its maximum at position x ˘ = 0.4 (red dot).

-5 -10

E.2 Subpixel Interpolation in 2D This is a frequent task, particularly for estimating the position of local function extrema with sub-pixel accuracy. In the following, we assume that the task is to find a local maximum, but the technique is the same when searching for a minimum. In a typical situation we are given a 3 × 3 array of discrete sample values si , as shown below. The general idea is (analogous to the 1D case described in Sec. E.1) to fit a continuous 2D approximation function I˜ to these samples and then calculate the position and value of its local maximum. Typically, a 2D quadratic function is used, because it exhibits a single extremum that can be located in closed form. However, a quadratic function has too few degrees of freedom (6) to fit all 9 sample values. There are several ways to deal with this problem which are described below:

851

Appendix E Sub-Pixel Maximum Finding

1. Second-order Taylor interpolation: Estimate the first- and second-order derivatives (i.e., gradient and Hessian matrix) from the discrete sample values and use a second-order Taylor expansion as the approximation function. As shown in Sec. E.2.2, the resulting function passes only through the five on-axis samples and only approximates the diagonal samples. 2. Least-squares quadratic interpolation: Fit a quadratic function to the sample values by minimizing least-squares error, as described in Sec. E.2.3. In general, the resulting approximation function does not pass through any of the sample values. 3. Quartic interpolation: Fit a quartic (i.e., 4th-order) polynomial which has 9 degrees of freedom and thus passes through all samples, as described in Sec. E.2.4. However, such a polynomial tends to behave “wildly” outside a restricted domain, exhibits multiple extrema and requires iterative techniques for locating them. For the remainder of this section we assume a sequence of 9 discrete sample values s0 , . . . , s8 ∈ R, taken from a 3 × 3 neighborhood (e.g., from a grayscale image I) and arranged in the following order: −1

0

+1

−1

s4

s3

s2

0

s5

s0

s1

+1

s6

s7

s8

=

8

11

7

15

16

9

14

12

10

(E.14)

For simplicity, the center value s0 is assumed to be centered at position (0, 0). The concrete sample values on the right side of Eqn. (E.14) are used in the numerical examples below. Note that the center sample (s0 = 16) is a local maximum. E.2.1 Quadratic Functions in 2D A 2D quadratic function in its most general form is a second-degree polynomial, f (x, y) = c0 + c1 ·x + c2 ·y + c3 ·x2 + c4 ·y 2 + c5 ·xy ,

(E.15)

which is fully specified by the six coefficients c0 , . . . , c5 . Since the function has only six degrees of freedom it is in general not possible to fit exactly to the nine sample values s0 , . . . , s9 . To be used further below, we note that the gradient and the Hessian of f at some position x = (x, y) are c1 +2c3 x+c5 y 2c3 c5 ∇f (x) = and Hf (x) = , (E.16) c2 +c5 x+2c4 y c5 2c4 respectively. Note that the Hessian matrix of this quadratic function is constant, i.e., independent of x, y. Extremum Position and Value 852

The functions (single) extremum position can be found by solving ∇f (x) = 0, which resolves to

c2 c5 − 2c1 c4 ˘= x · , 4c3 c4 − c25 c1 c5 − 2c2 c3 1

(E.17) E.2 Subpixel Interpolation in 2D

−1 ˘ = −Hf,0 or, alternatively, by Taylor expansion as x · ∇f,0 (see Eqn. (D.53)). The associated (min. or. max.) function value is

f (˘ x) =

4c0 c3 c4 + c1 c2 c5 − c22 c3 − c21 c4 − c0 c25 , 4c3 c4 − c25

(E.18)

which also could be calculated by Taylor expansion (see Eqn. (D.54)) ⊺ −1 as f (˘ x) = f (0) − 12 ·∇f,0 ·Hf,0 ·∇f,0 . Relative Minimum or Maximum? ˘ found in Eqn. (E.17) may be a minimum The “stationary point” x or maximum, or possibly a saddle point, which happens if the de˘ is negative (see Eqn. terminant of the Hessian matrix for position x (D.60)). As noted, the Hessian matrix of a quadratic function is constant (see Eqn. (E.16)), with H00 = 2c3 ,

H11 = 2c4 ,

H01 = c5,

and thus the condition for a saddle-point is 2 det(Hf ) = H00 ·H11 − H01 = 4c3 c4 − c25 < 0 .

(E.19)

Given the quadratic function’s coefficients c0 , . . . , c5 , the classification can be written as ( relative maximum, if c3 < 0, 2 4c3 c4 − c5 > 0 ⇒ (E.20) relative minimum, otherwise. In general, the situation is undetermined if det(Hf ) = 0, noting that this determinant also appears in the denominators of Eqn. (E.17) and (E.18), which may not be zero either. E.2.2 Method A: Second-Order Taylor Interpolation To fit a quadratic function f (x, y) (see Eqn. (E.15)) by second-order ˜ and the Hessian Taylor expansion, we first estimate the gradient ∇ ˜ matrix H from the discrete samples s0 , . . . , s8 (centered at (u, v) = 0, i.e., at sample s0 ) as d˜x s1 −s5 1 ˜ ∇ = ˜ = 2· and (E.21) s7 −s3 dy 1 ˜ 00 H ˜ 01 ·(s4 +s8 −s2 −s6 ) H s1 −2s0 +s5 4 ˜ H= ˜ ˜ = 1 , s3 −2s0 +s7 H10 H11 4 ·(s4 +s8 −s2 −s6 ) respectively (see Sec. D.3). From Eqn. (E.16), the gradient and Hessian of the unknown quadratic function f (x) at point x = (0, 0) are c1 2c3 c5 ∇f (0) = and Hf (0) = . (E.22) c2 c5 2c4 The polynomial coefficients can now be derived directly. By letting

853

Appendix E Sub-Pixel Maximum Finding Alg. E.1 Local maximum localization by second-order Taylor interpolation.

1: FindMaxQuadraticTaylor(s) Input: s = (s0 , . . . , s8 ), a vector of 9 discrete sample values taken from a 3 × 3 neighborhood (as defined in Eqn. (E.14)). The center sample (s0 ) is assumed to be a local maximum. Returns the estimated maximum position (˘ x, y˘) and value z˘. 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

c0 ← s0 c1 ← (s1 −s5 )/2 ▷ see Eqn. (E.23) c2 ← (s7 −s3 )/2 c3 ← (s1 −2s0 +s5 )/2 c4 ← (s3 −2s0 +s7 )/2 c5 ← (−s2 +s4 −s6 +s8 )/4 d ← 4c3 c4 − c25 ▷ determinant of Hessian matrix H if (d < ϵ) ∨ (c3 ≥ 0) then ▷ undefined or not a maximum return nil x ˘ ← (c2 c5 − 2c1 c4 )/d ▷ see Eqn. (E.17) y˘ ← (c1 c5 − 2c2 c3 )/d z˘ ← c0 +c1 x ˘ +c2 y˘+c3 x ˘2 +c4 y˘2 +c5 x ˘y˘ ▷ z˘ = f (˘ x) return (˘ x, y˘, z˘)

    c0 s0 1 c     1   2 ·(s1 −s5 )     1 ·(s7 −s3 ) c 2    2  = 1  c3   2 ·(s1 −2s0 +s5 )     1  c4   2 ·(s3 −2s0 +s7 )  1 c5 4 ·(s4 +s8 −s2 −s6 )

(E.23)

we make the quadratic function f (x, y) pass through the center sample (s0 ) and let it have the same 1st- and 2nd-order derivatives at point (0, 0) as were estimated from the discrete sample data. Note that f (x, y) fits exactly to all on-axis samples s0 , s1 , s3 , s5 , s7 but not to the diagonal samples s2 , s4 , s6 , s8 . A detailed summary of this procedure is given in Alg. E.1. Numeric Example With the sample data in Eqn. (E.14), the estimate for the gradient at the center position is ˜ = 1 · 9−15 = −3 ∇ 12−11 0.5 2 and the estimated Hessian matrix is 9−2·16+15 0.25·(8−14−7+10) ˜ = H = 0.25·(8−14−7+10) 11−2·16+12

−8 −0.75 −0.75 −9

.

After calculating the coefficients c0 , . . . , c5 (by Eqn. (E.23)), the resulting interpolation function is f (x) = 16 − 3·x + 0.5·y − 4·x2 − 4.5·y 2 − 0.75·xy.

854

The extremal point of this function is (by Eqn. (E.17)) 1 ˘ = 71.4375 x · −27.375 = −0.3832 0.0875 , 6.25

and the associated function value (e.g., by Eqn. (E.18)) is

(E.24)

E.2 Subpixel Interpolation in 2D

s6 f (x, y) s5

s7

Fig. E.3 Two-dimensional, second-order Taylor interpolation example. The blue dots represent the discrete image samples s0 , . . . , s8 taken from a 3 × 3 neighborhood and centered at position (0, 0). The center sample (s0 ) is a local maximum (see Eqn. (E.14) for the concrete values). The surface represents the continuous approximation f (x, y) obtained by second-order Taylor expansion (see Eqn. (E.24)). It has a maximum at the marked ˘ = (−0.3832, 0.0875). position x

s8

s4

s1

s3 ˘ x

y

s2

x

f (˘ x) =

1185.63 71.4375

= 16.5967.

The determinant of the Hessian (i.e., the denominator 71.4375 in the ˘ is not a saddle point. Since above expression) is positive an thus x H00 = 2c3 = − 8 is negative, we have a relative maximum (see Eqn. (E.20)). Figure E.3 illustrates this example. As noted, the approximation function passes exactly through the five on-axes sample points but not through the corner samples (although these are considered for calculating the mixed derivatives of the Hessian matrix). E.2.3 Method B: Least-Squares Quadratic Interpolation In this case, the coefficients c0 , . . . , c5 of the quadratic interpolation function f (x, y) (see Eqn. (E.15)) are determined by calculating a least-squares fit to the nine discrete samples s0 , . . . , s8 taken at points x0 , . . . , x8 . Ideally, the function f should fit each sample value exactly, i.e., f (xi , yi ) = si , but this is not possible with f having only 6 degrees of freedom. To find the function coefficients, we set up one equation for each of the nine sample values si in the form f (xi , yi ) = (1, xi , yi , x2i , yi2 , xi yi ) · (c0 , c1 , c2 , c3 , c4 , c5 ) = si , (E.25) ⊺

with (xi , yi ) ∈ [−1, 1]2 , which equations  1 0 0 0 1 1 0 1   1 1 −1 1   1 0 −1 0   1 −1 −1 1   1 −1 0 1   1 −1 1 1  1 0 1 0 1 1 1 1

together create the system of linear    0 0 s0   s1  0 0  c0     1 −1   c1  s2     1 0    s3   c2      1 1  ·  = s4  . c3  s5  0 0  c4    s6  1 −1   c5    s7  1 0 1 1 s8

(E.26)

This 9 × 6 system of equations, M·c = s, is over-determined (i.e., has more equations than unknowns) and thus has no solution in general.

855

1: FindMaxQuadraticLeastSquares(s) Input: s = (s0 , . . . , s8 ), a vector of 9 discrete sample values taken from a 3×3 neighborhood (as defined in Eqn. (E.14)). The center sample (s0 ) is assumed to be a local maximum. Returns the estimated maximum position (˘ x, y˘) and value z˘.

Appendix E Sub-Pixel Maximum Finding Alg. E.2 Local maximum localization by least-squares fitting.

2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

c0 ← (5s0 +2s1 −s2 +2s3 −s4 +2s5 −s6 +2s7 −s8 )/9 c1 ← (s1 +s2 −s4 −s5 −s6 +s8 )/6 ▷ see Eqn. (E.27) c2 ← (−s2 −s3 −s4 +s6 +s7 +s8 )/6 c3 ← (−2s0 +s1 +s2 −2s3 +s4 +s5 +s6 −2s7 +s8 )/6 c4 ← (−2s0 −2s1 +s2 +s3 +s4 −2s5 +s6 +s7 +s8 )/6 c5 ← (−s2 +s4 −s6 +s8 )/4 d ← 4c3 c4 − c25 ▷ determinant of Hessian matrix H if (d < ϵ) ∨ (c3 ≥ 0) then ▷ undefined or not a maximum return nil x ˘ ← (c2 c5 − 2c1 c4 )/d ▷ see Eqn. (E.17) y˘ ← (c1 c5 − 2c2 c3 )/d z˘ ← c0 +c1 x ˘ +c2 y˘+c3 x ˘2 +c4 y˘2 +c5 x ˘y˘ ▷ z˘ = f (˘ x) return (˘ x, y˘, z˘)

However, a least-squares solution1 can be found, typically with a numerical solver (see also Sec. B.8.2). In this case, it is even possible to write the result in closed form, i.e.,   1  c0 9 ·(5s0 +2s1 −s2 +2s3 −s4 +2s5 −s6 +2s7 −s8 ) 1 c     1   6 ·(s1 +s2 −s4 −s5 −s6 +s8 )     1 ·(−s −s −s +s +s +s ) c2    2 3 4 6 7 8 6  = 1 . c3   6 ·(−2s0 +s1 +s2 −2s3 +s4 +s5 +s6 −2s7 +s8 )    1  c4   6 ·(−2s0 −2s1 +s2 +s3 +s4 −2s5 +s6 +s7 +s8 )  1 c5 4 ·(s4 +s8 −s2 −s6 )

(E.27)

The complete procedure is summarized in Alg. E.2 Numeric Example Again using the sample data given in Eqn. (E.14), the resulting coefficient vector is 5 −5 −3 c = (15, −11 6 , 3 , 2 , −3, 4 ) , ⊺

i.e., the interpolating function is2 ˙ + 1.6·y ˙ − 2.5·x2 − 3·y 2 − 0.75·xy. f (x) = 15 − 1.83·x

(E.28)

The maximum position and value (by Eqns. (E.17) and (E.18)) are ˘= x

−0.4161 0.3298

and

f (˘ x) = 15.6563.

(E.29)

This interpolation function is shown in Fig. E.4. 1 2

856

The least-squares solution minimizes the residual ∥M·c − s∥2 . Note that this function is quite different to the result of second-order Taylor expansion in Eqn. (E.24).

s6

E.2 Subpixel Interpolation in 2D

f (x, y) s7

s5

Fig. E.4 Quadratic interpolation function obtained by least-squares fitting. The surface represents the continuous approximation f (x, y) (see Eqn. (E.28)). It has a maximum at the marked ˘ = (−0.416, 0.3298). position x Note that none of the discrete samples (si ) are exactly on the function surface.

s8

s4

s3

s1 ˘ x

s2

y

x

E.2.4 Quartic Interpolation To exactly fit a polynomial to the set of nine sample values s0 , . . . , s8 a quadratic function is not sufficient—a function of higher order (i.e., with sufficient degrees of freedom) is required. A “quartic” polynomial of the form f (x) = c0 + c1 x + c2 y + c3 x2 + c4 y 2 + c5 xy + c6 x2 y + c7 xy 2 + c8 x2 y 2

(E.30)

has nine coefficients c = (c0 , . . . , c8 )⊺ and should thus allow an exact fit. Analogous to Eqn. (E.25), we get one equation per sample, f (xi , yi ) = 1, xi , yi , x2i , yi2 , xi yi , x2i yi , xi yi2 , x2i yi2 ·c = si , (E.31) and again stack them to a system of equations M·c = s, i.e.,       1 0 0 0 0 0 0 0 0 c0 s0  1 1 0 1 0 0 0 0 0  c1  s1         1 1 −1 1 1 −1 −1 1 1  c2  s2         1 0 −1 0 1 0 0 0 0  c3  s3         1 −1 −1 1 1 1 −1 −1 1  · c4  = s4 . (E.32)        1 −1 0 1 0 0 0 0 0  c5  s5         1 −1 1 1 1 −1 1 −1 1  c6  s6         1 0 1 0 1 0 0 0 0  c7  s7  1 1 1 1 1 1 1 1 1 c8 s8 This 9 × 9 system is not over-determined but has a unique solution,     c0 s0 1 c1    2 ·(s1 −s5 )     1  c   ·(s −s ) 7 3  2   2 c    1 ·(s −2s +s )  3   1 0 5 2     1 ·(s −2s +s ) c4  =  . (E.33) 3 0 7 2     1 c5    ·(s +s −s −s ) 4 8 2 6 4     1 c6    4 ·(−s2 +2s3 −s4 +s6 −2s7 +s8 )     1 c7    4 ·(−2s1 +s2 −s4 +2s5 −s6 +s8 ) 1 c8 4 ·(4s0 −2s1 +s2 −2s3 +s4 −2s5 +s6 −2s7 +s8 )

857

Appendix E Sub-Pixel Maximum Finding

Note that the first six coefficients (c0 , . . . , c5 ) are the same as for the 2nd-order Taylor polynomial in Eqn. (E.23). The gradient (vector of first derivatives) of f is c1 +2c3 x+c5 y+2c6 xy+c7 y 2 +2c8 xy 2 ∇f (x) = , (E.34) c2 +c5 x+2c4 y+2c7 xy+c6 x2 +2c8 x2 y and the associated Hessian matrix (i.e., the second derivatives) is 2(c3 +c6 y+c8 y 2 ) c5 +2(c6 x+c7 y)+4c8 xy Hf (x) = . (E.35) c5 +2(c6 x+c7 y)+4c8 xy 2(c4 +c7 x+c8 x2 ) In contrast to the quadratic function (see Eqn. (E.16)), the Hessian of the quartic function is not constant but varies with x. However, for x = 0, the gradient and Hessian reduce to c 2c3 c5 ∇f (0) = 1 and Hf (0) = , (E.36) c2 c5 2c4 respectively, which are the same as for the quadratic function (see Eqn. (E.22)). Thus locally (near the center position) the two functions behave similarly. Iterative Maximum-Finding Unlike a quadratic function, a quartic polynomial generally has multiple extrema which cannot be found in closed form. A local extremum is found where the gradient of the function vanishes, i.e., where ∇f (x) = 0. Thus the task is to find a local root of ∇f (x), which in this case is a quadratic function (R2 7→ R2 , see Eqn. (E.34)). Such roots can be found, for example, with the iterative (Newton) scheme 3 (n) x(n+1) ← x(n) − H−1 ) · ∇f (x(n) ). f (x

(E.37)

Assuming that the extremum is somewhere near the origin, x(0) = 0 is taken as the starting position. Typically only a few iterations are required for convergence. Note, however, that this scheme does not always converge, i.e., may fail to find a local maximum under certain conditions.4 In the above formulation, the gradient and the inverse Hessian matrix are evaluated repeatedly for varying positions x(n) . The latter can also be calculated in closed form from the coefficients c as H−1 f (x) = · = with

1 4·[c4 +x(c7 +c8 x)]·[c3 +y(c6 +c8 y)]−(c5 +2c6 x+2c7 y+4c8 xy)2

2·[c4 +x(c7 +c8 x)]

−(c5 +2c6 x+2c7 y+4c8 xy)

−(c5 +2c6 x+2c7 y+4c8 xy)

2·[c3 +y(c6 +c8 y)]

1 H11 −H01 · , 2 −H01 H00 H00 ·H11 − H01

(E.38) (E.39)

H00 = 2·[c3 + y(c6 + c8 y)] , H11 = 2·[c4 + x(c7 + c8 x)] ,

(E.40)

H01 = c5 + 2c6 x + 2c7 y + 4c8 xy . 3 4

858

See also Eqn. (D.53). This situation can at least be detected by limiting the number of iterations and restricting the position x to a fixed search window (see Alg. E.3).

1: FindMaxQuartic(s, N, δmax , rmax ) Input: s = (s0 , . . . , s8 ), a vector of 9 discrete sample values taken from a 3 × 3 neighborhood (as defined in Eqn. (E.14), the center sample (s0 ) is assumed to be a local maximum); N , the maximum number of iterations; δmax , the maximum position shift to stop; rmax , the search limit (distance from the origin). Returns the estimated maximum position (˘ x, y˘) and value z˘. 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:

c0 ← s0 c1 ← (s1 −s5 )/2 ▷ see Eqn. (E.33) c2 ← (s7 −s3 )/2 c3 ← (s1 −2s0 +s5 )/2 c4 ← (s3 −2s0 +s7 )/2 c5 ← (−s2 +s4 −s6 +s8 )/4 c6 ← (−s2 +2s3 −s4 +s6 −2s7 +s8 )/4 c7 ← (−2s1 +s2 −s4 +2s5 −s6 +s8 )/4 c8 ← (4s0 −2s1 +s2 −2s3 +s4 −2s5 +s6 −2s7 +s8 )/4 c ← (c0 , . . . , c8 ) fc (x, y) := c0 + c1 x + c2 y + c3 x2 + c4 y 2 + c5 xy + c6 x2 y + c7 xy 2 + c8 x2 y 2

13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24:

Search for the local function maximum, starting at x = 0: done ← false n←0 x←0 while ¬done ∧ (n < N ) ∧ (∥x∥ < rmax ) do ∇ ← GetGradient(x) H ← GetHessian(x) x′ ← x − H−1 ·∇ ▷ next position if ∥x − x′ ∥ < δmax then ▷ no change, stop done ← true else ▷ continue x ← x′ n←n+1

25: 26: 27: 28: 29: 30: 31:

isMax ← (det(H) > 0) ∧ (H00 < 0) ▷ see Eqn. (D.60) if done ∧ isMax then (˘ x, y˘) ← x z˘ ← fc (˘ x, y˘) ▷ get max. function value (see line 12) return (˘ x, y˘, z˘) else ▷ search did not converge or not a maximum return nil

E.2 Subpixel Interpolation in 2D Alg. E.3 Local maximum localization by quartic interpolation. The iterative search continues until (a) the maximum number of iterations N (typ. 20) is exceeded, (b) the position change is below δmax (typ. 10−6 ), or (c) the current position was outside the maximum search radius rmax (typ. 1.0). ϵ is a small positive constant (typ. 10−6 ).

32: GetGradient(c, x) 33: dx ← c1 +2c3 x+c5 y+2c6 xy+c7 y 2 +2c8 xy 2 34: dx ← c2 +c5 x+2c4 y+2c7 xy+c6 x2 +2c8 x2 y 35: return (dx , dy ) 36: GetHessian(c, x) 37: H00 ← 2·[c3 + y(c6 + c8 y)] 38: H11 ← 2·[c4 + x(c7 + c8 x)] 39: H01 ← c 5 + 2c6 x + 2c7 y + 4c8 xy 40:

return

H00 H01 H01 H11

859

Appendix E Sub-Pixel Maximum Finding

Fig. E.5 Quartic interpolation function obtained by exact fitting of 9 sample values s0 , . . . , s8 . The function f (x, y) (see Eqn. (E.30)) has a relative maximum at the marked posi˘ = (−0.4057, 0.1129). tion x

s6 f (x, y) s5

s7

s8

s4

s1

s3 ˘ x

y

s2

x Fig. E.6 Quadratic vs. quartic interpolation functions. The quadratic function (a) has a single global maximum and is well-behaved over the entire domain while the quartic function (b) has multiple extrema and shows “wild” behavior outside a small range.

Quadratic

Quartic

f (x, y)

f (x, y)

y x

y x

(a)

(b)

2 Note that the denominator H00 ·H11 −H01 is the determinant of Hf (x). The complete procedure is summarized in Alg. E.3, which also shows that the quartic technique is considerably more involved than the much simpler quadratic methods.

Numeric Example For the sample values given in Eqn. (E.14) the coefficients of the quartic polynomial (see Eqn. (E.33)) are ⊺

c = (16, −3, 0.5, −4, −4.5, −0.75, 1.75, 1.75, 2.25) . The associated interpolation function is shown in Fig. E.5. As expected, the function is an exact fit to all nine samples. The local maximum is found (by Eqn. (E.37)) after three iterations as ˘ = −0.4057 x with f (˘ x) = 16.6204 . (E.41) 0.1129 ,

860

For illustration, Fig. E.6 compares the quadratic and quartic interpolation functions for the same data set but over an enlarged window of size [−2, 2] × [−2, 2]. Although both show similar behaviour near the center position, the quartic function “goes wild” within short range. In addition to the need for iterative maximum search this is one reason why quartic interpolation is usually avoided. In summary it offers no significant advantages over the simpler quadratic methods described earlier.

Appendix F Geometry

F.1 Straight Lines F.1.1 Conversions Between Different Line Equations This section is a supplement primarily to Chapters 10 and 12. Table F.1 gives a compact summary of the conversions described below. Slope-Intercept Form ↔ Algebraic Line Given a slope-intercept line y = kx + d (see Sec. 10.1.1) with parameters k, d, the equivalent algebraic line Ax + By + C = 0 (see Sec. 10.1.3) has the parameters A = k,

B = −1,

C = d.

(F.1)

Any slope/intercept line can be converted to an algebraic line. Reversely, an algebraic line can only be converted to slopeintercept form if it is not vertical (i.e., B ̸= 0); in this case, k = −A/B,

d = −C/B.

(F.2)

Parametric Line ↔ Algebraic Line Given is a parametric line x = s + t · v (see Sec. 10.1.2) with parameters s = (xs , ys )⊺ and v = (xv , yv )⊺ . Since vector v is parallel to the line, (A, B)⊺ must be a vector normal to v, for example, A 0 −1 −yv = · v = . (F.3) B 1 0 x v

C is found by solving the line equation for point s, which is known to be on the line, i.e., A·xs + B·ys + C = 0, and thus C = −A·xs − B·ys .

(F.4)

Given an algebraic line with parameters (A, B, C), v must be a vector normal to (A, B), e.g., A v = 01 −1 · B = −B . (F.5) 0 A

© Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1

861

Appendix F Geometry Table F.1 Converting parameters between different line types. Algebraic line parameters (A, B, C) are generally not assumed to be normalized.

Algebraic Line Ax + By + C = 0

↔

Slope/Intercept Form y = kx + d k = −A/B d = −C/B

Eqns.

A, B, C

⇒

A=k B = −1 C =d

⇐

k, d

(F.1)

Algebraic Line Ax + By + C = 0

↔

Parametric Line x = s + t·v (t ∈ R)

Eqns.

A, B, C

⇒

s =

C A2 +B 2

for B ̸= 0

·(A, B)

v = (−B, A)

(F.2)

⊺

⊺

(F.5, F.6)

A = −yv B = xv C = yv ·xs − xv ·ys

⇐

s, v

(F.3, F.4)

Algebraic Line Ax + By + C = 0

↔

Hessian Normal Form cos(θ)·x + sin(θ)·y = r

Eqns.

A, B, C

⇒

A = cos(θ) B = sin(θ) C = −r

⇐

θ = ArcTan(B, A) r = −C

(F.7)

θ, r

(F.8)

For the start point s we must find a point on the line. A good candidate is the line point closest to the origin, which is at distance r= √

C A2 + B 2

in the direction (−A, −B). The associated unit vector is −1 A e= √ · B A2 + B 2 and thus the start point coordinates are

−C A s = r·e = 2 · . B A + B2

(F.6)

Algebraic Line ↔ Hessian Normal Form

Given an algebraic line with parameters (A, B, C), the parameters (θ, r) of the corresponding line in Hessian normal form (HNF) are r = −C.

θ = ArcTan(B, A),

(F.7)

Reversely, given a Hessian line (θ, r), the parameters of the equivalent algebraic line are, e.g., A = cos(θ),

2

862

C = −r.

B = sin(θ), 2

The resulting line is normalized (i.e., A +B = 1).

(F.8)

Algebraic Line From Two Points

F.1 Straight Lines

The algebraic line that passes through two given points x0 , x1 (with x0 ̸= x1 ) is easily found via the parametric form x = s + t · v, setting s = x0 and v = x1 −x0 . Eqns. (F.3, F.4) can then be used to calculate the (normalized) algebraic line parameters A, B, C as     A y0 −y1 1 B  = ·  x1 −x0  . (F.9) ∥x1 −x0 ∥ C x0 ·y1 −x1 ·y0 F.1.2 Intersections of Algebraic Lines Given two algebraic lines (see Sec. 10.1.3) L1 = ⟨A1 , B1 , C1 ⟩ and L2 = ⟨A2 , B2 , C2 ⟩, the coordinates of their intersection point x = (x, y) must satisfy the equations A1 ·x + B1 ·y + C1 = 0, A2 ·x + B2 ·y + C2 = 0, or, in matrix-vector notation, A1 B 1 x −C1 · = . A2 B 2 y −C2

(F.10)

(F.11)

This can be solved with standard numerical methods (see Sec. B.8.1) or in closed form (using Cramer’s rule) as 1 x B1 ·C2 − B2 ·C1 = · . (F.12) y A1 ·B2 − A2 ·B1 A2 ·C1 − A1 ·C2 If the above denominator (i.e., the determinant of the matrix in Eqn. (F.11)) is zero, then lines L1 , L2 are parallel and the intersection point is undefined. F.1.3 Intersections of Lines in Hessian Normal Form Given two lines L1 = ⟨θ1 , r1 ⟩ and L2 = ⟨θ2 , r2 ⟩ in Hessian normal form (see Sec. 10.1.4), their intersection point is (from Eqn. (F.12)) 1 x sin(θ2 )·r1 − sin(θ1 )·r2 = · y cos(θ1 ) sin(θ2 )− cos(θ2 ) sin(θ1 ) cos(θ1 )·r2 − cos(θ2 )·r1 1 sin(θ2 )·r1 − sin(θ1 )·r2 = · . (F.13) sin(θ2 −θ1 ) cos(θ1 )·r2 − cos(θ2 )·r1 F.1.4 Numeric Line Fitting Examples The following examples are a supplement to Sec. 10.2, intended for illustration and quick testing of implementations. They use the common 2D point set X = {xi } = {( 18 ), ( 45 ), ( 47 ), ( 64 ), ( 94 )}, with n = 5 points.

(F.14) 863

10

Appendix F Geometry Fig. F.1 Linear regression and orthogonal fits for the sample point set in Eqn. (F.14).

8

6

4

2

0

0

2

4

6

linear regression fit

8

10

orthogonal fit

Linear Regression Fitting The parameters of the line y = k·x+d are found by solving the system of linear equations in Eqn. (10.21) on p. 285. The required sums are n−1 X i0

xi = 24,

n−1 X

yi = 28,

i0

n−1 X

x2i = 150,

i0

n−1 X

xi yi = 116,

i0

and the solution is (by Eqn. (10.22)) k = −0.5287,

d = 8.1379

(see Fig. F.1). The objective function F (k, d), i.e., the sum of the squared vertical differences (see Eqn. (10.20)), evaluates to F (k, d) =

n−1 X

(k·xi + d − yi )2 = 3.4713.

i0

The corresponding algebraic line has the parameters (A, B, C) = (−0.5287, −1, 8.1379); its normalized equivalent is A = −0.4674,

B = −0.8840,

C = 7.1942.

Orthogonal Fitting ¯ = (4.800, 5.600) and For the point set in Eqn. (F.14), the centroid is x the associated scatter matrix (see Eqns. (10.27, 10.29)) is s s 34.800 −18.400 S = xx xy = . sxy syy −18.400 13.200 The eigenvalues and eigenvectors of S are

864

λ = (λ0 , λ1 ) = (45.3354, 2.6646), 0.8678 −0.4969 E = (e0 |e1 ) = , −0.4969 −0.8678

respectively (e0 , e1 are the two column vectors of E). Since λ1 is the F.2 Circles smaller of the two eigenvalues, the associated eigenvector e1 is used to find the line parameters as A·Σxi +B·Σyi A −0.4969 = e1 = and C = − = 7.2448. B −0.8678 n Thus the optimal fit is obtained as the (normalized) algebraic line A = −0.4969,

B = −0.8678,

C = 7.2448

(see Fig. F.1). The objective function for p = (A, B, C), i.e., the sum of the squared orthogonal point distances di (see Eqn. (10.24)), evaluates to F (p) =

n−1 X

d2i =

i=0

n−1 X

[Axi + Byi + C]2 = 2.6646.

i=0

F.2 Circles This is a supplement to Sec. 11.1. F.2.1 Circle Equations and Conversions Geometric Circle Equation: The implicit (geometric) circle equation is (x−xc )2 + (y−yc )2 = r2 ,

(F.15)

with parameters xc , yc (the center point) and r (the circle radius). Equivalently, Eqn. (F.15) can be written as (x−xc )2 (y−yc )2 + = 1. 2 r r2

(F.16)

Algebraic Circle Equation: The general (algebraic) circle equation is A·(x2 +y 2 ) + B·x + C ·y + D = 0 ,

(F.17)

with parameters A, B, C, D. Parametric Circle Equation: The parametric form yields explicit circle coordinates in dependence of a single scalar variable (angle) t, x xc cos(t) = + r· , (F.18) y yc sin(t) for t = 0, . . . , 2π.

865

Appendix F Geometry

Geometric → Algebraic Circle Parameters Given geometric circle parameters xc , yc , r, the corresponding algebraic parameters are (e.g.) A=

1 , 2r

B = −2Axc ,

C = −2Ayc ,

D=

B 2 +C 2 −1 . (F.19) 4A

This result is not unique, since any scaled parameter vector s · (A, B, C, D) describes the same circle. Algebraic → Geometric Circle Parameters Given algebraic circle parameters A, B, C, D, the corresponding geometric parameters are unique (for A ̸= 0): √ B C B 2 +C 2 −4AD xc = − , yc = − , r= . (F.20) 2A 2A |2A| F.2.2 Circle From 3 Points Given three points x0 , x1 , x2 , the parameters of the algebraic circle equation (Eqn. (F.17)) are found, e.g., as A = x0 ·(y1 −y2 ) − y0 ·(x1 −x2 ) + x1 ·y2 − x2 ·y1

(F.21)

B = R0 ·(y2 −y1 ) + R1 ·(y0 −y2 ) + R2 ·(y1 −y0 ), C = R0 ·(x1 −x2 ) + R1 ·(x2 −x0 ) + R2 ·(x0 −x1 ), D = R0 ·(x2 ·y1 −x1 ·y2 ) + R1 ·(x0 ·y2 −x2 ·y0 ) + R2 ·(x1 ·y0 −x0 ·y1 ), 2

with Ri = ∥xi ∥ = x2i + yi2 . The three points are collinear (i.e., no finite circle fit exists) if A = 0. Equation (F.20) can be used to convert A, B, C, D to geometric circle parameters.

F.3 Ellipses This is mainly a supplement to Sec. 11.2. F.3.1 Ellipse Equations Geometric Ellipse Equation The geometry of a general ellipse can be described by a center xc = (xc , yc ) two radii ra , rb (with ra ≥ rb ) and the orientation angle θ. Thus any ellipse can be specified by five geometric ellipse parameters p = (xc , yc , ra , rb , θ). If the ellipse is axis-aligned (i.e., θ = 0), its geometric equation is, analogous to a circle (see Eqn. (F.16)), (y−yc )2 (x−xc )2 + = 1. 2 ra rb2

866

(F.22)

For an ellipse with arbitrary orientation θ (see Fig. 11.7, page 325), rotating the point coordinates by −θ to an axis-aligned frame, i.e.,

cos(−θ) − sin(−θ) x−xc cos(θ) sin(θ) x−xc · = · , sin(−θ) cos(−θ) y−yc − sin(θ) cos(θ) y−yc

F.3 Ellipses

gives the geometric equation for a general ellipse, [(x−xc )·cθ + (y−yc )·sθ ]2 [(y−yc )·cθ − (x−xc )·sθ ]2 + = 1, (F.23) ra2 rb2 with cθ = sin(θ) and sθ = cos(θ). The associated parametric form is (analogous to Eqn. (F.18)) x xc cos(θ) − sin(θ) r · cos(t) = + · a (F.24) y yc sin(θ) cos(θ) rb · sin(t) xc + ra · cos(θ)· cos(t) − rb · sin(θ)· sin(t) = , (F.25) yc + ra · sin(θ)· cos(t) + rb · cos(θ)· sin(t) for t = 0, . . . , 2π. Algebraic Ellipse Equation The algebraic definition of a general ellipse is, analogous to a circle (cf. Eqn. (F.17)), Ax2 + Bxy + Cy 2 + Dx + Ey + F = 0 ,

(F.26)

with parameters q = (A, B, C, D, E, F ). Equation F.26 actually describes a general 2D conic, i.e., a circle, ellipse, parabola or hyperbola, if at least one of A, B, C is nonzero. It represents an ellipse only if its discriminant is negative, that is, B 2 −4AC < 0 (see Sec. 11.2.1). F.3.2 Converting Between Algebraic and Geometric Parameters Algebraic → Geometric Ellipse Parameters Given a vector of algebraic ellipse parameters q = (A, B, C, D, E, F ), the properties of the equivalent “geometric” ellipse (i.e., the “geometric form parameters”) are calculated as follows. The corresponding center position is xc =

2CD − BE , B 2 − 4AC

yc =

2AE − BD , B 2 − 4AC

(F.27)

the lengths of the semi-axes are

1/2 2·(AE 2 + CD2 + F B 2 − BDE − 4ACF ) p ra = , (B 2 − 4AC)·[−A − C + (A−C)2 + B 2 ] 1/2 2·(AE 2 + CD2 + F B 2 − BDE − 4ACF ) p rb = , (B 2 − 4AC)·[−A − C − (A−C)2 + B 2 ]

(F.28) (F.29)

respectively, and the orientation of the major axis is1 θ = 0.5 · ArcTan(C −A, −B).

1

See Sec. J.1.5 for the definition of ArcTan(x, y).

(F.30)

867

Appendix F Geometry Alg. F.1 Conversions between algebraic and geometric ellipse parameters. Ellipse axes are switched if necessary in line 12 to make sure that ra ≥ rb .

1: AlgebraicToGeometricEllipse(q) Input: q = (A, B, C, D, E, F ), a vector of algebraic ellipse parameters. Returns a vector p = (xc , yc , ra , rb , θ) of geometric ellipse parameters (see Eqns. (F.27)–(F.30)) or nil if p does not represent an ellipse. 2: 3: 4:

p ← B 2 − 4AC if p ≥ 0 then return nil

5: 6:

q ← (A − C)2 + B 2 s ← 2·(AE 2 + CD2 + F B 2 − BDE − 4ACF )

7: 8: 9: 10: 11: 12: 13:

▷ not an ellipse if B 2 −4AC ≥ 0

p

xc ← (2CD − BE)/p yc ← (2AE p − BD)/p ra ← s / (p · (−A − C + q)) p rb ← s / (p · (−A − C − q)) θ ← 0.5 · ArcTan(C −A, −B) ▷ 0.5 * atan2(-b, c-a) (xc , yc , ra , rb , θ) if ra ≥ rb p← ▷ switch axes? (xc , yc , rb , ra , θ+ π2 ) otherwise return p ▷ geometric ellipse parameters

14: GeometricToAlgebraicEllipse(p) Input: p = (xc , yc , ra , rb , θ), a vector of geometric ellipse parameters. Returns a vector q = (A, B, C, D, E, F ) of algebraic ellipse parameters (see Eqns. (F.31)–(F.36)). 15: 16: 17: 18: 19: 20: 21: 22:

A ← [ra · sin(θ)]2 + [rb · cos(θ)]2 B ← 2·(rb2 − ra2 ) · sin(θ)· cos(θ) C ← [ra · cos(θ)]2 + [rb · sin(θ)]2 D ← −2·A·xc − B·yc E ← −2·C ·yc − B·xc F ← A·x2c + B·xc ·yc + C ·yc2 − ra2 ·rb2 q ← (A, B, C, D, E, F ) ▷ ∥q∥ = 1 return q ▷ algebraic ellipse parameters

Geometric → Algebraic Ellipse Parameters Given the geometric parameters p = (xc , yc , ra , rb , θ) of an ellipse, the equivalent algebraic parameters A, B, C, D, E, F are found as A = ra2 · sin2 (θ) + rb2 · cos2 (θ), B= C=

(F.31)

2·(rb2 − ra2 ) · sin(θ)· cos(θ), ra2 · cos2 (θ) + rb2 · sin2 (θ),

D = −2·A·xc − B·yc , F =

+ B·xc ·yc +

(F.33) (F.34)

E = −2·C ·yc − B·xc , A·x2c

(F.32)

(F.35) C ·yc2

−

ra2 ·rb2 .

(F.36)

The resulting algebraic parameter vector q is normalized, i.e., ∥q∥ = 1. Note that the values A, . . . , F are not unique, since any scaled parameter vector describes the same ellipse. See Alg. F.1 for a procedural definition of both conversions. 868

F.3.3 Ellipse From 5 Points

F.3 Ellipses

Any ellipse can be uniquely specified by 5 points. Fitting an ellipse to 5 given points (xi , yi ) means that the ellipse equation (Eqn. (F.26)) must be satisfied for all points. Equation (F.26) has 6 unknowns but is redundant and can be reduced to 5 unknowns (to match the number of equations), since we can arbitrarily scale the coefficients A, . . . , F . For example, we can apply the constraint A+C =1

(F.37)

to replace C by 1−A, i.e., the ellipse equation becomes A·x2i + B·xi ·yi + (1−A)·yi2 + D·xi + E·yi + F = 0

(F.38)

A·(x2i −yi2 ) + B·xi ·yi + D·xi + E·yi + F = −yi2 .

(F.39)

or

This yields five linear equations in five unknowns, which can be written in matrix-vector form as    2  2 2  A x0 −y0 x0 y0 x0 y0 1 −y0 B     .. .. .. .. ..  ·D =  ..  . (F.40)  . . . . .    .  E  2 2 2 x4 −y4 x4 y4 x4 y4 1 −y4 F

This linear system is in the standard form A·x = b and can be solved with common numerical techniques, e.g., by LU decomposition (see Sec. B.8.1). Unless the matrix on the left-hand side (A) is singular (which happens if two or more points are identical), a solution (A, B, C, D, E, F ) exists, with C = 1−A. Note that no ellipse may exist that passes through certain arrangements of five points. In fact, most sets of 5 randomly picked points will not fit an ellipse, but a parabola or hyperbola instead. The curve described by the resulting parameters is an ellipse only if B 2 − 4AC < 0, As an alternative to the exact 5-point fit in Eqn. (F.40), a leastsquares fit (see Sec. 11.2) can be applied to the 5 points to calculate the ellipse parameters. In this case, the residual error (e.g., the sum of absolute algebraic distances for all 5 points) needs to be checked to see if the resulting fit is “exact” or not.

869

Appendix G Statistical Prerequisites

This part summarizes some elementary statistical concepts for vectorvalued data, intended as a supplement mainly to Chapters 9 and 17.

G.1 Mean, Variance, and Covariance For the following definitions we assume a collection X = (x0 , x1 , . . . , xn−1 ) of n vector-valued, m-dimensional measurements (“samples”) ⊺

xi = (xi,0 , xi,1 , . . . , xi,m−1 ) ∈ Rm .

(G.1)

G.1.1 Mean The m-dimensional sample mean vector is defined as µ(X) = (µ0 , µ1 , . . . , µm−1 )

⊺

(G.2) n−1

=

1 1 X · (x0 + x1 + . . . + xn−1 ) = · x. n n i=0 i

(G.3)

Geometrically speaking, the vector µ(X) corresponds to the centroid of the points xi in m-dimensional space. Each scalar component µp = x ¯p is the mean of the associated component p (also called variate or dimension) over all n samples, that is n−1

µp =

1 X · x , n i=0 i,p

(G.4)

for p = 0, . . . , m−1. G.1.2 Variance and Covariance The covariance quantifies the strength of interaction between a pair of components p, q in the sample set X, defined as © Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1

871

n−1

Appendix G Statistical Prerequisites

σp,q (X) =

1 X · (x −µ )·(xi,q −µq ). n i=0 i,p p

(G.5)

For efficient calculation, this expression can be rewritten in the form σp,q (X) =

n−1 n−1 X n−1 i 1 hX 1 X · xi,p ·xi,q − · xi,p · xi,q , n i=0 n i=0 i=0 | {z } | {z } | {z } Sp,q (X)

Sp (X)

(G.6)

Sq (X)

which does not require the explicit (prior) calculation of µp and µq . In the special case of p = q, we get σp,p (X) = σp2 (X) = =

n−1 1 X · (x −µ )2 n i=0 i,p p

n−1 n−1 2 i 1 hX 2 1 X · xi,p − · xi,p , n i=0 n i=0

(G.7)

(G.8)

which quantifies the variance within component p. This corresponds to the ordinary (one-dimensional) variance σp2 (X) of the n scalar sample values x0,p , x1,p , . . . , xn−1,p (see also Sec. 2.7.1). G.1.3 Biased vs. Unbiased Variance If the variance (or covariance) of some population is estimated from a small set of random samples, the results obtained by the formulation given in the previous section are known to be statistically biased.1 The most common form of correcting for this bias is to use the factor 1/(n − 1) instead of 1/n in the variance calculations. For example, Eqn. (G.5) would change to σ ˘p,q (X) =

n−1 X 1 · (x −µ )·(xi,q −µq ) n−1 i=0 i,p p

(G.9)

to obtain an unbiased sample variance. In the following (and throughout the text), we ignore the bias issue, since samples are usually pixel values that come in large numbers, and consistently use the factor 1/n for all variance calculations. Note, however, that many software packages2 use the bias-corrected factor 1/(n−1) by default and thus may return different results (which can be easily scaled for comparison).

G.2 The Covariance Matrix The covariance matrix for the m-dimensional sample set X is a square matrix of size m × m whose elements are the covariance values σp,q for all component pairs (p, q), that is, 1

2

872

Note that the estimation of the mean by the sample mean (Eqn. (G.3)) is not affected by this bias problem. For example, Apache Commons Math, Matlab, and Mathematica.



σ0,0 σ1,0 .. .

σ0,1 σ1,1 .. .

··· ··· .. .

σ0,m−1 σ1,m−1 .. .



    Σ(X) =     σm−1,0 σm−1,1 · · · σm−1,m−1   σ02 σ0,1 · · · σ0,m−1  σ1,0 σ12 · · · σ1,m−1    = . .. .. . ..  .. . . .  2 σm−1,0 σm−1,1 · · · σm−1

(G.10)

G.2 The Covariance Matrix

(G.11)

Note that each diagonal element of Σ(X) is the ordinary (scalar) variance σp2 (X) (see Eqn. (G.7)), for p = 0, . . . , m−1, which can never be negative. All other entries of a covariance matrix may be positive or negative in general. Since σp,q = σq,p , a covariance matrix is always symmetric, with up to (m2 +m)/2 unique elements. Thus, any covariance matrix has the important property of being positive semidefinite, which implies that none of its eigenvalues (see Sec. B.5) are negative. The covariance matrix can also be written in the form n−1

Σ(X) =

1 X ⊺ · [x − µ(X)]·[xi − µ(X)] , {z } n i=0 | i

(G.12)

= [xi −µ(X)] ⊗ [xi −µ(X)]

where ⊗ denotes the outer (vector) product, which in this case yields a m × m matrix. The trace (i.e., the sum of the diagonal elements) of the covariance matrix, σtotal (X) = tr (Σ(X)) ,

(G.13)

is called the total variance of the multivariate sample. Alternatively, the (Frobenius) norm of the covariance matrix Σ(X), defined as ∥Σ(X)∥2 =

m−1 X m−1 X p=0 q=0

2 σp,q

1/2

,

(G.14)

can be used to quantify the overall variance in the sample data. G.2.1 Example Assume that the sample set X consists of four 3D vectors (i.e., m = 3, n = 4),         75 41 93 12 x0 = 37 , x1 = 27 , x2 = 81 , x3 = 48 , 12 20 11 52 representing RGB colors. The resulting sample mean vector (see Eqn. (G.3)) is         µR 75+41+93+12 221 55.25 1 1 µ(X) = µG  = ·37+27+81+48 = ·193 = 48.25 , 4 4 12+20+11+52 95 23.75 µB

873

Appendix G Statistical Prerequisites

and the associated covariance matrix (Eqn. (G.11)) is   972.188 331.938 −470.438 Σ(X) =  331.938 412.688 −53.188 . −470.438 −53.188 278.188

As expected, this matrix is symmetric and all diagonal elements are non-negative. Note that no sample bias-correction (see Sec. G.1.3) is used in this example. The total variance (see Eqn. (G.13)) of the sample set X is σtotal (X) = tr (Σ(X)) = 972.188+412.688+278.188 ≈ 1663.06, and the Froebenius norm of the covariance matrix (see Eqn. (G.14)) is ∥Σ(X)∥2 ≈ 1364.36. G.2.2 Practical Calculation The calculation of covariance matrices is implemented in almost any software package for statistical analysis or linear algebra. For example, with the Apache Commons Math library this could be accomplished as follows: import org.apache.commons.math3.stat.correlation.Covariance; ... double[][] X; // X[i] is the i-th sample vector of length m Covariance cov = new Covariance(X, false); // no bias correction RealMatrix S = cov.getCovarianceMatrix(); ...

Note that bias correction is explicitly turned off.

G.3 Mahalanobis Distance The Mahalanobis distance3 [179] is used to measure distances in multi-dimensional distributions. Unlike the Euclidean distance it takes into account the amount of scatter in the distribution and the correlation between features. In particular, the Mahalanobis distance can be used to measure distances in distributions, where the individual components substantially differ in scale. Depending on their scale, a few components (or even a single component) may dominate the ordinary (Euclidean) distance outcome and the “smaller” components have no influence whatsoever. G.3.1 Definition Given a distribution of m-dimensional samples X = (x0 , . . . , xn−1 ), with xk ∈ Rm , the Mahalanobis distance between two samples xa , xb is defined as p dM (xa , xb ) = ∥xa −xb ∥M = (xa −xb )⊺ ·Σ−1 ·(xa −xb ), (G.15) where Σ is the m × m covariance matrix of the distribution X, as described in Sec. G.2. Note that the expression under the square root 3

874

https://en.wikipedia.org/wiki/Mahalanobis_distance.

in Eqn. (G.15) is the (dot) product of a row vector and a column G.3 Mahalanobis vector, that is, the result is a non-negative scalar value. Distance The Mahalanobis distance normalizes each feature component to zero mean and unit variance. This makes the distance calculation independent of the scale of the individual components, that is, all components are “treated fairly” even if their ranges differ by many orders of magnitude. In other words, no component can dominate the others even if its absolute magnitude is disproportionally large. G.3.2 Relation to the Euclidean Distance Recall that the common Euclidean distance between two points xa , xb ∈ Rm is equivalent to the (L2) norm of the difference vector xa − xb , which we can write in the form p dE (xa , xb ) = ∥xa −xb ∥2 = (xa −xb )⊺ ·(xa −xb ). (G.16) Note the structural similarity with the definition of the Mahalanobis distance in Eqn. (G.15), the only difference being the missing matrix Σ−1 . This becomes even clearer if we explicitly insert the identity matrix I into Eqn. (G.16), that is, p dE (xa , xb ) = ∥xa −xb ∥2 = (xa −xb )⊺ ·I·(xa −xb ) , (G.17) which obviously does not change the outcome at all. In Eqn. (G.15), the identity matrix is simply replaced by matrix Σ−1 . Its purpose is to map the difference vectors xa −xb (and thus the involved sample vectors xa , xb ) to a transformed (non-uniformly scaled and rotated) space, where the actual distance measurement is performed. Under the Euclidean distance, by contrast, all components contribute equally to the distance measure, without any scaling or other transformation. G.3.3 Numerical Considerations For calculating the Mahalanobis distance (Eqn. (G.15)) the inverse of the covariance matrix (see Sec. G.2) is needed. By definition, a covariance matrix Σ is symmetric and its diagonal values are nonnegative. Similarly (at least in theory), its inverse Σ−1 should also be symmetric with non-negative diagonal values. This is necessary to ensure that the quantities under the square root in Eqn. (G.15) are always positive. Unfortunately, Σ is often ill-conditioned because of diagonal values that are very small or even zero. In this case, Σ is not positivedefinite (as it should be), that is, one or more of its eigenvalues are negative, the inversion becomes numerically unstable and the resulting Σ−1 is non-symmetric. A simple remedy to this problem is to add a small quantity to the diagonal of the original covariance matrix Σ, that is, ˜ = Σ + ϵ·I, Σ

(G.18)

˜ −1 in Eqn. (G.15). to enforce positive definiteness, and to use Σ

875

Appendix G Statistical Prerequisites

A possible alternative is to calculate the eigendecomposition 4 of Σ in the form ⊺

Σ = V·Λ·V ,

(G.19)

where Λ is a diagonal matrix containing the eigenvalues of Σ (which may be zero or negative). From this we create a modified diagonal ˜ by substituting all non-positive eigenvalues with a small matrix Λ positive quantity ϵ, that is, Λ˜i,i = min(Λi,i , ϵ)

(G.20)

(typically ϵ ≈ 10−6 ), and finally calculate the modified covariance matrix as ˜ = V·Λ·V ˜ ⊺, Σ

(G.21)

which should always be positive definite. The (symmetric) inverse ˜ −1 is then used in Eqn. (G.15). Σ G.3.4 Pre-Mapping Data For Efficient Mahalanobis Matching Assume that we have a large set of sample vectors (“data base”) X = (x0 , . . . , xn−1 ) which shall be frequently queried for the instance most similar (i.e., closest) to a given search sample xs . Assuming that the search through X is performed linearly, we would need to calculate dM (xs , xi ), using Eqn. (G.15), for all elements of xi in X. One way to accelerate the matching is to perform the transformation defined by Σ−1 to the entire data set only once, such that the Euclidean norm alone can be used for the distance calculation.5 For the sake of simplicity we write 2

2

d2M (xa , xb ) = ∥xa −xb ∥M = ∥y∥M

(G.22)

with the difference vector y = xa −xb , such that the Mahalanobis distance (see Eqn. (G.15)) becomes 2

⊺

∥y∥M = y ·Σ−1 ·y .

(G.23)

The goal is to find a transformation U, such that we can calculate the Mahalanobis distance from the transformed vectors ˆ = U·y, y

(G.24)

directly, by using the ordinary Euclidean norm ∥·∥2 instead, i.e., in the form 2

2

⊺

⊺

⊺

⊺

ˆ ·ˆ ∥y∥M = ∥ˆ y ∥2 = y y = (U·y) ·(U·y) = (y ·U )·(U·y) ⊺

⊺

⊺

= y ·U ·U·y = y ·Σ 4 5

876

−1

·y .

(G.25) (G.26)

See also Sec. B.5.1. This is also called a “whitening transformation”, because it turns the data set X into a white noise vector with uncorrelated components and unit variance in each dimension. In other words, the data set is (linearly) transformed to form a m-dimensional, isotropic Gaussian distribution.

We see from Eqn. (G.26) that the transformation matrix U must G.3 Mahalanobis satisfy Distance ⊺

U ·U = Σ−1 .

(G.27)

Fortunately, since Σ−1 is symmetric and positive definite, such a decomposition of Σ−1 always exists. The standard method for calculating U in Eqn. (G.27) is by the Cholesky decomposition,6 which can factorize any symmetric, positive definite matrix A in the form A = L·L

⊺

or

⊺

A = U ·U,

(G.28)

where L is a lower-triangular matrix or, alternatively, U is an uppertriangular matrix (the second variant is the one we need).7 Since the transformation of the difference vectors y → U·y is a linear operation, the result is the same if we apply the transformation individually to the original vectors, that is, ˆ = U·y = U·(xa −xb ) = U·xa −U·xb . y

(G.29)

This means that, given the transformation U, we can obtain the Mahalanobis distance between two points xa , xb (as defined in Eqn. (G.15)) by simply calculating the Euclidean distance ∥·∥2 in the form dM (xa , xb ) = ∥U·(xa −xb )∥2 = ∥U·xa −U·xb ∥2 .

(G.30)

In summary, this suggests the following solution to a large-database Mahalanobis matching problem: 1. Calculate the covariance matrix Σ for the original dataset X = (x0 , . . . , xn−1 ). 2. Condition Σ, such that it is positive definite (see Sec. G.3.3). 3. Find the matrix U, such that U⊺ ·U = Σ−1 (by Cholesky decomposition of Σ−1 ). 4. Transform all samples of the original data set X = (x0 , . . . , xn−1 ) ˆ = (ˆ ˆ n−1 ), with x ˆ k = U·xk . This now becomes the to X x0 , . . . , x actual “database”. 5. Apply the same transformation to the search sample xs , that is, ˆ s = U·xs . calculate x 6. Find the index l of the best-matching element in X (in terms of the Mahalanobis distance) by calculating the Euclidean (!) distance between the transformed vectors, that is 2

ˆk∥ . l = argmin ∥ˆ xs − x 0≤k 0 and N(X | µ, σ 2 ) dx = 1. (G.34) −∞

Thus the area under the probability distribution curve is always one, that is, N() is normalized. The Gaussian function in Eqn. (G.32) has its maximum height (called “mode”) at position x = µ, where its value is p(x=µ) = √

1 2πσ 2

.

(G.35)

If a random variable X is normally distributed with mean µ and variance σ 2 , then the result of some linear mapping of the kind X ′ = aX + b is again a random variable that is normally distributed, with parameters µ ¯ = a·µ + b and σ ¯ 2 = a2 ·σ 2 : X ∼ N(µ, σ 2 ) ⇒ a·X + b ∼ N(a·µ + b, a2 ·σ 2 ),

(G.36)

for a, b ∈ R. Moreover, if X1 , X2 are statistically independent, normally distributed random variables with means µ1 , µ2 and variances σ12 , σ22 , respectively, then a linear combination of the form a1 X1 + a2 X2 is again normally distributed, with µ12 = a1 ·µ1 + a2 ·µ2 and σ12 = a21 ·σ12 + a22 ·σ22 , that is, (a1 X1 + a2 X2 ) ∼ N(a1 ·µ1 + a2 ·µ2 , a21 ·σ12 + a22 ·σ22 ).

(G.37)

G.4.1 Maximum Likelihood Estimation The probability density function p(x) of a statistical distribution tells us how probable it is to observe the result x for some fixed distribution parameters, such as µ and σ, in case of a normal distribution. If these parameters are unknown and need to be estimated,8 it is interesting to ask the reverse question: 8

878

For example, for “minimum error thresholding” in Sec. 9.1.6.

How likely are particular parameter values for a given set of em- G.4 The Gaussian pirical observations (assuming a certain type of distribution)? Distribution This is (in a casual sense) what the term “likelihood” stands for. In particular, a distribution’s likelihood function quantifies the probability that a given (fixed) set of observations was generated by some varying distribution parameters. Note that the probability of observing the outcome x from the normal distribution, p(x) = p(x | µ, σ 2 ),

(G.38)

is really a conditional probability, stating how probable it is to observe the value x from a given normal distribution with known parameters µ and σ 2 . Conversely, a likelihood function for the normal distribution could be viewed as a conditional function L(µ, σ 2 | x),

(G.39)

which quantifies the likelihood of (µ, σ 2 ) being the correct distribution parameters for a given observation x. The maximum likelihood method tries to find optimal parameters by maximizing the value of a distribution’s likelihood function L. If we draw two independent9 samples xa , xb that are subjected to the same distribution, their joint probability (i.e., the probability of xa and xb occurring together in the sample) is the product of their individual probabilities, that is, p(xa ∧ xb ) = p(xa ) · p(xb ) .

(G.40)

In general, if we are given a vector of m independent observations X = (x1 , x2 , . . . , xm ) from the same distribution, the probability of observing exactly this set of values is p(X) = p(x0 ∧ x1 ∧ . . . ∧ xm−1 ) = p(x0 )·p(x1 )· . . . ·p(xm−1 ) =

m−1 Y

p(xi ) .

(G.41)

i=0

Thus, if the sample X originates from a normal distribution N, a suitable likelihood function is L(µ, σ 2 | X) = p(X | µ, σ 2 ) =

m−1 Y i=0

2

N(xi | µ, σ ) =

m−1 Y i=0

√

1 2πσ 2

·e

−

(xi −µ)2 2·σ 2

.

(G.42)

The parameters (ˆ µ, σ ˆ 2 ), for which L(µ, σ 2 | X) is a maximum, are called the maximum-likelihood estimate for X. Note that it is not necessary for a likelihood function to be a proper (i.e., normalized) probability distribution, since it is only necessary to calculate whether a particular set of distribution parameters is more probable than another. Thus the likelihood function L may be any monotonic function of the corresponding probability p in Eqn. (G.42), in particular its logarithm, which is commonly used to avoid multiplying small values. 9

Although often violated in reality, the assumption of independence is important to keep statistical problems simple and tractable. For example, the values of neighboring image pixels are usually not independent.

879

Appendix G Statistical Prerequisites

G.4.2 Gaussian Mixtures In practice, probabilistic models are often too complex to be described by a single Gaussian (or other standard) distribution. Without losing the mathematical convenience of Gaussian models, highly complex distributions can be modeled as combinations of multiple Gaussian distributions with different parameters. Such a Gaussian mixture model is a linear superposition of K Gaussian distributions of the form p(x) =

K−1 X

πk · N(x | µk , σk2 ),

(G.43)

k=0

where the weights (“mixing coefficients”) πk express the probability that an event x was generated by the k th component (with PK−1 10 The interpretation of this mixture model is, that k=0 πk = 1). there are K independent Gaussian “components” (each with its parameters µk , σk ) that contribute to a common stream of events xi . If a particular value x is observed, it is assumed to be the result of exactly one of the K components, but the identity of that component is unknown. Assume, as a special case, that a probability distribution p(x) is the superposition (mixture) of two Gaussian distributions, that is, p(x) = πa ·N(x | µa , σa2 ) + πb ·N(x | µb , σb2 ).

(G.44)

Any observed value x is assumed to be generated by either the first component (with µa , σa2 and prior probability πa ) or the second component (with µb , σb2 and prior probability πb ). These parameters as well as the prior probabilities are unknown but can be estimated by maximimizing the likelihood function L. Note that, in general, the unknown parameters cannot be calculated in closed form but only with numerical methods. For further details and solution techniques see [30, 76, 259], for example. G.4.3 Creating Gaussian Noise Synthetic Gaussian noise is often used for testing in image processing, particularly for assessing the quality of smoothing filters. While the generation of pseudo-random values that follow a Gaussian distribution is not a trivial task in general,11 it is readily implemented in Java by the standard class Random. For example, the Java method addGaussianNoise() in Prog. G.1 adds Gaussian noise with zero mean (µ = 0) and standard deviation s (σ) to a grayscale image I of type FloatProcessor (ImageJ). The random values produced by successive calls to the method nextGaussian() in line 12 follow a Gaussian distribution N(0, 1), with mean µ = 0 and variance σ 2 = 1. As implied by Eqn. (G.36), X ∼ N(0, 1) ⇒ (a + s·X) ∼ N(a, s2 ), 10 11

880

(G.45)

The weight πk is also called the prior probability of the component k. Typically the so-called polar method is used for generating Gaussian random values [161, Sec. 3.4.1].

1 import java.util.Random; 2 3 void addGaussianNoise (FloatProcessor I, double s) { 4 final int w = I.getWidth(); 5 final int h = I.getHeight(); 6 7 Random rg = new Random(); // random generator 8 9 for (int v = 0; v < h; v++) { 10 for (int u = 0; u < w; u++) { 11 float val = I.getf(u, v); 12 float x = (float) (rg.nextGaussian() * s); 13 I.setf(u, v, val + x); 14 } 15 } 16 }

G.4 The Gaussian Distribution Prog. G.1 Java method for adding Gaussian noise to an image of type FloatProcessor (ImageJ).

so scaling the return values of nextGaussian() by s makes the resulting random variable x normally distributed with N(0, s2 ).

881

Appendix H Gaussian Filters

This part contains material supplementary to Ch. 25 (SIFT).

H.1 Cascading Gaussian Filters To compute a Gaussian scale space efficiently (as used in the SIFT method, for example), the scale layers are usually not obtained directly from the input image by smoothing with Gaussians of increasing size. Instead, each layer can be calculated recursively from the previous layer by filtering with relatively small Gaussians. Thus, the entire scale space is implemented as a concatenation or “cascade” of smaller Gaussian filters.1 If Gaussian filters of sizes σ1 , σ2 are applied successively to the same image, the resulting smoothing effect is identical to using a single larger Gaussian filter HσG , that is, (I ∗ HσG1 ) ∗ HσG2 = I ∗ (HσG1 ∗ HσG2 ) = I ∗ HσG12 , (H.1) p with σ12 = σ12 + σ22 being the size of the resulting combined Gaussian filter HσG12 [151, Sec. 4.5.4]. Put in other words, the variances (squares of the σ values) of successive Gaussian filters add up, i.e., 2 σ12 = σ12 + σ22 .

(H.2)

In the special case of the same Gaussian filter being applied√twice (i.e., σ1 = σ2 ), the effective width of the combined filter is σ12 = 2·σ1 .

H.2 Gaussian Filters and Scale Space In a Gaussian scale space, the scale corresponding to each level is proportional to the width (σ) of the Gaussian filter required to derive this level from the original (completely unsmoothed) image. Given an image that is already pre-smoothed by a Gaussian filter of width 1

See Sec. 25.1.1 for details.

© Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1

883

Appendix H Gaussian Filters

σ1 and should be smoothed to some target scale σ2 > σ1 , the required width of the additional Gaussian filter is p σd = σ22 − σ12 . (H.3)

Usually the neighboring layers of the scale space differ by a constant scale factor (κ) and the transformation from one scale level to another can be accomplished by successively applying Gaussian filters. Despite the constant scale factor, however, the width of the required filters is not constant but depends on the image’s initial scale. In particular, if we want to transform some image with scale σ0 by a given factor κ to a new scale κ·σ0 , then (from Eqn. (H.2)) the relation (κ·σ0 )2 = σ02 + σd2

(H.4)

must hold for σd . Thus, the required width σd of the additional Gaussian filter is √ σd = σ0 · κ2 −1. (H.5) For example, to double the scale (i.e., κ = 2) of an image that is presmoothed with σ0 requires a Gaussian filter of width p √ σd = σ0 · 22 −1 = σ0 · 3 ≈ σ0 ·1.732 .

H.3 Effects of Gaussian Filtering in the Frequency Domain For the 1D Gaussian function gσ (x) =

√1 σ 2π

2

·e

x − 2σ 2

(H.6)

the continuous Fourier transform2 F(gσ ) is Gσ (ω) =

√1 2π

·e

−ω

2 σ2 2

.

(H.7)

Doubling the width (σ) of a Gaussian filter corresponds to cutting the bandwidth by half. If σ is doubled, the Fourier transform becomes ω 2 (2σ)2 2

G2σ (ω) =

√1 2π

·e

−

=

√1 2π

·e

(2ω)2 σ 2 − 2

=

√1 2π

2 2

·e

− 4ω 2σ

= Gσ (2ω)

(H.8) (H.9)

and, in general, when scaling the filter by some factor k, Gkσ (ω) = Gσ (kω).

(H.10)

That is, if σ is increased (i.e., the kernel widened) by factor k, the corresponding Fourier transform gets contracted by the same factor. In terms of linear filtering this means that widening the kernel by some factor k reduces the signal bandwidth by k1 . 2

884

See also Sec. 18.1.

H.4 LoG-Approximation by Difference of Gaussians (DoG)

H.4 LoG-Approximation by Difference of Gaussians (DoG)

The 2D Laplacian-of-Gaussian (LoG, see Sec. 25.1.1) kernel, Lσ (x, y) = ∇2 gσ (x, y) =

x2 +y 2 −2σ 2 1 πσ 4 · 2σ 2

2

·e

+y − x 2σ 2

2

,

(H.11)

has a (negative) peak at the origin with the associated function value Lσ (0, 0) = − πσ1 4 .

(H.12)

Thus, the scale normalized LoG kernel, defined in Eqn. (25.10) as ˆ σ (x, y) = σ 2 ·Lσ (x, y), L

(H.13)

ˆ σ (0, 0) = − 1 2 L πσ

(H.14)

has the peak value

at the origin. In comparison, for a given scale factor κ, the unscaled DoG function DoGσ,κ (x, y) = Gκσ (x, y) − Gσ (x, y) x2 +y 2

=

− 2κ2 σ2 1 2πκ2 σ 2 ·e

−

− 1 2πσ 2 ·e

x2 +y 2 2σ 2

,

(H.15)

has the peak value 2

κ −1 DoGσ,κ (0, 0) = − 2πκ 2 σ2 .

(H.16)

By scaling the DoG function with some factor λ to match the LoG’s center peak value, such that Lσ (0, 0) = λ·DoGσ,κ (0, 0), the unnormalized LoG (Eqn. (H.11)) can be approximated by the DoG as Lσ (x, y) ≈

2κ2 σ 2 (κ2 −1)

· DoGσ,κ (x, y).

(H.17)

Similarly, the scale-normalized LoG (Eqn. (H.13)) can be approximated by the DoG in the form3 ˆ σ (x, y) ≈ L

2κ2 κ2 −1

· DoGσ,κ (x, y).

(H.18)

Since the factor in Eqn. (H.18) depends on κ only, the DoG approximation is (for a constant size ratio κ) implicitly proportional to the scale normalized LoG for any scale σ.

3

ˆ σ (x, y) ≈ 1 · DoGσ,κ (x, y), is given in [175], A different formulation, L κ−1 which is the same as Eqn. (H.18) for κ → 1, but not for κ > 1. The essence is that the leading factor is constant and independent of σ, and can thus be ignored when comparing the magnitude of the filter responses at varying scales.

885

Appendix I Writing ImageJ Plugins

I.1 ImageJ Plugins Plugins are small Java modules for extending the functionality of ImageJ by using a simple standardized interface (Fig. I.1). Plugins can be created, edited, compiled, invoked, and organized through the Plugin menu in ImageJ’s main window (see Fig. 1.11, p. 26). Plugins can be grouped to improve modularity, and plugin commands can be arbitrarily placed inside the main menu structure. Also, many of ImageJ’s built-in functions are actually implemented as plugins themselves. Plugin

Plugin

Plugin

Plugin

Plugin

ImageJ

AWT

Java Core

Fig. I.1 ImageJ software structure (simplified). ImageJ is based on the Java core system and depends in particular upon Java’s Advanced Windowing Toolkit (AWT) for the implementation of the user interface and the presentation of image data. Plugins are small Java classes that extend the functionality of the basic ImageJ system.

I.1.1 Program Structure Technically speaking, plugins are Java classes that implement a particular interface specification defined by ImageJ. There are two main types of plugins: • PlugIn: requires no image to be open to start a plugin. © Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1

887

Appendix I Writing ImageJ Plugins

• PlugInFilter: the currently active image is passed to the plugin when started. Throughout the examples in this book, we almost exclusively use plugins of the second type (i.e., PlugInFilter) for implementing imageprocessing operations. The interface specification requires that any plugin of type PlugInFilter must at least implement two methods, setup() and run(), with the following signatures: int setup (String args, ImagePlus im ) When the plugin is started, ImageJ calls this method first to verify that the capabilities of this plugin match the target image. setup() returns a vector of binary flags (packaged as a 32-bit int value) that describes the plugin’s properties. void run (ImageProcessor ip ) This method does the actual work for this plugin. It is passed a single argument ip, an object of type ImageProcessor, which contains the image to be processed and all relevant information about it. The run() method returns no result value (void) but may modify the passed image and create new images. I.1.2 A First Example: Inverting an Image Let us look at a real example to quickly illustrate this mechanism. The task of our first plugin is to invert any 8-bit grayscale image to turn a positive image into a negative. As we shall see later, inverting the intensity of an image is a typical point operation, which is discussed in detail in Chapter 3. In ImageJ, 8-bit grayscale images have pixel values ranging from 0 (black) to 255 (white), and we assume that the width and height of the image are M and N , respectively. The operation is very simple: the value of each image pixel I(u, v) is replaced by its inverted value, I(u, v) ← 255 − I(u, v), for all image coordinates (u, v), with u = 0, . . . , M −1 and v = 0, . . . , N −1. I.1.3 Plugin My_Inverter_A (PlugInFilter) We decide to name our first plugin “My_Inverter_A”, which is both the name of the Java class and the name of the source file1 that contains it (see Prog. I.1). The underscore characters (“_”) in the name cause ImageJ to recognize this class as a plugin and to insert it automatically into the menu list at startup. The Java source code in file My_Inverter.java contains a few import statements, followed by the definition of the class My_Inverter, which implements the PlugInFilter interface (because it will be applied to an existing image). 1

888

File My_Inverter_A.java.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

import ij.ImagePlus; import ij.plugin.filter.PlugInFilter; import ij.process.ImageProcessor; public class My_Inverter_A implements PlugInFilter { public int setup(String args, ImagePlus im) { return DOES_8G; // this plugin accepts 8-bit grayscale images } public void run(ImageProcessor ip) { int M = ip.getWidth(); int N = ip.getHeight();

}

// iterate over all image coordinates (u,v) for (int u = 0; u < M; u++) { for (int v = 0; v < N; v++) { int p = ip.getPixel(u, v); ip.putPixel(u, v, 255 - p); } }

I.1 ImageJ Plugins Prog. I.1 ImageJ plugin for inverting 8-bit grayscale images. This plugin implements the interface PlugInFilter and defines the required methods setup() and run(). The target image is received by the run() method as an instance of type ImageProcessor. ImageJ assumes that the plugin modifies the supplied image and automatically redisplays it after the plugin is executed. Program I.2 shows an alternative implementation that is based on the PlugIn interface.

}

The setup() Method When a plugin of type PlugInFilter is executed, ImageJ first invokes its setup() method to obtain information about the plugin itself. In this example, setup() only returns the value DOES_8G (a static int constant specified by the PlugInFilter interface), indicating that this plugin can handle 8-bit grayscale images. The parameters arg and im of the setup() method are not used in this example (see also Exercise 1.16). The run() Method As mentioned already, the run() method of a PlugInFilter plugin receives an object (ip) of type ImageProcessor, which contains the image to be processed and all relevant information about it. First, we use the ImageProcessor methods getWidth() and getHeight() to query the size of the image referenced by ip. Then we use two nested for loops (with loop variables u, v for the horizontal and vertical coordinates, respectively) to iterate over all image pixels. For reading and writing the pixel values, we use two additional methods of the class ImageProcessor: int getPixel (int u, int v ) Returns the pixel value at the given position or zero if (u, v ) is outside the image bounds. void putPixel (int u, int v, int a ) Sets the pixel value at position (u, v ) to the new value a. Does nothing if (u, v ) is outside the image bounds. Both methods check the supplied image coordinates and pixel values to avoid unwanted errors. While this makes them more or less fail-safe it also makes them slow. If we are sure that no coordinates

889

Appendix I Writing ImageJ Plugins Prog. I.2 Alternative implementation of the inverter plugin, based on ImageJ’s PlugIn interface. In contrast to Prog. I.1 this plugin has no setUp() method but defines a run() method only. The current image (im) is obtained as an instance of class ImagePlus by invoking the IJ.getImage() method. After checking for the proper image type the associated ImageProcessor (ip) is retrieved from im. The parameter string (args) is not used in this example. The remaining parts of the plugin are identical to Prog. I.1, except that the (slightly faster) pixel access methods get() and set() are used. Also note that the modified image is not re-displayed automatically but by an explicit call to updateAndDraw().

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

import import import import

ij.IJ; ij.ImagePlus; ij.plugin.PlugIn; ij.process.ImageProcessor;

public class My_Inverter_B implements PlugIn { public void run(String args) { ImagePlus im = IJ.getImage(); if (im.getType() != ImagePlus.GRAY8) { IJ.error("8-bit grayscale image required"); return; } ImageProcessor ip = im.getProcessor(); int M = ip.getWidth(); int N = ip.getHeight(); // iterate over all image coordinates (u,v) for (int u = 0; u < M; u++) { for (int v = 0; v < N; v++) { int p = ip.get(u, v); ip.set(u, v, 255 - p); } }

}

}

im.updateAndDraw();

// redraw the modified image

outside the image bounds are ever accessed (as in My_Inverter in Prog. I.1) and the inserted pixel values are guaranteed not to exceed the image processor’s range, we can use the significantly faster methods get() and set() in place of getPixel() and putPixel(), respectively. The most efficient way to process the image is to avoid read/write methods altogether and directly access the elements of the associated (1D) pixel array. Details on these and other methods can be found in the ImageJ API documentation.2 I.1.4 Plugin My_Inverter_B (PlugIn) Program I.2 shows an alternative implementation of the inverter plugin based on ImageJ’s PlugIn interface, which requires a run() method only. In this case the reference to the current image is not supplied directly but is obtained by invoking the (static) method IJ.getImage(). If no image is currently open, getImage() automatically displays an error message and aborts the plugin. However, the subsequent test for the correct image type (GRAY8) and the corresponding error handling must be performed explicitly. The run() method accepts a single string argument that can be used to pass arbitrary information for controlling the plugin.

2

890

https://imagej.nih.gov/ij/developer/api/index.html.

I.1.5 When To Use PlugIn Or PlugInFilter?

I.1 ImageJ Plugins

The choice of PlugIn or PlugInFilter is mostly a matter of taste, since both versions have their advantages and disadvantages. As a rule of thumb, we use the PlugIn type for tasks that do not require any image to be open but for tasks that create, load, or record images, or perform operations without any images. Otherwise, if one or more open images should be processed, PlugInFilter is the preferred choice and thus almost all plugins in this book are of type PlugInFilter. Editing, Compiling, and Executing a Plugin The Java source file for our plugin should be stored in directory /plugins/3 or an immediate subdirectory. New plugin files can be created with ImageJ’s Plugins ▷ New... menu. ImageJ even provides a built-in Java editor for writing plugins, which is available through the Plugins ▷ Edit... menu but unfortunately is of little use for serious programming. A better alternative is to use a modern editor or a professional Java programming environment, such as Eclipse,4 IntelliJ,5 or Apache NetBeans,6 all of which are freely available. For compiling plugins (to Java bytecode), ImageJ comes with its own Java compiler as part of its runtime environment. To compile and execute the new plugin, simply use the menu Plugins ▷ Compile and Run... Compilation errors are displayed in a separate log window. Once the plugin is compiled, the corresponding .class file is automatically loaded and the plugin is applied to the currently active image. An error message is displayed if no images are open or if the current image cannot be handled by that plugin. At startup, ImageJ automatically loads all correctly named plugins found in the /plugins/ directory (or any immediate subdirectory) and installs them in its Plugins menu. These plugins can be executed immediately without any recompilation. References to plugins can also be placed manually with the Plugins ▷ Shortcuts ▷ Install Plugin... command at any other position in the ImageJ menu tree. Sequences of plugin calls and other ImageJ commands may be recorded as macro programs with Plugins ▷ Macros ▷ Record. Displaying and “Undoing” Results Our first plugins in Prog. I.1–I.2 did not create a new image but “destructively” modified the source image. This is not always desirable and, of course, plugins can also create additional images or calculate statistics only, without modifying the original image at all. It may be surprising, though, that our plugin contains no commands 3 4 5 6

refers to ImageJ’s installation directory. https://www.eclipse.org/downloads/ https://www.jetbrains.com/idea https://netbeans.apache.org

891

Appendix I Writing ImageJ Plugins

for displaying the modified image. This is done automatically by ImageJ whenever it can be assumed that the image passed to a plugin was modified.7 In addition, ImageJ automatically makes a copy (“snapshot”) of the image before passing it to the run() method of a PlugInFilter-type plugin. This feature makes it possible to restore the original image (by Edit ▷ Undo) after the plugin has finished, without any further measures in the plugin code. Logging and Debugging The usual console output from Java via System.out is not available in ImageJ by default.8 Instead, a separate logging window can be used which facilitates simple text output by the method IJ.log(String s ). Such calls may be placed at any position in the plugin code for quick and simple debugging at runtime. However, because of the typically large amounts of data involved, they should be used with caution in real image-processing operations. Particularly, when placed in the body of inner processing loops that could execute millions of times, text output may produce an enormous overhead compared to the time used for the actual calculations. ImageJ itself does not offer much support for “real” debugging, i.e., for setting breakpoints, inspecting local variables etc. However, it is possible to launch ImageJ from within a programming environment (IDE) such as Eclipse or Netbeans and then use all debugging options that the given environment provides.9 According to experience, this is only needed in rare and exceptionally difficult situations. In most cases, inspection of pixel values displayed in ImageJ’s main window (see Fig. 1.11) is much simpler and more effective. In general, many errors (in particular those related to image coordinates) can be easily avoided by careful planning in advance. I.1.6 Executing ImageJ “Commands” If possible, it is wise in most cases to re-use existing (and extensively tested) functionality instead of re-implementing it oneself. In particuar, the Java library that comes with ImageJ covers many standard image-processing operations, many of which are used throughout this book. Additional classes and methods for specific operations are contained in the associated (imagingbook) library. In the context of ImageJ, the term “command” refers to any composite operation implemented as a (Java) plugin, a macro command or as a script.10 ImageJ itself includes numerous commands which can be listed with the menu Plugins ▷ Utilities ▷ Find Commands.... They are usually referenced “by name”, i.e., by a unique string. For 7

8

9 10

892

No automatic redisplay occurs if the NO_CHANGES flag is set in the return value of the plugin’s setup() method. This can be changed by calling ij.io.LogStream.redirectSystem(), which redirects Java’s System.out stream to ImageJ’s output console. For details see the “HowTo” section at http://imagejdocu.tudor.lu. Scripting languages for ImageJ currently include JavaScript, BeanShell, and Python.

1 2 3 4 5 6 7 8 9 10 11 12

import ij.IJ; import ij.ImagePlus; import ij.plugin.PlugIn;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

import import import import

public class Command_From_PlugIn implements PlugIn {

}

public void run(String args) { ImagePlus im = IJ.getImage(); IJ.run(im, "Invert", ""); // run the “Invert” command on im // ... continue with this plugin }

ij.IJ; ij.ImagePlus; ij.plugin.filter.PlugInFilter; ij.process.ImageProcessor;

public class Command_From_PlugInFilter implements PlugInFilter { ImagePlus im; public int setup(String args, ImagePlus im) { this.im = im; return DOES_ALL; }

}

I.1 ImageJ Plugins Prog. I.3 Executing the ImageJ command “Invert” within a Java plugin of type PlugIn.

Prog. I.4 Executing the ImageJ command “Invert” within a Java plugin of type PlugInFilter. In this case the current image is automatically locked during plugin execution, such that no other operation may be applied to it. However, the image can be temporarily unlocked by calling unlock() and lock(), respectively, to run the external command.

public void run(ImageProcessor ip) { im.unlock(); // unlock im to run other commands IJ.run(im, "Invert", ""); // run “Invert” command on im im.lock(); // lock im again (to be safe) // ... continue with this plugin }

example, the standard operation for inverting an image (Edit ▷ Invert) is implemented by the Java class ij.plugin.filter.Filters (with the string argument "invert"). An existing command can also be executed from within a Java plugin with the method IJ.run(), as demonstrated for the “Invert” command in Prog. I.3. Some caution is required with plugins of type PlugInFilter, since these lock the current image during execution, such that no other operation can be applied to it. The example in Prog. I.4 shows how this can be resolved by a pair of calls to unlock() and lock(), respectively, to temporarily release the current image. I.1.7 ImageJ’s Command Recorder A convenient tool for putting together complex commands is ImageJ’s built-in Macro Recorder. Started with Plugins ▷ Macros ▷ Record..., it logs all subsequent commands in a text file for later use. It can be set up to record commands in various modes, including Java, JavaScript, BeanShell, Python, or ImageJ macro code. Of course it does record the application of self-defined plugins as well.

893

Appendix J Java Notes

As a text for undergraduate engineering curricula, this book assumes basic programming skills in a procedural language, such as Java, C#, or C. The examples in the main text should be easy to understand with the help of an introductory book on Java or one of the many online tutorials. Experience shows, however, that difficulties with some basic Java concepts pertain and often cause complications, even at higher levels. The following sections address some typical problem spots.

J.1 Arithmetic Java is a “strongly typed” programming language, which means that every variable has a fixed type that cannot be altered dynamically. Also, the result of an expression is determined by the types of the involved operands and not (in the case of an assignment) by the type of the “receiving” variable. J.1.1 Integer Division Division involving integer operands is a frequent cause of errors. If the variables a and b are both of type int (or long), then the expression a / b is evaluated according to the rules of integer division. The result—the number of times b is contained in a—is again of type int or long. For example, after the Java statements int a = 2; int b = 5; double c = a / b;

// c is zero!

the value of c is not 0.4 but 0.0 because the expression a / b on the right yields the int-value 0, which is then automatically converted to the double value 0.0, before being assigned to c. If we wanted to evaluate a / b as a floating-point operation (as most pocket calculators do), at least one of the involved operands must be converted to a floating-point value, e.g., by an explicit type cast © Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1

895

Appendix J Java Notes

double c = (double) a / b;

// c = 0.4

or alternatively double c = a / (double) b;

// c = 0.4

The following, however, does not work as intended, since a/b is evaluated (in the integer domain) before the type cast takes effect: double c = (double) (a / b);

// WRONG! c = 0.0

Example Assume, for example, that we want to scale any pixel value a of an image, such that the maximum pixel value amax is mapped to 255 (see Ch. 3). In mathematical notation, the scaling of the pixel values is simply expressed as c ←

a amax

· 255

and it may be tempting to convert this 1:1 into Java code, e.g., ImageProcessor ip = ... ; // the current image int a_max = ip.getMaxValue(); for ... { int a = ip.getPixel(u,v); int c = (a / a_max) * 255; // ← problem! ip.putPixel(u, v, c); } ...

As we can now predict, the resulting image will be all black (filled with zero values), except those pixels whose value was a_max originally (they are set to 255). The reason is again that the division a / a_max has two operands of type int, and the result is thus zero whenever the denumerator (a_max) is greater than the numerator (a). Of course, the entire operation could be performed in the floatingpoint domain by converting one of the operands (as we have shown), but this is not even necessary. Instead, we may simply change the order of operations and do the multiplication first, i.e., int c = a * 255 / a_max;

Why does this work now? The subexpression a * 255 is evaluated first,1 generating large intermediate values that pose no problem for the subsequent (integer) division. Nevertheless, rounding should always be considered to obtain more accurate results when computing fractions of integers (see Sec. J.1.4). J.1.2 Modulus Operator The result of the modulus operator ‘a mod b’ (used in several places in the main text) is defined [112, p. 82] as the remainder of the “floored” division ⌊a/b⌋, that is, a mod b ≡ a − b · ⌊a/b⌋,

(J.1)

for a, b ∈ R, b ̸= 0. In Java, the standard method 1

896

In Java, expressions at the same level are always evaluated in left-toright order, and therefore no parentheses are required in this example (though they would do no harm either).

Math.floorMod(int a, int b) Math.floorMod(long a, long b)

J.1 Arithmetic

implements the above operation for int and long arguments. While the mathematical ‘mod’ operator is not restricted to integer values, no standard Java method currently exists for floating-point arguments. It may be defined directly from the definition in Eqn. (J.1), e.g., as2 double mod(double a, double b) { return a - b * Math.floor(a / b); }

Again, argument b is supposed to be non-zero. Note that Java’s (and C’s) remainder operator %, defined as a % b ≡ a − b · truncate(a/b)

(J.2)

for b ̸= 0, is often confused with the ‘mod’ operator, but yields the same results only when both operands have the same sign, e.g., 13 −13 13 −13

mod 4 → 1 mod 4 → 3 mod −4 → −3 mod −4 → −1

vs.

13 −13 13 −13

% 4 → 1 % 4 → −1 % −4 → 1 % −4 → −1

Java’s % operator works for floating-point quantities as well, with analogous differences to the ‘mod’ operator: 3.5 −3.5 3.5 −3.5

mod 2.1 → 1.4 mod 2.1 → 0.7 mod −2.1 → −0.7 mod −2.1 → −1.4

vs.

3.5 −3.5 3.5 −3.5

% 2.1 → 1.4 % 2.1 → −1.4 % −2.1 → 1.4 % −2.1 → −1.4

Thus one should always be careful when coding a modulus operation, particularly if any of the arguments may become negative. If both arguments are guaranteed to be positive, the % operator is safe to use. Otherwise Math.floorMod() or the above mod() method for floating-point arguments should be used. J.1.3 Mathematical Functions in Class Math Java provides most standard mathematical functions as static methods in class Math, as listed in Table J.1. The Math class is part of the java.lang package and thus requires no explicit import to be used. Most Math methods accept arguments of type double and also return values of type double. As a simple example, a typical use of the cosine function y = cos(x) is double x; double y = Math.cos(x);

Similarly, the Math class defines some common numerical constants as static variables; for example, the value of π could be obtained by 2

This method is defined as Arithmetic.mod(double a, double b) in the imagingbook library.

897

Appendix J Java Notes Table J.1 Frequently used mathematical methods and constants defined by Java’s standard Math class.

double abs(double a)

double max(double a, double b)

int abs(int a)

float max(float a, float b)

float abs(float a)

int max(int a, int b)

long abs(long a)

long max(long a, long b)

double ceil(double a)

double min(double a, double b)

double floor(double a)

float min(float a, float b)

int floorMod(int a, int b)

int min(int a, int b)

long floorMod(long a, long b)

long min(long a, long b)

double rint(double a) long round(double a)

double random()

int round(float a) double toDegrees(double rad)

double toRadians(double deg)

double sin(double a)

double asin(double a)

double cos(double a)

double acos(double a)

double tan(double a)

double atan(double a)

double atan2(double y, double x) double hypot(double x, double y) double log(double a)

double exp(double a)

double sqrt(double a)

double pow(double a, double b)

double E

double PI

double pi = Math.PI;

The Apache Commons Math library provides class FastMath3 as a faster and potentially more accurate “drop-in” replacement for Java’s standard Math class, using the same function names as listed in Table J.1. J.1.4 Numerical Rounding Java’s standard Math class (confusingly) offers three different methods for rounding floating-point values: double rint(double x) long round(double x) int round(float x) For example, a double value x can be rounded to int in any of the following ways: double x; int k; k = (int) Math.rint(x); k = (int) Math.round(x); k = Math.round((float) x);

If the operand x is known to be positive (as is typically the case with pixel values) rounding can be accomplished without using any method calls by k = (int) (x + 0.5); 3

898

// only if x >= 0

Package org.apache.commons.math3.util

In this case, the expression (x + 0.5) is first computed as a floating- J.1 Arithmetic point (double) value, which is then truncated (toward zero) by the explicit (int) typecast. Note, however, that performance gains from avoiding method calls are usually marginal. J.1.5 Inverse Tangent Function The inverse tangent function φ = tan−1 (a) = arctan(a) is used in several places in the main text. This function is implemented by the method atan(double a) in Java’s Math class (see Table J.1). The return value of atan() is in the range [− π2 , . . . , π2 ] and thus restricted to only two of the four quadrants. Without any additional constraints, the resulting angle is ambiguous. In many practical situations, however, a is specified as the ratio of two catheti (x, y) of a right-angled triangle in the form φ = arctan

y x ,

(J.3)

for which we introduced the two-parameter function φ = ArcTan(x, y)

(J.4)

in the main text.4 Function ArcTan(x, y) is implemented by the standard method atan2(y,x) in Java’s Math class, i.e., ArcTan(x, y) ≡ atan2(y, x)

(J.5)

(but note the reversed parameters), and returns an unambiguous angle φ in the range [−π, π), i.e., in all four quadrants of the unit circle.5 Method atan2() returns a finite value even if both arguments are zero. If angles in the (positive-only) range [0, 2π) are required, the simplest way is to use the modulus or the remainder operator (see Sec. J.1.2), e.g., in the form φ1 = ArcTan(x, y) mod (2π)

or

φ2 = (ArcTan(x, y) + 2π) % (2π),

(J.6) (J.7)

respectively. In Java syntax this translates to6 double phi1 = Arithmetic.mod(atan2(y,x), 2 * Math.PI); double phi2 = (atan2(y,x) + 2 * Math.PI) % (2*Math.PI);}

J.1.6 Unsigned Byte Data Most grayscale and indexed images in Java and ImageJ are composed of pixels of type byte, and the same holds for the individual components of most color images. A single byte consists of eight bits and can thus represent 28 = 256 different bit patterns or values, usually 4 5

6

See also p. 801. The function atan2(y,x) is available in most current programming languages, including Java, C, and C++. Method Arithmetic.mod() is defined in Sec. J.1.2.

899

Appendix J Java Notes

mapped to the numeric range [0, 255]. Unfortunately, Java (unlike C and C++) does not provide a native “unsigned” 8-bit data type. The primitive Java type byte is “signed”, using one of its eight bits for the ± sign, and is intended to hold values in the range [−128, 127]. Java’s byte type can still be used to represent (unsigned) values 0 to 255, but conversions must take place to perform proper arithmetic computations. For example, after execution of the statements int a = 200; byte b = (byte) a;

the variables a (32-bit int) and b (8-bit byte) contain the binary patterns a = 00000000000000000000000011001000 b = 11001000

(= 200 decimal) (= −56 decimal)

Interpreted as a (signed) byte value, with the leftmost bit7 taken as the sign bit, variable b has the decimal value −56. Thus the statement int

c = b;

// c == -56 (WRONG!)

sets the value of int variable c to −56 ! To (ab-)use the signed byte type for unsigned 8-bit data, we can circumvent Java’s standard conversion mechanism by disguising the content of b as a binary (i.e., nonarithmetic) bit pattern; for example, by int

c = (0xff & b);

// c == 200 (RIGHT!)

where 0xff (in hexadecimal notation) is an int value with the binary 32-bit pattern 00000000000000000000000011111111 and & is the bitwise AND operator. Now variable c contains the proper integer value (200) and we thus have a way to use Java’s (signed) byte data type for storing unsigned values. Inside ImageJ, access to 8-bit pixel data is usually implemented at the byte level, which is considerably faster than using the convenience methods getPixel() and putPixel(), which take integer arguments that need to be converted. J.1.7 Classes Float and Double The representation of floating-point numbers in Java follows the IEEE standard, and thus the types float and double include the values Float.MIN_VALUE, Double.MIN_VALUE, Float.MAX_VALUE, Double.MAX_VALUE, Float.POSITIVE_INFINITY, Double.POSITIVE_INFINITY, Float.NEGATIVE_INFINITY, Double.NEGATIVE_INFINITY, Float.NaN, Double.NaN. These values are defined as constants in the corresponding wrapper classes Float and Double, respectively. If any INFINITY or NaN8 value occurs in the course of a computation (e.g., as the result of dividing by zero),9 Java continues without raising an error, so incorrect 7

8 9

900

Java uses the standard “2s-complement” representation, where a sign bit = 1 stands for a negative value. NaN stands for “not a number”. In Java, this only holds for floating-point operations, whereas integer division by zero always causes an exception.

values may ripple unnoticed through a whole chain of calculations, J.2 Arrays thereby making it difficult to locate the original problem. J.1.8 Testing Floating-Point Values Against Zero Comparing floating-point values or testing them for zero is a nontrivial issue and a frequent cause of errors. In particular, one should never write something like if (x == 0.0) {...}

// DON’T!

if x is a floating-point variable. This is often needed, for example, to make sure that it is safe to divide by x. The above test is generally not sufficient since x may be non-zero but still too small as a divisor. A better alternative is to test if x is “close” to zero, that is, within some small positive/negative (epsilon) interval. While the proper choice of this interval depends on the specific situation, the following settings are usually sufficient for safe operation:10 static final float EPSILON_FLOAT = 1e-7f; static final double EPSILON_DOUBLE = 2e-16; float x; double y; if (Math.abs(x) < EPSILON_FLOAT) { ... // x is practically zero } if (Math.abs(y) < EPSILON_DOUBLE) { ... // y is practically zero }

The imagingbook library defines static convenience methods for testing zero values, e.g., boolean isZero(double x),11 boolean isZero(double x, double tolerance), and similarly for comparing two floating-point values, e.g., boolean equals(double x, double y), boolean equals(double x, double y, double tolerance). Analogous methods are available for float arguments.

J.2 Arrays J.2.1 Creating Arrays Unlike in most traditional programming languages (such as FORTRAN or C), arrays in Java can be created dynamically, meaning 10

11

These settings account for the limited machine accuracy (ϵm ) of the IEEE 754 standard types float (ϵm ≈ 1.19·10−7 ) and double (ϵm ≈ 2.22·10−16 ) [214, Sec. 1.1.2]. See class imagingbook.lib.math.Arithmetic. The default tolerance for double-type arguments is defined by Arithmetic.EPSILON_DOUBLE (2·10−16 ).

901

Appendix J Java Notes

that the size of an array can be specified at runtime using the value of some variable or arithmetic expression. For example: int N = 20; int[] A = new int[N]; int[] B = new int[N * N];

Once it is created, however, the size of any Java array is fixed and cannot be subsequently altered.12 Note that Java arrays may be of length zero! After its definition, an array variable can be assigned any other compatible array or the constant value null, for example,13 A = B; B = null;

// A now references the data in B

With the assignment A = B, the array initially referenced by A becomes unaccessible and thus turns into garbage. In contrast to C and C++, where unnecessary storage needs to be deallocated explicitly, this is taken care of in Java by its built-in “garbage collector”. It is also convenient that newly created arrays of numerical primitive types (i.e., int, float, double, etc.) are automatically initialized to zero. J.2.2 Array Size Since an array may be created dynamically, it is important that its actual size can be determined at runtime. This is done by accessing the array’s length field14 int k = A.length;

// number of elements in A

The size is a property of the array itself, thus (unlike in C, for example) it is not necessary to pass the size of an array as a separate method argument. If an array has more than one dimension, the size (length) along every dimension must be queried separately (see Sec. J.2.4). Also, arrays are not necessarily rectangular; for example, the individual rows of a 2D array may have different lengths (including zero). J.2.3 Accessing Array Elements In Java, the index of the first array element is always 0 and the index of the last element is N −1, for an array with a total of N elements. To iterate through a 1D array A of arbitrary size, one would typically use a construct like for (int i = 0; i < A.length; i++) { // do something with A[i] } 12

13

14

902

For additional flexibility, Java provides a number of universal container classes (e.g., implementations of the Set and List interfaces) for a wide range of applications. This is not possible if the array variable was declared with the attribute final. Note that length is not a method!

Alternatively, if only the array values are relevant and the array index J.2 Arrays (i) is not needed, one could use to following (even simpler) “for-each” loop construct: for (int a : A) { // do something with array values a }

In both cases, the Java compiler can generate very efficient runtime code, since the source code makes obvious that the for loop does not access any elements outside the array limits and thus no explicit boundary checking is needed at execution time. This fact is very important for implementing efficient image processing programs in Java. Images in Java and ImageJ are usually stored as 1D arrays (accessible through the ImageProcessor method getPixels() in ImageJ), with pixels arranged in row-first order.15 Statistical calculations and most point operations can thus be efficiently implemented by directly accessing the underlying 1D array. For example, the run method of the contrast enhancement plugin in Prog. 3.1 (see p. 50) could also be implemented in the following manner: public void run(ImageProcessor ip) { // ip is assumed to be of type ByteProcessor byte[] pixels = (byte[]) ip.getPixels(); for (int i = 0; i < pixels.length; i++) { int a = 0xFF & pixels[i]; // direct read operation int b = (int) (a * 1.5 + 0.5); if (b > 255) b = 255; pixels[i] = (byte) (0xFF & b); // direct write operation } }

J.2.4 2D Arrays Multidimensional arrays are a frequent source of confusion. In Java, all arrays are 1D in principle, and multi-dimensional arrays are implemented as 1D arrays of arrays, etc. If, for example, the 3 × 3 matrix     a0,0 a0,1 a0,2 1 2 3 A = a1,0 a1,1 a1,2  = 4 5 6 (J.8) a2,0 a2,1 a2,2 7 8 9 is defined as a 2D int array, i.e., int[][] A = {{1,2,3}, {4,5,6}, {7,8,9}};

then A is actually a 1D array with three elements, each of which is again a 1D array. The elements A[0], A[1] and A[2] are of type int[] and correspond to the three rows of the matrix A (see Fig. J.1). 15

This means that horizontally adjacent image pixels are stored next to each other in computer memory.

903

Appendix J Java Notes

A

Fig. J.1 Layout of elements of a 2D Java array (see Eqn. (J.8)). In Java, multidimensional arrays are generally implemented as 1D arrays whose elements are again 1D arrays.

A[0]

A[0][0] 1

A[0][1] 2

A[0][2] 3

A[1]

A[1][0] 4

A[1][1] 5

A[1][2] 6

A[2]

A[2][0] 7

A[2][1] 8

A[2][2] 9

The usual assumption is that the array elements are arranged in row-first order, as illustrated in Fig. J.1. The first index thus corresponds to the row number (r) and the second index corresponds to the column number (c), that is, ar,c ≡ A[r][c] .

(J.9)

This conforms to the mathematical convention and makes the array definition in the code segment above look exactly the same as the original matrix in Eqn. (J.8). Note that, in this scheme, the first array index corresponds to the vertical coordinate and the second index to the horizontal coordinate. However, if an array is used to specify the contents of an image I(u, v) or a filter kernel H(i, j), we usually assume that the first index (u or i, respectively) is associated with the horizontal x-coordinate and the second index (v or j) with the vertical y-coordinate. For example, if we want to define the filter kernel     h0,0 h1,0 h2,0 −1 −2 0 H = h0,1 h1,1 h2,1  = −2 0 2 h0,2 h1,2 h2,2 0 2 1 as a 2D Java array,

double[][] H = {{-1,-2, 0}, {-2, 0, 2}, { 0, 2, 1}};

then the row and column indexes must be reversed in order to access the correct elements. In this case we have the relation hi,j ≡ H[j][i],

(J.10)

that is, the ordering of the indexes is reversed. In this case, the first array index (j) corresponds to the vertical coordinate and the second index (i) to the horizontal coordinate. The advantage is that (as shown in the above code segment) the definition of the filter kernel can be written in the usual matrix form16 (otherwise we would have to specify the transposed kernel matrix). If a 2D array is merely used as an image container (whose contents are never defined in matrix form) any convention can be used 16

904

This scheme is used, for example, in the implementation of the 3 × 3 filter plugin in Prog. 4.2 (see p. 91).

for the ordering of the indexes. For example, the ImageJ method J.2 Arrays getFloatArray() of class ImageProcessor, when called in the form float[][] Ia = ip.getFloatArray();

returns the image as a 2D array (Ia), whose indexes are arranged in the usual x/y order, that is, I(u, v) ≡ Ia[u][v].

(J.11)

Thus the image pixels in Ia are arranged in column-order, that is, vertically adjacent elements are stored next to each other in memory. Size of Multi-Dimensional Arrays The size of a multi-dimensional array can be obtained by querying the size of its sub-arrays. For example, given the following 3D array with dimensions P × Q × R, int[][][] A = new int[P][Q][R];

the size of A along its three dimensions is obtained by the statements int p = A.length; int q = A[0].length; int r = A[0][0].length;

// = P // = Q // = R

This at least works for “rectangular” Java arrays, that is, multidimensional arrays with all sub-arrays at the same level having identical lengths, which is warranted by the array initialization in the above case. However, every 1D sub-array of A may be replaced by a compatible 1D array of different length,17 for example, by the statement A[0][0] = new int[0];

To avoid “index-out-of-bounds” errors during iterations over multidimensional arrays, the length of each sub-array should be determined dynamically. The following example shows a “bullet-proof” iteration over all elements of a 3D array (A) whose sub-arrays may have different lengths or may even be empty (but not null): int[][][] A = ... ; ... for (int i = 0; i < for (int j = 0; j for (int k = 0; // safely access } } }

A.length; i++) { < A[i].length; j++) { k < A[i][j].length; k++) { A[i][j][k]

J.2.5 Arrays of Objects In Java, as mentioned earlier, we can create arrays dynamically, that is, the size of an array can be specified at runtime. This is convenient because we can adapt the size of the arrays to the given problem. For example, we could write 17

Even if the array A was originally declared final, the structure and contents of its sub-arrays may be modified any time.

905

Appendix J Java Notes

Corner[] corners = new Corner[n];

to create an array that can hold n instances of type Corner (see Sec. 6.4). The newly created array corners is not filled with corners yet but initialized with null references, so initially holds no objects at all. We can insert a Corner object into its first (or any other) cell, for example, by corners[0] = new Corner(10, 20, 6789.0f);

Of course, we may also remove it again from the array, e.g., by corners[0] = null;

J.2.6 Searching for Minimum and Maximum Values Unfortunately, the standard Java API does not provide methods for retrieving the minimum and maximum values of a numeric array. Although these values are easily found by iterating over all elements of the sequence, care must be taken regarding the initialization. For example, finding the extreme values of a sequence of intvalues could be accomplished as follows:18 int[] A = ... int minval = Integer.MAX_VALUE; int maxval = Integer.MIN_VALUE; for (int val : A) { minval = Math.min(minval, val); maxval = Math.max(maxval, val); }

Note the use of the constants MIN_VALUE and MAX_VALUE, which are defined for any numeric Java type. However, in the case of floating-point values, these are not the proper values for initialization.19 Instead, POSITIVE_INFINITY and NEGATIVE_INFINITY should be used, as shown in the following code segment: double[] B = ... double minval = Double.POSITIVE_INFINITY; double maxval = Double.NEGATIVE_INFINITY; for (double val : B) { minval = Math.min(minval, val); maxval = Math.max(maxval, val); }

J.2.7 Sorting Arrays Arrays can be sorted efficiently with the standard method Arrays.sort(type [] arr) defined by class java.util.Arrays, where arr can be any array of primitive type (int, float, etc.) or an array of objects. In the latter case, the array may not have null entries. Also, the class of every 18

19

906

Alternatively, one could initialize minval and maxval with the first array element A[0]. Because Double.MIN_VALUE and Float.MIN_VALUE specify to the smallest positive values.

contained object must implement the Comparable interface, that is, J.2 Arrays provide a public method compareTo() that returns an int value of −1, 0, or 1, depending upon the intended ordering. For example, class Corner defines the compareTo() method for sorting corners by decreasing corner score as follows:20 public class Corner implements Comparable { private final float x, y; // corner position private final float q; // corner score ...

}

20

@Override public int compareTo(Corner other) { return Float.compare(other.q, this.q); }

// to sort by decreasing q

The (optional) @Override annotation indicates that the compareTo() method overrides the associated method required by the Comparable interface.

907

References

1. H. Abdul-Rahman and N. Chernov. Fast and numerically stable circle fit. Journal of Mathematical Imaging and Vision 49, 289–295 (2014). 2. Adobe Systems. “Adobe RGB (1998) Color Space Specification” (2005). www.adobe.com/digitalimag/pdfs/AdobeRGB1998.pdf. 3. M. Ahmed and R. Ward. A rotation invariant rule-based thinning algorithm for character recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(12), 1672–1678 (2002). 4. N. Ahmed. How I came up with the Discrete Cosine Transform. Digital Signal Processing 1, 4–5 (1991). 5. S. J. Ahn. “Least Squares Orthogonal Distance Fitting of Curves and Surfaces in Space”, vol. 3151 of “Lecture Notes in Computer Science”. Springer (2004). 6. A. Al-Sharadqah and N. Chernov. Error analysis for circle fitting algorithms. Electronic Journal of Statistics 3, 886–911 (2009). 7. L. Alvarez, P.-L. Lions, and J.-M. Morel. Image selective smoothing and edge detection by nonlinear diffusion (II). SIAM Journal on Numerical Analysis 29(3), 845–866 (1992). 8. Apache Software Foundation. “Apache Commons Math User Guide”. http://commons.apache.org/proper/commons-math/userguide/. 9. Apache Software Foundation. “Commons Math: The Apache Commons Mathematics Library”. http://commons.apache.org/math/index. html. 10. H. Araujo and J. M. Dias. An introduction to the log-polar mapping. In “Proceedings II Workshop on Cybernetic Vision”, pp. 139– 144, São Carlos, Brazil (1996). 11. G. R. Arce, J. Bacca, and J. L. Paredes. Nonlinear filtering for image analysis and enhancement. In A. Bovik, editor, “Handbook of Image and Video Processing”, pp. 109–133. Academic Press, New York, second ed. (2005). 12. C. Arcelli and G. Sanniti di Baja. A one-pass two-operation process to detect the skeletal pixels on the 4-distance transform. IEEE Transactions on Pattern Analysis and Machine Intelligence 11(4), 411–414 (1989). 13. K. Arnold, J. Gosling, and D. Holmes. “The Java Programming Language”. Prentice Hall, fifth ed. (2012). 14. S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. Journal of the ACM 45(6), 891–923 (1998). 15. J. Astola, P. Haavisto, and Y. Neuvo. Vector median filters. Proceedings of the IEEE 78(4), 678–689 (1990). 16. J. Babaud, A. P. Witkin, M. Baudin, and R. O. Duda. Uniqueness of the Gaussian kernel for scale-space filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence 8(1), 26–33 (1986). © Springer Nature Switzerland AG 2022 W. Burger and M. J. Burge, Digital Image Processing, Texts in Computer Science, https://doi.org/10.1007/978-3-031-05744-1

909

References

910

17. W. Bailer. “Writing ImageJ Plugins—A Tutorial” (2003). www. imagingbook.com. 18. S. Baker and I. Matthews. Lucas-Kanade 20 years on: A unifying framework: Part 1. Technical Report CMU-RI-TR-02-16, Robotics Institute, Carnegie Mellon University (2003). 19. S. Baker and I. Matthews. Lucas-Kanade 20 years on: A unifying framework. International Journal of Computer Vision 56(3), 221–255 (2004). 20. D. H. Ballard and C. M. Brown. “Computer Vision”. Prentice Hall, Englewood Cliffs, NJ (1982). 21. D. Barash. Fundamental relationship between bilateral filtering, adaptive smoothing, and the nonlinear diffusion equation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(6), 844–847 (2002). 22. D. Barath and J. Matas. Graph-cut RANSAC. In “Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)”, pp. 6733–6741 (June 2018). 23. C. B. Barber, D. P. Dobkin, and H. Huhdanpaa. The quickhull algorithm for convex hulls. ACM Transactions on Mathematical Software 22(4), 469–483 (1996). 24. M. Barni. A fast algorithm for 1-norm vector median filtering. IEEE Transactions on Image Processing 6(10), 1452–1455 (1997). 25. H. G. Barrow, J. M. Tenenbaum, R. C. Bolles, and H. C. Wolf. Parametric correspondence and chamfer matching: two new techniques for image matching. In R. Reddy, editor, “Proceedings of the 5th International Joint Conference on Artificial Intelligence”, pp. 659–663, Cambridge, MA (1977). William Kaufmann, Los Altos, CA. 26. H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. SURF: Speeded up robust features. Computer Vision, Graphics, and Image Processing: Image Understanding 110(3), 346–359 (2008). 27. J. S. Beis and D. G. Lowe. Shape indexing using approximate nearest-neighbour search in high-dimensional spaces. In “Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR’97)”, pp. 1000–1006, Puerto Rico (June 1997). 28. R. Bencina and M. Kaltenbrunner. The design and evolution of fiducials for the reacTIVision system. In “Proceedings of the 3rd International Conference on Generative Systems in the Electronic Arts”, Melbourne (2005). 29. J. Bernsen. Dynamic thresholding of grey-level images. In “Proceedings of the International Conference on Pattern Recognition (ICPR)”, pp. 1251–1255, Paris (October 1986). IEEE Computer Society. 30. C. M. Bishop. “Pattern Recognition and Machine Learning”. Springer, New York (2006). 31. Å. Björk. “Numerical Methods for Least Squares Problems”. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1996). 32. R. E. Blahut. “Fast Algorithms for Digital Signal Processing”. Addison-Wesley, Reading, MA (1985). 33. I. Blayvas, A. Bruckstein, and R. Kimmel. Efficient computation of adaptive threshold surfaces for image binarization. Pattern Recognition 39(1), 89–101 (2006). 34. J. Blinn. Consider the lowly 2 × 2 matrix. IEEE Computer Graphics and Applications 16(2), 82–88 (1996). 35. J. Blinn. “Jim Blinn’s Corner: Notation, Notation, Notation”. Morgan Kaufmann (2002). 36. J. Bloch. “Effective Java”. Addison-Wesley, third ed. (2018).

37. G. Borgefors. Distance transformations in digital images. Computer Vision, Graphics and Image Processing 34, 344–371 (1986). 38. G. Borgefors. Hierarchical chamfer matching: a parametric edge matching algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 10(6), 849–865 (1988). 39. A. I. Borisenko and I. E. Tarapov. “Vector and Tensor Analysis with Applications”. Dover Publications, New York (1979). 40. R. N. Bracewell. “The Fourier Transform and Its Applications”. McGraw-Hill, third ed. (2000). 41. J. E. Bresenham. A linear algorithm for incremental digital display of circular arcs. Communications of the ACM 20(2), 100–106 (1977). 42. E. O. Brigham. “The Fast Fourier Transform and Its Applications”. Prentice Hall, Englewood Cliffs, NJ (1988). 43. I. N. Bronstein and K. A. Semendjajew. “Handbook of Mathematics”. Springer-Verlag, Berlin, third ed. (2007). 44. I. N. Bronstein, K. A. Semendjajew, G. Musiol, and H. Mühlig. “Taschenbuch der Mathematik”. Verlag Harri Deutsch, fifth ed. (2000). 45. M. Brown and D. Lowe. Invariant features from interest point groups. In “Proceedings of the British Machine Vision Conference”, pp. 656–665 (2002). 46. M. Brown, R. Szeliski, and S. Winder. Multi-image matching using multi-scale oriented patches. In “Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR’05”, pp. 510–517, San Diego, CA, USA (2005). 47. H. Bunke and P. S.-P. Wang, editors. “Handbook of Character Recognition and Document Image Analysis”. World Scientific, Singapore (2000). 48. W. Burger and M. J. Burge. “ImageJ Short Reference for Java Developers” (2008). www.imagingbook.com. 49. W. Burger and M. J. Burge. “Digital Image Processing—An Algorithmic Introduction using Java”. Texts in Computer Science. Springer, London, second ed. (2016). 50. P. J. Burt and E. H. Adelson. The Laplacian pyramid as a compact image code. IEEE Transactions on Communications 31(4), 532– 540 (1983). 51. J. F. Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 8(6), 679–698 (1986). 52. K. R. Castleman. “Digital Image Processing”. Prentice Hall, Upper Saddle River, NJ (1995). 53. E. E. Catmull and R. Rom. A class of local interpolating splines. In R. E. Barnhill and R. F. Riesenfeld, editors, “Computer Aided Geometric Design”, pp. 317–326. Academic Press, New York (1974). 54. F. Catté, P.-L. Lions, J.-M. Morel, and T. Coll. Image selective smoothing and edge detection by nonlinear diffusion. SIAM Journal on Numerical Analysis 29(1), 182–193 (1992). 55. C. I. Chang, Y. Du, J. Wang, S. M. Guo, and P. D. Thouin. Survey and comparative analysis of entropy and relative entropy thresholding techniques. IEE Proceedings—Vision, Image and Signal Processing 153(6), 837–850 (2006). 56. F. Chang, C. J. Chen, and C. J. Lu. A linear-time componentlabeling algorithm using contour tracing technique. Computer Vision, Graphics, and Image Processing: Image Understanding 93(2), 206– 220 (2004). 57. P. Charbonnier, L. Blanc-Feraud, G. Aubert, and M. Barlaud. Two deterministic half-quadratic regularization algorithms for

References

911

References 58.

59.

60.

61.

62.

63. 64.

65.

66.

67. 68. 69. 70.

71. 72.

73.

74. 75.

76.

912

computed imaging. In “Proceedings IEEE International Conference on Image Processing (ICIP-94)”, vol. 2, pp. 168–172, Austin (November 1994). Y. Chen and G. Leedham. Decompose algorithm for thresholding degraded historical document images. IEE Proceedings—Vision, Image and Signal Processing 152(6), 702–714 (2005). H. D. Cheng, X. H. Jiang, Y. Sun, and J. Wang. Color image segmentation: advances and prospects. Pattern Recognition 34(12), 2259–2281 (2001). N. Chernov. “Ellipse Fit (Taubin method)”. MATLAB Central File Exchange, www.mathworks.com/matlabcentral/fileexchange/ 22683-ellipse-fit-taubin-method (2009). N. Chernov. “Circular and Linear Regression: Fitting Circles and Lines by Least Squares”. Monographs on Statistics and Applied Probability. Taylor & Francis (2011). N. Chernov and H. Ma. Least squares fitting of quadratic curves and surfaces. In S. R. Yoshida, editor, “Computer Vision”, Computer Science, Technology and Applications, pp. 285–302. Nova Science Publishers (2011). P. R. Cohen and E. A. Feigenbaum. “The Handbook of Artificial Intelligence”. William Kaufmann, Los Altos, CA (1982). B. Coll, J. L. Lisani, and C. Sbert. Color images filtering by anisotropic diffusion. In “Proceedings of the IEEE International Conference on Systems, Signals, and Image Processing (IWSSIP)”, pp. 305–308, Chalkida, Greece (2005). D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5), 603–619 (2002). T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. “Introduction to Algorithms”. MIT Press, Cambridge, MA, second ed. (2001). A. Criminisi, I. D. Reid, and A. Zisserman. A plane measuring device. Image and Vision Computing 17(8), 625–634 (1999). F. C. Crow. Summed-area tables for texture mapping. SIGGRAPH Computer Graphics 18(3), 207–212 (1984). A. Cumani. Edge detection in multispectral images. Computer Vision, Graphics and Image Processing 53(1), 40–51 (1991). A. Cumani. Efficient contour extraction in color images. In “Proceedings of the Third Asian Conference on Computer Vision”, ACCV, pp. 582–589, Hong Kong (January 1998). Springer. L. S. Davis. A survey of edge detection techniques. Computer Graphics and Image Processing 4, 248–270 (1975). J. E. Dennis and R. B. Schnabel. “Numerical Methods for Unconstrained Optimization and Nonlinear Equations”. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1996). R. Deriche. Using Canny’s criteria to derive a recursively implemented optimal edge detector. International Journal of Computer Vision 1(2), 167–187 (1987). S. Di Zenzo. A note on the gradient of a multi-image. Computer Vision, Graphics and Image Processing 33(1), 116–125 (1986). M. Donoser, H. Riemenschneider, and H. Bischof. Shape guided maximally stable extremal region (MSER) tracking. In “Proceedings of the 20th International Conference on Pattern Recognition (ICPR’10)”, pp. 1800–1803 (2010). R. O. Duda, P. E. Hart, and D. G. Stork. “Pattern Classification”. Wiley, New York (2001).

77. F. Durand and J. Dorsey. Fast bilateral filtering for the display of high-dynamic-range images. In “Proceedings of the 29th annual conference on Computer graphics and interactive techniques (SIGGRAPH’02)”, pp. 257–266, San Antonio, Texas (July 2002). 78. D. Eberly. Derivative approximation by finite differences. Technical Report, Geometric Tools, www.geometrictools.com, Redmont, WA (May 2001). 79. D. Eberly. Distance from a point to an ellipse, an ellipsoid, or a hyperellipsoid. Technical Report, Geometric Tools, www.geometrictools. com, Redmont, WA (June 2013). 80. B. Eckel. “Thinking in Java”. Prentice Hall, Englewood Cliffs, NJ, fourth ed. (2006). 81. M. Elad. On the origin of the bilateral filter and ways to improve it. IEEE Transactions on Image Processing 11(10), 1141–1151 (2002). 82. A. Ferreira and S. Ubeda. Computing the medial axis transform in parallel with eight scan operations. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(3), 277–282 (1999). 83. M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6), 381–395 (1981). 84. N. I. Fisher. “Statistical Analysis of Circular Data”. Cambridge University Press (1995). 85. A. W. Fitzgibbon, M. Pilu, and R. B. Fisher. Direct leastsquares fitting of ellipses. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(5), 476–480 (1999). 86. D. Flanagan. “Java in a Nutshell: A Desktop Quick Reference”. O’Reilly, Sebastopol, CA, 7th ed. (2019). 87. L. M. J. Florack, B. M. ter Haar Romeny, J. J. Koenderink, and M. A. Viergever. Scale and the differential structure of images. Image and Vision Computing 10(6), 376–388 (1992). 88. J. Flusser. On the independence of rotation moment invariants. Pattern Recognition 33(9), 1405–1410 (2000). 89. J. Flusser. Moment forms invariant to rotation and blur in arbitrary number of dimensions. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(2), 234–246 (2003). 90. J. Flusser, B. Zitova, and T. Suk. “Moments and Moment Invariants in Pattern Recognition”. John Wiley & Sons (2009). 91. J. D. Foley, A. van Dam, S. K. Feiner, and J. F. Hughes. “Computer Graphics: Principles and Practice”. Addison-Wesley, Reading, MA, second ed. (1996). 92. A. Ford and A. Roberts. “Colour Space Conversions” (1998). www.poynton.com/PDFs/coloureq.pdf. 93. P.-E. Forssén. Maximally stable colour regions for recognition and matching. In “2007 IEEE Conference on Computer Vision and Pattern Recognition”, pp. 1–8, Los Alamitos, CA, USA (June 2007). IEEE Computer Society. 94. P.-E. Forssen and D. G. Lowe. Shape descriptors for maximally stable extremal regions. In “Proceedings of the 11th International Conference on Computer Vision (ICCV’07)”, pp. 1–8 (2007). 95. W. Förstner and E. Gülch. A fast operator for detection and precise location of distinct points, corners and centres of circular features. In A. Grün and H. Beyer, editors, “Proceedings, International Society for Photogrammetry and Remote Sensing Intercommission Conference on the Fast Processing of Photogrammetric Data”, pp. 281–305, Interlaken (June 1987).

References

913

References

914

96. D. A. Forsyth and J. Ponce. “Computer Vision—A Modern Approach”. Prentice Hall, Englewood Cliffs, NJ, second ed. (2012). 97. J. N. Franklin. “Matrix Theory”. Prentice Hall (1968). 98. H. Freeman. Computer processing of line drawing images. ACM Computing Surveys 6(1), 57–97 (1974). 99. J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software 3(3), 209–226 (1977). 100. W. Gander, M. J. Gander, and F. Kwok. “Scientific Computing – An Introduction using Maple and MATLAB”. Springer (2014). 101. M. Gervautz and W. Purgathofer. A simple method for color quantization: octree quantization. In A. Glassner, editor, “Graphics Gems I”, pp. 287–293. Academic Press, New York (1990). 102. T. Gevers, A. Gijsenij, J. van de Weijer, and J.-M. Geusebroek. “Color in Computer Vision”. Wiley (2012). 103. T. Gevers and H. Stokman. Classifying color edges in video into shadow-geometry, highlight, or material transitions. IEEE Transactions on Multimedia 5(2), 237–243 (2003). 104. T. Gevers, J. Van De Weijer, and H. Stokman. Color feature detection. In R. Lukac and K. N. Plataniotis, editors, “Color Image Processing: Methods and Applications”, pp. 203–226. CRC Press (2006). 105. C. A. Glasbey. An analysis of histogram-based thresholding algorithms. Computer Vision, Graphics, and Image Processing: Graphical Models and Image Processing 55(6), 532–537 (1993). 106. A. S. Glassner. “Principles of Digital Image Synthesis”. Morgan Kaufmann Publishers, San Francisco (1995). 107. G. H. Golub and C. F. Van Loan. “Matrix Computations”. The Johns Hopkins University Press, Baltimore, fourth ed. (2013). 108. R. C. Gonzalez and R. E. Woods. “Digital Image Processing”. Addison-Wesley, Reading, MA (1992). 109. R. C. Gonzalez and R. E. Woods. “Digital Image Processing”. Pearson Prentice Hall, Upper Saddle River, NJ, third ed. (2008). 110. M. Grabner, H. Grabner, and H. Bischof. Fast approximated SIFT. In “Proceedings of the 7th Asian Conference of Computer Vision”, pp. 918–927 (2006). 111. R. L. Graham. An efficient algorithm for determining the convex hull of a finite planar set. Information Processing Letters 1, 132–133 (1972). 112. R. L. Graham, D. E. Knuth, and O. Patashnik. “Concrete Mathematics: A Foundation for Computer Science”. Addison-Wesley, Reading, MA, second ed. (1994). 113. P. Green. Colorimetry and colour differences. In P. Green and L. MacDonald, editors, “Colour Engineering”, ch. 3, pp. 40–77. Wiley, New York (2002). 114. F. Guichard, L. Moisan, and J.-M. Morel. A review of P.D.E. models in image processing and image analysis. J. Phys. IV France 12(1), 137–154 (2002). 115. W. W. Hager. “Applied Numerical Linear Algebra”. Prentice Hall (1988). 116. R. Halíř and J. Flusser. Numerically stable direct least squares fitting of ellipses. In “Proceedings of the 6th International Conference in Central Europe on Computer Graphics and Visualization (WSCG’98)”, pp. 125–132, Plzeň, CZ (February 1998). 117. A. Hanbury. Circular statistics applied to colour images. In “Proceedings of the 8th Computer Vision Winter Workshop”, pp. 55–60, Valtice, Czech Republic (February 2003).

118. J. C. Hancock. “An Introduction to the Principles of Communication Theory”. McGraw-Hill (1961). 119. I. Hannah, D. Patel, and R. Davies. The use of variance and entropic thresholding methods for image segmentation. Pattern Recognition 28(4), 1135–1143 (1995). 120. W. W. Harman. “Principles of the Statistical Theory of Communication”. McGraw-Hill (1963). 121. C. G. Harris and M. Stephens. A combined corner and edge detector. In C. J. Taylor, editor, “4th Alvey Vision Conference”, pp. 147–151, Manchester (1988). 122. J. C. Hart. Distance to an ellipsoid. In P. S. Heckbert, editor, “Graphics Gems IV”, pp. 113–119. Academic Press (1994). 123. P. E. Hart. How the Hough transform was invented. IEEE Signal Processing Magazine 26(6), 18–22 (2009). 124. R. Hartley and A. Zisserman. “Multiple View Geometry in Computer Vision”. Cambridge University Press, second ed. (2013). 125. P. S. Heckbert. Color image quantization for frame buffer display. Computer Graphics 16(3), 297–307 (1982). 126. P. S. Heckbert. Fundamentals of texture mapping and image warping. Master’s thesis, University of California, Berkeley, Dept. of Electrical Engineering and Computer Science (1989). 127. R. Hess. An open-source SIFT library. In “Proceedings of the International Conference on Multimedia, MM’10”, pp. 1493–1496, Firenze, Italy (October 2010). 128. J. Holm, I. Tastl, L. Hanlon, and P. Hubel. Color processing for digital photography. In P. Green and L. MacDonald, editors, “Colour Engineering”, ch. 9, pp. 179–220. Wiley, New York (2002). 129. C. M. Holt, A. Stewart, M. Clint, and R. H. Perrott. An improved parallel thinning algorithm. Communications of the ACM 30(2), 156–160 (1987). 130. V. Hong, H. Palus, and D. Paulus. Edge preserving filters on color images. In “Proceedings Int’l Conf. on Computational Science, ICCS”, pp. 34–40, Kraków, Poland (2004). 131. B. K. P. Horn. “Robot Vision”. MIT-Press, Cambridge, MA (1982). 132. R. A. Horn and C. R. Johnson. “Matrix Analysis”. Cambridge University Press, second ed. (2012). 133. P. V. C. Hough. Method and means for recognizing complex patterns. US Patent 3,069,654 (1962). 134. M. K. Hu. Visual pattern recognition by moment invariants. IEEE Transactions on Information Theory 8, 179–187 (1962). 135. A. Huertas and G. Medioni. Detection of intensity changes with subpixel accuracy using Laplacian-Gaussian masks. IEEE Transactions on Pattern Analysis and Machine Intelligence 8(5), 651–664 (1986). 136. R. W. G. Hunt. “The Reproduction of Colour”. Wiley, New York, sixth ed. (2004). 137. J. Hutchinson. Culture, communication, and an information age madonna. IEEE Professional Communications Society Newsletter 45(3), 1, 5–7 (2001). 138. J. Illingworth and J. Kittler. Minimum error thresholding. Pattern Recognition 19(1), 41–47 (1986). 139. J. Illingworth and J. Kittler. A survey of the Hough transform. Computer Vision, Graphics and Image Processing 44, 87–116 (1988). 140. International Color Consortium. “Specification ICC.1:2010-12 (Profile Version 4.3.0.0): Image Technology Colour Management— Architecture, Profile Format, and Data Structure” (2010). www.color. org.

References

915

References

916

141. International Electrotechnical Commission, IEC, Geneva. “IEC 61966-2-1: Multimedia Systems and Equipment—Colour Measurement and Management, Part 2-1: Colour Management—Default RGB Colour Space—sRGB” (1999). www.iec.ch. 142. International Organization for Standardization, ISO, Geneva. “ISO 13655:1996, Graphic Technology—Spectral Measurement and Colorimetric Computation for Graphic Arts Images” (1996). 143. International Organization for Standardization, ISO, Geneva. “ISO 15076-1:2005, Image Technology Colour Management—Architecture, Profile Format, and Data Structure: Part 1” (2005). Based on ICC.1:2004-10. 144. International Telecommunications Union, ITU, Geneva. “ITU-R Recommendation BT.709-3: Basic Parameter Values for the HDTV Standard for the Studio and for International Programme Exchange” (1998). 145. International Telecommunications Union, ITU, Geneva. “ITU-R Recommendation BT.601-5: Studio Encoding Parameters of Digital Television for Standard 4:3 and Wide-Screen 16:9 Aspect Ratios” (1999). 146. K. Jack. “Video Demystified—A Handbook for the Digital Engineer”. LLH Publishing, Eagle Rock, VA, third ed. (2001). 147. B. Jähne. “Practical Handbook on Image Processing for Scientific Applications”. CRC Press, Boca Raton, FL (1997). 148. B. Jähne. “Digitale Bildverarbeitung”. Springer-Verlag, Berlin, fifth ed. (2002). 149. B. Jähne. “Digital Image Processing”. Springer-Verlag, Berlin, sixth ed. (2005). 150. A. K. Jain. “Fundamentals of Digital Image Processing”. Prentice Hall, Englewood Cliffs, NJ (1989). 151. R. Jain, R. Kasturi, and B. G. Schunck. “Machine Vision”. McGraw-Hill, Boston (1995). 152. Y. Jia and T. Darrell. Heavy-tailed distances for gradient based image descriptors. In “Proceedings of the Twenty-Fifth Annual Conference on Neural Information Processing Systems (NIPS)”, Grenada, Spain (December 2011). 153. X. Y. Jiang and H. Bunke. Simple and fast computation of moments. Pattern Recognition 24(8), 801–806 (1991). 154. L. Jin and D. Li. A switching vector median filter based on the CIELAB color space for color image restoration. Signal Processing 87(6), 1345–1354 (2007). 155. K. Kanatani and P. Rangarajan. Hyper least squares fitting of circles and ellipses. Computational Statistics and Data Analysis 55, 2197–2208 (2011). 156. J. N. Kapur, P. K. Sahoo, and A. K. C. Wong. A new method for gray-level picture thresholding using the entropy of the histogram. Computer Vision, Graphics, and Image Processing 29, 273–285 (1985). 157. B. Kimia. A large binary image database. Technical Report, LEMS Vision Group, Brown University (2002). 158. J. King. Engineering color at Adobe. In P. Green and L. MacDonald, editors, “Colour Engineering”, ch. 15, pp. 341–369. Wiley, New York (2002). 159. R. A. Kirsch. Computer determination of the constituent structure of biological images. Computers in Biomedical Research 4, 315–328 (1971). 160. L. Kitchen and A. Rosenfeld. Gray-level corner detection. Pattern Recognition Letters 1, 95–102 (1982).

161. D. E. Knuth. “The Art of Computer Programming, Volume 2: Seminumerical Algorithms”. Addison-Wesley, third ed. (1997). 162. J. J. Koenderink. The structure of images. Biological Cybernetics 50(5), 363–370 (1984). 163. A. Koschan and M. A. Abidi. Detection and classification of edges in color images. IEEE Signal Processing Magazine 22(1), 64–73 (2005). 164. A. Koschan and M. A. Abidi. “Digital Color Image Processing”. Wiley (2008). 165. P. Kovesi. Arbitrary Gaussian filtering with 25 additions and 5 multiplications per pixel. Technical Report UWA-CSSE-09-002, The University of Western Australia, School of Computer Science and Software Engineering (2009). 166. M. Kuwahara, K. Hachimura, S. Eiho, and M. Kinoshita. Processing of RI-angiocardiographic image. In K. Preston and M. Onoe, editors, “Digital Processing of Biomedical Images”, pp. 187–202. Plenum, New York (1976). 167. I. Kåsa. A circle fitting procedure and its error analysis. IEEE Transactions on Instrumentation and Measurement 25(1), 8–14 (1976). 168. F. Larsson, M. Felsberg, and P.-E. Forssen. Patch contour matching by correlating Fourier descriptors. In “Proceedings of the Digital Image Computing: Technqiues and Applications (DICTA)”, pp. 40–46, Melbourne, Australia (January 2009). IEEE Computer Society. 169. D. C. Lay, S. R. Lay, and J. J. McDonald. “Linear Algebra and Its Applications”. Pearson, Boston, fifth ed. (2016). 170. P.-S. Liao, T.-S. Chen, and P.-C. Chung. A fast algorithm for multilevel thresholding. Journal of Information Science and Engineering 17, 713–727 (2001). 171. B. J. Lindbloom. Accurate color reproduction for computer graphics applications. SIGGRAPH Computer Graphics 23(3), 117–126 (1989). 172. T. Lindeberg. “Scale-Space Theory in Computer Vision”. Kluwer Academic Publishers (1994). 173. T. Lindeberg. Feature detection with automatic scale selection. International Journal of Computer Vision 30(2), 77–116 (1998). 174. D. G. Lowe. Object recognition from local scale-invariant features. In “Proceedings of the 7th IEEE International Conference on Computer Vision”, vol. 2 of “ICCV’99”, pp. 1150–1157, Kerkyra, Corfu, Greece (1999). 175. D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004). 176. B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In P. J. Hayes, editor, “Proceedings of the 7th International Joint Conference on Artificial Intelligence IJCAI’81”, pp. 674–679, Vancouver, BC (1981). William Kaufmann, Los Altos, CA. 177. R. Lukac, B. Smolka, and K. N. Plataniotis. Sharpening vector median filters. Signal Processing 87(9), 2085–2099 (2007). 178. R. Lukac, B. Smolka, K. N. Plataniotis, and A. N. Venetsanopoulos. Vector sigma filters for noise detection and removal in color images. Journal of Visual Communication and Image Representation 17(1), 1–26 (2006). 179. P. C. Mahalanobis. On the generalised distance in statistics. Proceedings of the National Institute of Sciences of India 2(1), 49–55 (1936). 180. S. Mallat. “A Wavelet Tour of Signal Processing”. Academic Press, New York (1999).

References

917

References

918

181. C. Mancas-Thillou and B. Gosselin. Color text extraction with selective metric-based clustering. Computer Vision, Graphics, and Image Processing: Image Understanding 107(1-2), 97–107 (2007). 182. M. J. Maron and R. J. Lopez. “Numerical Analysis”. Wadsworth Publishing, third ed. (1990). 183. D. Marr and E. Hildreth. Theory of edge detection. Proceedings of the Royal Society of London, Series B 207, 187–217 (1980). 184. J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust widebaseline stereo from maximally stable extremal regions. Image and Vision Computing 22(10), 761–767 (2004). 185. E. H. W. Meijering, W. J. Niessen, and M. A. Viergever. Quantitative evaluation of convolution-based methods for medical image interpolation. Medical Image Analysis 5(2), 111–126 (2001). 186. J. Miano. “Compressed Image File Formats”. ACM Press, AddisonWesley, Reading, MA (1999). 187. K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. van Gool. A comparison of affine region detectors. international journal of computer vision. International Journal of Computer Vision 65(1/2), 43–72 (2005). 188. D. P. Mitchell and A. N. Netravali. Reconstruction filters in computer-graphics. In R. J. Beach, editor, “Proceedings of the 15th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH’88”, pp. 221–228, Atlanta, GA (1988). ACM Press, New York. 189. P. A. Mlsna and J. J. Rodriguez. Gradient and Laplacian-type edge detection. In A. Bovik, editor, “Handbook of Image and Video Processing”, pp. 415–431. Academic Press, New York (2000). 190. P. A. Mlsna and J. J. Rodriguez. Gradient and Laplacian-type edge detection. In A. Bovik, editor, “Handbook of Image and Video Processing”, pp. 415–431. Academic Press, New York, second ed. (2005). 191. J. Morovic. “Color Gamut Mapping”. Wiley (2008). 192. M. Nadler and E. P. Smith. “Pattern Recognition Engineering”. Wiley, New York (1993). 193. M. Nagao and T. Matsuyama. Edge preserving smoothing. Computer Graphics and Image Processing 9(4), 394–407 (1979). 194. S. K. Naik and C. A. Murthy. Standardization of edge magnitude in color images. IEEE Transactions on Image Processing 15(9), 2588– 2595 (2006). 195. L. Najman and M. Couprie. Building the component tree in quasilinear time. IEEE Transactions on Image Processing 15(11), 3531– 3539 (2006). 196. W. Niblack. “An Introduction to Digital Image Processing”. Prentice-Hall (1986). 197. D. Nistér and H. Stewénius. Linear time maximally stable extremal regions. In D. Forsyth, P. Torr, and A. Zisserman, editors, “Computer Vision – ECCV 2008”, pp. 183–196, Berlin, Heidelberg (2008). Springer Berlin Heidelberg. 198. M. Nitzberg and T. Shiota. Nonlinear image filtering with edge and corner enhancement. IEEE Transactions on Pattern Analysis and Machine Intelligence 14(8), 826–833 (1992). 199. Š. Obdržálek and J. Matas. Local affine frames for image retrieval. In “Proceedings of the International Conference on Image and Video Retrieval, CIVR 2002”, Lecture Notes in Computer Science, Vol. 2383, pp. 318–327, Berlin, Heidelberg (July 2002). Springer.

200. H. Ochoa-Dominguez and K. R. Rao. “Discrete Cosine Transform”. CRC Press, second ed. (2019). 201. W. Oh and W. B. Lindquist. Image thresholding by indicator kriging. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(7), 590–602 (1999). 202. A. V. Oppenheim, R. W. Shafer, and J. R. Buck. “Discrete-Time Signal Processing”. Prentice Hall, Englewood Cliffs, NJ, second ed. (1999). 203. N. Otsu. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics 9(1), 62–66 (1979). 204. N. R. Pal and S. K. Pal. A review on image segmentation techniques. Pattern Recognition 26(9), 1277–1294 (1993). 205. S. Paris and F. Durand. A fast approximation of the bilateral filter using a signal processing approach. International Journal of Computer Vision 81(1), 24–52 (2007). 206. T. Pavlidis. “Algorithms for Graphics and Image Processing”. Computer Science Press / Springer-Verlag, New York (1982). 207. O. Pele and M. Werman. A linear time histogram metric for improved SIFT matching. In “Proceedings of the 10th European Conference on Computer Vision (ECCV’08)”, pp. 495–508, Marseille, France (October 2008). 208. P. Perona and J. Malik. Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(4), 629–639 (1990). 209. T. Q. Pham and L. J. van Vliet. Separable bilateral filtering for fast video preprocessing. In “Proceedings IEEE International Conference on Multimedia and Expo”, pp. CD1–4, Los Alamitos, USA (July 2005). IEEE Computer Society. 210. K. N. Plataniotis and A. N. Venetsanopoulos. “Color Image Processing and Applications”. Springer (2000). 211. F. Porikli. Constant time O(1) bilateral filtering. In “Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)”, pp. 1–8, Anchorage (June 2008). 212. C. A. Poynton. “Digital Video and HDTV Algorithms and Interfaces”. Morgan Kaufmann Publishers, San Francisco (2003). 213. V. Pratt. Direct least-squares fitting of algebraic surfaces. ACM SIGGRAPH Computer Graphics 21(4), 145–152 (July 1987). 214. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. “Numerical Recipes”. Cambridge University Press, third ed. (2007). 215. J. Prewitt. Object enhancement and extraction. In B. Lipkin and A. Rosenfeld, editors, “Picture Processing and Psychopictorics”, pp. 415–431. Academic Press (1970). 216. R. R. Rakesh, P. Chaudhuri, and C. A. Murthy. Thresholding in edge detection: a statistical approach. IEEE Transactions on Image Processing 13(7), 927–936 (2004). 217. W. S. Rasband. “ImageJ”. U.S. National Institutes of Health, MD (1997–2007). imagej.nih.gov/ij/. 218. C. E. Reid and T. B. Passin. “Signal Processing in C”. Wiley, New York (1992). 219. D. Rich. Instruments and methods for colour measurement. In P. Green and L. MacDonald, editors, “Colour Engineering”, ch. 2, pp. 19–48. Wiley, New York (2002). 220. I. E. G. Richardson. “H.264 and MPEG-4 Video Compression”. Wiley, New York (2003).

References

919

References

920

221. T. W. Ridler and S. Calvard. Picture thresholding using an iterative selection method. IEEE Transactions on Systems, Man, and Cybernetics 8(8), 630–632 (1978). 222. L. G. Roberts. Machine perception of three-dimensional solids. In J. T. Tippet, editor, “Optical and Electro-Optical Information Processing”, pp. 159–197. MIT Press, Cambridge, MA (1965). 223. G. Robinson. Edge detection by compass gradient masks. Computer Graphics and Image Processing 6(5), 492–501 (1977). 224. P. I. Rockett. An improved rotation-invariant thinning algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(10), 1671–1674 (2005). 225. J. B. Roerdink and A. Meijster. The watershed transform: Definitions, algorithms and parallelization strategies. Fundamenta Informaticae 41, 187–228 (10 2001). 226. A. Rosenfeld and J. L. Pfaltz. Sequential operations in digital picture processing. Journal of the ACM 12, 471–494 (1966). 227. P. L. Rosin. Ellipse fitting using orthogonal hyperbolae and stirling’s oval. Graphical Models and Image Processing 60(3), 209–213 (1998). 228. J. C. Russ. “The Image Processing Handbook”. CRC Press, Boca Raton, FL, third ed. (1998). 229. P. K. Sahoo, S. Soltani, A. K. C. Wong, and Y. C. Chen. A survey of thresholding techniques. Computer Vision, Graphics and Image Processing 41(2), 233–260 (1988). 230. P. D. Sampson. Fitting conic sections to ‘very scattered’ data: An iterative refinement of the Bookstein algorithm. Computer Graphics and Image Processing 18(1), 97–108 (1982). 231. G. Sapiro. “Geometric Partial Differential Equations and Image Analysis”. Cambridge University Press (2001). 232. G. Sapiro and D. L. Ringach. Anisotropic diffusion of multivalued images with applications to color filtering. IEEE Transactions on Image Processing 5(11), 1582–1586 (1996). 233. J. Sauvola and M. Pietikäinen. Adaptive document image binarization. Pattern Recognition 33(2), 1135–1143 (2000). 234. H. Schildt. “Java: The Complete Reference”. McGraw-Hill, 12th ed. (2021). 235. C. Schmid and R. Mohr. Local grayvalue invariants for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(5), 530–535 (1997). 236. C. Schmid, R. Mohr, and C. Bauckhage. Evaluation of interest point detectors. International Journal of Computer Vision 37(2), 151–172 (2000). 237. E. L. Schwartz. Spatial mapping in the primate sensory projection: analytic structure and relevance to perception. Biological Cybernetics 25, 181–194 (1977). 238. Y. Schwarzer, editor. “Die Farbenlehre Goethes”. Westerweide Verlag, Witten (2004). 239. R. Sedgewick and K. Wayne. “Algorithms”. Pearson Education, fourth ed. (2011). 240. M. Seul, L. O’Gorman, and M. J. Sammon. “Practical Algorithms for Image Analysis”. Cambridge University Press, Cambridge (2000). 241. M. Sezgin and B. Sankur. Survey over image thresholding techniques and quantitative performance evaluation. Journal of Electronic Imaging 13(1), 146–165 (2004). 242. L. G. Shapiro and G. C. Stockman. “Computer Vision”. Prentice Hall, Englewood Cliffs, NJ (2001). 243. G. Sharma and H. J. Trussell. Digital color imaging. IEEE Transactions on Image Processing 6(7), 901–932 (1997).

244. J. Shi and C. Tomasi. Good features to track. In “Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR’94”, pp. 593–600, Seattle, WA, USA (1994). 245. F. Y. Shih and S. Cheng. Automatic seeded region growing for color image segmentation. Image and Vision Computing 23(10), 877–886 (2005). 246. N. Silvestrini and E. P. Fischer. “Farbsysteme in Kunst und Wissenschaft”. DuMont, Cologne (1998). 247. S. N. Sinha, J.-M. Frahm, M. Pollefeys, and Y. Genc. Feature tracking and matching in video using programmable graphics hardware. Machine Vision and Applications 22(1), 207–217 (2011). 248. Y. Sirisathitkul, S. Auwatanamongkol, and B. Uyyanonvara. Color image quantization using distances between adjacent colors along the color axis with highest color variance. Pattern Recognition Letters 25, 1025–1043 (2004). 249. S. M. Smith and J. M. Brady. SUSAN—a new approach to low level image processing. International Journal of Computer Vision 23(1), 45–78 (1997). 250. B. Smolka, M. Szczepanski, K. N. Plataniotis, and A. N. Venetsanopoulos. Fast modified vector median filter. In “Proceedings of the 9th International Conference on Computer Analysis of Images and Patterns”, CAIP’01, pp. 570–580, London, UK (2001). Springer-Verlag. 251. M. Sonka, V. Hlavac, and R. Boyle. “Image Processing, Analysis and Machine Vision”. PWS Publishing, Pacific Grove, CA, second ed. (1999). 252. M. Spiegel and S. Lipschutz. “Schaum’s Outline of Vector Analysis”. McGraw-Hill, New York, second ed. (2009). 253. M. Stokes and M. Anderson. “A Standard Default Color Space for the Internet—sRGB”. Hewlett-Packard, Microsoft, www.w3.org/ Graphics/Color/sRGB.html (1996). 254. S. Süsstrunk. Managing color in digital image libraries. In P. Green and L. MacDonald, editors, “Colour Engineering”, ch. 17, pp. 385–419. Wiley, New York (2002). 255. Z. L. Szpak, W. Chojnacki, and A. van den Hengel. Guaranteed ellipse fitting with the Sampson distance. In “Proceedings of the European Conference on Computer Vision (ECCV’12)”, pp. 87–100, Florence, Italy (2012). Springer Berlin Heidelberg. 256. B. Tang, G. Sapiro, and V. Caselles. Color image enhancement via chromaticity diffusion. IEEE Transactions on Image Processing 10(5), 701–707 (2001). 257. C.-Y. Tang, Y.-L. Wu, M.-K. Hor, and W.-H. Wang. Modified SIFT descriptor for image matching under interference. In “Proceedings of the International Conference on Machine Learning and Cybernetics (ICMLC)”, pp. 3294–3300, Kunming, China (July 2008). 258. G. Taubin. Estimation of planar curves, surfaces, and nonplanar space curves defined by implicit equations with applications to edge and range image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(11), 1115–1138 (1991). 259. S. Theodoridis and K. Koutroumbas. “Pattern Recognition”. Academic Press, New York (1999). 260. C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In “Proceedings Int’l Conf. on Computer Vision”, ICCV’98, pp. 839–846, Bombay (1998). 261. F. Tomita and S. Tsuji. Extraction of multiple regions by smoothing in selected neighborhoods. IEEE Transactions on Systems, Man, and Cybernetics 7, 394–407 (1977).

References

921

References

922

262. V. J. Traver and A. Bernardino. A review of log-polar imaging for visual perception in robotics. Robotics and Autonomous Systems 58, 378–398 (2010). 263. Ø. D. Trier and T. Taxt. Evaluation of binarization methods for document images. IEEE Transactions on Pattern Analysis and Machine Intelligence 17(3), 312–315 (1995). 264. E. Trucco and A. Verri. “Introductory Techniques for 3-D Computer Vision”. Prentice Hall, Englewood Cliffs, NJ (1998). 265. D. Tschumperlé. “PDEs Based Regularization of Multivalued Images and Applications”. PhD thesis, Université de Nice, Sophia Antipolis, France (2005). 266. D. Tschumperlé. Fast anisotropic smoothing of multi-valued images using curvature-preserving PDEs. International Journal of Computer Vision 68(1), 65–82 (2006). 267. D. Tschumperlé and R. Deriche. Diffusion PDEs on vector-valued images: local approach and geometric viewpoint. IEEE Signal Processing Magazine 19(5), 16–25 (2002). 268. D. Tschumperlé and R. Deriche. Vector-valued image regularization with PDEs: A common framework for different applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(4), 506–517 (2005). 269. K. Turkowski. Filters for common resampling tasks. In A. Glassner, editor, “Graphics Gems I”, pp. 147–165. Academic Press, New York (1990). 270. T. Tuytelaars and K. Mikolajczyk. Local invariant feature detectors: A survey. Foundations and Trends in Computer Graphics and Vision 3(3), 177–280 (2008). 271. T. Tuytelaars and L. J. Van Gool. Matching widely separated views based on affine invariant regions. International Journal of Computer Vision 59(1), 61–85 (2004). 272. J. van de Weijer. “Color Features and Local Structure in Images”. PhD thesis, University of Amsterdam (2005). 273. M. I. Vardavoulia, I. Andreadis, and P. Tsalides. A new vector median filter for colour image processing. Pattern Recognition Letters 22(6-7), 675–689 (2001). 274. A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms. www.vlfeat.org (2008). 275. F. R. D. Velasco. Thresholding using the ISODATA clustering algorithm. IEEE Transactions on Systems, Man, and Cybernetics 10(11), 771–774 (1980). 276. D. Vernon. “Machine Vision”. Prentice Hall (1999). 277. L. Vincent and P. Soille. Watersheds in digital spaces: An efficient algorithm based on immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(6), 583–598 (1991). 278. P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision 57(2), 137–154 (2004). 279. D. Wallner. Color management and transformation through ICC profiles. In P. Green and L. MacDonald, editors, “Colour Engineering”, ch. 11, pp. 247–261. Wiley, New York (2002). 280. A. Watt. “3D Computer Graphics”. Addison-Wesley, Reading, MA, third ed. (1999). 281. A. Watt and F. Policarpo. “The Computer Image”. AddisonWesley, Reading, MA (1999). 282. J. Weickert. “Anisotropic Diffusion in Image Processing”. PhD thesis, Universität Kaiserslautern, Fachbereich Mathematik (1996). 283. J. Weickert. A review of nonlinear diffusion filtering. In B. M. ter Haar Romeny, L. Florack, J. J. Koenderink, and M. A.

284. 285. 286. 287.

288. 289. 290.

291.

292.

293.

294.

295.

296. 297. 298.

299.

Viergever, editors, “Proceedings First International Conference on Scale-Space Theory in Computer Vision, Scale-Space’97”, Lecture Notes in Computer Science, pp. 3–28, Utrecht (July 1997). Springer. J. Weickert. Coherence-enhancing diffusion filtering. International Journal of Computer Vision 31(2/3), 111–127 (1999). J. Weickert. Coherence-enhancing diffusion of colour images. Image and Vision Computing 17(3/4), 201–212 (1999). B. Weiss. Fast median and bilateral filtering. ACM Transactions on Graphics 25(3), 519–526 (2006). M. Welk, J. Weickert, F. Becker, C. Schnörr, C. Feddern, and B. Burgeth. Median and related local filters for tensor-valued images. Signal Processing 87(2), 291–308 (2007). P. Wendykier. “High Performance Java Software for Image Processing”. PhD thesis, Emory University (2009). G. Wolberg. “Digital Image Warping”. IEEE Computer Society Press, Los Alamitos, CA (1990). G. Wyszecki and W. S. Stiles. “Color Science: Concepts and Methods, Quantitative Data and Formulae”. Wiley–Interscience, New York, second ed. (2000). Q. Yang, K.-H. Tan, and N. Ahuja. Real-time O(1) bilateral filtering. In “Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)”, pp. 557–564, Miami (2009). S. D. Yanowitz and A. M. Bruckstein. A new method for image segmentation. Computer Vision, Graphics, and Image Processing 46(1), 82–95 (1989). D. Young. Straight lines and circles in the log-polar image. In “Proceedings of the 11th British Machine Vision Conference (BMVC)”, pp. 426–435, Bristol, UK (September 2000). G. W. Zack, W. E. Rogers, and S. A. Latt. Automatic measurement of sister chromatid exchange frequency. Journal of Histochemistry and Cytochemistry 25(7), 741–753 (1977). P. Zamperoni. A note on the computation of the enclosed area for contour-coded binary objects. Signal Processing 3(3), 267–271 (1981). E. Zeidler, editor. “Teubner-Taschenbuch der Mathematik”. B. G. Teubner Verlag, Leipzig, second ed. (2002). T. Y. Zhang and C. Y. Suen. A fast parallel algorithm for thinning digital patterns. Communications of the ACM 27(3), 236–239 (1984). S.-Y. Zhu, K. N. Plataniotis, and A. N. Venetsanopoulos. Comprehensive analysis of edge detection in color image processing. Optical Engineering 38(4), 612–625 (1999). S. Zokai and G. Wolberg. Image registration using log-polar mappings for recovery of large-scale similarity and projective transformations. IEEE Transactions on Image Processing 14(10), 1422–1434 (2005).

References

923

Index

Symbols ∀, 803 ∃, 803 ∠, 476, 615, 636, 637, 737, 785, 801 ⋄, 96, 800 ÷, 501, 800 ∗, 96–98, 121, 271, 576, 642, 715, 800, 847 ⊛, 670, 800 ⊗, 800, 814, 873 ×, 800 ⊕, 171, 800 ⊖, 171, 800 ◦, 800 •, 800 ∂, 119, 480, 481, 801, 838, 839 ∇, 119, 476, 481, 526–529, 801, 838 ∇2 , 136, 518, 711, 801, 839, 885 ⊞, 799, 800, 805 ∪, 803 ∩, 803 \, 803 . . . , 801 . . . , 801 ∧, 801 ∨, 801 ∅, 801, 803 ∼, 800, 878 ≈, 800 ≡, 800 ←, 800 ⊕ , 800 ← :=, 801 | |, 800, 803 ∥ ∥, 800 ⌈ ⌉, 800 ⌊ ⌋, 800 0, 801 µ, 802, 871, 878 σ, 803 τ (interval), 803 & (operator), 900 | (operator), 380 / (operator), 800 % (operator), 897

& (operator), 380 >> (operator), 380

Burger W. Digital Image Processing. An Algorithmic Introduction 3ed 2022 [PDF]

Texts in Computer Science Wilhelm Burger Mark J. Burge Digital Image Processing An Algorithmic Introduction Third Edit

File loading please wait...

Citation preview

Texts in Computer Science

Wilhelm Burger Mark J. Burge

Digital Image Processing An Algorithmic Introduction Third Edit