Data Structures & Algorithms [PDF]

Data Structures and Algorithms: Table of Contents

Data Structures and Algorithms

Data Structures and Algorithms - Tabl

21 0 2 MB

Report DMCA / Copyright

DOWNLOAD FILE

File loading please wait...

Citation preview

Data Structures and Algorithms: Table of Contents

Data Structures and Algorithms

Data Structures and Algorithms - Table of Contents Front Page Course Outline 1. Introduction 2. Programming Strategies ❍ 2.1 Objects and ADTs ■ 2.1.1 An Example: Collections ❍ 2.2 Constructors and destructors ❍ 2.3 Data Structure ❍ 2.4 Methods ❍ 2.5 Pre- and post-conditions ❍ 2.6 C conventions ❍ 2.7 Error Handling ❍ 2.8 Some Programming Language Notes 3. Data Structures ❍ 3.1 Arrays ❍ 3.2 Lists ❍ 3.3 Stacks ■ 3.3.1 Stack Frames ❍ 3.4 Recursion ■ 3.4.1 Recursive Functions ■ 3.4.2 Example: Factorial 4. Searching ❍ 4.1 Sequential Searches ❍ 4.2 Binary Search ❍ 4.3 Trees 5. Complexity ❍ 5. Complexity (PS) 6. Queues ❍ 6.1 Priority Queues ❍ 6.2 Heaps 7. Sorting ❍ 7.1 Bubble ❍ 7.2 Heap ❍ 7.3 Quick ❍ 7.4 Bin http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/ds_ToC.html (1 of 3) [3/23/2004 2:23:29 PM]

Data Structures and Algorithms: Table of Contents

7.5 Radix Searching Revisited ❍ 8.1 Red-Black trees ❍ 8.1.1 AVL trees ❍ 8.2 General n-ary trees ❍ 8.3 Hash Tables Dynamic Algorithms ❍ 9.1 Fibonacci Numbers ❍ 9.2 Binomial Coefficients ❍ 9.3 Optimal Binary Search Trees ❍ 9.4 Matrix Chain Multiplication ❍ 9.5 Longest Common Subsequence ❍ 9.6 Optimal Triangulation Graphs ❍ 10.1 Minimum Spanning Tree ❍ 10.2 Dijkstra's Algorithm Huffman Encoding FFT Hard or Intractable Problems ❍ 13.1 Eulerian or Hamiltonian Paths ❍ 13.2 Travelling Salesman's Problem Games ❍

8.

9.

10.

11. 12. 13.

14.

Appendices A. ANSI C B. Source code listings C. Getting these notes

Slides Slidesfrom 1998 lectures (PowerPoint).

Course Management ❍ ❍ ❍ ❍

Key Points from Lectures Workshops Past Exams Tutorials

Texts http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/ds_ToC.html (2 of 3) [3/23/2004 2:23:29 PM]

Data Structures and Algorithms: Table of Contents

Texts available in UWA library Other on-line courses and texts Algorithm Animations © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/ds_ToC.html (3 of 3) [3/23/2004 2:23:29 PM]

Data Structures and Algorithms: Table of Contents

Data Structures and Algorithms John Morris, Electrical and Electronic Engineering, University of Western Australia These notes were prepared for the Programming Languages and System Design course in the BE(Information Technology) course at the University of Western Australia. The course covers: ●

●

●

●

●

●

● ● ● ● ●

Algorithm Complexity ❍ Polynomial and Intractible Algorithms Classes of Efficient Algorithms ❍ Divide and Conquer ❍ Dynamic ❍ Greedy Searching ❍ Lists ❍ Trees ■ Binary ■ Red-Black ■ AVL ■ B-trees and other m-way trees ■ Optimal Binary Search Trees ❍ Hash Tables Queues ❍ Heaps and Priority Queues Sorting ❍ Quick ❍ Heap ❍ Bin and Radix Graphs ❍ Minimum Spanning Tree ❍ Dijkstra's Algorithm Huffman Encoding Fast Fourier Transforms Matrix Chain Multiplication Intractible Problems Alpha-Beta search

The algorithm animations were mainly written by Woi Ang with contributions by Chien-Wei Tan, Mervyn Ng, Anita Lee and John Morris. Table of Contents © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/index.html [3/23/2004 2:23:31 PM]

PLDS210 - Programming Languages and Data Structures

PLDS210 Programming Languages and Data Structures Course Synopsis This course will focus on data structures and algorithms for manipulating them. Data structures for storing information in tables, lists, trees, queues and stacks will be covered. Some basic graph and discrete transform algorithms will also be discussed. You will also be introduced to some basic principles of software engineering: good programming practice for "long-life" software. For a full list of topics to be covered, view the table of contents page for the lecture notes.

Lectures - 1998 There are two lectures every week: Monday 12 pm E273 Tuesday 12 pm AG11

Lecture Notes A set of notes for this course is available on the Web. From the table of contents page you can jump to any section of the course. There is a home page set up for student information: http://www.ee.uwa.edu.au/internal/ug.courses.html On which, you will find an entry for course information; you can follow the links to this page and the notes themselves. You can also go directly to the PLSD210 page: http://www.ee.uwa.edu.au/~plsd210/ds/plds210.html

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/plds210.html (1 of 5) [3/23/2004 2:23:35 PM]

PLDS210 - Programming Languages and Data Structures

Note that the Web pages use the string plds (programming languages and data structures) - a historical accident, which we retain because this label describes the content more accurately!

Printed Notes For a ridiculously low price, you can obtain a pre-printed copy of the notes from the bookshop. You are strongly advised to do so, as this will enable you to avoid laboriously taking notes in lectures and concentrate on understanding the material. (It will also save you a large amount of time printing each page from a Web browser!) The printed notes accurately represent the span of the course: you will be specifically advised if examinable material not appearing in these notes is added to the course. (But note that anything appearing in laboratory exercises and assignments is automatically considered examinable: this includes the feedback notes!) However, the Web notes are undergoing constant revision and improvement (comments are welcome!) so you are advised to browse through the Web copies for updated pages. You'll be advised in lectures if there is a substantial change to any section. Textbooks The material on data structures and algorithms may be found in many texts: lists of reference books in the library are part of the Web notes. The Web notes are, of necessity, abbreviated and should not be considered a substitute for studying the material in texts. Web browsers Web browsers have varying capabilities: the notes were checked with Netscape 2 - but should read intelligently with other browsers. If you have problems, I would be interested to know about them, but please note that updating these notes, adding the animations, tutoring and marking your assignments for this course have priority: problems with other browsers, your home computer, etc, will only be investigated if time permits.

Using the notes The notes make use of the hypertext capabilities of Web browsers: you will find highlighted links to subsidiary information scattered throughout the text. Occasionally these links will point to Web resources which may be located off campus and take some time to download: you may find it productive to use the "Stop" facility on the browser to abort the current fetch - you can try again later when the Net is less heavily loaded. In all cases, the browser's "Back" command should take you back to the original page. Program source http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/plds210.html (2 of 5) [3/23/2004 2:23:35 PM]

PLDS210 - Programming Languages and Data Structures

Example source code for programs will sometimes pop up in a separate window. This is to enable you to scan the code while referring to the notes in the main page. You will probably need to move the source code page out of the way of the main page. When you have finished with the source code page, select File:Close to close the window. Selecting File:Exit will close the window and exit from Netscape - possibly not your intention!

Tutorials - 1997 Exercises for the tutorials and laboratory sessions are also found in the Web pages. Tutorial Times Weeks Time

4-13

Location Groups

Thursday 9 am

E273

2ic,2it1-3

Thursday 2 pm

E269

rest

The first tutorial will be in the fourth week of semester. As long as one tutorial group does not become ridiculously overloaded, you may go to whichever tutorial suits you best.

Laboratory Sessions - 1998 There will be two formal introductory laboratory sessions early in the semester - watch these pages for the final details. These sessions will be in laboratory G.50. After the first two laboratories, a tutor will be available in G.50 every week at times to be advertised. The tutor will advise on any problems related to the whole course: assignments, lecture material, etc. You will be able to complete the assignment on any machine which has an ANSI C compiler. Assignments will be submitted electronically: submit programs on the SGI machines and on the NT systems in 1.51 may be used - refer to the submission instructions. Note that you are expected to write ANSI standard C which will run on any machine: programs which won't run on our SGI's risk failure! In 1998, Java programs written to an acceptable standard will also be accepted. (The standard required for C is set out explicitly: ensure that you understand how to translate the important elements of this to Java before starting work in Java. Seek feedback if uncertain!)

Assessment Assignments

20%

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/plds210.html (3 of 5) [3/23/2004 2:23:35 PM]

PLDS210 - Programming Languages and Data Structures

Written Exam (3hrs) 80% As with all other courses with a practical component, the practical assignments are compulsory. Failure to obtain a satsifactory grade in the practical component of the course may cause you to be given a 0 for this component of PLSD210. Since this will make it virtually impossible to obtain more than a faculty pass for the year, failure to do the practical assignments will not only cause you to miss some feedback which may well be useful to you in the written exam, but may cause you to fail the whole unit. A "satisfactory" grade in assignments is more than 40% overall. Any less will put your whole year at risk. A much safer course is to do the assignments conscientiously, making sure that you understand every aspect of them: assume that the effort put into them will improve your examination mark also.

Assignments - 1998 Four assignment exercises will be set for the semester. You should be able to complete most of the first two assignments during the initial laboratory sessions. The 3rd and 4th are more substantial. Completed assignments (which should include a summary report, the program code and any relevant output) should be submitted by following the submission instructions at the end of the Web page. Performance on the assignments will be 20% of your overall assessment for the unit.

Assignments 1 & 2 These will be relatively short and should require only 1 or 2 hours extra work to complete. They contribute 6% of your final assessment. These assignments will provide some feedback on what is expected for the remaining two assignments. You may even find that you can use the (corrected) code from these assignments in the later assignments.

Assignments 3 & 4 For these two assignments, you will be expected to implement one algorithm and test another. You will be assigned an algorithm to implement as assignment 3. You may obtain from one of your class colleagues an implementation of any other algorithm and test it for assignment 4. You must submit them by the dates shown on the assignment sheets. They will constitute the remaining 14% of your assignment assessment.

A minimum standard must be obtained in the assignments to pass the unit as a whole. Failure to attempt the assignments will put you at a severe disadvantage in the exam. Assignment reports http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/plds210.html (4 of 5) [3/23/2004 2:23:35 PM]

PLDS210 - Programming Languages and Data Structures

Each assignment submission should be accompanied by a summary report. The report should be clear and concise: it is unlikely that you will need to write more than 2 A4 pages (or about 120 lines of text). Report Format

The report should be in plain ASCII text. The 'native form' of any wordprocessor will be rejected. If you prefer to use a word processor to prepare your report, then ensure that you export a plain text file for submission when you have finished: all word-processors have this capability. This allows you to concentrate on the content of the report, rather than the cosmetics of its format. However, the general standards for report structure and organisation (title, authors, introduction, body grouped into related paragraphs, conclusion, etc) expected for any other unit apply here also. Communication This course attempts to be "paperless" as much as possible! Assignments will be submitted electronically and comments will be emailed back to you. Please ensure that your reports include email addresses of all authors. The preferred method for communication with the lecturer and tutor(s) is, at least initially, email. All routine queries will be handled this way: we will attempt to respond to all email messages by the next day. If you have more complex problems, email for an appointment (suggest a few times when you will be free). You may of course try to find me in my office at any time (but early in the morning is likely to be a waste of time), but emailing for an appointment first ensures you some priority and enables you to avoid wasting a trip to the 4th floor when there may be zero probability of success!

Continue on the lecture notes. © John Morris, 1996

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/plds210.html (5 of 5) [3/23/2004 2:23:35 PM]

Data Structures and Algorithms: Introduction

Data Structures and Algorithms

1. Introduction This course is designed to teach you how to program efficiently. It assumes that ● ● ●

you know the basics of programming in C, can write, debug and run simple programs in C, and have some simple understanding of object-oriented design.

An introduction to object-oriented programming using ANSI standard C may be found in the companion Object First course.

Good Programs There are a number of facets to good programs: they must a. b. c. d. e.

run correctly run efficiently be easy to read and understand be easy to debug and be easy to modify.

What does correct mean? We need to have some formal notion of the meaning of correct: thus we define it to mean "run in accordance with the specifications".

The first of these is obvious - programs which don't run correctly are clearly of little use. "Efficiently" is usually understood to mean in the minimum time - but occasionally there will be other constraints, such as memory use, which will be paramount. As will be demonstrated later, better running times will generally be obtained from use of the most appropriate data structures and algorithms, rather than through "hacking", i.e. removing a few statements by some clever coding - or even worse, programming in assembler! This course will focus on solving problems efficiently: you will be introduced to a number of fundamental data structures and algorithms (or procedures) for manipulating them. The importance of the other points is less obvious. The early history of many computer installations is, however, testimony to their importance. Many studies have quantified the enormous costs of failing to build software systems that had all the characteristics listed. (A classic reference is Boehm's text.) Unfortunately, much recent evidence suggests that these principles are still not well understood! Any perusal of Risks forum will soon convince you that there is an enormous amount of poor software in use. The discipline of software engineering is concerned with building large software systems which perform as their users expected, are reliable and easy to maintain. This course will introduce some software engineering principles but we will concentrate on the creation of small programs only. By using well-known, efficient techniques for solving problems, not only do you

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/introduction.html (1 of 2) [3/23/2004 2:23:41 PM]

Data Structures and Algorithms: Introduction

produce correct and fast programs in the minimum time, but you make your programs easier to modify. Another software engineer will find it much simpler to work with a well-known solution than something that has been hacked together and "looks a bit like" some textbook algorithm.

Key terms

correct A correct program runs in accordance with its specifications algorithm A precisely specified procedure for solving a problem. Continue on to Programming Strategies

Back to the Table of Contents

© John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/introduction.html (2 of 2) [3/23/2004 2:23:41 PM]

Data Structures and Algorithms: Aside - Software Engineering

Data Structures and Algorithms

Software Engineering The discipline of Software Engineering was founded when it was discovered that large numbers of software projects ● ● ● ●

exceeded their budgets were late were riddled with errors and did not satisfy their users' needs.

The term is believed to have been coined by a NATO study group in 1967. The first software engineering conference was the NATO Software Engineering Conference held in Garmisch, Germany in 1968.

Software Engineering References One of the classic early texts is Boehm's book: B.J. Boehm, "Software Engineering Economics", Prentice-Hall, 1981

The Risks Forum A continuing saga of problems with software system is chronicled in the "Risks" section of ACM journal, "Software Engineering Notes". The Risks section has appeared in every issue for more than ten years, i.e. there has been no shortage of material to keep it alive for all that time!

Notes Clever programming Be the end of this course, you should understand that hacking is far from clever: there are much more effective strategies for making programs run faster! Back to Introduction Back to the Table of Contents © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/softeng.html [3/23/2004 2:23:43 PM]

Data Structures and Algorithms: Programming Strategies

Data Structures and Algorithms

2. Programming Strategies It is necessary to have some formal way of constructing a program so that it can be built efficiently and reliably. Research has shown that this can be best done by decomposing a program into suitable small modules, which can themselves be written and tested before being incorporated into larger modules, which are in turn constructed and tested. The alternative is create what was often called "sphaghetti code" because of its tangled of statements and jumps. Many expensive, failed projects have demonstrated that, however much you like to eat sphaghetti, using it as a model for program construction is not a good idea! It's rather obvious that if we split any task into a number of smaller tasks which can be completed individually, then the management of the larger task becomes easier. However, we need a formal basis for partitioning our large task into smaller ones. The notion of abstraction is extremely useful here. Abstractions are high level views of objects or functions which enable us to forget about the low level details and concentrate on the problem at hand. To illustrate, a truck manufacturer uses a computer to control the engine operation - adjusting fuel and air flow to match the load. The computer is composed of a number of silicon chips, their interconnections and a program. These details are irrelevant to the manufacturer - the computer is a black box to which a host of sensors (for engines speed, accelerator pedal position, air temperature, etc) are connected. The computer reads these sensors and adjusts the engine controls (air inlet and fuel valves, valve timing, etc) appropriately. Thus the manufacturer has a high level or abstract view of the computer. He has specified its behaviour with statements like: "When the accelerator pedal is 50% depressed, air and fuel valves should be opened until the engine speed reaches 2500rpm". He doesn't care how the computer calculates the optimum valve settings - for instance it could use either integer or floating point arithmetic - he is only interested in behaviour that matches his specification. In turn, the manager of a transport company has an even higher level or more abstract view of a truck. It's simply a means of transporting goods from point A to point B in the minimum time allowed by the road traffic laws. His specification contains statements like: "The truck, when laden with 10 tonnes, shall need no more than 20l/100km of fuel when travelling at 110kph." How this specification is achieved is irrelevant to him: it matters little whether there is a control computer or some mechanical engineer's dream of cams, rods, gears, etc. There are two important forms of abstraction: functional abstraction and structural abstraction. In functional abstraction, we specify a function for a module, i.e. http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/strategies.html (1 of 2) [3/23/2004 2:24:04 PM]

Data Structures and Algorithms: Programming Strategies

"This module will sort the items in its input stream into ascending order based on an ordering rule for the items and place them on its output stream." As we will see later, there are many ways to sort items - some more efficient than others. At this level, we are not concerned with how the sort is performed, but simply that the output is sorted according to our ordering rule. The second type of abstraction - structural abstraction - is better known as object orientation. In this approach, we construct software models of the behaviour of real world if ( pedal_pos > 50.0 ) { items, i.e. our truck set_air_intake( 0.78*pedal_pos); manufacturer, in analysing the set_fuel_valve( 0.12 + 0.32*pedal_pos); performance of his vehicle, } would employ a software model of the control computer. For him, this model is abstract it could mimic the behaviour of the real computer by simply providing a behavioural model with program statements like: Alternatively, his model could incorporate details of the computer and its program. However, he isn't concerned: the computer is a "black box" to him and he's solely concerned with its external behaviour. To simplify the complexity of his own model (the vehicle as a whole), he doesn't want to concern himself with the internal workings of the control computer; he wants to assume that someone else has correctly constructed a reliable model of it for him.

Key terms

hacking Producing a computer program rapidly, without thought and without any design methodology. Continue on to Objects and ADTs Back to the Table of Contents © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/strategies.html (2 of 2) [3/23/2004 2:24:04 PM]

Data Structures and Algorithms: Objects and ADTs

Data Structures and Algorithms

2.1 Objects and ADTs In this course, we won't delve into the full theory of object-oriented design. We'll concentrate on the pre-cursor of OO design: abstract data types (ADTs). A theory for the full object oriented approach is readily built on the ideas for abstract data types. An abstract data type is a data structure and a collection of functions or procedures which operate on the data structure. To align ourselves with OO theory, we'll call the functions and procedures methods and the data structure and its methods a class, i.e. we'll call our ADTs classes. However our classes do not have the full capabilities associated with classes in OO theory. An instance of the class is called an object . Objects represent objects in the real world and appear in programs as variables of a type defined by the class. These terms have exactly the same meaning in OO design methodologies, but they have additional properties such as inheritance that we will not discuss here. It is important to note the object orientation is a design methodology. As a consequence, it is possible to write OO programs using languages such as C, Ada and Pascal. The so-called OO languages such as C++ and Eiffel simply provide some compiler support for OO design: this support must be provided by the programmer in non-OO languages.

2.2 An Example: Collections Programs often deal with collections of items. These collections may be organised in many ways and use many different program structures to represent them, yet, from an abstract point of view, there will be a few common operations on any collection. These might include: create

Create a new collection

add

Add an item to a collection

delete

Delete an item from a collection

find

Find an item matching some criterion in the collection

destroy Destroy the collection

2.2.1 Constructors and destructors The create and destroy methods - often called constructors and destructors - are usually implemented for any abstract data type. Occasionally, the data type's use or semantics are such that there is only ever one object of that type in a program. In that case, it is possible to hide even the object's `handle' http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/objects.html (1 of 4) [3/23/2004 2:24:14 PM]

Data Structures and Algorithms: Objects and ADTs

from the user. However, even in these cases, constructor and destructor methods are often provided. Of course, specific applications may call for additional methods, e.g. we may need to join two collections (form a union in set terminology) - or may not need all of these. One of the aims of good program design would be to ensure that additional requirements are easily handled.

2.2.2 Data Structure To construct an abstract software model of a collection, we start by building the formal specification. The first component of this is the name of a data type - this is the type of objects that belong to the collection class. In C, we use typedef to define a new type which is a pointer to a structure: typedef struct collection_struct *collection; Note that we are defining a pointer to a structure only; we have not specified details of the attributes of the structure. We are deliberately deferring this - the details of the implementation are irrelevant at this stage. We are only concerned with the abstract behaviour of the collection. In fact, as we will see later, we want to be able to substitute different data structures for the actual implementation of the collection, depending on our needs. The typedef declaration provides us with a C type (class in OO design parlance), collection. We can declare objects of type collection wherever needed. Although C forces us to reveal that the handle for objects of the class is a pointer, it is better to take an abstract view: we regard variables of type collection simply as handles to objects of the class and forget that the variables are actually C pointers.

2.2.3 Methods Next, we need to define the methods: collection ConsCollection( int max_items, int item_size ); void AddToCollection( collection c, void *item ); void DeleteFromCollection( collection c, void *item ); void *FindInCollection( collection c, void *key ); Note that we are using a number of C "hacks" here. C - even in ANSI standard form - is not exactly the safest programming language in the sense of the support it provides for the engineering of quality software. However, its portability and extreme popularity mean that it is a practical choice for even large software engineering projects. Unfortunately, C++, because it is based on C, isn't much better. Java, the latest fad in the software industry, shows some evidence that its designers have learned from experience (or actually read some of the literature in programming language research!) and has eliminated some of the more dangerous features of C. http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/objects.html (2 of 4) [3/23/2004 2:24:14 PM]

Data Structures and Algorithms: Objects and ADTs

Just as we defined our collection object as a pointer to a structure, we assume that the object which belong in this collection are themselves represented by pointers to data structures. Hence in AddToCollection, item is typed void *. In ANSI C, void * will match any pointer - thus AddToCollection may be used to add any object to our collection. Similarly, key in FindInCollection is typed void *, as the key which is used to find any item in the collection may itself be some object. FindInCollection returns a pointer to the item which matches key, so it also has the type void *. The use of void * here highlights one of the deficiencies of C: it doesn't provide the capability to create generic objects, cf the ability to define generic packages in Ada or templates in C++. Note there are various other "hacks" to overcome C's limitations in this area. One uses the preprocessor. You might like to try to work out an alternative approach and try to convince your tutor that it's better than the one set out here!

2.2.4 Pre- and post-conditions No formal specification is complete without pre- and post-conditions. A useful way to view these is as forming a contract between the object and its client. The pre-conditions define a state of the program which the client guarantees will be true before calling any method, whereas the post-conditions define the state of the program that the object's method will guarantee to create for you when it returns. Again C (unlike Eiffel, for example) provides no formal support for pre- and post-conditions. However, the standard does define an assert function which can (and should!) be used to verify preand post-conditions [man page for assert]. We will see how this is used when we examine an implementation of our collection object. Thus pre- and post-conditions should be expressed as comments accompanying the method definition. Adding pre- and post-conditions to the collection object would produce: Select to load collection.h Aside In order to keep the discussion simple at this stage, a very general specification of a collection has been implied by the definitions used here. Often, we would restrict our specification in various ways: for example, by not permitting duplicates (items with the same key) to be added to the collection. With such a collection, the pre- and post-conditions can be made more formal: Select to load ucollection.h Note how the pre- and post-conditions now use the FindInUCollection function to more precisely define the state of the object before and after the method has been invoked. Such formal preand post-conditions are obviously much more useful than the informal English ones previously specified. They are also easier to translate to appropriate assertions as will be seen when the implementation is constructed. http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/objects.html (3 of 4) [3/23/2004 2:24:14 PM]

Data Structures and Algorithms: Objects and ADTs

2.2.5 C conventions This specification - which all a user or client of this object needs to see (he isn't interested in the implementation details) - would normally be placed in a file with a .h (h = header) suffix to its name. For the collection, we would place the specifications in files called collection.h and Ucollection.h and use the C #include facility to import them into programs which needed to use them. The implementation or body of the class is placed in a file with a .c suffix. References Some additional sources of information on Object Orientation: ● ●

● ●

What is Object-Oriented Software? Basic Principles and Concepts of Object-Orientation - an extensive set of notes of OO concepts. Unfortunately most of the bibliography links seem to be outdated. Object Oriented Programming - notes for a class at Indiana University (based on Objective-C). Object Oriented Programming Languages - summary of OO programming languages (with links to full details).

Key terms

abstract data type (ADT) A data structure and a set of operations which can be performed on it. A class in objectoriented design is an ADT. However, classes have additional properties (inheritance and polymorphism) not normally associated with ADTs. Continue on to Error Handling Continue on to Arrays Back to the Table of Contents © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/objects.html (4 of 4) [3/23/2004 2:24:14 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/eiffel.html

Eiffel Eiffel is an object-oriented language based on Ada designed by Betrand Meyer. It provides formal support for pre-conditions by allowing a programmer to insert assert statements at appropriate points in his code.

Reference Bertrand Meyer, "Eiffel: The Language", Prentice Hall, 1992 ISBN 0-13-247925-7. There is also an Eiffel Home Page.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/eiffel.html [3/23/2004 2:24:16 PM]

C hacks

C hacks C allows you to define a void pointer. This pointer will match a pointer to any type. When you invoke a function whose formal parameter is a void *, you can use any pointer as the actual argument. This allows you to build generic functions which will operate on a variety of data types. However it does bypass the compiler's type checking - and allow you to inadvertently make mistakes that a compiler for a strongly typed language such as Ada would detect for you.

Hacking This term probably arises from the MIT expression for what we know in English as a "student prank". MIT students refer to these assaults on the conventions of society as "hacks".

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/C_hacks.html [3/23/2004 2:24:20 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/ANSI_C.html

Data Structures and Algorithms

Appendix A: Languages A.1 ANSI C Function prototypes ANSI C Compilers

A.2 C++ A.3 Java Designed by a group within Sun Microsystems, Java has eliminated some of the more dangerous features of C (to the undoubted disappointment of some hackers - who probably achieve their daily highs from discovering new ways to program dangerously in C!). A host of texts on Java have now appeared - possibly setting a new record for the rate of textbook production on any one subject!

References Back to the Table of Contents © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/ANSI_C.html [3/23/2004 2:24:22 PM]

Ada

Ada Ada was designed in response to a US Department of Defense initiative which sought a common higher order language for all defense applications. Jean Ichbiah's team at Honeywell Bull won the competition for the new language, which was named after Ada Augusta, daugther of Lord Byron and Countess of Lovelace. Ada was Babbage's assistant and thus may claim the title of the first computer programmer. Ada is now an ANSI and ISO standard: Reference Manual for the Ada Programming Language, ANSI/MIL-STD-1815A-1983, Feb, 1983. The language reference manual may also be found as an appendix in some texts, eg J.G.P. Barnes, "Programming in Ada plus Language Reference Manual", 3rd ed, Addison-Wesley, 1991. ISBN 0-201-56539-0. The Ada initiative pre-dates the discovery of object-oriented design. However, it does support many OO design strategies. It provides excellent support for the construction of Abstract Data Types through use of the package and private data type facilities. An object oriented Ada, "Ada 95", has been defined. An on-line version of the Ada Language Reference Manual is available. It is also available from a number of other sites: any Internet search engine should be able to locate the nearest one for you. A full set of Ada resources, the Ada rationale, the Ada Information Clearing House (Ada IC), etc, is available at the Swiss Federal Institute of Technology in Lausanne (EPFL)'s Ada Resources page.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/ada.html [3/23/2004 2:24:30 PM]

Data Structures and Algorithms - xx

Data Structures and Algorithms This section not complete yet! Back to the Table of Contents © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/notyet.html [3/23/2004 2:24:32 PM]

assert

ASSERT(3V)

C LIBRARY FUNCTIONS

ASSERT(3V)

NAME assert - program verification SYNOPSIS #include assert(expression) DESCRIPTION assert() is a macro that indicates expression is expected to be true at this point in the program. If expression is false (0), it displays a diagnostic message on the standard output and exits (see exit(2V)). Compiling with the cc(1V) option -DNDEBUG, or placing the preprocessor control statement #define NDEBUG before the ``#include '' deletes assert() from the program.

statement

effectively

SYSTEM V DESCRIPTION The System V version of assert() calls abort(3) rather exit().

than

SEE ALSO cc(1V), exit(2V), abort(3) DIAGNOSTICS Assertion failed: file f line n The expression passed to the assert() statement at line n of source file f was false. SYSTEM V DIAGNOSTICS Assertion failed: expression, file f, line n The expression passed to the assert() statement at line n of source file f was false.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/assert.html [3/23/2004 2:24:40 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/collection.h

/* Specification for Collection */ typedef struct t_Collection *Collection; Collection ConsCollection( int max_items, int item_size ); /* Construct a new Collection Pre-condition: max_items > 0 Post-condition: returns a pointer to an empty Collection */ void AddToCollection( Collection c, void *item ); /* Add an item to a Collection Pre-condition: (c is a Collection created by a call to ConsCollection) && (existing item count < max_items) && (item != NULL) Post-condition: item has been added to c */ void DeleteFromCollection( Collection c, void *item ); /* Delete an item from a Collection Pre-condition: (c is a Collection created by a call to ConsCollection) && (existing item count >= 1) && (item != NULL) Post-condition: item has been deleted from c */ void *FindInCollection( Collection c, void *key ); /* Find an item in a Collection Pre-condition: c is a Collection created by a call to ConsCollection key != NULL Post-condition: returns an item identified by key if one exists, otherwise returns NULL */

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/collection.h [3/23/2004 2:24:42 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/ucollection.h

/* Specification of a collection which contains unique items only Assumes that the items of which the collection is composed supply an ItemKey method which returns the pointer to an identifying key for the item */ typedef struct u_collection_struct *u_collection; u_collection ConsUCollection( int max_items ); /* Construct a new collection Pre-condition: None Post-condition: returns a pointer to an empty collection */ void AddToUCollection( u_collection c, void *item ); /* Add an item to a collection Pre-condition: (c was created by a call to ConsUCollection) && (item != NULL) && (FindInUCollection(c,ItemKey(item))==NULL) Post-condition: FindInUCollection(c,ItemKey(item)) != NULL */ void DeleteFromUCollection( u_collection c, void *item ); /* Delete an item from a collection Pre-condition: (c was created by a call to ConsUCollection) && (item != NULL) && (FindInUCollection(c,ItemKey(item)) != NULL) Post-condition: FindInUCollection(c,ItemKey(item)) == NULL */ void *FindInUCollection( u_collection c, void *key ); /* Find an item in a collection Pre-condition: c is a collection created by a call to ConsCollection key != NULL Post-condition: returns an item identified by key if one exists, otherwise returns NULL */

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/ucollection.h [3/23/2004 2:24:44 PM]

Data Structures and Algorithms: Error Handling

Data Structures and Algorithms

2.7 Error Handling No program or program fragment can be considered complete until appropriate error handling has been added. Unexpected program failures are a disaster - at the best, they cause frustration because the program user must repeat minutes or hours of work, but in life-critical applications, even the most trivial program error, if not processed correctly, has the potential to kill someone. If an error is fatal, in the sense that a program cannot sensibly continue, then the program must be able to "die gracefully". This means that it must ● ●

inform its user(s) why it died, and save as much of the program state as possible.

2.7.1 Defining Errors The first step in determining how to handle errors is to define precisely what is considered to be an error. Careful specification of each software component is part of this process. The pre-conditions of an ADT's methods will specify the states of a system (the input states) which a method is able to process. The post-conditions of each method should clearly specify the result of processing each acceptable input state. Thus, if we have a method: int f( some_class a, int i ) /* PRE-CONDITION: i >= 0 */ /* POST-CONDITION: if ( i == 0 ) return 0 and a is unaltered else return 1 and update a's i-th element by .... */ ●

● ● ●

This specification tells us that i==0 is a meaningless input that f should flag by returning 0 but otherwise ignore. f is expected to handle correctly all positive values of i. The behaviour of f is not specified for negative values of i, ie it also tells us that It is an error for a client to call f with a negative value of i.

Thus, a complete specification will specify ● ●

all the acceptable input states, and the action of a method when presented with each acceptable input state.

By specifying the acceptable input states in pre-conditions, it will also divide responsibility for errors unambiguously.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/errors.html (1 of 3) [3/23/2004 2:24:47 PM]

Data Structures and Algorithms: Error Handling ●

●

The client is responsible for the pre-conditions: it is an error for the client to call the method with an unacceptable input state, and The method is responsible for establishing the post-conditions and for reporting errors which occur in doing so.

2.7.2 Processing errors Let's look at an error which must be handled by the constructor for any dynamically allocated object: the system may not be able to allocate enough memory for the object. A good way to create a disaster is to do this: X ConsX( .... ) { X x = malloc( sizeof(struct t_X) ); if ( x == NULL ) { printf("Insuff mem\n"); exit( 1 ); } else ..... } Not only is the error message so cryptic that it is likely to be little help in locating the cause of the error (the message should at least be "Insuff mem for X"!), but the program will simply exit, possibly leaving the system in some unstable, partially updated, state. This approach has other potential problems: ●

●

●

What if we've built this code into some elaborate GUI program with no provision for "standard output"? We may not even see the message as the program exits! We may have used this code in a system, such as an embedded processor (a control computer), which has no way of processing an output stream of characters at all. The use of exit assumes the presence of some higher level program, eg a Unix shell, which will capture and process the error code 1.

As a general rule, I/O is non-portable! A function like printf will produce error messages on the 'terminal' window of your modern workstation, but if you are running a GUI program like Netscape, where will the messages go? So, the same function may not produce useful diagnostic output for two programs running in different environments on the same processor! How can we expect it to be useful if we transport this program to another system altogether, eg a Macintosh or a Windows machine? Before looking at what we can do in ANSI C, let's look at how some other languages tackle this

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/errors.html (2 of 3) [3/23/2004 2:24:47 PM]

Data Structures and Algorithms: Error Handling

problem. Continue on to Ada Exceptions Back to the Table of Contents © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/errors.html (3 of 3) [3/23/2004 2:24:47 PM]

Data Structures and Algorithms: Arrays

Data Structures and Algorithms

3 Data Structures In this section, we will examine some fundamental data structures: arrays, lists, stacks and trees.

3.1 Arrays The simplest way to implement our collection is to use an array to hold the items. Thus the implementation of the collection object becomes: /* Array implementation of a collection */ #include /* Needed for assertions */ #include "collection.h" /* import the specification */ struct t_collection { int item_cnt; int max_cnt; int item_size; void *items[]; };

/* Not strictly necessary */ /* Needed by FindInCollection */

Points to note: a. We have imported the specification of this object into the implementaton - this enables the compiler to verify that the implementation and the specification match. Although it's not necessary to include the specification (cf function prototypes), it is much safer to do so as it enables the compiler to detect some common errors and ensures that the specification and its implementation remain consistent when the object is changed. b. items is typed as an array of void * in the struct. It is an array of item's which happen to be pointers - but remember that we are trying to hide this from users of the class. Many C programmers would write the equivalent void ** here. A question: ●

Why is the attribute max_cnt not strictly necessary? Hint: it's related to the pre- and post-conditions specified for methods on this object.

The implementations of the methods are: Select here to load collection.c

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/arrays.html (1 of 2) [3/23/2004 2:24:53 PM]

Data Structures and Algorithms: Arrays

Points to note: a. ConsCollection uses the memory allocator calloc to dynamically allocate memory off the program's heap for the collection. Two calls are necessary - one to allocate space for the "header" structure itself and one to allocate space for the array of item pointers. b. assert calls have been added for the pre-conditions (cf full description of assert). Note that the pre-conditions here are expressed as a number of conditions linked by &&. Since assert requires a single boolean expression as its argument, one assert would suffice. However, we have chosen to implement each individual condition as a separate assert. This is done to assist de-bugging: if the pre-conditions are not satisfied, it is more helpful to know which one of multiple conditions has not been satisfied! c. memcmp is a standard function which compares blocks of memory byte by byte [man page for memcmp]. d. The use of memcp and ItemKey severely constrain the form of the key - it must be in a contiguous string of characters in the item. There are ways of providing more flexible keys (eg ones having multiple fields within item or ones calculated from item. These rely on C capabilities which will be discussed in a later section. e. There is no treatment of errors, e.g. if no memory is available on the heap for calloc. This is a serious shortcoming. No software without a consistent strategy for detecting, reporting and recovering from errors can be considered well engineered. It is difficult to debug, prone to crashes from faults which are difficult to correct because there is no indication of the source of the error. Error handling is addressed in a later section.

Key terms

hacking producing a computer program rapidly, without thought and Continue on to Lists Back to the Table of Contents © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/arrays.html (2 of 2) [3/23/2004 2:24:53 PM]

Data Structures and Algorithms: Ada exceptions

Data Structures and Algorithms

Ada Exceptions Ada defines an EXCEPTION which may be processed in an exception handler at any level in the program above that where the exception was generated or RAISEd. PACKAGE adtX IS TYPE X IS PRIVATE; EXCEPTION out_of_range; PROCEDURE f( a: INOUT X; b: INTEGER ); END adtX; PACKAGE BODY adtX IS PROCEDURE f( a: INOUT X; b: INTEGER ) IS BEGIN ...... IF b < some_limit THEN -- Normal processing ELSE RAISE out_of_range; END IF; END adtX; This package exports the exception out_of_range which may be caught in any routine that uses f. WITH adtX; USE adtX;

-- Import adtX

PROCEDURE g( ... ) IS BEGIN ... f( a, n ); ...

-- Invoke method f -- Continue here if exception not raised

.... -- Return from here if no errors EXCEPTION WHEN out_of_range => ... -- process the exception END g; In this example, the exception was processed in the procedure, g, which called the function, f, in which it was raised. The code processing the exception is any set of Ada statements: it could even raise another exception.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/ada_exceptions.html (1 of 3) [3/23/2004 2:25:13 PM]

Data Structures and Algorithms: Ada exceptions

If the exception is not 'caught' it is propagated up the call stack until it encounters an exception handler prepared to process it. (If there are no exception handlers, then it will propagate to the highest level and cause the program to abort. However an implementation would be expected to print out the name of the exception causing the abort.) Because they are propagated to arbitrarily high levels of an Ada program, it is easy to arrange for Ada exceptions to be caught at some level where there is an appropriate interface for dealing with them. For example, in a GUI program, the routines which handle interaction with a user through the windows, mouse events, keyboard input, etc, are generally at the highest level in the program. These routines "know" how to pop up the alert box that tells the user that a problem has occurred and force him or her to take some action to correct the problem. Alternatively, in an embedded processor, they would "know" to send a message via a communications channel to some master processor. Lower level, re-usable code should be able to function correctly in any environment - GUI, text terminal, embedded system, etc. Ada's ability to propagate exceptions to a level at which the program knows sufficient about the environment to output the appropriate messages makes life simple for the writer of re-usable software. Exceptions are defined which correspond to all the errors that could occur. Re-usable code simply raises the exceptions. The users of the code then have the flexibility to decide when (ie at what level) to process the exceptions. An added benefit of Ada's exception mechanism is that it provides a uniform method of handling errors. Left to their own devices, programmers are able to define a large grab-bag of styles of error raising and processing, for example, we can: ● ● ● ● ● ●

use the return values of functions, add a call-by-reference error parameter to a function, set a global variable, call an error handling module, notify a separate process, ...

In Ada, a disciplined group of programmers will use Ada's in-built exception handling uniformly to propagate exceptions to some agreed level in programs where code which "knows" the current environment can handle the problem appropriately. Ada further standardises behaviour by pre-defining a number of exceptions for commonly encountered problems, such as constraint_error when an attempt is made to assign a value to a variable is outside the permitted range for its type.

Key terms

Exception An exception is raised by a program or program module when some event occurs which should http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/ada_exceptions.html (2 of 3) [3/23/2004 2:25:13 PM]

Data Structures and Algorithms: Ada exceptions

be handled by some other program or module. Continue on to C++ exception handling Back to the Table of Contents © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/ada_exceptions.html (3 of 3) [3/23/2004 2:25:13 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/collection.c

/* Array implementation of a Collection */ #include /* Definition of NULL */ #include /* Needed for assertions */ #include "Collection.h" /* import the specification */ extern void *ItemKey( void * ); struct t_Collection { int item_cnt; int max_items; int size; void **items; };

/* Not strictly necessary */ /* Needed by FindInCollection */

Collection ConsCollection(int max_items, int item_size ) /* Construct a new Collection Pre-condition: (max_items > 0) && (item_size > 0) Post-condition: returns a pointer to an empty Collection */ { Collection c; assert( max_items > 0 ); assert( item_size > 0 ); c = (Collection)calloc( 1, sizeof(struct t_Collection) ); c->items = (void **)calloc(max_items,sizeof(void *)); c->size = item_size; c->max_items = max_items; return c; } void DeleteCollection( Collection c ) { assert( c != NULL ); assert( c->items != NULL ); free( c->items ); free( c ); } void AddToCollection( Collection c, void *item ) /* Add an item to a Collection Pre-condition: (c is a Collection created by a call to ConsCollection) && (existing item count < max_items) && (item != NULL) Post-condition: item has been added to c */ { assert( c != NULL); assert( c->item_cnt < c->max_items ); assert( item != NULL); c->items[c->item_cnt++] = item; /* Post-condition */ assert( FindInCollection( c, ItemKey( item ) ) != NULL ); } void DeleteFromCollection( Collection c, void *item ) /* Delete an item from a Collection Pre-condition: (c is a Collection created by a call to ConsCollection) && (existing item count >= 1) && (item != NULL) Post-condition: item has been deleted from c */ { int i;

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/collection.c (1 of 2) [3/23/2004 2:25:16 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/collection.c

assert( c != NULL ); assert( c->item_cnt >= 1 ); assert( item != NULL ); for(i=0;iitem_cnt;i++) { if ( item == c->items[i] ) { /* Found the item to be deleted, shuffle all the rest down */ while( i < c->item_cnt ) { c->items[i] = c->items[i+1]; i++; } c->item_cnt--; break; } } } void *FindInCollection( Collection c, void *key ) /* Find an item in a Collection Pre-condition: c is a Collection created by a call to ConsCollection key != NULL Post-condition: returns an item identified by key if one exists, otherwise returns NULL */ { int i; assert( c != NULL ); assert( key != NULL ); for(i=0;iitem_cnt;i++) { if (memcmp(ItemKey(c->items[i]),key,c->size)==0) return c->items[i]; } return NULL; }

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/collection.c (2 of 2) [3/23/2004 2:25:16 PM]

Data Structures and Algorithms: Memory Allocators

Data Structures and Algorithms

Memory Allocators C implementations provide a number of memory allocation functions which allocate memory from the program's heap. This is usually an area of memory above the program and data blocks which grows upwards in memory as memory is allocated by program requests. The two most commonly used C functions are malloc and calloc. Full descriptions may be found in the Unix man pages for malloc and calloc.

Fig 1 A typical program's use of memory. Back to the Table of Contents © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/malloc.html [3/23/2004 2:25:44 PM]

memcmp

MEMORY(3)

C LIBRARY FUNCTIONS

MEMORY(3)

NAME memory, memccpy, memchr, memcmp, operations

memcpy,

memset

-

memory

SYNOPSIS #include ... others omitted ... int memcmp(s1, s2, n) char *s1, *s2; int n; DESCRIPTION These functions operate as efficiently as possible on memory areas (arrays of characters bounded by a count, not terminated by a null character). They do not check for the overflow of any receiving memory area. memcmp() compares its arguments, looking at the first n characters only, and returns an integer less than, equal to, or greater than 0, according as s1 is lexicographically less than, equal to, or greater than s2. See full man page for other descriptions. NOTES For user convenience, all these functions the header file.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/memcmp.html [3/23/2004 2:25:47 PM]

are

declared

in

Data Structures and Algorithms: Introduction

Data Structures and Algorithms

3.2 Lists The array implementation of our collection has one serious drawback: you must know the maximum number of items in your collection when you create it. This presents problems in programs in which this maximum number cannot be predicted accurately when the program starts up. Fortunately, we can use a structure called a linked list to overcome this limitation.

3.2.1 Linked lists The linked list is a very flexible dynamic data structure: items may be added to it or deleted from it at will. A programmer need not worry about how many items a program will have to accommodate: this allows us to write robust programs which require much less maintenance. A very common source of problems in program maintenance is the need to increase the capacity of a program to handle larger collections: even the most generous allowance for growth tends to prove inadequate over time! In a linked list, each item is allocated space as it is added to the list. A link is kept with each item to the next item in the list. Each node of the list has two elements 1. the item being stored in the list and 2. a pointer to the next item in the list The last node in the list contains a NULL pointer to indicate that it is the end or tail of the list. As items are added to a list, memory for a node is dynamically allocated. Thus the number of items that may be added to a list is limited only by the amount of memory available.

Handle for the list The variable (or handle) which represents the list is simply a pointer to the node at the head of the list.

Adding to a list The simplest strategy for adding an item to a list is to:

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/lists.html (1 of 4) [3/23/2004 2:26:04 PM]

Data Structures and Algorithms: Introduction

a. b. c. d.

allocate space for a new node, copy the item into it, make the new node's next pointer point to the current head of the list and make the head of the list point to the newly allocated node.

This strategy is fast and efficient, but each item is added to the head of the list. An alternative is to create a structure for the list which contains both head and tail pointers: struct fifo_list { struct node *head; struct node *tail; }; The code for AddToCollection is now trivially modified to make a list in which the item most recently added to the list is the list's tail. The specification remains identical to that used for the array implementation: the max_item parameter to ConsCollection is simply ignored [7] Thus we only need to change the implementation. As a consequence, applications which use this object will need no changes. The ramifications for the cost of software maintenance are significant. The data structure is changed, but since the details (the attributes of the object or the elements of the structure) are hidden from the user, there is no impact on the user's program. Select here to load collection_ll.c Points to note: a. This implementation of our collection can be substituted for the first one with no changes to a client's program. With the exception of the added flexibility that any number of items may be added to our collection, this implementation provides exactly the same high level behaviour as the previous one. b. The linked list implementation has exchanged flexibility for efficiency - on most systems, the system call to allocate memory is relatively expensive. Pre-allocation in the array-based implementation is generally more efficient. More examples of such trade-offs will be found later. The study of data structures and algorithms will enable you to make the implementation decision which most closely matches your users' specifications.

3.2.2 List variants

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/lists.html (2 of 4) [3/23/2004 2:26:04 PM]

Data Structures and Algorithms: Introduction

Circularly Linked Lists By ensuring that the tail of the list is always pointing to the head, we can build a circularly linked list. If the external pointer (the one in struct t_node in our implementation), points to the current "tail" of the list, then the "head" is found trivially via tail->next, permitting us to have either LIFO or FIFO lists with only one external pointer. In modern processors, the few bytes of memory saved in this way would probably not be regarded as significant. A circularly linked list would more likely be used in an application which required "round-robin" scheduling or processing. Doubly Linked Lists

Doubly linked lists have a pointer to the preceding item as well as one to the next. They permit scanning or searching of the list in both directions. (To go backwards in a simple list, it is necessary to go back to the start and scan forwards.) Many applications require searching backwards and forwards through sections of a list: for example, searching for a common name like "Kim" in a Korean telephone directory would probably need much scanning backwards and forwards through a small region of the whole list, so the backward links become very useful. In this case, the node structure is altered to have two links: struct t_node { void *item; struct t_node *previous; struct t_node *next; } node; Lists in arrays Although this might seem pointless (Why impose a structure which has the overhead of the "next" pointers on an array?), this is just what memory allocators do to manage available space. Memory is just an array of words. After a series of memory allocations and de-allocations, there are blocks of free memory scattered throughout the available heap space. In order to be able to re-use this memory, memory allocators will usually link freed blocks together in a free list by writing pointers to the next free block in the block itself. An external free list pointer pointer points to the first block in the free list. When a new block of memory is requested, the allocator will generally scan the free list looking for a freed block of suitable size and delete it from the free list (re-linking the free list around the deleted block). Many variations of memory allocators have been proposed: refer to a text on

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/lists.html (3 of 4) [3/23/2004 2:26:04 PM]

Data Structures and Algorithms: Introduction

operating systems or implementation of functional languages for more details. The entry in the index under garbage collection will probably lead to a discussion of this topic.

Key terms

Dynamic data structures Structures which grow or shrink as the data they hold changes. Lists, stacks and trees are all dynamic structures. Continue on to Stacks Back to the Table of Contents © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/lists.html (4 of 4) [3/23/2004 2:26:04 PM]

Data Structures and Algorithms: C++ exceptions

Data Structures and Algorithms

C++ Exceptions C++ defines a mechanism that is somewhat similar to Ada's exception mechanism. C++ allows you to execute a throw statement when an error is detected. When a throw is invoked, control jumps immediately to a catch routine. First you define a try block which encloses the "normal" (no exception) code. The try block is followed by a number of catch blocks: if code in the try block throws an exception, it is caught by one of the catch blocks. try { classX x; x.mangle( 2 ); x.straighten( -2 ); } catch( const char *string ) { ..... } catch( RangeErr &re ) { .... } catch( ... ) { // catches any other error .... } classX's constructor, mangle and straighten methods contain throw statements when problems are encountered: classX::mangle( int degree ) { if ( degree > MAX_MANGLE ) throw "Can't mangle this much!"; .... // Normal code for mangle } The throw causes control to jump straight out of the mangle method to the catch block, skipping all subsequent statements in the try block. However, like much of C++, the rules which are used to associate a throw with the correct catch are too complex to contemplate. Stroustroup attempted to make the throw like a general method invocation with parameters, overloading, etc, etc. Historical note: many early C++ compilers did not implement the throw/catch pair, (possibly because of the complexity alluded to above!) and some textbooks avoid it - or relegate it to the last few pages!

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/C++_throw.html (1 of 2) [3/23/2004 2:26:06 PM]

Data Structures and Algorithms: C++ exceptions

Basically, like a lot of C++ features, the throw/catch mechanism is best used in a very simple way, eg by providing a single catch block with a single int or enum parameter! Ada's simple and clean mechanism may lack some power (an exception handler can't be passed a parameter), but it's a lot easier to understand and use! Continue on to Java Exceptions Back to the Table of Contents © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/C++_throw.html (2 of 2) [3/23/2004 2:26:06 PM]

Data Structures and Algorithms: Notes

Data Structures and Algorithms

Notes 1. By the end of this course, you should understand that hacking is far from clever: there are much more effective strategies for making programs run faster! 2. Boehm, Software Engineering Economics 3. Software Engineering Notes 4. Ada LRM 5. B Meyer, Eiffel 6. Some compilers, e.g. Metrowerks Macintosh C compiler, have an option Require function prototypes which can be turned on. If it is on, then the compiler will issue errors if the specification is not included - because the function prototypes (the formal specification of the methods of our objects) are in the specification. Other compilers, e.g. GNU gcc, will only issue warnings if the function prototypes are absent. 7. Or possibly used as "advice" to the system - enabling it to pre-allocated space in some efficient way, e.g. in a contiguous block in one page of memory. 8. Maintenance is well known to be the most costly phase of any large software development project, refer to any text on Software Engineering. 9. top is equivalent to x = pop(s); push(s,x); 10. In fact, adding and deleting from the head of a linked list is the simplest implementation and produces exactly the LIFO semantics of a stack. 11. In most operating systems, allocation and de-allocation of memory is a relatively expensive operation, there is a penalty for the flexibility of linked list implementations. 12. Pronounce this "big-Oh n" - or sometimes "Oh n". 13. You will find that not many people will be happy with a prediction that 90% of the time, this computer will calculate the [new position of the aircraft's flaps|the fastest rate at which the brakes can be applied |...] to prevent [ the aircraft crashing| hitting the car in front|...].

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/notes.html [3/23/2004 2:26:08 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/collection_ll.c

/* Linked list implementation of a collection */ #include #include #include #include

"collection.h"

/* /* /* /*

calloc */ NULL */ Needed for assertions */ import the specification */

extern void *ItemKey( void * ); struct t_node { void *item; struct t_node *next; } node; struct t_Collection { int size; /* Needed by FindInCollection */ struct t_node *node; }; Collection ConsCollection(int max_items, int item_size ) /* Construct a new collection Pre-condition: (max_items > 0) && (item_size > 0) Post-condition: returns a pointer to an empty collection */ { Collection c; /* Although redundant, this assertion should be retained as it tests compliance with the formal specification */ assert( max_items > 0 ); assert( item_size > 0 ); c = (Collection)calloc( 1, sizeof(struct t_Collection) ); c->node = (struct t_node *)0; c->size = item_size; return c; } void AddToCollection( Collection c, void *item ) /* Add an item to a Collection Pre-condition: (c is a Collection created by a call to ConsCollection) && (existing item count < max_items) && (item != NULL) Post-condition: item has been added to c */ { struct t_node *new; assert( c != NULL ); assert( item != NULL ); /* Allocate space for a node for the new item */ new = (struct t_node *)malloc(sizeof(struct t_node)); /* Attach the item to the node */ new->item = item; /* Make the existing list `hang' from this one */ new->next = c->node; /* The new item is the new head of the list */ c->node = new; assert( FindInCollection( c, ItemKey( item ) ) != NULL ); } void DeleteFromCollection( Collection c, void *item ) /* Delete an item from a Collection Pre-condition: (c is a Collection created by a call to ConsCollection) && (existing item count >= 1) && (item != NULL)

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/collection_ll.c (1 of 2) [3/23/2004 2:26:15 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/collection_ll.c

Post-condition: item has been deleted from c */ { struct t_node *node, *prev; assert( c != NULL ); /* The requirement that the Collection has at least one item is expressed a little differently */ assert( c->node != NULL ); assert( item != NULL); /* Select node at head of list */ prev = node = c->node; /* Loop until we've reached the end of the list */ while( node != NULL ) { if ( item == node->item ) { /* Found the item to be deleted, re-link the list around it */ if( node == c->node ) /* We're deleting the head */ c->node = node->next; else prev->next = node->next; /* Free the node */ free( node ); break; } prev = node; node = node->next; } } void *FindInCollection( Collection c, void *key ) /* Find an item in a Collection Pre-condition: (c is a Collection created by a call to ConsCollection) && (key != NULL) Post-condition: returns an item identified by key if one exists, otherwise returns NULL */ { struct t_node *node; assert( c != NULL ); assert( key != NULL ); /* Select node at head of list */ node = c->node; while( node != NULL) { if ( memcmp(key,ItemKey(node->item),c->size)==0 ) { return node->item; } node = node->next; } return NULL; }

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/collection_ll.c (2 of 2) [3/23/2004 2:26:15 PM]

Data Structures and Algorithms: Stacks

Data Structures and Algorithms

3.3 Stacks Another way of storing data is in a stack. A stack is generally implemented with only two principle operations (apart from a constructor and destructor methods): push adds an item to a stack pop

extracts the most recently pushed item from the stack

Other methods such as top

returns the item at the top without removing it [9]

isempty determines whether the stack has anything in it are sometimes added.

A common model of a stack is a plate or coin stacker. Plates are "pushed" onto to the top and "popped" off the top. Stacks form Last-In-First-Out (LIFO) queues and have many applications from the parsing of algebraic expressions to ...

A formal specification of a stack class would look like: typedef struct t_stack *stack; stack ConsStack( int max_items, int item_size ); /* Construct a new stack Pre-condition: (max_items > 0) && (item_size > 0) Post-condition: returns a pointer to an empty stack */ void Push( stack s, void *item ); /* Push an item onto a stack Pre-condition: (s is a stack created by a call to ConsStack) && (existing item count < max_items) && http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/stacks.html (1 of 3) [3/23/2004 2:26:27 PM]

Data Structures and Algorithms: Stacks

(item != NULL) Post-condition: item has been added to the top of s */ void *Pop( stack s ); /* Pop an item of a stack Pre-condition: (s is a stack created by a call to ConsStack) && (existing item count >= 1) Post-condition: top item has been removed from s */ Points to note: a. A stack is simply another collection of data items and thus it would be possible to use exactly the same specification as the one used for our general collection. However, collections with the LIFO semantics of stacks are so important in computer science that it is appropriate to set up a limited specification appropriate to stacks only. b. Although a linked list implementation of a stack is possible (adding and deleting from the head of a linked list produces exactly the LIFO semantics of a stack), the most common applications for stacks have a space restraint so that using an array implementation is a natural and efficient one (In most operating systems, allocation and de-allocation of memory is a relatively expensive operation, there is a penalty for the flexibility of linked list implementations.).

3.3.1 Stack Frames Almost invariably, programs compiled from modern high level languages (even C!) make use of a stack frame for the working memory of each procedure or function invocation. When any procedure or function is called, a number of words - the stack frame - is pushed onto a program stack. When the procedure or function returns, this frame of data is popped off the stack. As a function calls another function, first its arguments, then the return address and finally space for local variables is pushed onto the stack. Since each function runs in its own "environment" or context, it becomes possible for a function to call itself - a technique known as recursion. This capability is extremely useful and extensively used - because many problems are elegantly specified or solved in a recursive way.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/stacks.html (2 of 3) [3/23/2004 2:26:27 PM]

Data Structures and Algorithms: Stacks

Program stack after executing a pair of mutually recursive functions: function f(int x, int y) { int a; if ( term_cond ) return ...; a = .....; return g(a); } function g(int z) { int p,q; p = ...; q = ...; return f(p,q); } Note how all of function f and g's environment (their parameters and local variables) are found in the stack frame. When f is called a second time from g, a new frame for the second invocation of f is created.

Key terms

push, pop Generic terms for adding something to, or removing something from a stack context The environment in which a function executes: includes argument values, local variables and global variables. All the context except the global variables is stored in a stack frame. stack frames The data structure containing all the data (arguments, local variables, return address, etc) needed each time a procedure or function is called. Continue on to Recursion Back to the Table of Contents © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/stacks.html (3 of 3) [3/23/2004 2:26:27 PM]

Data Structures and Algorithms: Trees

Data Structures and Algorithms

4.3 Trees 4.3.1 Binary Trees The simplest form of tree is a binary tree. A binary tree consists of a. a node (called the root node) and b. left and right sub-trees. Both the sub-trees are themselves binary trees. You now have a recursively defined data structure. (It is also possible to define a list recursively: can you see how?)

A binary tree The nodes at the lowest levels of the tree (the ones with no sub-trees) are called leaves. In an ordered binary tree, 1. the keys of all the nodes in the left sub-tree are less than that of the root, 2. the keys of all the nodes in the right sub-tree are greater than that of the root, 3. the left and right sub-trees are themselves ordered binary trees.

Data Structure http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/trees.html (1 of 3) [3/23/2004 2:26:44 PM]

Data Structures and Algorithms: Trees

The data structure for the tree implementation simply adds left and right pointers in place of the next pointer of the linked list implementation. [Load the tree struct.] The AddToCollection method is, naturally, recursive. [ Load the AddToCollection method.] Similarly, the FindInCollection method is recursive. [ Load the FindInCollection method.]

Analysis Complete Trees Before we look at more general cases, let's make the optimistic assumption that we've managed to fill our tree neatly, ie that each leaf is the same 'distance' from the root.

This forms a complete tree, whose height is defined as the number of links from the root to the deepest leaf.

A complete tree First, we need to work out how many nodes, n, we have in such a tree of height, h. Now, n = 1 + 21 + 22 + .... + 2h From which we have, n = 2h+1 - 1 and h = floor( log2n ) Examination of the Find method shows that in the worst case, h+1 or ceiling( log2n ) comparisons are needed to find an item. This is the same as for binary search. However, Add also requires ceiling( log2n ) comparisons to determine where to add an item. Actually adding the item takes a constant number of operations, so we say that a binary tree requires O(logn) operations for both adding and finding an item - a considerable improvement over binary search for a dynamic structure which often requires addition of new items. Deletion is also an O(logn) operation. http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/trees.html (2 of 3) [3/23/2004 2:26:44 PM]

Data Structures and Algorithms: Trees

General binary trees However, in general addition of items to an ordered tree will not produce a complete tree. The worst case occurs if we add an ordered list of items to a tree. What will happen? Think before you click here! This problem is readily overcome: we use a structure known as a heap. However, before looking at heaps, we should formalise our ideas about the complexity of algorithms by defining carefully what O(f(n)) means.

Key terms

Root Node Node at the "top" of a tree - the one from which all operations on the tree commence. The root node may not exist (a NULL tree with no nodes in it) or have 0, 1 or 2 children in a binary tree. Leaf Node Node at the "bottom" of a tree - farthest from the root. Leaf nodes have no children. Complete Tree Tree in which each leaf is at the same distance from the root. A more precise and formal definition of a complete tree is set out later. Height Number of nodes which must be traversed from the root to reach a leaf of a tree. Continue on to Complexity (PS) Continue on to Complexity (HTML) Back to the Table of Contents © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/trees.html (3 of 3) [3/23/2004 2:26:44 PM]

Data Structures and Algorithms: Java Exceptions

Data Structures and Algorithms

Java Exceptions Java, somewhat unfortunately, follows C++ and has the try, throw, catch mechanism. However, Java defines a class hierarchy for errors, all of which are specialisations of the exception class. Java exceptions, like Ada exceptions, are thrown from one class method to the invoking method, until it reaches a try block. However Java methods explicitly list the exceptions that they throw. void readFromFile( String s ) throws IOException, InterruptedException { ...... } try { readFromFile( "abc" ); } catch( FileNotFoundException e ) { ..... } catch( IOException e) { .... } catch( Exception e ) { // catches any other error (Exception is the "superclass") .... } finally { // Cleanup code - always executed } The finally is executed whether an exception is thrown or not. Despite inheriting a mess, Java's designers have managed to simplify it to a useable system. catch appears (my text is not clear on this point - a problem with Java's instant fame!) to accept a single parameter belonging to the exception class. As in Ada, many exceptions are pre-defined, but you can also define your own. Note the problem of a precise definition of Java is a perennial problem in the computer industry: the tendency to instantly adopt "fads" before they have a chance to be defined thoroughly and unambiguously. This has lead to enormous porting problems and enormous, avoidable costs for users. No professional engineer would design any system using screws with non-standard threads!

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/java_throw.html (1 of 2) [3/23/2004 2:26:50 PM]

Data Structures and Algorithms: Java Exceptions

Continue on to C ADT errors Back to the Table of Contents © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/java_throw.html (2 of 2) [3/23/2004 2:26:50 PM]

Data Structures and Algorithms: Programming Language Capabilities

Data Structures and Algorithms

2.7 Programming Languages This section contains some notes on capabilities of programming languages. The first sub-section discusses the ability to pass a function as an argument to another function - an important capability which enables us to create flexible generic ADTs in ANSI C. The remaining sub-sections give brief overviews of the object-oriented capabilities of C++, Java and Ada - three of the more important programming languages. ● ● ● ●

Functions as data types in C C++ classes Java classes ADTs in Ada

Continue on to Arrays Back to the Table of Contents © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/prog_languages.html [3/23/2004 2:27:05 PM]

Data Structures and Algorithms: Functions as Data Types

Data Structures and Algorithms

Functions as Data Types 2.7.1 C functions C allows a function to be used as a data item. This makes it possible to pass functions as arguments to other functions. This capability, although not often used, is extremely useful when it is appropriate. For example, as we initially defined the collections, even though we were careful to design our collection so that it would handle any type of data, we limited ourselves to collections of only one type of data in any one program. This is caused by the need to define an external function for comparing items. Ideally, we would like to specify a general comparison function function for the objects in the collection when we constructed the collection. In C, this is easy (although the syntax is definitely non-intuitive!). We want to have a general comparison function which tells us what order objects to be stored in our collection should be ranked in, ie we need a function: int ItemCmp( void *a, void *b ); which returns -1, 0 or +1 depending on whether a is less than, equal to or greater than b. (Note that we're allowing a very general notion of 'less than': the ItemCmp can order items according to any rule which might be appropriate to these items.) So we add to our collection structure, a comparison function: struct t_collection { int item_cnt; int (*ItemCmp)( void *, void * ); .... }; ItemCmp is a pointer to a function which has the prototype: int ItemCmp( void *, void * ); The parentheses are necessary to distinguish this declaration from the function prototype and an invocation of the function! The ConsCollection function now becomes: collection ConsCollection( int max_items, int (*ItemCmp)( void *, void * ) ); A use of the collection now looks like: #include "widget.h" /* import the ADT for widgets */ int WidgetComp( widget, widget ); http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/C_functions.html (1 of 2) [3/23/2004 2:27:13 PM]

Data Structures and Algorithms: Functions as Data Types

collection LotsOfWidgets; LotsOfWidgets = ConsCollection( large_no, WidgetComp ); In the body of the ADT, the ItemCmp function is used by de-referencing the pointer to the function and using it normally: in FindInCollection, we might have: int FindInCollection( collection c, void *a ) { ..... if ( (*(c->ItemCmp))( c->items[i], a ) == 0 ) { /* Found match ... */ .... } In the example above, an excessive number of parentheses has been used, because I simply don't want to bother to look up the precedence rules: why risk making a silly mistake, when a few extra parentheses will ensure that the compiler treats the code as you intended? However, C permits a 'shortcut' - which doesn't require de-referencing the pointer to the function: in the source code examples, an ItemCmp function has been added to a tree collection. New collection specification New tree implementation

Key terms

Continue on to ADTs in Ada

Back to the Table of Contents

© John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/C_functions.html (2 of 2) [3/23/2004 2:27:13 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/coll_a.h

/* Specification for collection */ typedef struct t_collection *collection; collection ConsCollection( int max_items, int (*ItemCmp)(void *, void *) ); /* Construct a new collection Pre-condition: max_items > 0 Post-condition: returns a pointer to an empty collection */ void AddToCollection( collection c, void *item ); /* Add an item to a collection Pre-condition: (c is a collection created by a call to ConsCollection) && (existing item count < max_items) && (item != NULL) Post-condition: item has been added to c */ void DeleteFromCollection( collection c, void *item ); /* Delete an item from a collection Pre-condition: (c is a collection created by a call to ConsCollection) && (existing item count >= 1) && (item != NULL) Post-condition: item has been deleted from c */ void *FindInCollection( collection c, void *key ); /* Find an item in a collection Pre-condition: c is a collection created by a call to ConsCollection key != NULL Post-condition: returns an item identified by key if one exists, otherwise returns NULL */

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/coll_a.h [3/23/2004 2:27:14 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/coll_at.c

/* Binary tree implementation of a collection */ #include #include #include #include

"coll_a.h"

/* /* /* /*

calloc */ NULL */ Needed for assertions */ import the specification */

struct t_node { void *item; struct t_node *left; struct t_node *right; } node; struct t_collection { /* Note that size is not needed any longer! */ int (*ItemCmp)( void *, void * ); struct t_node *node; }; collection ConsCollection(int max_items, int (*ItemCmp)(void *,void *) ) /* Construct a new collection Pre-condition: (max_items > 0) && (item_size > 0) Post-condition: returns a pointer to an empty collection */ { collection c; /* Although redundant, this assertion should be retained as it tests compliance with the formal specification */ assert( max_items > 0 ); c = (collection)calloc( 1, sizeof(struct t_collection) ); c->node = (struct t_node *)0; c->ItemCmp = ItemCmp; return c; } static void AddToTree( struct t_node **t, struct t_node *new, int (*ItemCmp)(void *, void *) ) { struct t_node *base; base = *t; /* If it's a null tree, just add it here */ if ( base == NULL ) { *t = new; return; } else { if ( ItemCmp( base->item, new ) < 0 ) { AddToTree( &(base->left), new, ItemCmp ); } else AddToTree( &(base->right), new, ItemCmp ); } } void AddToCollection( collection c, void *item ) /* Add an item to a collection Pre-condition: (c is a collection created by a call to ConsCollection) && (existing item count < max_items) &&

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/coll_at.c (1 of 2) [3/23/2004 2:27:16 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/coll_at.c

(item != NULL) Post-condition: item has been added to c */ { struct t_node *new, *node_p; assert( c != NULL ); assert( item != NULL ); /* Allocate space for a node for the new item */ new = (struct t_node *)malloc(sizeof(struct t_node)); /* Attach the item to the node */ new->item = item; new->left = new->right = (struct t_node *)0; node_p = c->node; AddToTree( &node_p, new, c->ItemCmp ); } void DeleteFromTree( struct t_node **t, void *item ) { } void DeleteFromCollection( collection c, void *item ) /* Delete an item from a collection Pre-condition: (c is a collection created by a call to ConsCollection) && (existing item count >= 1) && (item != NULL) Post-condition: item has been deleted from c */ { struct t_node *node; assert( c != NULL ); /* The requirement that the collection has at least one item is expressed a little differently */ assert( c->node != NULL ); assert( item != NULL); /* Select node at head of list */ node = c->node; DeleteFromTree( &node, item ); } void *FindInTree( struct t_node *t, void *key ) { } void *FindInCollection( collection c, void *key ) /* Find an item in a collection Pre-condition: (c is a collection created by a call to ConsCollection) && (key != NULL) Post-condition: returns an item identified by key if one exists, otherwise returns NULL */ { struct t_node *node; assert( c != NULL ); assert( key != NULL ); /* Select node at head of list */ node = c->node; return FindInTree( node, key ); }

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/coll_at.c (2 of 2) [3/23/2004 2:27:16 PM]

Data Structures and Algorithms - Ada ADTs

Data Structures and Algorithms 6.2 ADTs in Ada Ada was designed in the late 70's - just before object orientation was "discovered". However at that time the value of abstract data types was well understood and Ada provides good support for this concept. Two Ada constructs are needed for defining an ADT: the data type and its methods are placed in an Ada package. For safety and information hiding, the data type is made private. Although a package's "client" can see the structure of the data type, the compiler prevents access to individual attributes of the type: thus effectively implementing the information hiding principle. The client can see the information, but can't do anything with it! (I believe that the reason for exposing the structure of the private type is purely pragmatic: compilers and linkers know how much space an ADT in a separately compiled package - for which only the specification might be available - requires.) An Ada package for complex numbers would be implemented: PACKAGE complex_numbers IS TYPE complex IS PRIVATE; I : CONSTANT complex; -- 'i' FUNCTION "-"( complex a ) RETURNS complex; -- Unary minus FUNCTION FUNCTION FUNCTION FUNCTION

"+"( "-"( "*"( "="(

complex complex complex complex

a; a; a; a;

complex complex complex complex

b b b b

) ) ) )

RETURNS RETURNS RETURNS RETURNS

complex; complex; complex; boolean;

PRIVATE TYPE complex IS RECORD real, imag : FLOAT; END RECORD; I : CONSTANT complex := (0.0, 1.0); END complex_numbers; The body or implementation would usually be placed in a separate file and compiled separately: PACKAGE BODY complex_numbers IS FUNCTION "-"( complex a ) RETURNS complex IS -- Unary minus RETURN complex'(-a.real,-a.imag);

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/ada_adt.html (1 of 2) [3/23/2004 2:27:38 PM]

Data Structures and Algorithms - Ada ADTs

END "-"; FUNCTION "+"( complex a; complex b ) RETURNS complex IS RETURN complex'(a.real+b.real,a.imag+c.imag); END "+"; FUNCTION "-"( complex a; complex b ) RETURNS complex IS RETURN complex'(a.real-b.real,a.imag-c.imag); END "-"; FUNCTION "*"( complex a; complex b ) RETURNS complex IS RETURN complex'(a.real*b.real - a.imag*b.imag, a.real*b.imag + a.imag*b.real ); END "*"; FUNCTION "="( complex a; complex b ) RETURNS boolean IS RETURN (a.real = b.real) AND (a.imag = b.imag); END "="; END complex_numbers; Note that Ada provides excellent operator overloading capabilities, which enable us to write mathematically "natural" code: e.g. complex a, b, c, z; IF a = b THEN c := z - a; z := -z; END IF; You can also observe that Ada is an extreme case of the principle that programs are read many times more often than they are read (for modern languages at least - nothing is ever likely to match the verbosity of COBOL!). Keywords appear everywhere: IS, RETURNS, etc..

Sorting Table of Contents © John Morris, 1996

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/ada_adt.html (2 of 2) [3/23/2004 2:27:38 PM]

Data Structures and Algorithms: Sorting

Data Structures and Algorithms

7 Sorting Sorting is one of the most important operations performed by computers. In the days of magnetic tape storage before modern data-bases, it was almost certainly the most common operation performed by computers as most "database" updating was done by sorting transactions and merging them with a master file. It's still important for presentation of data extracted from databases: most people prefer to get reports sorted into some relevant order before wading through pages of data!

7.1 Bubble, Selection, Insertion Sorts There are a large number of variations of one basic strategy for sorting. It's the same strategy that you use for sorting your bridge hand. You pick up a card, start at the beginning of your hand and find the place to insert the new card, insert it and move all the others up one place. /* Insertion sort for integers */ void insertion( int a[], int n ) { /* Pre-condition: a contains n items to be sorted */ int i, j, v; /* Initially, the first item is considered 'sorted' */ /* i divides a into a sorted region, x= i */ for(i=1;i v ) { a[j] = a[j-1]; j = j-1; if ( j high) return NULL; mid = (high+low)/2; switch (memcmp(ItemKey(c->items[mid]),key,c->size)) { /* Match, return item found */ case 0: return c->items[mid]; /* key is less than mid, search lower half */ case -1: return bin_search( c, low, mid-1, key); /* key is greater than mid, search upper half */ case 1: return bin_search( c, mid+1, high, key ); default : return NULL; } } void *FindInCollection( collection c, void *key ) { /* Find an item in a collection Pre-condition: c is a collection created by ConsCollection c is sorted in ascending order of the key key != NULL Post-condition: returns an item identified by key if one exists, otherwise returns NULL */ int low, high; low = 0; high = c->item_cnt-1; return bin_search( c, low, high, key ); } Points to note: a. bin_search is recursive: it determines whether the search key lies in the lower or upper half of the array, then calls itself on the appropriate half. b. There is a termination condition (two of them in fact!) i. If low > high then the partition to be searched has no elements in it and ii. If there is a match with the element in the middle of the current partition, then we can return immediately. c. AddToCollection will need to be modified to ensure that each item added is placed in its correct place in the array. The procedure is simple: i. Search the array until the correct spot to insert the new item is found, ii. Move all the following items up one position and iii. Insert the new item into the empty position thus created. d. bin_search is declared static. It is a local function and is not used outside this class: if it were not declared static, it would be exported and be available to all parts of the program. The http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/searching.html (2 of 5) [3/23/2004 2:49:56 PM]

Data Structures and Algorithms: Searching

static declaration also allows other classes to use the same name internally. static reduces the visibility of a function an should be used wherever possible to control access to functions! Analysis

Each step of the algorithm divides the block of items being searched in half. We can divide a set of n items in half at most log2 n times. Thus the running time of a binary search is proportional to log n and we say this is a O(log n) algorithm.

Binary search requires a more complex program than our original search and thus for small n it may run slower than the simple linear search. However, for large n,

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/searching.html (3 of 5) [3/23/2004 2:49:56 PM]

Data Structures and Algorithms: Searching

Thus at large n, log n is much smaller than n, consequently an O(log n) algorithm is much faster than an O(n) one.

Plot of n and log n vs n . We will examine this behaviour more formally in a later section. First, let's see what we can do about the insertion (AddToCollection) operation. In the worst case, insertion may require n operations to insert into a sorted list. 1. We can find the place in the list where the new item belongs using binary search in O(log n) operations. 2. However, we have to shuffle all the following items up one place to make way for the new one. In the worst case, the new item is the first in the list, requiring n move operations for the shuffle! A similar analysis will show that deletion is also an O(n) operation. If our collection is static, ie it doesn't change very often - if at all - then we may not be concerned with the time required to change its contents: we may be prepared for the initial build of the collection and the occasional insertion and deletion to take some time. In return, we will be able to use a simple data structure (an array) which has little memory overhead. However, if our collection is large and dynamic, ie items are being added and deleted continually, then we can obtain considerably better performance using a data structure called a tree.

Key terms Big Oh A notation formally describing the set of all functions which are bounded above by a nominated function. Binary Search http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/searching.html (4 of 5) [3/23/2004 2:49:56 PM]

Data Structures and Algorithms: Searching

A technique for searching an ordered list in which we first check the middle item and based on that comparison - "discard" half the data. The same procedure is then applied to the remaining half until a match is found or there are no more items left. Continue on to Trees

Back to the Table of Contents

© John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/searching.html (5 of 5) [3/23/2004 2:49:56 PM]

Data Structures and Algorithms: Red-Black Trees

Data Structures and Algorithms

8.2 Red-Black Trees A red-black tree is a binary search tree with one extra attribute for each node: the colour, which is either red or black. We also need to keep track of the parent of each node, so that a red-black tree's node structure would be: struct t_red_black_node { enum { red, black } colour; void *item; struct t_red_black_node *left, *right, *parent; } For the purpose of this discussion, the NULL nodes which terminate the tree are considered to be the leaves and are coloured black. Definition of a red-black tree A red-black tree is a binary search tree which has the following red-black properties: 1. Every node is either red or black. 3. implies that on any path from the root to a 2. Every leaf (NULL) is black. leaf, red nodes must not be adjacent. 3. If a node is red, then both its children are black. However, any number of black nodes may 4. Every simple path from a node to a descendant appear in a sequence. leaf contains the same number of black nodes.

A basic red-black tree

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/red_black.html (1 of 5) [3/23/2004 2:50:18 PM]

Data Structures and Algorithms: Red-Black Trees

Basic red-black tree with the sentinel nodes added. Implementations of the red-black tree algorithms will usually include the sentinel nodes as a convenient means of flagging that you have reached a leaf node. They are the NULL black nodes of property 2. The number of black nodes on any path from, but not including, a node x to a leaf is called the blackheight of a node, denoted bh(x). We can prove the following lemma: Lemma

A red-black tree with n internal nodes has height at most 2log(n+1). (For a proof, see Cormen, p 264) This demonstrates why the red-black tree is a good search tree: it can always be searched in O(log n) time. As with heaps, additions and deletions from red-black trees destroy the red-black property, so we need to restore it. To do this we need to look at some operations on red-black trees. Rotations A rotation is a local operation in a search tree that preserves in-order traversal key ordering. Note that in both trees, an in-order traversal yields: A x B y C The left_rotate operation may be encoded: left_rotate( Tree T, node x ) { node y; y = x->right; /* Turn y's left sub-tree into x's right sub-tree */ http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/red_black.html (2 of 5) [3/23/2004 2:50:18 PM]

Data Structures and Algorithms: Red-Black Trees

x->right = y->left; if ( y->left != NULL ) y->left->parent = x; /* y's new parent was x's parent */ y->parent = x->parent; /* Set the parent to point to y instead of x */ /* First see whether we're at the root */ if ( x->parent == NULL ) T->root = y; else if ( x == (x->parent)->left ) /* x was on the left of its parent */ x->parent->left = y; else /* x must have been on the right */ x->parent->right = y; /* Finally, put x on y's left */ y->left = x; x->parent = y; } Insertion Insertion is somewhat complex and involves a number of cases. Note that we start by inserting the new node, x, in the tree just as we would for any other binary tree, using the tree_insert function. This new node is labelled red, and possibly destroys the red-black property. The main loop moves up the tree, restoring the red-black property. rb_insert( Tree T, node x ) { /* Insert in the tree in the usual way */ tree_insert( T, x ); /* Now restore the red-black property */ x->colour = red; while ( (x != T->root) && (x->parent->colour == red) ) { if ( x->parent == x->parent->parent->left ) { /* If x's parent is a left, y is x's right 'uncle' */ y = x->parent->parent->right; if ( y->colour == red ) { /* case 1 - change the colours */ x->parent->colour = black; y->colour = black; x->parent->parent->colour = red; /* Move x up the tree */ x = x->parent->parent; } else { /* y is a black node */ http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/red_black.html (3 of 5) [3/23/2004 2:50:18 PM]

Data Structures and Algorithms: Red-Black Trees

if ( x == x->parent->right ) { /* and x is to the right */ /* case 2 - move x up and rotate */ x = x->parent; left_rotate( T, x ); } /* case 3 */ x->parent->colour = black; x->parent->parent->colour = red; right_rotate( T, x->parent->parent ); } } else { /* repeat the "if" part with right and left exchanged */ } } /* Colour the root black */ T->root->colour = black; }

Here's an example of the insertion operation.

Animation Red-Black Tree Animation This animation was written by Linda Luo, Mervyn Ng, Anita Lee, John Morris and Woi Ang.

Please email comments to: [email protected]

Examination of the code reveals only one loop. In that loop, the node at the root of the sub-tree whose red-black property we are trying to restore, x, may be moved up the tree at least one level in each iteration of the loop. Since the tree originally has O(log n) height, there are O(log n) iterations. The tree_insert routine also has O(log n) complexity, so overall the rb_insert routine also has O(log n) complexity.

Key terms

Red-black trees Trees which remain balanced - and thus guarantee O(logn) search times - in a dynamic environment. Or more importantly, since any tree can be re-balanced - but at considerable cost can be re-balanced in O(logn) time.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/red_black.html (4 of 5) [3/23/2004 2:50:18 PM]

Data Structures and Algorithms: Red-Black Trees

Continue on to AVL Trees © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/red_black.html (5 of 5) [3/23/2004 2:50:18 PM]

Back to the Table of Contents

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/heap_delete.c

/* Extract the highest priority from the heap */ #define #define #define #define

LEFT(k) RIGHT(k) EMPTY(c,k) SWAP(i,j)

(2*k) (2*k+1) (k>=c->item_cnt) { void *x = c->items[i]; \ c->items[i] = c->items[j]; \ c->items[j] = x; }

void MoveDown( Collection c, int k ) { int larger, right, left; left = LEFT(k); right = RIGHT(k); if ( !EMPTY(c,k) ) /* Termination condition! */ { larger=left; if ( !EMPTY(c,right) ) { if ( ItemCmp( c->items[right], c->items[larger] ) > 0 ) larger = right; } if ( ItemCmp( c->items[k], c->items[larger] ) ) { SWAP( k, larger ); MoveDown( c, larger ); } } } void *HighestPriority( Collection c ) /* Return the highest priority item Pre-condition: (c is a collection created by a call to ConsCollection) && (existing item count >= 1) && (item != NULL) Post-condition: item has been deleted from c */ { int i, cnt; void *save; assert( c != NULL ); assert( c->item_cnt >= 1 ); /* Save the root */ save = c->items[0]; /* Put the last item in the root */ cnt = c->item_cnt; c->items[0] = c->items[cnt-1]; /* Adjust the count */ c->item_cnt--; /* Move the new root item down if necessary */ MoveDown( c, 1 ); return save; }

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/source/heap_delete.c [3/23/2004 2:50:19 PM]

Data Structures and Algorithms: Queues

Data Structures and Algorithms

6 Queues Queues are dynamic collections which have some concept of order. This can be either based on order of entry into the queue - giving us First-In-First-Out (FIFO) or Last-In-First-Out (LIFO) queues. Both of these can be built with linked lists: the simplest "add-to-head" implementation of a linked list gives LIFO behaviour. A minor modification - adding a tail pointer and adjusting the addition method implementation - will produce a FIFO queue.

Performance A straightforward analysis shows that for both these cases, the time needed to add or delete an item is constant and independent of the number of items in the queue. Thus we class both addition and deletion as an O(1) operation. For any given real machine+operating system+language combination, addition may take c1 seconds and deletion c2 seconds, but we aren't interested in the value of the constant, it will vary from machine to machine, language to language, etc. The key point is that the time is not dependent on n - producing O(1) algorithms. Once we have written an O(1) method, there is generally little more that we can do from an algorithmic point of view. Occasionally, a better approach may produce a lower constant time. Often, enhancing our compiler, run-time system, machine, etc will produce some significant improvement. However O(1) methods are already very fast, and it's unlikely that effort expended in improving such a method will produce much real gain!

5.1 Priority Queues Often the items added to a queue have a priority associated with them: this priority determines the order in which they exit the queue - highest priority items are removed first. This situation arises often in process control systems. Imagine the operator's console in a large automated factory. It receives many routine messages from all parts of the system: they are assigned a low priority because they just report the normal functioning of the system - they update various parts of the operator's console display simply so that there is some confirmation that there are no problems. It will make little difference if they are delayed or lost. However, occasionally something breaks or fails and alarm messages are sent. These have high priority because some action is required to fix the problem (even if it is mass evacuation because nothing can stop the imminent explosion!). Typically such a system will be composed of many small units, one of which will be a buffer for messages received by the operator's console. The communications system places messages in the buffer so that communications links can be freed for further messages while the console software is processing the message. The console software extracts messages from the buffer and updates http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/queues.html (1 of 3) [3/23/2004 2:50:24 PM]

Data Structures and Algorithms: Queues

appropriate parts of the display system. Obviously we want to sort messages on their priority so that we can ensure that the alarms are processed immediately and not delayed behind a few thousand routine messages while the plant is about to explode. As we have seen, we could use a tree structure - which generally provides O(logn) performance for both insertion and deletion. Unfortunately, if the tree becomes unbalanced, performance will degrade to O(n) in pathological cases. This will probably not be acceptable when dealing with dangerous industrial processes, nuclear reactors, flight control systems and other life-critical systems. Aside The great majority of computer systems would fall into the broad class of information systems which simply store and process information for the benefit of people who make decisions based on that information. Obviously, in such systems, it usually doesn't matter whether it takes 1 or 100 seconds to retrieve a piece of data - this simply determines whether you take your coffee break now or later. However, as we'll see, using the best known algorithms is usually easy and straightforward: if they're not already coded in libaries, they're in text-books. You don't even have to work out how to code them! In such cases, it's just your reputation that's going to suffer if someone (who has studied his or her algorithms text!) comes along later and says "Why on earth did X (you!) use this O(n2) method there's a well known O(n) one!" Of course, hardware manufacturers are very happy if you use inefficient algorithms - it drives the demand for new, faster hardware - and keeps their profits high! There is a structure which will provide guaranteed O(logn) performance for both insertion and deletion: it's called a heap.

Key terms FIFO queue A queue in which the first item added is always the first one out. LIFO queue A queue in which the item most recently added is always the first one out. Priority queue A queue in which the items are sorted so that the highest priority item is always the next one to be extracted. Life critical systems Systems on which we depend for safety and which may result in death or injury if they fail: medical monitoring, industrial plant monitoring and control and aircraft control systems are examples of life critical systems. Real time systems

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/queues.html (2 of 3) [3/23/2004 2:50:24 PM]

Data Structures and Algorithms: Queues

Systems in which time is a constraint. A system which must respond to some event (eg the change in attitude of an aircraft caused by some atmospheric event like wind-shear) within a fixed time to maintain stability or continue correct operation (eg the aircraft systems must make the necessary adjustments to the control surfaces before the aircraft falls out of the sky!). Continue on to Heaps

Back to the Table of Contents

© John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/queues.html (3 of 3) [3/23/2004 2:50:24 PM]

Data Structures and Algorithms: Red-Black Trees

Data Structures and Algorithms

8.2 Red-Black Tree Operation Here's an example of insertion into a red-black tree (taken from Cormen, p269).

Here's the original tree .. Note that in the following diagrams, the black sentinel nodes have been omitted to keep the diagrams simple.

The tree insert routine has just been called to insert node "4" into the tree. This is no longer a red-black tree there are two successive red nodes on the path 11 - 2 - 7 - 5 - 4 Mark the new node, x, and it's uncle, y. y is red, so we have case 1 ...

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/red_black_op.html (1 of 3) [3/23/2004 2:50:41 PM]

Data Structures and Algorithms: Red-Black Trees

Change the colours of nodes 5, 7 and 8.

Move x up to its grandparent, 7. x's parent (2) is still red, so this isn't a red-black tree yet. Mark the uncle, y. In this case, the uncle is black, so we have case 2 ...

Move x up and rotate left.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/red_black_op.html (2 of 3) [3/23/2004 2:50:41 PM]

Data Structures and Algorithms: Red-Black Trees

Still not a redblack tree .. the uncle is black, but x's parent is to the left ..

Change the colours of 7 and 11 and rotate right ..

This is now a redblack tree, so we're finished! O(logn) time!

Back to Red Black Trees

Back to the Table of Contents

© John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/red_black_op.html (3 of 3) [3/23/2004 2:50:41 PM]

Data Structures and Algorithms: AVL Trees

Data Structures and Algorithms

8.3 AVL Trees An AVL tree is another balanced binary search tree. Named after their inventors, Adelson-Velskii and Landis, they were the first dynamically balanced trees to be proposed. Like red-black trees, they are not perfectly balanced, but pairs of sub-trees differ in height by at most 1, maintaining an O(logn) search time. Addition and deletion operations also take O(logn) time. Definition of an AVL tree An AVL tree is a binary search tree which has the following properties: 1. The sub-trees of every node differ in height by at most one. 2. Every sub-tree is an AVL tree. Balance requirement for an AVL tree: the left and right sub-trees differ by at most 1 in height.

You need to be careful with this definition: it permits some apparently unbalanced trees! For example, here are some trees: Tree

AVL tree?

Yes Examination shows that each left sub-tree has a height 1 greater than each right sub-tree.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/AVL.html (1 of 3) [3/23/2004 2:51:00 PM]

Data Structures and Algorithms: AVL Trees

No Sub-tree with root 8 has height 4 and sub-tree with root 18 has height 2

Insertion As with the red-black tree, insertion is somewhat complex and involves a number of cases. Implementations of AVL tree insertion may be found in many textbooks: they rely on adding an extra attribute, the balance factor to each node. This factor indicates whether the tree is left-heavy (the height of the left sub-tree is 1 greater than the right sub-tree), balanced (both sub-trees are the same height) or right-heavy (the height of the right sub-tree is 1 greater than the left sub-tree). If the balance would be destroyed by an insertion, a rotation is performed to correct the balance. A new item has been added to the left subtree of node 1, causing its height to become 2 greater than 2's right sub-tree (shown in green). A right-rotation is performed to correct the imbalance.

Key terms

AVL trees Trees which remain balanced - and thus guarantee O(logn) search times - in a dynamic environment. Or more importantly, since any tree can be re-balanced - but at considerable cost -

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/AVL.html (2 of 3) [3/23/2004 2:51:00 PM]

Data Structures and Algorithms: AVL Trees

can be re-balanced in O(logn) time. Continue on to General n-ary Trees © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/AVL.html (3 of 3) [3/23/2004 2:51:00 PM]

Back to the Table of Contents

Data Structures and Algorithms: Quick Sort

Data Structures and Algorithms

7.3 Quick Sort Quicksort is a very efficient sorting algorithm invented by C.A.R. Hoare. It has two phases: ● ●

the partition phase and the sort phase.

As we will see, most of the work is done in the partition phase - it works out where to divide the work. The sort phase simply sorts the two smaller problems that are generated in the partition phase. This makes Quicksort a good example of the divide and conquer strategy for solving problems. (You've already seen an example of this approach in the binary search procedure.) In quicksort, we divide the array of items to be sorted into two partitions and then call the quicksort procedure recursively to sort the two partitions, ie we divide the problem into two smaller ones and conquer by solving the smaller ones. Thus the conquer part of the quicksort routine looks like this:

quicksort( void *a, int low, int high ) { int pivot; /* Termination condition! */ if ( high > low ) { pivot = partition( a, low, high ); quicksort( a, low, pivot-1 ); quicksort( a, pivot+1, high ); } }

Initial Step - First Partition

Sort Left Partition in the same way For the strategy to be effective, the partition phase must ensure that all the items in one part (the lower part) and less than all those in the other (upper) part. To do this, we choose a pivot element and arrange that all the items in the lower part are less than the pivot and all those in the upper part greater than it. In the most general case, we don't know anything about the items to be sorted, so that any choice of the pivot element will do - the first element is a convenient one. As an illustration of this idea, you can view this animation, which shows a partition algorithm in which items to be sorted are copied from the original array to a new one: items smaller than the pivot are placed to the left of the new array and items greater than the pivot are placed on the right. In the final step, the pivot is dropped into the remaining slot in the middle. QuickSort Animation This animation was based on a suggestion made by Jeff Rohl; it was written by Woi Ang. Observe that the animation uses two arrays for the items being sorted: thus it requires O(n) additional space to operate. However, it's possible to partition the array in place. The next page shows a conventional implementation of the partition phase which swaps elements in the same array and thus avoids using extra space.

Key terms

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/qsort.html (1 of 2) [3/23/2004 2:51:09 PM]

Data Structures and Algorithms: Quick Sort

Divide and Conquer Algorithms Algorithms that solve (conquer) problems by dividing them into smaller sub-problems until the problem is so small that it is trivially solved. in place In place sorting algorithms don't require additional temporary space to store elements as they sort; they use the space originally occupied by the elements. Continue on to Quick sort: Partition in place © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/qsort.html (2 of 2) [3/23/2004 2:51:09 PM]

Back to the Table of Contents

Data Structures and Algorithms: Quick Sort

Data Structures and Algorithms

Quick Sort: Partition in place Most implementations of quick sort make use of the fact that you can partition in place by keeping two pointers: one moving in from the left and a second moving in from the right. They are moved towards the centre until the left pointer finds an element greater than the pivot and the right one finds an element less than the pivot. These two elements are then swapped. The pointers are then moved inward again until they "cross over". The pivot is then swapped into the slot to which the right pointer points and the partition is complete. int partition( void *a, int low, int high ) { int left, right; void *pivot_item; pivot_item = a[low]; pivot = left = low; right = high; while ( left < right ) { /* Move left while item < pivot */ while( a[left] pivot */ while( a[right] > pivot_item ) right--; if ( left < right ) SWAP(a,left,right); } /* right is final position for the pivot */ a[low] = a[right]; a[right] = pivot_item; return right; } Note that this above code does not check that left does not exceed the array bound. You need to add this check, before performing the swaps - both the one in the loop and the final one outside the loop.

partition ensures that all items less than the pivot precede it and returns the position of the pivot. This meets our condition for dividing the problem: all the items in the lower half are known to be less than the pivot and all items in the upper half are known to be greater than it. Note that we have used our ItemCmp function in the partition function. This assumes that there is an external declaration for ItemCmp and that in any one program, we only want to sort one type of object. Generally this will not be acceptable, so the formal specification for quicksort in the Unix and ANSI C libraries includes a function compar which is supplied to qsort when it is called. Passing the function, compar, which defines the ordering of objects when qsort is called avoids this problem in the same way that we passed an ItemCmp function to ConsCollection

Analysis The partition routine examines every item in the array at most once, so it is clearly O(n). Usually, the partition routine will divide the problem into two roughly equal sized partitions. We know that we can divide n items in half log2n times. This makes quicksort a O(nlogn) algorithm - equivalent to heapsort. However, we have made an unjustified assumption - see if you can identify it before you continue. QuickSort Animation This animation uses the partition in place approach; it was written by Chien Wei Tan

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/qsort1a.html (1 of 2) [3/23/2004 2:51:26 PM]

Please email comments to [email protected]

Data Structures and Algorithms: Quick Sort

Key terms Divide and Conquer Algorithms Algorithms that solve (conquer) problems by dividing them into smaller sub-problems until the problem is so small that it is trivially solved.

Continue on to Quick Sort (cont) © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/qsort1a.html (2 of 2) [3/23/2004 2:51:26 PM]

Back to the Table of Contents

Data Structures and Algorithms: Bin Sort

Data Structures and Algorithms

7.4 Bin Sort Assume that 1. the keys of the items that we wish to sort lie in a small fixed range and 2. that there is only one item with each value of the key. Then we can sort with the following procedure: 1. Set up an array of "bins" - one for each value of the key - in order, 2. Examine each item and use the value of the key to place it in the appropriate bin. Now our collection is sorted and it only took n operations, so this is an O(n) operation. However, note that it will only work under very restricted conditions. Constraints on bin sort To understand these restrictions, let's be a little more precise about the specification of the problem and assume that there are m values of the key. To recover our sorted collection, we need to examine each bin. This adds a third step to the algorithm above, 3. Examine each bin to see whether there's an item in it. which requires m operations. So the algorithm's time becomes: T(n) = c1n + c2m and it is strictly O(n + m). Now if m > n, then it is O(m). For example, if we wish to sort 104 32-bit integers, then m = 232 and we need 232 operations (and a rather large memory!). For n = 104: nlogn ~ 104 x 13 ~ 213 x 24 ~ 217 So quicksort or heapsort would clearly be preferred. An implementation of bin sort might look like: #define EMPTY -1 /* Some convenient flag */ void bin_sort( int *a, int *bin, int n ) { int i; /* Pre-condition: for 0root) return T->root; else if (key < T->root) return tree_search( T->left, key ); else return tree_search( T->right, key ); } is simple and provides us with a O(log n) searching routine as long as we can keep the tree balanced. However, if we simply add items to a tree, producing an unbalanced tree is easy!

This is what happens if we add the letters A B C D E F in that order to a tree: Not exactly well balanced!

Key terms

Pre-order tree traversal Traversing a tree in the order: root | left | right In-order tree traversal Traversing a tree in the order: left | root | right Post-order tree traversal Traversing a tree in the order: left | right | root

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/search_trees.html (2 of 3) [3/23/2004 2:53:06 PM]

Data Structures and Algorithms: Search Trees

Continue on to Red-Black Trees

Back to the Table of Contents

© John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/search_trees.html (3 of 3) [3/23/2004 2:53:06 PM]

Data Structures and Algorithms: n-ary trees

Data Structures and Algorithms

8.2 General n-ary trees If we relax the restriction that each node can have only one key, we can reduce the height of the tree. An m-way search tree a. is empty or b. consists of a root containing j (1visited[k] = ++search_index; al_node = ListHead( g->adj_list[k] ); while( al_node != NULL ) { http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/graph_rep.html (3 of 4) [3/23/2004 2:56:44 PM]

Data Structures and Algorithms: Graph Representations

j = ANodeIndex(al_node); if ( !g->visited[j] ) { AddIntToQueue( g, j ); g->visited[j] = -1; /* C hack, 0 = false! */ al_node = ListNext( al_node ); } } } }

Key terms

Adjacency Matrix A structure for representing a graph in which the presence of arcs between nodes is indicated by an entry in a matrix. Adjacency Lists An alternative structure for representing a graph in which the arcs are stored as lists of connections between nodes. Breadth-first Traversal Traversing a graph by visiting all the nodes attached directly to a starting node first. Depth-first Traversal Traversing a graph by visiting all the nodes attached to a node attached to a starting node before visiting a second node attached to the starting node. Continue on to Dijkstra's Algorithm

Back to the Table of Contents

© John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/graph_rep.html (4 of 4) [3/23/2004 2:56:44 PM]

Data Structures and Algorithms: Graph Algorithms

Data Structures and Algorithms

Prim's Algorithm Prim's Algorithm Prim's algorithm is very similar to Kruskal's: whereas Kruskal's "grows" a forest of trees, Prim's algorithm grows a single tree until it becomes the minimum spanning tree. Both algorithms use the greedy approach - they add the cheapest edge that will not cause a cycle. But rather than choosing the cheapest edge that will connect any pair of trees together, Prim's algorithm only adds edges that join nodes to the existing tree. (In this respect, Prim's algorithm is very similar to Dijkstra's algorithm for finding shortest paths.) Prim's algorithm works efficiently if we keep a list d[v] of the cheapest weights which connect a vertex, v, which is not in the tree, to any vertex already in the tree. A second list pi[v] keeps the index of the node already in the tree to which v can be connected with cost, d[v]. int *MinimumSpanningTree( Graph g, int n, double **costs ) { Queue q; int u, v; int d[n], *pi; q = ConsEdgeQueue( g, costs ); pi = ConsPredList( n ); for(i=0;iadj[u] { if ( (v in q) && costs[u][v] < d[v] ) { pi[v] = u; d[v] = costs[u][v]; } } } return pi; } The steps are: 1. The edge queue is constructed 2. A predecessor list of predecessors for each node is constructed. http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/prim.html (1 of 2) [3/23/2004 2:56:48 PM]

Data Structures and Algorithms: Graph Algorithms

3. "Best" distances to each node are set to infinity. 4. Choose node 0 as the "root" of the MST (any node will do as the MST must contain all nodes), 5. While the edge queue is not empty, 1. Extract the cheapest edge, u, from the queue, 2. Relax all its neighbours - if the distance of this node from the closest node in the MST formed so far is larger than d[u][v], then update d[u][v] and set v's predecessor to u. 6. Return the predecessor list. The time complexity is O(VlogV + ElogV) = O(ElogV), making it the same as Kruskal's algorithm. However, Prim's algorithm can be improved using Fibonacci Heaps (cf Cormen) to O(E + logV).

Key terms

Predecessor list A data structure for defining a graph by storing a predecessor for each node with that node. Thus it uses a single array of integers to define a sub-graph of a graph. Fibonacci Heaps See Cormen, chapter 21. The time complexity is O(VlogV + ElogV) = O(ElogV), making it the same as Kruskal's algorithm. However, Prim's algorithm can be improved using Fibonacci Heaps (cf Cormen) to O(E + logV).

Proving the MST algorithm

Graph Representations

© John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/prim.html (2 of 2) [3/23/2004 2:56:48 PM]

Back to the Table of Contents

Data Structures and Algorithms: Dijkstra's Algorithm

Data Structures and Algorithms

10.2 Dijkstra's Algorithm Djikstra's algorithm (named after its discover, E.W. Dijkstra) solves the problem of finding the shortest path from a point in a graph (the source) to a destination. It turns out that one can find the shortest paths from a given source to all points in a graph in the same time, hence this problem is sometimes called the single-source shortest paths problem. The somewhat unexpected result that all the paths can be found as easily as one further demonstrates the value of reading the literature on algorithms! This problem is related to the spanning tree one. The graph representing all the paths from one vertex to all the others must be a spanning tree - it must include all vertices. There will also be no cycles as a cycle would define more than one path from the selected vertex to at least one other vertex. For a graph, G = (V,E)

where

● ●

V is a set of vertices and E is a set of edges.

Dijkstra's algorithm keeps two sets of vertices: S the set of vertices whose shortest paths from the source have already been determined and V-S the remaining vertices. The other data structures needed are: d array of best estimates of shortest path to each vertex pi an array of predecessors for each vertex The basic mode of operation is: 1. Initialise d and pi, 2. Set S to empty, 3. While there are still vertices in V-S, i. Sort the vertices in V-S according to the current best estimate of their distance from the source, ii. Add u, the closest vertex in V-S, to S, iii. Relax all the vertices still in V-S connected to u Relaxation The relaxation process updates the costs of all the vertices, v, connected to a vertex, u, if we could improve the best estimate of the shortest path to v by including (u,v) in the path to v.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/dijkstra.html (1 of 3) [3/23/2004 2:56:52 PM]

Data Structures and Algorithms: Dijkstra's Algorithm

The relaxation procedure proceeds as follows: initialise_single_source( Graph g, Node s ) for each vertex v in Vertices( g ) g.d[v] := infinity g.pi[v] := nil g.d[s] := 0; This sets up the graph so that each node has no predecessor (pi[v] = nil) and the estimates of the cost (distance) of each node from the source (d[v]) are infinite, except for the source node itself (d[s] = 0). Note that we have also introduced a further way to store a graph (or part of a graph - as this structure can only store a spanning tree), the predecessor subgraph - the list of predecessors of each node, pi[j], 1 > n). This helps when trying to verify log relationships as you need to have values of n spanning quite a wide range of n, because dlogn/dn decreases with n. Some of you were 'misled' when T(n)/logn appeared to become constant. In fact to verify a log relationship a set of logarithmically spaced values of n are best: n = 1000, 2000, 4000, 8000, 16000, 32000, ... would have given a clear result! Perturbing results When designing any experiment, it's important to eliminate all sources of error possible. Here you are timing certain operations (adds, finds, ..): make sure that you are timing only those operations! Thus all unnecessary code should be removed. This includes function calls - put the timing code inside the AddAll, etc functions. ■ Remove all the code that was originally there to verify the functions, eg the if ( ip == &list[i] ) after the FindInCollection call. These things don't perturb your results by very much, but you can eliminate these sources of error trivially, so why not do it? ■

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Labs/feedback2.html (2 of 3) [3/23/2004 3:01:29 PM]

Data Structures and Algorithms - Assignment 2

As many of you will have found, the cache on a modern processor causes your results to 'creep' slowly towards the predicted result. If you're doing the experiment carefully and want to ensure that you've actually proven your hypothesis, then you will need to eliminate all potential sources of error! Users of DOS/Windows machines Please make sure that the extra CRs needed by DOS, etc, are removed before you submit. Unix files should have LFs only at the end of a line. The same applies to reports produced by your word processor: export them as plain text without the extra CRs. Reports which are not plain text will not be accepted - there are a large number of word processors out there: it's much more productive (and therefore beneficial for you!) if the tutors spend time marking your report's content than trying to work out which WP will read it!

Table of Contents © John Morris, 1996

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Labs/feedback2.html (3 of 3) [3/23/2004 3:01:29 PM]

Data Structures and Algorithms - Workshop 2

Data Structures and Algorithms Workshop 3 - Minimum Spanning Trees For the assignments which follow from this workshop, you are expected to produce a program which reads a description of a problem from a file, calculates the minimum spanning tree and prints out the tree and its cost. Rules 1. Nodes are labelled with an arbitrary string. In the present problem, this string will not be longer than 100 characters. However, it would be a mistake to submit a program in which you cannot trivially change this requirement! 2. Node labels have no spaces in them. 3. For simplicity, edges have the same cost in both directions, ie the cost of edge "abc"->"pqr" is the same as that for "pqr"->"abc". A program which can be trivially extended to handle asymmetric costs will attract a small bonus. You should indicate in your report the changes that are necessary. 4. All edge costs are positive. 5. Unspecified edges are assumed to be impossible. (Your program should use a suitable representation!) 6. You may print out the resulting MST in any intelligible format, but your report should obviously explain the format. Procedure 1. Find a partner and decide how to split the work for this assignment between yourself and your colleague. 2. In the tutorial session preceding the lab session, start the design of the ADTs that you will need for this assignment. By now, you should interpret "design" as meaning "design and formally specify". 3. Get the full design checked off by the lecturer or tutor before proceeding to implement your solution. Note that, for graphs, there are a number of ways of implementing the graph structure. You should understand by now that the specification and the implementation details are quite separate. You can derive a specification by looking at the requirements of the problem which specify abstract operations that will be needed.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Labs/workshop3.html (1 of 3) [3/23/2004 3:01:35 PM]

Data Structures and Algorithms - Workshop 2

4. For assignment 3, a. Formally test each ADT used by performing an equivalence class analysis on it and generating a program (or programs) to check each class. Note that while you may have constructed the full MST program at this point, it is not required for this submission. 5. For assignment 4, 1. Submit a program which will read the test file and find and print out the MST. 2. Design (or automatically generate) additional test sets to formally test the whole program. 3. Confirm that the running time of the whole algorithm is as expected. Running time of the algorithm does not include the time to load the data from the test file. For 2 and 3, you may generate the test data within the test programs themselves, ie it is not necessary to read data from a file. A script which automatically runs the test program with a set of test files is also acceptable. 6. Submissions a. Both submissions should be accompanied by an appropriate report. b. You should make one submission with your partner. Either one of you can make the actual submission: just make sure that you have collected all the relevant files into the submission directory. c. Your report (and the program file prologues) should clearly identify the contributions of each partner to the joint submission. File format Lines From To

Content

Format

Notes

n

%d

Number of nodes

%s

Node Label

1

1

2

n+1 labeli

n+2

EOF labeli labelj cij %s %s %g Edge descriptor and weight

Notes: ● ● ●

Fields are separated by spaces. The remainder of a line is to be ignored and may be used for comments. Costs are real numbers: you may assume costs are less than 106.

A sample of the format may be found in mst.test.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Labs/workshop3.html (2 of 3) [3/23/2004 3:01:35 PM]

Data Structures and Algorithms - Workshop 2

Submission Follow the same submission procedure used for the previous assignments. Dates

Designs Checked and Signed Off

5pm, Thu 19th Sept Designs submitted by this deadline will be checked and returned by the following day.

Assignment 3 5pm, Tue 8th Oct ADTs programmed and verified Assignment 4 Full program verified Time complexity verified

5pm, Thu 24th Oct

Proceeding to the coding stage without having your design checked is not likely to be productive. For a quick review and sign-off, you may bring your design to me any time before the deadline: otherwise, submit it using the normal submission procedure as the "3des" assignment. The submission procedure automatically sends me an email message when you have submitted - I will attempt to review all designs within 24 hours and will email you my comments.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Labs/workshop3.html (3 of 3) [3/23/2004 3:01:35 PM]

Data Structures and Algorithms - Assignment 3

Data Structures and Algorithms Feedback from assignment 3 1. Separating the classes Many of you failed to place each class in a separate file! This allows: 1. separate development - once the specification is decided, you can individually work on separate parts of the whole problem, 2. separate verification - each class can be verified independently of the others. Of course, sometimes one class depends on another, so complete independence can't be achieved. However the testing strategy becomes crystal clear, test all the classes which don't depend on any others first, then test classes that only depend on this first group, and so on. 2. Equivalence Classes Very few of you had any decent approach to proving individual methods of classes correct. Some were trivial .. simply put some data in an object and verify that you could get it back! Such tests can be performed entirely automatically: the program sets the object's attributes and compares the values returned by 'projector' functions. By using a program for this, you make use of the machine's ability to mechanically compare large amounts of data accurately (once the test program is correct, of course!). Generally, there will be large number of equivalence classes - and therefore test cases. These can be handled in three ways: 1. Use program code to generate the test data sets. For instance you want to test 0,1,2,...,n,n+1 items where the items are random numbers. Write a function to generate the appropriately sized arrays of random numbers. 2. Use data structures to hold the tests. For example, node labels - make an array of strings: char *labels[] = { "a", "aa", "aaa", "b", "" }; http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Labs/feedback3.html (1 of 3) [3/23/2004 3:01:39 PM]

Data Structures and Algorithms - Assignment 3

#define N_LABELS

(sizeof(labels)/sizeof(char *))

Note how C allows you to put an arbitrary number of items in an array, using [], and #define a symbol which gives the number of items. This means that as you discover a need for more tests, they are trivially added to labels and no other part of the program needs changing! 3. Put the test data in files - prepared with a text editor, or another program. This would be a good approach for testing the MST itself ■ determine what cases need testing, ■ produce a number of files with the appropriate data in them, ■ run the program reading from each file in turn (give the files names like "graph1", "graph2", etc, so that a program can automatically read them all!) or ■ write a Unix shell script to run the program with each file as input and capture the test output. 3. Presenting Verification Results The best way to do this is with a table: Class

Representative ❍ Value of data, Test ❍ Name of test ❍ Location, file, ❍ Name of proram ❍ Name of data ❍ Name of function set, ❍ etc

Expected Result Result

No data

-

no_data.c

Assertion Assertion raised raised

Empty data set

-

null_data.c

NULL return

n > max

106

large_n.c

Assertion Assertion raised raised

Single datum

data_1

testx.c

Same data OK returned

2 points out of order

data_2_out

testx.c

Order reversed

2 points in order

data_2_in

testx.c

Order OK unchanged

Brief description of equivalence class

Some examples of the sorts of entries that could be made in each column are shown.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Labs/feedback3.html (2 of 3) [3/23/2004 3:01:39 PM]

NULL

OK

Data Structures and Algorithms - Assignment 3

You can obviously vary the columns (particularly the second and third) to suit the style of test that you are making.

Table of Contents © John Morris, 1996

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Labs/feedback3.html (3 of 3) [3/23/2004 3:01:39 PM]

Data Structures and Algorithms - Assignment 4

Data Structures and Algorithms Feedback from assignment 4 1. Testing the MST Proving your MST algorithm has two parts: ❍ Proving that a spanning tree is produced and ❍ Proving that it's the minimum spanning tree. The first can be done easily by inspection or by a simple program. Proving that you've found the MST is not quite so simple: you could rely on the formal proof of the algorithm and simply attempt to prove that the operations (cycle, getcheapest, etc) on which the MST algorithm relies are performing correctly or you could do something like exhaustively generate allthe possible trees and thus demonstrate that indeed the tree produced by your program was a MST. For maximum marks, you were expected to address both parts of the problem in your report and do something about the first part. One of the post-conditions of your MST method should have been: The tree produced is a spanning tree. Adding an additional method SpanningTree( Graph g )to your graph class and using it as the post-condition for MSTis the best approach. You can then simply run the MST method on a number of carefully chosen graphs and if the post-condition assertion is never raised (except perhaps for deliberate error inputs, such as disjoint graphs which don't have a spanning tree at all!) then you can assert that your function is producing a spanning tree at least! 2. Complexity of MST Incredibly, quite a number of people started their experiments with a faulty hypothesis. There is no excuse for this - the correct expression can be found in any one of a number of texts. It's also in the PLDS210 notes on the Web.

Table of Contents

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Labs/feedback4.html (1 of 2) [3/23/2004 3:01:41 PM]

Data Structures and Algorithms - Assignment 4

© John Morris, 1996

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Labs/feedback4.html (2 of 2) [3/23/2004 3:01:41 PM]

Data Structures and Algorithms: Past Exams

Data Structures and Algorithms

Past Exams 1997 November, 1997 Final Exam Note that the material on abstract data types which used to be in the PLSD210 course was moved to the CLP110 course in 1997 and will not be examined directly in PLSD210. However, CLP110 is a prerequisite for this course, so an understanding of the basic principles of object-oriented design and abstract data types is expected! Back to the Table of Contents © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Exams/index.html [3/23/2004 3:01:50 PM]

Data Structures and Algorithms - Tutorials

Data Structures and Algorithms Tutorials Tutorial 1 1.

Arrays or Linked Lists? An array implementation of a collection requires O(n) time to search it (assuming it's not ordered). A linked list also requires O(n) time to search. Yet one of these will be quite a bit faster on a high-performance modern processor. Which one? Why? Hint: Part of the answer is found in the next question and part in IPS205 - the computer architecture section.

2.

Overheads The storage requirements for a typical modern RISC processor are: Type

Space (bytes)

integer

4

pointer

4

float

4

double

8

A typical implementation of mallocwill use an extra 4 bytes every time it allocates a block of memory. Calculate the overheads for storing various numbers of items of the types listed using the array and list implementations of our collectionobject. Overhead here means that if a data structure requires 1140 bytes to store 1000 bytes of data, the overhead is 14%. Fill in the table: Item type

Number of items

Array

100 integer 1000 100

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Tutorials/tutorials.html (1 of 3) [3/23/2004 3:01:57 PM]

List

Data Structures and Algorithms - Tutorials

double struct { int x, y; double z[20]; }

3.

1000 100 1000

Complexity Modern processors have clock speeds in excess of 100MHz. Thus a RISC processor may be executing more than 1x108 machine instructions per second. This means they can process of the order of 1x107 "operations" per second. An "operation" is loosely defined here as something like one iteration of a very simple loop. Assuming that you patience allows you to wait a. one second, b. one minute, c. one hour, d. one day. Calculate how large a problem you can solve if it is i. O(log2 n) ii. O(n) iii. O(sqrt(n)) iv. O(n log2 n) v. vi. vii. viii. ix.

O(log2 n) O(n2) O(2n) O(n!) O(nn)

Numbers beyond the range of your calculator can simply be reported as "> 10x" or "< 10-x", where x is determined by your calculator. To try this in reverse, assume that to be certain of beating Kasparov in the next "Man vs machine" chess challenge, we would need to look ahead 40 moves. How long will it take one of today's computers to calculate each move? For simplicity, assume that, on average, the number of possible moves is the same for every move: but if you know of any other estimate for the number of moves in chess, then use that. And if you don't know western chess, substitute Chinese chess or Go (and the appropriate http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Tutorials/tutorials.html (2 of 3) [3/23/2004 3:01:57 PM]

Data Structures and Algorithms - Tutorials

current champion's name!). Tutorials (cont)

Back to the Table of Contents

© John Morris, 1996

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Tutorials/tutorials.html (3 of 3) [3/23/2004 3:01:57 PM]

Data Structures and Algorithms: Tutorial Problems 2

Data Structures and Algorithms

Tutorial Problems: Part 2 Tutorial 2 ●

Asymptotic behaviour a. Threshhold values For what values of n is 4 x 106 n2 > 10 x 2n ? b. Algorithm comparison Algorithm A requires 200 machine cycles for each iteration and requires nlogn iterations to solve a problem of size n. A simpler algorithm, B, requires 25 machine cycles for each iteration and requires n2 iterations to solve a problem of size n. Under what conditions will you prefer algorithm A over algorithm B?

Tutorial 3 ●

Simple ADT Design

A double-ended queue or deque is one that has both LIFO and FIFO behaviour, ie you can add an item to the head or the tail of a list and extract an item from the head or the tail. Taking the following specification for the Collection class, modify it to handle a deque. Note: ●

●

There are quite a few ways that a software engineer could do this: see how many you can devise! A software engineer would probably try to ensure that code using the original specification continued to function correctly.

Similarly, modify the implementation to handle a deque. /* Specification for Collection */ http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Tutorials/tut2.html (1 of 5) [3/23/2004 3:02:03 PM]

Data Structures and Algorithms: Tutorial Problems 2

typedef struct t_Collection *Collection; Collection ConsCollection( int max_items, int item_size ); /* Construct a new Collection Pre-condition: max_items > 0 Post-condition: returns a pointer to an empty Collection */ void AddToCollection( Collection c, void *item ); /* Add an item to a Collection Pre-condition: (c is a Collection created by a call to ConsCollection) && (existing item count < max_items) && (item != NULL) Post-condition: item has been added to c */ void DeleteFromCollection( Collection c, void *item ); /* Delete an item from a Collection Pre-condition: (c is a Collection created by a call to ConsCollection) && (existing item count >= 1) && (item != NULL) Post-condition: item has been deleted from c */ void *FindInCollection( Collection c, void *key ); /* Find an item in a Collection Pre-condition: c is a Collection created by a call to ConsCollection key != NULL Post-condition: returns an item identified by key if one exists, otherwise returns NULL */

/* Linked list implementation of a collection */ #include /* calloc */ #include /* NULL */ #include /* Needed for assertions */ #include "collection.h" /* import the specification */ extern void *ItemKey( void * ); struct t_node { http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Tutorials/tut2.html (2 of 5) [3/23/2004 3:02:03 PM]

Data Structures and Algorithms: Tutorial Problems 2

void *item; struct t_node *next; } node; struct t_collection { int size; /* Needed by FindInCollection */ struct t_node *node; }; collection ConsCollection(int max_items, int item_size ) /* Construct a new collection Pre-condition: (max_items > 0) && (item_size > 0) Post-condition: returns a pointer to an empty collection */ { collection c; /* Although redundant, this assertion should be retained as it tests compliance with the formal specification */ assert( max_items > 0 ); assert( item_size > 0 ); c = (collection)calloc( 1, sizeof(struct t_collection) ); c->node = (struct t_node *)0; c->size = item_size; return c; } void AddToCollection( collection c, void *item ) /* Add an item to a collection Pre-condition: (c is a collection created by a call to ConsCollection) && (existing item count < max_items) && (item != NULL) Post-condition: item has been added to c */ { struct t_node *new; assert( c != NULL ); assert( item != NULL ); /* Allocate space for a node for the new item */ new = (struct t_node *)malloc(sizeof(struct t_node)); /* Attach the item to the node */ new->item = item; /* Make the existing list `hang' from this one */ new->next = c->node; /* The new item is the new head of the list */ http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Tutorials/tut2.html (3 of 5) [3/23/2004 3:02:03 PM]

Data Structures and Algorithms: Tutorial Problems 2

c->node = new; assert( FindInCollection( c, ItemKey( item ) ) != NULL ); } void DeleteFromCollection( collection c, void *item ) /* Delete an item from a collection Pre-condition: (c is a collection created by a call to ConsCollection) && (existing item count >= 1) && (item != NULL) Post-condition: item has been deleted from c */ { struct t_node *node, *prev; assert( c != NULL ); /* The requirement that the collection has at least one item is expressed a little differently */ assert( c->node != NULL ); assert( item != NULL); /* Select node at head of list */ prev = node = c->node; /* Loop until we've reached the end of the list */ while( node != NULL ) { if ( item == node->item ) { /* Found the item to be deleted, re-link the list around it */ if( node == c->node ) /* We're deleting the head */ c->node = node->next; else prev->next = node->next; /* Free the node */ free( node ); break; } prev = node; node = node->next; } }

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Tutorials/tut2.html (4 of 5) [3/23/2004 3:02:03 PM]

Data Structures and Algorithms: Tutorial Problems 2

Key terms

deque A double-ended queue - one to which items can be added at both the head and the tail and one from which items can be extracted from the head or the tail. Continue on to Tutorials: Part 3

Back to the Table of Contents

© John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Tutorials/tut2.html (5 of 5) [3/23/2004 3:02:03 PM]

Data Structures and Algorithms: Tutorial Problems 3

Data Structures and Algorithms

Tutorial Problems: Part 3 Tutorial 3 4.

B+-tree Design You are constructing a database for the Tax Office of a medium-sized Pacific nation. The primary key for records is a 9-digit tax file number. (This Tax Office is so pre-occupied with a new tax scheme that their Prime Minister is touting as the answer to all their economic woes, that it hasn't learnt about binary representations for numbers yet and still uses one byte per decimal digit. They get really uptight when anyone mentions the year 2000 problem!) The records for this database will be stored on discs which have an average access time of 2.5ms to read a block of 4Kbytes. Each disc uses a 32-bit integer to address blocks within the disc. The database spreads over multiple discs, so a 16-bit disc identifier has to be added to each block address. It takes 150ns to read a word from the computer's main memory. The population is 17x106 taxpayers and 0.78 x 103 politicians (1.8x103 millionaires have never needed to submit a tax return.). There are also records for 1.2 x 106 foreign residents who pay some tax and 2.8x106 companies, trusts and other tax avoidance structures. a. How much space will you use to store the indices for this database? Work out the minimum and maximum values for the space - in either disc blocks or bytes. (Don't worry about the poor taxpayers' records, which are humungous as they include details such as the cost of the beer consumed by every travelling salesman in every bar in the country - in order to calculate FBT.) b. How long will it take to find a taxpayer's record? c. How much disc space has been wasted in the indices? d. Many compilers align fields on to word boundaries for efficiency. For example, a 9byte field is padded out to 12-bytes (the next multiple of the 4-byte word length) so that the next field lies on a word boundary. Would this be a good thing to do in this case? Would it be worth going to some effort to prevent the compiler from padding out short (106) collection of items. Arrange the following sorting algorithms in order of their expected running time – slowest first. If you would expect two algorithms to take about the same time, then group them together. Assume that you have so much memory on your computer that memory will not be a factor. (a) Quick sort (b) Insertion sort (c) Radix sort (d) Heap sort

2

3

20

When would you use insertion or bubble sort effectively? Explain your answer.

2

21

I can obtain better performance from another algorithm. What is it?

1

22

Give two restrictions on the use of this algorithm.

2

23

Graphs Why is it necessary to be able to distinguish between problems which map to the travelling salesman’s problem and the minimum spanning tree problem?

2

SECOND SEMESTER EXAMINATIONS NOVEMBER 1996 PROGRAMMING LANGUAGES AND SOFTWARE DESIGN 210 623.210 PAGE 6

24

What data structure would you use to determine whether adding an edge to a graph causes a cycle? Write one sentence describing how this structure is used.

2

25

What is the time complexity of the cycle determining operation?

1

26

Hard problems Give the time complexity of a typical intractible algorithm

27

28

Can I solve such a problem for (a) small n? (b) large n? In each case, add to your yes or no answer a phrase describing the quality of the best answer that you can obtain with a practical computer.

Verifying functions Why is the concept of equivalence classes useful in verifying functions?

1

3

2

SECOND SEMESTER EXAMINATIONS NOVEMBER 1996 PROGRAMMING LANGUAGES AND SOFTWARE DESIGN 210 623.210 PAGE 7

SECOND SEMESTER EXAMINATIONS NOVEMBER 1996 PROGRAMMING LANGUAGES AND SOFTWARE DESIGN 210 623.210 PAGE 8

Section B QUESTION B1

15 marks

You are developing a private network for your company which has a very large number of outlets all over the country for its mechanically produced, sterile (untouched by human hands!) hamburgers. Each outlet must be connected to the network so that management can arrange to ship all the packaging (the most significant component of your company’s output) to the outlets just as it is needed. This network is to be just like the Internet, with nets of multiple redundant links connecting all the outlets. Nodes will be placed in strategic locations. From each node it is possible to have multiple links to other nodes. Network nodes receive large numbers of messages (not all relevant to the company’s operations – see below) and are responsible for forwarding them to their correct destination. Links use a variety of technologies – copper wire, optical fibre and satellite. (The chairman of the board hasn’t really understood the Internet yet, so there is a very low bandwidth link between his secretary’s desk and his: his secretary copies the messages onto pieces of paper and carries them into his office.) All of these links have different bandwidths and thus different costs associated with sending a message over them. In cases of extreme packaging shortage, outlets have to communicate directly with each other to arrange emergency supplies. As part of the network design process, you have to determine: (a) The most efficient routing for messages from any outlet to any other. Which algorithm would you use for this? What is its complexity? (b) The most efficient route for a broadcast message (which emanates from the chairman’s office) to reach all nodes. Which algorithm would you use for this? What is its complexity? (c) The chairman hasn’t heard of Internet television yet, so insists on visiting each outlet once a year to encourage the workers. He’s getting rather frail now, so that it’s important that the most efficient way for him to do this is found. For some reason, this pure public relations exercise has landed on your desk too – perhaps because you have the only up-to-date database containing all the outlet locations. You are required to plan his route. Which algorithm would you use for this? How long would it take you to compute the chairman’s route? (d) You once accidentally showed a few of your colleagues how the network was could run all the chat programs – because it was using standard Internet protocols. Within two days, the prototype network became clogged with messages that seemed to contain little more than “Hi” followed by a name and questions about the weather at other outlets around the country. When the chairman heard about this, he thought it was magnificent that all his employees were talking to each other and refused your request to junk all chat packets from the network. You were forced to add a filter which tagged all chat packets as “non-urgent” and packaging supply and other messages which were actually relevant to the company’s operations with 57 other levels of urgency. Which

SECOND SEMESTER EXAMINATIONS NOVEMBER 1996 PROGRAMMING LANGUAGES AND SOFTWARE DESIGN 210 623.210 PAGE 9 algorithm should you use at each network node to ensure that messages relating to the company’s operations take precedence over the weather? If there are, on average, n messages waiting for forwarding at each node at any one time and it takes approximately c microseconds to allocate space for a new message, compare its urgency level with another and decide to swap their positions. Approximately how long will it take to receive each message at a node? QUESTION B2

10 marks

Your P9 computer is able to analyse one million chess moves per second. A genetic engineer has succeeded in combining some of Kasparov’s DNA with some recovered from Einstein’s fingerprints in a cloned monkey which can now - with absolute reliability - think 10 moves ahead. Assume there are, on average, about 20 possible moves at each position. Assume also that you are able to purchase and connect together, without loss of efficiency, as many of your P9’s as you need. You have 100 seconds for each move. How many P9’s will you need in order to at least draw with this monkey?

SECOND SEMESTER EXAMINATIONS NOVEMBER 1996 PROGRAMMING LANGUAGES AND SOFTWARE DESIGN 210 623.210 PAGE 10

QUESTION B3

25 marks

Design a software module for supporting operations on a class of graphs. This class must provide all the methods necessary to calculate a minimum spanning tree (MST). (Provision of methods to support other common graph algorithms will attract a small bonus - a maximum of 5 marks which will be used to compensate for other flaws in your answer and increasing your chance of obtaining the maximum 25 marks for this question.) Rules: i) Graphs consist of a set of nodes and edges. ii) Initially a graph will be constructed with no nodes and no edges. iii) Nodes and edges are to be added separately. iv) The number of nodes and edges in the graph at any one time needs to be available. (a) Provide a complete formal software definition for the graph class. This should be in the form of a program module that would be accepted by an ANSI C compiler. (Minor syntactic errors will be ignored.) (b) Suggest a set of data structures which could be used effectively internally in the graph structure to handle the nodes, edges and any other information needed by the class to support operations on it. Obviously, the structures you mention should be sufficient to implement the minimum spanning tree algorithm. (c) Describe how the cycle determining step of the MST algorithm will work. You may do this by i) simply describing the algorithm step by step in natural language (with appropriate references to the actual data structures to be used) or ii) providing suitable annotated actual code (The comments should be sufficiently detailed to enable the algorithm to be understood. ) or iii) any combination of (i) and (ii). It is strongly suggested that appropriate diagrams to show the working of the algorithm should be used to augment your description.

Data Structures and Algorithms: Tutorial Problems 7

Data Structures and Algorithms

Tutorial Problems: Part 7 Under construction! 1. Continue on to Tutorials: Part 7 © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Tutorials/tut7.html [3/23/2004 3:03:55 PM]

Back to the Table of Contents

Data Structures and Algorithms - TEXTS

Data Structures and Algorithms Texts The following is a (non-exhaustive) list of texts which are in the UWA library which cover aspects of this course. Not all the texts cover all the material - you will need to search a little for some of the topics. Since there are many texts here, it's probably simpler to note a few representative catalogue numbers and simply look in the shelves in that area! For instance 005.37 obviously has a decent block of texts. Texts highlighted in red have been used as sources for some of the material in this course.

Brown, Marc H. Algorithm animation / Marc H. Brown. Cambridge, Mass : M.I.T. Press, c1988. FIZ 006.6 1988 ALG

x

Harel, David, 1950Algorithmics : the spirit of computing / David Harel. Wokingham, England ; Reading, Mass : Addison-Wesley, c1987. FIZ 004 1987 ALG

x

Sedgewick, Robert, 1946Algorithms / Robert Sedgewick. Reading, Mass : Addison-Wesley, c1983. SRR 517.6 1983 ALG

x

DUE 22-11-96

Sedgewick, Robert, 1946Algorithms / Robert Sedgewick. Reading, Mass : Addison-Wesley, c1988. FIZ Reserve 517.6 1988 ALG FIZ Reserve 517.6 1988 ALG

x x

Kingston, Jeffrey H. (Jeffrey Howard) Algorithms and data structures : design, correctness, analysis / Sydney : Addison-Wesley, 1990. FIZ 005.73 1990 ALG DUE 30-08-96

x

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/texts.html (1 of 4) [3/23/2004 3:04:04 PM]

Data Structures and Algorithms - TEXTS

Wirth, Niklaus, 1934Algorithms + data structures=programs / Niklaus Wirth. Englewood Cliffs, N.J : Prentice-Hall, c1976. FIZ 005.1 1976 ALG FIZ 005.1 1976 ALG

x x

Moret, B. M. E. (Bernard M. E.) Algorithms from P to NP / B.M.E. Moret, H.D. Shapiro. Redwood City, CA : Benjamin/Cummings, c1991FIZ 005.1 1991 ALG

Sedgewick, Robert, 1946Algorithms in C / Robert Sedgewick. Reading, Mass : Addison-Wesley Pub. Co., c1990. SRR 005.133 1990 ALG

Collected algorithms from ACM. New York, N.Y : Association for Computing Machinery, 1975R 005.1 FIZ Reference MICROFICHE MP 430 FIZ Microform

x x

Moffat, David V., 1944Common algorithms in Pascal with programs for reading / David V. Englewood Cliffs, N.J : Prentice-Hall, c1984. FIZ 005.133 1984 COM

x

Baase, Sara. Computer algorithms : introduction to design and analysis / Sara Reading, Mass : Addison-Wesley Pub. Co., c1978. FIZ 005.1 1978 COM

x

Walker, Henry M., 1947Computer science 2 : principles of software engineering, data Glenview, Ill : Scott, Foresman, c1989. FIZ 005.1 1989 COM

Garey, Michael R. Computers and intractability : a guide to the theory of NPSan Francisco : W. H. Freeman, c1979. FIZ 005.1 1979 COM DUE 03-09-96

Aho, Alfred V. http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/texts.html (2 of 4) [3/23/2004 3:04:04 PM]

x

Data Structures and Algorithms - TEXTS

Data structures and algorithms / Alfred V. Aho, John E. Hopcroft, Reading, Mass : Addison-Wesley, c1983. FIZ 005.73 1983 DAT

x

Aho, Alfred V. The design and analysis of computer algorithms / Alfred V. Aho, Reading, Mass : Addison-Wesley Pub. Co., [1974] FIZ 005.1 1974 DES FIZ 005.1 1974 DES

x x

Mehlhorn, Kurt, 1949UNIF Effiziente Allgorithmen. English. Data structures and algorithms / Kurt Mehlhorn. Berlin ; New York : Springer, 1984. FIZ 005.73 1984 DAT V. 2 FIZ 005.73 1984 DAT V. 3 FIZ 005.73 1984 DAT V. 1

x x x

Brassard, Gilles, 1955Fundamentals of algorithmics / Gilles Brassard and Paul Bratley. Englewood, N.J. : Prentice Hall, c1996. FIZ 517.6 1996 FUN DUE 22-11-96

x

Horowitz, Ellis. Fundamentals of computer algorithms / Ellis Horowitz, Sartaj Potomac, Md : Computer Science Press, c1978. FIZ 005.12 1978 FUN DUE 30-08-96

x

Gonnet, G. H. (Gaston H.) Handbook of algorithms and data structures : in Pascal and C / Wokingham, England ; Reading, Mass : Addison-Wesley Pub. Co., SRR 005.133 1991 HAN

x

Cormen, Thomas H. Introduction to algorithms / Thomas H. Cormen, Charles E. Cambridge, Mass : MIT Press ; New York : McGraw-Hill, c1990. FIZ Reserve 005.1 1990 INT FIZ 3day 005.1 1990 INT

x x

Tremblay, Jean-Paul, 1938An Introduction to computer science : an algorithmic approach / New York : McGraw-Hill, c1979. FIZ 005.1 1979 INT

x

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/texts.html (3 of 4) [3/23/2004 3:04:04 PM]

Data Structures and Algorithms - TEXTS

Machtey, Michael. An introduction to the general theory of algorithms / Michael New York : North Holland, c1978. FIZ 005.13 1978 INT

x

Greene, Daniel H., 1955Mathematics for the analysis of algorithms / Daniel H. Greene, Boston : Birkhauser, c1981. FIZ 517.6 1981 GRE

x

Reinelt, G. (Gerhard) The traveling salesman : computational solutions for TSP Berlin ; New York : Springer-Verlag, c1994. FIZ P 004.05 P27 Classic data structures in C++ / Timothy A. Budd. Reading, Mass. : Addison-Wesley Pub. Co., c1994. FIZ 005.73 1994 CLA

Standish, Thomas A., 1941Data structure techniques / Thomas A. Standish. Reading, MA : Addison-Wesley, c1980. FIZ 005.73 1980 DAT

Table of Contents © John Morris, 1996

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/texts.html (4 of 4) [3/23/2004 3:04:04 PM]

x

x

x

Data Structures & Algorithms - Courses

Data Structures & Algorithms - Online courses This is a partial list of on-line course material and tutorials for data structures and algorithms. 1. Thomas Niemann's text on sorting and searching 2. Updated version of Thomas Niemann's text

Continue on to algorithm animations Back to the Table of Contents

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/www_ds.html [3/23/2004 3:04:06 PM]

Sorting and Searching Algorithms: A Cookbook

Title Introduction

Sorting and Searching Algorithms: A Cookbook

Sorting Insertion Sort Shell Sort Quicksort Comparison

by

Dictionaries Hash Tables Binary Search Trees Red-Black Trees Skip Lists Comparison

Very Large Files External Sorting B-Trees

Bibliography

Thomas Niemann Other works... A Guide to Lex & Yacc This is a collection of algorithms for sorting and searching. Descriptions are brief and intuitive, with just enough theory thrown in to make you nervous. I assume you know C, and that you are familiar with concepts such as arrays and pointers. The first section introduces basic data structures and notation. The next section presents several sorting algorithms. This is followed by techniques for implementing dictionaries, structures that allow efficient search, insert, and delete operations. The last section illustrates algorithms that sort data and implement dictionaries for very large files. Source code for each algorithm, in ANSI C, is included. This document has been translated into Russian. If you are interested in translating, please send me email. Special thanks go to Pavel Dubner, whose numerous suggestions were much appreciated. The following files may be downloaded: ● ●

PDF format (153k) source code for the above (16k)

Permission to reproduce this document, in whole or in part, is given provided the original web site listed below is referenced, and no additional restrictions apply. Source code, when part of a software project, may be used freely without reference to the author. Thomas Niemann Portland, Oregon http://members.xoom.com/thomasn/s_man.htm

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_man.htm (1 of 2) [3/23/2004 3:04:11 PM]

Sorting and Searching Algorithms: A Cookbook

Visit my Home Page.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_man.htm (2 of 2) [3/23/2004 3:04:11 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_title.htm

Sorting and Searching Algorithms: A Cookbook by Thomas Niemann Other works... A Guide to Lex & Yacc This is a collection of algorithms for sorting and searching. Descriptions are brief and intuitive, with just enough theory thrown in to make you nervous. I assume you know C, and that you are familiar with concepts such as arrays and pointers. The first section introduces basic data structures and notation. The next section presents several sorting algorithms. This is followed by techniques for implementing dictionaries, structures that allow efficient search, insert, and delete operations. The last section illustrates algorithms that sort data and implement dictionaries for very large files. Source code for each algorithm, in ANSI C, is included. This document has been translated into Russian. If you are interested in translating, please send me email. Special thanks go to Pavel Dubner, whose numerous suggestions were much appreciated. The following files may be downloaded: ● ●

PDF format (153k) source code for the above (16k)

Permission to reproduce this document, in whole or in part, is given provided the original web site listed below is referenced, and no additional restrictions apply. Source code, when part of a software project, may be used freely without reference to the author. Thomas Niemann Portland, Oregon http://members.xoom.com/thomasn/s_man.htm Visit my Home Page.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_title.htm [3/23/2004 3:04:13 PM]

Sorting and Searching Algorithms: A Cookbook Thomas Niemann

Preface This is a collection of algorithms for sorting and searching. Descriptions are brief and intuitive, with just enough theory thrown in to make you nervous. I assume you know C, and that you are familiar with concepts such as arrays and pointers. The first section introduces basic data structures and notation. The next section presents several sorting algorithms. This is followed by techniques for implementing dictionaries, structures that allow efficient search, insert, and delete operations. The last section illustrates algorithms that sort data and implement dictionaries for very large files. Source code for each algorithm, in ANSI C, is available at the site listed below. Permission to reproduce this document, in whole or in part, is given provided the original web site listed below is referenced, and no additional restrictions apply. Source code, when part of a software project, may be used freely without reference to the author.

THOMAS NIEMANN Portland, Oregon

email: home:

[email protected] http://members.xoom.com/thomasn/s_man.htm

By the same author: A Guide to Lex and Yacc, at http://members.xoom.com/thomasn/y_man.htm.

-2-

CONTENTS

1.

INTRODUCTION

4

2.

SORTING

8

2.1 2.2 2.3 2.4

Insertion Sort Shell Sort Quicksort Comparison

8 10 11 14

3.

DICTIONARIES

15

3.1 3.2 3.3 3.4 3.5

Hash Tables Binary Search Trees Red-Black Trees Skip Lists Comparison

15 19 21 25 26

4.

VERY LARGE FILES

29

4.1 4.2 5.

External Sorting B-Trees

29 32

BIBLIOGRAPHY

36

-3-

1. Introduction Arrays and linked lists are two basic data structures used to store information. We may wish to search, insert or delete records in a database based on a key value. This section examines the performance of these operations on arrays and linked lists.

Arrays Figure 1-1 shows an array, seven elements long, containing numeric values. To search the array sequentially, we may use the algorithm in Figure 1-2. The maximum number of comparisons is 7, and occurs when the key we are searching for is in A[6]. 0

4

1

7

2

16

3

20

4

37

5

38

6

43

Lb

M

Ub

Figure 1-1: An Array

int function SequentialSearch (Array A , int Lb , int Ub , int Key ); begin for i = Lb to Ub do if A [ i ] = Key then return i ; return –1; end; Figure 1-2: Sequential Search

-4-

int function BinarySearch (Array A , int Lb , int Ub , int Key ); begin do forever M = ( Lb + Ub )/2; if ( Key < A[M]) then Ub = M – 1; else if (Key > A[M]) then Lb = M + 1; else return M ; if (Lb > Ub) then return –1; end; Figure 1-3: Binary Search

If the data is sorted, a binary search may be done (Figure 1-3). Variables Lb and Ub keep track of the lower bound and upper bound of the array, respectively. We begin by examining the middle element of the array. If the key we are searching for is less than the middle element, then it must reside in the top half of the array. Thus, we set Ub to (M – 1). This restricts our next iteration through the loop to the top half of the array. In this way, each iteration halves the size of the array to be searched. For example, the first iteration will leave 3 items to test. After the second iteration, there will be one item left to test. Therefore it takes only three iterations to find any number. This is a powerful method. Given an array of 1023 elements, we can narrow the search to 511 elements in one comparison. After another comparison, and we’re looking at only 255 elements. In fact, we can search the entire array in only 10 comparisons. In addition to searching, we may wish to insert or delete entries. Unfortunately, an array is not a good arrangement for these operations. For example, to insert the number 18 in Figure 1-1, we would need to shift A[3]…A[6] down by one slot. Then we could copy number 18 into A[3]. A similar problem arises when deleting numbers. To improve the efficiency of insert and delete operations, linked lists may be used.

-5-

Linked Lists X 18

P

# 4

7

16

20

37

38

43

Figure 1-4: A Linked List In Figure 1-4 we have the same values stored in a linked list. Assuming pointers X and P, as shown in the figure, value 18 may be inserted as follows: X->Next = P->Next; P->Next = X;

Insertion and deletion operations are very efficient using linked lists. You may be wondering how pointer P was set in the first place. Well, we had to do a sequential search to find the insertion point X. Although we improved our performance for insertion/deletion, it was done at the expense of search time.

Timing Estimates Several methods may be used to compare the performance of algorithms. One way is simply to run several tests for each algorithm and compare the timings. Another way is to estimate the time required. For example, we may state that search time is O(n) (big-oh of n). This means that search time, for large n, is proportional to the number of items n in the list. Consequently, we would expect search time to triple if our list increased in size by a factor of three. The big-O notation does not describe the exact time that an algorithm takes, but only indicates an upper bound on execution time within a constant factor. If an algorithm takes O(n2) time, then execution time grows no worse than the square of the size of the list.

-6-

n 1 16 256 4,096 65,536 1,048,476 16,775,616

lg n 0 4 8 12 16 20 24

n lg n 0 64 2,048 49,152 1,048,565 20,969,520 402,614,784

n 1.25 1 32 1,024 32,768 1,048,476 33,554,432 1,073,613,825

n2 1 256 65,536 16,777,216 4,294,967,296 1,099,301,922,576 281,421,292,179,456

Table 1-1: Growth Rates Table 1-1 illustrates growth rates for various functions. A growth rate of O(lg n) occurs for algorithms similar to the binary search. The lg (logarithm, base 2) function increases by one when n is doubled. Recall that we can search twice as many items with one more comparison in the binary search. Thus the binary search is a O(lg n) algorithm. If the values in Table 1-1 represented microseconds, then a O(lg n) algorithm may take 20 microseconds to process 1,048,476 items, a O(n1.25) algorithm might take 33 seconds, and a O(n2) algorithm might take up to 12 days! In the following chapters a timing estimate for each algorithm, using big-O notation, will be included. For a more formal derivation of these formulas you may wish to consult the references.

Summary As we have seen, sorted arrays may be searched efficiently using a binary search. However, we must have a sorted array to start with. In the next section various ways to sort arrays will be examined. It turns out that this is computationally expensive, and considerable research has been done to make sorting algorithms as efficient as possible. Linked lists improved the efficiency of insert and delete operations, but searches were sequential and time-consuming. Algorithms exist that do all three operations efficiently, and they will be the discussed in the section on dictionaries.

-7-

2. Sorting Several algorithms are presented, including insertion sort, shell sort, and quicksort. Sorting by insertion is the simplest method, and doesn’t require any additional storage. Shell sort is a simple modification that improves performance significantly. Probably the most efficient and popular method is quicksort, and is the method of choice for large arrays.

2.1 Insertion Sort One of the simplest methods to sort an array is an insertion sort. An example of an insertion sort occurs in everyday life while playing cards. To sort the cards in your hand you extract a card, shift the remaining cards, and then insert the extracted card in the correct place. This process is repeated until all the cards are in the correct sequence. Both average and worst-case time is O(n2). For further reading, consult Knuth [1998].

-8-

Theory Starting near the top of the array in Figure 2-1(a), we extract the 3. Then the above elements are shifted down until we find the correct place to insert the 3. This process repeats in Figure 2-1(b) with the next number. Finally, in Figure 2-1(c), we complete the sort by inserting 2 in the correct place.

4

D

E

F

4

3

3

4

4

1

1

1

1

2

2

2

2

3

3

4

4

1

1

3

3

4

4

2

2

2

2

1

1

1

1

3

3

4

4

2

2

3

3

4

4

Figure 2-1: Insertion Sort Assuming there are n elements in the array, we must index through n – 1 entries. For each entry, we may need to examine and shift up to n – 1 other entries, resulting in a O(n2) algorithm. The insertion sort is an in-place sort. That is, we sort the array in-place. No extra memory is required. The insertion sort is also a stable sort. Stable sorts retain the original ordering of keys when identical keys are present in the input data.

Implementation Source for the insertion sort algorithm may be found in file ins.c. Typedef T and comparison operator compGT should be altered to reflect the data stored in the table.

-9-

2.2 Shell Sort Shell sort, developed by Donald L. Shell, is a non-stable in-place sort. Shell sort improves on the efficiency of insertion sort by quickly shifting values to their destination. Average sort time is O(n1.25), while worst-case time is O(n1.5). For further reading, consult Knuth [1998].

Theory In Figure 2-2(a) we have an example of sorting by insertion. First we extract 1, shift 3 and 5 down one slot, and then insert the 1, for a count of 2 shifts. In the next frame, two shifts are required before we can insert the 2. The process continues until the last frame, where a total of 2 + 2 + 1 = 5 shifts have been made. In Figure 2-2(b) an example of shell sort is illustrated. We begin by doing an insertion sort using a spacing of two. In the first frame we examine numbers 3-1. Extracting 1, we shift 3 down one slot for a shift count of 1. Next we examine numbers 5-2. We extract 2, shift 5 down, and then insert 2. After sorting with a spacing of two, a final pass is made with a spacing of one. This is simply the traditional insertion sort. The total shift count using shell sort is 1+1+1 = 3. By using an initial spacing larger than one, we were able to quickly shift values to their proper destination. 2s

D

2s 1

1

1

5

3

2

2

1

5

3

3

2

2

5

4

4

4

4

5

1s

E

1s

3

1s

1s

3

1

1

1

5

5

2

2

1

3

3

3

2

2

5

4

4

4

4

5

Figure 2-2: Shell Sort Various spacings may be used to implement shell sort. Typically the array is sorted with a large spacing, the spacing reduced, and the array sorted again. On the final sort, spacing is one. Although the shell sort is easy to comprehend, formal analysis is difficult. In particular, optimal spacing values elude theoreticians. Knuth has experimented with several values and recommends that spacing h for an array of size N be based on the following formula: Let h1 = 1, hs +1 = 3hs + 1, and stop with ht when ht + 2 ≥ N

- 10 -

Thus, values of h are computed as follows: h1 = 1 h2 = (3 × 1) + 1 = 4 h3 = (3 × 4) + 1 = 13 h4 = (3 × 13) + 1 = 40 h5 = (3 × 40) + 1 = 121 To sort 100 items we first find hs such that hs ≥ 100. For 100 items, h5 is selected. Our final value (ht) is two steps lower, or h3. Therefore our sequence of h values will be 13-4-1. Once the initial h value has been determined, subsequent values may be calculated using the formula hs −1 = hs / 3

Implementation Source for the shell sort algorithm may be found in file shl.c. Typedef T and comparison operator compGT should be altered to reflect the data stored in the array. The central portion of the algorithm is an insertion sort with a spacing of h.

2.3 Quicksort Although the shell sort algorithm is significantly better than insertion sort, there is still room for improvement. One of the most popular sorting algorithms is quicksort. Quicksort executes in O(n lg n) on average, and O(n2) in the worst-case. However, with proper precautions, worst-case behavior is very unlikely. Quicksort is a non-stable sort. It is not an in-place sort as stack space is required. For further reading, consult Cormen [1990].

Theory The quicksort algorithm works by partitioning the array to be sorted, then recursively sorting each partition. In Partition (Figure 2-3), one of the array elements is selected as a pivot value. Values smaller than the pivot value are placed to the left of the pivot, while larger values are placed to the right.

- 11 -

int function Partition (Array A, int Lb, int Ub); begin select a pivot from A[Lb]…A[Ub]; reorder A[Lb]…A[Ub] such that: all values to the left of the pivot are ≤ pivot all values to the right of the pivot are ≥ pivot return pivot position; end; procedure QuickSort (Array A, int Lb, int Ub); begin if Lb < Ub then M = Partition (A, Lb, Ub); QuickSort (A, Lb, M – 1); QuickSort (A, M + 1, Ub); end; Figure 2-3: Quicksort Algorithm In Figure 2-4(a), the pivot selected is 3. Indices are run starting at both ends of the array. One index starts on the left and selects an element that is larger than the pivot, while another index starts on the right and selects an element that is smaller than the pivot. In this case, numbers 4 and 1 are selected. These elements are then exchanged, as is shown in Figure 2-4(b). This process repeats until all elements to the left of the pivot are ≤ the pivot, and all items to the right of the pivot are ≥ the pivot. QuickSort recursively sorts the two sub-arrays, resulting in the array shown in Figure 2-4(c). Lb

D

4

Ub 2

3

5

1

SLYRW Lb

M

Lb

E

1

2

3

5

4

F

1

2

3

4

5

Figure 2-4: Quicksort Example As the process proceeds, it may be necessary to move the pivot so that correct ordering is maintained. In this manner, QuickSort succeeds in sorting the array. If we’re lucky the pivot selected will be the median of all values, equally dividing the array. For a moment, let’s assume

- 12 -

that this is the case. Since the array is split in half at each step, and Partition must eventually examine all n elements, the run time is O(n lg n). To find a pivot value, Partition could simply select the first element (A[Lb]). All other values would be compared to the pivot value, and placed either to the left or right of the pivot as appropriate. However, there is one case that fails miserably. Suppose the array was originally in order. Partition would always select the lowest value as a pivot and split the array with one element in the left partition, and Ub – Lb elements in the other. Each recursive call to quicksort would only diminish the size of the array to be sorted by one. Therefore n recursive calls would be required to do the sort, resulting in a O(n2) run time. One solution to this problem is to randomly select an item as a pivot. This would make it extremely unlikely that worst-case behavior would occur.

Implementation The source for the quicksort algorithm may be found in file qui.c. Typedef T and comparison operator compGT should be altered to reflect the data stored in the array. Several enhancements have been made to the basic quicksort algorithm: •

The center element is selected as a pivot in partition. If the list is partially ordered, this will be a good choice. Worst-case behavior occurs when the center element happens to be the largest or smallest element each time partition is invoked.

•

For short arrays, insertSort is called. Due to recursion and other overhead, quicksort is not an efficient algorithm to use on small arrays. Consequently, any array with fewer than 12 elements is sorted using an insertion sort. The optimal cutoff value is not critical and varies based on the quality of generated code.

•

Tail recursion occurs when the last statement in a function is a call to the function itself. Tail recursion may be replaced by iteration, resulting in a better utilization of stack space. This has been done with the second call to QuickSort in Figure 2-3.

•

After an array is partitioned, the smallest partition is sorted first. This results in a better utilization of stack space, as short partitions are quickly sorted and dispensed with.

Included in file qsort.c is the source for qsort, an ANSI-C standard library function usually implemented with quicksort. Recursive calls were replaced by explicit stack operations. Table 2-1 shows timing statistics and stack utilization before and after the enhancements were applied.

count 16 256 4,096 65,536

time ( µ s) stacksize before after before after 103 51 540 28 1,630 911 912 112 34,183 20,016 1,908 168 658,003 470,737 2,436 252

Table 2-1: Effect of Enhancements on Speed and Stack Utilization

- 13 -

2.4 Comparison In this section we will compare the sorting algorithms covered: insertion sort, shell sort, and quicksort. There are several factors that influence the choice of a sorting algorithm: •

Stable sort. Recall that a stable sort will leave identical keys in the same relative position in the sorted output. Insertion sort is the only algorithm covered that is stable.

•

Space. An in-place sort does not require any extra space to accomplish its task. Both insertion sort and shell sort are in-place sorts. Quicksort requires stack space for recursion, and therefore is not an in-place sort. Tinkering with the algorithm considerably reduced the amount of time required.

•

Time. The time required to sort a dataset can easily become astronomical (Table 1-1). Table 2-2 shows the relative timings for each method. The time required to sort a randomly ordered dataset is shown in Table 2-3.

•

Simplicity. The number of statements required for each algorithm may be found in Table 2-2. Simpler algorithms result in fewer programming errors.

method

statements

insertion sort shell sort quicksort

average time

worst-case time

2

O (n 2) O (n 1.5) O (n 2)

O (n ) O (n 1.25) O (n lg n )

9 17 21

Table 2-2: Comparison of Methods

count 16 256 4,096 65,536

insertion 39 µs 4,969 µs 1.315 sec 416.437 sec

shell 45 µs 1,230 µs .033 sec 1.254 sec

Table 2-3: Sort Timings

- 14 -

quicksort 51 µs 911 µs .020 sec .461 sec

3. Dictionaries Dictionaries are data structures that support search, insert, and delete operations. One of the most effective representations is a hash table. Typically, a simple function is applied to the key to determine its place in the dictionary. Also included are binary trees and red-black trees. Both tree methods use a technique similar to the binary search algorithm to minimize the number of comparisons during search and update operations on the dictionary. Finally, skip lists illustrate a simple approach that utilizes random numbers to construct a dictionary.

3.1 Hash Tables Hash tables are a simple and effective method to implement dictionaries. Average time to search for an element is O(1), while worst-case time is O(n). Cormen [1990] and Knuth [1998] both contain excellent discussions on hashing.

Theory A hash table is simply an array that is addressed via a hash function. For example, in Figure 3-1, HashTable is an array with 8 elements. Each element is a pointer to a linked list of numeric data. The hash function for this example simply divides the data key by 8, and uses the remainder as an index into the table. This yields a number from 0 to 7. Since the range of indices for HashTable is 0 to 7, we are guaranteed that the index is valid. HashTable 0

#

1

#

2

#

16

3

#

4

#

5

#

11

27

6 7

19

# #

22

6

Figure 3-1: A Hash Table To insert a new item in the table, we hash the key to determine which list the item goes on, and then insert the item at the beginning of the list. For example, to insert 11, we divide 11 by 8 giving a remainder of 3. Thus, 11 goes on the list starting at HashTable[3]. To find a

- 15 -

number, we hash the number and chain down the correct list to see if it is in the table. To delete a number, we find the number and remove the node from the linked list. Entries in the hash table are dynamically allocated and entered on a linked list associated with each hash table entry. This technique is known as chaining. An alternative method, where all entries are stored in the hash table itself, is known as direct or open addressing and may be found in the references. If the hash function is uniform, or equally distributes the data keys among the hash table indices, then hashing effectively subdivides the list to be searched. Worst-case behavior occurs when all keys hash to the same index. Then we simply have a single linked list that must be sequentially searched. Consequently, it is important to choose a good hash function. Several methods may be used to hash key values. To illustrate the techniques, I will assume unsigned char is 8-bits, unsigned short int is 16-bits, and unsigned long int is 32-bits. •

Division method (tablesize = prime). This technique was used in the preceding example. A HashValue, from 0 to (HashTableSize - 1), is computed by dividing the key value by the size of the hash table and taking the remainder. For example: typedef int HashIndexType; HashIndexType Hash(int Key) { return Key % HashTableSize; }

Selecting an appropriate HashTableSize is important to the success of this method. For example, a HashTableSize of two would yield even hash values for even Keys, and odd hash values for odd Keys. This is an undesirable property, as all keys would hash to the same value if they happened to be even. If HashTableSize is a power of two, then the hash function simply selects a subset of the Key bits as the table index. To obtain a more random scattering, HashTableSize should be a prime number not too close to a power of two. •

Multiplication method (tablesize = 2n). The multiplication method may be used for a HashTableSize that is a power of 2. The Key is multiplied by a constant, and then the necessary bits are extracted to index into the table. Knuth recommends using the fractional part of the product of the key and the golden ratio, or 5 − 1 / 2 . For example, assuming a word size of 8 bits, the golden ratio is multiplied by 28 to obtain 158. The product of the 8-bit key and 158 results in a 16-bit integer. For a table size of 25 the 5 most significant bits of the least significant word are extracted for the hash value. The following definitions may be used for the multiplication method:

(

- 16 -

)

/* 8-bit index */ typedef unsigned char HashIndexType; static const HashIndexType K = 158; /* 16-bit index */ typedef unsigned short int HashIndexType; static const HashIndexType K = 40503; /* 32-bit index */ typedef unsigned long int HashIndexType; static const HashIndexType K = 2654435769; /* w=bitwidth(HashIndexType), size of table=2**m */ static const int S = w - m; HashIndexType HashValue = (HashIndexType)(K * Key) >> S;

For example, if HashTableSize is 1024 (210), then a 16-bit index is sufficient and S would be assigned a value of 16 – 10 = 6. Thus, we have: typedef unsigned short int HashIndexType; HashIndexType Hash(int Key) { static const HashIndexType K = 40503; static const int S = 6; return (HashIndexType)(K * Key) >> S; }

•

Variable string addition method (tablesize = 256). To hash a variable-length string, each character is added, modulo 256, to a total. A HashValue, range 0-255, is computed. typedef unsigned char HashIndexType; HashIndexType Hash(char *str) { HashIndexType h = 0; while (*str) h += *str++; return h; }

•

Variable string exclusive-or method (tablesize = 256). This method is similar to the addition method, but successfully distinguishes similar words and anagrams. To obtain a hash value in the range 0-255, all bytes in the string are exclusive-or'd together. However, in the process of doing each exclusive-or, a random component is introduced. typedef unsigned char HashIndexType; unsigned char Rand8[256]; HashIndexType Hash(char *str) { unsigned char h = 0; while (*str) h = Rand8[h ^ *str++]; return h; }

- 17 -

Rand8 is a table of 256 8-bit unique random numbers. The exact ordering is not critical.

The exclusive-or method has its basis in cryptography, and is quite effective Pearson [1990]. •

Variable string exclusive-or method (tablesize ≤ 65536). If we hash the string twice, we may derive a hash value for an arbitrary table size up to 65536. The second time the string is hashed, one is added to the first character. Then the two 8-bit hash values are concatenated together to form a 16-bit hash value. typedef unsigned short int HashIndexType; unsigned char Rand8[256]; HashIndexType Hash(char *str) { HashIndexType h; unsigned char h1, h2; if (*str == 0) return 0; h1 = *str; h2 = *str + 1; str++; while (*str) { h1 = Rand8[h1 ^ *str]; h2 = Rand8[h2 ^ *str]; str++; } /* h is in range 0..65535 */ h = ((HashIndexType)h1 7, so we traverse to the right child. On the third comparison, we succeed.

- 19 -

4 7 16 20 37 38 43

Figure 3-3: An Unbalanced Binary Search Tree Each comparison results in reducing the number of items to inspect by one-half. In this respect, the algorithm is similar to a binary search on an array. However, this is true only if the tree is balanced. Figure 3-3 shows another tree containing the same values. While it is a binary search tree, its behavior is more like that of a linked list, with search time increasing proportional to the number of elements stored.

Insertion and Deletion Let us examine insertions in a binary search tree to determine the conditions that can cause an unbalanced tree. To insert an 18 in the tree in Figure 3-2, we first search for that number. This causes us to arrive at node 16 with nowhere to go. Since 18 > 16, we simply add node 18 to the right child of node 16 (Figure 3-4).

20

7

4

38

16

37

18

Figure 3-4: Binary Tree After Adding Node 18

- 20 -

43

Now we can see how an unbalanced tree can occur. If the data is presented in an ascending sequence, each node will be added to the right of the previous node. This will create one long chain, or linked list. However, if data is presented for insertion in a random order, then a more balanced tree is possible. Deletions are similar, but require that the binary search tree property be maintained. For example, if node 20 in Figure 3-4 is removed, it must be replaced by node 37. This results in the tree shown in Figure 3-5. The rationale for this choice is as follows. The successor for node 20 must be chosen such that all nodes to the right are larger. Therefore we need to select the smallest valued node to the right of node 20. To make the selection, chain once to the right (node 38), and then chain to the left until the last node is found (node 37). This is the successor for node 20. 37

7

4

38

16

43

18

Figure 3-5: Binary Tree After Deleting Node 20

Implementation Source for the binary search tree algorithm may be found in file bin.c. Typedef T and comparison operators compLT and compEQ should be altered to reflect the data stored in the tree. Each Node consists of left, right, and parent pointers designating each child and the parent. Data is stored in the data field. The tree is based at root, and is initially NULL. Function insertNode allocates a new node and inserts it in the tree. Function deleteNode deletes and frees a node from the tree. Function findNode searches the tree for a particular value.

3.3 Red-Black Trees Binary search trees work best when they are balanced or the path length from root to any leaf is within some bounds. The red-black tree algorithm is a method for balancing trees. The name derives from the fact that each node is colored red or black, and the color of the node is instrumental in determining the balance of the tree. During insert and delete operations, nodes may be rotated to maintain tree balance. Both average and worst-case search time is O(lg n). See Cormen [1990] for details.

- 21 -

Theory A red-black tree is a balanced binary search tree with the following properties: 1. Every node is colored red or black. 2. Every leaf is a NIL node, and is colored black. 3. If a node is red, then both its children are black. 4. Every simple path from a node to a descendant leaf contains the same number of black nodes. The number of black nodes on a path from root to leaf is known as the black height of a tree. These properties guarantee that any path from the root to a leaf is no more than twice as long as any other path. To see why this is true, consider a tree with a black height of two. The shortest distance from root to leaf is two, where both nodes are black. The longest distance from root to leaf is four, where the nodes are colored (root to leaf): red, black, red, black. It is not possible to insert more black nodes as this would violate property 4, the black-height requirement. Since red nodes must have black children (property 3), having two red nodes in a row is not allowed. The largest path we can construct consists of an alternation of red-black nodes, or twice the length of a path containing only black nodes. All operations on the tree must maintain the properties listed above. In particular, operations that insert or delete items from the tree must abide by these rules.

Insertion To insert a node, we search the tree for an insertion point, and add the node to the tree. The new node replaces an existing NIL node at the bottom of the tree, and has two NIL nodes as children. In the implementation, a NIL node is simply a pointer to a common sentinel node that is colored black. After insertion, the new node is colored red. Then the parent of the node is examined to determine if the red-black tree properties have been violated. If necessary, we recolor the node and do rotations to balance the tree. By inserting a red node with two NIL children, we have preserved black-height property (property 4). However, property 3 may be violated. This property states that both children of a red node must be black. Although both children of the new node are black (they’re NIL), consider the case where the parent of the new node is red. Inserting a red node under a red parent would violate this property. There are two cases to consider: •

Red parent, red uncle: Figure 3-6 illustrates a red-red violation. Node X is the newly inserted node, with both parent and uncle colored red. A simple recoloring removes the red-red violation. After recoloring, the grandparent (node B) must be checked for validity, as its parent may be red. Note that this has the effect of propagating a red node up the tree. On completion, the root of the tree is marked black. If it was originally red, then this has the effect of increasing the black-height of the tree.

- 22 -

•

Red parent, black uncle: Figure 3-7 illustrates a red-red violation, where the uncle is colored black. Here the nodes may be rotated, with the subtrees adjusted as shown. At this point the algorithm may terminate as there are no red-red conflicts and the top of the subtree (node A) is colored black. Note that if node X was originally a right child, a left rotation would be done first, making the node a left child.

Each adjustment made while inserting a node causes us to travel up the tree one step. At most one rotation (2 if the node is a right child) will be done, as the algorithm terminates in this case. The technique for deletion is similar. % EODFN

$

& UHG

UHG

SDUHQW

XQFOH

; UHG

% UHG

$

& EODFN

EODFN

; UHG

Figure 3-6: Insertion – Red Parent, Red Uncle

- 23 -

% EODFN

&

$ UHG SDUHQW

EODFN

γ

δ

XQFOH

ε

; UHG

α

β

$ EODFN

%

; UHG

α

UHG

β

γ & EODFN

δ Figure 3-7: Insertion – Red Parent, Black Uncle

- 24 -

ε

Implementation Source for the red-black tree algorithm may be found in file rbt.c. Typedef T and comparison operators compLT and compEQ should be altered to reflect the data stored in the tree. Each Node consists of left, right, and parent pointers designating each child and the parent. The node color is stored in color, and is either RED or BLACK. The data is stored in the data field. All leaf nodes of the tree are sentinel nodes, to simplify coding. The tree is based at root, and initially is a sentinel node. Function insertNode allocates a new node and inserts it in the tree. Subsequently, it calls insertFixup to ensure that the red-black tree properties are maintained. Function deleteNode deletes a node from the tree. To maintain red-black tree properties, deleteFixup is called. Function findNode searches the tree for a particular value.

3.4 Skip Lists Skip lists are linked lists that allow you to skip to the correct node. The performance bottleneck inherent in a sequential scan is avoided, while insertion and deletion remain relatively efficient. Average search time is O(lg n). Worst-case search time is O(n), but is extremely unlikely. An excellent reference for skip lists is Pugh [1990].

Theory The indexing scheme employed in skip lists is similar in nature to the method used to lookup names in an address book. To lookup a name, you index to the tab representing the first character of the desired entry. In Figure 3-8, for example, the top-most list represents a simple linked list with no tabs. Adding tabs (middle figure) facilitates the search. In this case, level-1 pointers are traversed. Once the correct segment of the list is found, level-0 pointers are traversed to find the specific entry. #

abe

art

ben

bob

cal

cat

dan

don

#

abe

art

ben

bob

cal

cat

dan

don

#

abe

art

ben

bob

cal

cat

dan

don

0

0 1

0 1 2

Figure 3-8: Skip List Construction

- 25 -

The indexing scheme may be extended as shown in the bottom figure, where we now have an index to the index. To locate an item, level-2 pointers are traversed until the correct segment of the list is identified. Subsequently, level-1 and level-0 pointers are traversed. During insertion the number of pointers required for a new node must be determined. This is easily resolved using a probabilistic technique. A random number generator is used to toss a computer coin. When inserting a new node, the coin is tossed to determine if it should be level-1. If you win, the coin is tossed again to determine if the node should be level-2. Another win, and the coin is tossed to determine if the node should be level-3. This process repeats until you lose. If only one level (level-0) is implemented, the data structure is a simple linked-list with O(n) search time. However, if sufficient levels are implemented, the skip list may be viewed as a tree with the root at the highest level, and search time is O(lg n). The skip list algorithm has a probabilistic component, and thus a probabilistic bounds on the time required to execute. However, these bounds are quite tight in normal circumstances. For example, to search a list containing 1000 items, the probability that search time will be 5 times the average is about 1 in 1,000,000,000,000,000,000.

Implementation Source for the skip list algorithm may be found in file skl.c. Typedef T and comparison operators compLT and compEQ should be altered to reflect the data stored in the list. In addition, MAXLEVEL should be set based on the maximum size of the dataset. To initialize, initList is called. The list header is allocated and initialized. To indicate an empty list, all levels are set to point to the header. Function insertNode allocates a new node, searches for the correct insertion point, and inserts it in the list. While searching, the update array maintains pointers to the upper-level nodes encountered. This information is subsequently used to establish correct links for the newly inserted node. The newLevel is determined using a random number generator, and the node allocated. The forward links are then established using information from the update array. Function deleteNode deletes and frees a node, and is implemented in a similar manner. Function findNode searches the list for a particular value.

3.5 Comparison We have seen several ways to construct dictionaries: hash tables, unbalanced binary search trees, red-black trees, and skip lists. There are several factors that influence the choice of an algorithm: •

Sorted output. If sorted output is required, then hash tables are not a viable alternative. Entries are stored in the table based on their hashed value, with no other ordering. For binary trees, the story is different. An in-order tree walk will produce a sorted list. For example:

- 26 -

void WalkTree(Node *P) { if (P == NIL) return; WalkTree(P->Left); /* examine P->Data here */ WalkTree(P->Right); } WalkTree(Root);

To examine skip list nodes in order, simply chain through the level-0 pointers. For example: Node *P = List.Hdr->Forward[0]; while (P != NIL) { /* examine P->Data here */ P = P->Forward[0]; }

•

Space. The amount of memory required to store a value should be minimized. This is especially true if many small nodes are to be allocated. ♦ For hash tables, only one forward pointer per node is required. In addition, the hash table itself must be allocated. ♦ For red-black trees, each node has a left, right, and parent pointer. In addition, the color of each node must be recorded. Although this requires only one bit, more space may be allocated to ensure that the size of the structure is properly aligned. Therefore each node in a red-black tree requires enough space for 3-4 pointers. ♦ For skip lists, each node has a level-0 forward pointer. The probability of having a level-1 pointer is ½. The probability of having a level-2 pointer is ¼. In general, the number of forward pointers per node is n = 1+

1 1 + + L =2. 2 4

•

Time. The algorithm should be efficient. This is especially true if a large dataset is expected. Table 3-2 compares the search time for each algorithm. Note that worst-case behavior for hash tables and skip lists is extremely unlikely. Actual timing tests are described below.

•

Simplicity. If the algorithm is short and easy to understand, fewer mistakes may be made. This not only makes your life easy, but the maintenance programmer entrusted with the task of making repairs will appreciate any efforts you make in this area. The number of statements required for each algorithm is listed in Table 3-2.

- 27 -

method hash table unbalanced tree red-black tree skip list

statements 26 41 120 55

average time O (1) O (lg n) O (lg n) O (lg n)

worst-case time O (n) O (n) O (lg n) O (n)

Table 3-2: Comparison of Dictionaries Average time for insert, search, and delete operations on a database of 65,536 (216) randomly input items may be found in Table 3-3. For this test the hash table size was 10,009 and 16 index levels were allowed for the skip list. Although there is some variation in the timings for the four methods, they are close enough so that other considerations should come into play when selecting an algorithm. method hash table unbalanced tree red-black tree skip list

insert 18 37 40 48

search 8 17 16 31

delete 10 26 37 35

Table 3-3: Average Time (µs), 65536 Items, Random Input order random input

ordered input

count 16 256 4,096 65,536 16 256 4,096 65,536

hash table 4 3 3 8 3 3 3 7

unbalanced tree 3 4 7 17 4 47 1,033 55,019

red-black tree 2 4 6 16 2 4 6 9

skip list 5 9 12 31 4 7 11 15

Table 3-4: Average Search Time (us) Table 3-4 shows the average search time for two sets of data: a random set, where all values are unique, and an ordered set, where values are in ascending order. Ordered input creates a worst-case scenario for unbalanced tree algorithms, as the tree ends up being a simple linked list. The times shown are for a single search operation. If we were to search for all items in a database of 65,536 values, a red-black tree algorithm would take .6 seconds, while an unbalanced tree algorithm would take 1 hour.

- 28 -

4. Very Large Files The previous algorithms have assumed that all data reside in memory. However, there may be times when the dataset is too large, and alternative methods are required. In this section, we will examine techniques for sorting (external sorts) and implementing dictionaries (B-trees) for very large files.

4.1 External Sorting One method for sorting a file is to load the file into memory, sort the data in memory, then write the results. When the file cannot be loaded into memory due to resource limitations, an external sort applicable. We will implement an external sort using replacement selection to establish initial runs, followed by a polyphase merge sort to merge the runs into one sorted file. I highly recommend you consult Knuth [1998], as many details have been omitted.

- 29 -

Theory For clarity, I’ll assume that data is on one or more reels of magnetic tape. Figure 4-1 illustrates a 3-way polyphase merge. Initially, in phase A, all data is on tapes T1 and T2. Assume that the beginning of each tape is at the bottom of the frame. There are two sequential runs of data on T1: 4-8, and 6-7. Tape T2 has one run: 5-9. At phase B, we’ve merged the first run from tapes T1 (4-8) and T2 (5-9) into a longer run on tape T3 (4-5-8-9). Phase C is simply renames the tapes, so we may repeat the merge again. In phase D we repeat the merge, with the final output on tape T3. phase A

T1 7 6 8 4

T2

9 5

B

C

7 6 9 8 5 4

D

T3

9 8 5 4

7 6 9 8 7 6 5 4

Figure 4-1: Merge Sort Several interesting details have been omitted from the previous illustration. For example, how were the initial runs created? And, did you notice that they merged perfectly, with no extra runs on any tapes? Before I explain the method used for constructing initial runs, let me digress for a bit. In 1202, Leonardo Fibonacci presented the following exercise in his Liber Abbaci (Book of the Abacus): “How many pairs of rabbits can be produced from a single pair in a year’s time?” We may assume that each pair produces a new pair of offspring every month, each pair becomes fertile at the age of one month, and that rabbits never die. After one month, there will be 2 pairs of rabbits; after two months there will be 3; the following month the original pair and the pair born during the first month will both usher in a new pair, and there will be 5 in all; and so on. This series, where each number is the sum of the two preceding numbers, is known as the Fibonacci sequence:

- 30 -

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ... . Curiously, the Fibonacci series has found widespread application to everything from the arrangement of flowers on plants to studying the efficiency of Euclid’s algorithm. There’s even a Fibonacci Quarterly journal. And, as you might suspect, the Fibonacci series has something to do with establishing initial runs for external sorts. Recall that we initially had one run on tape T2, and 2 runs on tape T1. Note that the numbers {1,2} are two sequential numbers in the Fibonacci series. After our first merge, we had one run on T1 and one run on T2. Note that the numbers {1,1} are two sequential numbers in the Fibonacci series, only one notch down. We could predict, in fact, that if we had 13 runs on T2, and 21 runs on T1 {13,21}, we would be left with 8 runs on T1 and 13 runs on T3 {8,13} after one pass. Successive passes would result in run counts of {5,8}, {3,5}, {2,3}, {1,1}, and {0,1}, for a total of 7 passes. This arrangement is ideal, and will result in the minimum number of passes. Should data actually be on tape, this is a big savings, as tapes must be mounted and rewound for each pass. For more than 2 tapes, higher-order Fibonacci numbers are used. Initially, all the data is on one tape. The tape is read, and runs are distributed to other tapes in the system. After the initial runs are created, they are merged as described above. One method we could use to create initial runs is to read a batch of records into memory, sort the records, and write them out. This process would continue until we had exhausted the input tape. An alternative algorithm, replacement selection, allows for longer runs. A buffer is allocated in memory to act as a holding place for several records. Initially, the buffer is filled. Then, the following steps are repeated until the input is exhausted: • • • •

Select the record with the smallest key that is ≥ the key of the last record written. If all keys are smaller than the key of the last record written, then we have reached the end of a run. Select the record with the smallest key for the first record of the next run. Write the selected record. Replace the selected record with a new record from input.

- 31 -

Figure 4-2 illustrates replacement selection for a small file. The beginning of the file is to the right of each frame. To keep things simple, I’ve allocated a 2-record buffer. Typically, such a buffer would hold thousands of records. We load the buffer in step B, and write the record with the smallest key (6) in step C. This is replaced with the next record (key 8). We select the smallest key ≥ 6 in step D. This is key 7. After writing key 7, we replace it with key 4. This process repeats until step F, where our last key written was 8, and all keys are less than 8. At this point, we terminate the run, and start another. Step A B C D E F G H

Input 5-3-4-8-6-7 5-3-4-8 5-3-4 5-3 5

Buffer 6-7 8-7 8-4 3-4 5-4 5

Output

6 7-6 8-7-6 3 | 8-7-6 4-3 | 8-7-6 5-4-3 | 8-7-6

Figure 4-2: Replacement Selection This strategy simply utilizes an intermediate buffer to hold values until the appropriate time for output. Using random numbers as input, the average length of a run is twice the length of the buffer. However, if the data is somewhat ordered, runs can be extremely long. Thus, this method is more effective than doing partial sorts. When selecting the next output record, we need to find the smallest key ≥ the last key written. One way to do this is to scan the entire list, searching for the appropriate key. However, when the buffer holds thousands of records, execution time becomes prohibitive. An alternative method is to use a binary tree structure, so that we only compare lg n items.

Implementation Source for the external sort algorithm may be found in file ext.c. Function makeRuns calls readRec to read the next record. Function readRec employs the replacement selection algorithm (utilizing a binary tree) to fetch the next record, and makeRuns distributes the records in a Fibonacci distribution. If the number of runs is not a perfect Fibonacci number, dummy runs are simulated at the beginning of each file. Function mergeSort is then called to do a polyphase merge sort on the runs.

4.2 B-Trees Dictionaries for very large files typically reside on secondary storage, such as a disk. The dictionary is implemented as an index to the actual file and contains the key and record address of data. To implement a dictionary we could use red-black trees, replacing pointers with offsets from the beginning of the index file, and use random access to reference nodes of the tree. However, every transition on a link would imply a disk access, and would be prohibitively

- 32 -

expensive. Recall that low-level disk I/O accesses disk by sectors (typically 256 bytes). We could equate node size to sector size, and group several keys together in each node to minimize the number of I/O operations. This is the principle behind B-trees. Good references for B-trees include Knuth [1998] and Cormen [1998]. For B+-trees, consult Aho [1983].

Theory Figure 4-3 illustrates a B-tree with 3 keys/node. Keys in internal nodes are surrounded by pointers, or record offsets, to keys that are less than or greater than, the key value. For example, all keys less than 22 are to the left and all keys greater than 22 are to the right. For simplicity, I have not shown the record address associated with each key. 22

10

4

6

8

16

12 14

26

18 20

24

28 30

Figure 4-3: B-Tree We can locate any key in this 2-level tree with three disk accesses. If we were to group 100 keys/node, we could search over 1,000,000 keys in only three reads. To ensure this property holds, we must maintain a balanced tree during insertion and deletion. During insertion, we examine the child node to verify that it is able to hold an additional node. If not, then a new sibling node is added to the tree, and the child’s keys are redistributed to make room for the new node. When descending for insertion and the root is full, then the root is spilled to new children, and the level of the tree increases. A similar action is taken on deletion, where child nodes may be absorbed by the root. This technique for altering the height of the tree maintains a balanced tree.

data stored in on insert, split on delete, join

B-Tree any node 1 x 1 → 2 x 1/2 2 x 1/2 → 1 x 1

B*-Tree any node 2 x 1 → 3 x 2/3 3 x 2/3 → 2 x 1

B + -Tree leaf only 1 x 1 → 2 x 1/2 2 x 1/2 → 1 x 1

B ++ -Tree leaf only 3 x 1 → 4 x 3/4 3 x 1/2 → 2 x 3/4

Table 4-1: B-Tree Implementations Several variants of the B-tree are listed in Table 4-1. The standard B-tree stores keys and data in both internal and leaf nodes. When descending the tree during insertion, a full child node is first redistributed to adjacent nodes. If the adjacent nodes are also full, then a new node is created, and ½ the keys in the child are moved to the newly created node. During deletion, children that are ½ full first attempt to obtain keys from adjacent nodes. If the adjacent nodes are also ½ full, then two nodes are joined to form one full node. B*-trees are similar, only the nodes

- 33 -

are kept 2/3 full. This results in better utilization of space in the tree, and slightly better performance. 22

10

4

6

8

16

10 12 14

26

16 18 20

22 24

26 28 30

Figure 4-4: B-+Tree Figure 4-4 illustrates a B+-tree. All keys are stored at the leaf level, with their associated data values. Duplicates of the keys appear in internal parent nodes to guide the search. Pointers have a slightly different meaning than in conventional B-trees. The left pointer designates all keys less than the value, while the right pointer designates all keys greater than or equal to (GE) the value. For example, all keys less than 22 are on the left pointer, and all keys greater than or equal to 22 are on the right. Notice that key 22 is duplicated in the leaf, where the associated data may be found. During insertion and deletion, care must be taken to properly update parent nodes. When modifying the first key in a leaf, the last GE pointer found while descending the tree will require modification to reflect the new key value. Since all keys are in leaf nodes, we may link them for sequential access. The last method, B++-trees, is something of my own invention. The organization is similar to B+-trees, except for the split/join strategy. Assume each node can hold k keys, and the root node holds 3k keys. Before we descend to a child node during insertion, we check to see if it is full. If it is, the keys in the child node and two nodes adjacent to the child are all merged and redistributed. If the two adjacent nodes are also full, then another node is added, resulting in four nodes, each ¾ full. Before we descend to a child node during deletion, we check to see if it is ½ full. If it is, the keys in the child node and two nodes adjacent to the child are all merged and redistributed. If the two adjacent nodes are also ½ full, then they are merged into two nodes, each ¾ full. Note that in each case, the resulting nodes are ¾ full. This is halfway between ½ full and completely full, allowing for an equal number of insertions or deletions in the future. Recall that the root node holds 3k keys. If the root is full during insertion, we distribute the keys to four new nodes, each ¾ full. This increases the height of the tree. During deletion, we inspect the child nodes. If there are only three child nodes, and they are all ½ full, they are gathered into the root, and the height of the tree decreases. Another way of expressing the operation is to say we are gathering three nodes, and then scattering them. In the case of insertion, where we need an extra node, we scatter to four nodes. For deletion, where a node must be deleted, we scatter to two nodes. The symmetry of the operation allows the gather/scatter routines to be shared by insertion and deletion in the implementation.

- 34 -

Implementation Source for the B++-tree algorithm may be found in file btr.c. In the implementation-dependent section, you’ll need to define bAdrType and eAdrType, the types associated with B-tree file offsets and data file offsets, respectively. You’ll also need to provide a callback function that is used by the B++-tree algorithm to compare keys. Functions are provided to insert/delete keys, find keys, and access keys sequentially. Function main, at the bottom of the file, provides a simple illustration for insertion. The code provided allows for multiple indices to the same data. This was implemented by returning a handle when the index is opened. Subsequent accesses are done using the supplied handle. Duplicate keys are allowed. Within one index, all keys must be the same length. A binary search was implemented to search each node. A flexible buffering scheme allows nodes to be retained in memory until the space is needed. If you expect access to be somewhat ordered, increasing the bufCt will reduce paging.

- 35 -

5. Bibliography Aho, Alfred V. and Jeffrey D. Ullman [1983]. Data Structures and Algorithms. AddisonWesley, Reading, Massachusetts. Cormen, Thomas H., Charles E. Leiserson and Ronald L. Rivest [1990]. Algorithms. McGraw-Hill, New York.

Introduction to

Knuth, Donald. E. [1998]. The Art of Computer Programming, Volume 3, Sorting and Searching. Addison-Wesley, Reading, Massachusetts. Pearson, Peter K [1990]. Fast Hashing of Variable-Length Text Strings. Communications of the ACM, 33(6):677-680, June 1990. Pugh, William [1990]. Skip lists: A Probabilistic Alternative To Balanced Trees. Communications of the ACM, 33(6):668-676, June 1990.

- 36 -

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_int.htm

Introduction Arrays and linked lists are two basic data structures used to store information. We may wish to search, insert or delete records in a database based on a key value. This section examines the performance of these operations on arrays and linked lists.

Arrays Figure 1-1 shows an array, seven elements long, containing numeric values. To search the array sequentially, we may use the algorithm in Figure 1-2. The maximum number of comparisons is 7, and occurs when the key we are searching for is in A[6].

Figure 1-1: An Array int function SequentialSearch (Array A, int Lb, int Ub, int Key); begin for i = Lb to Ub do if A(i) = Key then return i; return -1; end;

Figure 1-2: Sequential Search If the data is sorted, a binary search may be done (Figure 1-3). Variables Lb and Ub keep track of the lower bound and upper bound of the array, respectively. We begin by examining the middle element http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_int.htm (1 of 4) [3/23/2004 3:06:22 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_int.htm

of the array. If the key we are searching for is less than the middle element, then it must reside in the top half of the array. Thus, we set Ub to (M - 1). This restricts our next iteration through the loop to the top half of the array. In this way, each iteration halves the size of the array to be searched. For example, the first iteration will leave 3 items to test. After the second iteration, there will be 1 item left to test. Therefore it takes only three iterations to find any number. This is a powerful method. Given an array of 1023 elements, we can narrow the search to 511 items in one comparison. Another comparison, and we're looking at only 255 elements. In fact, we can search the entire array in only 10 comparisons. In addition to searching, we may wish to insert or delete entries. Unfortunately, an array is not a good arrangement for these operations. For example, to insert the number 18 in Figure 1-1, we would need to shift A[3]...A[6] down by one slot. Then we could copy number 18 into A[3]. A similar problem arises when deleting numbers. To improve the efficiency of insert and delete operations, linked lists may be used. int function BinarySearch (Array A, int Lb, int Ub, int Key); begin do forever M = (Lb + Ub)/2; if (Key < A[M]) then Ub = M - 1; else if (Key > A[M]) then Lb = M + 1; else return M; if (Lb > Ub) then return -1; end;

Figure 1-3: Binary Search

Linked Lists In Figure 1-4, we have the same values stored in a linked list. Assuming pointers X and P, as shown in the figure, value 18 may be inserted as follows: X->Next = P->Next; P->Next = X; Insertion and deletion operations are very efficient using linked lists. You may be wondering how pointer P was set in the first place. Well, we had to do a sequential search to find the insertion point http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_int.htm (2 of 4) [3/23/2004 3:06:22 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_int.htm

X. Although we improved our performance for insertion/deletion, it has been at the expense of search time.

Figure 1-4: A Linked List

Timing Estimates Several methods may be used to compare the performance of algorithms. One way is simply to run several tests for each algorithm and compare the timings. Another way is to estimate the time required. For example, we may state that search time is O(n) (big-oh of n). This means that search time, for large n, is proportional to the number of items n in the list. Consequently, we would expect search time to triple if our list increased in size by a factor of three. The big-O notation does not describe the exact time that an algorithm takes, but only indicates an upper bound on execution time within a constant factor. If an algorithm takes O(n2) time, then execution time grows no worse than the square of the size of the list. n

n1.25

lg n n lg n

n2

1

0

0

1

1

16

4

64

32

256

256

8

2,048

1,024

65,536

4,096

12

49,152

32,768

16,777,216

65,536

16

1,048,565

1,048,476

4,294,967,296

1,048,476

20

20,969,520

33,554,432

1,099,301,922,576

16,775,616

24 402,614,784 1,073,613,825 281,421,292,179,456

Table 1-1: Growth Rates Table 1-1 illustrates growth rates for various functions. A growth rate of O(lg n) occurs for algorithms similar to the binary search. The lg (logarithm, base 2) function increases by one when n is doubled. http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_int.htm (3 of 4) [3/23/2004 3:06:22 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_int.htm

Recall that we can search twice as many items with one more comparison in the binary search. Thus the binary search is a O(lg n) algorithm. If the values in Table 1-1 represented microseconds, then a O(lg n) algorithm may take 20 microseconds to process 1,048,476 items, a O(n1.25) algorithm might take 33 seconds, and a O(n2) algorithm might take up to 12 days! In the following chapters a timing estimate for each algorithm, using big-O notation, will be included. For a more formal derivation of these formulas you may wish to consult the references.

Summary As we have seen, sorted arrays may be searched efficiently using a binary search. However, we must have a sorted array to start with. In the next section various ways to sort arrays will be examined. It turns out that this is computationally expensive, and considerable research has been done to make sorting algorithms as efficient as possible. Linked lists improved the efficiency of insert and delete operations, but searches were sequential and time-consuming. Algorithms exist that do all three operations efficiently, and they will be the discussed in the section on dictionaries.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_int.htm (4 of 4) [3/23/2004 3:06:22 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_srt.htm

Sorting Several algorithms are presented, including insertion sort, shell sort, and quicksort. Sorting by insertion is the simplest method, and doesn't require any additional storage. Shell sort is a simple modification that improves performance significantly. Probably the most efficient and popular method is quicksort, and is the method of choice for large arrays.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_srt.htm [3/23/2004 3:06:24 PM]

Insertion Sort

Insertion Sort One of the simplest methods to sort an array is an insertion sort. An example of an insertion sort occurs in everyday life while playing cards. To sort the cards in your hand you extract a card, shift the remaining cards, and then insert the extracted card in the correct place. This process is repeated until all the cards are in the correct sequence. Both average and worst-case time is O(n2). For further reading, consult Knuth [1998].

Theory Starting near the top of the array in Figure 2-1(a), we extract the 3. Then the above elements are shifted down until we find the correct place to insert the 3. This process repeats in Figure 2-1(b) with the next number. Finally, in Figure 2-1(c), we complete the sort by inserting 2 in the correct place.

Figure 2-1: Insertion Sort http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ins.htm (1 of 2) [3/23/2004 3:06:34 PM]

Insertion Sort

Assuming there are n elements in the array, we must index through n - 1 entries. For each entry, we may need to examine and shift up to n - 1 other entries, resulting in a O(n2) algorithm. The insertion sort is an in-place sort. That is, we sort the array in-place. No extra memory is required. The insertion sort is also a stable sort. Stable sorts retain the original ordering of keys when identical keys are present in the input data.

Implementation An ANSI-C implementation for insertion sort is included. Typedef T and comparison operator compGT should be altered to reflect the data stored in the table.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ins.htm (2 of 2) [3/23/2004 3:06:34 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ins.txt

/* insert sort */ #include #include typedef int T; typedef int tblIndex;

/* type of item to be sorted */ /* type of subscript */

#define compGT(a,b) (a > b) void insertSort(T *a, tblIndex lb, tblIndex ub) { T t; tblIndex i, j; /************************** * sort array a[lb..ub] * **************************/ for (i = lb + 1; i = lb && compGT(a[j], t); j--) a[j+1] = a[j]; /* insert */ a[j+1] = t; } } void fill(T *a, tblIndex lb, tblIndex ub) { tblIndex i; srand(1); for (i = lb; i = N. Thus, values of h are computed as follows: h1 = 1 h2 = (3 x 1) + 1 = 4 h3 = (3 x 4) + 1 = 13 h4 = (3 x 13) + 1 = 40 h5 = (3 x 40) + 1 = 121 To sort 100 items we first find an hs such that hs >= 100. For 100 items, h5 is selected. Our final value (ht) is two steps lower, or h3. Therefore our sequence of h values will be 13-4-1. Once the initial h value has been determined, subsequent values may be calculated using the formula hs-1 = floor(hs / 3).

Implementation An ANSI-C implementation for shell sort is included. Typedef T and comparison operator compGT should be altered to reflect the data stored in the array. The central portion of the algorithm is an insertion sort with a spacing of h.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_shl.htm (2 of 2) [3/23/2004 3:06:40 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_shl.txt

/* shell sort */ #include #include typedef int T; typedef int tblIndex;

/* type of item to be sorted */ /* type of subscript */

#define compGT(a,b) (a > b) void shellSort(T *a, tblIndex lb, tblIndex ub) { tblIndex n, h, i, j; T t; /************************** * sort array a[lb..ub] * **************************/ /* compute largest increment */ n = ub - lb + 1; h = 1; if (n < 14) h = 1; else if (sizeof(tblIndex) == 2 && n > 29524) h = 3280; else { while (h < n) h = 3*h + 1; h /= 3; h /= 3; } while (h > 0) { /* sort-by-insertion in increments of h */ for (i = lb + h; i = lb && compGT(a[j], t); j -= h) a[j+h] = a[j]; a[j+h] = t; } /* compute next increment */ h /= 3; } } void fill(T *a, tblIndex lb, tblIndex ub) { tblIndex i; srand(1); for (i = lb; i b) void insertSort(T *a, tblIndex lb, tblIndex ub) { T t; tblIndex i, j; /************************** * sort array a[lb..ub] * **************************/ for (i = lb + 1; i = lb && compGT(a[j], t); j--) a[j+1] = a[j]; /* insert */ a[j+1] = t; } } tblIndex partition(T *a, tblIndex lb, tblIndex ub) { T t, pivot; tblIndex i, j, p; /******************************* * partition array a[lb..ub] * *******************************/ /* select pivot and exchange with 1st element */ p = lb + ((ub - lb)>>1); pivot = a[p]; a[p] = a[lb]; /* sort lb+1..ub based on pivot */ i = lb+1; j = ub; while (1) { while (i < j && compGT(pivot, a[i])) i++; while (j >= i && compGT(a[j], pivot)) j--; if (i >= j) break; t = a[i]; a[i] = a[j]; a[j] = t; j--; i++; } /* pivot belongs in a[j] */ a[lb] = a[j]; a[j] = pivot; return j; } void quickSort(T *a, tblIndex lb, tblIndex ub) { tblIndex m; /**************************

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_qui.txt (1 of 2) [3/23/2004 3:06:59 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_qui.txt

* sort array a[lb..ub] * **************************/ while (lb < ub) { /* quickly sort short lists */ if (ub - lb = i && compar(j, lb) > 0) j -= size; if (i >= j) break; exchange (i, j, size); j -= size; i += size; }

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_qsort.txt (1 of 2) [3/23/2004 3:07:04 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_qsort.txt

/* pivot belongs in A[j] */ exchange (lb, j, size); m = j; /* keep processing smallest segment, and stack largest */ if (m - lb lb) { lbStack[sp] = lb; ubStack[sp++] = m - size; } lb = m + size; } } } } void fill(T *lb, T *ub) { T *i; srand(1); for (i = lb; i > S; For example, if HashTableSizeis 1024 (210), then a 16-bit index is sufficient and Swould be assigned a value of 16 - 10 = 6. Thus, we have: typedef unsigned short int HashIndexType; HashIndexType Hash(int Key) { static const HashIndexType K = 40503; static const int S = 6; return (HashIndexType)(K * Key) >> S; } ●

Variable string addition method (tablesize = 256). To hash a variable-length string, each character is added, modulo 256, to a total. A HashValue, range 0-255, is computed. typedef unsigned char HashIndexType; HashIndexType Hash(char *str) { HashIndexType h = 0; while (*str) h += *str++; return h; }

●

Variable string exclusive-or method (tablesize = 256). This method is similar to the addition method, but successfully distinguishes similar words and anagrams. To obtain a hash value in the range 0-255, all bytes in the string are exclusive-or'd together. However, in the process of doing each exclusive-or, a random component is introduced. typedef unsigned char HashIndexType; unsigned char Rand8[256]; HashIndexType Hash(char *str) { unsigned char h = 0; while (*str) h = Rand8[h ^ *str++]; return h; } Rand8is a table of 256 8-bit unique random numbers. The exact ordering is not critical. The exclusive-or method has its basis in cryptography, and is quite effective (Pearson [1990]).

●

Variable string exclusive-or method (tablesize data = data; return p; } void deleteNode(T data) { Node *p0, *p; hashTableIndex bucket; /******************************************** * delete node containing data from table * ********************************************/ /* find node */ p0 = 0; bucket = hash(data); p = hashTable[bucket]; while (p && !compEQ(p->data, data)) { p0 = p; p = p->next; } if (!p) return;

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_has.txt (1 of 3) [3/23/2004 3:07:27 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_has.txt

/* p designates node to delete, remove it from list */ if (p0) /* not first node, p0 points to previous node */ p0->next = p->next; else /* first node on chain */ hashTable[bucket] = p->next; free (p); } Node *findNode (T data) { Node *p; /******************************* * find node containing data * *******************************/ p = hashTable[hash(data)]; while (p && !compEQ(p->data, data)) p = p->next; return p; } int main(int argc, char **argv) { int i, *a, maxnum, random; /* command-line: * * has maxnum hashTableSize [random] * * has 2000 100 * processes 2000 records, tablesize=100, sequential numbers * has 4000 200 r * processes 4000 records, tablesize=200, random numbers * */ maxnum = atoi(argv[1]); hashTableSize = atoi(argv[2]); random = argc > 3; if ((a = malloc(maxnum * sizeof(*a))) == 0) { fprintf (stderr, "out of memory (a)\n"); exit(1); } if ((hashTable = malloc(hashTableSize * sizeof(Node *))) == 0) { fprintf (stderr, "out of memory (hashTable)\n"); exit(1); } if (random) { /* random */ /* fill "a" with unique random numbers */ for (i = 0; i < maxnum; i++) a[i] = rand(); printf ("ran ht, %d items, %d hashTable\n", maxnum, hashTableSize); } else { for (i=0; i= 0; i--) { findNode(a[i]); http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_has.txt (2 of 3) [3/23/2004 3:07:27 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_has.txt

} for (i = maxnum-1; i >= 0; i--) { deleteNode(a[i]); } return 0; }

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_has.txt (3 of 3) [3/23/2004 3:07:27 PM]

Binary Search Trees

Binary Search Trees In the introduction we used the binary search algorithm to find data stored in an array. This method is very effective, as each iteration reduced the number of items to search by one-half. However, since data was stored in an array, insertions and deletions were not efficient. Binary search trees store data in nodes that are linked in a tree-like fashion. For randomly inserted data, search time is O(lg n). Worst-case behavior occurs when ordered data is inserted. In this case the search time is O(n). See Cormen [1990] for a more detailed description.

Theory A binary search tree is a tree where each node has a left and right child. Either child, or both children, may be missing. Figure 3-2 illustrates a binary search tree. Assuming k represents the value of a given node, then a binary search tree also has the following property: all children to the left of the node have values smaller than k, and all children to the right of the node have values larger than k. The top of a tree is known as the root, and the exposed nodes at the bottom are known as leaves. In Figure 3-2, the root is node 20 and the leaves are nodes 4, 16, 37, and 43. The height of a tree is the length of the longest path from root to leaf. For this example the tree height is 2.

Figure 3-2: A Binary Search Tree To search a tree for a given value, we start at the root and work down. For example, to search for 16, we first note that 16 < 20 and we traverse to the left child. The second comparison finds that 16 > 7, so we traverse to the right child. On the third comparison, we succeed. Each comparison results in reducing the number of items to inspect by one-half. In this respect, the algorithm is similar to a binary search on an array. However, this is true only if the tree is balanced. For example, Figure 3-3 shows another tree containing the same values. While it is a binary search tree, its behavior is more like that of a linked list, with search time increasing proportional to the number of elements stored.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_bin.htm (1 of 3) [3/23/2004 3:07:52 PM]

Binary Search Trees

Figure 3-3: An Unbalanced Binary Search Tree

Insertion and Deletion Let us examine insertions in a binary search tree to determine the conditions that can cause an unbalanced tree. To insert an 18 in the tree in Figure 3-2, we first search for that number. This causes us to arrive at node 16 with nowhere to go. Since 18 > 16, we simply add node 18 to the right child of node 16 (Figure 3-4). Now we can see how an unbalanced tree can occur. If the data is presented in an ascending sequence, each node will be added to the right of the previous node. This will create one long chain, or linked list. However, if data is presented for insertion in a random order, then a more balanced tree is possible. Deletions are similar, but require that the binary search tree property be maintained. For example, if node 20 in Figure 3-4 is removed, it must be replaced by node 37. This results in the tree shown in Figure 3-5. The rationale for this choice is as follows. The successor for node 20 must be chosen such that all nodes to the right are larger. Therefore we need to select the smallest valued node to the right of node 20. To make the selection, chain once to the right (node 38), and then chain to the left until the last node is found (node 37). This is the successor for node 20.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_bin.htm (2 of 3) [3/23/2004 3:07:52 PM]

Binary Search Trees

Figure 3-4: Binary Tree After Adding Node 18

Figure 3-5: Binary Tree After Deleting Node 20

Implementation An ANSI-C implementation for a binary search tree is included. Typedef T and comparison operators compLT and compEQ should be altered to reflect the data stored in the tree. Each Node consists of left, right, and parent pointers designating each child and the parent. Data is stored in the data field. The tree is based at root, and is initially NULL. Function insertNode allocates a new node and inserts it in the tree. Function deleteNode deletes and frees a node from the tree. Function findNode searches the tree for a particular value.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_bin.htm (3 of 3) [3/23/2004 3:07:52 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_bin.txt

/* binary search tree */ #include #include typedef int T; #define compLT(a,b) (a < b) #define compEQ(a,b) (a == b) typedef struct Node_ { struct Node_ *left; struct Node_ *right; struct Node_ *parent; T data; } Node; Node *root = NULL;

/* type of item to be stored */

/* /* /* /*

left child */ right child */ parent */ data stored in node */

/* root of binary tree */

Node *insertNode(T data) { Node *x, *current, *parent; /*********************************************** * allocate node for data and insert in tree * ***********************************************/ /* find x's parent */ current = root; parent = 0; while (current) { if (compEQ(data, current->data)) return (current); parent = current; current = compLT(data, current->data) ? current->left : current->right; } /* setup new node */ if ((x = malloc (sizeof(*x))) == 0) { fprintf (stderr, "insufficient memory (insertNode)\n"); exit(1); } x->data = data; x->parent = parent; x->left = NULL; x->right = NULL; /* insert x in tree */ if(parent) if(compLT(x->data, parent->data)) parent->left = x; else parent->right = x; else root = x; return(x); } void deleteNode(Node *z) { Node *x, *y; /***************************** * delete node z from tree * *****************************/ /* y will be removed from the parent chain */ if (!z || z == NULL) return;

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_bin.txt (1 of 3) [3/23/2004 3:07:58 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_bin.txt

/* find tree successor */ if (z->left == NULL || z->right == NULL) y = z; else { y = z->right; while (y->left != NULL) y = y->left; } /* x is y's only child */ if (y->left != NULL) x = y->left; else x = y->right; /* remove y from the parent chain */ if (x) x->parent = y->parent; if (y->parent) if (y == y->parent->left) y->parent->left = x; else y->parent->right = x; else root = x; /* /* /* if

y is the node we're removing */ z is the data we're removing */ if z and y are not the same, replace z with y. */ (y != z) { y->left = z->left; if (y->left) y->left->parent = y; y->right = z->right; if (y->right) y->right->parent = y; y->parent = z->parent; if (z->parent) if (z == z->parent->left) z->parent->left = y; else z->parent->right = y; else root = y; free (z); } else { free (y); } } Node *findNode(T data) { /******************************* * find node containing data * *******************************/ Node *current = root; while(current != NULL) if(compEQ(data, current->data)) return (current); else current = compLT(data, current->data) ? current->left : current->right; return(0); } int main(int argc, char **argv) { int i, *a, maxnum, random; /* command-line:

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_bin.txt (2 of 3) [3/23/2004 3:07:58 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_bin.txt

* * bin * * bin * bin * */ maxnum = random =

maxnum random 5000 2000 r

// 5000 sequential // 2000 random

atoi(argv[1]); argc > 2;

if ((a = malloc(maxnum * sizeof(*a))) == 0) { fprintf (stderr, "insufficient memory (a)\n"); exit(1); } if (random) { /* random */ /* fill "a" with unique random numbers */ for (i = 0; i < maxnum; i++) a[i] = rand(); printf ("ran bt, %d items\n", maxnum); } else { for (i=0; i= 0; i--) { findNode(a[i]); } for (i = maxnum-1; i >= 0; i--) { deleteNode(findNode(a[i])); } return 0; }

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_bin.txt (3 of 3) [3/23/2004 3:07:58 PM]

Red-Black Trees

Red-Black Trees Binary search trees work best when they are balanced or the path length from root to any leaf is within some bounds. The red-black tree algorithm is a method for balancing trees. The name derives from the fact that each node is colored red or black, and the color of the node is instrumental in determining the balance of the tree. During insert and delete operations, nodes may be rotated to maintain tree balance. Both average and worst-case search time is O(lg n). For details, consult Cormen [1990].

Theory A red-black tree is a balanced binary search tree with the following properties: 1. 2. 3. 4.

Every node is colored red or black. Every leaf is a NIL node, and is colored black. If a node is red, then both its children are black. Every simple path from a node to a descendant leaf contains the same number of black nodes.

The number of black nodes on a path from root to leaf is known as the black-height of a tree. These properties guarantee that any path from the root to a leaf is no more than twice as long as any other. To see why this is true, consider a tree with a black height of two. The shortest distance from root to leaf is two, where both nodes are black. The longest distance from root to leaf is four, where the nodes are colored (root to leaf): red, black, red, black. It is not possible to insert more black nodes as this would violate property 4, the black-height requirement. Since red nodes must have black children (property 3), having two red nodes in a row is not allowed. The largest path we can construct consists of an alternation of red-black nodes, or twice the length of a path containing only black nodes. All operations on the tree must maintain the properties listed above. In particular, operations which insert or delete items from the tree must abide by these rules.

Insertion To insert a node, we search the tree for an insertion point, and add the node to the tree. A new node replaces an existing NIL node at the bottom of the tree, and has two NIL nodes as children. In the implementation, a NIL node is simply a pointer to a common sentinel node that is colored black. After insertion, the new node is colored red. Then the parent of the node is examined to determine if the red-black tree properties have been violated. If necessary, we recolor the node and do rotations to balance the tree. By inserting a red node with two NIL children, we have preserved black-height property (property 4). However, property 3 may be violated. This property states that both children of a red node must be black. Although both children of the new node are black (they're NIL), consider the case where the http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_rbt.htm (1 of 5) [3/23/2004 3:08:14 PM]

Red-Black Trees

parent of the new node is red. Inserting a red node under a red parent would violate this property. There are two cases to consider: ●

●

Red parent, red uncle: Figure 3-6 illustrates a red-red violation. Node X is the newly inserted node, with both parent and uncle colored red. A simple recoloring removes the red-red violation. After recoloring, the grandparent (node B) must be checked for validity, as its parent may be red. Note that this has the effect of propagating a red node up the tree. On completion, the root of the tree is marked black. If it was originally red, then this has the effect of increasing the black-height of the tree. Red parent, black uncle: Figure 3-7 illustrates a red-red violation, where the uncle is colored black. Here the nodes may be rotated, with the subtrees adjusted as shown. At this point the algorithm may terminate as there are no red-red conflicts and the top of the subtree (node A) is colored black. Note that if node X was originally a right child, a left rotation would be done first, making the node a left child.

Each adjustment made while inserting a node causes us to travel up the tree one step. At most 1 rotation (2 if the node is a right child) will be done, as the algorithm terminates in this case. The technique for deletion is similar.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_rbt.htm (2 of 5) [3/23/2004 3:08:14 PM]

Red-Black Trees

Figure 3-6: Insertion - Red Parent, Red Uncle

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_rbt.htm (3 of 5) [3/23/2004 3:08:14 PM]

Red-Black Trees

Figure 3-7: Insertion - Red Parent, Black Uncle

Implementation An ANSI-C implementation for red-black trees is included. Typedef T and comparison operators compLT and compEQ should be altered to reflect the data stored in the tree. Each Node consists of left, right, and parent pointers designating each child and the parent. The node color is stored in http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_rbt.htm (4 of 5) [3/23/2004 3:08:14 PM]

Red-Black Trees

color, and is either RED or BLACK. The data is stored in the data field. All leaf nodes of the tree are sentinel nodes, to simplify coding. The tree is based at root, and initially is a sentinel node. Function insertNode allocates a new node and inserts it in the tree. Subsequently, it calls insertFixup to ensure that the red-black tree properties are maintained. Function deleteNode deletes a node from the tree. To maintain red-black tree properties, deleteFixup is called. Function findNode searches the tree for a particular value.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_rbt.htm (5 of 5) [3/23/2004 3:08:14 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_rbt.txt

/* red-black tree */ #include #include #include #include

typedef int T; #define compLT(a,b) (a < b) #define compEQ(a,b) (a == b)

/* type of item to be stored */

/* Red-Black tree description */ typedef enum { BLACK, RED } nodeColor; typedef struct Node_ { struct Node_ *left; struct Node_ *right; struct Node_ *parent; nodeColor color; T data; } Node;

/* /* /* /* /*

left child */ right child */ parent */ node color (BLACK, RED) */ data stored in node */

#define NIL &sentinel /* all leafs are sentinels */ Node sentinel = { NIL, NIL, 0, BLACK, 0}; Node *root = NIL;

/* root of Red-Black tree */

void rotateLeft(Node *x) { /************************** * rotate node x to left * **************************/ Node *y = x->right; /* establish x->right link */ x->right = y->left; if (y->left != NIL) y->left->parent = x; /* establish y->parent link */ if (y != NIL) y->parent = x->parent; if (x->parent) { if (x == x->parent->left) x->parent->left = y; else x->parent->right = y; } else { root = y; } /* link x and y */ y->left = x; if (x != NIL) x->parent = y; } void rotateRight(Node *x) { /**************************** * rotate node x to right * ****************************/ Node *y = x->left; /* establish x->left link */ x->left = y->right; if (y->right != NIL) y->right->parent = x;

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_rbt.txt (1 of 5) [3/23/2004 3:08:21 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_rbt.txt

/* establish y->parent link */ if (y != NIL) y->parent = x->parent; if (x->parent) { if (x == x->parent->right) x->parent->right = y; else x->parent->left = y; } else { root = y; } /* link x and y */ y->right = x; if (x != NIL) x->parent = y; } void insertFixup(Node *x) { /************************************* * maintain Red-Black tree balance * * after inserting node x * *************************************/ /* check Red-Black properties */ while (x != root && x->parent->color == RED) { /* we have a violation */ if (x->parent == x->parent->parent->left) { Node *y = x->parent->parent->right; if (y->color == RED) { /* uncle is RED */ x->parent->color = BLACK; y->color = BLACK; x->parent->parent->color = RED; x = x->parent->parent; } else { /* uncle is BLACK */ if (x == x->parent->right) { /* make x a left child */ x = x->parent; rotateLeft(x); } /* recolor and rotate */ x->parent->color = BLACK; x->parent->parent->color = RED; rotateRight(x->parent->parent); } } else { /* mirror image of above code */ Node *y = x->parent->parent->left; if (y->color == RED) { /* uncle is RED */ x->parent->color = BLACK; y->color = BLACK; x->parent->parent->color = RED; x = x->parent->parent; } else { /* uncle is BLACK */ if (x == x->parent->left) { x = x->parent; rotateRight(x);

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_rbt.txt (2 of 5) [3/23/2004 3:08:21 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_rbt.txt

} x->parent->color = BLACK; x->parent->parent->color = RED; rotateLeft(x->parent->parent); } } } root->color = BLACK; } Node *insertNode(T data) { Node *current, *parent, *x; /*********************************************** * allocate node for data and insert in tree * ***********************************************/ /* find where node belongs */ current = root; parent = 0; while (current != NIL) { if (compEQ(data, current->data)) return (current); parent = current; current = compLT(data, current->data) ? current->left : current->right; } /* setup new node */ if ((x = malloc (sizeof(*x))) == 0) { printf ("insufficient memory (insertNode)\n"); exit(1); } x->data = data; x->parent = parent; x->left = NIL; x->right = NIL; x->color = RED; /* insert node in tree */ if(parent) { if(compLT(data, parent->data)) parent->left = x; else parent->right = x; } else { root = x; } insertFixup(x); return(x); } void deleteFixup(Node *x) { /************************************* * maintain Red-Black tree balance * * after deleting node x * *************************************/ while (x != root && x->color == BLACK) { if (x == x->parent->left) { Node *w = x->parent->right; if (w->color == RED) { w->color = BLACK; x->parent->color = RED; rotateLeft (x->parent);

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_rbt.txt (3 of 5) [3/23/2004 3:08:21 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_rbt.txt

w = x->parent->right; } if (w->left->color == BLACK && w->right->color == BLACK) { w->color = RED; x = x->parent; } else { if (w->right->color == BLACK) { w->left->color = BLACK; w->color = RED; rotateRight (w); w = x->parent->right; } w->color = x->parent->color; x->parent->color = BLACK; w->right->color = BLACK; rotateLeft (x->parent); x = root; } } else { Node *w = x->parent->left; if (w->color == RED) { w->color = BLACK; x->parent->color = RED; rotateRight (x->parent); w = x->parent->left; } if (w->right->color == BLACK && w->left->color == BLACK) { w->color = RED; x = x->parent; } else { if (w->left->color == BLACK) { w->right->color = BLACK; w->color = RED; rotateLeft (w); w = x->parent->left; } w->color = x->parent->color; x->parent->color = BLACK; w->left->color = BLACK; rotateRight (x->parent); x = root; } } } x->color = BLACK; } void deleteNode(Node *z) { Node *x, *y; /***************************** * delete node z from tree * *****************************/ if (!z || z == NIL) return; if (z->left == NIL || z->right == NIL) { /* y has a NIL node as a child */ y = z; } else { /* find tree successor with a NIL node as a child */ y = z->right; while (y->left != NIL) y = y->left; }

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_rbt.txt (4 of 5) [3/23/2004 3:08:21 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_rbt.txt

/* x is y's only child */ if (y->left != NIL) x = y->left; else x = y->right; /* remove y from the parent chain */ x->parent = y->parent; if (y->parent) if (y == y->parent->left) y->parent->left = x; else y->parent->right = x; else root = x; if (y != z) z->data = y->data; if (y->color == BLACK) deleteFixup (x); free (y); } Node *findNode(T data) { /******************************* * find node containing data * *******************************/ Node *current = root; while(current != NIL) if(compEQ(data, current->data)) return (current); else current = compLT (data, current->data) ? current->left : current->right; return(0); } void main(int argc, char **argv) { int a, maxnum, ct; Node *t; /* command-line: * * rbt maxnum * * rbt 2000 * process 2000 records * */ maxnum = atoi(argv[1]); for (ct = maxnum; ct; ct--) { a = rand() % 9 + 1; if ((t = findNode(a)) != NULL) { deleteNode(t); } else { insertNode(a); } } }

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_rbt.txt (5 of 5) [3/23/2004 3:08:21 PM]

Skip Lists

Skip Lists Skip lists are linked lists that allow you to skip to the correct node. The performance bottleneck inherent in a sequential scan is avoided, while insertion and deletion remain relatively efficient. Average search time is O(lg n). Worst-case search time is O(n), but is extremely unlikely. An excellent reference for skip lists is Pugh [1990].

Theory The indexing scheme employed in skip lists is similar in nature to the method used to lookup names in an address book. To lookup a name, you index to the tab representing the first character of the desired entry. In Figure 3-8, for example, the top-most list represents a simple linked list with no tabs. Adding tabs (middle figure) facilitates the search. In this case, level-1 pointers are traversed. Once the correct segment of the list is found, level-0 pointers are traversed to find the specific entry.

Figure 3-8: Skip List Construction The indexing scheme may be extended as shown in the bottom figure, where we now have an index to the index. To locate an item, level-2 pointers are traversed until the correct segment of the list is identified. Subsequently, level-1 and level-0 pointers are traversed. During insertion the number of pointers required for a new node must be determined. This is easily resolved using a probabilistic technique. A random number generator is used to toss a computer coin. When inserting a new node, the coin is tossed to determine if it should be level-1. If you win, the coin is tossed again to determine if the node should be level-2. Another win, and the coin is tossed to determine if the node should be level-3. This process repeats until you lose. If only one level (level-0) http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_skl.htm (1 of 2) [3/23/2004 3:08:27 PM]

Skip Lists

is implemented, the data structure is a simple linked-list with O(n) search time. However, if sufficient levels are implemented, the skip list may be viewed as a tree with the root at the highest level, and search time is O(lg n). The skip list algorithm has a probabilistic component, and thus a probabilistic bounds on the time required to execute. However, these bounds are quite tight in normal circumstances. For example, to search a list containing 1000 items, the probability that search time will be 5 times the average is about 1 in 1,000,000,000,000,000,000.

Implementation An ANSI-C implementation for skip lists is included. Typedef T and comparison operators compLT and compEQ should be altered to reflect the data stored in the list. In addition, MAXLEVEL should be set based on the maximum size of the dataset. To initialize, initList is called. The list header is allocated and initialized. To indicate an empty list, all levels are set to point to the header. Function insertNode allocates a new node, searches for the correct insertion point, and inserts it in the list. While searching, the update array maintains pointers to the upper-level nodes encountered. This information is subsequently used to establish correct links for the newly inserted node. The newLevel is determined using a random number generator, and the node allocated. The forward links are then established using information from the update array. Function deleteNode deletes and frees a node, and is implemented in a similar manner. Function findNode searches the list for a particular value.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_skl.htm (2 of 2) [3/23/2004 3:08:27 PM]

Skip Lists: A Probabilistic Alternative to Balanced Trees Skip lists are a data structure that can be used in place of balanced trees. Skip lists use probabilistic balancing rather than strictly enforced balancing and as a result the algorithms for insertion and deletion in skip lists are much simpler and significantly faster than equivalent algorithms for balanced trees. William Pugh Binary trees can be used for representing abstract data types such as dictionaries and ordered lists. They work well when the elements are inserted in a random order. Some sequences of operations, such as inserting the elements in order, produce degenerate data structures that give very poor performance. If it were possible to randomly permute the list of items to be inserted, trees would work well with high probability for any input sequence. In most cases queries must be answered on-line, so randomly permuting the input is impractical. Balanced tree algorithms re-arrange the tree as operations are performed to maintain certain balance conditions and assure good performance. Skip lists are a probabilistic alternative to balanced trees. Skip lists are balanced by consulting a random number generator. Although skip lists have bad worst-case performance, no input sequence consistently produces the worst-case performance (much like quicksort when the pivot element is chosen randomly). It is very unlikely a skip list data structure will be significantly unbalanced (e.g., for a dictionary of more than 250 elements, the chance that a search will take more than 3 times the expected time is less than one in a million). Skip lists have balance properties similar to that of search trees built by random insertions, yet do not require insertions to be random. Balancing a data structure probabilistically is easier than explicitly maintaining the balance. For many applications, skip lists are a more natural representation than trees, also leading to simpler algorithms. The simplicity of skip list algorithms makes them easier to implement and provides significant constant factor speed improvements over balanced tree and self-adjusting tree algorithms. Skip lists are also very space efficient. They can easily be configured to require an average of 1 1 / 3 pointers per element (or even less) and do not require balance or priority information to be stored with each node.

SKIP LISTS We might need to examine every node of the list when searching a linked list (Figure 1a). If the list is stored in sorted order and every other node of the list also has a pointer to the node two ahead it in the list (Figure 1b), we have to examine no more than n/2 + 1 nodes (where n is the length of the list).

Also giving every fourth node a pointer four ahead (Figure 1c) requires that no more than n/4 + 2 nodes be examined. If every (2i)th node has a pointer 2i nodes ahead (Figure 1d), the number of nodes that must be examined can be reduced to log2 n while only doubling the number of pointers. This data structure could be used for fast searching, but insertion and deletion would be impractical. A node that has k forward pointers is called a level k node. If every (2i)th node has a pointer 2i nodes ahead, then levels of nodes are distributed in a simple pattern: 50% are level 1, 25% are level 2, 12.5% are level 3 and so on. What would happen if the levels of nodes were chosen randomly, but in the same proportions (e.g., as in Figure 1e)? A node’s ith forward pointer, instead of pointing 2i–1 nodes ahead, points to the next node of level i or higher. Insertions or deletions would require only local modifications; the level of a node, chosen randomly when the node is inserted, need never change. Some arrangements of levels would give poor execution times, but we will see that such arrangements are rare. Because these data structures are linked lists with extra pointers that skip over intermediate nodes, I named them skip lists.

SKIP LIST ALGORITHMS This section gives algorithms to search for, insert and delete elements in a dictionary or symbol table. The Search operation returns the contents of the value associated with the desired key or failure if the key is not present. The Insert operation associates a specified key with a new value (inserting the key if it had not already been present). The Delete operation deletes the specified key. It is easy to support additional operations such as “find the minimum key” or “find the next key”. Each element is represented by a node, the level of which is chosen randomly when the node is inserted without regard for the number of elements in the data structure. A level i node has i forward pointers, indexed 1 through i. We do not need to store the level of a node in the node. Levels are capped at some appropriate constant MaxLevel. The level of a list is the maximum level currently in the list (or 1 if the list is empty). The header of a list has forward pointers at levels one through MaxLevel. The forward pointers of the header at levels higher than the current maximum level of the list point to NIL.

a

3

b

3

c 3

6

6

6

7

7

3

9

12

12

9 7

d 6

9

12

17

17

19

19

21

21

12

3

7

9

25

21 19

25

17

19

26

NIL

26

NIL

26

12

17

26

NIL

25 19

NIL

NIL 25

6

e

25

21

9 7

17

21

26

FIGURE 1 - Linked lists with additional pointers

Initialization An element NIL is allocated and given a key greater than any legal key. All levels of all skip lists are terminated with NIL. A new list is initialized so that the the level of the list is equal to 1 and all forward pointers of the list’s header point to NIL.

the previous maximum level of the list, we update the maximum level of the list and initialize the appropriate portions of the update vector. After each deletion, we check if we have deleted the maximum element of the list and if so, decrease the maximum level of the list.

Search Algorithm

Choosing a Random Level

We search for an element by traversing forward pointers that do not overshoot the node containing the element being searched for (Figure 2). When no more progress can be made at the current level of forward pointers, the search moves down to the next level. When we can make no more progress at level 1, we must be immediately in front of the node that contains the desired element (if it is in the list).

Initially, we discussed a probability distribution where half of the nodes that have level i pointers also have level i+1 pointers. To get away from magic constants, we say that a fraction p of the nodes with level i pointers also have level i+1 pointers. (for our original discussion, p = 1/2). Levels are generated randomly by an algorithm equivalent to the one in Figure 5. Levels are generated without reference to the number of elements in the list.

Insertion and Deletion Algorithms To insert or delete a node, we simply search and splice, as shown in Figure 3. Figure 4 gives algorithms for insertion and deletion. A vector update is maintained so that when the search is complete (and we are ready to perform the splice), update[i] contains a pointer to the rightmost node of level i or higher that is to the left of the location of the insertion/deletion. If an insertion generates a node with a level greater than

Search(list, searchKey) x := list→header -- loop invariant: x→key < searchKey for i := list→level downto 1 do while x→forward[i]→key < searchKey do x := x→forward[i] -- x→key < searchKey ≤ x→forward[1] →key x := x→forward[1] if x→key = searchKey then return x→value else return failure FIGURE 2 - Skip list search algorithm

At what level do we start a search? Defining L(n) In a skip list of 16 elements generated with p = 1/2, we might happen to have 9 elements of level 1, 3 elements of level 2, 3 elements of level 3 and 1 element of level 14 (this would be very unlikely, but it could happen). How should we handle this? If we use the standard algorithm and start our search at level 14, we will do a lot of useless work. Where should we start the search? Our analysis suggests that ideally we would start a search at the level L where we expect 1/p nodes. This happens when L = log1/p n. Since we will be referring frequently to this formula, we will use L(n) to denote log1/p n. There are a number of solutions to the problem of deciding how to handle the case where there is an element with an unusually large level in the list. • Don’t worry, be happy. Simply start a search at the highest level present in the list. As we will see in our analysis, the probability that the maximum level in a list of n elements is significantly larger than L(n) is very small. Starting a search at the maximum level in the list does not add more than a small constant to the expected search time. This is the approach used in the algorithms described in this paper.

Search path

update[i]→forward[i]

6 3

7

NIL

25

9

12

19

21

26

original list, 17 to be inserted

6 3

7

9

12

17

NIL

25 19

21

26

list after insertion, updated pointers in grey FIGURE 3 - Pictorial description of steps involved in performing an insertion • Use less than you are given. Although an element may contain room for 14 pointers, we don’t need to use all 14. We can choose to utilize only L(n) levels. There are a number of ways to implement this, but they all complicate the algorithms and do not noticeably improve performance, so this approach is not recommended. • Fix the dice. If we generate a random level that is more than one greater than the current maximum level in the list, we simply use one plus the current maximum level in the list as the level of the new node. In practice and intuitively, this change seems to work well. However, it totally destroys our ability to analyze the resulting algorithms, since the level of a node is no longer completely random. Programmers should probably feel free to implement this, purists should avoid it.

Determining MaxLevel Since we can safely cap levels at L(n), we should choose MaxLevel = L(N) (where N is an upper bound on the number of elements in a skip list). If p = 1/2, using MaxLevel = 16 is appropriate for data structures containing up to 216 elements.

ANALYSIS OF SKIP LIST ALGORITHMS The time required to execute the Search, Delete and Insert operations is dominated by the time required to search for the appropriate element. For the Insert and Delete operations, there is an additional cost proportional to the level of the node being inserted or deleted. The time required to find an element is proportional to the length of the search path, which is determined by the pattern in which elements with different levels appear as we traverse the list.

Probabilistic Philosophy The structure of a skip list is determined only by the number

randomLevel() lvl := 1 -- random() that returns a random value in [0...1) while random() < p and lvl < MaxLevel do lvl := lvl + 1 return lvl FIGURE 5 - Algorithm to calculate a random level

Insert(list, searchKey, newValue) local update[1..MaxLevel] x := list→header for i := list→level downto 1 do while x→forward[i]→key < searchKey do x := x→forward[i] -- x→key < searchKey ≤ x→forward[i]→key update[i] := x x := x→forward[1] if x→key = searchKey then x→value := newValue else lvl := randomLevel() if lvl > list→level then for i := list→level + 1 to lvl do update[i] := list→header list→level := lvl x := makeNode(lvl, searchKey, value) for i := 1 to level do x→forward[i] := update[i]→forward[i] update[i]→forward[i] := x Delete(list, searchKey) local update[1..MaxLevel] x := list→header for i := list→level downto 1 do while x→forward[i]→key < searchKey do x := x→forward[i] update[i] := x x := x→forward[1] if x→key = searchKey then for i := 1 to list→level do if update[i]→forward[i] ≠ x then break update[i]→forward[i] := x→forward[i] free(x) while list→level > 1 and list→header→forward[list→level] = NIL do list→level := list→level – 1 FIGURE 4 - Skip List insertion and deletion algorithms

elements in the skip list and the results of consulting the random number generator. The sequence of operations that produced the current skip list does not matter. We assume an adversarial user does not have access to the levels of nodes; otherwise, he could create situations with worst-case running times by deleting all nodes that were not level 1. The probabilities of poor running times for successive operations on the same data structure are NOT independent; two successive searches for the same element will both take exactly the same time. More will be said about this later.

Analysis of expected search cost We analyze the search path backwards, travelling up and to the left. Although the levels of nodes in the list are known and fixed when the search is performed, we act as if the level of a node is being determined only when it is observed while backtracking the search path. At any particular point in the climb, we are at a situation similar to situation a in Figure 6 – we are at the ith forward pointer of a node x and we have no knowledge about the levels of nodes to the left of x or about the level of x, other than that the level of x must be at least i. Assume the x is not the header (the is equivalent to assuming the list extends infinitely to the left). If the level of x is equal to i, then we are in situation b. If the level of x is greater than i, then we are in situation c. The probability that we are in situation c is p. Each time we are in situation c, we climb up a level. Let C(k) = the expected cost (i.e, length) of a search path that climbs up k levels in an infinite list: C(0) = 0 C(k) = (1–p) (cost in situation b) + p (cost in situation c) By substituting and simplifying, we get: C(k) = (1–p) (1 + C(k)) + p (1 + C(k–1)) C(k) = 1/p + C(k–1) C(k) = k/p

Total expected cost to climb out of a list of n elements ≤ L(n)/p + 1/(1–p) which is O(log n).

Number of comparisons Our result is an analysis of the “length” of the search path. The number of comparisons required is one plus the length of the search path (a comparison is performed for each position in the search path, the “length” of the search path is the number of hops between positions in the search path).

Probabilistic Analysis It is also possible to analyze the probability distribution of search costs. The probabilistic analysis is somewhat more complicated (see box). From the probabilistic analysis, we can calculate an upper bound on the probability that the actual cost of a search exceeds the expected cost by more than a specified ratio. Some results of this analysis are shown in Figure 8.

Choosing p Table 1 gives the relative times and space requirements for different values of p. Decreasing p also increases the variabil-

Need to climb k levels from here

? probability = 1-p

Our assumption that the list is infinite is a pessimistic assumption. When we bump into the header in our backwards climb, we simply climb up it, without performing any leftward movements. This gives us an upper bound of (L(n)–1)/p on the expected length of the path that climbs from level 1 to level L(n) in a list of n elements. We use this analysis go up to level L(n) and use a different analysis technique for the rest of the journey. The number of leftward movements remaining is bounded by the number of elements of level L(n) or higher in the entire list, which has an expected value of 1/p. We also move upwards from level L(n) to the maximum level in the list. The probability that the maximum level of the list is a greater than k is equal to 1–(1–pk)n , which is at most npk. We can calculate the expected maximum level is at most L(n) + 1/(1–p). Putting our results together, we find

x situation a

?

probability = p

? x

Still need to climb k levels from here

situation b

x situation c

FIGURE 6 - Possible situations in backwards traversal of the search path

Need to climb only k-1 levels from here

ity of running times. If 1/p is a power of 2, it will be easy to generate a random level from a stream of random bits (it requires an average of (log2 1/p)/(1–p) random bits to generate a random level). Since some of the constant overheads are related to L(n) (rather than L(n)/p), choosing p = 1/4 (rather than 1/2) slightly improves the constant factors of the speed of the algorithms as well. I suggest that a value of 1/4 be used for p unless the variability of running times is a primary concern, in which case p should be 1/2.

Sequences of operations The expected total time for a sequence of operations is equal to the sum of the expected times of each of the operations in the sequence. Thus, the expected time for any sequence of m searches in a data structure that contains n elements is O(m log n). However, the pattern of searches affects the probability distribution of the actual time to perform the entire sequence of operations. If we search for the same item twice in the same data structure, both searches will take exactly the same amount of time. Thus the variance of the total time will be four times the variance of a single search. If the search times for two elements are independent, the variance of the total time is equal to the sum of the variances of the individual searches. Searching for the same element over and over again maximizes the variance.

ALTERNATIVE DATA STRUCTURES Balanced trees (e.g., AVL trees [Knu73] [Wir76]) and selfadjusting trees [ST85] can be used for the same problems as skip lists. All three techniques have performance bounds of the same order. A choice among these schemes involves several factors: the difficulty of implementing the algorithms, constant factors, type of bound (amortized, probabilistic or worst-case) and performance on a non-uniform distribution of queries.

Implementation difficulty For most applications, implementers generally agree skip lists are significantly easier to implement than either balanced tree algorithms or self-adjusting tree algorithms.

p

Normalized search times (i.e., normalized L(n)/p )

1/ 2 1/ e 1/ 4 1/ 8 1/ 16

Avg. # of pointers per node (i.e., 1/(1 – p))

1 0.94... 1 1.33... 2

2 1.58... 1.33... 1.14... 1.07...

TABLE 1 – Relative search speed and space requirements, depending on the value of p.

Constant factors Constant factors can make a significant difference in the practical application of an algorithm. This is particularly true for sub-linear algorithms. For example, assume that algorithms A and B both require O(log n) time to process a query, but that B is twice as fast as A: in the time algorithm A takes to process a query on a data set of size n, algorithm B can process a query on a data set of size n 2 . There are two important but qualitatively different contributions to the constant factors of an algorithm. First, the inherent complexity of the algorithm places a lower bound on any implementation. Self-adjusting trees are continuously rearranged as searches are performed; this imposes a significant overhead on any implementation of self-adjusting trees. Skip list algorithms seem to have very low inherent constant-factor overheads: the inner loop of the deletion algorithm for skip lists compiles to just six instructions on the 68020. Second, if the algorithm is complex, programmers are deterred from implementing optimizations. For example, balanced tree algorithms are normally described using recursive insert and delete procedures, since that is the most simple and intuitive method of describing the algorithms. A recursive insert or delete procedure incurs a procedure call overhead. By using non-recursive insert and delete procedures, some of this overhead can be eliminated. However, the complexity of nonrecursive algorithms for insertion and deletion in a balanced tree is intimidating and this complexity deters most programmers from eliminating recursion in these routines. Skip list al1 10-1 10-2 10-3 10-4

p = 1/4, n = 256 p = 1/4, n = 4,096 p = 1/4, n = 65,536 p = 1/2, n = 256 p = 1/2, n = 4,096 p = 1/2, n = 65,536 1.0

10-5

Prob.

10-6 10-7 10-8 2.0

10-9 3.0

Ratio of actual cost to expected cost FIGURE 8 - This graph shows a plot of an upper bound on the probability of a search taking substantially longer than expected. The vertical axis show the probability that the length of the search path for a search exceeds the average length by more than the ratio on the horizontal axis. For example, for p = 1/2 and n = 4096, the probability that the search path will be more than three times the expected length is less than one in 200 million. This graph was calculated using our probabilistic upper bound.

Implementation Skip lists non-recursive AVL trees recursive 2–3 trees Self–adjusting trees: top-down splaying bottom-up splaying

Search Time 0.051 msec (1.0) 0.046 msec (0.91) 0.054 msec (1.05)

Insertion Time 0.065 msec (1.0) 0.10 msec (1.55) 0.21 msec (3.2)

Deletion Time 0.059 msec (1.0) 0.085 msec (1.46) 0.21 msec (3.65)

0.15 msec 0.49 msec

0.16 msec 0.51 msec

0.18 msec 0.53 msec

(3.0) (9.6)

(2.5) (7.8)

(3.1) (9.0)

Table 2 - Timings of implementations of different algorithms gorithms are already non-recursive and they are simple enough that programmers are not deterred from performing optimizations. Table 2 compares the performance of implementations of skip lists and four other techniques. All implementations were optimized for efficiency. The AVL tree algorithms were written by James Macropol of Contel and based on those in [Wir76]. The 2–3 tree algorithms are based on those presented in [AHU83]. Several other existing balanced tree packages were timed and found to be much slower than the results presented below. The self-adjusting tree algorithms are based on those presented in [ST85]. The times in this table reflect the CPU time on a Sun-3/60 to perform an operation in a data structure containing 216 elements with integer keys. The values in parenthesis show the results relative to the skip list time The times for insertion and deletion do not include the time for memory management (e.g, in C programs, calls to malloc and free). Note that skip lists perform more comparisons than other methods (the skip list algorithms presented here require an average of L(n)/p + 1/(1–p) + 1 comparisons). For tests using real numbers as keys, skip lists were slightly slower than the non-recursive AVL tree algorithms and search in a skip list was slightly slower than search in a 2–3 tree (insertion and deletion using the skip list algorithms was still faster than using the recursive 2–3 tree algorithms). If comparisons are very expensive, it is possible to change the algorithms so that we never compare the search key against the key of a node more than once during a search. For p = 1/2, this produces an upper bound on the expected number of comparisons of 7/2 + 3/2 log2 n. This modification is discussed in [Pug89b].

Type of performance bound These three classes of algorithm have different kinds of performance bounds. Balanced trees have worst-case time bounds, self-adjusting trees have amortized time bounds and skip lists have probabilistic time bounds. With self-adjusting trees, an individual operation can take O(n) time, but the time bound always holds over a long sequence of operations. For skip lists, any operation or sequence of operations can take longer than expected, although the probability of any operation taking significantly longer than expected is negligible. In certain real-time applications, we must be assured that an operation will complete within a certain time bound. For such applications, self-adjusting trees may be undesirable, since they can take significantly longer on an individual operation than expected (e.g., an individual search can take O(n) time instead of O(log n) time). For real-time systems, skip lists may be usable if an adequate safety margin is provided: the chance that a search in a skip lists containing 1000 ele-

ments takes more than 5 times the expected time is about 1 in 1018.

Non-uniform query distribution Self-adjusting trees have the property that they adjust to nonuniform query distributions. Since skip lists are faster than self-adjusting trees by a significant constant factor when a uniform query distribution is encountered, self-adjusting trees are faster than skip lists only for highly skewed distributions. We could attempt to devise self-adjusting skip lists. However, there seems little practical motivation to tamper with the simplicity and fast performance of skip lists; in an application where highly skewed distributions are expected, either selfadjusting trees or a skip list augmented by a cache may be preferable [Pug90].

ADDITIONAL WORK ON SKIP LISTS I have described a set of algorithms that allow multiple processors to concurrently update a skip list in shared memory [Pug89a]. This algorithms are much simpler than concurrent balanced tree algorithms. They allow an unlimited number of readers and n busy writers in a skip list of n elements with very little lock contention. Using skip lists, it is easy to do most (all?) the sorts of operations you might wish to do with a balanced tree such as use search fingers, merge skip lists and allow ranking operations (e.g., determine the kth element of a skip list) [Pug89b]. Tom Papadakis, Ian Munro and Patricio Poblette [PMP90] have done an exact analysis of the expected search time in a skip list. The upper bound described in this paper is close to their exact bound; the techniques they needed to use to derive an exact analysis are very complicated and sophisticated. Their exact analysis shows that for p = 1/2 and p = 1/4, the upper bound given in this paper on the expected cost of a search is not more than 2 comparisons more than the exact expected cost. I have adapted idea of probabilistic balancing to some other problems arising both in data structures and in incremental computation [PT88]. We can generate the level of a node based on the result of applying a hash function to the element (as opposed to using a random number generator). This results in a scheme where for any set S, there is a unique data structure that represents S and with high probability the data structure is approximately balanced. If we combine this idea with an applicative (i.e., persistent) probabilistically balanced data structure and a scheme such as hashed-consing [All78] which allows constant-time structural equality tests of applicative data structures, we get a number of interesting properties, such as constant-time equality tests for the representations of sequences. This scheme also has a number of applications for incremental computation. Since skip lists are

somewhat awkward to make applicative, a probabilistically balanced tree scheme is used.

RELATED WORK James Discroll pointed out that R. Sprugnoli suggested a method of randomly balancing search trees in 1981 [Spr81]. With Sprugnoli’s approach, the state of the data structure is not independent of the sequence of operations which built it. This makes it much harder or impossible to formally analyze his algorithms. Sprugnoli gives empirical evidence that his algorithm has good expected performance, but no theoretical results. A randomized data structure for ordered sets is described in [BLLSS86]. However, a search using that data structure requires O(n1/2) expected time. Cecilia Aragon and Raimund Seidel describe a probabilistically balanced search trees scheme [AC89]. They discuss how to adapt their data structure to non-uniform query distributions.

SOURCE CODE AVAILABILITY Skip list source code libraries for both C and Pascal are available for anonymous ftp from mimsy.umd.edu.

CONCLUSIONS From a theoretical point of view, there is no need for skip lists. Balanced trees can do everything that can be done with skip lists and have good worst-case time bounds (unlike skip lists). However, implementing balanced trees is an exacting task and as a result balanced tree algorithms are rarely implemented except as part of a programming assignment in a data structures class. Skip lists are a simple data structure that can be used in place of balanced trees for most applications. Skip lists algorithms are very easy to implement, extend and modify. Skip lists are about as fast as highly optimized balanced tree algorithms and are substantially faster than casually implemented balanced tree algorithms.

ACKNOWLEDGEMENTS Thanks to the referees for their helpful comments. Special thanks to all those people who supplied enthusiasm and encouragement during the years in which I struggled to get this work published, especially Alan Demers, Tim Teitelbaum and Doug McIlroy. This work was partially supported by an AT&T Bell Labs Fellowship and by NSF grant CCR– 8908900.

REFERENCES [AC89]

[AHU83]

[All78]

Aragon, Cecilia and Raimund Seidel, Randomized Search Trees, Proceedings of the 30th Ann. IEEE Symp on Foundations of Computer Science, pp 540–545, October 1989. Aho, A., Hopcroft, J. and Ullman, J. Data Structures and Algorithms, Addison-Wesley Publishing Company, 1983.

John Allen. Anatomy of LISP, McGraw Hill Book Company, NY, 1978.

[BLLSS86] Bentley, J., F. T. Leighton, M.F. Lepley, D. Stanat and J. M. Steele, A Randomized Data Structure For Ordered Sets, MIT/LCS Technical Memo 297, May 1986. [Knu73] Knuth, D. “Sorting and Searching,” The Art of Computer Programming, Vol. 3, Addison-Wesley Publishing Company, 1973. [PMP90] Papadakis, Thomas, Ian Munro and Patricio Poblette, Exact Analysis of Expected Search Cost in Skip Lists, Tech Report # ????, Dept. of Computer Science, Univ. of Waterloo, January 1990. [PT89] Pugh, W. and T. Teitelbaum, “Incremental Computation via Function Caching,” Proc. of the Sixteenth conference on the Principles of Programming Languages, 1989. [Pug89a] Pugh, W., Concurrent Maintenance of Skip Lists, Tech Report TR-CS-2222, Dept. of Computer Science, University of Maryland, College Park, 1989. [Pug89b] Pugh, W., Whatever you might want to do using Balanced Trees, you can do it faster and more simply using Skip Lists, Tech Report CS–TR–2286, Dept. of Computer Science, University of Maryland, College Park, July 1989. [Pug90] Pugh, W. Slow Optimally Balanced Search Strategies vs. Cached Fast Uniformly Balanced Search Strategies, to appear in Information Processing Letters. [Spr81] Sprugnoli, R. Randomly Balanced Binary Trees, Calcolo, V17 (1981), pp 99-117. [ST85] Sleator, D. and R. Tarjan “Self-Adjusting Binary Search Trees,” Journal of the ACM, Vol 32, No. 3, July 1985, pp. 652-666. [Wir76] Wirth, N. Algorithms + Data Structures = Programs, Prentice-Hall, 1976.

PROBABILISTIC ANALYSIS

Probabilistic analysis of search cost

In addition to analyzing the expected performance of skip lists, we can also analyze the probabilistic performance of skip lists. This will allow us to calculate the probability that an operation takes longer than a specified time. This analysis is based on the same ideas as our analysis of the expected cost, so that analysis should be understood first. A random variable has a fixed but unpredictable value and a predictable probability distribution and average. If X is a random variable, Prob{ X = t } denotes the probability that X equals t and Prob{ X > t } denotes the probability that X is greater than t. For example, if X is the number obtained by throwing a unbiased die, Prob{ X > 3 } = 1/2. It is often preferable to find simple upper bounds on values whose exact value is difficult to calculate. To discuss upper bounds on random variables, we need to define a partial ordering and equality on the probability distributions of nonnegative random variables. Definitions (=prob and ≤prob). Let X and Y be non-negative independent random variables (typically, X and Y would denote the time to execute algorithms AX and AY). We define X ≤ prob Y to be true if and only if for any value t, the probability that X exceeds t is less than the probability that Y exceeds t. More formally:

The number of leftward movements we need to make before we move up a level (in an infinite list) has a negative binomial distribution: it is the number of failures (situations b’s) we see before we see the first success (situation c) in a series of independent random trials, where the probability of success is p. Using the probabilistic notation introduced above:

X =prob Y iff ∀ t, Prob{ X > t } = Prob{ Y > t } and X ≤ prob Y iff ∀ t, Prob{ X > t } ≤ Prob{ Y > t }. ■ For example, the graph in Figure 7shows the probability distribution of three random variables X, Y and Z. Since the probability distribution curve for X is completely under the curves for Y and Z, X ≤prob Y and X ≤prob Z. Since the probability curves for Y and Z intersect, neither Y ≤prob Z nor Z ≤prob Y. Since the expected value of a random variable X is simply the area under the curve Prob{ X > t }, if X ≤prob Y then the average of X is less than or equal to the average of Y. We make use of two probability distributions: Definition (binomial distributions — B(t, p)). Let t be a non-negative integer and p be a probability. The term B(t, p) denotes a random variable equal to the number of successes seen in a series of t independent random trials where the probability of a success in a trial is p. The average and variance of B(t, p) are tp and tp(1 – p) respectively. ■ Definition (negative binomial distributions — NB(s, p)). Let s be a non-negative integer and p be a probability. The term NB(s, p) denotes a random variable equal to the number of failures seen before the sth success in a series of random independent trials where the probability of a success in a trial is p. The average and variance of NB(s, p) are s(1–p)/p and s(1–p)/p2 respectively. ■

1

Prob{ X > t } Prob{ Y > t } Prob{ Z > t }

Prob

0

t

FIGURE 7 – Plots of three probability distributions

Cost to climb one level in an infinite list =prob 1+ NB(1, p). We can sum the costs of climbing each level to get the total cost to climb up to level L(n): Cost to climb to level L(n) in an infinite list =prob (L(n) – 1) + NB(L(n) – 1, p). Our assumption that the list is infinite is a pessimistic assumption: Cost to climb to level L(n) in a list of n elements ≤prob (L(n) – 1) + NB(L(n) – 1, p). Once we have climbed to level L(n), the number of leftward movements is bounded by the number of elements of level L(n) or greater in a list of n elements. The number of elements of level L(n) or greater in a list of n elements is a random variable of the form B(n, 1/np). Let M be a random variable corresponding to the maximum level in a list of n elements. The probability that the level of a node is greater than k is pk, so Prob{ M > k } = 1– (1–pk)n < npk. Since npk = pk–L(n) and Prob{ NB(1, 1–p) + 1 > i} = pi, we get an probabilistic upper bound of M ≤prob L(n) + NB(1, 1 – p) + 1. Note that the average of L(n) + NB(1, 1 – p) + 1 is L(n) + 1/(1–p). This gives a probabilistic upper bound on the cost once we have reached level L(n) of B(n, 1/np) + (L(n) + NB(1, 1 – p) + 1) – L(n). Combining our results to get a probabilistic upper bound on the total length of the search path (i.e., cost of the entire search): total cost to climb out of a list of n elements ≤prob (L(n) – 1) + NB(L(n) – 1, p) + B(n, 1/np) + NB(1, 1 – p) + 1 The expected value of our upper bound is equal to (L(n) – 1) + (L(n) – 1)(1 – p)/p + 1/p + p/(1–p) + 1 = L(n)/p + 1/(1–p), which is the same as our previously calculated upper bound on the expected cost of a search. The variance of our upper bound is (L(n) – 1)(1–p)/p2 + (1 – 1/np)/p + p/(1–p)2 < (1–p)L(n)/p2 + p/(1–p)2 + (2p–1)/p2 . Figure 8 show a plot of an upper bound on the probability of an actual search taking substantially longer than average, based on our probabilistic upper bound.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_skl.txt

/* skip list */ #include #include /* define data-type and compare operators here */ typedef int T; /* type of item to be stored */ #define compLT(a,b) (a < b) #define compEQ(a,b) (a == b) /* levels range from (0 .. MAXLEVEL) */ #define MAXLEVEL 15 typedef struct Node_ { T data; struct Node_ *forward[1]; } Node;

/* user's data */ /* skip list forward pointer */

typedef struct { Node *hdr; int listLevel; } SkipList;

/* list Header */ /* current level of list */

SkipList list;

/* skip list information */

#define NIL list.hdr Node *insertNode(T data) { int i, newLevel; Node *update[MAXLEVEL+1]; Node *x; /*********************************************** * allocate node for data and insert in list * ***********************************************/ /* find where data belongs */ x = list.hdr; for (i = list.listLevel; i >= 0; i--) { while (x->forward[i] != NIL && compLT(x->forward[i]->data, data)) x = x->forward[i]; update[i] = x; } x = x->forward[0]; if (x != NIL && compEQ(x->data, data)) return(x); /* determine level */ for (newLevel = 0; rand() < RAND_MAX/2 && newLevel < MAXLEVEL; newLevel++); if (newLevel > list.listLevel) { for (i = list.listLevel + 1; i data = data; /* update forward links */ for (i = 0; i forward[i] = update[i]->forward[i];

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_skl.txt (1 of 3) [3/23/2004 3:09:31 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_skl.txt

update[i]->forward[i] = x; } return(x); } void deleteNode(T data) { int i; Node *update[MAXLEVEL+1], *x; /******************************************* * delete node containing data from list * *******************************************/ /* find where data belongs */ x = list.hdr; for (i = list.listLevel; i >= 0; i--) { while (x->forward[i] != NIL && compLT(x->forward[i]->data, data)) x = x->forward[i]; update[i] = x; } x = x->forward[0]; if (x == NIL || !compEQ(x->data, data)) return; /* adjust forward pointers */ for (i = 0; i forward[i] != x) break; update[i]->forward[i] = x->forward[i]; } free (x); /* adjust header level */ while ((list.listLevel > 0) && (list.hdr->forward[list.listLevel] == NIL)) list.listLevel--; } Node *findNode(T data) { int i; Node *x = list.hdr; /******************************* * find node containing data * *******************************/ for (i = list.listLevel; i >= 0; i--) { while (x->forward[i] != NIL && compLT(x->forward[i]->data, data)) x = x->forward[i]; } x = x->forward[0]; if (x != NIL && compEQ(x->data, data)) return (x); return(0); } void initList() { int i; /************************** * initialize skip list * **************************/ if ((list.hdr = malloc(sizeof(Node) + MAXLEVEL*sizeof(Node *))) == 0) { printf ("insufficient memory (initList)\n"); exit(1); } http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_skl.txt (2 of 3) [3/23/2004 3:09:31 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_skl.txt

for (i = 0; i forward[i] = NIL; list.listLevel = 0; } int main(int argc, char **argv) { int i, *a, maxnum, random; /* command-line: * * skl maxnum [random] * * skl 2000 * process 2000 sequential records * skl 4000 r * process 4000 random records * */ maxnum = atoi(argv[1]); random = argc > 2; initList(); if ((a = malloc(maxnum * sizeof(*a))) == 0) { fprintf (stderr, "insufficient memory (a)\n"); exit(1); } if (random) { /* fill "a" with unique random numbers */ for (i = 0; i < maxnum; i++) a[i] = rand(); printf ("ran, %d items\n", maxnum); } else { for (i = 0; i < maxnum; i++) a[i] = i; printf ("seq, %d items\n", maxnum); } for (i = 0; i < maxnum; i++) { insertNode(a[i]); } for (i = maxnum-1; i >= 0; i--) { findNode(a[i]); } for (i = maxnum-1; i >= 0; i--) { deleteNode(a[i]); } return 0; }

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_skl.txt (3 of 3) [3/23/2004 3:09:31 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_cm2.htm

Comparison We have seen several ways to construct dictionaries: hash tables, unbalanced binary search trees, redblack trees, and skip lists. There are several factors that influence the choice of an algorithm: ●

Sorted output. If sorted output is required, then hash tables are not a viable alternative. Entries are stored in the table based on their hashed value, with no other ordering. For binary trees, the story is different. An in-order tree walk will produce a sorted list. For example: void WalkTree(Node *P) { if (P == NIL) return; WalkTree(P->Left); /* examine P->Data here */ WalkTree(P->Right); } WalkTree(Root); To examine skip list nodes in order, simply chain through the level-0 pointers. For example: Node *P = List.Hdr->Forward[0]; while (P != NIL) { /* examine P->Data here */ P = P->Forward[0]; }

●

Space. The amount of memory required to store a value should be minimized. This is especially true if many small nodes are to be allocated. ❍

❍

❍

For hash tables, only one forward pointer per node is required. In addition, the hash table itself must be allocated. For red-black trees, each node has a left, right, and parent pointer. In addition, the color of each node must be recorded. Although this requires only one bit, more space may be allocated to ensure that the size of the structure is properly aligned. Therefore each node in a red-black tree requires enough space for 3-4 pointers. For skip lists, each node has a level-0 forward pointer. The probability of having a level1 pointer is 1/2. The probability of having a level-2 pointer is 1/4. In general, the

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_cm2.htm (1 of 3) [3/23/2004 3:09:34 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_cm2.htm

number of forward pointers per node is n = 1 + 1/2 + 1/4 + ... = 2. ●

●

Time. The algorithm should be efficient. This is especially true if a large dataset is expected. Table 3-2 compares the search time for each algorithm. Note that worst-case behavior for hash tables and skip lists is extremely unlikely. Actual timing tests are described below. Simplicity. If the algorithm is short and easy to understand, fewer mistakes may be made. This not only makes your life easy, but the maintenance programmer entrusted with the task of making repairs will appreciate any efforts you make in this area. The number of statements required for each algorithm is listed in Table 3-2. method

statements average time worst-case time

hash table

26

O(1)

O(n)

unbalanced tree

41

O(lg n)

O(n)

red-black tree

120

O(lg n)

O(lg n)

skip list

55

O(lg n)

O(n)

Table 3-2: Comparison of Dictionaries Average time for insert, search, and delete operations on a database of 65,536 (216) randomly input items may be found in Table 3-3. For this test the hash table size was 10,009 and 16 index levels were allowed for the skip list. Although there is some variation in the timings for the four methods, they are close enough so that other considerations should come into play when selecting an algorithm. method

insert search delete

hash table

18

8

10

unbalanced tree

37

17

26

red-black tree

40

16

37

skip list

48

31

35

Table 3-3: Average Time (µs), 65536 Items, Random Input Table 3-4 shows the average search time for two sets of data: a random set, where all values are unique, and an ordered set, where values are in ascending order. Ordered input creates a worst-case http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_cm2.htm (2 of 3) [3/23/2004 3:09:34 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_cm2.htm

scenario for unbalanced tree algorithms, as the tree ends up being a simple linked list. The times shown are for a single search operation. If we were to search for all items in a database of 65,536 values, a red-black tree algorithm would take .6 seconds, while an unbalanced tree algorithm would take 1 hour. count hash table unbalanced tree red-black tree skip list random input

ordered input

16

4

3

2

5

256

3

4

4

9

4,096

3

7

6

12

65,536

8

17

16

31

16

3

4

2

4

256

3

47

4

7

4,096

3

1,033

6

11

65,536

7

55,019

9

15

Table 3-4: Average Search Time (µs)

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_cm2.htm (3 of 3) [3/23/2004 3:09:34 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_vlf.htm

Very Large Files The previous algorithms have assumed that all data reside in memory. However, there may be times when the dataset is too large, and alternative methods are required. In this section, we will examine techniques for sorting (external sort) and implementing dictionaries (B-trees) for very large files.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_vlf.htm [3/23/2004 3:09:35 PM]

External Sorting

External Sorting One method for sorting a file is to load the file into memory, sort the data in memory, then write the results. When the file cannot be loaded into memory due to resource limitations, an external sort applicable. We will implement an external sort using replacement selection to establish initial runs, followed by a polyphase merge sort to merge the runs into one sorted file. I highly recommend you consult Knuth [1998], as many details have been omitted.

Theory For clarity, I'll assume that data is on one or more reels of magnetic tape. Figure 4-1 illustrates a 3way polyphase merge. Initially, in phase A, all data is on tapes T1 and T2. Assume that the beginning of each tape is at the bottom of the frame. There are two sequential runs of data on T1: 4-8, and 6-7. Tape T2 has one run: 5-9. At phase B, we've merged the first run from tapes T1 (4-8) and T2 (5-9) into a longer run on tape T3 (4-5-8-9). Phase C is simply renames the tapes, so we may repeat the merge again. In phase D we repeat the merge, with the final output on tape T3. Phase T1 T2 T3

A

B

C

D

7 6 8 4

9 5 9 8 5 4

7 6 9 8 5 4

7 6 9 8 7 6 5 4

Figure 4-1: http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.htm (1 of 3) [3/23/2004 3:09:38 PM]

External Sorting

Merge Sort Several interesting details have been omitted from the previous illustration. For example, how were the initial runs created? And, did you notice that they merged perfectly, with no extra runs on any tapes? Before I explain the method used for constructing initial runs, let me digress for a bit. In 1202, Leonardo Fibonacci presented the following exercise in his Liber Abbaci (Book of the Abacus): "How many pairs of rabbits can be produced from a single pair in a year's time?" We may assume that each pair produces a new pair of offspring every month, each pair becomes fertile at the age of one month, and that rabbits never die. After one month, there will be 2 pairs of rabbits; after two months there will be 3; the following month the original pair and the pair born during the first month will both usher in a new pair, and there will be 5 in all; and so on. This series, where each number is the sum of the two preceeding numbers, is known as the Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ... . Curiously, the Fibonacci series has found wide-spread application to everything from the arrangement of flowers on plants to studying the efficiency of Euclid's algorithm. There's even a Fibonacci Quarterly journal. And, as you might suspect, the Fibonacci series has something to do with establishing initial runs for external sorts. Recall that we initially had one run on tape T2, and 2 runs on tape T1. Note that the numbers {1,2} are two sequential numbers in the Fibonacci series. After our first merge, we had one run on T1 and one run on T2. Note that the numbers {1,1} are two sequential numbers in the Fibonacci series, only one notch down. We could predict, in fact, that if we had 13 runs on T2, and 21 runs on T1 {13,21}, we would be left with 8 runs on T1 and 13 runs on T3 {8,13} after one pass. Successive passes would result in run counts of {5,8}, {3,5}, {2,3}, {1,1}, and {0,1}, for a total of 7 passes. This arrangement is ideal, and will result in the minimum number of passes. Should data actually be on tape, this is a big savings, as tapes must be mounted and rewound for each pass. For more than 2 tapes, higher-order Fibonacci numbers are used. Initially, all the data is on one tape. The tape is read, and runs are distributed to other tapes in the system. After the initial runs are created, they are merged as described above. One method we could use to create initial runs is to read a batch of records into memory, sort the records, and write them out. This process would continue until we had exhausted the input tape. An alternative algorithm, replacement selection, allows for longer runs. A buffer is allocated in memory to act as a holding place for several records. Initially, the buffer is filled. Then, the following steps are repeated until the input is exhausted: ● ●

● ●

Select the record with the smallest key that is >= the key of the last record written. If all keys are smaller than the key of the last record written, then we have reached the end of a run. Select the record with the smallest key for the first record of the next run. Write the selected record. Replace the selected record with a new record from input.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.htm (2 of 3) [3/23/2004 3:09:38 PM]

External Sorting

Figure 4-2 illustrates replacement selection for a small file. The beginning of the file is to the right of each frame. To keep things simple, I've allocated a 2-record buffer. Typically, such a buffer would hold thousands of records. We load the buffer in step B, and write the record with the smallest key (6) in step C. This is replaced with the next record (key 8). We select the smallest key >= 6 in step D. This is key 7. After writing key 7, we replace it with key 4. This process repeats until step F, where our last key written was 8, and all keys are less than 8. At this point, we terminate the run, and start another. Step Input

Buffer Output

A

5-3-4-8-6-7

B

5-3-4-8

6-7

C

5-3-4

8-7

6

D

5-3

8-4

7-6

E

5

3-4

8-7-6

F

5-4

3 | 8-7-6

G

5

4-3 | 8-7-6

H

5-4-3 | 8-7-6

Figure 4-2: Replacement Selection This strategy simply utilizes an intermediate buffer to hold values until the appropriate time for output. Using random numbers as input, the average length of a run is twice the length of the buffer. However, if the data is somewhat ordered, runs can be extremely long. Thus, this method is more effective than doing partial sorts. When selecting the next output record, we need to find the smallest key >= the last key written. One way to do this is to scan the entire list, searching for the appropriate key. However, when the buffer holds thousands of records, execution time becomes prohibitive. An alternative method is to use a binary tree structure, so that we only compare lg n items.

Implementation An ANSI-C implementation of an external sort is included. Function makeRuns calls readRec to read the next record. Function readRec employs the replacement selection algorithm (utilizing a binary tree) to fetch the next record, and makeRuns distributes the records in a Fibonacci distribution. If the number of runs is not a perfect Fibonacci number, dummy runs are simulated at the beginning of each file. Function mergeSort is then called to do a polyphase merge sort on the runs.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.htm (3 of 3) [3/23/2004 3:09:38 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.txt

/* external sort */ #include #include #include /**************************** * implementation dependent * ****************************/ /* template for workfiles (8.3 format) */ #define FNAME "_sort%03d.dat" #define LNAME 13 /* comparison operators */ #define compLT(x,y) (x < y) #define compGT(x,y) (x > y) /* define the record to be sorted here */ #define LRECL 100 typedef int keyType; typedef struct recTypeTag { keyType key; #if LRECL char data[LRECL-sizeof(keyType)]; #endif } recType;

/* sort key for record */ /* other fields */

/****************************** * implementation independent * ******************************/ typedef enum {false, true} bool; typedef struct tmpFileTag { FILE *fp; char name[LNAME]; recType rec; int dummy; bool eof; bool eor; bool valid; int fib; } tmpFileType;

/* /* /* /* /* /* /* /*

file pointer */ filename */ last record read */ number of dummy runs */ end-of-file flag */ end-of-run flag */ true if rec is valid */ ideal fibonacci number */

static static static static

/* /* /* /*

array of file info for tmp files */ number of tmp files */ input filename */ output filename */

tmpFileType **file; int nTmpFiles; char *ifName; char *ofName;

static int level; static int nNodes;

/* level of runs */ /* number of nodes for selection tree */

void deleteTmpFiles(void) { int i; /* delete merge files and free resources */ if (file) { for (i = 0; i < nTmpFiles; i++) { if (file[i]) { if (file[i]->fp) fclose(file[i]->fp); if (*file[i]->name) remove(file[i]->name); free (file[i]); } } free (file); } http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.txt (1 of 8) [3/23/2004 3:09:47 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.txt

} void termTmpFiles(int rc) { /* cleanup files */ remove(ofName); if (rc == 0) { int fileT; /* file[T] contains results */ fileT = nTmpFiles - 1; fclose(file[fileT]->fp); file[fileT]->fp = NULL; if (rename(file[fileT]->name, ofName)) { perror("io1"); deleteTmpFiles(); exit(1); } *file[fileT]->name = 0; } deleteTmpFiles(); } void cleanExit(int rc) { /* cleanup tmp files and exit */ termTmpFiles(rc); exit(rc); } void *safeMalloc(size_t size) { void *p; /* safely allocate memory and initialize to zero */ if ((p = calloc(1, size)) == NULL) { printf("error: malloc failed, size = %d\n", size); cleanExit(1); } return p; } void initTmpFiles(void) { int i; tmpFileType *fileInfo; /* initialize merge files */ if (nTmpFiles < 3) nTmpFiles = 3; file = safeMalloc(nTmpFiles * sizeof(tmpFileType*)); fileInfo = safeMalloc(nTmpFiles * sizeof(tmpFileType)); for (i = 0; i < nTmpFiles; i++) { file[i] = fileInfo + i; sprintf(file[i]->name, FNAME, i); if ((file[i]->fp = fopen(file[i]->name, "w+b")) == NULL) { perror("io2"); cleanExit(1); } } } recType *readRec(void) { typedef struct iNodeTag { /* internal node */ struct iNodeTag *parent;/* parent of internal node */ struct eNodeTag *loser; /* external loser */ } iNodeType; typedef struct eNodeTag { /* external node */ struct iNodeTag *parent;/* parent of external node */ http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.txt (2 of 8) [3/23/2004 3:09:47 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.txt

recType rec; int run; bool valid; } eNodeType;

/* input record */ /* run number */ /* input record is valid */

typedef struct nodeTag { iNodeType i; eNodeType e; } nodeType;

/* internal node */ /* external node */

static nodeType *node; static eNodeType *win; static FILE *ifp; static bool eof; static int maxRun; static int curRun; iNodeType *p; static bool lastKeyValid; static keyType lastKey;

/* /* /* /* /* /* /* /* /*

array of selection tree nodes */ new winner */ input file */ true if end-of-file, input */ maximum run number */ current run number */ pointer to internal nodes */ true if lastKey is valid */ last key written */

/* read next record using replacement selection */ /* check for first call */ if (node == NULL) { int i; if (nNodes < 2) nNodes = 2; node = safeMalloc(nNodes * sizeof(nodeType)); for (i = 0; i < nNodes; i++) { node[i].i.loser = &node[i].e; node[i].i.parent = &node[i/2].i; node[i].e.parent = &node[(nNodes + i)/2].i; node[i].e.run = 0; node[i].e.valid = false; } win = &node[0].e; lastKeyValid = false; if ((ifp = fopen(ifName, "rb")) == NULL) { printf("error: file %s, unable to open\n", ifName); cleanExit(1); } } while (1) { /* replace previous winner with new record */ if (!eof) { if (fread(&win->rec, sizeof(recType), 1, ifp) == 1) { if ((!lastKeyValid || compLT(win->rec.key, lastKey)) && (++win->run > maxRun)) maxRun = win->run; win->valid = true; } else if (feof(ifp)) { fclose(ifp); eof = true; win->valid = false; win->run = maxRun + 1; } else { perror("io4"); cleanExit(1); } } else { win->valid = false; win->run = maxRun + 1; }

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.txt (3 of 8) [3/23/2004 3:09:47 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.txt

/* adjust loser and winner pointers */ p = win->parent; do { bool swap; swap = false; if (p->loser->run < win->run) { swap = true; } else if (p->loser->run == win->run) { if (p->loser->valid && win->valid) { if (compLT(p->loser->rec.key, win->rec.key)) swap = true; } else { swap = true; } } if (swap) { /* p should be winner */ eNodeType *t; t = p->loser; p->loser = win; win = t; } p = p->parent; } while (p != &node[0].i); /* end of run? */ if (win->run != curRun) { /* win->run = curRun + 1 */ if (win->run > maxRun) { /* end of output */ free(node); return NULL; } curRun = win->run; } /* output top of tree */ if (win->run) { lastKey = win->rec.key; lastKeyValid = true; return &win->rec; } } } void makeRuns(void) { recType *win; int fileT; int fileP; int j;

/* /* /* /*

winner */ last file */ next to last file */ selects file[j] */

/* Make initial runs using replacement selection. * Runs are written using a Fibonacci distintbution. */ /* initialize file structures */ fileT = nTmpFiles - 1; fileP = fileT - 1; for (j = 0; j < fileT; j++) { file[j]->fib = 1; file[j]->dummy = 1; } file[fileT]->fib = 0;

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.txt (4 of 8) [3/23/2004 3:09:47 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.txt

file[fileT]->dummy = 0; level = 1; j = 0; win = readRec(); while (win) { bool anyrun; anyrun = false; for (j = 0; win && j valid) { if (!compLT(win->key, file[j]->rec.key)) { /* append to an existing run */ run = true; } else if (file[j]->dummy) { /* start a new run */ file[j]->dummy--; run = true; } } else { /* first run in file */ file[j]->dummy--; run = true; } if (run) { anyrun = true; /* flush run */ while(1) { if (fwrite(win, sizeof(recType), 1, file[j]->fp) != 1) { perror("io3"); cleanExit(1); } file[j]->rec.key = win->key; file[j]->valid = true; if ((win = readRec()) == NULL) break; if (compLT(win->key, file[j]->rec.key)) break; } } } /* if no room for runs, up a level */ if (!anyrun) { int t; level++; t = file[0]->fib; for (j = 0; j dummy = t + file[j+1]->fib - file[j]->fib; file[j]->fib = t + file[j+1]->fib; } } } } void rewindFile(int j) { /* rewind file[j] and read in first record */ file[j]->eor = false; file[j]->eof = false; rewind(file[j]->fp); if (fread(&file[j]->rec, sizeof(recType), 1, file[j]->fp) != 1) {

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.txt (5 of 8) [3/23/2004 3:09:47 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.txt

if (feof(file[j]->fp)) { file[j]->eor = true; file[j]->eof = true; } else { perror("io5"); cleanExit(1); } } } void mergeSort(void) { int fileT; int fileP; int j; tmpFileType *tfile; /* polyphase merge sort */ fileT = nTmpFiles - 1; fileP = fileT - 1; /* prime the files */ for (j = 0; j < fileT; j++) { rewindFile(j); } /* each pass through loop merges one run */ while (level) { while(1) { bool allDummies; bool anyRuns; /* scan for runs */ allDummies = true; anyRuns = false; for (j = 0; j dummy) { allDummies = false; if (!file[j]->eof) anyRuns = true; } } if (anyRuns) { int k; keyType lastKey; /* merge 1 run file[0]..file[P] --> file[T] */ while(1) { /* each pass thru loop writes 1 record to file[fileT] */ /* find smallest key */ k = -1; for (j = 0; j eor) continue; if (file[j]->dummy) continue; if (k < 0 || (k != j && compGT(file[k]->rec.key, file[j]->rec.key))) k = j; } if (k < 0) break; /* write record[k] to file[fileT] */ if (fwrite(&file[k]->rec, sizeof(recType), 1, file[fileT]->fp) != 1) { perror("io6"); cleanExit(1); http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.txt (6 of 8) [3/23/2004 3:09:47 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.txt

} /* replace record[k] */ lastKey = file[k]->rec.key; if (fread(&file[k]->rec, sizeof(recType), 1, file[k]->fp) == 1) { /* check for end of run on file[s] */ if (compLT(file[k]->rec.key, lastKey)) file[k]->eor = true; } else if (feof(file[k]->fp)) { file[k]->eof = true; file[k]->eor = true; } else { perror("io7"); cleanExit(1); } } /* fixup dummies */ for (j = 0; j dummy) file[j]->dummy--; if (!file[j]->eof) file[j]->eor = false; } } else if (allDummies) { for (j = 0; j dummy--; file[fileT]->dummy++; } /* end of run */ if (file[fileP]->eof && !file[fileP]->dummy) { /* completed a fibonocci-level */ level--; if (!level) { /* we're done, file[fileT] contains data */ return; } /* fileP is exhausted, reopen as new */ fclose(file[fileP]->fp); if ((file[fileP]->fp = fopen(file[fileP]->name, "w+b")) == NULL) { perror("io8"); cleanExit(1); } file[fileP]->eof = false; file[fileP]->eor = false; rewindFile(fileT); /* f[0],f[1]...,f[fileT] eor = false; } } } }

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.txt (7 of 8) [3/23/2004 3:09:47 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.txt

void extSort(void) { initTmpFiles(); makeRuns(); mergeSort(); termTmpFiles(0); } int main(int argc, char *argv[]) { /* command-line: * * ext ifName ofName nTmpFiles nNodes * * ext in.dat out.dat 5 2000 * reads in.dat, sorts using 5 files and 2000 nodes, output to out.dat */ if (argc != 5) { printf("%s ifName ofName nTmpFiles nNodes\n", argv[0]); cleanExit(1); } ifName = argv[1]; ofName = argv[2]; nTmpFiles = atoi(argv[3]); nNodes = atoi(argv[4]); printf("extSort: nFiles=%d, nNodes=%d, lrecl=%d\n", nTmpFiles, nNodes, sizeof(recType)); extSort(); return 0; }

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_ext.txt (8 of 8) [3/23/2004 3:09:47 PM]

B-Trees

B-Trees Dictionaries for very large files typically reside on secondary storage, such as a disk. The dictionary is implemented as an index to the actual file and contains the key and record address of data. To implement a dictionary we could use red-black trees, replacing pointers with offsets from the beginning of the index file, and use random access to reference nodes of the tree. However, every transition on a link would imply a disk access, and would be prohibitively expensive. Recall that lowlevel disk I/O accesses disk by sectors (typically 256 bytes). We could equate node size to sector size, and group several keys together in each node to minimize the number of I/O operations. This is the principle behind B-trees. Good references for B-trees include Knuth [1998] and Cormen [1990]. For B+-trees, consult Aho [1983].

Theory Figure 4-3 illustrates a B-tree with 3 keys/node. Keys in internal nodes are surrounded by pointers, or record offsets, to keys that are less than or greater than, the key value. For example, all keys less than 22 are to the left and all keys greater than 22 are to the right. For simplicity, I have not shown the record address associated with each key.

Figure 4-3: B-Tree We can locate any key in this 2-level tree with three disk accesses. If we were to group 100 keys/node, we could search over 1,000,000 keys in only three reads. To ensure this property holds, we must maintain a balanced tree during insertion and deletion. During insertion, we examine the child node to verify that it is able to hold an additional node. If not, then a new sibling node is added to the tree, and the child's keys are redistributed to make room for the new node. When descending for insertion and the root is full, then the root is spilled to new children, and the level of the tree increases. A similar action is taken on deletion, where child nodes may be absorbed by the root. This technique for altering the height of the tree maintains a balanced tree.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.htm (1 of 3) [3/23/2004 3:10:03 PM]

B-Trees

B-Tree data stored in any node

B*-Tree

B+-Tree

B++-Tree

any node

leaf only

leaf only

on insert, split 1 x 1 –>2 x 1/2 2 x 1 –>3 x 2/3 1 x 1 –>2 x 1/2 3 x 1 –>4 x 3/4 on delete, join 2 x 1/2 –>1 x 1 3 x 2/3 –>2 x 1 2 x 1/2 –>1 x 1 3 x 1/2 –>2 x 3/4

Table 4-1: B-Tree Implementations Several variants on the B-tree are listed in Table 4-1. The standard B-tree stores keys and data in both internal and leaf nodes. When descending the tree during insertion, a full child node is first redistributed to adjacent nodes. If the adjacent nodes are also full, then a new node is created, and half the keys in the child are moved to the newly created node. During deletion, children that are 1/2 full first attempt to obtain keys from adjacent nodes. If the adjacent nodes are also 1/2 full, then two nodes are joined to form one full node. B*-trees are similar, only the nodes are kept 2/3 full. This results in better utilization of space in the tree, and slightly better performance.

Figure 4-4: B+-Tree Figure 4-4 illustrates a B+-tree. All keys are stored at the leaf level, with their associated data values. Duplicates of the keys appear in internal parent nodes to guide the search. Pointers have a slightly different meaning than in conventional B-trees. The left pointer designates all keys less than the value, while the right pointer designates all keys greater than or equal to (GE) the value. For example, all keys less than 22 are on the left pointer, and all keys greater than or equal to 22 are on the right. Notice that key 22 is duplicated in the leaf, where the associated data may be found. During insertion and deletion, care must be taken to properly update parent nodes. When modifying the first key in a leaf, the tree is walked from leaf to root. The last GE pointer found while descending the tree will require modification to reflect the new key value. Since all keys are in the leaf nodes, we may link them for sequential access. The last method, B++-trees, is something of my own invention. The organization is similar to B+-trees, except for the split/join strategy. Assume each node can hold k keys, and the root node holds 3k keys. Before we descend to a child node during insertion, we check to see if it is full. If it is, the keys in the http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.htm (2 of 3) [3/23/2004 3:10:03 PM]

B-Trees

child node and two nodes adjacent to the child are all merged and redistributed. If the two adjacent nodes are also full, then another node is added, resulting in four nodes, each 3/4 full. Before we descend to a child node during deletion, we check to see if it is 1/2 full. If it is, the keys in the child node and two nodes adjacent to the child are all merged and redistributed. If the two adjacent nodes are also 1/2 full, then they are merged into two nodes, each 3/4 full. This is halfway between 1/2 full and completely full, allowing for an equal number of insertions or deletions in the future. Recall that the root node holds 3k keys. If the root is full during insertion, we distribute the keys to four new nodes, each 3/4 full. This increases the height of the tree. During deletion, we inspect the child nodes. If there are only three child nodes, and they are all 1/2 full, they are gathered into the root, and the height of the tree decreases. Another way of expressing the operation is to say we are gathering three nodes, and then scattering them. In the case of insertion, where we need an extra node, we scatter to four nodes. For deletion, where a node must be deleted, we scatter to two nodes. The symmetry of the operation allows the gather/scatter routines to be shared by insertion and deletion in the implementation.

Implementation An ANSI-C implementation of a B++-tree is included. In the implementation-dependent section, you'll need to define bAdrType and eAdrType, the types associated with B-tree file offsets and data file offsets, respectively. You'll also need to provide a callback function which is used by the B++-tree algorithm to compare keys. Functions are provided to insert/delete keys, find keys, and access keys sequentially. Function main, at the bottom of the file, provides a simple illustration for insertion. The code provided allows for multiple indices to the same data. This was implemented by returning a handle when the index is opened. Subsequent accesses are done using the supplied handle. Duplicate keys are allowed. Within one index, all keys must be the same length. A binary search was implemented to search each node. A flexible buffering scheme allows nodes to be retained in memory until the space is needed. If you expect access to be somewhat ordered, increasing the bufCt will reduce paging.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.htm (3 of 3) [3/23/2004 3:10:03 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

#include #include #include #include

/* * this file is divided into sections: * stuff you'll probably want to place in a .h file... * implementation dependent * - you'll probably have to change something here * implementation independent * - types and function prototypes that typically go in a .h file * function prototypes * - prototypes for user functions * internals * - local functions * - user functions * main() */ /**************************** * implementation dependent * ****************************/ typedef long eAdrType; typedef long bAdrType; #define CC_EQ #define CC_GT #define CC_LT

/* record address for external record */ /* record address for btree node */

0 1 -1

/* compare two keys and return: * CC_LT key1 < key2 * CC_GT key1 > key2 * CC_EQ key1 = key2 */ typedef int (*bCompType)(const void *key1, const void *key2); /****************************** * implementation independent * ******************************/ /* statistics */ int maxHeight; int nNodesIns; int nNodesDel; int nKeysIns; int nKeysDel; int nDiskReads; int nDiskWrites;

/* /* /* /* /* /* /*

maximum height attained */ number of nodes inserted */ number of nodes deleted */ number of keys inserted */ number of keys deleted */ number of disk reads */ number of disk writes */

/* line number for last IO or memory error */ int bErrLineNo; typedef enum {false, true} bool; typedef enum { bErrOk, bErrKeyNotFound, bErrDupKeys, bErrSectorSize, bErrFileNotOpen, bErrFileExists, bErrIO, bErrMemory } bErrType; typedef void *bHandleType;

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (1 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

typedef struct { char *iName; int keySize; bool dupKeys; int sectorSize; bCompType comp; } bOpenType;

/* /* /* /* /* /*

info for bOpen() */ name of index file */ length, in bytes, of key */ true if duplicate keys allowed */ size of sector on disk */ pointer to compare function */

/*********************** * function prototypes * ***********************/ bErrType bOpen(bOpenType info, bHandleType *handle); /* * input: * info info for open * output: * handle handle to btree, used in subsequent calls * returns: * bErrOk open was successful * bErrMemory insufficient memory * bErrSectorSize sector size too small or not 0 mod 4 * bErrFileNotOpen unable to open index file */ bErrType bClose(bHandleType handle); /* * input: * handle handle returned by bOpen * returns: * bErrOk file closed, resources deleted */ bErrType bInsertKey(bHandleType handle, void *key, eAdrType rec); /* * input: * handle handle returned by bOpen * key key to insert * rec record address * returns: * bErrOk operation successful * bErrDupKeys duplicate keys (and info.dupKeys = false) * notes: * If dupKeys is false, then all records inserted must have a * unique key. If dupkeys is true, then duplicate keys are * allowed, but they must all have unique record addresses. * In this case, record addresses are included in internal * nodes to generate a "unique" key. */ bErrType bDeleteKey(bHandleType handle, void *key, eAdrType *rec); /* * input: * handle handle returned by bOpen * key key to delete * rec record address of key to delete * output: * rec record address deleted * returns: * bErrOk operation successful * bErrKeyNotFound key not found * notes: * If dupKeys is false, all keys are unique, and rec is not used * to determine which key to delete. If dupKeys is true, then * rec is used to determine which key to delete. */ http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (2 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

bErrType bFindKey(bHandleType handle, void *key, eAdrType *rec); /* * input: * handle handle returned by bOpen * key key to find * output: * rec record address * returns: * bErrOk operation successful * bErrKeyNotFound key not found */ bErrType bFindFirstKey(bHandleType handle, void *key, eAdrType *rec); /* * input: * handle handle returned by bOpen * output: * key first key in sequential set * rec record address * returns: * bErrOk operation successful * bErrKeyNotFound key not found */ bErrType bFindLastKey(bHandleType handle, void *key, eAdrType *rec); /* * input: * handle handle returned by bOpen * output: * key last key in sequential set * rec record address * returns: * bErrOk operation successful * bErrKeyNotFound key not found */ bErrType bFindNextKey(bHandleType handle, void *key, eAdrType *rec); /* * input: * handle handle returned by bOpen * output: * key key found * rec record address * returns: * bErrOk operation successful * bErrKeyNotFound key not found */ bErrType bFindPrevKey(bHandleType handle, void *key, eAdrType *rec); /* * input: * handle handle returned by bOpen * output: * key key found * rec record address * returns: * bErrOk operation successful * bErrKeyNotFound key not found */ /************* * internals * *************/ /* http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (3 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

* algorithm: * A B+tree implementation, with keys stored in internal nodes, * and keys/record addresses stored in leaf nodes. Each node is * one sector in length, except the root node whose length is * 3 sectors. When traversing the tree to insert a key, full * children are adjusted to make room for possible new entries. * Similarly, on deletion, half-full nodes are adjusted to allow for * possible deleted entries. Adjustments are first done by * examining 2 nearest neighbors at the same level, and redistibuting * the keys if possible. If redistribution won't solve the problem, * nodes are split/joined as needed. Typically, a node is 3/4 full. * On insertion, if 3 nodes are full, they are split into 4 nodes, * each 3/4 full. On deletion, if 3 nodes are 1/2 full, they are * joined to create 2 nodes 3/4 full. * * A LRR (least-recently-read) buffering scheme for nodes is used to * simplify storage management, and, assuming some locality of reference, * improve performance. * * To simplify matters, both internal nodes and leafs contain the * same fields. * */ /* macros for addressing fields */ /* primitives */ #define bAdr(p) *(bAdrType *)(p) #define eAdr(p) *(eAdrType *)(p) /* based on k = &[key,rec,childGE] */ #define childLT(k) bAdr((char *)k - sizeof(bAdrType)) #define key(k) (k) #define rec(k) eAdr((char *)(k) + h->keySize) #define childGE(k) bAdr((char *)(k) + h->keySize + sizeof(eAdrType)) /* based on b = &bufType */ #define leaf(b) b->p->leaf #define ct(b) b->p->ct #define next(b) b->p->next #define prev(b) b->p->prev #define fkey(b) &b->p->fkey #define lkey(b) (fkey(b) + ks((ct(b) - 1))) #define p(b) (char *)(b->p) /* shortcuts */ #define ks(ct) ((ct) * h->ks) typedef char keyType;

/* keys entries are treated as char arrays */

typedef struct { unsigned int leaf:1; /* first bit = 1 if leaf unsigned int ct:15; /* count of keys present bAdrType prev; /* prev node in sequence bAdrType next; /* next node in sequence bAdrType childLT; /* child LT first key */ /* ct occurrences of [key,rec,childGE] */ keyType fkey; /* first occurrence */ } nodeType; typedef struct bufTypeTag { struct bufTypeTag *next; struct bufTypeTag *prev; bAdrType adr; nodeType *p; bool valid;

/* /* /* /* /* /*

*/ */ (leaf) */ (leaf) */

location of node */ next */ previous */ on disk */ in memory */ true if buffer contents valid */

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (4 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

bool modified; } bufType;

/* true if buffer modified */

/* one node for each open handle */ typedef struct hNodeTag { struct hNodeTag *prev; /* previous node */ struct hNodeTag *next; /* next node */ FILE *fp; /* idx file */ int keySize; /* key length */ bool dupKeys; /* true if duplicate keys */ int sectorSize; /* block size for idx records */ bCompType comp; /* pointer to compare routine */ bufType root; /* root of b-tree, room for 3 sets */ bufType bufList; /* head of buf list */ void *malloc1; /* malloc'd resources */ void *malloc2; /* malloc'd resources */ bufType gbuf; /* gather buffer, room for 3 sets */ bufType *curBuf; /* current location */ keyType *curKey; /* current key in current node */ unsigned int maxCt; /* minimum # keys in node */ int ks; /* sizeof key entry */ bAdrType nextFreeAdr; /* next free b-tree record address */ } hNode; static hNode hList; static hNode *h;

/* list of hNodes */ /* current hNode */

#define error(rc) lineError(__LINE__, rc) static bErrType lineError(int lineno, bErrType rc) { if (rc == bErrIO || rc == bErrMemory) if (!bErrLineNo) bErrLineNo = lineno; return rc; } static bAdrType allocAdr(void) { bAdrType adr; adr = h->nextFreeAdr; h->nextFreeAdr += h->sectorSize; return adr; } static bErrType flush(bufType *buf) { int len; /* number of bytes to write */ /* flush buffer to disk */ len = h->sectorSize; if (buf->adr == 0) len *= 3; /* root */ if (fseek(h->fp, buf->adr, SEEK_SET)) return error(bErrIO); if (fwrite(buf->p, len, 1, h->fp) != 1) return error(bErrIO); buf->modified = false; nDiskWrites++; return bErrOk; } static bErrType flushAll(void) { bErrType rc; /* return code */ bufType *buf; /* buffer */ if (h->root.modified) if ((rc = flush(&h->root)) != 0) return rc; buf = h->bufList.next; while (buf != &h->bufList) { if (buf->modified)

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (5 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

if ((rc = flush(buf)) != 0) return rc; buf = buf->next; } } static bErrType assignBuf(bAdrType adr, bufType **b) { /* assign buf to adr */ bufType *buf; /* buffer */ bErrType rc; /* return code */ if (adr == 0) { *b = &h->root; return bErrOk; } /* search for buf with matching adr */ buf = h->bufList.next; while (buf->next != &h->bufList) { if (buf->valid && buf->adr == adr) break; buf = buf->next; } /* either buf points to a match, or it's last one in list (LRR) */ if (buf->valid) { if (buf->adr != adr) { if (buf->modified) { if ((rc = flush(buf)) != 0) return rc; } buf->adr = adr; buf->valid = false; } } else { buf->adr = adr; } /* remove from current position and place at front of list */ buf->next->prev = buf->prev; buf->prev->next = buf->next; buf->next = h->bufList.next; buf->prev = &h->bufList; buf->next->prev = buf; buf->prev->next = buf; *b = buf; return bErrOk; } static bErrType writeDisk(bufType *buf) { /* write buf to disk */ buf->valid = true; buf->modified = true; return bErrOk; } static bErrType readDisk(bAdrType adr, bufType **b) { /* read data into buf */ int len; bufType *buf; /* buffer */ bErrType rc; /* return code */ if ((rc = assignBuf(adr, &buf)) != 0) return rc; if (!buf->valid) { len = h->sectorSize; if (adr == 0) len *= 3; /* root */ if (fseek(h->fp, adr, SEEK_SET)) return error(bErrIO); if (fread(buf->p, len, 1, h->fp) != 1) return error(bErrIO); buf->modified = false;

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (6 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

buf->valid = true; nDiskReads++; } *b = buf; return bErrOk; } typedef enum { MODE_FIRST, MODE_MATCH } modeEnum; static int search( bufType *buf, void *key, eAdrType rec, keyType **mkey, modeEnum mode) { /* * input: * p * key * rec * output: * k * returns: * CC_EQ * CC_LT * CC_GT */ int cc; int m; int lb; int ub; bool foundDup;

pointer to node key to find record address (dupkey only) pointer to keyType info key = mkey key < mkey key > mkey /* /* /* /* /*

condition code */ midpoint of search */ lower-bound of binary search */ upper-bound of binary search */ true if found a duplicate key */

/* scan current node for key using binary search */ foundDup = false; lb = 0; ub = ct(buf) - 1; while (lb comp(key, key(*mkey)); if (cc < 0) /* key less than key[m] */ ub = m - 1; else if (cc > 0) /* key greater than key[m] */ lb = m + 1; else { /* keys match */ if (h->dupKeys) { switch (mode) { case MODE_FIRST: /* backtrack to first key */ ub = m - 1; foundDup = true; break; case MODE_MATCH: /* rec's must also match */ if (rec < rec(*mkey)) { ub = m - 1; cc = CC_LT; } else if (rec > rec(*mkey)) { lb = m + 1; cc = CC_GT; } else {

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (7 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

return CC_EQ; } break; } } else { return cc; } } } if (ct(buf) == 0) { /* empty list */ *mkey = fkey(buf); return CC_LT; } if (h->dupKeys && (mode == MODE_FIRST) && foundDup) { /* next key is first key in set of duplicates */ *mkey += ks(1); return CC_EQ; } /* didn't find key */ return cc; } static bErrType scatterRoot(void) { bufType *gbuf; bufType *root; /* scatter gbuf to root */ root = &h->root; gbuf = &h->gbuf; memcpy(fkey(root), fkey(gbuf), ks(ct(gbuf))); childLT(fkey(root)) = childLT(fkey(gbuf)); ct(root) = ct(gbuf); leaf(root) = leaf(gbuf); return bErrOk; } static bErrType scatter(bufType bufType *gbuf; keyType *gkey; bErrType rc; int iu; int k0Min; int knMin; int k0Max; int knMax; int sw; int len; int base; int extra; int ct; int i; /* * input: * pbuf * pkey * is * tmp * output: * tmp */

*pbuf, keyType *pkey, int is, bufType **tmp) { /* gather buf */ /* gather buf key */ /* return code */ /* number of tmp's used */ /* min #keys that can be mapped to tmp[0] */ /* min #keys that can be mapped to tmp[1..3] */ /* max #keys that can be mapped to tmp[0] */ /* max #keys that can be mapped to tmp[1..3] */ /* shift width */ /* length of remainder of buf */ /* base count distributed to tmps */ /* extra counts */

parent buffer of gathered keys where we insert a key if needed in parent number of supplied tmps array of tmp's to be used for scattering array of tmp's used for scattering

/* scatter gbuf to tmps, placing 3/4 max in each tmp */

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (8 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

gbuf = &h->gbuf; gkey = fkey(gbuf); ct = ct(gbuf); /**************************************** * determine number of tmps to use (iu) * ****************************************/ iu = is; /* determine limits */ if (leaf(gbuf)) { /* minus 1 to allow for insertion */ k0Max= h->maxCt - 1; knMax= h->maxCt - 1; /* plus 1 to allow for deletion */ k0Min= (h->maxCt / 2) + 1; knMin= (h->maxCt / 2) + 1; } else { /* can hold an extra gbuf key as it's translated to a LT pointer */ k0Max = h->maxCt - 1; knMax = h->maxCt; k0Min = (h->maxCt / 2) + 1; knMin = ((h->maxCt+1) / 2) + 1; } /* calculate iu, number of tmps to use */ while(1) { if (iu == 0 || ct > (k0Max + (iu-1)*knMax)) { /* add a buffer */ if ((rc = assignBuf(allocAdr(), &tmp[iu])) != 0) return rc; /* update sequential links */ if (leaf(gbuf)) { /* adjust sequential links */ if (iu == 0) { /* no tmps supplied when splitting root for first time */ prev(tmp[0]) = 0; next(tmp[0]) = 0; } else { prev(tmp[iu]) = tmp[iu-1]->adr; next(tmp[iu]) = next(tmp[iu-1]); next(tmp[iu-1]) = tmp[iu]->adr; } } iu++; nNodesIns++; } else if (iu > 1 && ct < (k0Min + (iu-1)*knMin)) { /* del a buffer */ iu--; /* adjust sequential links */ if (leaf(gbuf) && tmp[iu-1]->adr) { next(tmp[iu-1]) = next(tmp[iu]); } next(tmp[iu-1]) = next(tmp[iu]); nNodesDel++; } else { break; } } /* establish count for each tmp used */ base = ct / iu; extra = ct % iu; for (i = 0; i < iu; i++) { int n;

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (9 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

n = base; /* distribute extras, one at a time */ /* don't do to 1st node, as it may be internal and can't hold it */ if (i && extra) { n++; extra--; } ct(tmp[i]) = n; } /************************************** * update sequential links and parent * **************************************/ if (iu != is) { /* link last node to next */ if (leaf(gbuf) && next(tmp[iu-1])) { bufType *buf; if ((rc = readDisk(next(tmp[iu-1]), &buf)) != 0) return rc; prev(buf) = tmp[iu-1]->adr; if ((rc = writeDisk(buf)) != 0) return rc; } /* shift keys in parent */ sw = ks(iu - is); if (sw < 0) { len = ks(ct(pbuf)) - (pkey - fkey(pbuf)) + sw; memmove(pkey, pkey - sw, len); } else { len = ks(ct(pbuf)) - (pkey - fkey(pbuf)); memmove(pkey + sw, pkey, len); } /* don't count LT buffer for empty parent */ if (ct(pbuf)) ct(pbuf) += iu - is; else ct(pbuf) += iu - is - 1; } /******************************* * distribute keys to children * *******************************/ for (i = 0; i < iu; i++) { /* update LT pointer and parent nodes */ if (leaf(gbuf)) { /* update LT, tmp[i] */ childLT(fkey(tmp[i])) = 0; /* update parent */ if (i == 0) { childLT(pkey) = tmp[i]->adr; } else { memcpy(pkey, gkey, ks(1)); childGE(pkey) = tmp[i]->adr; pkey += ks(1); } } else { if (i == 0) { /* update LT, tmp[0] */ childLT(fkey(tmp[i])) = childLT(gkey); /* update LT, parent */ childLT(pkey) = tmp[i]->adr; } else { /* update LT, tmp[i] */ childLT(fkey(tmp[i])) = childGE(gkey); http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (10 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

/* update parent key */ memcpy(pkey, gkey, ks(1)); childGE(pkey) = tmp[i]->adr; gkey += ks(1); pkey += ks(1); ct(tmp[i])--; } } /* install keys, tmp[i] */ memcpy(fkey(tmp[i]), gkey, ks(ct(tmp[i]))); leaf(tmp[i]) = leaf(gbuf); gkey += ks(ct(tmp[i])); } leaf(pbuf) = false; /************************ * write modified nodes * ************************/ if ((rc = writeDisk(pbuf)) != 0) return rc; for (i = 0; i < iu; i++) if ((rc = writeDisk(tmp[i])) != 0) return rc; return bErrOk; } static bErrType gatherRoot(void) { bufType *gbuf; bufType *root; /* gather root to gbuf */ root = &h->root; gbuf = &h->gbuf; memcpy(p(gbuf), root->p, 3 * h->sectorSize); leaf(gbuf) = leaf(root); ct(root) = 0; return bErrOk; } static bErrType gather(bufType *pbuf, keyType **pkey, bufType **tmp) { bErrType rc; /* return code */ bufType *gbuf; keyType *gkey; /* * input: * pbuf parent buffer * pkey pointer to match key in parent * output: * tmp buffers to use for scatter * pkey pointer to match key in parent * returns: * bErrOk operation successful * notes: * Gather 3 buffers to gbuf. Setup for subsequent scatter by * doing the following: * - setup tmp buffer array for scattered buffers * - adjust pkey to point to first key of 3 buffers */ /* find 3 if (*pkey *pkey if ((rc = if ((rc =

adjacent buffers */ == lkey(pbuf)) -= ks(1); readDisk(childLT(*pkey), &tmp[0])) != 0) return rc; readDisk(childGE(*pkey), &tmp[1])) != 0) return rc;

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (11 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

if ((rc = readDisk(childGE(*pkey + ks(1)), &tmp[2])) != 0) return rc; /* gather nodes to gbuf */ gbuf = &h->gbuf; gkey = fkey(gbuf); /* tmp[0] */ childLT(gkey) = childLT(fkey(tmp[0])); memcpy(gkey, fkey(tmp[0]), ks(ct(tmp[0]))); gkey += ks(ct(tmp[0])); ct(gbuf) = ct(tmp[0]); /* tmp[1] */ if (!leaf(tmp[1])) { memcpy(gkey, *pkey, ks(1)); childGE(gkey) = childLT(fkey(tmp[1])); ct(gbuf)++; gkey += ks(1); } memcpy(gkey, fkey(tmp[1]), ks(ct(tmp[1]))); gkey += ks(ct(tmp[1])); ct(gbuf) += ct(tmp[1]); /* tmp[2] */ if (!leaf(tmp[2])) { memcpy(gkey, *pkey+ks(1), ks(1)); childGE(gkey) = childLT(fkey(tmp[2])); ct(gbuf)++; gkey += ks(1); } memcpy(gkey, fkey(tmp[2]), ks(ct(tmp[2]))); ct(gbuf) += ct(tmp[2]); leaf(gbuf) = leaf(tmp[0]); return bErrOk; } bErrType bOpen(bOpenType info, bHandleType *handle) { bErrType rc; /* return code */ int bufCt; /* number of tmp buffers */ bufType *buf; /* buffer */ int maxCt; /* maximum number of keys in a node */ bufType *root; int i; nodeType *p; if ((info.sectorSize < sizeof(hNode)) || (info.sectorSize % 4)) return bErrSectorSize; /* determine sizes and offsets */ /* leaf/n, prev, next, [childLT,key,rec]... childGE */ /* ensure that there are at least 3 children/parent for gather/scatter */ maxCt = info.sectorSize - (sizeof(nodeType) - sizeof(keyType)); maxCt /= sizeof(bAdrType) + info.keySize + sizeof(eAdrType); if (maxCt < 6) return bErrSectorSize; /* copy parms to hNode */ if ((h = malloc(sizeof(hNode))) == NULL) return error(bErrMemory); memset(h, 0, sizeof(hNode)); h->keySize = info.keySize; h->dupKeys = info.dupKeys; h->sectorSize = info.sectorSize; h->comp = info.comp; /* childLT, key, rec */ h->ks = sizeof(bAdrType) + h->keySize + sizeof(eAdrType); http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (12 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

h->maxCt = maxCt; /* Allocate buflist. * During insert/delete, need simultaneous access to 7 buffers: * - 4 adjacent child bufs * - 1 parent buf * - 1 next sequential link * - 1 lastGE */ bufCt = 7; if ((h->malloc1 = malloc(bufCt * sizeof(bufType))) == NULL) return error(bErrMemory); buf = h->malloc1; /* * Allocate bufs. * We need space for the following: * - bufCt buffers, of size sectorSize * - 1 buffer for root, of size 3*sectorSize * - 1 buffer for gbuf, size 3*sectorsize + 2 extra keys * to allow for LT pointers in last 2 nodes when gathering 3 full nodes */ if ((h->malloc2 = malloc((bufCt+6) * h->sectorSize + 2 * h->ks)) == NULL) return error(bErrMemory); p = h->malloc2; /* initialize buflist */ h->bufList.next = buf; h->bufList.prev = buf + (bufCt - 1); for (i = 0; i < bufCt; i++) { buf->next = buf + 1; buf->prev = buf - 1; buf->modified = false; buf->valid = false; buf->p = p; p = (nodeType *)((char *)p + h->sectorSize); buf++; } h->bufList.next->prev = &h->bufList; h->bufList.prev->next = &h->bufList; /* initialize root */ root = &h->root; root->p = p; p = (nodeType *)((char *)p + 3*h->sectorSize); h->gbuf.p = p; /* done last to include extra 2 keys */ h->curBuf = NULL; h->curKey = NULL; /* initialize root */ if ((h->fp = fopen(info.iName, "r+b")) != NULL) { /* open an existing database */ if ((rc = readDisk(0, &root)) != 0) return rc; if (fseek(h->fp, 0, SEEK_END)) return error(bErrIO); if ((h->nextFreeAdr = ftell(h->fp)) == -1) return error(bErrIO); } else if ((h->fp = fopen(info.iName, "w+b")) != NULL) { /* initialize root */ memset(root->p, 0, 3*h->sectorSize); leaf(root) = 1; h->nextFreeAdr = 3 * h->sectorSize; } else { /* something's wrong */ free(h); return bErrFileNotOpen; } http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (13 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

/* append node to hList */ if (hList.next) { h->prev = hList.next; h->next = &hList; h->prev->next = h; h->next->prev = h; } else { /* first item in hList */ h->prev = h->next = &hList; hList.next = hList.prev = h; } *handle = h; return bErrOk; } bErrType bClose(bHandleType handle) { h = handle; if (h == NULL) return bErrOk; /* remove from list */ if (h->next) { h->next->prev = h->prev; h->prev->next = h->next; } /* flush idx */ if (h->fp) { flushAll(); fclose(h->fp); } if (h->malloc2) free(h->malloc2); if (h->malloc1) free(h->malloc1); free(h); return bErrOk; } bErrType bFindKey(bHandleType handle, void *key, eAdrType *rec) { keyType *mkey; /* matched key */ bufType *buf; /* buffer */ bErrType rc; /* return code */ h = handle; buf = &h->root; /* find key, and return address */ while (1) { if (leaf(buf)) { if (search(buf, key, 0, &mkey, MODE_FIRST) == 0) { *rec = rec(mkey); h->curBuf = buf; h->curKey = mkey; return bErrOk; } else { return bErrKeyNotFound; } } else { if (search(buf, key, 0, &mkey, MODE_FIRST) < 0) { if ((rc = readDisk(childLT(mkey), &buf)) != 0) return rc; } else { if ((rc = readDisk(childGE(mkey), &buf)) != 0) return rc; } } }

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (14 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

} bErrType bInsertKey(bHandleType int rc; keyType *mkey; int len; int cc; bufType *buf, *root; bufType *tmp[4]; unsigned int keyOff; bool lastGEvalid; bool lastLTvalid; bAdrType lastGE; unsigned int lastGEkey; int height;

handle, void *key, eAdrType rec) { /* return code */ /* match key */ /* length to shift */ /* condition code */

/* /* /* /* /*

true if GE branch taken */ true if LT branch taken after GE branch */ last childGE traversed */ last childGE key traversed */ height of tree */

h = handle; root = &h->root; lastGEvalid = false; lastLTvalid = false; /* check for full root */ if (ct(root) == 3 * h->maxCt) { /* gather root and scatter to 4 bufs */ /* this increases b-tree height by 1 */ if ((rc = gatherRoot()) != 0) return rc; if ((rc = scatter(root, fkey(root), 0, tmp)) != 0) return rc; } buf = root; height = 0; while(1) { if (leaf(buf)) { /* in leaf, and there' room guaranteed */ if (height > maxHeight) maxHeight = height; /* set mkey to point to insertion point */ switch(search(buf, key, rec, &mkey, MODE_MATCH)) { case CC_LT: /* key < mkey */ if (!h->dupKeys && h->comp(key, mkey) == CC_EQ) return bErrDupKeys; break; case CC_EQ: /* key = mkey */ return bErrDupKeys; break; case CC_GT: /* key > mkey */ if (!h->dupKeys && h->comp(key, mkey) == CC_EQ) return bErrDupKeys; mkey += ks(1); break; } /* shift items GE key to right */ keyOff = mkey - fkey(buf); len = ks(ct(buf)) - keyOff; if (len) memmove(mkey + ks(1), mkey, len); /* insert new key */ memcpy(key(mkey), key, h->keySize); rec(mkey) = rec; childGE(mkey) = 0; ct(buf)++; if ((rc = writeDisk(buf)) != 0) return rc; /* if new key is first key, then fixup lastGE key */ if (!keyOff && lastLTvalid) {

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (15 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

bufType *tbuf; keyType *tkey; if ((rc = readDisk(lastGE, &tbuf)) != 0) return rc; tkey = fkey(tbuf) + lastGEkey; memcpy(key(tkey), key, h->keySize); rec(tkey) = rec; if ((rc = writeDisk(tbuf)) != 0) return rc; } nKeysIns++; break; } else { /* internal node, descend to child */ bufType *cbuf; /* child buf */ height++; /* read child */ if ((cc = search(buf, key, rec, &mkey, MODE_MATCH)) < 0) { if ((rc = readDisk(childLT(mkey), &cbuf)) != 0) return rc; } else { if ((rc = readDisk(childGE(mkey), &cbuf)) != 0) return rc; } /* check for room in child */ if (ct(cbuf) == h->maxCt) { /* gather 3 bufs and scatter */ if ((rc = gather(buf, &mkey, tmp)) != 0) return rc; if ((rc = scatter(buf, mkey, 3, tmp)) != 0) return rc; /* read child */ if ((cc = search(buf, key, rec, &mkey, MODE_MATCH)) < 0) { if ((rc = readDisk(childLT(mkey), &cbuf)) != 0) return rc; } else { if ((rc = readDisk(childGE(mkey), &cbuf)) != 0) return rc; } } if (cc >= 0 || mkey != fkey(buf)) { lastGEvalid = true; lastLTvalid = false; lastGE = buf->adr; lastGEkey = mkey - fkey(buf); if (cc < 0) lastGEkey -= ks(1); } else { if (lastGEvalid) lastLTvalid = true; } buf = cbuf; } } return bErrOk; } bErrType bDeleteKey(bHandleType int rc; keyType *mkey; int len; int cc; bufType *buf; bufType *tmp[4]; unsigned int keyOff; bool lastGEvalid; bool lastLTvalid; bAdrType lastGE; unsigned int lastGEkey; bufType *root;

handle, void *key, eAdrType *rec) { /* return code */ /* match key */ /* length to shift */ /* condition code */ /* buffer */

/* /* /* /*

true true last last

if GE branch taken */ if LT branch taken after GE branch */ childGE traversed */ childGE key traversed */

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (16 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

bufType *gbuf; h = handle; root = &h->root; gbuf = &h->gbuf; lastGEvalid = false; lastLTvalid = false; buf = root; while(1) { if (leaf(buf)) { /* set mkey to point to deletion point */ if (search(buf, key, *rec, &mkey, MODE_MATCH) == 0) *rec = rec(mkey); else return bErrKeyNotFound; /* shift items GT key to left */ keyOff = mkey - fkey(buf); len = ks(ct(buf)-1) - keyOff; if (len) memmove(mkey, mkey + ks(1), len); ct(buf)--; if ((rc = writeDisk(buf)) != 0) return rc; /* if deleted key is first key, then fixup lastGE key */ if (!keyOff && lastLTvalid) { bufType *tbuf; keyType *tkey; if ((rc = readDisk(lastGE, &tbuf)) != 0) return rc; tkey = fkey(tbuf) + lastGEkey; memcpy(key(tkey), mkey, h->keySize); rec(tkey) = rec(mkey); if ((rc = writeDisk(tbuf)) != 0) return rc; } nKeysDel++; break; } else { /* internal node, descend to child */ bufType *cbuf; /* child buf */ /* read child */ if ((cc = search(buf, key, *rec, &mkey, MODE_MATCH)) < 0) { if ((rc = readDisk(childLT(mkey), &cbuf)) != 0) return rc; } else { if ((rc = readDisk(childGE(mkey), &cbuf)) != 0) return rc; } /* check for room to delete */ if (ct(cbuf) == h->maxCt/2) { /* gather 3 bufs and scatter */ if ((rc = gather(buf, &mkey, tmp)) != 0) return rc; /* if && &&

if last 3 bufs in root, and count is low enough... */ (buf == root ct(root) == 2 ct(gbuf) < (3*(3*h->maxCt))/4) { /* collapse tree by one level */ scatterRoot(); nNodesDel += 3; continue;

} if ((rc = scatter(buf, mkey, 3, tmp)) != 0) return rc; /* read child */ http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (17 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

if ((cc = search(buf, key, *rec, &mkey, MODE_MATCH)) < 0) { if ((rc = readDisk(childLT(mkey), &cbuf)) != 0) return rc; } else { if ((rc = readDisk(childGE(mkey), &cbuf)) != 0) return rc; } } if (cc >= 0 || mkey != fkey(buf)) { lastGEvalid = true; lastLTvalid = false; lastGE = buf->adr; lastGEkey = mkey - fkey(buf); if (cc < 0) lastGEkey -= ks(1); } else { if (lastGEvalid) lastLTvalid = true; } buf = cbuf; } } return bErrOk; } bErrType bFindFirstKey(bHandleType handle, void *key, eAdrType *rec) { bErrType rc; /* return code */ bufType *buf; /* buffer */ h = handle; buf = &h->root; while (!leaf(buf)) { if ((rc = readDisk(childLT(fkey(buf)), &buf)) != 0) return rc; } if (ct(buf) == 0) return bErrKeyNotFound; memcpy(key, key(fkey(buf)), h->keySize); *rec = rec(fkey(buf)); h->curBuf = buf; h->curKey = fkey(buf); return bErrOk; } bErrType bFindLastKey(bHandleType handle, void *key, eAdrType *rec) { bErrType rc; /* return code */ bufType *buf; /* buffer */ h = handle; buf = &h->root; while (!leaf(buf)) { if ((rc = readDisk(childGE(lkey(buf)), &buf)) != 0) return rc; } if (ct(buf) == 0) return bErrKeyNotFound; memcpy(key, key(lkey(buf)), h->keySize); *rec = rec(lkey(buf)); h->curBuf = buf; h->curKey = lkey(buf); return bErrOk; } bErrType bFindNextKey(bHandleType handle, void *key, eAdrType *rec) { bErrType rc; /* return code */ keyType *nkey; /* next key */ bufType *buf; /* buffer */ h = handle; if ((buf = h->curBuf) == NULL) return bErrKeyNotFound; if (h->curKey == lkey(buf)) { /* current key is last key in leaf node */ if (next(buf)) { /* fetch next set */ if ((rc = readDisk(next(buf), &buf)) != 0) return rc; http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (18 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

nkey = fkey(buf); } else { /* no more sets */ return bErrKeyNotFound; } } else { /* bump to next key */ nkey = h->curKey + ks(1); } memcpy(key, key(nkey), h->keySize); *rec = rec(nkey); h->curBuf = buf; h->curKey = nkey; return bErrOk; } bErrType bFindPrevKey(bHandleType handle, void *key, eAdrType *rec) { bErrType rc; /* return code */ keyType *pkey; /* previous key */ keyType *fkey; /* first key */ bufType *buf; /* buffer */ h = handle; if ((buf = h->curBuf) == NULL) return bErrKeyNotFound; fkey = fkey(buf); if (h->curKey == fkey) { /* current key is first key in leaf node */ if (prev(buf)) { /* fetch previous set */ if ((rc = readDisk(prev(buf), &buf)) != 0) return rc; pkey = fkey(buf) + ks((ct(buf) - 1)); } else { /* no more sets */ return bErrKeyNotFound; } } else { /* bump to previous key */ pkey = h->curKey - ks(1); } memcpy(key, key(pkey), h->keySize); *rec = rec(pkey); h->curBuf = buf; h->curKey = pkey; return bErrOk; } int comp(const void *key1, const void *key2) { unsigned int const *p1; unsigned int const *p2; p1 = key1; p2 = key2; return (*p1 == *p2) ? CC_EQ : (*p1 > *p2 ) ? CC_GT : CC_LT; } int main(void) { bOpenType info; bHandleType handle; bErrType rc; unsigned int key; remove("t1.dat"); info.iName = "t1.dat"; info.keySize = sizeof(int); info.dupKeys = false; info.sectorSize = 256; info.comp = comp; if ((rc = bOpen(info, &handle)) != bErrOk) { printf("line %d: rc = %d\n", __LINE__, rc); http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (19 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt

exit(0); } key = 0x11; if ((rc = bInsertKey(handle, &key, 0x300)) != bErrOk) { printf("line %d: rc = %d\n", __LINE__, rc); exit(0); } bClose(handle); printf("statistics:\n"); printf(" maximum height: printf(" nodes inserted: printf(" nodes deleted: printf(" keys inserted: printf(" keys deleted: printf(" disk reads: printf(" disk writes:

%8d\n", %8d\n", %8d\n", %8d\n", %8d\n", %8d\n", %8d\n",

maxHeight); nNodesIns); nNodesDel); nKeysIns); nKeysDel); nDiskReads); nDiskWrites);

return 0; }

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_btr.txt (20 of 20) [3/23/2004 3:10:37 PM]

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_bib.htm

Bibliography Aho, Alfred V. and Jeffrey D. Ullman [1983]. Data Structures and Algorithms. Addison-Wesley, Reading, Massachusetts. Cormen, Thomas H., Charles E. Leiserson and Ronald L. Rivest [1990]. Introduction to Algorithms. McGraw-Hill, New York. Knuth, Donald E. [1998]. The Art of Computer Programming, Volume 3, Sorting and Searching. Addison-Wesley, Reading, Massachusetts. Pearson, Peter K. [1990]. Fast Hashing of Variable-Length Text Strings. Communications of the ACM, 33(6):677-680, June 1990. Pugh, William [1990]. Skip Lists: A Probabilistic Alternative to Balanced Trees. Communications of the ACM, 33(6):668-676, June 1990.

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/niemann/s_bib.htm [3/23/2004 3:10:39 PM]

Algorithm Animations

Data Structures and Algorithms

Animated Algorithms The following pages contain animations of some of the algorithms covered in this text. Please note that a. Some of the Java classes take a very long time to load! b. These animations are currently the result of a major effort to enhance the data structures and algorithms course and are thus subject to continuous enhancement. Comments are most welcome! 1.

UWA animations Please note that these are under active development! Sorting algorithms a. b. c. d. e. f.

Woi Ang's Insertion Sort Animation Woi Ang's QuickSort Animation Chien Wei Tan's QuickSort Animation Woi Ang's Bin Sort Animation Woi Ang's Radix Sort Animation Woi Ang's Priority Queue Animation

Searching Algorithms a. Mervyn Ng's Red Black Tree Animation b. Woi Ang's Hash Table Construction Animation c. Woi Ang's Optimal Binary Search Tree Animation Greedy algorithms a. Woi Ang's Huffman Encoding & Decoding Animation Dynamic algorithms a. Woi Ang's Matrix Chain Multiplication Animation Graph algorithms 1. Mervyn Ng's Minimum Spanning Tree Animation

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/alg_anim.html (1 of 2) [3/23/2004 3:28:00 PM]

Algorithm Animations

2. Mervyn Ng's Animation of Dijkstra's Algorithm If you find the animations useful, but want them a little closer to home, you can download a file of them all: anim.tar.gz. They are also available by ftp. If you do download them, please don't forget to acknowledge Woi Ang as the author wherever you use them and I'd appreciate it if you'd let me know .. and, of course, if you have any suggestions or comments, they're most welcome: [email protected]. Back to the Table of Contents © John Morris, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/alg_anim.html (2 of 2) [3/23/2004 3:28:00 PM]

Quicksort Animation

QSort Test Demonstration If you have any comments or suggestions for improvements, please feel free to email me © Chien Wei Tan, 1998

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/Java/q_sort/tqs_new.html [3/23/2004 3:28:06 PM]

Data Structures & Algorithms [PDF]

Data Structures and Algorithms: Table of Contents Data Structures and Algorithms Data Structures and Algorithms - Tabl

File loading please wait...

Citation preview

Data Structures and Algorithms: Table of Contents

Data Structures and Algorithms

Data Structures and Algorithms - Tabl