Prev  Next  Up  Contents

The ISAM Library



The Indexed Sequential Access Method, or ISAM , is a technique for organizing data and efficiently retrieving it. It is designed for efficient operation in two modes: random access and sequential access; hence the name Indexed Sequential Access Method.

The ISAM is implemented as a C function library. The purpose of the ISAM library is to manage index and data files. The combination of the data file and its associated index file is called a "database".

Database Operations


The ISAM library defines the operations that can be performed on a database. There are six basic operations that are performed by any ISAM library:

  1. create a new database
  2. open an existing database
  3. make indexes for accessing the database
  4. add records to the database
  5. find records in the database
  6. delete records from the database

To make it easier to perform database operations, there is the notion of a "current record". This is an internal position indicator that points to a particular record in the database. Some ISAM operations apply to the current record. Functions are provided for changing the internal position indicator. For example, there are functions for moving to the next or previous record. These types of operations are always relative to an index.

Indexes

A database will often have more than one index. For example, the card catalog has an author index, a title index, and a subject index. In addition, the card catalog might have a unique index for finding a book by its ISBN number. Database operations are performed relative to a particular index. Each index defines a different ordering of the records. For example, the title index orders the records alphabetically by book title, while the author index orders them by the author names. Each index has its own current record, which will generally be a different record than the current record in another index. Since the indexes define a logical ordering of records, the physical ordering (i.e., the order in which the records are physically stored in the data file) is generally not important.

Remember that an index is a collection of keys that are values extracted from one or more fields of a record. An index contains one key for each record in the database. Each key points to the complete record from which it was extracted. If an index contains keys that are composed from more than one field, the keys are called compound (or multiple field) keys.

In our card catalog example, the author index might contain keys consisting of a single string value extracted from the last name field. This would cause the author index to order records by last name. For example, the name Thomas Jefferson would appear in sequence before the name George Washington, since the letter 'J' appears alphabetically before the letter 'W'. But what if we had two authors named Jefferson, Thomas Jefferson and Timothy Jefferson. Which would come first in the ordered sequence? The answer is that it could be either, since the ordering is not affected by the author's first name. If we wanted the author index to ensure that Thomas Jefferson appeared in sequence before Timothy Jefferson, we would need to include the first name field as part of the key.

Instead of using single field keys, we could make the author index using compound keys consisting of two string values extracted from both the last name and first name fields of each record. The order in which the two strings are stored in the compound key is important.

If the last name field is stored first, followed by the first name field, then the author index will order the records by last name. Then any records with the same last name will be further ordered by first name. For example, the name Thomas Jefferson will appear in sequence before the name Timothy Jefferson, since the letter 'h' in Thomas appears alphabetically before the letter 'i' in Timothy.

On the other hand, if the first name field was stored in the key before the last name field, then the author index would order the records by first name. Then any records with the same first name would be further ordered by last name. With this ordering, the name George Washington would appear in sequence before the name Thomas Jefferson, since the letter 'G' would appear alphabetically before the letter 'T'. Since it would probably be preferable for the author index to order records by last name rather than first name, the compound key should store the last name first, followed by the first name.

Data Types

Each field in a record is used to store a particular type of information (or data). The most common data type is a string. A string is simply a sequence of characters terminated by a binary zero (or null character). A string field is used to store textual information. In our card catalog example, string fields would be used to store the author's first and last names. The title and subject heading would also be stored in string fields.

The ISAM library supports two types of string fields: normal and case sensitive. The string type becomes an issue when strings are compared. For normal string fields, the string "ABC" would be equal to the string "abc", even though one string contains upper-case characters and the other contains lower-case characters. However, for case sensitive string fields, "ABC" would not be equal to "abc". Using the standard ASCII character set, the string "ABC" would be less than "abc" because the character 'A' appears before the character 'a' in the character set.

String fields can also be used to store numerical data. However, the standard C numeric data types (int, long, float, double, etc.) are stored internally in a binary format. This binary format is a bit- packed representation of a numeric value that is suitable for performing arithmetic operations. Before storing a numeric value in a string field, it must first be converted to string format using a standard C function like sprintf. Likewise, before performing arithmetic operations with a numeric value stored as a string, it must first be converted back to binary format using a standard C function like sscanf.

Instead of using string fields to store numerical data, a more efficient alternative is to use binary fields. A binary field stores numeric data using the internal binary format. Therefore it is not necessary to convert numeric values to and from string format. An additional benefit is that binary format saves disk space. For any given numeric data type (e.g., type short int), the binary format is fixed-length for all values (typically two bytes), whereas the length of the string format is dependent on the value (e.g., "10" is three bytes in length while "32000" is six bytes in length. Remember that a string requires a terminating null character.

There is also a disadvantage to using binary fields for storing numeric data. The size of a standard C numeric data type is implementation dependent. One compiler might treat the type int as a two byte (16-bit) value, while another might treat it as a four byte (32-bit) value. A compiler might use different sizes based on whether you are compiling for a 16-bit or 32-bit operating system. Another problem is that the byte ordering of binary data can be dependent on the computer processor (least significant versus most significant byte first). Also, binary floating point formats can vary (not all compilers use the standard IEEE format). As a result, a database that contains numeric data stored in binary format will not be as portable as one that stores numeric data in string format. The tradeoff is efficiency versus portability.

The ISAM library supports three standard binary integer data types: short int, unsigned short int and long int. Although the sizes can vary, most compilers treat short int as a 16-bit (two byte) signed integer, unsigned short int as a 16-bit (two byte) unsigned integer, and long int as a 32-bit (four byte) signed integer. Most processors store integer values in little endian (least significant byte first) format. Therefore, these binary integer data types are at least somewhat portable.

The ISAM library also supports the two standard binary floating point data types: float and double. Most compilers store floating point values using the standard IEEE format, with float stored as a four byte value and double stored as an eight byte value.

Finally, the ISAM library supports a generic binary data type. This is a fixed-length field that can hold any type of data. Although fixed-length, the size can be any specified number of bytes. For example, a single byte binary field could be used to efficiently store numeric values of limited range (0..255 or -128..127).

Variable Length Records

A database record typically contains many string fields. The length of the strings stored in a particular field usually varies from one record to the next. For example, a record might contain a string field that is used for general comments. One record might need a comment that is a couple of hundred characters in length, while the next record might not need any comment at all.

Some database programs (e.g., dBASE®) only support fixed-length records (i.e., each record is the same size). With fixed-length records, the string fields in each record must be large enough to accommodate the longest strings ever stored in those fields. For a given string field, most records will contain strings that are shorter than the longest string. However, since space is reserved for the longest string possible for that field, there is a lot of unused (or wasted) disk space. For example, if a string field allowed a maximum of 300 characters, then a zero length string stored in that field would waste 300 bytes of disk space.

The ISAM library solves the problem of wasted disk space by storing variable length records (i.e., each record may be a different size). With variable length records, a given string field may be a different length in each record. The string fields are only large enough to accommodate the actual strings that are stored in each record. An empty or zero length string requires only one byte of storage (i.e., the string terminating null character consumes one byte of disk space).


Next Page
C/Database Toolchest Description
Product List