Section 16

Development through Iterative Refinement

When we develop real functions, we are often confronted with the task of designing a data representation for complicated forms of information. The best strategy to approach this task is apply a well-known scientific technique: ITERATIVE REFINEMENT. A scientist's problem is to represent a part of the real world using mathematics. The result of the effort is called a MODEL. The scientist then tests the model in many ways, in particular by predicting certain properties of events. If the model truly captured the essential elements of the real world, the prediction will be accurate; otherwise, there will be discrepancies between the predictions and the actual outcomes. For example, a physicist may start by representing a jet plane as a point and by predicting its movement in a straight line using Newton's equations. Later, if there is a need to understand the plane's friction, the physicist may add certain aspects of the jet plane's contour to the model. In general, a scientist refines a model and retests its usefulness until it is sufficiently accurate.

A programmer or a computing scientist should proceed like a scientist. Since the representation of data plays a central role in the work of a programmer, the key is to find an accurate data representation of the real-world information. The best way to get there in complicated situations is to develop the representation in an iterative manner, starting with the essential elements and adding more attributes when the current model is fully understood.

In this book, we have encountered iterative refinement in many of our extended exercises. For example, the exercise on moving pictures started with simple circles and rectangles; later on we developed programs for moving entire lists of shapes. Similarly, we first introduced Web pages as a list of words and embedded Web pages; in section 15.3 we refined the representation of embedded Web pages. For all of these exercises, however, the refinement was built into the presentation.

This section illustrates iterative refinement as a principle of program development. The goal is to model a file system. A file system is that part of the computer that remembers programs and data when the computer is turned off. We first discuss files in more detail and then iteratively develop three data representations. The last part of the section suggests some programming exercises for the final model. We will use iterative refinement again in later sections.

16.1  Data Analysis

When we turn a computer off, it should remember the functions and the data we worked on. Otherwise we have to reenter everything when we turn it on again. Things that a computer is to remember for a long time are put into files. A file is a sequence of small pieces of data. For our purposes, a file resembles a list; we ignore why and how a computer stores a file in a permanent manner.

[curriculum2-Z-G-4.gif]

Figure 44:  A sample directory tree

It is more important to us that, on most computer systems, the collection of files is organized in directories.40 Roughly speaking, a directory contains some files and some more directories. The latter are called subdirectories and may contain yet more subdirectories and files, and so on. The entire collection is collectively called a file system or a directory tree.

Figure 44 contains a graphical sketch of a small directory tree.41 The tree's root directory is TS. It contains one file, called read!, and two subdirectories, called Text and Libs. The first subdirectory, Text, contains only three files; the latter, Libs, contains only two subdirectories, each of which contains at least one file. Each box has one of two annotations. A directory is annotated with DIR, and a file is annotated with a number, which signifies the file's size. Altogether TS contains seven files and consists of five (sub)directories.

Exercise 16.1.1.   How many times does a file name read! occur in the directory tree TS? What is the total size of all the files in the tree? How deep is the tree (how many levels does it contain)?    Solution

16.2  Defining Data Classes and Refining Them

Let's develop a data representation for file systems using the method of iterative refinement. The first decision we need to make is what to focus on and what to ignore.

Consider the directory tree in figure 44 and let's imagine how it is created. When a user first creates a directory, it is empty. As time goes by, the user adds files and directories. In general, a user refers to files by names but thinks of directories as containers of other things.

Model 1: Our thought experiment suggests that our first and most primitive model should focus on files as atomic entities, say, a symbol that represents a file's name, and on the directories' nature as containers. More concretely, we should think of a directory as just a list that contains files and directories.

All of this suggests the following two data definitions:

A file is a symbol.

A directory (short: dir) is either

  1. empty;

  2. (cons f d) where f is a file and d is a dir; or

  3. (cons d1 d2) where d1 and d2 are dirs.

The first data definition says that files are represented by their names. The second one captures how a directory is gradually constructed by adding files and directories.

A closer look at the second data definition shows that the class of directories is the class of Web pages of section 14.3. Hence we can reuse the template for Web-page processing functions to process directory trees. If we were to write a function that consumes a directory (tree) and counts how many files are contained, it would be identical to a function that counts the number of words in a Web tree.

Exercise 16.2.1.   Translate the file system in figure 44 into a Scheme representation according to model 1.    Solution

Exercise 16.2.2.   Develop the function how-many, which consumes a dir and produces the number of files in the dir tree.    Solution

Model 2: While the first data definition is familiar to us and easy to use, it obscures the nature of directories. In particular, it hides the fact that a directory is not just a collection of files and directories but has several interesting attributes. To model directories in a more faithful manner, we must introduce a structure that collects all relevant properties of a directory. Here is a minimal structure definition:

(define-struct dir (name content))

It suggests that a directory has a name and a content; other attributes can now be added as needed.

The intention of the new definition is that a directory has two attributes: a name, which is a symbol, and a content, which is a list of files and directories. This, in turn, suggests the following data definitions:

A directory (short: dir) is a structure:

 (make-dir n c) 
where n is a symbol and c is a list of files and directories.

A list-of-files-and-directories (short: LOFD) is either

  1. empty;

  2. (cons f d) where f is a file and d is a LOFD; or

  3. (cons d1 d2) where d1 is a dir and d2 is a LOFD.

Since the data definition for dir refers to the definition for LOFDs, and the definition for LOFDs refers back to that of dirs, the two are mutually recursive definitions and must be introduced together.

Roughly speaking, the two definitions are related like those of parent and list-of-children in section 15.1. This, in turn, means that the design recipe for programming from section 15.2 directly applies to dirs and LOFDs. More concretely, to design a function that processes dirs, we must develop templates for dir-processing functions and LOFD-processing functions in parallel.

Exercise 16.2.3.   Show how to model a directory with two more attributes: a size and a systems attribute. The former measures how much space the directory itself (as opposed to its files and subdirectories) consumes; the latter specifies whether the directory is recognized by the operating system.    Solution

Exercise 16.2.4.   Translate the file system in figure 44 into a Scheme representation according to model 2.    Solution

Exercise 16.2.5.   Develop the function how-many, which consumes a dir according to model 2 and produces the number of files in the dir tree.    Solution

Model 3: The second data definition refined the first one with the introduction of attributes for directories. Files also have attributes. To model those, we proceed just as above. First, we define a structure for files:

(define-struct file (name size content))

Second, we provide a data definition:

A file is a structure:

 (make-file n s x) 
where n is a symbol, s is a number, and x is some Scheme value.

For now, we think of the content field of a file as set to empty. Later, we will discuss how to get access to the data in a file.

Finally, let's split the content field of dirs into two pieces: one for a list of files and one for a list of subdirectories. The data definition for a list of files is straightforward and relies on nothing but the definition for files:

A list-of-files is either

  1. empty, or

  2. (cons s lof) where s is a file and lof is a list of files.

In contrast, the data definitions for dirs and its list of subdirectories still refer to each other and must therefore be introduced together. Of course, we first need a structure definition for dirs that has a field for files and another one for subdirectories:

(define-struct dir (name dirs files))

Here are the data definitions: A dir is a structure:

 (make-dir n ds fs) 
where n is a symbol, ds is a list of directories, and fs is a list of files.

A list-of-directories is either

  1. empty or

  2. (cons s lod) where s is a dir and lod is a list of directories.

This third model (or data representation) of a directory hierarchy captures the nature of a file system as a user typically perceives it. With two structure definitions and four data definitions, it is, however, far more complicated than the first model. But, by starting with a the simple representation of the first model and refining it step by step, we have gained a good understanding of how to work with this complex web of classes. It is now our job to use the design recipe from section 15.2 for developing functions on this set of data definitions. Otherwise, we cannot hope to understand our functions at all.

16.3  Refining Functions and Programs

The goal of the following sequence of exercises is to develop several common utility functions for directory and file systems, using our third and most refined model. Even though these functions process Scheme-based representations of files and directories, they give us a good idea how such real-world programs work.

Exercise 16.3.1.   Translate the file system in figure 44 into a Scheme representation. Remember to use empty for the content of the files.    Solution

To make the exercise more realistic, DrScheme supports the teachpack dir.ss. It introduces the two necessary structure definitions and a function to create representations of directories according to our third model:

;; create-dir : string  ->  dir
;; to create a representation of the directory that a-path specifies:
;; 1. Windows: (create-dir "c:\\windows") 
;; 2. Mac: (create-dir "My Disk:") 
;; 3. Unix: (create-dir "/home/scheme/")
(define (create-dir a-path) ...)

Use the function to create some small and large examples based on the directories in a real computer. Warning: For large directory trees, DrScheme may need a lot of time to build a representation. Use create-dir on small directory trees first. Do not define your own dir structures.

Exercise 16.3.2.   Develop the function how-many, which consumes a dir (according to model 3) and produces the number of files in the dir tree. Test the function on the directories created in exercise 16.3.1. Why are we confident that the function produces correct results?    Solution

Exercise 16.3.3.   Develop the function du-dir. The function consumes a directory and computes the total size of all the files in the entire directory tree. This function approximates a true disk-usage meter in that it assumes that directories don't require storage.

Refine the function to compute approximate sizes for subdirectories. Let's assume that storing a file and a directory in a dir structure costs 1 storage unit.    Solution

Exercise 16.3.4.   Develop the function find?, which consumes a dir and a file name and determines whether or not a file with this name occurs in the directory tree.

Challenge: Develop the function find. It consumes a directory d and a file name f. If (find? d f) is true, it produces a path to the file; otherwise it produces false. A path is a list of directory names. The first one is that of the given directory; the last one is that of the subdirectory whose files list contains f. For example:

(find TS 'part3)
;; expected value: 
(list 'TS 'Text)

(find TS 'read!)
;; expected value: 
(list 'TS)

assuming TS is defined to be the directory in figure 44.

Which read! file in figure 44 should find discover? Generalize the function to return a list of paths if the file name occurs more than once. Each path should lead to a different occurrence, and there should be a path for each occurrence.    Solution


40 On some computers, a directory is called a folder.

41 The picture explains why computer scientists call such directories trees.