Elisp Code Analyzer

Warning: I have not yet selected a license to distribute this stuff. It's a work-in-progress, remember? I'll let you view this stuff from my web page, but I do not give permission to redistribute in its current form.

50,000 foot level design

Layer 1: the “where defined” database

This layer provides a C file scanner and a Lisp file scanner. Both produce a database mapping Lisp function and variable names to files, and a mapping from files to a Lisp form, usually of the form (featurep 'foo) or the form t, that describes when that file is accessible.

The C scanner must filter out comments, keep track of #ifdef#endif forms, scan for DEFUN and DEFVAR, and scan for invocations of Fprovide.

The Lisp scanner must store variants for both variables and functions. For variables, the variants are: constant and variable. For functions, the variants are: macro, subst, and function.

Layer 2: the dependency checker

This problem is currently too hard to solve for the C sources. When my C source analyzer is ready, we can plug it in to do the same job. In the meantime, though, we punt on C code.

This layer provides a Lisp scanner that, for each function, produces a list of functions called and a list of variables accessed. These lists are then matched up with the name→file mapping to produce a list of file dependencies. Note that a called macro induces a build time dependency on the file containing the macro definition, and a runtime dependency on any functions called or variables accessed by the macro expansion.

We must be able to filter out calls to Emacs functions that are not made when the code is used on XEmacs. This means that we have to do some constant propagation, and know that (featurep 'xemacs) is always true, that running-xemacs is a constant with value true, and also be able to filter out the various string-match calls that determine flavor.

Layer 3: the call graph generator

Using the information produced in layer 2, we next construct a call graph. The edges are two-way, to support analysis of code that throws/signals. That is, each function has a list of functions that it calls, as well as a list of functions that call it.

Layer 4: the type analyzer

Next, we walk the call graph from the “bottom” (functions that don't call other functions) to the “top” (functions that are not called by other functions). As we walk, we check types on arguments versus expected types. Unfortunately, recursive and mutually recursive functions mess up this scheme. For such functions, we take our best guess, then reanalyze after pushing the best guess through. This is a far sight less sophisticated than the Haskell approach, but there's only so much we can do here. I suppose we could keep going around recursive loops until nothing changes …

Elisp Type Representation

TODO: Deal with throw/signal/error/…

Atomic types

The atomic types are those types defined in C where the name of the type describes the type entirely. The atomic types include the type null, whose only member is the constant nil. Note that the constant nil is not a member of any other type. This makes function type descriptions more verbose than they would be otherwise, but enables checks for misuses of nil. The atomic types also include the all-inclusive type t, which is the ultimate parent type of all other types.

Nonatomic types

The nonatomic types optionally supply additional information beyond the name of the type.

Bounds

A bound on type T indicates an upper or lower bound on type T that is a subtype of number. Each bound takes one of the following forms:

  1. A single element of type T, denoting an inclusive bound.
  2. A list containing a single element of type T, denoting an exclusive bound.
  3. An asterisk (*), indicating the absence of any bound.

Intervals

An interval on type T indicates a permitted range on type T, with a lower bound and an upper bound. An interval is a dotted pair of bounds. For example:

  1. (-20 . 5) indicates the fixnums -20, -19, …, 5.
  2. (-20 . (5)) indicates the fixnums -20, -19, …, 4.
  3. (-20 . *) indicates the fixnums -20, -19, …, most-positive-fixnum
  4. (0.0 . 10.0) indicates all floating point numbers ≥ 0.0 and ≤ 10.0.
  5. (0.0 . (10.0)) indicates all floating point numbers ≥ 0.0 and < 10.0.
  6. (* . *) indicates all elements of type T.

Length

A sequence length is one of the following:

  1. *, meaning the length is unknown or unbounded.
  2. A nonnegative integer, describing an exact length.
  3. An interval, describing minimum and maximum possible lengths.

Sequence types

Sequence element types are defined in one of the following ways:

  1. A type descriptor: indicates that every element of the sequence is of that type.
  2. (types type1 …): indicates that the sequence elements are to be matched up with successive types from the list of types. Each type in the list is a type descriptor. The keyword :repeat may appear at most once in this list; if it is not present, it is deemed to appear immediately after the symbol types. That keyword indicates that all following type descriptors should be repeated, in order, as many times as necessary to account for all members of a list.

Type descriptors

Type descriptors are similar, but not identical, to Common Lisp type descriptors. In particular, an Elisp type descriptor is one of the following.

Type name

A symbol naming a type is a complete type descriptor. If the symbol names a nonatomic type, then the type parameters are deemed to range over all possible values.

Sequence

Sequence types can be restricted with the following syntax.

  1. (bit-vector length): indicates a bit-vector with a particular length. Note that the type descriptors bit-vector, (bit-vector *), and (bit-vector (* . *)) are equivalent.
  2. (string length): indicates a sequence of characters of the indicated length.
  3. (list type length): indicates a list, where all elements of the list have the type indicated by type and the list has the indicated length.
  4. (vector type length): similar to list.

Cons cell

In the usual Lisp type taxonomy, the types cons and list differ in only one respect: nil is a list, but not a cons. Since we have already split null out as a separate type, this distinction is not useful. We instead use the cons type to describe dotted lists only; i.e., when the cdr of the cons cell is not another cons cell. In particular, the type descriptor (cons car cdr) contains type descriptors car and cdr, where cdr is not a subtype of either cons or list.

Numeric type

The numeric types can include an interval to show that a limited range of that type is allowed. For example:

  1. (fixnum (0 . 1)): indicates that the value must be 0 or 1.
  2. (integer (-4000000000 . 4000000000)): indicates an integer that is at at least -4 billion, but no more than 4 billion.
  3. (float (0.0 . (1.0))): indicates nonnegative floating point numbers less than 1.0.

Function type

A function type includes a parameter list, which is a list of type descriptors, possibly interspersed with the keywords &optional and &rest. As is the case with Elisp function declarations, if both &optional and &keyword appear in a parameter list, then &rest must appear last. Both keywords must be followed immediately by at least one type descriptor, and &rest must be followed by exactly one type descriptor.

Type names fun, subr, and macro correspond to definitions created with defun, defsubr, and defmacro, respectively. Each can appear as a bare symbol, meaning that any function is possible, or can appear in a list of the form (fun parameters return-type), where parameters is a list of parameter types, with optional keywords as described above, and return-type describes the return type of the function or macro.

Enumerated type

A type descriptor of the form (member obj1 …) indicates the type consisting exactly of the listed objects. The predicate eql is used to compare objects with elements of this list.

Symbol

A type descriptor can indicate that a symbol is expected simply by specifying the type symbol. However, in some cases, the requirement is a symbol bound to a particular value or function. For those cases, we use the following syntax.

  1. (symbol :value type): specifies the type of the value to which the symbol is bound.
  2. (symbol :function type): specifies the type of the function binding for the symbol.
  3. (symbol :value type :function type): specifies both.

Boolean expression

Boolean expressions can be used to combine type descriptors in the following ways.

  1. (and type1 …): indicates the intersection of the following type descriptors.
  2. (or type1 …): indicates the union of the following type descriptors.
  3. (not type): indicates that any element not of the indicated type may appear.

For example, the type descriptor (and t (not null)) matches any object except nil.

Type variable

A type variable is used to indicate that the same type must appear in multiple places. Type variables are indicated with Lisp keywords; i.e., symbols that start with a colon. For example, the type descriptor (fun (:A) :A) indicates a function of one parameter whose return type is the same as the type of its parameter.

Type variables may be restricted by requiring them to stand for some limited set of types. This is accomplished by wrapping the type descriptor with a restrict clause, which lists all restrictions on type variables appearing in an enclosed type descriptor. For example:

  1. (restrict ((:A float)) (fun (:A fixnum) :A)): indicates a function of two parameters, where the first parameter must be some subtype of float (e.g., (float (0.0 . (1.0)))) and the second parameter must be a fixnum. The return value of the function has the same type as its first argument.
  2. (restrict ((:A fixnum) (:B vector)) (fun (:A :B) (cons :A :B))): indicates a function of two parameters. The first parameter can have any type that is a subtype of fixnum (e.g., (fixnum (0 . 1))) and the second parameter can be of any vector type. The return type of the function is a dotted list whose car has the type of the first parameter and whose cdr has the type of the second parameter.

FIXME: They aren't just type variables! They are variables that match the corresponding parts of type descriptors. They match entire type descriptors in some places, but can be exploded.

FIXME: Describe the special cases:

TODO: Type simplification

TODO: Definitional groups

Describes when definitions appear together as a group. We need a syntax to say that a given provide causes a bunch of symbols to appear together, or the loading of a given file, or, for C definitions, a given preprocessor symbol. We may need subgroups, which add a constraint to a parent group, in order to specify conditionally defined symbols.

Lame attempt at implementing in Common Lisp

Lame attempt at implementing in XEmacs


Last modified: Tue May 19 17:09:16 MDT 2009 by Jerry James