Mr. Paul M. Yates and Associate Professor Ian D. Bishop
Department of Geomatics,
University of Melbourne,
Parkville, VIC, 3052, Australia
Email: firstname.lastname@example.org, Ian.Bishop@eng.unimelb.edu.au
Often modelling problems are most easily specified by a declaration of the required outcomes. Such a declarative specification of a problem can sometimes be used for automating the construction of software that computes the required outcomes. This paper investigates issues relating to the design and implementation of a system that takes a declarative specification of a problem and attempts to generate a computational solution to the problem. An interesting aspect of the solution described in the paper is the use of a deductive database environment in the implementation of a prototype execution plan construction tool and software generation tool. The paper includes a number of examples highlighting the important features of the design and operation of the prototype.
Often modelling problems are most easily specified by a declaration of the required outcomes. For example, "Provide me with a raster image showing the slope value in each cell", or, "Show me the best way to allocate my delivery trucks to routes so as to minimise the cost of delivering stock to customers while maximising the satisfaction of the customers" are specifications for particular problems. These specifications do not refer specifically to procedural methods for the computation of a result; they simply declare the requirements, and, hence, are called declarative specifications. Providing answers to such problems requires the development of both models of the data and models for the computation of the result. The responsibility for providing a software solution to such problems is usually allocated to a software engineer. A software engineer will take a declarative specification of a problem, a model for the data and a model for computation of a solution (often associated with some mathematical theory), and create a procedural solution to the problem. Software is then developed following the procedural solution.
Declarative specifications of problems/queries are commonly used for requesting data from standard relational and object-relational database management systems. The relational calculus is an example of such a language used for the specification of queries. Such languages are translated into a sequence of operations that can be executed on the underlying data model to provide an answer to the query. For example, a relational calculus specification is translated into relational algebra expressions operating on relations to provide answers to queries. Note that in this case the underlying data model for both the calculus and algebraic languages is the relational data model.
This paper addresses issues relating to the development of an environment in which a declarative specification of a problem incorporating geographic data and modelling systems can be translated into an operational system. The approach we have taken consists of four major components,
An interesting attribute of the tool describe here is the use of a deductive database system to compute possible solutions of a given problem. Deductive databases "originate from the fusion of database technology and logic programming (e.g. Prolog)" (Abiteboul, Hull & Vianu 1995). In this case the basic objects found in the deductive database include the entities object and solver described in Abel, Taylor & Kou (1997). Identifying the required output objects identifies a problem. A query specification, written in Prolog, is used to construct execution plans (sequences of solver invocations). Given the appropriate input, the execution plan will arrive in a state in which all the identified output objects have a method for computation. It is possible that there exist several execution plans to solve a given problem. These plans are presented to the user through a graphical user interface so that they can choose the most appropriate solution path to follow.
The selected execution plan is then used as the basis for code generation. Associated with each solver stored in the database is a template method for the invocation of that solver. This template along with additional metadata is used for code generation. The additional metadata used includes the location of the solver program, the underlying data model and the input and output data structures.
Solving problems using models incorporating geographic data often requires the use of several software components. Recently there has been much interest in the integration/interoperation of separately developed software systems. Often each system to be integrated has been developed meet its own requirements. To date, these requirements have sometimes included specifications defining the ability to interoperate with external software systems. Rarely has the interoperability requirement been specified in terms of some common standard for interoperation (e.g., the Common Object Request Brokerage Architecture (CORBA) (Object Management Group 1998) and the Distributed Component Object Model (DCOM) (Brown & Kindel 1998)). Until recently, for many software systems an interoperability requirement has been seen as a secondary. It has not been seen as essential to the proper operation of the software.
There are many methods that can be used to integrate separately developed pieces of software. The CORBA FAQ (Appelbaum, Cline & Girou (1997)) identifies (amongst others) the following methods:
These mechanisms provide generic solutions for the integration of software systems. They do not themselves provide mechanisms to translate a specification of a problem into an executable solution. They are important because they often form part of a solution to a problem. In the current prototype system, the code makes use of a message model for communication and a simple protocol for interoperation (see Yates & Bishop 1997 and Tang, Bishop & Yates 1998).
Models and data sets linked together can be used to form other data sets. For example, population, housing and employment data sets may be linked together with an urban model measuring disposable income to form an instance of a disposable income data set. This data set may be used as input to other models or information systems. The software tool provides for such a network of data set generation by allowing the source of a data set to be the output of a model.
The tool provides a graphical interface in which data set types (objects) and modelling systems (solvers) are represented using icons. A set of object icons can be selected that represents the output required for solution to a given problem. The execution plans generated for a given problem are displayed by linking several solvers together. Code is then generated that can be interpreted or compiled for execution depending upon the requirements of a given application. It is usually the case that data and software reside on different machines, hence, the code generated makes use of systems providing solutions for distributed computation.
Our solution to the problem of integrating solvers to produce new solutions to problems consists of four central components. These are the metadata database, an execution plan generator, the graphical user interface and the code generator. The essential operation and interaction of each of the components is as follows:
The operation of graphical user interface is quite simple. One or more icons for data set types are selected as output of a given modelling activity. Similarly, one or more icons are selected as a starting point of a given activity. The execution plan generator is activated via a menu selection. Once generated, both textual and graphical descriptions of the execution plans are presented to the user for verification. The textual description is comprised of a sequence of invocations of solvers.
Verification of a plan generated by the execution plan builder is an important step for a number of reasons. One reason is that there may be more than one solution to a given problem. For example, generating a slope model at particular resolution from a digital elevation model (DEM) at a higher resolution could be achieved in two ways. In the first method, the DEM could be generalised to the required resolution using some re-sampling technique. Then, the slope model would be generated from the generalised DEM. In the second method, a slope model could be generated at the higher resolution from the original DEM; then, the generalised slope model would be generated from slope model derived from the DEM. One can see that the results from these two processes could differ...the choice of which process is the correct process is beyond the scope of the system described in this paper.
A second consideration is that there maybe many options to select for a given solution. For example, the process of generalising a grid based representation of a data set has multiple options for the types of generalisation that occurs, e.g., ARC/INFO allows for the algorithm selection (e.g. bilinear, cubic...) for resampling a grid. Such parameters need to be chosen by the model builder/integrator. These considerations highlight the need for some form of verification by the integrator/model builder of the suggested solution and to make appropriate selections for a solution and its alternative options. Only when options have been selected by the model builder can software generation take place.
The purpose of the metadata database is to maintain information about objects and solvers that can be used when constructing execution plans developed as solutions to particular problems. The metadata database contains relations for all objects, solvers (including newly developed execution plans), and known instances of objects. The metadata is used in the generation of solutions for given problems. Out implementation uses the PostgreSQL (The PostgreSQL Organisation 1998) database management system to manage the metadata database.
The essential components of the database are the objects and the solvers. In the database an object is a set of attribute types, a solver is identified by an input set and an output set of objects. A solver also has an associated set of parameters. An important attribute of a solver is its template code. This is used code generation. The template code can contain variables. These variables come from the parameter set of the solver and the input and output object sets.
The concept of an interface repository defined in the CORBA specification (Object Management Group 1998) has much in common with the metadata database described in this paper. The interface repository maintains collections of interface definitions. Such interface definitions are defined using the Interface Definition Language (IDL). It may be possible for our prototype tool to use an implementation of an interface repository as the metadata database. This is under further investigation.
Abel, Taylor & Kou (1997) suggest that "a solver plan could be generated by an exhaustive enumeration algorithm." The solution generator described here makes use of a Prolog specification for objects and solvers stored in the metadata database. This has enabled us to quickly prototype a solution generator that searches for all possible solutions to a specified problem. Given an instantiation of the relation called directly_solves (provided in the metadata database), a set of given objects and a set of required objects, the following Prolog program computes alternative solutions for a given problem (where they exist). For the prototype we have made use of the XSB Prolog interpreter (Sagonas et.al. 1998). The XSB system is easily integrated with the PostgreSQL database and the Java graphical user interface via external C calls. The core execution plan generator code is as follows:
solves(Given, Required, Needing, Solution) :- is_subset(Needing,Given), directly_solves(Needing,Out,Solution), is_subset(Required,Out). solves(Given,Required,Needing,Solution) :- is_subset(In,Given), directly_solves(In,Out,Solution1), \+ is_subset(Out,Given), \+ is_subset(Required,Out), union(Given,Out,NewGiven), subtract(Required,Out,NewRequired), solves(NewGiven,NewRequired,Needed,Solution2), intersect(Out,Needed,OutNeeding), \+ OutNeeding = , union(In,Needed,Needing), append(Solution2,Solution1,Solution).
The specification of the Solves relation has two clauses. A problem requiring solution is defined by the sets of Given and Required objects. The first clause specifies that a solution may be found by searching for a direct solution. A solution is a direct solution if there is a solver in the database for which the input is a subset of the Given objects and the output is a subset of the Required objects. As indicated by the second clause, it is possible for indirect solutions to exist. Such solutions make use of a sequence of applications of solvers until all the requirements have been computed.
A simple way of viewing the operation of this Prolog program is by considering an analogy of finding paths in a graph. Consider a graph where: each node represents a set of known data; and each edge represents a solver taking one set of known input data to a new set of known data that includes the output data of the solver. If for a given a set of input data there is a solution to a problem then there is a path between nodes corresponding to the set of input data and to the set of data that includes both the input and output data. This occurs if they are directly connected (the first clause of the program), or, if they are indirectly connected. Finding a solution for a given problem corresponds to the computation of the transitive closure of the relation between input and output data sets of solvers.
The lines of code in bold in the above program form the important sections of code for computing a solution. The first clause identifies the case where a direct solution to a given problem can be constructed. Note that the set of given objects that are used in finding a direct solution can be a subset of all of the given objects. Also note that the set of output objects of the direct solution can be a super set of the required objects.
In the second clause, the solver generator adds to the known objects that can be directly solved from the current set of given objects. Attempts to solve the problem with the new set of given objects follow. Note that, as the direct solution found might meet some of the requirements, the partial solution needs only to find a reduced set of requirements. Hence, we subtract the output set of the directly solves predicate from the required set. Also, the set of given objects is expanded for use in the solves predicate appearing in the second clause. Finally, note that the partial solution found by the solves predicate in the second clause must make use of newly added values in the set of given objects. This ensures that the predicate solves for a true path through the solution space by making sure that the next step in the solution always expands upon the current step.
Once a solution to a problem is found, a new data set can be added to the metadata database. This data set is represented by the execution plan for the solved problem. As such it forms an object similar to a view in a relational database.
As indicated earlier, it is possible for zero, one or more execution plans to exist for a problem. If at least one selected plan the steps for execution of each of the solvers in the plan is known then it is possible generate code that implements the execution plan by invocation of each of the solvers in the plan. Associated with each solver is a template implementation specifying how to generate code for interoperation of the solver. In the current prototype these templates are quite simple. In fact they can be expressed as predicated in the deductive database. For example, the slope_template for computing slope values of grids using ARC/INFO is as follows:
slope_template(Machine, Location, Input, Output) :- write('open arc '), write(Machine), nl, write('arc.exec workspace '), write(Location), nl, write('arc.exec grid'), nl, write('arc.exec '), write(Output), write(' = slope('), write(Input), write(')'), nl, write('arc.exec quit'), nl, write('arc.exec grid'), nl, write('close arc'), nl.
In the prototype these templates have been developed using a simple protocol for interoperation consisting of open, send, exec and mode commands.
Note that the predicate definition of the slope_template has a number of parameters. These include the machine name, the location of the workspace containing the grids, the input grid and the output grid. Other options (not thoroughly in the prototype) such as the selection of the mechanism for the integration of the software systems may be supported if appropriate templates are provided. For example, the chosen mechanism could follow the CORBA standard (Object Management Group 1998) by providing a template that makes use of that method for interoperation.
A simple example is where the database contains the fact
This fact relates objects with multiple attributes. The attributes include the grid type and the resolution of the grid (X represents in this case any resolution). The fact simply states that there is a solver called slope_command that can be used for taking a dem at any resolution and producing a slope model of that dem for the same resolution. By specifying the query
at the Prolog prompt the answer,
X = [[dem,resolutionX]] Y = [slope_command]
is returned. This is the only solution found. The solution consists of the invocation of one solver, slope_command, i.e. to solve for the slope map given a grid map at the same resolution you make use of the slope command. The solution will make use of a dem at resolutionX. Now, constructing the invocation code for the slope_command consists of the query
This will result in the code:
open arc daisy arc.exec workspace /home/paul/grids arc.exec grid arc.exec victoria_slope = slope(victoria_grid) arc.exec quit arc.exec quit close arc
This example is relatively simple! A slightly more complex example includes the additional fact
and uses the query
Note that the input and output resolutions are different. The following results are return by the query
X = [[dem,resolutionX],[dem,resolutionY]] Y = [resample_command,slope_command]; X = [[dem,resolutionX],[slope,resolutionX]] Y = [slope_command,resample_command]
Now there are two possible paths that can be taken to reach the goal. Also note that there are additional objects placed in the OutNeeded list. This is that object [dem,resolutionY] for the first execution plan. This object is an intermediate object. It is computed by the application of the resample command and is then used as the input to the slope command.
After selection of the appropriate plan, code can be generated for the resample command as well as the slope command. The resample template has a same form as the slope command, except for the command executed is the resample command and that the resample command has two parameters, cell size and resampling algorithm. Note that in the prototype the user must provide and control the input and output parameters of each of the generated code segments. Furthermore, the simple code generation prototype does not merge separate templates for commands. For a machine called daisy, workspace /home/paul/grids, input grid victoria_dem, output grid victoria_slope, output cell size 50 and resampling algorithm NEAREST, the code generated for this second example is:
open arc daisy arc.exec workspace /home/paul/grids arc.exec grid arc.exec temp = slope(victoria_dem) arc.exec quit arc.exec quit close arc open arc daisy arc.exec workspace /home/paul/grids arc.exec grid arc.exec victoria_slope = resample(temp,50,NEAREST) arc.exec quit arc.exec quit close arc
Note that the code generated from the prototype system is not very efficient. Clearly, a better solution would be:
open arc daisy arc.exec workspace /home/paul/grids arc.exec grid arc.exec temp = slope(victoria_dem) arc.exec victoria_slope = resample(temp,50,NEAREST) arc.exec quit arc.exec quit close arc
This will require much more complex definitions of the templates for code generation. Furthermore, in this example the two grid commands could be replaced by the command victoria_slope = resample(slope(victoria_dem),50,NEAREST). At present, using the prototype, the only to way optimise the result of code generation is to create a new solver and edit the generated code to create an optimised template to perform the whole process.
The construction of the grid type in this example had two attributes, grid type and resolution. An alternative view of types would allow for the construction of a hierarchy of types. For example, the type dem would be a specialisation of the grid. The same would follow for the type slope. This would have several implications for the definition of the execution plan generator. Such implications are the subject of further research.
It should be clear that, although the examples found in this section use only ARC/INFO commands, the system is not limited to only ARC/INFO commands. The generation of code depends only on the template that is provided to the code generator. We will highlight this by the following example.
First, suppose we have a special slope model constructor that operates on TIF image representations of a dem. The above description of a grid type must now be expanded to include a type attribute. The problem is to take an ARC/INFO grid version of a dem and compute a slope model using our special slope model constructor and return the slope model back to ARC/INFO. In this case, there is a need to use a tool for data translation between systems. The database facts for the solvers are
directly_solves([[X,Y,arcgrid]],[[X,Y,tifimage]],[gridimage_command]). directly_solves([[X,Y,tifimage]],[[X,Y,arcgrid]],[imagegrid_command]). directly_solves([[X,resolutionX,arcgrid]],[[X,resolutionY,arcgrid]],[resample_command]). directly_solves([[dem,X,tifimage]],[[slope,X,tifimage]],[special_slope_command]).
The query is now:
solves([[dem,resolutionX,arcgrid]],[[slope,resolutionY,arcgrid]],X,Y). The resulting execution plans are: X = [[dem,resolutionX,arcgrid],[dem,resolutionX,tifimage],[slope,resolutionX,tifimage],[slope,resolutionX,arcgrid]] Y = [gridimage_command, special_slope_command, image_grid_command, resample_command]; X = [[dem,resolutionX,arcgrid],[dem,resolutionY,arcgrid],[dem,resolutionY,tifimage],[slope,resolutionY,arcgrid]] Y = [resample_command, gridimage_command, special_slope_command, image_grid_command];
These two execution paths correspond to options for resampling the slope after importing the result, and, resampling the dem prior to exporting it to a tif image. Two additional paths are found if include the fact,
These two additional paths are the similar to the earlier example, i.e.,
X = [[dem,resolutionX,arcgrid],[dem,resolutionY,arcgrid]] Y = [resample_command,slope_command]; X = [[dem,resolutionX,arcgrid],[slope,resolutionX,arcgrid]] Y = [slope_command,resample_command]
Issues relating to the design and implementation of a system for aiding the construction of software integrating several software components were investigated in this paper. In particular, the design of a system consisting of four major components was described and a prototype implementation of a system was described. The important operational components of the system are the execution plan generator and code generator. The execution plan generator is driven by the GUI component and makes use of an underlying metadata database. The code generator component of the system makes use of the generated execution plan and solver templates to generate code. Several examples of the operation of the prototype can be found in the paper.
As a result of investigating these examples several possible future research directions have been identified. These include:
Of these, the automatic or semi-automatic optimisation of the code poses the most difficult problem to solve. At the moment optimisation of code is the responsibility of the user of the system.
Abel, D. J., Taylor, K. & Kuo, D. (1997), Integrating modelling systems for environmental management information systems, SIGMOD Record 26(1),5-10.
Abiteboul S., Hull R. & Vianu V. (1995), Foundations of Databases, Addison Wesley Publishing Company, Reading, Massachusetts.
Appelbaum R., Cline M. & Girou M., (1997) The CORBA FAQ --- Frequently Asked Questions, July 1997, available from http://www.cerfnet.com/~mpcline/Corba-FAQ/index.html.
Brown N. & Kindel C (1998), Distributed Component Object Model Protocol, Microsoft Coorporation, Network Working Group Internet-Draft, January 1998, available from http://www.microsoft.com/oledev/olecom/draft-brown-dcom-v1-spec-02.txt.
Object Management Group (OMG) (1998), The CORBA/IIOP Specification 2.2, February 1998, available from http://www.omg.com/corba/corbaiiop.htm.
The PostgreSQL Organization (1998), PostgreSQL HOME Page, available from http://www.postgresql.org.
Tang, H., Bishop I.D. and Yates, P.M., (1998), Systems Design Options for Linking GIS, Modelling and Visualisation: Applications to Forest Management., Presented at Resource Technology 1998, Rovaniemi, Finland, 8-13, June 1998.
Sagonas K.F., Swift T., Warren D.S., Freire J. & Rao P. (1998), The XSB Programmer's Manual Version 1.8, The Computer Science Department, Stony Brook State University of New York, http://www.cs.sunysb.edu/, see also ftp://ftp.cs.sunysb.edu/pub/XSB/.
Yates, P.M. and Bishop, I.D. (1997), A method for the integration of Existing GIS and Modelling Systems, Presented at GeoComputation'97, Dunedin, New Zealand, August 1997.