Spatial Data Acquisition from Motion Video

Mark Williams
Departments of Computer and Information Science, University of Otago, PO Box 56, Dunedin, New Zealand

Abstract
The field of geocomputing utilises computer models of the real world. While in some cases there is no spatial correspondence between the computer model and reality, many cases require an accurate representation of the real world. For example, geographical information systems (GIS) commonly require accurate, scaled maps of land. A simple GIS may incorporate a flat two dimensional map from which the computer could, for example, measure distance or calculate areas. A more complex version may include a topographic map, which is essentially a layered set of two dimensional maps at fixed height intervals. From this information the GIS can calculate volumes of hills, or viewshed images that determine the visibility of, for instance, a large structure on a hilltop. More detailed GIS models require true three dimensional models of objects, such as tunnels, to measure volume. In all cases a GIS is only as good as the source data collection, hence the data collection technique needs to be both accurate and practical for the size of the object being modelled. In the case of traditional GIS, photogrammetry is well established as a data gathering method. This large-scale technique has also been proven successful for some small-scale tasks. For example a semi-automated method has been demonstrated for measuring ships hulls and satellite dishes. On an even smaller scale, photogrammetry has been used for measuring human faces to evaluate post-operative swelling.

The quintessential problem of photogrammetry is to establish correspondence, which involves identifying the points in each image that represent a given point in space. This has proven to be an easy task for a trained human operator, but very difficult for a computer. Solutions often require that easily identifiable reference objects be placed within the scene. They may also require controlled lighting and manual identification of corresponding target points. There is, however, a class of problems where alteration of the scene is undesirable, making automated measurement very difficult.

This paper reviews methods of capturing three dimensional spatial data from two dimensional images. A new approach is proposed that is based upon the principles of photogrammetry, while circumventing the correspondence problem. In doing so many restrictions of other methods are avoided: no reference points are required, nor special markers on the object. Where photogrammetry requires a widely spaced pair of images, the proposed method uses a dense image set. The motion of the camera relative to the scene is determined by a simple function. Data from the camera is treated as a three dimensional block, consisting of X and Y spatial dimensions, and a temporal dimension. Interesting features are tracked as they move within this space; the function of their motion determines the three dimensional location of each feature in space.

The simplest case of the proposed method is linear motion. This will arise, for example, when the camera is stationary while an object moves past it. As the object moves, frames from the camera are captured to form a dense spatiotemporal volume. This volume can be segmented into XT slices, also referred to as epipolar images. Each image plots the paths of points on the object at one height, or Y value, in the XY frames. By considering the apparent motion of individual points over time, it can be observed that points at different distances from the camera move at different apparent velocities. All points move in straight lines, but those closer to the camera appear to move faster than those farther away. Imagine looking out the window of a moving train: trees close to the railway line appear to rush past in a flash, while mountains in the distance appear to move very slowly. When viewing this motion as an epipolar image, objects at different distances from the camera appear as lines of different angles. Measuring the angle between the lines allows a direct calculation of the difference in distance from the camera, and hence the size of the object. This method extends to rotational motion. As an object rotates before a camera, or the camera rotates around the object, individual points on its surface form sinusoidal paths in an epipolar image. Consider watching a glass of water placed at the edge of the turntable inside a microwave oven. The glass appears to move back and forth periodically: plotting this motion against time produces a sinusoidal curve. Now consider a second glass in the microwave, closer to the centre than the first. The epipolar image will contain two sinusoids, one with a smaller amplitude than the other. If the two glasses are in line with the centre, the sinusoids will be in phase. Shifting the first glass around the edge of the turntable will give the large amplitude sinusoid a different phase to the small amplitude one. By extracting the phase and amplitude of each sinusoidal curve in an epipolar image, the position of each tracked point relative to the camera can be calculated. Again this allows the size of the object to be determined.

This paper establishes a new method of acquiring spatial data from motion video. The proposed method is based upon the principles of photogrammetry, but allows position to be calculated with feature tracking rather than point correspondence. By doing so, it avoids many constraints imposed by previous solutions. The proposed method is demonstrated with both linear and rotational motion.