Return to GeoComputation 99 Index

Database fusion for the comparative study of migration data

Oliver Duke-Williams
University of Leeds, Centre for Computational Geography, Leeds LS2 9JT United Kingdom

Marcus Blake
University of Adelaide,
Key Centre for Social Applications in GIS, 10 Pulteney Street, Adelaide SA 5005 Australia


Internal migration is one of the key processes influencing the population of small areas, together with the natural agents of birth and death. In most post-industrial countries it is both the most significant and the least predictable component of change. The study of migration is often made difficult by the particular characteristics of data collected: alternative datasets may not be directly comparable, data may be of limited extent, or where data does exist, it may do so in sufficiently large volumes so as to make analysis daunting. In addition, both the data themselves, and the results of analysis, can be complex, and require large amounts of metadata to support their explanation.

This paper describes a software system designed to facilitate the comparative study of two time series of migration data, respectively, detailing patterns of internal migration in Australia and the U.K. Both datasets stretch over a concurrent 20-year period (1976-1996), but they are different in conceptual structure: the Australian data describes migration transitions over four inter-Censal 5-year periods, whereas the U.K. data consists of 20 years of migration event data compiled from administrative registers. A database has been designed that contains both source data and data transformed through demographic models, which is intended to generate migration statistics, and measures which can be directly compared, in order to permit the study of features and patterns of change in migration in the two countries, with the primary aim being the elucidation of age-period-cohort relationships in the two sets of data.

The system is constructed from web servers coupled via server-side scripts to RDBMS servers in a manner that is intended to be flexible in terms of hardware and software used, and also to allow servers in the two countries to offer complementary services. The use of a hypertext system allows familiar interfaces to be offered to the user, and as much supporting explanatory text as is needed. The system is designed to support a wide variety of requirements ranging from the extraction of subsets of data to be imported into external analysis packages, and the generation of statistics describing the intensities and effects of migration for selected data to visualization of the data for speculative and exploratory analysis of the data. Migration data is multidimensional, thus, visualization through means such as VRML models allows a rich summary of the data to be presented to the user. This multidimensional nature of the data includes both demographic detail and derived information such as measures of change over time, and also geographic detail. The effects of geographic scale are of considerable importance when studying migration data, and it is presumed that causal factors have differing priorities for local and long distance moves. The system is designed to allow re-aggregation of the migration interaction matrices to suit the users requirements. In addition, many measures of migration intensity and impact rely on distance as a component, and studies have shown that alternative calculations of distance between origins and estimations can have very significant affects on the measures subsequently calculated; the system has been designed to accommodate a variety of methods for calculating distances between locations.

The system described in the paper uses data fusion and integration techniques, together with a variety of presentation approaches, to offer new analytic tools for researchers, allowing large volumes of data to be studied and compared in a manner that was not previously available.

1. Introduction

Internal migration is one of the key processes influencing the population of small areas, together with the natural agents of birth and death. In most post-industrial countries in the contemporary developed world migration is both the most significant and the least predictable component of change. Population geographers have tended to focus on migration as an agent of change rather than fertility and mortality, largely because it a spatially dynamic phenomenon. However, differences in the way in which internal migration is recorded in different countries make international comparisons difficult, and have therefore restricted the number of comparative studies of the process.

There are a number of collections which describe migration data sources in different countries (e.g. Nam, Serow and Sly 1990) or migration patterns (e.g. Rees, Stillwell, Convey and Kupiszewski 1996), and there is also a large body of literature comparing particular aspects of internal migration, such as the phenomenon of counter-urbanisation (e.g. Fielding 1982; Champion 1989). There have also been direct comparisons drawn between countries in regard to overall levels of mobility (Long 1991), distances travelled by migrants (Long, Tucker and Urton 1988a, 1988b), age structures (Rogers, Racquillet and Castro 1978, Castro and Rogers 1983, Rogers and Castro 1986) and other demographic characteristics (Long 1992). Such work has been constrained by the lack of a robust framework which permits comparative quantitative analysis to be conducted in a sufficiently rigorous manner.

Such direct comparisons are desirable for a number of reasons. Measures of migration calculated for any one country become more meaningful when placed in a comparative context. If one measure of migration propensity is calculated for any particular country, then it's reference against other countries will enable us to understand whether it is a particularly high or low propensity. Indeed, we may not realise that any country displays unusual or even unexpected patterns of migrancy until it is compared to a number of other countries. It is through comparative study of such phenomena that we can identify similarities and differences which help both to enrich the body of theory and determine whether patterns observed in one country are generalisable to others.

This paper describes a database that is being developed as part of a project which aims to compare migration levels at a variety of spatial scales in Australia and the United Kingdom, over the period 1976-1996. Our central aim is to test the theory proposed by Easterlin (1980) that demographic rates are a function of the relative size of cohorts with specific reference to migration rates and other measures of migration intensity.

Tools are required both to manipulate and process the large migration database and to enable visual exploration and give graphical summaries of the data. Migration data are complex, and the use of visualization can provide a useful overview of the data, and also help researchers to identify patterns that may not otherwise be immediately apparent. There is a great need for exploratory data analysis tools that are specifically tailored to interaction data sets. The need is therefore to develop a system which can be run on different hardware and software platforms, and can readily display a wide variety of data types including images and other visualized elements. The Web is an ideal interface, and one which also helps facilitate work between research groups who are geographically far apart. Whilst the database is currently being developed and used as a tool for researchers with experience of the data, the Web is also ideally suited to providing large volumes of hyper-linked documentation and explanation, should the database be subsequently used in a teaching role.

The complexity of migration data has implications for the development of computer programs and interfaces used to study the data. Migration data sets often take the form of a matrix showing all origins and destinations. These matrices are typically large and sparsely populated. Where these data sets form a long running time series, the volumes of data can become considerable, requiring powerful database engines to manage the data. On a practical level, the necessity for users to select numerous origins and destinations, possibly from a large list of small areas is time consuming and may hinder users (Duke-Williams, 1998). The requirement to construct a request involving numerous parameters which may require careful editing necessitates an interface through which the user can follow several paths to complete the query and in which revision of choices is permitted.

The remainder of this paper introduces the various data sources used, and describes the conceptual differences between them. A brief discussion of the constructs of age group, time period and cohort by which the data can be categorised follows. Following this is a description of the manner in which the data has been modified and loaded in the database, and an outline of the structure of the database and tools used. Descriptions of the type of analysis and visualization which will be possible are given, together with a list of features which the authors hope to incorporate in the future.

2. Data sources

This section of the paper describes the various data sources used in the project. Migration data are complex, and therefore the section is introduced with a discussion of different types of migration data and the ways in which they differ. A demonstration of the use of diagrams to illustrate the relationships in migration data is then given. Finally, the specific data sources used are described, and their strengths and weaknesses outlined.

2.1 Types of migration data

The processes of birth and death are unambiguous, and generally subject to explicit official registration. Migration, on the other hand, is harder to classify, with distinctions between various categories of short-term migration and other forms of migration being blurred, and ultimately subjective for official registrars, researchers and the migrants themselves. Migration is usually seen (in countries such as those studied here) as being related to a change in permanent residence. However, the concept of a permanent residence is itself not clear, with different definitions being used by different statistical organisations. In addition, few countries directly collect data on change in residence, and so data on internal migration must be collected in some other way. The problem is especially significant for those groups who may have more than one regular residence, such as students in higher education, those serving in the armed forces, and citizens who commute periodically between more than one home. Indeed, it is these groups who may be viewed as being amongst the most migratory sections of the population who are often the hardest to quantify.

Migration data can be divided into transition and event data types. Event (or movement) migration data attempts to record all individual migration events, whereas transition data attempts to measure the net effect of migration over some period of time, by comparing the usual residences of people at the time of collection with their usual residences at some previous point in time.

Transition data are typically collected by means of a national census, through the inclusion of a question asking respondents about their usual residence at an earlier date. The amount of time between the this earlier date and the date on which the survey is taken is the transition period. In most censuses in which such a question is asked, the time period is either 1 year or 5 years; in some cases data are collected for both periods by asking two questions, about the place of usual residence 1 year previously and 5 years previously. A number of types of migration are not captured effectively by transition data. Multiple migrations - where a migrant has moved more than once over the course of the transition period - are only recorded as a single move from the location at the start of the period to the location at the end of the period. Return migration - where a person has moved away from an area but then moved back to it again within the period may be noted as being only a local move, or missed altogether. Finally, data collected in this way requires the respondent to still be alive at the time that the data are collected - if someone has moved between two areas but dies before the next census is taken, then their migration will not be recorded.

Event data are typically associated with countries which have a strong system of population registers, in which people who move are required to register a change of address in some way. In countries which do not use registers is this way, event data may still be gathered as from other sources (that is, as a secondary use of data which is being gathered for some other purpose). Event data overcomes many of the problems of transition data, although its quality depends on the completeness of coverage. However, event data may be inferior to transition data for assessing the net effect of migration if it is not possible to identify particular migrants and track them across several migrations. Analysis of this kind obviously requires uniquely identifying variables to be present in the data, and data of this kind is rarely made available to researchers without prior anonymisation.

A further significant difference between event and transition data lies in the amount of additional detail about the migrants that is available. As transition data are typically collected via censuses, it is often disaggregated by a range of socio-demographic variables. In contrast, register data such as migration event data are usually limited in detail to the age and sex of migrants.

Regardless of the way in which migration data are collected, they are generally reported in one of two ways: either as summary totals or as a full migration matrix. The full matrix can be represented as a cuboid, whose vertices are formed by origins, destinations and characteristics of migrants. These characteristics are usually assumed to be limited to age and sex, but may include any other variables by which migrants can be disaggregated. Such a cuboid is illustrated in Figure 1. The internal part of the cuboid contains the full origin-destination-characteristic flows. Assuming characteristics to be limited to age and sex, we can refer to such data using the acronym ODAS (for origin-destination-age-sex data). The faces of the cube represent the totals of the internal values. There are three possible sets of totals: origin-destination (referred to as OD data), origin-age-sex (OAS) and destination-age-sex (DAS). These three summary totals may be given in preference to the full ODAS data, because of the large and sparse nature of the latter matrix.

[Migration data cube]
Figure 1: 'Three faces' summary totals of migration data

2.2 Illustrating migration data

Time series demographic data can generally be indexed by three variables: age, time-period and cohort. All three of these need to be known to accurately refer accurately to a given migrant (or aggregate set of data); if only two are known, the third can only be estimated. Upper and lower limits for the third variable can be deduced, but the distribution of data between these limits may not be linear. Such data including migration data, are often illustrated using a two dimensional graph known as a Lexis diagram, (although modern diagrams derive more directly from modified versions by Pressat (Pressat, 1961, Vanderschrick, 1992)). These diagrams show the relationship between a number of variables; Figure 2 shows part of a Lexis diagram. The parallel diagonal lines mark the path of the 1990-91 birth cohort through time (the x axis), showing the age of the people (the y axis) in the cohort. At any point in time, the diagonal lines describe the upper and lower possible ages for people of that birth cohort. In addition, the Figure includes a number of alternative representations of migration data.

[Lexis diagram]
Figure 2: Migration data representation

The stippled block marked 'A' in the Figure represents event migration data, with each stipple indicating one migration event. We assume that as with most event data, the migrations are recorded soon after they have happened. The block shows migration events occurring in the year 1990 to children aged 2 (i.e. from their second birthday up to the day before their third birthday). For each event, we know the date of migration, and the migrant's date of birth, and can therefore derive the migrant's age at the time of migration.

The other blocks shown in the Figure are shaded rather than stippled. This is intended to indicate that migrations have occurred somewhere within the shaded space, but unlike the data in 'A', we do not know precisely when.

The shaded block 'B' also shows migration occurring over a single year and for a specific age group. However, unlike the data in 'A', we assume that the block shows transition data. These data might have been collected if we asked people whether they had migrated at all during 1990, and what their age was at the time of migration. Clearly, 'A' (if it were collected for the same age group) would be a reasonable initial estimate of 'B', but would not be totally accurate. In order to make 'A' a closer estimate of 'B', we would need to use methods including demographic accounting to estimate the number of migrants who had died before the data was collected, and also use some model to help remove the effects of multiple migration. Such demographic accounting measures obviously become more significant with the elderly, where the mortality rate is high.

Both 'A' and 'B' are data which are referenced by age group and time period.

The hypothetical questionnaire used to collect the information in 'B' required the respondent to say whether they had moved, and what their exact age was at the time of migration. The parallelogram marked 'C' represents the data that would be collected if we asked, throughout 1992, all two year olds, on their birthday, whether they had moved in the past year. This unlikely data would be referenced by age group and cohort. In the Figure, the shaded area indicates the whole of the space in which age is between 1 and 2, and the birth date is during 1990; it is inside this space that migration events may have occurred.

The parallelogram shaded 'D' is far more representative of data gathered by a Census - it is collected by asking all people on one particular day (in this example, the first day of 1994) whether they had moved in the past year, and how old they are at present. Area 'D' represents those answers given for three year olds. The data in 'D' is referenced by period and cohort, and thus completes the possible combinations of the three reference variables.

The data illustrated in Figure 2 demonstrate the problem of incompatibility between event data (such as that in block 'A') and typical transition data as shown in 'D'. In order to carry out analysis of these data we need to find some way of making 'A' and 'D' comparable. In order to do this, there are two general problems. Firstly we must account for differences between event and transition data - for example, the differences (other than age group) between 'A' and 'B', and secondly we must account for the different shaped areas. The latter can be achieved if we can find some way of dividing up 'B' or 'D' into smaller components which could then be reassembled in a different manner in order to create areas of identical shape. The obvious way of doing this is to split either unit into two triangles. With 'B', this would be a split along a diagonal line from bottom left to top right, whereas for 'D' a division along a line parallel to the x axis from the upper left corner to the lower right one would be required.

Figure 3 shows how a single age-period cell (that is, similar to 'A' or 'B' in Figure 2, but shown here with 5 year age groups and a five year transition period) can be split into two triangular age-period-cohort cohort elements, abbreviated here to cohels. In most such age-period blocks, the migrants can be split into two birth cohorts, with older migrants belonging to an earlier cohort and younger ones belonging to a later birth cohort. It is important to note that the values which are collected for the whole block cannot necessarily be divided equally between the two cohels. In order to allocate a migrant correctly, we need to know both the date of migration and the age at the time of migration. The distribution of age at the time of migration within an age-period observation may well be skewed, as migration is often associated with age-related changes in status, such as the commencement of higher education, or retirement.

[Cohort elements]
Figure 3: Cohort elements

The exception to the general scheme illustrated in Figure 3 occurs for final age group. The final age group in demographic data is generally open ended, and collection of transition type data does not lead to a regular age-period block for this age group. If, for example, we have data with five year age groups collected for a five year transition, and the final age group is '75+', then all we can definitely say about the migrants in this age group at the collection date, is that 5 years previously they were aged 70+; we can not identify explicitly those migrants aged 75+ who were aged between 70 and 75 at the beginning of the transition period. This ambiguity makes estimation of cohels for the last two age groups somewhat harder.

When the data are split up into their component age-period-cohort elements, they can then re-aggregated using any combination of the three reference variables, creating (or recreating) age-period, age-cohort or period-cohort observations. This permits analysis to be carried out with considerably more flexibility than would be the case if the data were stored in their original forms.

2.3 Populations at risk

The preceding discussion has focussed entirely on migrant flows, and problems relating to the allocation of migrants to specific cohels. However, this is not the only problem that is posed. In order to make any sensible analysis of migration, we need to consider migration rates and transition probabilities rather than absolute numbers. Rates are calculated for any general demographic phenomenon by dividing an observed event total by the people to whom those events either occurred or might have occurred: the population at risk (PAR) of the event happening. In the case of migration events, the choice of population at risk is not a simple one, especially for data which are collected over a non-trivial time period. For a rate calculation, we wish to ensure that all persons in the numerator are present in the denominator, and that all persons in the denominator could be in the numerator; however as populations change over time the calculation of how many people are at risk of migration is one that needs to be carried out with care. The correct PAR to use will depend on the way in which the data were collected. For transition data, the origin represents the migrants' location 5 years (or however long the period was) previously, and the population at risk of outmigration from that area must be the population of the area at the start of the period. It is important to note that calculations based on the net number of migrants over a period and a fixed starting population will produce a transition probability rather than a migration rate. For event data the migration rate for each individual migrant will be related to populations on the day that he or she migrated. Consequently, the PAR for event data is usually calculated as being the mean population over the period, which is generally estimated with the assumption of linear change over time between the start and the end of the period.

The calculation of populations at risk is also affected by international migration. If we are attempting to measure internal migration, then we wish to view the process as existing with in a closed system, where the overall population is changed only by birth and death. In practice however, there is also a flow of migrants into the system from outside and a concomitant flow of migrants out of the system via emigration. The latter process is not picked up by Census type monitoring, as emigrants who have left a country will not be captured by that country's Census. By definition, these people are not internal migrants. If we chose to exclude these flows from the migrants being studied, then it may be important to adjust the population at the end of the period by removing residents who migrated from overseas, in order to generate a logically more consistent PAR.

The correct calculation of PAR may seem to be less significant that derivation of comparable migration flows, perhaps because the data tend to be more abundant and require less pre-processing. However, different assumptions or approaches may lead to significantly different values for the PAR, and thus the importance attached to their calculation should not be understated.

2.4 Data sources used

The database is built on two large bodies of migration data, respectively detailing migration flows in Australia and in the UK, together with corresponding population data used to calculate migration rates and transition probabilities. All datasets cover a period of twenty years, from 1976 to 1996; however the migration datasets differ markedly in their structure. This section of the paper outlines some of the details of the data used.

The Australian migration data used in this project comes from the results of the question regarding respondents address 5 years previously, as asked in the Censuses of 1981, 1986, 1991 and 1996. Whereas most censuses are collected on a decennial basis, the Australian census is collected on a quinquennial basis. The 5 year transition period thus corresponds to the collection date of the previous census (subject to minor changes in the census date), and effectively forms a continuous time series. Table 1 summarises the migration data collected in Australian Censuses since 1971, showing which questions have been asked, and the level of geographical detail at which moves were coded. It is apparent from this Table that migration data are not fully consistent over time, as the scales at which data have been collected have varied over time.

Table 1: Migration data collected in Australian Censuses since 1971
Census Five year questionOne year question
1971Yes - statistical divisions Not asked
1976Yes - local government areas Yes - local government areas
1981Yes - local government areas Yes - local government areas
1986Yes - statistical local areas Yes - statistical local areas
1991Yes - statistical local areas Yes - states and territories
1996Yes - statistical local areas Yes - statistical local areas

A common problem that confronts time series analysis of Census data is changes in the boundaries of spatial units. In Australia, information on usual residence, which form the basis for migration flow matrices, is coded to Statistical Local Area (SLA) level. Substantial changes have been made to SLA boundaries in most states and territories over the past two decades, and thus the inconsistencies in Table 1 are more significant than at first appears. Derivation of a regional framework for migration analysis over the four intercensal periods 1976-81 to 1991-96 therefore requires identification of regions each comprising one or more whole SLAs with a common outer boundary. For the purpose of this project, a hybrid geography has been developed based on Statistical Divisions. The data for 1981 onwards has been re-aggregated on a common set of boundaries termed Temporally Consistent Statistical Divisions (TSDs) (Blake, 1998). There are 69 of these regions, with an average (1998) population of around 270,000 persons. Figure 4 shows the TSD boundaries. Migrants flows between TSDs are disaggregated by age (5 year age groups, with a final group of 75+) and sex.

[TSD boundaries, Australia]
Figure 4: TSD boundaries, Australia

The UK data is event data and comes from the National Health Service Central Register (NHSCR). Whenever a person moves, they are expected to register their change of address with a doctor, and if necessary register with a new doctor. These patient re-registrations are centrally processed (and used to transfer medical records to the new doctor) and abstracts of data are made available to researchers showing all migrations across the boundaries of Family Health Service Areas (FHSAs) in England and Wales or Health Board Areas (HBAs) in Scotland. The boundaries of these areas, hereafter collectively referred to as FHSAs are shown in Figure 5. Whilst there are no formal population registers in the UK, the NHSCR acts as a useful source of migration data. The system is not ideal, as it fails to capture fully various groups of migrants, specifically migrants who move within an FHSA, and also those who move but do not immediately register with a new doctor. This problem is especially significant for young adult males, who may not register with a doctor unless they become ill. However, comparison of transition data (from the 1991 UK Census) and NHSCR data from the nearest comparable time period shows similar patterns of spatial change (Boden, Stillwell and Rees 1992).

[FHSA boundaries, UK]
Figure 5: FHSA boundaries, United Kingdom

Whilst NHSCR data for the UK is available for the same twenty year time period that the Australian Censuses cover, both the geographical coverage and the conceptual structure of the data vary. Prior to mid-1983, Scotland was treated as a single area for reporting purposes, whilst after this date in was split into 15 separate Health Board Areas. Similarly, the English county Middlesex was treated as a single area until mid-1986, when it was subdivided into 4 FHSAs. The geographic representation used consists of 115 zones within the UK (plus some additional zones and spatially non-specific flow categories), with an average population of around 510,000 persons.

Since mid-1983, anonymised records of individual migration events have been made directly available to researchers. These data show migration events between individual FHSAs (given the changing geographical frameworks outlined above) disaggregated by age (single years of age) and sex. This event data can be re-aggregated with considerable flexibility to meet researchers' requirements. However, prior to this date, summary data was made available by the Office for Population, Census and Surveys (OPCS) (now the Office for National Statistics, [ONS]). The summary data included aggregate tables showing the total numbers of migrants in five-year age groups originating from and arriving at each FHSA, and the total numbers transferring between each origin-destination pair. For the purpose of this project Scotland was retained as a single area, because of general problems with missing data for flows between Scottish HBAs.

3. Database construction

The previous sections of this paper have summarised the data that are available, and outlined some of the steps required in making the datasets comparable. The two datasets used differ in terms of the way in which the data were collected, the time frames over which they were collected and the spatial systems for which they were collected (that is, in each country the spatial definitions have mutated over time). The primary aim of the work described here is to develop a database that contains data on population stocks and migration flows indexed by age, period and cohort, so that direct comparisons can be made between countries and also between cohorts within each country. Due to the differences in the data a considerable amount of processing is required - dis-aggregation, estimation and re-aggregation - in order to achieve some degree of initial comparability.

The manipulation of data can be carried out at a number of stages in the process of building a database. In general we wish to have access to both the original data and the transformed data, but this poses a potential problem: the balance of priorities in implementation between the most efficient use of various resources such as disk space, processing time and so on. It may be desirable to only store primary information once, and for all related values to be derived directly within the database. However, this has implications for both processing time and disk space required. It was decided therefore to modify the data as required before loading it into a database. The modified data must therefore be re-created and re-loaded should errors be identified or a better estimation model developed.

Figure 6 is a schematic diagram showing how both data sets are manipulated to reach a common format, that of 5 year age group, 5 year period, 5 year birth cohort elements. These data are then loaded into a relational database, and application programs are developed to extract the data in the ways required.

[Schematic diagram of application inputs and outputs]
Figure 6: Summary diagram of data pre-processing and web application outputs

The Figure shows the two sets of migration data with a summary description of each set; in practice the UK data changes in its structure over the course of the time period, and some processing is required to reach the starting point shown in the diagram. In addition, both migration data sets require corresponding population data sets, to be used for the calculations of population at risk, to feed in to rate calculations. Once this pre-processing phase had been completed, the data took the form of comparable cohels which could be loaded into a database and then analysed using unified tools.

3.1 Steps involved in pre-processing the Australian data

There are two main steps involved in pre-processing the Australian data. The first was to generate datasets which were geographically consistent over time, and the second was to estimate values for age-period-cohort elements from the period-cohort data collected via the census.

The Australian source data is held by the Australian Bureau of Statistics (ABS) at SLA level. The boundaries of SLAs are based on local government boundaries called Local Government Areas (LGAs). The number of SLAs has altered over time as new ones are defined; however there are a large number of them (in 1996 there were 1329 SLAs with a spatial extent used to subdivide the whole of Australia, plus a total 35 SLAs used for off-shore locations and a variety of non-spatial classifications) making analysis at SLA level difficult. As the geography of origins and destinations becomes more detailed, so the average flow becomes smaller, and at SLA level there would be many small flows. This is problematic not just for analysis but also because of data provision rules. Through a process termed Introduced Random Error any cell in an output table with a value of 3 or less is randomised (to a value in the range 0 to 3) for statistical disclosure control purposes. Where there are a lot of table cells with low values, the errors should average out, but the researcher's confidence in the actual value of the cells will be low.

The logical unit of analysis in Australia are the 66 Statistical Divisions (SDs) whose boundaries are "delimited on the basis of socioeconomic criteria" and where possible "embrace contiguous whole local government areas" (ABS, 1996). Unfortunately, as described above, the boundaries change between the four censuses, making temporal comparisons of internal migration statistics impossible. Therefore the first step in pre-processing the Australian data was to spatially aggregate the migration data from the basic spatial unit (the SLA) to a set of spatial boundaries that were consistent over each of the four censuses. This involved the creation of four look-up-tables that defined the relationship between the SLAs of each census and a new set of Temporally Consistent Statistical Divisions (TSDs). These boundaries of shown in Figure 4.

A model is required to divide each period-cohort observation into its two component cohels, because simple division of the total by 2 is too simplistic. Mobility rates vary markedly by age, which means that the volume of movements would be radically overstated in some age groups and understated in others. A more accurate method for 'splitting the period-cohort parallelograms' is for each inter-regional flow to be divided into age components based on separation factors derived from national mobility profiles for single year intervals by sex and single years of age. This method was applied to all four census periods.

3.2 Steps involved in pre-processing the UK data

The UK data required a greater amount of pre-processing than the Australian data. The steps taken included the estimation of a consistent migration data set over time, aggregation to a consistent geography over time, aggregation to a time frame consistent with the Australian data, and finally estimation of the cohort elements of age-period blocks.

Apart from minor changes in boundaries, the area definitions used for NHSCR data have remained stable over the course of the time series. However not all areas are used throughout the time series. In addition, flows between HBAs in Scotland are not fully reported. At the same time, the number of FHSAs used was considered to be sufficiently large as to make processing the resultant matrices difficult. For these reasons, a revised geography was adopted which consisted of 35 areas, based on metropolitan cities and surrounding areas. The geography used allowed the problems caused by differential reporting over the time series to be ignored, through aggregation of 'difficult' areas to single units which were consistent over time.

As described above, the UK data is divided into two formats. Data for the years 1975-76 to 1982-83 where published only as OD, OAS and DAS summary tables, whilst data after mid-1983 was made available as single event observations, from which full ODAS matrices could be constructed. In order to carry out useful analysis, the full ODAS data was required for the first part of the time series. The estimation of the full ODAS array was carried out using an iterative proportional fitting (IPF) model, using the three known sets of totals - having been aggregated to the previously defined City regions - as constraints. The IPF model cycles through the following stages until convergence is achieved:

  1. Define initial values - set all values of Mijas to 1, so that all cells have a chance of being filled with a non-zero value.
  2. Adjust to known origins by age:
    [M1ijas = Mijas . (Oias / Sum_j Mijas)]
  3. Adjust to known destinations by age:
    [M2ijas = M1ijas . (Djas / Sum_i M1ijas)]
  4. Adjust to known origin-destination flows:
    [M3ijas = M2ijas . (Mij / Sum_a,s M2ijas)]
  5. Test for convergence. For each cell,
    [d = |M3ijas - Mijas| ]
    1. If all observations of d < 0.5 then stop, else:
    2. Reset array as follows and repeat from step 2:
      [Mijas = M3ijas]

After running the IPF model, a set of matrices had been created for each year in the time series, with single year of age observations; each cell in these matrices effectively being a 1 year age-period block. These were subsequently aggregated so as to give 5 year age-period blocks, a comparable time framework to the Australian data.

Having a produced a set of UK 5 year age-period data for a consistent usable geography over the time series, the final requirement was to split these age-period blocks into their component cohels. This was initially carried out by splitting the data equally for most blocks, except for the first and last age groups in which an alternative set of weights were used.

4. Building the Web application

Previous sections of the paper have described the differences between alternative types of migration data, outlined the data sources available, and described the way in which the data need to be pre-processed before being loaded in to a database. This section of the paper concentrates on the specific steps taken to build an application consisting of a database server and a client interface. The Web was selected as a an appropriate mechanism as the it allows access to the data for many people, regardless of physical location or computing environment, and permits easy delivery of a wide variety of content types.

4.1 Overview

The Web application consists of a number of parts: broadly, the interface, the data and the metadata. These are rather different in scope: the data - the actual details of migration in which the user is interested - are probably most easily defined and demarcated, whereas the boundary between the interface and metadata is less clear. The interface allows the user to build and perform queries through access to the metadata. The metadata themselves have a range of roles, some providing general information about the data (the number of records in a particular data file has, what time period it refers to, the geography for which migration is being reported, etc.), while others are one step removed from this (data about how different geographies relate to one another, for example). The interface can be considered as an integrated set of software applications, including the web server and database manager, and also a set of programs designed to extract items of data for different purposes.

4.2 Database structure and metadata

The data was loaded into an SQL relational database and, therefore, when discussing the data, we use the terminology associated with such databases:

The 'table/row/column' and 'relation/tuple/field' terminology are often freely mixed, although in this description they are specifically used separately to describe respectively the outputs of the database, and the database view of the data themselves. It is worth noting that an output table may be the product of more than one source relation.

A variety of metadata are required to list the relationships between various pieces of data - which population data relate to which migration data for example, or how the geography of a certain relation is defined. As the data form a time series, we also need to store explicitly dates indicating the period or point in time that a set of data describe. In addition there is a requirement to store look-up tables in a controlled manner. These might be used, for example, for converting between different spatial representations of the data in a given relation. Amongst the metadata relations are contents lists. These are the central metadata, as they list the relations which are available for users to explore. In order to ensure that relations can only be used if all the necessary metadata are available, a rule is imposed in the application that only those relations which are listed in these contents tables can be used, and that the contents tables can themselves only be updated by the application, through forms which check that all metadata values are present and valid. The contents tables themselves depend on supporting metadata, which again must be updated via the application. Table 2 shows the metadata fields which must be completed to add an entry for a migration relation.

relation_name The name of the relation to use This must come from a list of existing relations generated by the database
orig_field The name of a field in the relation which contains id codes of some form indicating a flow origin This must come from a list of fields in the relation relation_name, generated by the database
orig_geog_type A code identifying the geographical scheme for origins in this relation This must come from an existing list of codes and associated labels defined in another metadata relation
dest_field The name of a field in the relation which contains id codes of some form indicating a flow destination As for orig_field
dest_geog_type A code identifying the geographical scheme for destinations in this relation As for orig_geog_type
age_typeA code identifying the age coding used in this relation This must come from an existing list of codes, associated labels and look-up tables defined in another metadata relation. The look-up table must define a mapping between single years of age and a new age group.
sex_type A code identifying the representation of gender in this relation A closed list limited to 'mf' or 'persons'
migdata_type A code intended for future use A closed list limited to 'event' or 'transition'
migdata_snapshot_date For transition data (or pseudo-transition data), the date on which data were collected A date entered by the user
migdata_anchor_date For transition data, the date at the beginning of the period over which data are collected A date entered by the user
migdata_interleave_type For data that is disaggregated by both age and sex, observations are held as a stream of age and sex specific totals for a particular origin-destination pair. This variable indicates the order of data in that stream. A closed list limited to 'mmmfff' or 'mfmfmf'
num_recs Count of the number of records in the relation. Generally used for reference. Generated automatically by the database
Used to monitor changes to the database and to identify who is to blame if something unexpected happens Generated automatically by the database
notes Used to store any general information which may be helpful Optional, added by user
Table 2: Metadata used to describe each source migration data table

Amongst the metadata which could be added are details of database servers where the data item is located. This might take the form of a list of alternative locations or a single location. In the former case, the web server could decide which host to use, based on whichever is the most efficient choice (or which is available), possibly incorporating some load-balancing techniques. The latter case, of data being listed with a single location, may be used if certain data is required to be stored on particular servers. In each case, the application program would 'know' the best server to use, and negotiation between the application and multiple database servers would be invisible to the user. These approaches may become more relevant if data from additional countries can be added to the project, especially if that data is located on servers in those countries.

4.3 Interface structure and state maintenance

The interface in a broad sense consists of a number of pieces of software which are integrated to a greater or lesser extent. These are listed in Table 3. The interface is presented to the user over the Web, and consists of a number of HTML pages which include a mixture of standard HTML and program code. This program code is written in a perl-like language called PHP. The language program instructions are interpreted by a module loaded into the Web server. In effect therefore, the application consists of an integrated set of small programs, all of which produce an HTML page as their primary output, possibly with additional files such as embedded images. The PHP language contains general commands for input, output and processing of data, and also contains a large number of specialised function calls for interacting with other programs, primarily databases. The main database used for this project is PostgreSQL, an SQL relational database management system (RDBMS).

Apache 1.3.4Web server
PHP 3.0.7Server-side script language
phplib 6.0Libraries for authentication and state maintenance
PostgreSQL 6.4.2SQL database
gnuplot 3.7Graph plotting package
Table 3: Software used

As well as these primary pieces of software - the web server, the database and the language interpreter, we also use supporting programs such as the plotting package gnuplot to create graphs, and a set of libraries created to extend the functionality of PHP called phplib. The phplib libraries add useful functions to PHP to facilitate authentication, session management and state maintenance. One of the fundamental problems when developing applications over the Web is that it is an inherently stateless medium. By stateless, we mean that each page served by a web server is independent of all other interactions. However in order to run any non-trivial database application, there is a requirement to store a variety of pieces of data which describe the current state of the application, and what has happened in the past. The standard methods of transferring data between web pages are limited and inefficient, and therefore it is desirable to store information about and provided by users on the server. As well as querying and updating the databases in which the user is ultimately interested, the scripting language can be used to store information about users and queries in a separate database (albeit one which is probably managed by the same RDBMS).

If such data (for example, a list of origins and destinations for which the user wishes to extract some data) are going to be stored, then the application needs to be able to separate requests from different users, and correctly keep the association between a user and the data relating to a query that he or she is building. This separation is achieved through correct identification and if necessary authentication of the users. Authentication is the process of verification of a claimed identity, through a password or some other means. This may be required to limit the operations that the user is allowed to carry out.

4.4 Interface components

The interface components that are visible to the user include or will include forms which allow the extraction of data whether for browsing via on screen tables, delivery in a form more suited to loading into external analysis packages, or representation through some form of visualization. In order to carry out analysis, data is required in a number of different forms. Various desirable features include data retrieval, data browsing, computation of metrics, maps and other forms of visualization.

An important facility is simply to be able to retrieve raw data from the database in as flexible a range of outputs as possible, in order for that data to be used in an external program. One reason for this is that requirements will inevitably change as analysis is carried out, and new hypotheses are formulated. In order to test these, data will generally be copied into an external statistical package or loaded into a bespoke program. A more general overview of the data may be gathered via data browsing, in which all data in the database can be quickly and easily examined, either in whole or via querying in subsets.

Data browsing pages can allow users to get a general idea of the nature of data in a particular relation by showing either summary statistics or perhaps listing part of the file. Data might be manipulated in some basic ways such as sorting it on a chosen field, calculating a simple derived measure such as net migration balances, or depiction through some graphical summary such as a population pyramid or a bar chart. In each case, this is carried out through a web page containing PHP code in which a query is generated in a number of stages and finally a result is displayed. Figure 7 shows a generalised explanation of the way in which a population pyramid might be drawn for a particular population relation. There are additional ways in which the data may be represented graphically, such as 2 or 3 dimensional surfaces and a range of graph types.

Web pageActions
Options page User selects draw population pyramid link; a script sets pop_pyr_mode variable to start and loads pop_pyr.phtml

If pop_pyr_mode is start:

  • Query contents table to get a list of available population relations
  • Offer these as a selection to the user. The user then selects a relation; this calls pop_pyr.phtml again but with a mode of edit

If pop_pyr_mode is edit:

  • If it is the first visit in this mode, then:
    • check the age, sex and geography definition of the chosen relation and store these values
    • Set initial set of chosen areas to all
  • Offer link to create a subset of data; this calls pop_pyr.phtml with a mode of subset
  • Offer link to graph the data; this calls pop_pyr.phtml with a mode of graph

If pop_pyr_mode is subset:

  • Offer menus to user allowing a subset of geographical areas
  • When areas have been selected: call pop_pyr.phtml with a mode of edit

If pop_pyr_mode is graph:

  • Present a skeleton HTML page which includes a reference to an inline image; however rather than an image per se, the link is to a second program file, draw_pyr.php3

This is called only from pop_pyr.phtml

  • Retrieve stored variables from pop_pyr.phtml listing the relation to use, the age and sex definitions and the subset of areas
  • Write a header:
    "Content-type: image/gif"
  • Extract population data required from database given current settings
  • Open a temporary file
  • Write a series of gnuplot instructions to the temporary file. The file includes the data extracted, and will write a .gif file on standard output
  • Run gnuplot with the temporary file. The IMG link in pop_pyr.phtml thus correctly receives a content header followed by a stream of data which is a gif file
  • Delete temporary file
Figure 7: Application element to draw a population pyramid

Where there are fixed ideas about analysis that is required, then calculations of various measures can be carried out directly within the application. For example, an out-migration rate can be simply computed (assuming that a suitable population at risk has been selected), and the answer returned for a selected set of areas. Numerous statistics can be computed which seek to summarise migration in various ways or indicate the effect that it has on populations. Plane and Rogerson (1994) describe a variety of statistics, including:

Some of these can be computed directly from counts of population and migration flows, and may be calculated directly in the application, whereas others require more data; the calculation of migration expectancies for example requires a full life table, the construction of which requires age specific mortality rates.

There are a variety of migration statistics in which distance between origins and destinations is an important component of the calculation. The concept of distance is, however, not a simple one, and care must be taken when deciding on a measure of distance to use. Simple measures of distance include a direct 'as the crow flies' Euclidean distance between two points, whilst a more sophisticated approach would be to calculate distance over the Earth's surface between points (in a country as large as Australia, surface curvature can make a significant difference to distance calculations). Both of these measures attempt to draw a direct line between points, but this is not necessarily a representative measure of distance. Where origins and destinations are separated by a significant physical barrier, such as a body of water, we may wish to calculate some measure of distance which involves a route which goes around the barrier. Clearly, there are a variety of ways in which this might be done, some of which are relatively simple and some of which are sophisticated. As these methods become more sophisticated it may be desirable to use a GIS with network analysis features to calculate a distance through a transportation network. It may be that a distance matrix between all origins and destinations is calculated and then loaded into the database, although an alternative approach may be to establish a link with a GIS engine which can be used to calculate distances as and when required.

Another use for GIS is clearly to provide maps of migration data. This would clearly require close integration between the web application and a GIS, however, and an interim solution would be to provide rough maps that could be created with standard plotting tools or simply to provide data in a suitable format to import into a preferred GIS package. A longer term aim would be to provide a mapping engine which could generate maps of migration balances and rates subject to user queries. A persistent problem in the study of interaction data lies in the difficulty of showing large amounts of flow data on a map. Flow data (i.e. between specific origins and destinations) is often mapped using lines linking the origin and destination, whose widths are proportional to the size of the flow. Figure 8 shows a map of this type, illustrating net migration flows between Statistical Divisions in New South Wales, over the year preceding the 1996 Australian Census. As more origins and destinations are used such maps quickly become very crowded with information, and difficult to interpret, and therefore there is considerable interest in using techniques such as linked map and table windows (Dykes, 1998) to make such maps easier to understand.

[Map of migration flows in New South Wales, 1995-6]
Figure 8: Net migration in New South Wales, 1995-96

One of the key aims of the project is to examine change over time in patterns of migration. One way in which time series data can be summarised graphically is via animation. Figure 9 shows a population pyramid showing the population of the UK disaggregated by sex and single years of age. It would be possible to produce such a pyramid for each year in a time series. The Figure however (in the Web version of this paper) also serves as a link to an animated image which cycles through 7 frames showing pyramids for consecutive years in a time series. This animated image makes it very easy to follow the change in age structure over time.

[Link: Animated UK population pyramid 1991-97]
Figure 9: UK Population pyramid, 1991

Population pyramids can also be constructed for counts other than the total number of residents. Figure 10 shows two pyramids effectively overlaid - one (in blue) showing the numbers of out-migrants by age and sex from Leeds, in the UK, and a second (in red) showing in-migrants to Leeds from the rest of the UK. The difference in the two lines for any particular gender and year of age observation reveals the effects of net migration: if the red line (in-migration) is further away from the y axis than the blue line (out-migration), then there is a net inflow of people, whereas if the blue line extends beyond the red line, then the net flow is outwards. Again, in the Web version of paper, the image forms a link to an animated image showing changing in and out-migration pyramids over a small time series. It can be seen from the animated image that the migration data exhibit greater volatility than the population estimates, but that the general pattern remains the same (although the volume of migration changes): for both males and females there is a significant net flow into Leeds around the ages of 18 to 20, and a significant but smaller outflow between the ages of 21 and 25.

[Link: Animated Leeds migration pyramid 1983-96]
Figure 10: Leeds in and out-migration, 1983-84

The Lexis diagrams introduced above are often used simply to illustrate the internal relationships in demographic data. However, they can also be used to show patterns in specific data, by attaching meaning to the cohort ribbons or to particular age-period-cohort elements. The shape of cohels and other parts of a Lexis diagram are important and they carry out an important role in demarcating the space in which certain events have occurred or might have occurred. Whilst the shape of such components cannot therefore be altered, we can still attach meaning by using colour to indicate some level of migration intensity. Alternatively, it is possible to construct a 3 dimensional model in which the 2 dimensional graph (such as the one illustrated in Figure 2 becomes a base plane, and cohels are shown as being raised to some height above (or below) this plane, with the height corresponding to some variable in which the analyst is interested. Through the use of both a z dimension and colour we can simultaneously show at least two variables mapped on to the cohort representation. It may also be possible to animate such a model, although it would be important to note that such animation would fulfill a different role to that used in Figures 8 and 9. In those images, animation is used to show change over time; however in a Lexis diagram time is already included as one of the primary axes. Therefore, animation could be used to cycle through cases of some other variable. The preference in such an model would be to use a variable with a continuous scale, so as to show the way in which migration intensity, disaggregated by age, cohort and period, changes (if at all) as the selected variable changes. One possible use would be to use or create a distance-of-move variable, and then show a series of frames illustrating the way the Lexis diagram changes for moves of increasing length. Such models would then be potentially showing a considerable number of variables simultaneously; it must be remembered that much migration data is limited in the amount of additional variables that are attached to them.

5. Conclusions

The paper has described two time series of migration data, and discussed some of the theoretical background behind the differences in the structure of the two datasets. A project to make the data comparable is under way, but the data remain complex and thus a suitably powerful and flexible mechanism to explore and analyse the data is required.

The global standarization and expansion of Internet technologies coupled with use of large digital datasets provides opportunities for information to be disseminated much more widely. The authors are in the process of exploiting these technologies with the aim of providing an interface to migration datasets of the types described. As such it is a excellent research tool transcending the physical boundaries that would normally hinder this type of collaborative research.

5.1 Future directions

The present database is designed around data collected for two countries. It is a logical step to include data for additional countries, and thus the metadata structure of the web application has been carefully designed to allow considerable flexibility in the way in which the data are described and referenced, and has also been designed with the ability to draw data from multiple database servers in mind.

The data is complex and often requires considerable explication. The Web is well placed to offer large amounts of documentation that is both integrated with the interface and also context-related, It is hoped that derivatives of this application, or similar such interfaces may be used more directly for teaching purposes.


The research reported here is supported by the Australian Research Council (Grant reference A79803552) and the Economic and Social Research Council (Grant reference R000237375).


Australian Bureau of Statistics (1996), Statistical Geography Volume 1, Australian Standard Geographical Classification (ASGC), Catalogue No. 1216.0, Australian Government Publishing Service, Canberra

Blake, M, (1998), A temporally consistent spatial framework for Australia, Ninth National Conference of the Australian Population Association, University of Queensland, Brisbane

Boden P., Stillwell J.C.H. and Rees P.H. (1992) How good are the NHSCR data?, in Stillwell J.C.H., Rees P.H. and Boden P. (eds.) Migration Processes and Patterns Volume 2: Population Redistribution in the United Kingdom, Belhaven Press, London, pp13-27.

Castro L.J. and Rogers A. (1983) What the age composition of migrants can tell us, Population Bulletin of the United Nations, 15.

Champion A.G. (ed.) (1989) Counterurbanisation: The Changing Pace and Nature of Population Deconcentration, Edward Arnold, London.

Duke-Williams, O.W, (1998) Interfaces to interaction data, in Rees, P.H, (ed.) The 2001 Census: What do we really, really want, Working Paper 98/7, School Of Geography, University of Leeds, Leeds.

Dykes, J.A., 1998, Cartographic visualization: exploratory spatial data analysis with local indicators of spatial association using Tcl/Tk and cdv, The Statistician, 47(3), pp485-97

Easterlin, R,. (1980), Birth and fortune: the impact of numbers on personal welfare, Basic Books, New York

Fielding A.J. (1982) Counterurbanisation in Western Europe, Progress in Planning, 17, Pergammon Press, London, pp1-52.

Long L.H. (1991) Residential mobility differences among developed countries, International Regional Science Review, 14, pp133-47.

Long L.H., Tucker C.J. and Urton W.L. (1988a) Migration distances: an international comparison, Demography, 25, pp633-40.

Long L.H., Tucker C.J. and Urton W.L. (1988b) Measuring migration distances: self-reporting and indirect methods, Journal of the American Statistical Association, 83, pp674-78.

Long L.H. (1992) Changing residence: comparative perspectives on its relationship to age, sex and marital status, Population Studies, 46, pp141-58.

Nam C.B., Serow W. and Sly D. (1990) International Handbook on Internal Migration, Greenwood Wesport, Connecticut.

Plane, D.A, and Rogerson, P.A, (1994) The geographical analysis of population with applications to planning and business, Wiley

Pressat, R., (1961), L'analyse demographique, Presses Universitaires de France, Paris

Rees P.H., Stillwell J.C.H., Convey A. and Kupiszewski M. (eds.) (1996) Population Migration in the European Union, Wiley, Chichester.

Rogers A. and Castro L.J. (1986) Migration, in Rogers A. and Willekens F.J. (eds.) Migration and Settlement - A Multi-regional Comparative Study, Reidel, Dordrecht, pp157-208.

Rogers A., Racquillet R. and Castro R.J. (1978) Model migration schedules and their applications, Environment and Planning A, 10(5), pp475-502.

Vandershrick, (1992), Le diagramme de Lexis revisité, Population, 47(5), pp1241-1262