Considering (mis-)representation in geodemographics and lifestyles

Rich Harris
School of Geographical Sciences, University of Bristol, UK.
Email: R.J.Harris@bris.ac.uk
Website: http://www.ggy.bris.ac.uk/pgs/pgrsch/rh/research1.htm



Abstract

Users of (British) geodemographic products may encounter a problem of (mis-) representation giving rise to the ecological fallacy: the typologies give a socio-economic profile of the 1991 UK Census EDs, not a direct classification of individual consumers. Marketeers are increasingly looking to lifestyle databases as a solution, 'fusing' together dis-aggregate and diverse individual-level data to form composite datasets. This paper contends that too little is known (or, perhaps, admitted) about the sources of error and of sampling bias associated with this process. Potential errors are propagated by the fusion process and this can lead to a marked divergence between the characteristics of the 'digital personae' and the characteristics of the real-world individuals the data purportedly represent. Consequently, whilst lifestyle databases may provide potentially rich sources of micro-data for both geodemographic and social-scientific analysis, the fundamental issue of representation remains.

Key words: ecological fallacy, geodemographic typologies, lifestyle databases, representation

1. Introduction

Unwittingly, or otherwise, users of geodemographic typologies may encounter a problem of representation. The typologies will have been derived from the 1991 Census but the Census describes areas; not individuals and their immediate, ecological circumstances. To infer the characteristics of the individuals from the characteristics of their 'neighbourhood' is to risk invoking the ecological fallacy. This problem, arising from the misapplication of geodemographics is considered in Section 2.

The prevalent 'solution' adopted by marketeers is to profile customers at the individual level using their recent consumer behaviours, as stored in a lifestyle database. The prime advantage of geodemographics as a 'segmentation tool', differentiating between different neighbourhood-types and partially capturing the geography of consumers, might be maintained by 'fusing' together (that is, merging) geodemographic and lifestyles approaches [ Section 3 ].

However, such procedures do not avoid fundamental issues of representation. Are the 'digital personae' (after Clarke, 1994), as stored in the database, representative of actual, real-world individuals? Non-random methods of data collection may conspire to bias the lifestyle dataset towards particular sub-sections of the population, creating geographies of bias and exclusion. Some of these geographies are identified with respect to a region inclusive of and surrounding Bristol, England [ Section 4 ].

This paper should not be read as a criticism of the lifestyles (or geodemographic) methodology. Rather, lifestyle databases are cited:

2. Geodemographics

During the 1980s, marketeers developed a new maxim, 'we know who you are because we know where you live', selling it in the form of geodemographic products, of which CACI's ACORN, Experian's MOSAIC and CDMS' Super Profiles are current examples. Indeed, the maxim seems almost to define geodemographics, being similar to Sleight's (1997: 16) 'favourite definition' of the methodology viz. "the analysis of people by where they live".

These definitions usefully emphasise the geographical nature of geodemographic typologies. Simply, these typologies are formed by a 'like-with-like' grouping of almost 150,000 UK Census enumeration districts / Scottish output areas (both abbreviated to ED, here) into one of 10, 25, 50 or so different 'clusters', or area-types; their 'likeness' being defined by consideration to fifty or more classificatory Census variables. The most alike EDs are usually defined as those which are at closest distance together, d, in n-dimensional space, where: d is the Euclidean distance between the EDs, calculated using Pythagoras' theorem in 'n' dimensions; and where 'n' corresponds to the number of classificatory variables (see Curry, 1992: 204-5). The definitions also suggest the 'stick used to beat geodemographics' (after Sleight, 1998: 8). In practice, geodemographic-based analysis explicitly considers only the areas in which people. The consideration given to the individuals themselves is secondary and ambiguous. The stick, then, is the 'ecological fallacy': it is inherently problematic to ascribe characteristics and relationships which dominate at one level of aggregation to a different level - here, from areas to individuals.

A geodemographic cluster does not delimit a perfectly homogenous social-economic/demographic 'landscape', because of the intra- and inter-heterogeneity of its constitutive EDs. Consequently and for example, whilst it might be assumed that because a cluster is characterised by a high number of elderly persons, so each of the several thousand EDs within that cluster will also share this characteristic, this assumption may be invalid. Despite the clustering methodology attempting to limit within-cluster heterogeneity, age was only one of the 'n' variables used to define the likeness of the constitutive EDs (and thus the within-cluster homogeneity). Accordingly, Flowerdew (1991: 8) writes,

"for any commercial application a general purpose [geodemographic] system is likely to be inferior to one designed specifically for the purpose."

Geodemographic vendors may respond that often the general purpose system is considered simpler by users: both to implement and as a basis for comparison between different analyses.

Birkin (1995: 130) notes that primarily geodemographic typologies are derived from Census data and an inspection of the technical notes for one of the commercial typologies confirms this to be true. Direct measures of consumer behaviours and practices are absent during the formation of the typologies. Even setting aside the issue of within-cluster heterogeneity, this absence would seem sufficient to undermine any guarantee that because, say, a drinks retailer had achieved the most successful sales of New World Chardonnay in test stores located within cluster C, so it will also achieve further successful sales having 'rolled-out' the product to other stores located within this cluster. A social-economic-cum-demographic - i.e. Census-based! - areal typology is not the same as a typology derived using variables directly measuring consumer practices, let alone a categorisation of the individual consumers who may be of ultimate interest. Yet, geodemographics does seem 'to work', or, at least, to work well enough for the requirements of commercial users and to give the geodemographic and lifestyles industries a turnover of approximately £100 million by 1996! (source: Target Market Consultancy, 1995. Projected from an estimated £54 million turnover by the geodemographic industry and an estimated £30 million turnover by the lifestyle industry as of 1995. Quoted by Sleight, 1997: 17).

The assumption made when applying geodemographics is an old one - that 'birds of a feather flock together.' The assumption implies only an association between individual consumer behaviours and the wider socio-economic environment within which the consumer resides. The demographic, economic and physical criteria do not purport to totally classify the social individual (after, but contrary to the 'mechanistic approaches' described by Cathelat, 1990: 97). Admitting this should change the maxim from 'we know who you are because we know where you live' to (the less catchy) 'we've got a broad idea about your buying habits and preferences, inferred from the type of neighbourhood we think you live in'. Geodemographic advertising does not admit the association between area and individual to be vague and the discriminatory (or, segmentation) power of the typology to be ambiguous. This may be fine when a modest increase in sales of New World Chardonnay will recoup the cost of the analysis but has more profound implications when those living within a 'poor'-ly labelled cluster find their bank branches have closed. It is somewhat sobering to realise that at least one postcode unit in Bristol, England, can be considered as an 'Affluent Achievers' neighbourhood or as a 'Have Nots', dependent upon where the centre of the postcode is taken to be and relating this to the Super Profiles geodemographic typology.

3. Lifestyles

It is partly from an awareness of the ecological fallacy and of the indirect variables used by geodemographics to analyse consumer behaviours that led to the appearance of lifestyle database companies, in Britain from the mid-to-late 1980s.

"A working definition of lifestyle database companies might be - 'lifestyle database companies build large databases of individuals, sourced from lifestyle questionnaires'" (Sleight, 1997: 16).

These companies include CSL (Consumer Surveys Limited), NDL (National Demographics and Lifestyles) and ICD (International Communications and Data) and a recent (1998) questionnaire asked over 200 questions pertaining to: shopping habits; homes and property; health; finances; motoring; recreation and leisure, and; holiday and travel preferences. In addition to collating data from postal surveys, sent to residential addresses across Britain, lifestyle databases can also include information obtained from the registration guarantees of durable goods, from high court judgements, from shareholder records and from basically anywhere else where individual-level data may be obtained legally! (Information about ICD's parent company Metromail can be found at http://www.metromail.com/).

The traditional domain of lifestyles has been for 'mail targeting', posting mail-shots ('junk mail'!) directly to individuals whose recent buying preferences or behaviours are indicated to conform to a desired criteria. More recent developments have seen,

"… a blurring of former distinctions [between geodemographics and lifestyles], as lifestyle database companies adopt geodemographic techniques, and geodemographic companies utilise lifestyle data within their products. This convergence is now a well-established trend." (Sleight, op. cit.).

This convergence ought to give the best of both worlds - individual, consumer-based data coupled with the 'geo' of geodemographics. However, as with geodemographics, lifestyles advertising is unsurprisingly reluctant to divulge the short-comings associated with the approach: the data are not truly representative of the British population but are biased towards particular sub-groups of it (see Section 4, below).

The problem occurred by data fusion is illustrated by reference to a claim made of Claritas' Lifestyle Universe product (Reed, 1997: 16). The Lifestyle Universe product is formed from the fusion of the NDL National Consumer Survey and CMT's National Shoppers Survey, as well as share registers, County Court judgements and other financial indicators, to cover 75% of UK households, purportedly with a 100% accuracy. Using probability-based modelling, the lifestyle characteristics of the remaining 25% can be selected with an alleged 70% accuracy. Lifestyles surveys implicitly assume that respondents don't lie. Maybe not - but questionnaire recipients are instructed to answer questions on behalf of their partners and/or other members of their households. Inaccurate information may be accorded to the digital persona of the actual individual. In turn, the digital persona may be assumed to be an accurate proxy to the individual themselves. Can this hidden error be checked for and quantified?

Moreover, are the initial datasets (for example, the Consumer or Shoppers Surveys) biased in any way? Does the fusion process lead to error propagation? Can the sources and outcomes of the error be determined? What is the basis for claming a hundred, or seventy percent accuracy when the traditions and mathematical rigour of sampling theory and survey design have been departed from? Compare, for example, the lifestyles method against the survey design explicated by Moser and Kalton, 1985. (For other examples of composite datasets see Claritas' Dimension product at http://spider.claritas.com/dimensn.htm, or SRI Consulting's GeoVALS product at http://future.sri.com/vals/gv.dirmail.html, which links a segmentation of US consumers based on their 'psychographs' to the geography of the US Zip Codes).

These considerations may not be of immediate relevance to the marketeer, to whom the Lifestyles Universe product provides a welcome alternative to the aggregated and non-consumer based data of the 1991 Census and to 'traditional' geodemographics. However, Lifestyle databases could also offer a source of micro-level data to those involved in service and needs planning, and to those in social-scientific research (Openshaw and Turton, 1998). It would seem essential to know the deficiencies of such data if confidence is to be attributed to the results and decisions arising from a lifestyle-based analysis.

4. Geographies of bias and exclusion: an example from Bristol, England

The study region is shown in Figure 1 (below) as including both the city (shaded blue) and the surrounding area (shaded grey) of Bristol. This region is defined by 'BS' (Bristol) postcodes matched to Census EDs, using the postcode-to-ED directory held at MIDAS, Manchester, England (http://midas.ac.uk/). There are nearly 70,000 lifestyle records (digital personae) geo-referenced to addresses within the 1667 Census EDs shown.

Figure 1 - The Bristol study region

The data form a subset of a national lifestyle database collated from a postal survey undertaken in the mid-1990s. Having excluded, in effect, Houses in Multiple Occupation (HMOs, more than one household per address), 20 million questionnaires were sent to addresses across Britain, a national response rate of approximately 10% being achieved. This is a large survey (compared with 'traditional' social-surveys or electoral polls, for example, or with the Target Group Index, a regular national survey of the consumer habits and preferences of 25,000 households and used to provide an additional 'descriptive layer' to the Super Profiles typology, see Batey and Brown, 1995: 89-95). However, not all the questionnaires were completed in their entirety.

(source: from correspondence with the data-supplying company).

There are other 'holes' in the data. Figure 2 shows the percentages of EDs for which there are no data upon any individual of a given lifestage group. For example, the first column indicates that in only 4% of EDs does the lifestyles survey fail to enumerate anybody of lifestage group thirteen. The lifestage groups are defined in Table 1.

Figure 2 - %EDs with no lifestyles data for particular lifestage groups

source: 1991 UK Census data, Crown copyright and lifestyles data

Table 1 - The 13 lifestage groups

# Definition # Definition
1 aged 16-24 years, no children aged 0-15 years (in household) 8 35-54, child(ren), 0-4
2 16-24, child(ren), 0-15 9 35-54, youngest child 5-10
3 25-34, no children 10 35-54, youngest child 11-15
4 25-34, child(ren), 0-4 11 55-64, working or retired
5 25-34, youngest child 5-10 12 55-64, unemployed or economically inactive, and
6 25-34, youngest child 11-15 13 65 and above
7 35-54, no children    

Along the horizontal scale of Figure 2 is given the percentage of the Bristol population per lifestage group, as derived from the 1991 Census. Lifestage group thirteen is comprised by the largest proportion of the Bristol population (an estimated 23%). It is therefore unsurprising that in only 4% of EDs does the lifestyles survey fail to enumerate a single person of this lifestage. At the other extreme, lifestage group six is formed from a negligible proportion of the Census population. Accordingly, in nearly 90% of EDs there is not a single person of this group enumerated by the survey.

The 'middle ground', half-way along the horizontal axis, is more interesting. In particular, the third most prevalent Census group, '16-24, no children', is absent from the lifestyles data covering 60% of the EDs. Yet, the fourth most prevalent group, '25-34, no children', is absent from only 17%. Further, group two, also aged 16-24 years but with children (seventh column form left), is disproportionately absent from the lifestyles data when compared with group nine, '35-54, youngest child 5-10' (eighth column). Both these groups are comprised by 5% of the Census population but the former is absent from the lifestyles data covering 60% of EDs, the latter of only 21%). These results support John Clements contention of lifestyle databases (quoted by Reed, 1998: 11) that,

"there is a slight [?!] down-weighting in 18-22 year olds because [of] mails from the electoral roll, generally targeting the main income earner or shopper. So forms are not completed by 18-22s who are living at home or are in multi-occupancy student accommodation."

The bias against younger adults and against areas of HMOs is reflected in the geodemography of 'penetration rates' (Table 2, below). A penetration rate of 100% indicates that all the Census population, of any lifestage group, but living within an ED grouped to a specified Super Profiles geodemographic cluster, have been enumerated by the lifestyles survey. At the Super Profiles ten-cluster level of aggregation (all EDs assigned to one of ten groups, less a few 'residual' cases) the lowest average penetration rate is for the 'Urban Venturers' areas (8.8%), characterised by younger populations and city centre locations. The highest penetrations are achieved in the 'Nest Builders' and 'Settled Suburbans' areas (12.3% and 11.5%, respectively). Whether 'the data has a slight bias towards the upmarket' (after Reed, op. cit.) is not clear.

Table 2 - The geodemography of 'penetration rate'

'Economic Rank'

Geodemographic cluster

Penetration rate (%)

4

Nest Builders

12.3

3

Settled Suburbans

11.5

8

Producers

11.2

1

Affluent Achievers

10.6

9

Hard Pressed

10.5

2

Thriving Greys

10.4

7

Senior Citizens

9.8

10

'Have Nots'

9.1

6

Country Life

8.9

5

Urban Venturers

8.8

The lifestyles data exhibit a geography of penetration rate, especially tending to bias against and exclude those individuals living within the centre of the city. It is estimated that at least partial responses were achieved from approximately 10% of all households within the study region households by the survey questionnaire. In Figure 3, a surface of penetration rate is shown, interpolated from the ED-average penetration rates. A lower than (study-region) average penetration is indicated in dark blue, a higher penetration in dark red. A sampling method biased against inner-city residents can be discerned.

Figure 3 - The geography of 'penetration rate' across all lifestage groups and by ED

5. Conclusions: 'Geo-marketing science'

Despite containing a potential wealth of micro-level data, lifestyle databases are not spatially replete. Issues of incomplete population coverage and of bias may prohibit lifestyles moving beyond their 'traditional' use for direct mail targeting to their use as a segmentation tool in the same way as geodemographics. The lifestyles data probably cannot (yet!) be used to create the 'general purpose [geodemographic] system' alluded to in Section 2 (after Flowerdew, 1991: 8). The applications of lifestyles data also appear limited. As Reed (1998: 11) notes, "there is an underclass which is left out because they are not on the electoral roll, but they are not of interest to marketers." This 'underclass' [sic] may nonetheless be of primary concern to those involved in service and needs planning.

To re-iterate, the intention of the paper has not been to criticise the lifestyles method but has raised a caveat against the growing temptations to 'geo-market' science - understood as the move towards automated data-mining (or geocomputation) on data derived from sources other than the traditional social surveys or decennial Census (see Goodchild and Longley, 1998, for a review of the traditions in spatial analysis and of elements of a new perspective). The caveat can be expressed simply: what comes out (in the form of analysis) is always a function of what (data) goes in; it is therefore necessary to 'know' the data being used, to be sure that it really is 'the right tool for the job'.

To the user who has defected from geodemographic to lifestyles might be issued a warning. If the ecological fallacy is simply expressed as the 'misrepresentation of individuals' , then, supposing a reliance on (biased) individual-data which is representative more of fictitious digital personae than of actual individuals, so it would seem that 'the stick' used to beat geodemographics could be beating lifestyles, too!

References

Berry MJA and G Linoff (1997) Data mining techniques: for marketing, sales, and customer support. John Wiley and Sons, Inc., New York.

Birkin M (1995) Customer targeting, geodemographic and lifestyles approaches. In P Longley and G Clarke (eds) GIS for business and service planning. GeoInformation International, Cambridge, 104-49.

Cathelat B (1990) Socio-styles, English edition. Kogan Page, London.

Clarke R (1994) The digital persona and its application to data surveillance. The Information Society, 10, 77-94.

Curry DJ (1993) The new marketing research systems: how to use strategic database information for better marketing decisions. John Wiley and Sons Inc., New York.

Flowerdew R (1991) Classified residential area profiles and beyond. University of Lancaster, North West Regional Research Laboratory, research report, 18.

Goodchild MF and PA Longley (1998) The future of GIS and spatial analysis. In PA Longley, MF Goodchild, DJ Maguire and DW Rhind (eds) Geographical Information Systems: principles, techniques, management and applications. John Wiley and Sons Inc., New York, chapter 40.

Moser CA and G Kalton (1985) Survey methods in social investigation, 2nd edition. Gower Publishing Company Ltd., Aldershot, Hampshire, England.

Openshaw S and C Openshaw (1997) Artificial intelligence in geography. John Wiley and Sons Inc., Chichester, England.

Openshaw S and I Turton (1998) Geographical research using lifestyles databases. Unpublished paper, presented at the RGS-IBG annual conference, Kingston, England.

Reed (1997) Affluent lifestyle. New Perspectives, 10, 14-17. Adams Business Media, Cambridge, UK.

Reed (1998) We're all individuals now. New Perspectives, 14, 8-14. Adams Business Media, Cambridge, UK.

Sleight P (1997) Targeting customers: how to use geodemographic and lifestyle data in your business, 2nd edition. NTC Publications Ltd., Henley-on-Thames.

Sleight P (1998) Through the line: ivory towers ltd. New Perspectives, 13, 8. Adams Business Media, Cambridge, UK.