Attributing ethnicity using someone’s name

Natasha Bance

In a recent methods and quality report, the Race Disparity Unit (RDU) talked about how someone’s name could be used to assign their ethnicity. This blog post further considers the strengths and weaknesses of this approach.

Research shows that using someone’s name to assign their ethnicity can help fill gaps when ethnicity data is not readily available. People’s names are usually collected in surveys and administrative processes. Easily-accessed, large data sources – such as the electoral register – make names and geographic information readily available.

Thanks to these datasets and modern computing power, developing ethnicity attribution methods using names is much more straightforward than it has been in the past. This has increased the number of third-party tools that use machine learning, algorithms, and other statistical models to classify ethnicities from name data. Estimates of other demographic information, such as gender and nationality, can be derived in a similar way.

These methods might use a person’s first name, last name, or both. First names often provide information such as gender and historical trends, cultural backgrounds, and nationality. Last names provide information on the roots of the family system and the origin of ethnicity. Using both can increase the chances of being able to predict someone’s ethnicity with greater confidence.

Benefits of using names

Using names can be a replacement for generating ethnicity data that has not been collected:

  • directly through a question in a survey
  • through an administrative process
  • through data linking

Name-based data can be useful in some scenarios, such as when statistical precision is not the primary aim. For example, name-based ethnicity has been successfully used for targeting in health campaigns. In these cases, it might not be necessary to know exactly how many people from an ethnic group live in a particular area, just that a certain number of people from that group live there. The unit of interest is the geographical area, not the individuals who live there.

But a name-based ethnic classification will never fully represent the ethnic breakdown of a population. This is because it does not correspond to someone’s subjective identification with an ethnic group. People reporting their own ethnicity is the “gold standard”. This can never be replicated by a name-based classification.

Limitations

There are limitations to attributing ethnicity using names.

The data source used

To provide the best attribution, the underlying dataset should be:

  • a large sample of names, including multiple spellings and variations
  • from a similar time period to the target list
  • geographically similar to the target list

However, some data sources are incomplete and not fully representative of the population. For example, electoral registers do not include people who cannot vote, such as children. Similar data sources are typically restricted to people aged 16 and over.

As the population becomes increasingly diverse, it is also important to use up-to-date lists.

Misrepresentation of specific groups

Assigning ethnicity using names will misrepresent some groups, including:

  • people from a mixed ethnic background – this is because last names are often given to people according to the last names of their father
  • black Caribbean people – this is because of similarities between Caribbean and British last names
  • people from countries with a predominantly Muslim faith, such as Pakistan and Somalia – this is because of how common some Muslim last names are in different countries in Asia and Africa
  • people who married someone of a different ethnicity and who took their partner’s last name when they married

Differences in the level of detail available

Difficulties classifying people, for example those from mixed ethnicity backgrounds, could mean that data for the harmonised 18 ethnic group classification or even the aggregated 5 group classification might not be available.

Some attribution tools include more ethnicities, or related concepts such as ‘heritage’, than the 18 group classification. For example, they may include further breakdowns of Eastern European groups. This can be useful to users interested in specific groups.

In both situations, there might be an effect on users trying to compare the name-attributed data with other datasets where information about ethnicity has been directly collected. The different sources might have categories that are not easily comparable.

Accuracy of ethnicity attribution

The relationship between ethnicity and names is specific to particular times, places, and groups of people. This affects the overall accuracy of the ethnicity category assigned because predicting someone’s ethnicity based on their name may not get the same result as if they were asked to provide it themselves.

Research into estimating ethnicity using family naming practices has also shown that accuracy can be affected by age, gender, marital status, and region of residence.

Some of the tools available can provide a level of confidence that the attribution on the basis of the name will predict what the person would respond with. This, and other information about data quality, should always be presented so that users can make appropriate conclusions about the data based on its accuracy.

Conclusion

RDU encourages direct collection of ethnicity data using the correct harmonised categories, or by linking to a dataset that has directly collected ethnicity, rather than assigning ethnicity using name attribution.

However, in some circumstances name-matching might provide an acceptable replacement for generating someone’s ethnicity when it hasn’t been gathered originally in a survey or other form of data collection.

We would encourage anybody using names to attribute someone’s ethnicity in a dataset to fully understand the limitations of the approach. These limitations should be explained to allow people to use and interpret the data correctly. The type of information that might be presented could include:

  • details of the classification – for example the statistical modelling or algorithm used
  • what level of confidence the analysts have in the ethnicity assigned
  • details of the underlying datasets that have been used in the attribution method, such as timeliness and coverage
  • any known limitations, such as unavailable ethnic group categories
  • levels of unknown attribution

We’re interested in examples of names being used to attribute ethnicity. If you have any case studies you would like to share, please contact us on Ethnicity@CabinetOffice.gov.uk.

References

Mateos P ‘A review of name-based ethnicity classification methods and their potential in population studies‘ Popul. Space Place 2007: volume 13, pages 243-263

Kandt J and Longley PA ‘Ethnicity estimation using family naming practices‘ PLoS ONE 2018: volume 13 (number 8), document ID e0201774

Webber R ‘Using names to segment customers by cultural, ethnic or religious origin‘ Journal of Direct, Data and Digital Marketing Practice 2007: volume 8, pages 226–242

Fiscella K and Fremont AM ‘Use of geocoding and surname analysis to estimate race and ethnicity‘ Health Services Research August 2006: volume 41(4 part 1), pages 1482-500

Jun J and Mizuno T ‘Detecting Ethnic Spatial Distribution of Business People Using Machine Learning‘ Information 2020: volume 11 (number 4), page 197

Darren Stillwell
Natasha Bance
Darren Stillwell is the Head of Quality and Standards at the Equality Hub.