Written by Philip Blair
Posted on:May 26, 2024 at 08:44 AM
JRC-Names-Retrieval: A Standardized Benchmark for Name Search Nodora Partners' work presented at LREC-COLING 2024

From banks conducting a know-your-customer (KYC) check to hospitals looking for a patient in their records, names are searched in databases around the world millions of times a day. How well do these lookups work? In reality, we don’t really know. Moreover, in our increasingly globalized world, more and more names are being searched in databases developed by programmers from entirely different cultural backgrounds, which can lead to a number of pitfalls that need to be actively guarded against. This means that systems’ quality becomes that much more unknown.

This week, I traveled to LREC-COLING with my co-author, Babel Street Chief Scientist and Reichman University Professor Kfir Bar, to present a paper that he and I published together on this topic. Babel Street, a Nodora Partners client, provides a specialized solution for indexing and matching names across many languages ranging from English to Japanese to Arabic.

In this work, we sought to develop a multilingual name retrieval evaluation dataset. While there is a substantial body of research related to matching one name with another (e.g. identifying whether “ジョン・スミス” is indeed the Japanese equivalent of “John Smith”) or the closely related task of disambiguation, there is a lack of research on the subject of retrieval, which we found to be a shame due to its widespread presence in industry.

Our approach to tackle this was based on the JRC-Names dataset provided by the European Commission’s Joint Research Centre. This dataset consists of people and organizations’ names which have been scraped from the Europe Media Monitor (which contains global news in many languages) and then clustered together. These data points look like the following:

2447481 P       u       Rick+Genow
973108  P       u       Eric+Adams <---\
973108  P       u       Эрик+Адамс <------ Notice that these have the same ID
973108  P       u       Erik+Adams <---/
2448991 P       u       Kevin+Dooley
2452561 P       u       Angelo+Basile
2452560 P       u       Rubén+Matamoros+Delgado
2452560 P       u       Ruben+Matamoros+Delgado
2452563 P       u       Deirdre+Griffin

We take this data and filter it out into different person and organization splits for each of the following writing script combinations (related languages shown in parentheses):

  1. Latin (English, French, etc.), Arabic, and Cyrillic (Russian, Bulgarian, etc.)
  2. Latin and Hanzi (Chinese)
  3. Hangul (Korean) and Hebrew
  4. Devanagari (Hindi, Nepali, etc.) and Katakana/Hiragana (Japanese)

The first two collections of writing scripts are meant to be more reflective of “real-world” use cases for this technology, while the last two are meant to be stress tests.

With this work, we accomplished our goal in providing a litmus test that others can use to compare and assess their name retrieval systems’ accuracy and improve accessibility for people across the globe.

Read the full paper here!

Curious about what we can mean for your business? Get in touch