Give it a try: https://zigam.github.io/ginkgo/
When my wife and I were expecting our first child, we faced a daunting task: naming our baby in a way that works across all of our cultures. My wife was born in India, I was born in Slovenia, and our baby was about to be born in the US (which is where we live). We wanted the baby’s name to be pronounceable and familiar across all of these countries, while also passing the Starbucks spelling test. Turns out naming is hard, not just in computer science.
As we worked our way through various baby naming web sites and suggestions from friends, we would occasionally stumble upon names that seemed to cross cultures: Maya, Ana, Max, etc. But how do we find an exhaustive list of such universal names to find a name we both like?
We’re both software engineers so we turned this into a data problem: if only we had large public datasets of first names with their country of origin, frequency, and optionally gender. We could then filter those lists to our countries of interest (Slovenia, India, and US) and intersect them. However, we’re not only interested in exact name matches across countries — for example the name Maya is spelled Maja in Slovenian, but pronounced the same as in English. We still consider this a good name candidate, so we have to take into account pronunciation when intersecting name lists.
The problem can thus be broken down into:
- Data sources: gather large lists of names from curated public sources. To generalize the problem and solve it for other families too, we’d gather names from all around the world.
- Filtering and matching: filter the list by an arbitrary set of countries and by gender or unisex names. When intersecting lists, take into account the pronunciation (phonetic encoding) of the name.
- Ranking: rank the results by how strongly they match (exact match or similar pronunciation) and how popular they are in their respective countries. Ranking allows us to find more common names and exclude spurious matches when the data sources contain unfamiliar names. It’s important to note that we’re optimizing for recall though: since naming is a very subjective process, we want to generate lots of good candidates.
A few notes:
- We chose country as the unit of localization instead of ethnicity, language, or other units. This has to do with how the public data sources are annotated.
- Naming a baby is a very personal process. Some might care about name meaning, history, or gender identity. We did not set out to solve those problems.
The best sources for first names are generally government statistics departments or census data. The US Social Security Administration publishes first names from Social Security card applications since 1880s. They publish statistics such as Emma and Liam being the most popular baby names in 2018. But they also provide the raw dataset: 98,400 unique names as of 2018.
There are similar statistical datasets for a handful of other countries (see references for the Wikipedia article on popular names). However, there are many more countries that either don’t publish such datasets or we weren’t able to find them.
A great international dataset was produced by the German computer magazine c’t: 45,371 names across 52 countries (with a focus on European countries), with gender prediction and name popularity. The list is well curated.
To further expand coverage, we looked to Wikipedia: with its 40M articles across 301 languages, it’s a great resource for extracting names. Our initial idea was to extract names from wikitext dumps using Named-entity recognition. However it turns out that Wikipedia’s sister project Wikidata—the multilingual secondary database collecting structured data to provide support for Wikipedia—already contains the data in a useful format. The following SPARQL query will extract Slovenian female names from the knowledge base:
SELECT ?nameLabel ?count
WITH {
SELECT ?name (count(?person) AS ?count) WHERE {
?person wdt:P735 ?name . # given name
?person wdt:P27 wd:Q215 . # country: Slovenia
?person wdt:P21 wd:Q6581072 . # sex or gender: female
}
GROUP BY ?name
ORDER BY DESC(?count)
LIMIT 1000
} AS %results
WHERE {
INCLUDE %results
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY DESC(?count)
The entire Wikidata dataset yielded 34,530 unique names across 128 countries! The dataset is not as well curated as the ones above, but we can use name frequency to exclude spurious entries. The other problem evident in the Wikiset data is gender bias: female names represent only 18% of all names!
Filtering the dataset by country and gender is straightforward. To account for pronunciation matches we used the Double Metaphone phonetic algorithm. Metaphone is an improved version of the Soundex algorithm, used to index names by sound as pronounced in English. These algorithms can be used for spell-checking or assisting phone operators in locating a person based on spoken names.
Here’s an example of Metaphone phonetic encoding:
Karla (common Czech name) → KRL
Carla (common in the US) → KRL
Since the phonetic encodings match, we can consider the pair (Karla, Carla) a good candidate for a Czech-American name.
Note that while Double Metaphone takes into account spelling differences in some other languages, we expect it to work best for English pronunciation. This limits its utility in finding similarly-sounding names across only non-English languages.
Finally, we rank the results. We weigh the results by popularity in their countries and separate exact matches from phonetic matches. Ranking matters, but our goal is also to produce as many good candidates as possible in hopes of finding the one (recall over precision).
The final dataset after some cleanup contains:
- 95,095 unique names across 101 countries.
- 386,512 unique (name, gender, country) tuples.
The most international names weighted by popularity are:
- Female: Maria, found in 79 countries.
- Male: David, found in 94 countries.
- Unisex: Dominique, found in 22 countries.
There are no names for a boy that match across any pair of countries from the set: (Algeria, Greece, Lithuania, South Korea, Vietnam).
The longest common single-word name is Sri Lankan Thamotharampillai. The most common short name is Vietnamese My.
Brittney (metaphone encoding: PRTN) can be spelled in at least 44 different ways.
Last but not least: both of our children were successfully named with the help of this tool! Give it a try: https://zigam.github.io/ginkgo/
By Ziga Mahkovec & Surabhi Gupta