Skip to content

Latest commit

 

History

History
173 lines (95 loc) · 18.8 KB

Reconciliation.md

File metadata and controls

173 lines (95 loc) · 18.8 KB

_This page is also available in http:_qiita.com/yayamamo/items/eade3e5788e6f359bce7.

Reconciliation is a semi-automated process of matching text names to database IDs (keys). This is semi-automated because in some cases, machine alone is not sufficient and human judgement is essential. For example, given "Ocean's Eleven" as the name of a film, should it be matched to

While we talk about "databases" here, perhaps a more accurate term would be "name registries". And conceptually, "reconciliation" isn't a new concept. In some fields, it's being done all the time. For example, when the police gets a name of a suspect, they run the name through their databases of known felons. More details about the suspect (e.g., race, height, last known address) would help the police narrow down the suspect among several database matches.

You can use OpenRefine to perform reconciliation of names in your data against any database that exposes a web service following this Reconciliation Service Api specification. One such database is https:www.wikidata.org/, and in this document we will use it as our example service.

Basics

Given a column containing names, we can reconcile those names against Wikidata. To do so, invoke the column's drop-down menu and pick Reconcile > Start reconciling. If you want to reconcile only some cells in that column, then use filters and facets to isolate them first.

Screenshot of the reconciliation menu

In the Reconcile dialog box, pick Wikidata Reconciliation Service on the left. That service comes by default, but you can install more services (see below to get a version of this service in your language).

There are a few options to configure the reconciliation process, which will help you get more reliable matches. These options are covered in the next sections. To start the reconciliation process, click the Start Reconciling button.

Screenshot of the reconciliation dialog

This process will take a while depending on how much data you have. Currently, the Wikidata reconciliation service processes about 3 rows per second.

When the process is done, you will see that the reconciliation data in the cells.

  • Either the cell was successfully matched: it displays a single dark blue link. In this case, the reconciliation is confident that the match is correct: you should not have to check it manually.

(Screenshot of a matched item)

  • Or a few candidates are displayed, together with their reconciliation score, with light blue links. You need to pick manually the correct one. For each matching decision you make, you have two options: either match this cell only ( button to perform a single match), or also use the same identifier for all other cells containing the same unreconciled value ( button to perform a multiple match).

(Screenshot of an unmatched item)

In addition to changing the ways the reconciled cells are displayed, OpenRefine also automatically creates two facets for you to use to filter the cells based on the reconciliation data. One is a numeric facet for the reconciliation scores of the best candidates of the cells. Higher scores mean better matches. You could filter for the higher scores, and approve them all in bulk (invoking that column's drop-down menu and using the Reconcile > Actions submenu).

There is also the judgment facet, which lets you filter for the cells that haven't been matched (pick "None" in the facet). As you process each cell, its judgment changes from "None" to "Matched" and it disappears from the view, because it no longer fits the facet's selection.

Restricting matches by type

One simple way to improve the quality of the matches is to restrict them to a given type. If your dataset contains universities, then you know that "Heidelberg" does not refer to the https:www.wikidata.org/wiki/ but rather the https:www.wikidata.org/wiki/Q151510.

In Wikidata, types are items themselves. For instance, the https:www.wikidata.org/wiki/Q1377 has type https:www.wikidata.org/wiki/Q875538, which is witnessed by a statement on https:www.wikidata.org/wiki/Q1377 using the https:www.wikidata.org/wiki/Property:P31 property.

Types are organized into an ontology, by specifying which types are subclasses of the others, using the https:www.wikidata.org/wiki/Property:P279 property. For instance, https:www.wikidata.org/wiki/Q875538 is a subclass of https:www.wikidata.org/wiki/Q3918.

(diagram explaining types in Wikidata)

One useful tool to visualize the subclasses of a given type is the https:angryloki.github.io/wikidata-graph-builder/. For instance, here are https:angryloki.github.io/wikidata-graph-builder/?property=P279&item=Q4671277&mode=reverse.

The reconciliation dialog allows you to select a type for your rows. This will restrict the matches to items which are instances of any subclass of the given type. For instance, if you select https:www.wikidata.org/wiki/Q875538, then https:www.wikidata.org/wiki/Q1377 will be a possible match, because of the path depicted above.

By default, OpenRefine will propose you some types based on the first few items of your dataset. It is quite likely that these types will be too specific for your needs. You can specify a broader type that will encompass more items. For instance, if you have a dataset of universities, it might be better to pick https:www.wikidata.org/wiki/Q2385804 rather than https:www.wikidata.org/wiki/Q3918, for instance if some entry happens to be marked as a https:www.wikidata.org/wiki/Q189004 in Wikidata.

Some items are not marked as the instances of anything, because no one has taken the time to add these statements yet. If you restrict the reconciliation to a type, these items will not appear in the results, except when no items of the given type could be found for the particular name. In this case, items with no type can be returned: they will have a low score and will not be matched automatically.

Refining by property

Names are ambiguous, and sometimes your dataset contains other columns that can help identify which item is being referred to. The reconciliation interface can be configured to take these columns into account in the matching scores. For instance, say you have a dataset of sportspeople with a column indicating which sport they play.

(example dataset)

If you want to use the second column in the reconciliation process, you need to identify how this additional information is represented in Wikidata. So you need to find out which Wikidata property is used to link sportspeople to the sport they play. Looking at https:www.wikidata.org/wiki/Q1189 it seems that https:www.wikidata.org/wiki/Property:P641 can be used for that.

In the reconciliation dialog, tick the box to include your column and bind it to the relevant Wikidata property.

(dialog to bind a column to a property)

The reconciliation service will then fuzzy-match on both columns.

(example dataset)

As you can see, the score is boosted when the sport matches. In the first row, there is a slight mismatch between the sport in the table ("rugby") and on Wikidata ("rugby league"), which explains why the first match does not reach the perfect score (100) and is therefore not matched automatically.

Advanced usage: property paths

This feature is specific to the reconciliation interface for Wikidata.

Sometimes, the relation between the reconciled item and the disambiguating column is not direct: it is not represented as a property itself. Let us consider this dataset of cities:

(an example dataset with cities and country codes)

To fetch the country code from an item representing a city, you need to follow two properties. First, follow https:www.wikidata.org/wiki/Property:P17 to get to the item for the country in which this city is located, then follow https:www.wikidata.org/wiki/Property:P297 to get the two-letter code string.

(an example dataset with cities and country codes)

This is supported by the reconciliation interface, with a syntax inspired by https:www.w3.org/TR/sparql11-property-paths/: just type the sequence of property identifiers separated by slashes, such as P17/P297:

(inputting a property path in the reconciliation interface)

This additional information allows to distinguish between namesakes, to some extent. As "Cambridge, US" is still ambiguous, there are multiple items with a perfect matching score, but "Oxford, GB" successfully disambiguates one particular city from its namesakes. (the disambiguated dataset)

The endpoint currently supports two property combinators: /, to concatenate two paths as above, and |, to compute the union of the values yielded by two paths. Concatenation / has precedence over disjunction |. The dot character . can be used to denote the empty path. For instance, the following property paths are equivalent:

  • P17|P749/P17
  • P17|(P749/P17)
  • (.|P749)/P17

They fetch the https:www.wikidata.org/wiki/Property:P17 of an item or that of its https:www.wikidata.org/wiki/Property:P17.

Comparing values

By default, the values supplied in OpenRefine and the ones present in Wikidata are compared by string fuzzy-matching. There are some exceptions to this:

  • If the value is an identifier, then exact string matching is used.
  • If the values are integers, exact equality between integers is used.
  • If the values are floating point numbers, the score is 100 if they are equal and decreases towards 0 as their absolute difference increases.
  • If the values are coordinates (specified in the "lat,lng" format on OpenRefine's side), then the matching score is 100 when they are equal and decreases as their distance increases. Currently a score of 0 is reached when the points are 1km away from each other.

Sometimes, we need a more specific matching on sub-parts of these values. It is possible to select these parts for matching by appending a modifier at the end of the property path:

  • @lat and @lng: latitude and longitude of geographical coordinates (float)
  • @year, @month, @day, @hour, @minute and @second: parts of a time value (int). They are returned only if the precision of the Wikidata value is good enough to define them.
  • @isodate: returns a date in the ISO format 1987-08-23 (string). A value is always returned.
  • @iso: returns the date and time in the ISO format 1996-03-17T04:15:00+00:00. A value is always returned. For times and dates, all values are returned in the UTC time zone.
  • @urlscheme ("https"), @netloc ("www.wikidata.org") and @urlpath ("/wiki/Q42") can be used to perform exact matching on parts of URLs.

For instance, if you want to refine people by their birth dates, but you only have the month and day. First, split the birthday dates in two columns, for month and day.

(the initial dataset)

Reconcile using the following parameters (P569 can be found with the autocompletion first):

(the reconciliation parameters: P569@month and P569@day)

Reconciling via unique identifiers

Many Wikidata items come with external unique identifiers, so if you have some in your dataset they can be very useful to get a precise reconciliation. When a Wikidata property corresponding to a unique identifier is used, the reconciliation interface uses a different matching strategy. It first retrieves any items with this identifier and gives it the perfect matching score (100). All the other properties are ignored, including the original name in the column being reconciled. If no item bears the supplied unique identifier, the reconciliation interface falls back on the standard strategy, using string search and fuzzy matching.

Note: Sometimes the same external identifier is assigned to different Wikidata items (in which case the conflict will be flagged in Wikidata as a uniqueness constraint violation). In this case, the multiple items are returned but are not automatically matched.

Note: It is important to supply the unique identifiers in the same format as they are stored on Wikidata. Matching of identifiers is done with strict string equality. Case differences, spaces or leading zeros all matter. Check the documentation of the relevant Wikidata property for the expected format.

This alternative matching strategy is only used for unique identifiers belonging to the reconciled items. For instance, while the https:www.wikidata.org/wiki/Property:P297 is a unique identifier for countries, it will not be used as a unique identifier for cities reconciled with the SPARQL property path P17/P297 demonstrated above.

However, disjunctions of unique identifiers can be still used: for instance, suppose you have a column containing sometimes a https:www.wikidata.org/wiki/Property:P356, and sometimes a https:www.wikidata.org/wiki/Property:P3784. If you are confident that such ids are not ambiguous (which is the case in this example as the formats are incompatible), then you can reconcile with this column mapped to (P356|P3784). For each row, any item with the given value as either a DOI or a CiteSeerX id will be matched, ignoring the other values.

Fetching values from the reconciled database

One reason to reconcile a dataset to some database is that it allows you to pull data from the database into OpenRefine. There are essentially two ways to do this.

Keep all the suggestions made

To keep all suggestion made by the reconciliation service and not match against the best one:

  1. Before starting reconciliation un-check option Auto-match candidates with high confidence in Reconcile dialog (it's at the bottom).

  2. After reconciliation you should be able to see all three suggested reconciliation candidates. This step is optional - it is just for you to see what will be extracted.

  3. After reconciliation click on the reconciled column and choose Add column based on this column.

  4. Put following expression in the Expression field if you want the value: cell.recon.candidates[0].name

  5. if you want a link to Wikidata use this expression: cell.recon.candidates[0].id

  6. To get all of the IDs at the same time use the expression: cell.recon.candidates.join(',') , which will give you a comma separated list of all the IDs. Then use the command "Split multi-valued cells" to split the identifiers into individual rows.

Language settings for Wikidata reconciliation

You can install a version of the Wikidata reconciliation service that uses your language.

First, you need to find your language code: this is the first letters in the domain name of your Wikipedia (for instance "fr" if your Wikipedia is https://fr.wikipedia.org/wiki/).

Then, open the reconciliation dialog and click Add Standard Service. The URL is https://tools.wmflabs.org/openrefine-wikidata/fr/api where "fr" is replaced by your language code.

When reconciling using this interface, items and properties will be displayed in your language if a translation is available. The matching score of the reconciliation is not influenced by your choice of language: items are matched by considering all labels and keeping the best possible match. So the language of your dataset is irrelevant to the choice of the language for the reconciliation interface.

Additional Resources

See also: