-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdbpedia-problems-long.html
1658 lines (1598 loc) · 57.9 KB
/
dbpedia-problems-long.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<title>DBpedia Ontology and Mapping Problems</title>
<!-- 2015-02-09 Mon 12:59 -->
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="generator" content="Org-mode" />
<meta name="author" content="Vladimir Alexiev, Ontotext Corp" />
<style type="text/css">
<!--/*--><![CDATA[/*><!--*/
.title { text-align: center; }
.todo { font-family: monospace; color: red; }
.done { color: green; }
.tag { background-color: #eee; font-family: monospace;
padding: 2px; font-size: 80%; font-weight: normal; }
.timestamp { color: #bebebe; }
.timestamp-kwd { color: #5f9ea0; }
.right { margin-left: auto; margin-right: 0px; text-align: right; }
.left { margin-left: 0px; margin-right: auto; text-align: left; }
.center { margin-left: auto; margin-right: auto; text-align: center; }
.underline { text-decoration: underline; }
#postamble p, #preamble p { font-size: 90%; margin: .2em; }
p.verse { margin-left: 3%; }
pre {
border: 1px solid #ccc;
box-shadow: 3px 3px 3px #eee;
padding: 8pt;
font-family: monospace;
overflow: auto;
margin: 1.2em;
}
pre.src {
position: relative;
overflow: visible;
padding-top: 1.2em;
}
pre.src:before {
display: none;
position: absolute;
background-color: white;
top: -10px;
right: 10px;
padding: 3px;
border: 1px solid black;
}
pre.src:hover:before { display: inline;}
pre.src-sh:before { content: 'sh'; }
pre.src-bash:before { content: 'sh'; }
pre.src-emacs-lisp:before { content: 'Emacs Lisp'; }
pre.src-R:before { content: 'R'; }
pre.src-perl:before { content: 'Perl'; }
pre.src-java:before { content: 'Java'; }
pre.src-sql:before { content: 'SQL'; }
table { border-collapse:collapse; }
caption.t-above { caption-side: top; }
caption.t-bottom { caption-side: bottom; }
td, th { vertical-align:top; }
th.right { text-align: center; }
th.left { text-align: center; }
th.center { text-align: center; }
td.right { text-align: right; }
td.left { text-align: left; }
td.center { text-align: center; }
dt { font-weight: bold; }
.footpara:nth-child(2) { display: inline; }
.footpara { display: block; }
.footdef { margin-bottom: 1em; }
.figure { padding: 1em; }
.figure p { text-align: center; }
.inlinetask {
padding: 10px;
border: 2px solid gray;
margin: 10px;
background: #ffffcc;
}
#org-div-home-and-up
{ text-align: right; font-size: 70%; white-space: nowrap; }
textarea { overflow-x: auto; }
.linenr { font-size: smaller }
.code-highlighted { background-color: #ffff00; }
.org-info-js_info-navigation { border-style: none; }
#org-info-js_console-label
{ font-size: 10px; font-weight: bold; white-space: nowrap; }
.org-info-js_search-highlight
{ background-color: #ffff00; color: #000000; font-weight: bold; }
/*]]>*/-->
</style>
<style type="text/css">
h1,h2,h3,h4,h5,h6,h7 {font-family: Arial}
// don't want empty lines in auto-postamble
.author, .date, .creator {-webkit-margin-before: 0em; -webkit-margin-after: 0em}
// style for #+begin_abstract
.abstract {margin: 1em; padding: 1em; border: 1px solid black}
.abstract:before {content: "Abstract: "; font-weight: bold}
// center the preamble (author name) and make it bigger
#preamble p { font-size: 110%%; margin-left: auto; margin-right: auto; text-align: center; }
// table headers aligned same as table data
th.left {text-align:left}
th.right {text-align:right}
// table horizontal&vertical borders. First value is top&bottom, second is left&right. http://www.w3schools.com/css/css_border.asp
th, td {border-width: 1px; border-style: solid solid; border-spacing: 2px 2px; padding:4px 2px}
// colored TODO keywords
.CANCELED {color: blue}
.MAYBE {color: blue}
.POSTPONED {color: blue}
.INPROGRESS {color: orange}
.NEXT {color: orange}
.IER {color: orange}
</style>
<script type="text/javascript">
/*
@licstart The following is the entire license notice for the
JavaScript code in this tag.
Copyright (C) 2012-2013 Free Software Foundation, Inc.
The JavaScript code in this tag is free software: you can
redistribute it and/or modify it under the terms of the GNU
General Public License (GNU GPL) as published by the Free Software
Foundation, either version 3 of the License, or (at your option)
any later version. The code is distributed WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU GPL for more details.
As additional permission under GNU GPL version 3 section 7, you
may distribute non-source (e.g., minimized or compacted) forms of
that code without the copy of the GNU GPL normally required by
section 4, provided you include this license notice and a URL
through which recipients can access the Corresponding Source.
@licend The above is the entire license notice
for the JavaScript code in this tag.
*/
<!--/*--><![CDATA[/*><!--*/
function CodeHighlightOn(elem, id)
{
var target = document.getElementById(id);
if(null != target) {
elem.cacheClassElem = elem.className;
elem.cacheClassTarget = target.className;
target.className = "code-highlighted";
elem.className = "code-highlighted";
}
}
function CodeHighlightOff(elem, id)
{
var target = document.getElementById(id);
if(elem.cacheClassElem)
elem.className = elem.cacheClassElem;
if(elem.cacheClassTarget)
target.className = elem.cacheClassTarget;
}
/*]]>*///-->
</script>
</head>
<body>
<div id="content">
<h1 class="title">DBpedia Ontology and Mapping Problems</h1>
<div id="table-of-contents">
<h2>Table of Contents</h2>
<div id="text-table-of-contents">
<ul>
<li><a href="#sec-1">1. Intro</a>
<ul>
<li><a href="#sec-1-1">1.1. DBpedia Data Quality</a></li>
<li><a href="#sec-1-2">1.2. Ontotext's DBpedia Experience</a></li>
<li><a href="#sec-1-3">1.3. A Non-sense Mapping</a></li>
<li><a href="#sec-1-4">1.4. Have I got your attention?</a></li>
<li><a href="#sec-1-5">1.5. Mapping Issues Bigger Than Ontology Issues?</a></li>
<li><a href="#sec-1-6">1.6. Balanced Effort</a></li>
</ul>
</li>
<li><a href="#sec-2">2. Issue Tracking</a>
<ul>
<li><a href="#sec-2-1">2.1. Testing Best Practice</a></li>
</ul>
</li>
<li><a href="#sec-3">3. Mapping Language Issues</a>
<ul>
<li><a href="#sec-3-1">3.1. ConditionalMapping Not Flexible Enough</a></li>
<li><a href="#sec-3-2">3.2. Object/DataProp Dichotomy</a></li>
<li><a href="#sec-3-3">3.3. Mapping Framework is not Modular Enough</a></li>
</ul>
</li>
<li><a href="#sec-4">4. Mapping Server Deficiencies</a></li>
<li><a href="#sec-5">5. Mapping Wiki Deficiencies</a>
<ul>
<li><a href="#sec-5-1">5.1. Improve Display of Mappings</a></li>
</ul>
</li>
<li><a href="#sec-6">6. Mapping Issues</a>
<ul>
<li><a href="#sec-6-1">6.1. No Editorial Process</a></li>
<li><a href="#sec-6-2">6.2. Lack of Documentation</a></li>
<li><a href="#sec-6-3">6.3. Good Documentation Is Specific</a></li>
<li><a href="#sec-6-4">6.4. Duplicate & Semi-Duplicate Properties</a></li>
<li><a href="#sec-6-5">6.5. Need for Research</a></li>
<li><a href="#sec-6-6">6.6. Need for Research</a></li>
<li><a href="#sec-6-7">6.7. Validate Ontological Assumptions</a></li>
<li><a href="#sec-6-8">6.8. Property and Class Naming</a></li>
<li><a href="#sec-6-9">6.9. Various Mapping Issues</a></li>
</ul>
</li>
<li><a href="#sec-7">7. Extraction Framework Issues</a>
<ul>
<li><a href="#sec-7-1">7.1. Issues Important for Local Chapters</a></li>
<li><a href="#sec-7-2">7.2. Date as Page is not Extracted</a></li>
<li><a href="#sec-7-3">7.3. Object Extractor Does Not Respect Ranges</a></li>
<li><a href="#sec-7-4">7.4. Use Domain & Range to Guide Extraction</a></li>
<li><a href="#sec-7-5">7.5. Specific Properties</a></li>
<li><a href="#sec-7-6">7.6. Various Extraction Issues</a></li>
</ul>
</li>
<li><a href="#sec-8">8. External Mapping Problems</a>
<ul>
<li><a href="#sec-8-1">8.1. DUL Too Generic?</a></li>
<li><a href="#sec-8-2">8.2. owl:Thing Considered Useless</a></li>
<li><a href="#sec-8-3">8.3. No Choice</a></li>
</ul>
</li>
<li><a href="#sec-9">9. Ontology Problems</a>
<ul>
<li><a href="#sec-9-1">9.1. External Props Not Used Consistently</a></li>
<li><a href="#sec-9-2">9.2. rdfs:domain/range are Wishful</a></li>
<li><a href="#sec-9-3">9.3. Classes that Duplicate Properties</a></li>
<li><a href="#sec-9-4">9.4. Measurement Classes</a></li>
<li><a href="#sec-9-5">9.5. Place vs Organisation</a></li>
<li><a href="#sec-9-6">9.6. Simple Ontology Fixes</a></li>
</ul>
</li>
</ul>
</div>
</div>
<div id="outline-container-sec-1" class="outline-2">
<h2 id="sec-1"><span class="section-number-2">1</span> Intro</h2>
<div class="outline-text-2" id="text-1">
<p>
This is way <b>Too Many Slides</b>(TM). View at your leisure:
</p>
<ul class="org-ul">
<li><a href="./dbpedia-problems-long.html">Single HTML page</a>: easier to read, you can print it, external links don't lose context
</li>
<li><a href="./dbpedia-problems.html">2D interactive presentation</a>: one slide per page, 2D structure
<ul class="org-ul">
<li>Press O for 2D overview, Escape to zoom on the selected slide
</li>
<li>Press <a href="../../reveal.js/reveal-help.html">H for help</a>
</li>
<li>Proudly made in plain text with <a href="https://github.com/hakimel/reveal.js/">reveal.js</a>, <a href="https://github.com/yjwen/org-reveal">org-reveal</a>, <a href="http://orgmode.org">org-mode</a>, <a href="http://www.gnu.org/s/emacs/">emacs</a>
</li>
</ul>
</li>
</ul>
</div>
<div id="outline-container-sec-1-1" class="outline-3">
<h3 id="sec-1-1"><span class="section-number-3">1.1</span> DBpedia Data Quality</h3>
<div class="outline-text-3" id="text-1-1">
<ul class="org-ul">
<li>DBpedia is a crucial LOD dataset used by many, including for commercial applications by companies like Ontotext.
</li>
<li>But DBpedia data quality leaves a lot to be desired, and has been the subject of many recent papers.
</li>
<li>Most of these papers describe approaches for finding errors.
</li>
</ul>
<p>
Instead, I want to focus on root causes of important error classes and to propose fixing approaches. We focus on:
</p>
<ul class="org-ul">
<li>Lack of documentation on classes and properties
</li>
<li>Weak editorial process in the mapping wiki, lack of issue tracking
</li>
<li>Ontology problems, mostly due to the weak editorial process. Compare to Wikidata property proposal process
</li>
<li>Potential improvements for error checking in the mapping wiki (both ontology and mapping)
</li>
<li>Deficiencies of ontology mapping to external ontologies
</li>
<li>Extractor deficiencies
</li>
</ul>
<p>
I give many concrete examples
</p>
</div>
</div>
<div id="outline-container-sec-1-2" class="outline-3">
<h3 id="sec-1-2"><span class="section-number-3">1.2</span> Ontotext's DBpedia Experience</h3>
<div class="outline-text-3" id="text-1-2">
<ul class="org-ul">
<li>Used DBpedia for at least 5 years
</li>
<li>Eg <a href="http://factforge.net">http://factforge.net</a> aggregates DBPedia, FreeBase, GeoNames, etc (9 central LOD datasets), but old versions
</li>
<li>Developed mapping layers, eg PROTON; contributed to UMBEL
</li>
<li>Use in FP7 Multisensor: DBpedia in 5 languages as a core background dataset
</li>
<li>Use in FP7 Europeana Food and Drink: DBpedia in 11 languages as the backbone of EFD Classification
</li>
<li>Just started hosting <a href="http://bg.dbpedia.org">http://bg.dbpedia.org</a> (above FP7 projects include Bulgarian)
</li>
</ul>
<p>
Most importantly:
</p>
<ul class="org-ul">
<li>Use DBpedia labels and other features for commercial Semantic Enrichment (media, publishers, etc)
</li>
<li>Now also for Bulgarian (BG project with OffMedia)
</li>
</ul>
<p>
Until now, only grumbled internally about DBpedia data quality
</p>
<ul class="org-ul">
<li>A couple months ago started looking actively into that
</li>
<li>Many improvements to bg.dbpedia mappings
</li>
<li>Posted suggestions and issues to dbpedia
</li>
<li><b>Pragmatic</b> approach
</li>
</ul>
</div>
</div>
<div id="outline-container-sec-1-3" class="outline-3">
<h3 id="sec-1-3"><span class="section-number-3">1.3</span> A Non-sense Mapping</h3>
<div class="outline-text-3" id="text-1-3">
<p>
<a href="http://mappings.dbpedia.org/index.php?title=Mapping_el:Quote_box&action=edit">Mapping_el:Quote_box</a> is utter nonsense
</p>
<ul class="org-ul">
<li>Mapped to Road, so eg the <a href="https://el.wikipedia.org/wiki/Ιστορία">Greek article on History</a> will be mapped to Road
</li>
<li>The only meaningful property (quote text) won't be extracted because <a href="http://mappings.dbpedia.org/index.php/OntologyProperty:Category">category</a> is an ObjectProperty:
<pre class="example">
{{ PropertyMapping | templateProperty = quote | ontologyProperty = category }}
{{ PropertyMapping | templateProperty = quoted | ontologyProperty = category }}
</pre>
</li>
<li>"title" (if any), is intermixed with non-semantic properties like background and font:
<pre class="example">
{{PropertyMapping | templateProperty = title| ontologyProperty = title }}
{{PropertyMapping | templateProperty = title_bg| ontologyProperty = title }}
{{PropertyMapping | templateProperty = title_fnt| ontologyProperty = title }}
</pre>
</li>
<li>Most of the properties (eg size, style) have no semantic significance
</li>
<li>Alignment -> picture ??
<pre class="example">
{{PropertyMapping | templateProperty = align | ontologyProperty = picture }}
{{PropertyMapping | templateProperty = salign | ontologyProperty = picture }}
{{PropertyMapping | templateProperty = halign | ontologyProperty = picture }}
{{PropertyMapping | templateProperty = qalign | ontologyProperty = picture }}
</pre>
</li>
<li>I especially like these mappings. 1 is a number, alright ;-)
<pre class="example">
{{ PropertyMapping | templateProperty = 1 | ontologyProperty = number }}
{{ PropertyMapping | templateProperty = 2 | ontologyProperty = number }}
</pre>
</li>
<li><a href="http://mappings.dbpedia.org/server/templatestatistics/el/?template=Quote_box">Stats happily reports</a> all props are mapped
</li>
</ul>
</div>
</div>
<div id="outline-container-sec-1-4" class="outline-3">
<h3 id="sec-1-4"><span class="section-number-3">1.4</span> Have I got your attention?</h3>
<div class="outline-text-3" id="text-1-4">
<p>
I wanted to open with a horrible example to get your attention
</p>
<ul class="org-ul">
<li>You may think the above is a weird exception, but it is not
</li>
<li>All of the DBpedia ontology and mappings are crowd-sourced
</li>
<li>But due to lack of editorial process, documentation and discussion, the results are… less than ideal
</li>
</ul>
<p>
<b>Ontology problems</b> include duplicated properties, non-standard properties, etc
</p>
<ul class="org-ul">
<li>But they pale in comparison to <b>mapping problems</b> (subjectively: 5% vs 95%)
</li>
<li>Efforts to improve the ontology and improve the mappings should be appropriately balanced
</li>
<li>These efforts must be intimately tied, else we'll not achieve much improvement
</li>
<li>It doesn't take an ontological discussion on the nature of Numbers to figure out this is wrong:
<pre class="example">
{{ PropertyMapping | templateProperty = 1 | ontologyProperty = number }}
</pre>
</li>
<li>Prop <a href="http://mappings.dbpedia.org/index.php/OntologyProperty:Number">number</a> is not documented (i.e. not well-defined), but that's not the problem here
</li>
<li>Crowdsourcing without editorial process = allowing any fool to write nonsense
</li>
</ul>
</div>
</div>
<div id="outline-container-sec-1-5" class="outline-3">
<h3 id="sec-1-5"><span class="section-number-3">1.5</span> Mapping Issues Bigger Than Ontology Issues?</h3>
<div class="outline-text-3" id="text-1-5">
<p>
Thesis: ontology problems pale in comparison to mapping problems
</p>
<ul class="org-ul">
<li>Lack of documentation of classes & props
<ul class="org-ul">
<li>Sometimes template props in wikipedia are also not documented
</li>
<li>This turns mapping into guesswork (also because of Object/DataProp Dichotomy <a href="#sec-3-2">3.2</a>)
</li>
<li>Many people don't research exising props before making new
</li>
</ul>
</li>
<li>Lack of editorial process
</li>
<li>Bad practices are copy & pasted (<a href="#sec-3-3">3.3</a>)
</li>
</ul>
</div>
</div>
<div id="outline-container-sec-1-6" class="outline-3">
<h3 id="sec-1-6"><span class="section-number-3">1.6</span> Balanced Effort</h3>
<div class="outline-text-3" id="text-1-6">
<p>
Please don't focus your energy and efforts only on ontology problems
</p>
<ul class="org-ul">
<li>The ontology and mappings are intimately connected
</li>
<li>The effort between fixing ontology & mapping problems should be balanced
</li>
<li>If we fix ontology problems in isolation from mapping: no useful result
</li>
</ul>
<p>
It will take lots of pragmatic & concerted editorial effort
</p>
<ul class="org-ul">
<li>Research current usage in various areas (eg Name props, Place parent hierarchy, Membership…)
</li>
<li>Best practice writing, wiki gardening, bot writing
</li>
<li>Not necessarily by world-class ontological thinkers
</li>
<li>But by people willing to spend the time and build consensus (examples: Wikipedia, Wikidata)
</li>
</ul>
<p>
Are we up to it?
</p>
</div>
</div>
</div>
<div id="outline-container-sec-2" class="outline-2">
<h2 id="sec-2"><span class="section-number-2">2</span> Issue Tracking</h2>
<div class="outline-text-2" id="text-2">
<p>
A major problem was that ontology and mapping issues were not tracked
</p>
<ul class="org-ul">
<li>D.Kontokostas made trackers on github about a month ago
</li>
<li><a href="https://github.com/dbpedia/mappings-tracker/issues">mappings-tracker/issues</a>: mapping issues, issues with the mapping wiki
</li>
<li><a href="https://github.com/dbpedia/ontology-tracker/issues">ontology-tracker/issues</a>: issues with the ontology
</li>
<li>(old): <a href="https://github.com/dbpedia/extraction-framework/issues">extraction-framework/issues</a>: technical problems with the extraction software
</li>
</ul>
<p>
But so far it seems I'm the only one using them :-(
</p>
<ul class="org-ul">
<li>I've posted 19 <a href="https://github.com/dbpedia/extraction-framework/issues?q=author:VladimirAlexiev+">extraction-framework/issues</a>, Referenced below with bigger numbers, eg #286
</li>
<li>I've posted 36 <a href="https://github.com/dbpedia/mappings-tracker/issues?q=author:VladimirAlexiev+">mappings-tracker/issues</a>. Referenced below with small numbers, eg #20
</li>
<li>I haven't posted ontology-tracker/issues, since IMHO ontology and mapping problems are intimately related
<ul class="org-ul">
<li>If we start using Web Protege, it must be just as intimately related to the mapping wiki!
</li>
</ul>
</li>
</ul>
<p>
All discussion should be in the wiki
</p>
<ul class="org-ul">
<li>The tracker is for tracking only, not for keeping knowledge
</li>
<li>Issue and Discussion should be interlinked (paste links in each)
</li>
</ul>
</div>
<div id="outline-container-sec-2-1" class="outline-3">
<h3 id="sec-2-1"><span class="section-number-3">2.1</span> Testing Best Practice</h3>
<div class="outline-text-3" id="text-2-1">
<p>
Say you made a <a href="http://mappings.dbpedia.org/index.php/Mapping_bg:Манекен_инфо">new mapping</a> or fixed a mapping
</p>
<ul class="org-ul">
<li>There's a <a href="http://mappings.dbpedia.org/server/mappings/bg/extractionSamples/Mapping_bg:Манекен_инфо">test link</a> to return triples
</li>
<li>But they're "random" triples and it works only for enwiki/ASCII (<a href="https://github.com/dbpedia/extraction-framework/issues/289">#289</a>)
</li>
</ul>
<p>
The individual triple extractor is more useful
</p>
<ul class="org-ul">
<li>First find <a href="http://bg.wikipedia.org/wiki/Special:WhatLinksHere/Template:Манекен_инфо?limit=500&namespace=0">wikipedia usages</a> and pick up some individuals, eg
<pre class="example">
Летисия Каста
</pre>
</li>
<li>Then go to Discussion page, add section "Testing" and make test links (cases), eg
<ul class="org-ul">
<li><a href="http://mappings.dbpedia.org/server/extraction/bg/extract?format=turtle-triples&extractors=custom&title=Летисия_Каста">http://mappings.dbpedia.org/server/extraction/bg/extract?format=turtle-triples&extractors=custom&title=Летисия_Каста</a>
</li>
</ul>
</li>
</ul>
<p>
These test cases serve important purposes:
</p>
<ul class="org-ul">
<li>Illustrates the problem
</li>
<li>As proof it works after the problem is fixed
</li>
<li>To provide test cases for any bugs in the extraction framework (upstream bug reporting)
</li>
</ul>
<p>
Proposed as <a href="http://mappings.dbpedia.org/index.php/Main_Page#Testing_Best_Practices">editorial policy</a>
</p>
</div>
</div>
</div>
<div id="outline-container-sec-3" class="outline-2">
<h2 id="sec-3"><span class="section-number-2">3</span> Mapping Language Issues</h2>
<div class="outline-text-2" id="text-3">
<p>
The <b>mapping language</b> is a set of wiki templates expressing classes, props, mappings
</p>
<ul class="org-ul">
<li>The very concept of using a wiki to express mappings is quite excellent
</li>
<li>But the mapping framework has a few deficiencies
<ul class="org-ul">
<li>"ConditionalMapping" is very possible to fix
</li>
<li>"Modularity" is hard/impossible to fix
</li>
<li><a href="https://github.com/dbpedia/mappings-tracker/issues/22">#22</a> what are "super" datatypes? is more of a question
</li>
</ul>
</li>
<li>Neither of these is crucially important
</li>
</ul>
<p>
Various cosmetic fixes to the mapping wiki are in the next section
</p>
</div>
<div id="outline-container-sec-3-1" class="outline-3">
<h3 id="sec-3-1"><span class="section-number-3">3.1</span> ConditionalMapping Not Flexible Enough</h3>
<div class="outline-text-3" id="text-3-1">
<p>
<a href="https://github.com/dbpedia/extraction-framework/issues/310">#310</a>: <a href="http://mappings.dbpedia.org/index.php?title=Mapping_bg:Музикален_изпълнител&action=edit">bg:Musical_artist</a> has complex ConditionalMapping logic (translated from bg):
</p>
<ul class="org-ul">
<li>If "members", "former_members", "created" -> Band
</li>
<li>If "background" includes "group", "quartet", "ensemble", "choir" -> Band
</li>
<li>If "background" includes "composer" -> MusicComposer
</li>
<li>If "background" includes "director" -> MusicDirector
</li>
<li>If "background" includes "she-singer" -> MusicalArtist, gender=dbo:Female
</li>
<li>If "background" includes "he-singer" -> MusicalArtist, gender=dbo:Male
</li>
<li>If "background" includes "he-pianist" -> MusicalArtist, gender=dbo:Male
</li>
<li>If "suffix=a" -> MusicalArtist, gender=dbo:Female
<ul class="org-ul">
<li>"suffix=a" indicates Female gender, eg my wife is <b>Alexieva</b>
</li>
</ul>
</li>
<li>Otherwise -> MusicalArtist, gender=dbo:Male
</li>
</ul>
<p>
ConditionalMapping is <b>linear</b>, so we can't:
</p>
<ul class="org-ul">
<li>Check "suffix" of "composer" to emit gender
</li>
<li>Check if "background" includes "composer" and "director" to emit <b>both</b> MusicComposer <b>and</b> MusicDirector
</li>
</ul>
<p>
Not hard to fix. Related to #19 GSoC warm-up task?
</p>
</div>
</div>
<div id="outline-container-sec-3-2" class="outline-3">
<h3 id="sec-3-2"><span class="section-number-3">3.2</span> Object/DataProp Dichotomy</h3>
<div class="outline-text-3" id="text-3-2">
<p>
The mapping language adopts the OWL Dichotomy between owl:ObjectProperty and owl:DatatypeProperty
</p>
<ul class="org-ul">
<li>rdf:Property is more flexible in that it can have either or both
</li>
<li>This dichotomy doesn't always work well with current wikipedia practice
</li>
<li>Eg <a href="http://en.wikipedia.org/wiki/Saint_Peter">Saint_Peter</a>: <b>patronage</b> (to be created) has both:
<ul class="org-ul">
<li>object references, eg many cities
</li>
<li>text literals, eg "fishermen", "the sick"…
</li>
</ul>
</li>
<li>Many other examples
</li>
</ul>
<p>
Some templates harvest <b>the same</b> template field -> ObjectProp & DataProp
</p>
<ul class="org-ul">
<li>Eg firstAscent -> firstAscentPerson (object), firstAscentYear (literal)
</li>
<li>Others exemplified by "field" (object) vs "fieldName" (literal)
</li>
<li>But this is not used systematically (eg there's no "childName" to complement "child")
</li>
<li>Hard to know when to use it: <a href="#sec-4">4</a>, Field Sampling
</li>
</ul>
<p>
Do you think this should be fixed?
</p>
</div>
</div>
<div id="outline-container-sec-3-3" class="outline-3">
<h3 id="sec-3-3"><span class="section-number-3">3.3</span> Mapping Framework is not Modular Enough</h3>
<div class="outline-text-3" id="text-3-3">
<ul class="org-ul">
<li>There's no mapping of a <b>property</b> or <b>group of properties</b>
</li>
<li>Thus mapping patterns cannot be reused but have to be copy-pasted
</li>
<li>We need to copy the complex suffix/gender ConditionalMapping 11 times
</li>
<li>Some bad patterns are copied over and over again, replicating their problems
</li>
<li>IMHO hard to impossible to fix
</li>
</ul>
</div>
</div>
</div>
<div id="outline-container-sec-4" class="outline-2">
<h2 id="sec-4"><span class="section-number-2">4</span> Mapping Server Deficiencies</h2>
<div class="outline-text-2" id="text-4">
<p>
The mapping server has good Stats and Testing features, but more is needed
</p>
<ul class="org-ul">
<li>TODO: Field Sampling:
<ul class="org-ul">
<li>On template stats, for every field, add a hyperlink to show some occurrences
</li>
<li>Extremely useful to understand the meaning of some fields
</li>
<li>And whether they're links, text, or both (<a href="#sec-3-2">3.2</a>)
</li>
</ul>
</li>
<li><a href="https://github.com/dbpedia/mappings-tracker/issues/3">#3</a> Statistics and Validator to check for redirected templates. Prevent problems like
<ul class="org-ul">
<li><a href="https://github.com/dbpedia/extraction-framework/issues/296">#296</a> Why Infobox_Geopolitical_organization (eg United_Nations) is mapped to Country?
</li>
<li><a href="https://github.com/dbpedia/extraction-framework/issues/326">#326</a> Why the redirect is not enacted?
</li>
</ul>
</li>
<li><a href="https://github.com/dbpedia/extraction-framework/issues/287">#287</a> some invalid domain, range, subPropertyOf
<ul class="org-ul">
<li>Check that prop names in templates start with lowercase
</li>
<li>Class names uppercase, include no comma
</li>
<li>Eg <code>firstAscentYear rdfs:domain Peak,Volcano</code> is breakage
</li>
</ul>
</li>
<li><a href="https://github.com/dbpedia/extraction-framework/issues/289">#289</a> testing works only for en/ASCII (see <a href="#sec-2-1">2.1</a> for workaround)
</li>
<li><a href="https://github.com/dbpedia/extraction-framework/issues/304">#304</a> extraction tester should return encoding UTF-8
<ul class="org-ul">
<li>Else browser displays gibberish: need to save file & open in proper editor
</li>
<li>Makes it unnecessarily hard to test international mappings
</li>
</ul>
</li>
<li><a href="https://github.com/dbpedia/extraction-framework/issues/308">#308</a> statistics should check params of GeocoordinatesMapping
</li>
</ul>
</div>
</div>
<div id="outline-container-sec-5" class="outline-2">
<h2 id="sec-5"><span class="section-number-2">5</span> Mapping Wiki Deficiencies</h2>
<div class="outline-text-2" id="text-5">
<p>
IMHO the mapping wiki is quite workable (some enhancements are in order)
</p>
<ul class="org-ul">
<li>Eg "OntologyProperty=foo" finds uses of "foo"
</li>
<li>If Web Protege is adopted, it should be as tightly knit with the mappings as currently
</li>
</ul>
<p>
Improve editing:
</p>
<ul class="org-ul">
<li><a href="https://github.com/dbpedia/mappings-tracker/issues/31">#31</a> show class & prop info while/at Mapping
</li>
<li><a href="https://github.com/dbpedia/mappings-tracker/issues/32">#32</a> add Preview and key shortcuts. Like on any wikipedia!
</li>
</ul>
<p>
Improve search:
</p>
<ul class="org-ul">
<li><a href="https://github.com/dbpedia/mappings-tracker/issues/1">#1</a> add class hierarchy to left navbar
</li>
<li><a href="https://github.com/dbpedia/mappings-tracker/issues/2">#2</a> add Search for Property to left navbar
</li>
<li><a href="https://github.com/dbpedia/mappings-tracker/issues/25">#25</a> FTS doesn't index everything
</li>
</ul>
<p>
Improve collaboration
</p>
<ul class="org-ul">
<li><a href="https://github.com/dbpedia/mappings-tracker/issues/33">#33</a> Add editorial templates/addons: but this is not <b>why</b> we're not doing it
</li>
</ul>
</div>
<div id="outline-container-sec-5-1" class="outline-3">
<h3 id="sec-5-1"><span class="section-number-3">5.1</span> Improve Display of Mappings</h3>
<div class="outline-text-3" id="text-5-1">
<ul class="org-ul">
<li><a href="https://github.com/dbpedia/mappings-tracker/issues/30">#30</a>: The current display (left) is useless (nobody bothers "header=no")
</li>
<li>I just look at the source Edit tab (right)
</li>
<li>The "diff" display (bottom) is quite good
</li>
</ul>
<div class="figure">
<p><img src="./img/dbpedia-mapping-views.png" alt="dbpedia-mapping-views.png" />
</p>
</div>
</div>
</div>
</div>
<div id="outline-container-sec-6" class="outline-2">
<h2 id="sec-6"><span class="section-number-2">6</span> Mapping Issues</h2>
<div class="outline-text-2" id="text-6">
<p>
<b>Biggest reason</b> for current situation is lack of <b>Discussion</b> and <b>Editorial process</b>
</p>
<ul class="org-ul">
<li>Contrast to <b>Wikidata Property Proposal</b> process, eg for <a href="https://www.wikidata.org/wiki/Wikidata:Property_proposal/Authority_control">Authority_control</a>
</li>
<li>Rich metadata: guidelines on use (eg what items applies to), corresponding
register/authority file (if any), examples, format validation, uniqueness constraints,
known exceptions, dynamic validation reports, etc.
</li>
<li>All reasoning & discussion preserved
</li>
</ul>
<div class="figure">
<p><img src="./img/wikidata-DNB-metadata.png" alt="wikidata-DNB-metadata.png" />
</p>
</div>
</div>
<div id="outline-container-sec-6-1" class="outline-3">
<h3 id="sec-6-1"><span class="section-number-3">6.1</span> No Editorial Process</h3>
<div class="outline-text-3" id="text-6-1">
<ul class="org-ul">
<li>Compare to Wikidata's <b>lack</b> of editorial process for Classes
</li>
<li>Any fool can make "instance of" or "subclass of" claims (thus classes and hierarchy)
</li>
<li>Result: 17k classes, at least 2/3 are junk (less than 5 instances)
</li>
</ul>
<p>
Examples
</p>
<ul class="org-ul">
<li><b>location> geographic location> facility> laboratory> lab-on-a-chip</b>:
<ul class="org-ul">
<li>But "lab-on-a-chip" is a "device that integrates one or several laboratory functions on a single chip of only millimeters to a few square centimeters in size", hardly a "geographic location"!!
</li>
</ul>
</li>
<li><b>location> storage> data storage device> audio storage device> album</b>:
<ul class="org-ul">
<li>Any NER implementor will balk at "albums are locations". The everyday understanding of "location" as "place" is implemented as the subclass "geographic location". But nevertheless, an "album" is a creative work, and as such is a conceptual object that persists even after all its copies are destroyed. It's definitely not a "storage device"!
</li>
</ul>
</li>
</ul>
</div>
</div>
<div id="outline-container-sec-6-2" class="outline-3">
<h3 id="sec-6-2"><span class="section-number-3">6.2</span> Lack of Documentation</h3>
<div class="outline-text-3" id="text-6-2">
<p>
Many props/classes have no comment. Everyone has complained about this
</p>
<ul class="org-ul">
<li>It takes a lot of unnecessary digging to figure out the meaning of a prop
</li>
<li>You'd never guess what "event" is until you investigate usages, eg this SL mapping:
<div class="org-src-container">
<pre class="src src-Turtle">Antonio_Pettigrew <span style="color: #228b22;">dbo:</span><span style="color: #008b8b;">event</span> Moški_tek_n<span style="color: #a020f0;">a</span>_400_m <span style="color: #b22222;"># (male race on 400m)</span>
</pre>
</div>
</li>
<li>Then you figure out it's the same as sportDiscipline and should be replaced
</li>
</ul>
<p>
Must be merciless about new props & classes: <b>no comment means automatic deletion</b>
</p>
<ul class="org-ul">
<li>But what to do about existing props with no comment?
</li>
<li>Thus <a href="https://github.com/dbpedia/mappings-tracker/issues/6">#6</a> "add documentation to every property" is a very large ongoing task
</li>
</ul>
</div>
</div>
<div id="outline-container-sec-6-3" class="outline-3">
<h3 id="sec-6-3"><span class="section-number-3">6.3</span> Good Documentation Is Specific</h3>
<div class="outline-text-3" id="text-6-3">
<p>
Comments should describe Usage (ie Scope Notes) and compare to similar props
</p>
<ul class="org-ul">
<li>Eg what's member vs membership?
</li>
<li>When to use teamMember vs currentTeamMember vs sportsTeamMember?
</li>
</ul>
<p>
Good examples:
</p>
<ul class="org-ul">
<li><b>sportDiscipline</b>: the sport discipline the athlete practices, e.g. Diving, or that a board member of a sporting club is focussing at
</li>
<li><b>zodiacSign</b>: Applies to persons, planets, etc
</li>
<li><b>bustWaistHipSize</b>: Use this property if all 3 sizes are given together (DBpedia cannot currently extract 3 Lengths out of a field). Otherwise use separate fields bustSize, waistSize, hipSize
</li>
</ul>
</div>
</div>
<div id="outline-container-sec-6-4" class="outline-3">
<h3 id="sec-6-4"><span class="section-number-3">6.4</span> Duplicate & Semi-Duplicate Properties</h3>
<div class="outline-text-3" id="text-6-4">
<p>
<a href="https://github.com/dbpedia/mappings-tracker/issues/5">#5</a> Eliminate semi-duplicate properties: another long-term task:
</p>
<ul class="org-ul">
<li>Research individual problems
</li>
<li>Write up decisions and best practices
</li>
<li>Clean up mappings that violate them
</li>
</ul>
<p>
A few random examples, but this just scratches the surface
</p>
<ul class="org-ul">
<li><a href="https://github.com/dbpedia/mappings-tracker/issues/17">#17</a> remove Racecourse, there is RaceTrack
</li>
<li><a href="https://github.com/dbpedia/mappings-tracker/issues/36">#36</a> Merge motto and slogan
</li>
<li><a href="https://github.com/dbpedia/mappings-tracker/issues/11">#11</a> blazonLink vs Blazon
</li>
<li><a href="https://github.com/dbpedia/mappings-tracker/issues/34">#34</a> replace shoeNumber with shoeSize
</li>
<li>replace event with sportDiscipline
</li>
</ul>
</div>
</div>
<div id="outline-container-sec-6-5" class="outline-3">
<h3 id="sec-6-5"><span class="section-number-3">6.5</span> Need for Research</h3>
<div class="outline-text-3" id="text-6-5">
<p>
Need to research problem areas & individual problems!
</p>
<ul class="org-ul">
<li>Need to write resolutions & best practices
</li>
</ul>
<p>
Example 1: <a href="http://mappings.dbpedia.org/index.php/What's_in_a_Name">What's_in_a_Name</a>
</p>
<ul class="org-ul">
<li>Believe it or not, DBO has 86 properties called "name".
</li>
<li>Birth, former, historical, old, original, previous, same, present: in what situations should each one be used?
</li>
<li>About 30 Language-specific_Name Props need to be converted to one prop with lang tag
<ul class="org-ul">
<li>Eg <a href="https://github.com/dbpedia/mappings-tracker/issues/15">#15</a> use "language" instead of "cyrilliqueName"
</li>
</ul>
</li>
</ul>
<p>
Other candidates:
</p>
<ul class="org-ul">
<li>Membership props
</li>
<li>Place hierarchy props, etc
</li>
</ul>
<p>
Any takers to research and write up?
</p>
</div>
</div>
<div id="outline-container-sec-6-6" class="outline-3">
<h3 id="sec-6-6"><span class="section-number-3">6.6</span> Need for Research</h3>
<div class="outline-text-3" id="text-6-6">
<p>
Example 2: <a href="https://github.com/dbpedia/mappings-tracker/issues/19">#19</a> fix mapping Listen. Conclusion:
</p>
<ul class="org-ul">
<li>delete class Listen, replace with prop soundRecording