att.referring for (and potentially <seg>) #1666

bansp · 2017-07-12T16:12:30Z

This ticket addresses the issue of efficient and standards-compliant delimitation of textual spans/segments in language resources encoded in the TEI. This is needed for any kind of linguistic annotation, where not only sentences should be clearly marked up, but also phrases, words/tokens and sometimes also sub-word (morphological) elements, because all these objects may serve as anchors for sets of linguistic features.

Quick links

pull request (diff)
pull request (comments)
suggested documentation of att.referring
modification of the relevant Guidelines chapter (modest, pending the Council's decision)
Initial discussion on this set of changes is to be found at offset information for segments (from, to) -- what's the best move? laurentromary/stdfSpec#16

Context and motivation

The Guidelines have for a long time advocated the use of pointing mechanisms relying on "enhanced fragment identifiers", of which the most commonly known form (e.g. #word1, assuming the existence of xml:id="word1" somewhere in the addressed document) were the simplest case. At first, the Guidelines (very sensibly at the time) attempted to use the then brand-new XPointer technology in order to define the TEI's own so-called "XPointer schemes" that enhanced the existing W3C proposals. This very nice and extremely flexible idea has failed, because technology didn't catch up with the complexity of XPointer and therefore the TEI devices remained merely a nifty markup artefact, without actual implementations (for some of the history of the idea, see the TEI wiki article on XPointer and related pages, NB. prepared years ago by one of the proponents of the current ticket). This was followed by simplified devices currently endorsed, which, to our knowledge, enjoy a status similar to the original XPointers -- no accessible cross-platform implementations exist. The current choice is whether to try to convince the linguistic community to create parsers designed to analyse the current incarnation of the TEI XPointer schemes (clever as they are -- no one disputes that), or whether to rely on direct access to the values of attributes @from and @to, in the way promoted by the same standardization efforts that the TEI participates in, namely ISO Linguistic Annotation Framework.

Description of the proposal

The LingSIG would like to submit to the TEI Council a proposal that is much more accessible for XML-aware tools than XPointer schemes and that enjoys the stamp of approval of the International Standards Organization, and more precisely its Technical Committee 37 Subcommittee 4 that the TEI Consortium has a liaison agreement with. The proposal is to identify spans of characters by means of their offsets used as values of @from and @to attributes. To this effect, the existing specification of span.xml was modified: an attribute class att.referring was created for the purpose of defining the attributes @to and @from, previously defined by the  element.  is, according to this proposal, a member of att.referring and behaves exactly as it used to -- so no backwards-compatibility-violation issues arise.
At the same time, other elements can be made members of att.referring, notably <seg> can be redefined as a member of this class, as shown in the documentation prepared as part of the newly proposed specs (in the pull request that accompanies this ticket). The optional third attribute of the att.referring class, namely @referringMode (defaulting to the values used currently by ) provides a further means of controlling the content of @to and @from. Specifically, in markup that follows the ISO proposals, the value of @referringMode is set to "icp", which stands for "inter-character point", a concept originally defined by XPointer and re-used by the ISO Language Annotation Framework. An example of the use of <seg> for this purpose can be seen in the diff view of the pull request, together with a piece of ODD that customises the TEI -- in the long run, we hope that this piece of customization will not be needed, if/when the <seg> element also becomes a member of the att.referring class.

The pull request also contains a sketchy fragment that identifies the place in the Guidelines where the current mechanism is discussed and where modifications should appear if the Council accepts the ticket. Since we realise that the exact potential text, and the values used in it, are subject to debate, we do not provide more at this moment but declare willingness to elaborate on it if the Council gives this ticket green light.

Summary of changes:

P5/Source/Guidelines/en/AI-AnalyticMechanisms.xml -- minimal changes to hook up the proposal to the Guidelines
P5/Source/Specs/att.referring.xml -- class definition and documentation (parts moved over from span.xml)
P5/Source/Specs/teidata.referring.xml -- data definition used by att.referring
P5/Source/Specs/span.xml -- thinned down, added to att.referring

Extent of changes

The change, as proposed in this ticket / initial pull request is practically not visible to the end-user of the Guidelines (it basically exposes att.referring as a target of potential customization). We would like to ask that the Council consider adding <seg> to att.referring as well.

The text was updated successfully, but these errors were encountered:

laurentromary · 2017-09-22T09:45:26Z

This ticket, as well as #1670 are important ones for the lingSIG community. Is it planned that they will be discussed at the forthcoming F2F? Since they are "thick" maybe someone could take the lead in reading them through. Thanks!

susannehaaf · 2018-04-20T16:33:34Z

@to and @from are also directly assigned to elements <app> , <locus> , <arc>. Would it maybe be suitable to make these elements members of class att.referring, as well? In this case, would datatypes teidata.word (for numeric values as well as strings) and teidata.pointer be sufficient? teidata.word would help include <locus> in the group.

bansp · 2020-02-12T16:28:44Z

I would like to retire this ticket. When I posted it, I was not aware that the Council does not wish to overload attributes (possibly, the Council wasn't aware of that either, back then :-)).
I don't think closing this is going to affect the proposal made by @joeytakeda , because as far as I can see, he uses the "proper" (or simply: established) datatype for @to and @from, namely tei.pointer.
I intend to replace this ticket with a much simpler proposal that doesn't overload the two attributes and keeps things fairly tight (so stay tuned...).

bansp added SIG:LingSIG Type: FeatureRequest labels Jul 12, 2017

bansp mentioned this issue Jul 12, 2017

att.referring for #1665

Closed

sydb assigned peterstadler and scstanley7 and unassigned peterstadler Jul 19, 2017

peterstadler assigned peterstadler and unassigned scstanley7 Nov 18, 2017

joeytakeda mentioned this issue Oct 31, 2018

New element for grouping notes: <noteGrp> #1833

Closed

bansp closed this as completed Feb 12, 2020

bansp mentioned this issue Feb 12, 2021

attributes start and end not allowed in seg element clarin-eric/parla-clarin#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

att.referring for <span> (and potentially <seg>) #1666

att.referring for <span> (and potentially <seg>) #1666

bansp commented Jul 12, 2017 •

edited

Loading

laurentromary commented Sep 22, 2017

susannehaaf commented Apr 20, 2018

bansp commented Feb 12, 2020

att.referring for <span> (and potentially <seg>) #1666

att.referring for <span> (and potentially <seg>) #1666

Comments

bansp commented Jul 12, 2017 • edited Loading

Quick links

Context and motivation

Description of the proposal

Summary of changes:

Extent of changes

laurentromary commented Sep 22, 2017

susannehaaf commented Apr 20, 2018

bansp commented Feb 12, 2020

bansp commented Jul 12, 2017 •

edited

Loading