Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

att.referring for <span> (and potentially <seg>) #1666

Closed
bansp opened this issue Jul 12, 2017 · 3 comments
Closed

att.referring for <span> (and potentially <seg>) #1666

bansp opened this issue Jul 12, 2017 · 3 comments

Comments

@bansp
Copy link
Member

bansp commented Jul 12, 2017

This ticket addresses the issue of efficient and standards-compliant delimitation of textual spans/segments in language resources encoded in the TEI. This is needed for any kind of linguistic annotation, where not only sentences should be clearly marked up, but also phrases, words/tokens and sometimes also sub-word (morphological) elements, because all these objects may serve as anchors for sets of linguistic features.

Quick links

  1. pull request (diff)
  2. pull request (comments)
  3. suggested documentation of att.referring
  4. modification of the relevant Guidelines chapter (modest, pending the Council's decision)
  5. Initial discussion on this set of changes is to be found at offset information for segments (from, to) -- what's the best move? laurentromary/stdfSpec#16

Context and motivation

The Guidelines have for a long time advocated the use of pointing mechanisms relying on "enhanced fragment identifiers", of which the most commonly known form (e.g. #word1, assuming the existence of xml:id="word1" somewhere in the addressed document) were the simplest case. At first, the Guidelines (very sensibly at the time) attempted to use the then brand-new XPointer technology in order to define the TEI's own so-called "XPointer schemes" that enhanced the existing W3C proposals. This very nice and extremely flexible idea has failed, because technology didn't catch up with the complexity of XPointer and therefore the TEI devices remained merely a nifty markup artefact, without actual implementations (for some of the history of the idea, see the TEI wiki article on XPointer and related pages, NB. prepared years ago by one of the proponents of the current ticket). This was followed by simplified devices currently endorsed, which, to our knowledge, enjoy a status similar to the original XPointers -- no accessible cross-platform implementations exist. The current choice is whether to try to convince the linguistic community to create parsers designed to analyse the current incarnation of the TEI XPointer schemes (clever as they are -- no one disputes that), or whether to rely on direct access to the values of attributes @from and @to, in the way promoted by the same standardization efforts that the TEI participates in, namely ISO Linguistic Annotation Framework.

Description of the proposal

The LingSIG would like to submit to the TEI Council a proposal that is much more accessible for XML-aware tools than XPointer schemes and that enjoys the stamp of approval of the International Standards Organization, and more precisely its Technical Committee 37 Subcommittee 4 that the TEI Consortium has a liaison agreement with. The proposal is to identify spans of characters by means of their offsets used as values of @from and @to attributes. To this effect, the existing specification of span.xml was modified: an attribute class att.referring was created for the purpose of defining the attributes @to and @from, previously defined by the <span> element. <span> is, according to this proposal, a member of att.referring and behaves exactly as it used to -- so no backwards-compatibility-violation issues arise.
At the same time, other elements can be made members of att.referring, notably <seg> can be redefined as a member of this class, as shown in the documentation prepared as part of the newly proposed specs (in the pull request that accompanies this ticket). The optional third attribute of the att.referring class, namely @referringMode (defaulting to the values used currently by <span>) provides a further means of controlling the content of @to and @from. Specifically, in markup that follows the ISO proposals, the value of @referringMode is set to "icp", which stands for "inter-character point", a concept originally defined by XPointer and re-used by the ISO Language Annotation Framework. An example of the use of <seg> for this purpose can be seen in the diff view of the pull request, together with a piece of ODD that customises the TEI -- in the long run, we hope that this piece of customization will not be needed, if/when the <seg> element also becomes a member of the att.referring class.

The pull request also contains a sketchy fragment that identifies the place in the Guidelines where the current mechanism is discussed and where modifications should appear if the Council accepts the ticket. Since we realise that the exact potential text, and the values used in it, are subject to debate, we do not provide more at this moment but declare willingness to elaborate on it if the Council gives this ticket green light.

Summary of changes:

  1. P5/Source/Guidelines/en/AI-AnalyticMechanisms.xml -- minimal changes to hook up the proposal to the Guidelines
  2. P5/Source/Specs/att.referring.xml -- class definition and documentation (parts moved over from span.xml)
  3. P5/Source/Specs/teidata.referring.xml -- data definition used by att.referring
  4. P5/Source/Specs/span.xml -- thinned down, added to att.referring

Extent of changes

The change, as proposed in this ticket / initial pull request is practically not visible to the end-user of the Guidelines (it basically exposes att.referring as a target of potential customization). We would like to ask that the Council consider adding <seg> to att.referring as well.

@laurentromary
Copy link
Contributor

This ticket, as well as #1670 are important ones for the lingSIG community. Is it planned that they will be discussed at the forthcoming F2F? Since they are "thick" maybe someone could take the lead in reading them through. Thanks!

@susannehaaf
Copy link

@to and @from are also directly assigned to elements <app> , <locus> , <arc>. Would it maybe be suitable to make these elements members of class att.referring, as well? In this case, would datatypes teidata.word (for numeric values as well as strings) and teidata.pointer be sufficient? teidata.word would help include <locus> in the group.

@bansp
Copy link
Member Author

bansp commented Feb 12, 2020

I would like to retire this ticket. When I posted it, I was not aware that the Council does not wish to overload attributes (possibly, the Council wasn't aware of that either, back then :-)).
I don't think closing this is going to affect the proposal made by @joeytakeda , because as far as I can see, he uses the "proper" (or simply: established) datatype for @to and @from, namely tei.pointer.
I intend to replace this ticket with a much simpler proposal that doesn't overload the two attributes and keeps things fairly tight (so stay tuned...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants