You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This ticket addresses the issue of efficient and standards-compliant delimitation of textual spans/segments in language resources encoded in the TEI. This is needed for any kind of linguistic annotation, where not only sentences should be clearly marked up, but also phrases, words/tokens and sometimes also sub-word (morphological) elements, because all these objects may serve as anchors for sets of linguistic features.
The Guidelines have for a long time advocated the use of pointing mechanisms relying on "enhanced fragment identifiers", of which the most commonly known form (e.g. #word1, assuming the existence of xml:id="word1" somewhere in the addressed document) were the simplest case. At first, the Guidelines (very sensibly at the time) attempted to use the then brand-new XPointer technology in order to define the TEI's own so-called "XPointer schemes" that enhanced the existing W3C proposals. This very nice and extremely flexible idea has failed, because technology didn't catch up with the complexity of XPointer and therefore the TEI devices remained merely a nifty markup artefact, without actual implementations (for some of the history of the idea, see the TEI wiki article on XPointer and related pages, NB. prepared years ago by one of the proponents of the current ticket). This was followed by simplified devices currently endorsed, which, to our knowledge, enjoy a status similar to the original XPointers -- no accessible cross-platform implementations exist. The current choice is whether to try to convince the linguistic community to create parsers designed to analyse the current incarnation of the TEI XPointer schemes (clever as they are -- no one disputes that), or whether to rely on direct access to the values of attributes @from and @to, in the way promoted by the same standardization efforts that the TEI participates in, namely ISO Linguistic Annotation Framework.
Description of the proposal
The LingSIG would like to submit to the TEI Council a proposal that is much more accessible for XML-aware tools than XPointer schemes and that enjoys the stamp of approval of the International Standards Organization, and more precisely its Technical Committee 37 Subcommittee 4 that the TEI Consortium has a liaison agreement with. The proposal is to identify spans of characters by means of their offsets used as values of @from and @to attributes. To this effect, the existing specification of span.xml was modified: an attribute class att.referring was created for the purpose of defining the attributes @to and @from, previously defined by the <span> element. <span> is, according to this proposal, a member of att.referring and behaves exactly as it used to -- so no backwards-compatibility-violation issues arise.
At the same time, other elements can be made members of att.referring, notably <seg> can be redefined as a member of this class, as shown in the documentation prepared as part of the newly proposed specs (in the pull request that accompanies this ticket). The optional third attribute of the att.referring class, namely @referringMode (defaulting to the values used currently by <span>) provides a further means of controlling the content of @to and @from. Specifically, in markup that follows the ISO proposals, the value of @referringMode is set to "icp", which stands for "inter-character point", a concept originally defined by XPointer and re-used by the ISO Language Annotation Framework. An example of the use of <seg> for this purpose can be seen in the diff view of the pull request, together with a piece of ODD that customises the TEI -- in the long run, we hope that this piece of customization will not be needed, if/when the <seg> element also becomes a member of the att.referring class.
The pull request also contains a sketchy fragment that identifies the place in the Guidelines where the current mechanism is discussed and where modifications should appear if the Council accepts the ticket. Since we realise that the exact potential text, and the values used in it, are subject to debate, we do not provide more at this moment but declare willingness to elaborate on it if the Council gives this ticket green light.
Summary of changes:
P5/Source/Guidelines/en/AI-AnalyticMechanisms.xml -- minimal changes to hook up the proposal to the Guidelines
P5/Source/Specs/att.referring.xml -- class definition and documentation (parts moved over from span.xml)
P5/Source/Specs/teidata.referring.xml -- data definition used by att.referring
P5/Source/Specs/span.xml -- thinned down, added to att.referring
Extent of changes
The change, as proposed in this ticket / initial pull request is practically not visible to the end-user of the Guidelines (it basically exposes att.referring as a target of potential customization). We would like to ask that the Council consider adding <seg> to att.referring as well.
The text was updated successfully, but these errors were encountered:
This ticket, as well as #1670 are important ones for the lingSIG community. Is it planned that they will be discussed at the forthcoming F2F? Since they are "thick" maybe someone could take the lead in reading them through. Thanks!
@to and @from are also directly assigned to elements <app> , <locus> , <arc>. Would it maybe be suitable to make these elements members of class att.referring, as well? In this case, would datatypes teidata.word (for numeric values as well as strings) and teidata.pointer be sufficient? teidata.word would help include <locus> in the group.
I would like to retire this ticket. When I posted it, I was not aware that the Council does not wish to overload attributes (possibly, the Council wasn't aware of that either, back then :-)).
I don't think closing this is going to affect the proposal made by @joeytakeda , because as far as I can see, he uses the "proper" (or simply: established) datatype for @to and @from, namely tei.pointer.
I intend to replace this ticket with a much simpler proposal that doesn't overload the two attributes and keeps things fairly tight (so stay tuned...).
This ticket addresses the issue of efficient and standards-compliant delimitation of textual spans/segments in language resources encoded in the TEI. This is needed for any kind of linguistic annotation, where not only sentences should be clearly marked up, but also phrases, words/tokens and sometimes also sub-word (morphological) elements, because all these objects may serve as anchors for sets of linguistic features.
Quick links
Context and motivation
The Guidelines have for a long time advocated the use of pointing mechanisms relying on "enhanced fragment identifiers", of which the most commonly known form (e.g.
#word1
, assuming the existence ofxml:id="word1"
somewhere in the addressed document) were the simplest case. At first, the Guidelines (very sensibly at the time) attempted to use the then brand-new XPointer technology in order to define the TEI's own so-called "XPointer schemes" that enhanced the existing W3C proposals. This very nice and extremely flexible idea has failed, because technology didn't catch up with the complexity of XPointer and therefore the TEI devices remained merely a nifty markup artefact, without actual implementations (for some of the history of the idea, see the TEI wiki article on XPointer and related pages, NB. prepared years ago by one of the proponents of the current ticket). This was followed by simplified devices currently endorsed, which, to our knowledge, enjoy a status similar to the original XPointers -- no accessible cross-platform implementations exist. The current choice is whether to try to convince the linguistic community to create parsers designed to analyse the current incarnation of the TEI XPointer schemes (clever as they are -- no one disputes that), or whether to rely on direct access to the values of attributes@from
and@to
, in the way promoted by the same standardization efforts that the TEI participates in, namely ISO Linguistic Annotation Framework.Description of the proposal
The LingSIG would like to submit to the TEI Council a proposal that is much more accessible for XML-aware tools than XPointer schemes and that enjoys the stamp of approval of the International Standards Organization, and more precisely its Technical Committee 37 Subcommittee 4 that the TEI Consortium has a liaison agreement with. The proposal is to identify spans of characters by means of their offsets used as values of
@from
and@to
attributes. To this effect, the existing specification ofspan.xml
was modified: an attribute classatt.referring
was created for the purpose of defining the attributes@to
and@from
, previously defined by the<span>
element.<span>
is, according to this proposal, a member ofatt.referring
and behaves exactly as it used to -- so no backwards-compatibility-violation issues arise.At the same time, other elements can be made members of
att.referring
, notably<seg>
can be redefined as a member of this class, as shown in the documentation prepared as part of the newly proposed specs (in the pull request that accompanies this ticket). The optional third attribute of theatt.referring
class, namely@referringMode
(defaulting to the values used currently by<span>
) provides a further means of controlling the content of@to
and@from
. Specifically, in markup that follows the ISO proposals, the value of@referringMode
is set to "icp", which stands for "inter-character point", a concept originally defined by XPointer and re-used by the ISO Language Annotation Framework. An example of the use of<seg>
for this purpose can be seen in the diff view of the pull request, together with a piece of ODD that customises the TEI -- in the long run, we hope that this piece of customization will not be needed, if/when the<seg>
element also becomes a member of theatt.referring
class.The pull request also contains a sketchy fragment that identifies the place in the Guidelines where the current mechanism is discussed and where modifications should appear if the Council accepts the ticket. Since we realise that the exact potential text, and the values used in it, are subject to debate, we do not provide more at this moment but declare willingness to elaborate on it if the Council gives this ticket green light.
Summary of changes:
Extent of changes
The change, as proposed in this ticket / initial pull request is practically not visible to the end-user of the Guidelines (it basically exposes att.referring as a target of potential customization). We would like to ask that the Council consider adding
<seg>
to att.referring as well.The text was updated successfully, but these errors were encountered: