Skip to content

Commit

Permalink
Merge pull request #1078 from kermitt2/copyrights-licenses
Browse files Browse the repository at this point in the history
Copyrights owner and licenses identification models
  • Loading branch information
kermitt2 authored Feb 10, 2024
2 parents b829eff + 261f975 commit ed9fef7
Show file tree
Hide file tree
Showing 27 changed files with 680 additions and 50 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ grobid-home/models/values
grobid-home/models/dataseer*
grobid-home/models/datasets*
grobid-home/models/*-bert*/
grobid-home/models/*_bert*/
grobid-home/models/*scibert*/
grobid-home/models/context_*

Expand Down
7 changes: 4 additions & 3 deletions Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,9 @@ The following functionalities are available:
- __Consolidation/resolution of the extracted bibliographical references__ using the [biblio-glutton](https://github.com/kermitt2/biblio-glutton) service or the [CrossRef REST API](https://github.com/CrossRef/rest-api-doc). In both cases, DOI/PMID resolution performance is higher than 0.95 F1-score from PDF extraction.
- __Extraction and parsing of patent and non-patent references in patent__ publications.
- __Extraction of Funders and funding information__ with optional matching of extracted funders with the CrossRef Funder Registry.
- __Identification of copyrights' owner and license associated to the document__, e.g. publisher or authors copyrights, CC-BY/CC-BY-NC/etc. license.

In a complete PDF processing, GROBID manages 55 final labels used to build relatively fine-grained structures, from traditional publication metadata (title, author first/last/middle names, affiliation types, detailed address, journal, volume, issue, pages, DOI, PMID, etc.) to full text structures (section title, paragraph, reference markers, head/foot notes, figure captions, etc.).
In a complete PDF processing, GROBID manages 68 final labels used to build relatively fine-grained structures, from traditional publication metadata (title, author first/last/middle names, affiliation types, detailed address, journal, volume, issue, pages, DOI, PMID, etc.) to full text structures (section title, paragraph, reference markers, head/foot notes, figure captions, etc.).

GROBID includes a comprehensive [web service API](https://grobid.readthedocs.io/en/latest/Grobid-service/), [Docker images](https://grobid.readthedocs.io/en/latest/Grobid-docker/), [batch processing](https://grobid.readthedocs.io/en/latest/Grobid-batch/), a JAVA API, a generic [training and evaluation framework](https://grobid.readthedocs.io/en/latest/Training-the-models-of-Grobid/) (precision, recall, etc., n-fold cross-evaluation), systematic [end-to-end benchmarking](https://grobid.readthedocs.io/en/latest/Benchmarking/) on thousand documents and the semi-automatic generation of training data.

Expand Down Expand Up @@ -108,7 +109,7 @@ A series of additional modules have been developed for performing __structure aw
- [grobid-quantities](https://github.com/kermitt2/grobid-quantities): recognition and normalization of physical quantities/measurements
- [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors): recognition of superconductor material and properties in scientific literature
- [entity-fishing](https://github.com/kermitt2/entity-fishing), a tool for extracting Wikidata entities from text and document, which can also use Grobid to pre-process scientific articles in PDF, leading to more precise and relevant entity extraction and the capacity to annotate the PDF with interactive layout
- [dataseer-ml](https://github.com/dataseer/dataseer-ml): identification of sections and sentences introducing datasets in a scientific article, and classification of the type of these datasets
- [datastet](https://github.com/kermitt2/datastet): identification of sections and sentences introducing datasets in a scientific article, identification of dataset names (implict and named datasets) and classification of the type of these datasets
- [grobid-ner](https://github.com/kermitt2/grobid-ner): named entity recognition
- [grobid-astro](https://github.com/kermitt2/grobid-astro): recognition of astronomical entities in scientific papers
- [grobid-bio](https://github.com/kermitt2/grobid-bio): a toy bio-entity tagger using BioNLP/NLPBA 2004 dataset
Expand Down Expand Up @@ -143,7 +144,7 @@ If you want to cite this work, please refer to the present GitHub project, toget
title = {GROBID},
howpublished = {\url{https://github.com/kermitt2/grobid}},
publisher = {GitHub},
year = {2008--2023},
year = {2008--2024},
archivePrefix = {swh},
eprint = {1:dir:dab86b296e3c3216e2241968f0d63b68e8209d3c}
}
Expand Down
4 changes: 4 additions & 0 deletions doc/Grobid-service.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,7 @@ Extract the header of the input PDF document, normalize it and convert it into a
| POST, PUT | `multipart/form-data` | `application/xml` | `input` | required | PDF file to be processed |
| | | | `consolidateHeader` | optional | consolidateHeader is a string of value `0` (no consolidation), `1` (consolidate and inject all extra metadata, default value), `2` (consolidate the header and inject DOI only), or `3` (consolidate using only extracted DOI - if extracted) . |
| | | | `includeRawAffiliations` | optional | `includeRawAffiliations` is a boolean value, `0` (default, do not include raw affiliation string in the result) or `1` (include raw affiliation string in the result). |
| | | | `includeRawCopyrights` | optional | `includeRawCopyrights` is a boolean value, `0` (default, do not include raw copyrights/license string in the result) or `1` (include raw copyrights/license string in the result). |

Use `Accept: application/x-bibtex` to retrieve BibTeX format instead of TEI (note: the TEI XML format is much richer, it should be preferred if there is no particular reason to use BibTeX).

Expand Down Expand Up @@ -177,6 +178,7 @@ Convert the complete input document into TEI XML format (header, body and biblio
| | | | `consolidatFunders` | optional | `consolidateFunders` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the funder and inject DOI only). |
| | | | `includeRawCitations` | optional | `includeRawCitations` is a boolean value, `0` (default, do not include raw reference string in the result) or `1` (include raw reference string in the result). |
| | | | `includeRawAffiliations` | optional | `includeRawAffiliations` is a boolean value, `0` (default, do not include raw affiliation string in the result) or `1` (include raw affiliation string in the result). |
| | | | `includeRawCopyrights` | optional | `includeRawCopyrights` is a boolean value, `0` (default, do not include raw copyrights/license string in the result) or `1` (include raw copyrights/license string in the result). |
| | | | `teiCoordinates` | optional | list of element names for which coordinates in the PDF document have to be added, see [Coordinates of structures in the original PDF](Coordinates-in-PDF.md) for more details |
| | | | `segmentSentences` | optional | Paragraphs structures in the resulting TEI will be further segmented into sentence elements <s> |
| | | | `start` | optional | Start page number of the PDF to be considered, previous pages will be skipped/ignored, integer with first page starting at `1`, (default `-1`, start from the first page of the PDF) |
Expand Down Expand Up @@ -220,6 +222,8 @@ Regarding the bibliographical references, it is possible to include the original
curl -v --form input=@./thefile.pdf --form includeRawCitations=1 localhost:8070/api/processFulltextDocument
```

Similar raw strings can be added in the result for affiliation and copyrights/license sections.

Example with requested additional sentence segmentation of the paragraph with bounding box coordinates of the sentence structures:

```console
Expand Down
4 changes: 3 additions & 1 deletion grobid-core/src/main/java/org/grobid/core/GrobidModels.java
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,9 @@ public enum GrobidModels implements GrobidModel {
//ACKNOWLEDGEMENT("acknowledgement"),
FUNDING_ACKNOWLEDGEMENT("funding-acknowledgement"),
INFRASTRUCTURE("infrastructure"),
DUMMY("none");
DUMMY("none"),
LICENSE("license"),
COPYRIGHT("copyright");

//I cannot declare it before
public static final String DUMMY_FOLDER_LABEL = "none";
Expand Down
12 changes: 12 additions & 0 deletions grobid-core/src/main/java/org/grobid/core/data/BiblioItem.java
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import org.grobid.core.data.util.AuthorEmailAssigner;
import org.grobid.core.data.util.ClassicAuthorEmailAssigner;
import org.grobid.core.data.util.EmailSanitizer;
import org.grobid.core.data.CopyrightsLicense;
import org.grobid.core.document.*;
import org.grobid.core.engines.config.GrobidAnalysisConfig;
import org.grobid.core.exceptions.GrobidException;
Expand Down Expand Up @@ -376,6 +377,9 @@ public String toString() {
// Availability statement
private String availabilityStmt = null;

// Copyrights/license information object
CopyrightsLicense copyrightsLicense = null;

public static final List<String> confPrefixes = Arrays.asList("Proceedings of", "proceedings of",
"In Proceedings of the", "In: Proceeding of", "In Proceedings, ", "In Proceedings of",
"In Proceeding of", "in Proceeding of", "in Proceeding", "In Proceeding", "Proceedings",
Expand Down Expand Up @@ -4477,4 +4481,12 @@ public void setAvailabilityStmt(String availabilityStmt) {
public List<List<LayoutToken>> getAffiliationAddresslabeledTokens() {
return affiliationAddresslabeledTokens;
}

public void setCopyrightsLicense(CopyrightsLicense copyrightsLicense) {
this.copyrightsLicense = copyrightsLicense;
}

public CopyrightsLicense getCopyrightsLicense() {
return this.copyrightsLicense;
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
package org.grobid.core.data;

import org.grobid.core.utilities.TextUtilities;

import java.util.ArrayList;
import java.util.List;
import java.util.Arrays;

/**
* Class for representing information related to copyrights owner and file license.
*/
public class CopyrightsLicense {

// copyrights owner
public enum CopyrightsOwner {
PUBLISHER ("publisher"),
AUTHORS ("authors"),
UNDECIDED ("undecided");

private String name;

private CopyrightsOwner(String name) {
this.name = name;
}

public String getName() {
return name;
}
};

public static List<String> copyrightOwners = Arrays.asList("publisher", "authors", "undecided");

// File-level licenses
public enum License {
CC0 ("CC-0"),
CCBY ("CC-BY"),
CCBYNC ("CC-BY-NC"),
CCBYNCND ("CC-BY-NC-ND"),
CCBYSA ("CC-BY-SA"),
CCBYNCSA ("CC-BY-NC-SA"),
CCBYND ("CC-BY-ND"),
COPYRIGHT ("strict-copyrights"),
OTHER ("other"),
UNDECIDED ("undecided");

private String name;

private License(String name) {
this.name = name;
}

public String getName() {
return name;
}
};

public static List<String> licenses =
Arrays.asList("CC-0", "CC-BY", "CC-BY-NC", "CC-BY-NC-ND", "CC-BY-SA", "CC-BY-NC-SA", "CC-BY-ND", "copyright", "other", "undecided");

private CopyrightsOwner copyrightsOwner;
private double copyrightsOwnerProb;
private License license;
private double licenseProb;

public CopyrightsOwner getCopyrightsOwner() {
return this.copyrightsOwner;
}

public void setCopyrightsOwner(CopyrightsOwner owner) {
this.copyrightsOwner = owner;
}

public double getCopyrightsOwnerProb() {
return this.copyrightsOwnerProb;
}

public void setCopyrightsOwnerProb(double prob) {
this.copyrightsOwnerProb = prob;
}

public License getLicense() {
return this.license;
}

public void setLicense(License license) {
this.license = license;
}

public double getLicenseProb() {
return this.licenseProb;
}

public void setLicenseProb(double prob) {
this.licenseProb = prob;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@
import nu.xom.Text;

import org.grobid.core.GrobidModels;
import org.grobid.core.data.CopyrightsLicense.License;
import org.grobid.core.data.CopyrightsLicense.CopyrightsOwner;
import org.grobid.core.data.Date;
import org.grobid.core.data.*;
import org.grobid.core.document.xml.XmlBuilderUtils;
Expand Down Expand Up @@ -248,28 +250,75 @@ public StringBuilder toTEIHeader(BiblioItem biblio,

if ((biblio.getPublisher() != null) ||
(biblio.getPublicationDate() != null) ||
(biblio.getNormalizedPublicationDate() != null)) {
(biblio.getNormalizedPublicationDate() != null) ||
biblio.getCopyrightsLicense() != null) {
tei.append("\t\t\t<publicationStmt>\n");

CopyrightsLicense copyrightsLicense = biblio.getCopyrightsLicense();

if (biblio.getPublisher() != null) {
// publisher and date under <publicationStmt> for better TEI conformance
tei.append("\t\t\t\t<publisher>" + TextUtilities.HTMLEncode(biblio.getPublisher()) +
"</publisher>\n");

tei.append("\t\t\t\t<availability status=\"unknown\">");
tei.append("<p>Copyright ");
//if (biblio.getPublicationDate() != null)
tei.append(TextUtilities.HTMLEncode(biblio.getPublisher()) + "</p>\n");
tei.append("\t\t\t\t</availability>\n");
} else {
// a dummy publicationStmt is still necessary according to TEI
tei.append("\t\t\t\t<publisher/>\n");
if (defaultPublicationStatement == null) {
tei.append("\t\t\t\t<availability status=\"unknown\"><licence/></availability>");
}

// We introduce something more meaningful with TEI customization to encode copyrights information:
// - @resp with value "publisher", "authors", "unknown", we add a comment to clarify that @resp
// should be interpreted as the copyrights owner
// - license related to copyrights exception is encoded via <licence>
// (note: I have no clue what can mean "free" as status for a document - there are always some sort of
// restrictions like moral rights even for public domain documents)
if (copyrightsLicense != null) {
tei.append("\t\t\t\t<availability ");

boolean addCopyrightsComment = false;
if (copyrightsLicense.getCopyrightsOwner() != null && copyrightsLicense.getCopyrightsOwner() != CopyrightsOwner.UNDECIDED) {
tei.append("resp=\""+ copyrightsLicense.getCopyrightsOwner().getName() +"\" ");
addCopyrightsComment = true;
}

if (copyrightsLicense.getLicense() != null && copyrightsLicense.getLicense() != License.UNDECIDED) {
tei.append("status=\"restricted\">\n");
if (addCopyrightsComment) {
tei.append("\t\t\t\t\t<!-- the @rest attribute above gives the document copyrights owner (publisher, authors), if known -->\n");
}
tei.append("\t\t\t\t\t<licence>"+copyrightsLicense.getLicense().getName()+"</licence>\n");
} else {
tei.append("\t\t\t\t<availability status=\"unknown\"><p>" +
TextUtilities.HTMLEncode(defaultPublicationStatement) + "</p></availability>");
tei.append(" status=\"unknown\">\n");
if (addCopyrightsComment) {
tei.append("\t\t\t\t\t<!-- the @rest attribute above gives the document copyrights owner (publisher, authors), if known -->\n");
}
tei.append("\t\t\t\t\t<licence/>\n");
}
tei.append("\n");

if (config.getIncludeRawCopyrights() && biblio.getCopyright() != null && biblio.getCopyright().length()>0) {
tei.append("\t\t\t\t\t<p type=\"raw\">");
tei.append(TextUtilities.HTMLEncode(biblio.getCopyright()));
tei.append("</note>\n");
}

tei.append("\t\t\t\t</availability>\n");
} else {
tei.append("\t\t\t\t<availability ");

tei.append(" status=\"unknown\">\n");
tei.append("\t\t\t\t\t<licence/>\n");

if (defaultPublicationStatement != null) {
tei.append("\t\t\t\t\t<p>" +
TextUtilities.HTMLEncode(defaultPublicationStatement) + "</p>\n");
}

if (config.getIncludeRawCopyrights() && biblio.getCopyright() != null && biblio.getCopyright().length()>0) {
tei.append("\t\t\t\t\t<p type=\"raw\">");
tei.append(TextUtilities.HTMLEncode(biblio.getCopyright()));
tei.append("</note>\n");
}

tei.append("\t\t\t\t</availability>\n");
}

if (biblio.getNormalizedPublicationDate() != null) {
Expand Down
12 changes: 10 additions & 2 deletions grobid-core/src/main/java/org/grobid/core/engines/Engine.java
Original file line number Diff line number Diff line change
Expand Up @@ -350,13 +350,15 @@ public String processHeader(
String inputFile,
int consolidate,
boolean includeRawAffiliations,
boolean includeRawCopyrights,
BiblioItem result
) {
GrobidAnalysisConfig config = new GrobidAnalysisConfig.GrobidAnalysisConfigBuilder()
.startPage(0)
.endPage(2)
.consolidateHeader(consolidate)
.includeRawAffiliations(includeRawAffiliations)
.includeRawCopyrights(includeRawCopyrights)
.build();
return processHeader(inputFile, null, config, result);
}
Expand All @@ -380,12 +382,14 @@ public String processHeaderFunding(
File inputFile,
int consolidateHeader,
int consolidateFunders,
boolean includeRawAffiliations
boolean includeRawAffiliations,
boolean includeRawCopyrights
) throws Exception {
GrobidAnalysisConfig config = new GrobidAnalysisConfig.GrobidAnalysisConfigBuilder()
.consolidateHeader(consolidateHeader)
.consolidateFunders(consolidateFunders)
.includeRawAffiliations(includeRawAffiliations)
.includeRawCopyrights(includeRawCopyrights)
.build();
return processHeaderFunding(inputFile, null, config);
}
Expand All @@ -408,13 +412,15 @@ public String processHeader(
String md5Str,
int consolidate,
boolean includeRawAffiliations,
boolean includeRawCopyrights,
BiblioItem result
) {
GrobidAnalysisConfig config = new GrobidAnalysisConfig.GrobidAnalysisConfigBuilder()
.startPage(0)
.endPage(2)
.consolidateHeader(consolidate)
.includeRawAffiliations(includeRawAffiliations)
.includeRawCopyrights(includeRawCopyrights)
.build();
return processHeader(inputFile, md5Str, config, result);
}
Expand All @@ -440,12 +446,14 @@ public String processHeaderFunding(
String md5Str,
int consolidateHeader,
int consolidateFunders,
boolean includeRawAffiliations
boolean includeRawAffiliations,
boolean includeRawCopyrights
) throws Exception {
GrobidAnalysisConfig config = new GrobidAnalysisConfig.GrobidAnalysisConfigBuilder()
.consolidateHeader(consolidateHeader)
.consolidateFunders(consolidateFunders)
.includeRawAffiliations(includeRawAffiliations)
.includeRawCopyrights(includeRawCopyrights)
.build();
return processHeaderFunding(inputFile, md5Str, config);
}
Expand Down
Loading

0 comments on commit ed9fef7

Please sign in to comment.