Merge pull request #1078 from kermitt2/copyrights-licenses

Copyrights owner and licenses identification models
kermitt2 · Feb 10, 2024 · ed9fef7 · ed9fef7
2 parents b829eff + 261f975
commit ed9fef7
Show file tree

Hide file tree

Showing 27 changed files with 680 additions and 50 deletions.
diff --git a/.gitignore b/.gitignore
@@ -79,6 +79,7 @@ grobid-home/models/values
 grobid-home/models/dataseer*
 grobid-home/models/datasets*
 grobid-home/models/*-bert*/
+grobid-home/models/*_bert*/
 grobid-home/models/*scibert*/
 grobid-home/models/context_*
 

diff --git a/Readme.md b/Readme.md
@@ -33,8 +33,9 @@ The following functionalities are available:
 - __Consolidation/resolution of the extracted bibliographical references__ using the [biblio-glutton](https://github.com/kermitt2/biblio-glutton) service or the [CrossRef REST API](https://github.com/CrossRef/rest-api-doc). In both cases, DOI/PMID resolution performance is higher than 0.95 F1-score from PDF extraction.
 - __Extraction and parsing of patent and non-patent references in patent__ publications.
 - __Extraction of Funders and funding information__ with optional matching of extracted funders with the CrossRef Funder Registry.
+- __Identification of copyrights' owner and license associated to the document__, e.g. publisher or authors copyrights, CC-BY/CC-BY-NC/etc. license.
 
-In a complete PDF processing, GROBID manages 55 final labels used to build relatively fine-grained structures, from traditional publication metadata (title, author first/last/middle names, affiliation types, detailed address, journal, volume, issue, pages, DOI, PMID, etc.) to full text structures (section title, paragraph, reference markers, head/foot notes, figure captions, etc.).
+In a complete PDF processing, GROBID manages 68 final labels used to build relatively fine-grained structures, from traditional publication metadata (title, author first/last/middle names, affiliation types, detailed address, journal, volume, issue, pages, DOI, PMID, etc.) to full text structures (section title, paragraph, reference markers, head/foot notes, figure captions, etc.).
 
 GROBID includes a comprehensive [web service API](https://grobid.readthedocs.io/en/latest/Grobid-service/), [Docker images](https://grobid.readthedocs.io/en/latest/Grobid-docker/), [batch processing](https://grobid.readthedocs.io/en/latest/Grobid-batch/), a JAVA API, a generic [training and evaluation framework](https://grobid.readthedocs.io/en/latest/Training-the-models-of-Grobid/) (precision, recall, etc., n-fold cross-evaluation), systematic [end-to-end benchmarking](https://grobid.readthedocs.io/en/latest/Benchmarking/) on thousand documents and the semi-automatic generation of training data.
 
@@ -108,7 +109,7 @@ A series of additional modules have been developed for performing __structure aw
 - [grobid-quantities](https://github.com/kermitt2/grobid-quantities): recognition and normalization of physical quantities/measurements
 - [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors): recognition of superconductor material and properties in scientific literature
 - [entity-fishing](https://github.com/kermitt2/entity-fishing), a tool for extracting Wikidata entities from text and document, which can also use Grobid to pre-process scientific articles in PDF, leading to more precise and relevant entity extraction and the capacity to annotate the PDF with interactive layout
-- [dataseer-ml](https://github.com/dataseer/dataseer-ml): identification of sections and sentences introducing datasets in a scientific article, and classification of the type of these datasets
+- [datastet](https://github.com/kermitt2/datastet): identification of sections and sentences introducing datasets in a scientific article, identification of dataset names (implict and named datasets) and classification of the type of these datasets
 - [grobid-ner](https://github.com/kermitt2/grobid-ner): named entity recognition
 - [grobid-astro](https://github.com/kermitt2/grobid-astro): recognition of astronomical entities in scientific papers
 - [grobid-bio](https://github.com/kermitt2/grobid-bio): a toy bio-entity tagger using BioNLP/NLPBA 2004 dataset
@@ -143,7 +144,7 @@ If you want to cite this work, please refer to the present GitHub project, toget
     title = {GROBID},
     howpublished = {\url{https://github.com/kermitt2/grobid}},
     publisher = {GitHub},
-    year = {2008--2023},
+    year = {2008--2024},
     archivePrefix = {swh},
     eprint = {1:dir:dab86b296e3c3216e2241968f0d63b68e8209d3c}
 }

diff --git a/doc/Grobid-service.md b/doc/Grobid-service.md
@@ -138,6 +138,7 @@ Extract the header of the input PDF document, normalize it and convert it into a
 | POST, PUT | `multipart/form-data` | `application/xml`    | `input`             | required      | PDF file to be processed |
 |           |                       |                      | `consolidateHeader` | optional      | consolidateHeader is a string of value `0` (no consolidation), `1` (consolidate and inject all extra metadata, default value), `2` (consolidate the header and inject DOI only), or `3` (consolidate  using only extracted DOI - if extracted) . |
 |           |                       |                      | `includeRawAffiliations` | optional | `includeRawAffiliations` is a boolean value, `0` (default, do not include raw affiliation string in the result) or `1` (include raw affiliation string in the result).  |
+|           |                       |                      | `includeRawCopyrights` | optional | `includeRawCopyrights` is a boolean value, `0` (default, do not include raw copyrights/license string in the result) or `1` (include raw copyrights/license string in the result).  |
 
 Use `Accept: application/x-bibtex` to retrieve BibTeX format instead of TEI (note: the TEI XML format is much richer, it should be preferred if there is no particular reason to use BibTeX).
 
@@ -177,6 +178,7 @@ Convert the complete input document into TEI XML format (header, body and biblio
 |           |                       |                      | `consolidatFunders` | optional         | `consolidateFunders` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the funder and inject DOI only). |
 |           |                       |                      | `includeRawCitations`  | optional      | `includeRawCitations` is a boolean value, `0` (default, do not include raw reference string in the result) or `1` (include raw reference string in the result). |
 |           |                       |                      | `includeRawAffiliations` | optional | `includeRawAffiliations` is a boolean value, `0` (default, do not include raw affiliation string in the result) or `1` (include raw affiliation string in the result).  |
+|           |                       |                      | `includeRawCopyrights` | optional | `includeRawCopyrights` is a boolean value, `0` (default, do not include raw copyrights/license string in the result) or `1` (include raw copyrights/license string in the result).  |
 |           |                       |                      | `teiCoordinates`       | optional      | list of element names for which coordinates in the PDF document have to be added, see [Coordinates of structures in the original PDF](Coordinates-in-PDF.md) for more details |
 |           |                       |                      | `segmentSentences`       | optional      | Paragraphs structures in the resulting TEI will be further segmented into sentence elements <s> |
 |           |                       |                      | `start`       | optional      | Start page number of the PDF to be considered, previous pages will be skipped/ignored, integer with first page starting at `1`, (default `-1`, start from the first page of the PDF)  |
@@ -220,6 +222,8 @@ Regarding the bibliographical references, it is possible to include the original
 curl -v --form input=@./thefile.pdf --form includeRawCitations=1 localhost:8070/api/processFulltextDocument
 ```
 
+Similar raw strings can be added in the result for affiliation and copyrights/license sections.
+
 Example with requested additional sentence segmentation of the paragraph with bounding box coordinates of the sentence structures:
 
 ```console

diff --git a/grobid-core/src/main/java/org/grobid/core/GrobidModels.java b/grobid-core/src/main/java/org/grobid/core/GrobidModels.java
@@ -51,7 +51,9 @@ public enum GrobidModels implements GrobidModel {
     //ACKNOWLEDGEMENT("acknowledgement"),
     FUNDING_ACKNOWLEDGEMENT("funding-acknowledgement"),
     INFRASTRUCTURE("infrastructure"),
-    DUMMY("none");
+    DUMMY("none"),
+    LICENSE("license"),
+    COPYRIGHT("copyright");
 
     //I cannot declare it before
     public static final String DUMMY_FOLDER_LABEL = "none";

diff --git a/grobid-core/src/main/java/org/grobid/core/data/BiblioItem.java b/grobid-core/src/main/java/org/grobid/core/data/BiblioItem.java
@@ -6,6 +6,7 @@
 import org.grobid.core.data.util.AuthorEmailAssigner;
 import org.grobid.core.data.util.ClassicAuthorEmailAssigner;
 import org.grobid.core.data.util.EmailSanitizer;
+import org.grobid.core.data.CopyrightsLicense;
 import org.grobid.core.document.*;
 import org.grobid.core.engines.config.GrobidAnalysisConfig;
 import org.grobid.core.exceptions.GrobidException;
@@ -376,6 +377,9 @@ public String toString() {
     // Availability statement
     private String availabilityStmt = null;
 
+    // Copyrights/license information object
+    CopyrightsLicense copyrightsLicense = null;
+
     public static final List<String> confPrefixes = Arrays.asList("Proceedings of", "proceedings of",
             "In Proceedings of the", "In: Proceeding of", "In Proceedings, ", "In Proceedings of",
             "In Proceeding of", "in Proceeding of", "in Proceeding", "In Proceeding", "Proceedings",
@@ -4477,4 +4481,12 @@ public void setAvailabilityStmt(String availabilityStmt) {
     public List<List<LayoutToken>> getAffiliationAddresslabeledTokens() {
         return affiliationAddresslabeledTokens;
     }
+
+    public void setCopyrightsLicense(CopyrightsLicense copyrightsLicense) {
+        this.copyrightsLicense = copyrightsLicense;
+    }
+
+    public CopyrightsLicense getCopyrightsLicense() {
+        return this.copyrightsLicense;
+    }
 }
diff --git a/grobid-core/src/main/java/org/grobid/core/data/CopyrightsLicense.java b/grobid-core/src/main/java/org/grobid/core/data/CopyrightsLicense.java
@@ -0,0 +1,96 @@
+package org.grobid.core.data;
+
+import org.grobid.core.utilities.TextUtilities;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Arrays;
+
+/**
+ * Class for representing information related to copyrights owner and file license.
+ */
+public class CopyrightsLicense {
+
+    // copyrights owner
+    public enum CopyrightsOwner {
+        PUBLISHER  ("publisher"),
+        AUTHORS    ("authors"),
+        UNDECIDED   ("undecided");
+
+        private String name;
+
+        private CopyrightsOwner(String name) {
+            this.name = name;
+        }
+
+        public String getName() {
+            return name;
+        }
+    };
+
+    public static List<String> copyrightOwners = Arrays.asList("publisher", "authors", "undecided");
+
+    // File-level licenses
+    public enum License {
+        CC0     ("CC-0"),
+        CCBY    ("CC-BY"),
+        CCBYNC  ("CC-BY-NC"),
+        CCBYNCND ("CC-BY-NC-ND"),
+        CCBYSA  ("CC-BY-SA"),
+        CCBYNCSA  ("CC-BY-NC-SA"),
+        CCBYND  ("CC-BY-ND"),
+        COPYRIGHT ("strict-copyrights"),
+        OTHER   ("other"),
+        UNDECIDED   ("undecided");
+
+        private String name;
+
+        private License(String name) {
+            this.name = name;
+        }
+
+        public String getName() {
+            return name;
+        }
+    };
+
+    public static List<String> licenses = 
+        Arrays.asList("CC-0", "CC-BY", "CC-BY-NC", "CC-BY-NC-ND", "CC-BY-SA", "CC-BY-NC-SA", "CC-BY-ND", "copyright", "other", "undecided");
+
+    private CopyrightsOwner copyrightsOwner;
+    private double copyrightsOwnerProb;
+    private License license;
+    private double licenseProb;
+
+    public CopyrightsOwner getCopyrightsOwner() {
+        return this.copyrightsOwner;
+    }
+
+    public void setCopyrightsOwner(CopyrightsOwner owner) {
+        this.copyrightsOwner = owner;
+    }
+
+    public double getCopyrightsOwnerProb() {
+        return this.copyrightsOwnerProb;
+    }
+
+    public void setCopyrightsOwnerProb(double prob) {
+        this.copyrightsOwnerProb = prob;
+    }
+
+    public License getLicense() {
+        return this.license;
+    }
+
+    public void setLicense(License license) {
+        this.license = license;
+    }
+
+    public double getLicenseProb() {
+        return this.licenseProb;
+    }
+
+    public void setLicenseProb(double prob) {
+        this.licenseProb = prob;
+    }
+}
diff --git a/grobid-core/src/main/java/org/grobid/core/document/TEIFormatter.java b/grobid-core/src/main/java/org/grobid/core/document/TEIFormatter.java
@@ -14,6 +14,8 @@
 import nu.xom.Text;
 
 import org.grobid.core.GrobidModels;
+import org.grobid.core.data.CopyrightsLicense.License;
+import org.grobid.core.data.CopyrightsLicense.CopyrightsOwner;
 import org.grobid.core.data.Date;
 import org.grobid.core.data.*;
 import org.grobid.core.document.xml.XmlBuilderUtils;
@@ -248,28 +250,75 @@ public StringBuilder toTEIHeader(BiblioItem biblio,
 
         if ((biblio.getPublisher() != null) ||
                 (biblio.getPublicationDate() != null) ||
-                (biblio.getNormalizedPublicationDate() != null)) {
+                (biblio.getNormalizedPublicationDate() != null) ||
+                biblio.getCopyrightsLicense() != null) {
             tei.append("\t\t\t<publicationStmt>\n");
+
+            CopyrightsLicense copyrightsLicense = biblio.getCopyrightsLicense();
+
             if (biblio.getPublisher() != null) {
                 // publisher and date under <publicationStmt> for better TEI conformance
                 tei.append("\t\t\t\t<publisher>" + TextUtilities.HTMLEncode(biblio.getPublisher()) +
                         "</publisher>\n");
-
-                tei.append("\t\t\t\t<availability status=\"unknown\">");
-                tei.append("<p>Copyright ");
-                //if (biblio.getPublicationDate() != null)
-                tei.append(TextUtilities.HTMLEncode(biblio.getPublisher()) + "</p>\n");
-                tei.append("\t\t\t\t</availability>\n");
             } else {
                 // a dummy publicationStmt is still necessary according to TEI
                 tei.append("\t\t\t\t<publisher/>\n");
-                if (defaultPublicationStatement == null) {
-                    tei.append("\t\t\t\t<availability status=\"unknown\"><licence/></availability>");
+            }
+
+            // We introduce something more meaningful with TEI customization to encode copyrights information:
+            // - @resp with value "publisher", "authors", "unknown", we add a comment to clarify that @resp
+            //   should be interpreted as the copyrights owner
+            // - license related to copyrights exception is encoded via <licence>  
+            // (note: I have no clue what can mean "free" as status for a document - there are always some sort of 
+            // restrictions like moral rights even for public domain documents)
+            if (copyrightsLicense != null) {
+                tei.append("\t\t\t\t<availability ");
+
+                boolean addCopyrightsComment = false;
+                if (copyrightsLicense.getCopyrightsOwner() != null && copyrightsLicense.getCopyrightsOwner() != CopyrightsOwner.UNDECIDED) {
+                    tei.append("resp=\""+ copyrightsLicense.getCopyrightsOwner().getName() +"\" ");
+                    addCopyrightsComment = true;
+                }
+
+                if (copyrightsLicense.getLicense() != null && copyrightsLicense.getLicense() != License.UNDECIDED) {
+                    tei.append("status=\"restricted\">\n");
+                    if (addCopyrightsComment) {
+                        tei.append("\t\t\t\t\t<!-- the @rest attribute above gives the document copyrights owner (publisher, authors), if known -->\n");
+                    }
+                    tei.append("\t\t\t\t\t<licence>"+copyrightsLicense.getLicense().getName()+"</licence>\n");
                 } else {
-                    tei.append("\t\t\t\t<availability status=\"unknown\"><p>" +
-                            TextUtilities.HTMLEncode(defaultPublicationStatement) + "</p></availability>");
+                    tei.append(" status=\"unknown\">\n");
+                    if (addCopyrightsComment) {
+                        tei.append("\t\t\t\t\t<!-- the @rest attribute above gives the document copyrights owner (publisher, authors), if known -->\n");
+                    }
+                    tei.append("\t\t\t\t\t<licence/>\n");
                 }
-                tei.append("\n");
+
+                if (config.getIncludeRawCopyrights() && biblio.getCopyright() != null && biblio.getCopyright().length()>0) {
+                    tei.append("\t\t\t\t\t<p type=\"raw\">");
+                    tei.append(TextUtilities.HTMLEncode(biblio.getCopyright()));
+                    tei.append("</note>\n");
+                }
+
+                tei.append("\t\t\t\t</availability>\n");
+            } else {
+                tei.append("\t\t\t\t<availability ");
+
+                tei.append(" status=\"unknown\">\n");
+                tei.append("\t\t\t\t\t<licence/>\n");
+
+                if (defaultPublicationStatement != null) {
+                    tei.append("\t\t\t\t\t<p>" +
+                            TextUtilities.HTMLEncode(defaultPublicationStatement) + "</p>\n");
+                }
+
+                if (config.getIncludeRawCopyrights() && biblio.getCopyright() != null && biblio.getCopyright().length()>0) {
+                    tei.append("\t\t\t\t\t<p type=\"raw\">");
+                    tei.append(TextUtilities.HTMLEncode(biblio.getCopyright()));
+                    tei.append("</note>\n");
+                }
+
+                tei.append("\t\t\t\t</availability>\n");
             }
 
             if (biblio.getNormalizedPublicationDate() != null) {

diff --git a/grobid-core/src/main/java/org/grobid/core/engines/Engine.java b/grobid-core/src/main/java/org/grobid/core/engines/Engine.java
@@ -350,13 +350,15 @@ public String processHeader(
         String inputFile,
         int consolidate,
         boolean includeRawAffiliations,
+        boolean includeRawCopyrights,
         BiblioItem result
     ) {
         GrobidAnalysisConfig config = new GrobidAnalysisConfig.GrobidAnalysisConfigBuilder()
             .startPage(0)
             .endPage(2)
             .consolidateHeader(consolidate)
             .includeRawAffiliations(includeRawAffiliations)
+            .includeRawCopyrights(includeRawCopyrights)
             .build();
         return processHeader(inputFile, null, config, result);
     }
@@ -380,12 +382,14 @@ public String processHeaderFunding(
         File inputFile,
         int consolidateHeader,
         int consolidateFunders,
-        boolean includeRawAffiliations
+        boolean includeRawAffiliations,
+        boolean includeRawCopyrights
     ) throws Exception {
         GrobidAnalysisConfig config = new GrobidAnalysisConfig.GrobidAnalysisConfigBuilder()
             .consolidateHeader(consolidateHeader)
             .consolidateFunders(consolidateFunders)
             .includeRawAffiliations(includeRawAffiliations)
+            .includeRawCopyrights(includeRawCopyrights)
             .build();
         return processHeaderFunding(inputFile, null, config);
     }
@@ -408,13 +412,15 @@ public String processHeader(
         String md5Str,
         int consolidate,
         boolean includeRawAffiliations,
+        boolean includeRawCopyrights,
         BiblioItem result
     ) {
         GrobidAnalysisConfig config = new GrobidAnalysisConfig.GrobidAnalysisConfigBuilder()
             .startPage(0)
             .endPage(2)
             .consolidateHeader(consolidate)
             .includeRawAffiliations(includeRawAffiliations)
+            .includeRawCopyrights(includeRawCopyrights)
             .build();
         return processHeader(inputFile, md5Str, config, result);
     }
@@ -440,12 +446,14 @@ public String processHeaderFunding(
         String md5Str,
         int consolidateHeader,
         int consolidateFunders,
-        boolean includeRawAffiliations
+        boolean includeRawAffiliations,
+        boolean includeRawCopyrights
     ) throws Exception {
         GrobidAnalysisConfig config = new GrobidAnalysisConfig.GrobidAnalysisConfigBuilder()
             .consolidateHeader(consolidateHeader)
             .consolidateFunders(consolidateFunders)
             .includeRawAffiliations(includeRawAffiliations)
+            .includeRawCopyrights(includeRawCopyrights)
             .build();
         return processHeaderFunding(inputFile, md5Str, config);
     }