Skip to content

Commit

Permalink
update documentation and configuration
Browse files Browse the repository at this point in the history
  • Loading branch information
lfoppiano committed Oct 23, 2024
1 parent 540dbd8 commit bbfa442
Show file tree
Hide file tree
Showing 2 changed files with 54 additions and 14 deletions.
28 changes: 14 additions & 14 deletions doc/Grobid-service.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,21 +178,21 @@ curl -v -H "Accept: application/x-bibtex" --form input=@./thefile.pdf localhost:

Convert the complete input document into TEI XML format (header, body and bibliographical section).

| method | request type | response type | parameters | requirement | description |
|--- |--- |--- |--------------------------|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| POST, PUT | `multipart/form-data` | `application/xml` | `input` | required | PDF file to be processed |
| | | | `consolidateHeader` | optional | `consolidateHeader` is a string of value `0` (no consolidation), `1` (consolidate and inject all extra metadata, default value), `2` (consolidate the citation and inject DOI only), or `3` (consolidate using only extracted DOI - if extracted). |
| | | | `consolidateCitations` | optional | `consolidateCitations` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the citation and inject DOI only). |
| | | | `consolidatFunders` | optional | `consolidateFunders` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the funder and inject DOI only). |
| | | | `includeRawCitations` | optional | `includeRawCitations` is a boolean value, `0` (default, do not include raw reference string in the result) or `1` (include raw reference string in the result). |
| | | | `includeRawAffiliations` | optional | `includeRawAffiliations` is a boolean value, `0` (default, do not include raw affiliation string in the result) or `1` (include raw affiliation string in the result). |
| | | | `includeRawCopyrights` | optional | `includeRawCopyrights` is a boolean value, `0` (default, do not include raw copyrights/license string in the result) or `1` (include raw copyrights/license string in the result). |
| | | | `teiCoordinates` | optional | list of element names for which coordinates in the PDF document have to be added, see [Coordinates of structures in the original PDF](Coordinates-in-PDF.md) for more details |
| | | | `segmentSentences` | optional | Paragraphs structures in the resulting TEI will be further segmented into sentence elements <s> |
| method | request type | response type | parameters | requirement | description |
|--- |--- |--- |--------------------------|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| POST, PUT | `multipart/form-data` | `application/xml` | `input` | required | PDF file to be processed |
| | | | `consolidateHeader` | optional | `consolidateHeader` is a string of value `0` (no consolidation), `1` (consolidate and inject all extra metadata, default value), `2` (consolidate the citation and inject DOI only), or `3` (consolidate using only extracted DOI - if extracted). |
| | | | `consolidateCitations` | optional | `consolidateCitations` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the citation and inject DOI only). |
| | | | `consolidatFunders` | optional | `consolidateFunders` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the funder and inject DOI only). |
| | | | `includeRawCitations` | optional | `includeRawCitations` is a boolean value, `0` (default, do not include raw reference string in the result) or `1` (include raw reference string in the result). |
| | | | `includeRawAffiliations` | optional | `includeRawAffiliations` is a boolean value, `0` (default, do not include raw affiliation string in the result) or `1` (include raw affiliation string in the result). |
| | | | `includeRawCopyrights` | optional | `includeRawCopyrights` is a boolean value, `0` (default, do not include raw copyrights/license string in the result) or `1` (include raw copyrights/license string in the result). |
| | | | `teiCoordinates` | optional | list of element names for which coordinates in the PDF document have to be added, see [Coordinates of structures in the original PDF](Coordinates-in-PDF.md) for more details |
| | | | `segmentSentences` | optional | Paragraphs structures in the resulting TEI will be further segmented into sentence elements <s> |
| | | | `generateIDs` | optional | if supplied as a string equal to `1`, it generates uniqe identifiers for each text component |
| | | | `start` | optional | Start page number of the PDF to be considered, previous pages will be skipped/ignored, integer with first page starting at `1`, (default `-1`, start from the first page of the PDF) |
| | | | `end` | optional | End page number of the PDF to be considered, next pages will be skipped/ignored, integer with first page starting at `1` (default `-1`, end with the last page of the PDF) |
| | | | `flavor` | optional | Indicate which flavor to apply for structuring the document. Useful when the default structuring cannot be applied to a specific document (e.g. the body is empty). Possible values are: `light` passes the document into "light" models which recognise only title, authors and body, `blank` return the raw extracted text from pdfalto in the body. |
| | | | `start` | optional | Start page number of the PDF to be considered, previous pages will be skipped/ignored, integer with first page starting at `1`, (default `-1`, start from the first page of the PDF) |
| | | | `end` | optional | End page number of the PDF to be considered, next pages will be skipped/ignored, integer with first page starting at `1` (default `-1`, end with the last page of the PDF) |
| | | | `flavor` | optional | Indicate which flavor to apply for structuring the document. Useful when the default structuring cannot be applied to a specific document (e.g. the body is empty). Values and models are indicated below. |

Response status codes:

Expand Down
40 changes: 40 additions & 0 deletions grobid-home/config/grobid-full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,22 @@ grobid:
max_sequence_length: 3000
batch_size: 10

- name: "segmentation-article-light"
engine: "wapiti"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.0000001
window: 50
nbMaxIterations: 2000

- name: "segmentation-article-light-ref"
engine: "wapiti"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.0000001
window: 50
nbMaxIterations: 2000

- name: "fulltext"
# at this time, must always be CRF wapiti, the input sequence size is too large for a Deep Learning implementation
engine: "wapiti"
Expand Down Expand Up @@ -137,6 +153,30 @@ grobid:
max_sequence_length: 3000
batch_size: 9

- name: "header-article-light"
# engine: "wapiti"
engine: "delft"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.000001
window: 30
nbMaxIterations: 1500
delft:
architecture: "BidLSTM_ChainCRF_FEATURES"
useELMo: false

- name: "header-article-light-ref"
# engine: "wapiti"
engine: "delft"
wapiti:
# wapiti training parameters, they will be used at training time only
epsilon: 0.000001
window: 30
nbMaxIterations: 1500
delft:
architecture: "BidLSTM_ChainCRF_FEATURES"
useELMo: false

- name: "reference-segmenter"
#engine: "wapiti"
engine: "delft"
Expand Down

0 comments on commit bbfa442

Please sign in to comment.