Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP worker refactoring #221

Merged
merged 125 commits into from
Nov 14, 2023
Merged

HTTP worker refactoring #221

merged 125 commits into from
Nov 14, 2023

Conversation

nck-mlcnv
Copy link
Contributor

@nck-mlcnv nck-mlcnv commented Oct 6, 2023

SPARQLProtocolWorker is a draft for a better, more reliable worker that is tailored towards SPARQL Protocol. Each worker uses a single HttpClient and handles work completion conditions itself. It also covers sending and receiving HTTP request and request bodies that exceed 2GB.

This PR gives an idea what the internals of such a worker could look like. It doesn't provide a full implementation and the code is not yet used within IGUANA.

TODOs:

  • Complete TODOs in SPARQLProtocolWorker
  • Make SPARQLProtocolWorker configurable (and remove all other workers for now)
  • Implement LanguageProcessors for xml, csv and tsv SPARQL results -> move to separate issue
  • Port LanguageProcessor for RDF results -> move to separate issue
  • Adjust Stresstest to start SPARQLProtocolWorkers
  • Adjust Streestest to start a ResponseBodyProcessor per unique (QuerySource,LanguageProcessor)
  • write/update tests for all classes touched
  • document adjusted behavior
  • Ask @MichaelRoeder for a review of BigByteArray{Input,Output}Stream after tests are implemented for them
  • Use LSQ to decide if a query is an update query; if yes, use the update endpoint -> move to separate issue

Future improvements:

  • allow to select language processors for: RDF and SPARQL-RESULT for each worker. The SPARQLProtocolWorker does content type negotiation based on info from LSQ about the query (basically: if it is a construct query, expect RDF)
  • Make ResponseBodyProcessor configurable, including timeout for processing after querying is done and number of threads

bigerl and others added 30 commits June 30, 2023 16:39
…at is tailored towards SPARQL Protocol. Each worker uses a single HttpClient and handles work completion conditions itself.
Refactored SPARQLProtocolWorker to record workerId and execution stats for each worker. WorkerId was added to uniquely identify each worker. An ExecutionStats inner class was created to track start time, duration, HTTP status code, content length, number of bindings, and number of solutions for each worker's task.
This commit changes the query building mechanism within SPARQLProtocolWorker.java, shifting from StringBuilder to InputStream, aiming to support processing of large queries, and reduce overhead from using String for queryID. Now it reads queries directly from QueryHandler's data stream, with modifications to a number of HTTP Request methods to accommodate this change. The refactor also includes addition of new method in Query Handler which returns 'QueryHandle' record—a container for index and InputStream for a query."
Introduced InputStream support in the QueryList and QuerySource to handle large queries more efficiently. Changes have been made to IndexedQueryReader, QuerySource, QueryHandler, and several other classes to accommodate the new streaming feature. Previously, all queries were loaded into memory which might cause OutOfMemoryError for large queries. It still depends on the SPARQL worker used if queries are streamed to the client.
…dled responses to avoid repeated processing. It uses a concurrent hash map to store the responses identified by unique keys. This approach aims to improve the efficiency of handling response bodies in multi-threaded scenarios.
…ayOutputStream and complete rewrite of BigByteArrayInputStream. This should increase the performance of both streams significantly.
Implemented the AbstractLanguageProcessor interface to process InputStreams. A new SAX Parser (SaxSparqlJsonResultCountingParser) was introduced for SPARQL JSON results, returning solutions, bound values, and variables.
* delegated executeQuery method
* reuse bbaos if not consumed
* removed assert for non-differing content-length header value and actual content length
* better logging for malformed url
# Conflicts:
#	pom.xml
#	src/main/java/org/aksw/iguana/cc/config/IguanaConfig.java
#	src/main/java/org/aksw/iguana/cc/config/elements/ConnectionConfig.java
#	src/main/java/org/aksw/iguana/cc/config/elements/DatasetConfig.java
#	src/main/java/org/aksw/iguana/cc/config/elements/MetricConfig.java
#	src/main/java/org/aksw/iguana/cc/config/elements/StorageConfig.java
#	src/main/java/org/aksw/iguana/cc/config/elements/TaskConfig.java
#	src/main/java/org/aksw/iguana/cc/controller/TaskController.java
#	src/main/java/org/aksw/iguana/cc/lang/AbstractLanguageProcessor.java
#	src/main/java/org/aksw/iguana/cc/lang/QueryWrapper.java
#	src/main/java/org/aksw/iguana/cc/lang/impl/RDFLanguageProcessor.java
#	src/main/java/org/aksw/iguana/cc/lang/impl/SPARQLLanguageProcessor.java
#	src/main/java/org/aksw/iguana/cc/model/QueryExecutionStats.java
#	src/main/java/org/aksw/iguana/cc/query/handler/QueryHandler.java
#	src/main/java/org/aksw/iguana/cc/tasks/AbstractTask.java
#	src/main/java/org/aksw/iguana/cc/tasks/Task.java
#	src/main/java/org/aksw/iguana/cc/tasks/TaskManager.java
#	src/main/java/org/aksw/iguana/cc/tasks/stresstest/Stresstest.java
#	src/main/java/org/aksw/iguana/cc/tasks/stresstest/storage/impl/NTFileStorage.java
#	src/main/java/org/aksw/iguana/cc/tasks/stresstest/storage/impl/RDFFileStorage.java
#	src/main/java/org/aksw/iguana/cc/tasks/stresstest/storage/impl/TriplestoreStorage.java
#	src/main/java/org/aksw/iguana/cc/worker/AbstractWorker.java
#	src/main/java/org/aksw/iguana/cc/worker/Worker.java
#	src/main/java/org/aksw/iguana/cc/worker/impl/CLIInputFileWorker.java
#	src/main/java/org/aksw/iguana/cc/worker/impl/CLIInputPrefixWorker.java
#	src/main/java/org/aksw/iguana/cc/worker/impl/CLIInputWorker.java
#	src/main/java/org/aksw/iguana/cc/worker/impl/CLIWorker.java
#	src/main/java/org/aksw/iguana/cc/worker/impl/HttpGetWorker.java
#	src/main/java/org/aksw/iguana/cc/worker/impl/HttpPostWorker.java
#	src/main/java/org/aksw/iguana/cc/worker/impl/HttpWorker.java
#	src/main/java/org/aksw/iguana/cc/worker/impl/MultipleCLIInputWorker.java
#	src/main/java/org/aksw/iguana/cc/worker/impl/UPDATEWorker.java
#	src/main/java/org/aksw/iguana/rp/storage/TripleBasedStorage.java
#	src/test/java/org/aksw/iguana/cc/config/WorkflowTest.java
#	src/test/java/org/aksw/iguana/cc/lang/SPARQLLanguageProcessorTest.java
#	src/test/java/org/aksw/iguana/cc/tasks/storage/impl/NTFileStorageTest.java
#	src/test/java/org/aksw/iguana/cc/tasks/storage/impl/TriplestoreStorageTest.java
#	src/test/java/org/aksw/iguana/cc/tasks/stresstest/StresstestTest.java
#	src/test/java/org/aksw/iguana/cc/worker/HTTPWorkerTest.java
#	src/test/java/org/aksw/iguana/cc/worker/MockupWorker.java
#	src/test/java/org/aksw/iguana/cc/worker/UPDATEWorkerTest.java
#	src/test/java/org/aksw/iguana/cc/worker/impl/CLIWorkersTests.java
#	src/test/java/org/aksw/iguana/cc/worker/impl/HttpPostWorkerTest.java
#	src/test/resources/controller_test.properties
* this commit also moved some packages
* also updates CSVStorage
* adds Storable interface
* also delegate the deserializer class for the QueryHandler to the QueryHandler itself
Adjusted the test as well and integrate it in the StresstestResultProcessor and Storages.
The implemented method searches every class that has been annotated with ContentType and maps its value with the clas. This is done with the spring-framework.
@nck-mlcnv
Copy link
Contributor Author

I accidently enabled the BigByteArrayStream tests, and it looks like they cause some issues on the GitHub system. 🤔

@nck-mlcnv nck-mlcnv requested a review from bigerl November 6, 2023 18:06
@nck-mlcnv nck-mlcnv linked an issue Nov 6, 2023 that may be closed by this pull request
@nck-mlcnv nck-mlcnv added this to the 4.0 milestone Nov 6, 2023
@nck-mlcnv nck-mlcnv linked an issue Nov 6, 2023 that may be closed by this pull request
@nck-mlcnv nck-mlcnv added the breaking change Changes that cause changes in the config file, output format, ontology or commandline interface. label Nov 6, 2023
Copy link
Member

@bigerl bigerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor things.

@nck-mlcnv nck-mlcnv merged commit 82dd89e into develop Nov 14, 2023
@nck-mlcnv nck-mlcnv deleted the proposal/http_worker_refactoring branch July 30, 2024 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment