-
Notifications
You must be signed in to change notification settings - Fork 0
This page contains a series of answers to some anticipated questions about the project. These questions and answers are organized into the following topics:
It is not clear at this time whether this project will be continued as is. Even if this is not the case, many of the technologies used and lessons learned may be applied to other projects.
This was simply for convenience. One advantage of microservices is that each microservice is developed on its own lifecycle and timeline, which means they often will live in different code repositories. This is often easier in larger organizations, since different groups can own their own microservices. But for this prototype, separating the microservices would have just made things more complicated than necessary.
This is not the first project to make use of patches (or even something similar, like commands) to capture the changes that a client desires to make. It is not really even the first data service for mobile apps to use patches. But this may be one of the first to make it the primary means of communicating changes to the data service.
Debezium's patches were modeled after RFC 6902, "JavaScript Object Notation (JSON) Patch", and so it makes perfect sense to represent them using the JSON Patch format. Debezium's patches add an INCREMENT
action to the ADD
, REMOVE
, REPLACE
, MOVE
, COPY
, AND REQUIRE
or (test) actions already included in JSON Patch. It is not clear whether INCREMENT
is useful, and it may actually be harmful since any patch that includes INCREMENT
is not idempotent. Consequently, INCREMENT
may be removed in the future.
At this point, the API for Debezium's driver is fully asynchronous: a client submits a request and a handler function that is to be called when the request has been fully processed, and the client's request returns as soon as Debezium receives and records the request. The immediate response lets the client know whether a failure occurred before Debezium could record the request and whether the request should be resubmitted. However, once Debezium does record the request and acknowledge that in the response, Debezium will always fully-process the request.
No, the Debezium driver is intended to be embedded into Java server processes that expose a network API to mobile apps. The messages that make up Debezium's API were designed to be naturally represented in JSON, making it even easier to wrap with a server process. For example, it might be used within a LiveOak subsystem, where LiveOak then exposes a public API with JSON messages over HTTP.
No, a stream-oriented architecture is not required. In fact, Debezium's API could easily be implemented by a service that simply stores information in a shared database such as MongoDB, Cassandra, or even a relational database. The stream-oriented architecture was explored in this prototype because it offers a lot of interesting features that are not naturally enabled with traditional backends: the ability of services to replay their inputs, the potentially very-high volumes, natural partitioning, ease of separating logic into decoupled microservices, the ability of these microservices to use their own separate storage systems to persist the information they needed, etc.
No, Kafka is very different from traditional message brokers like ActiveMQ, RabbitMQ, and ZeroMQ:
- With traditional message brokers, it is the broker that is responsible for sending messages to the consumers. With Kafka, each consumer is responsible for consuming messages and tracking which messages they've processed. This makes it possible for Kafka consumers to actually back up and replay older messages.
- With traditional brokers, the messages are passed directly between processes, so if processes go down the broker has to coordinate and ensure the messages are eventually delivered. With Kafka, messages are transferred only via a persistent log, so that there is actually no direct communication between processes. This makes Kafka far more robust, since all messages are persisted (and replicated). Additionally, this makes Kafka producers independent of downstream consumers. For example, if a consumer goes down, the upstream Kafka producers can continue writing to the log without a problem, while consumers that are downstream of the unavailable consumer/producer will simply see a pause in the messages written to the log(s) they are consuming. When the service is restored, it will continue processing where it left off.
- Kafka is insanely fast. A properly configured Kafka cluster it is very possible to write 2M messages/second, and is often limited when the NIC. Other Kafka setups have shown writing as many as 11 billion messages per day (at about 80 bytes per message). Reading messages is also very quick, especially for the common case when consumers are reading messages that were just written; in this case, the written messages are still in page cache and can be read with no additional disk operations.
- Kafka persistent logs are append-only and can be kept for any duration of time (as long as space is available). The ability for consumers to rewind and re-read "history" means that new services can be brought up and evaluated alongside the older services; only when the new services show themselves to be more beneficial do the old services need to be retired.
- Kafka logs are partitioned and each partition is replicated to a configurable number of replicas. They can even be replicated to other data centers, providing even greater levels of fault tolerance.
- All the messages in each partition are totally ordered: they will always be consumed in the same order. This feature makes it possible for Debezium to not use transactions, yet ensure that all updates to a single entity will always be totally ordered based upon the time the requests are recorded.
Yes and no. There are multiple open source projects that enable stream-oriented processing (including Apache Storm and Apache Spark Streaming), and there are even some proprietary products (including IBM InfoSphere Streams, TIBCO StreamBase, and Software AG's Apama). However, Kafka's approach is relatively distinct and very low-level: its primary purpose is to provide persistent logs that are partitioned and replicated across multiple machines, while Kafka producers are simply processes that write to those logs and Kafka consumers are simply processes that consume from those logs. Apache Storm and Apache Spark can actually use Kafka, but their focus is to make it easy to create, execute, and monitor multiple stream-oriented workflows. Because Debezium's workflow is relatively fixed (given a set of requirements and features), this flexibility offered by Storm and Spark may not be of tremendous benefit.
Yes. The primary master data is stored in the Kafka logs, but any Debezium service can consume messages from any of the available streams and update data in a separate database. In fact, it is likely that any services developed to support queries would want to reuse existing technology (such as ElasticSearch, Cassandra, or even a relational database) rather than implement the index and query capabilities from scratch. But if Debezium services were to use external data sources, those data sources should be considered as caches and not the primary master data.
All of Debezium's messages are written to Kafka logs in a simple JSON form. This kept things simple and allowed the prototype to easily change without having to define a formal message structure or define a formal message schema. However, it is likely that a formal message schema technology (e.g., Avro) would be very beneficial and result in a logs that are more compact.
Debezium uses a hierarchical structure of containers: databases, collections, zones, and entities. Collections are identified by the type of entity they store. But in reality, database, collections (rather the entity type) and zones are simply portions of the entity ID, which has the following structure:
/{databaseId}/{entityType}/{zoneId}/{entityId}
A collection is intended to store entities that have a similar structure, but even within a collection it is often useful to define subsets of entities. Debezium's zones serve this purpose. By default, all entities will go into the "default" zone, but clients can specify a zone at the time an entity is created. In reality, zones are simply part of the entity identifier, so adding a zone does not require explicitly creating an object or piece of data. Zones also make it possible to partition streams based upon zones (rather than just collections or entity IDs), making it easier to implement services such as the zone watch service that records events and sends notifications based upon activities within zones.
Not at this time, but it'd be fairly straightforward to add support for references within the entity schema, and to add a service that maintains a cache of all reverse references and provides the ability to find all entities that have a reference to a given entity. There even could be a service that removes references in entities when the referenced entities are removed.
Public, private, and even hybrid collections can be implemented on top of Debezium by the server/subsystem that has/uses the security infrastructure and predefined use of zones. See this explanation for details.
Debezium's schema learning service attempts to automatically figure out a schema that describes all fields within the entities loaded into a collection. Each field is defined by a name and the path within the document (e.g., the address/home
path represents the home
field within the address
nested document), and information about the field includes the type(s) of values allowed in the field and whether the field is optional or required.
Yes, it appears that the schema information that Debezium can automatically learn from entities includes all of the essential information defined in the schemas of Avro, Protocol Buffers, and Thrift. Therefore, it is possible that any of these technologies could be used within Debezium to represent the schemas. Avro may be a little easier, since it has a native JSON format; both Protocol Buffers and Thrift have their own format.
No, Debezium does not require schemas to be used. One of the goals of Debezium was to understand the practicality of automatically learning the schema of a set of entities as they are loaded into a new database. Otherwise, schemas are not currently used.
One benefit of schemas is to use them to maintain a set of indexes for a collection, and to use queries to help the developer tune the indexes and prune out those indexes that are not needed. This was beyond the scope of the prototype effort.
Yes, it would be possible to introduce a service that would evaluate a patch based upon the current schema for an entity type, and to reject the patch if it ended up violating some rules within the schema. This was beyond the scope of the prototype effort.
No, at this point Debezium uses no authentication or authorization. However, because Debezium is intended to be embedded into other applications, such as web services or MBaaS systems, those systems would likely provide the needed authentication and authorization functionality before making the requests to Debezium. For example, if Debezium is included as a subsystem in LiveOak, then LiveOak's use of KeyCloak already provides authentication and authorization functionality.