Skip to content

Comprehensive guidance in Software\System Design and System Design using Azure Public Cloud

Notifications You must be signed in to change notification settings

Glareone/Azure-Solution-and-Enterprise-Architecture-in-Depth

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data presented here is gathered from different courses: Stanford University, Carnegie Mellon University, Different Books, Udemy-DesignGuru-Alex Xu-Oreilly courses, etc.

Categories of presented materials

  • Azure Solution Architecture
  • TOGAF Enterprise Architecture
  • Solution and System Design examples
  • Advanced topics regarding different aspects of solution architecture

AWS Cloud

  1. AWS: Solution Architecture in Practice

Azure Cloud

  1. Azure: Solution Architecture in Practice
    a. Azure: Logging & Monitoring
    b. Azure: Service Principal, Managed Identity, and Service Connection
    c. Azure: Data Materials
    d. Azure: Design Relational Data Storage
    e. Azure: Active Geo-replication
    f. Azure: Networking
    g. Azure: High Availability and Traffic Manager
    h. Azure: Architecture examples by Microsoft
    i. Azure Migration Strategy by Microsoft
    g. Azure Additional Links to Different Materials
    k. Azure: Articles which covers everything in Azure. Third-Party repo I cloned
    l. API Management and VNet + APIM with Gateway for On-Premise
  2. Azure B2B, B2C, Entitlement, Guest Users. App roles & delegated permissions
    a. Azure: B2B & B2C, Azure AD DS Domain Services
    b. Azure Entitlement
    c. Azure Delegate Permissions (User to Azure App)
    d. Azure Assign App roles to another App. Api Permissions
  3. Azure: High Availability. Disaster Recovery for Azure Functions. Strategies
    a. Azure Service Bus: Performance Improvements Best Practices
    b. Azure App Service Disaster Recovery. Active-Active, Active-Passive, Active-Cold
  4. Azure: Skylines Academies materials
  5. Data & Data Storage Materials
    a. Design for Data Storage
    b. Azure SQL. Data Retention
    c. Azure SQL DB Data Retention for more than 35 days. Tip
  6. Traffic Manager
  7. Azure Data Factory
  8. Azure Proximity Placement Groups. Configuring Low-latency VMs in one DataCenter
  9. Azure App Configuration. Feature Flags and Configurations

Enterprise Architecture. ToGAF

  1. ToGAF. ADM. Basics
  2. ToGAF. Preliminary Phase
  3. ToGAF. A. Architecture Vision
  4. ToGAF. B. Business Architecture
  5. ToGAF. C. Information Systems Architecture
  6. ToGAF. D. Technology Architecture
  7. ToGAF. E. Opportunities and Solutions
  8. ToGAF. F. Migration Planning
  9. ToGAF. G. Governance. Table of Content
  1. ToGAF. H. Architecture Change Management
  2. ADM. Guidelines and Techniques. Table of Content
  3. Architecture Governance Techniques. Table of Content

Architecture. ATAM. Principles

  1. Architecture: How Much of Architecture is needed
    a. Architecture lifecycle
    b. Architecture patterns
    c. Architecture style vs pattern
    d. Architecture Influence Cycle
  2. Architecture: Jeffrey Richter's Course materials
  3. Architecture: Attribute-Driven Design
  4. Architecture: Quality Attributes and Tactics to achieve them
    a. Functional vs Non-Functional. Test Process
    b. Tactics to Achieve Quality Attributes
  5. Architecture: Architecture Style
    a. Module Styles
    b. Component & Connectors Styles
    c. Allocation Styles
  6. Architecture Views. Documenting Software Architecture. Properties to document in your Architecture Document
  7. Documenting Architecture. General Structure. General Principles. Mapping Requirements
    a. Documenting: Combining Views. Hybrid View
    b. Documenting: Interfaces, Behavior, Context
    c. View Packets. Alternatives
    d. Example: Architecture Decision Records (ADR\MADR). How to Document your architecture decisions and their consequences
  8. Views and Beyond. Alternatives: DoDAF, ISO42010 \ IEEE1471-2000
  9. Reviewing Architecture Documentation
  10. Architecture Evaluation

General Architecture Principles. Data. Caching. Messaging. Distributed Systems

  1. Data & Data Storage Materials
    a. Data Replication. Leader-Follower & Quorum patterns in SQL
    b. Blob Storage vs File Share vs Managed Disks
  2. Cache. Read-Through, Cache-Aside
    a. Cache Consistency Models
    b. Cache Challenges
    c. Cache Replacement Policies
    d. Cache Performance Metrics
  3. Governance and Compliance materials
  4. Kafka & Messaging patterns
    a. Kafka. Basics. Consumer Group. Compression & Batching. Load Balancing
    b. Messaging Patterns suitable for Kafka and for other services. Q&A
  5. Darp. The Distributed Application Runtime
  6. CAP Theorem. PACELC
  7. Consistent Hashing for Data Replication and Data Partitioning

Architecture in Practice. System Design Interview Questions. Main Problems

  1. Architecture in Practice. System Design Interview. Q&A. Main Problems
    a. System Design Interview In practice
    b. URL Shortener
    c. Pastebin
    d. Web Crawler
  2. CAP Theorem. PACELC
  3. Consistent Hashing for Data Replication and Data Partitioning

System Design Interview. Low-Level System Design Interview. High-Level System Design Interview

Table of Contents

  1. System Design Interview. Low-Level SDI, High-level SDI
  2. Step-by-step guide by DesignGurus

Other Good Articles & Must Read Books

  1. System Design Complete Guide by Karan Pratap Singh
  2. Designing Data-Intensive Applications, M.Kleppmann
  3. System Design Interview, Alex Xu

How Much of Architecture is needed?

image

Architecture lifecycle

QAW (workshop) => ADD (Attribute-Driven Design) => V&B (Views and Beyond) => ATAM (Architecture Tradeoffs Analysis)

image

Architecture patterns

image

Architecture style vs pattern

image

Architecture Influence Cycle

image

Architecture Materials. General Materials

Materials

Jeffrey Richter's Course. Slides

?aaS Cloud course from Jeffrey Richter
Jeffry Richter Presentation, Why cloud apps, Embracing Failures, Orchestrators, Virtualization.pptx
Jeffrey Richter Presentation, Regions and Microservices.pptx
Jeffrey Richter, Scaling, 12-Factor, Containers.pptx
Jeffrey Richter, Docker, Hyper-V containers, Containerd runtime, CI and CD.pptx
Jeffrey Richter, API Versioning, Troubleshooting, Steps towards Microservices(what need to take into account).pptx
Jeffrey Richter, Idempotency, Retry Policy, Exactly Once Strategy.pptx

8 Fallacies of Distributed Computing Explained

8_Fallacies_of_Distributed_Computing_Explained.pdf

image

4 Reasons to split monolith onto microservices. Richter's slide

4 reasons to split the monolith onto microservices

twelve factor application (12 factor explained)

The twelve factor application (12-factor)

Forward & Reverse Proxies

When people talk about proxy they usually mean forward proxy.
Forward proxy - kind of proxy when you have a several services which reach one resource in the internet. And you may create a service which adds some information (headers) to such request and\or transform them somehow before forwarding (change the address of destination).
Good forward proxy example: Fiddler;

Good reverse proxy examples: Nginx, IIS, API Gateway, Load Balancer, etc.
You may need them for:

  1. Manage calls coming to your services (Being a facade of your services)
  2. Check whether coming calls which have basic authentication - are able to interact with the service
  3. Load Balancing (on OSI level4 for TCP and UDP traffic and for OSI level7 for HTTP\HTTPS traffic)
  4. SSL termination - request coming to reverse proxy (Nginx for example) may be HTTPS, but forwarded request will be http
  5. Caching mechanisms
  6. Throttling - to control the input, e.g. amount of requests per seconds
  7. Billing - to control the amount of requests and to help making a bill for them
  8. DDos mitigation
  9. Retry Policy. It may automatically make a retry to another service instance behind it if the certain one is unreachable
    image

Reverse proxy example. RP-I is Reverse proxy and it plays a load balancer role here.

NOTE: It's impossible to keep endpoints in sync as service instances come\go. Client code must be robust against this

image

Another example of reverse proxy with load balancer + healthcheck role
image

Carnegie Mellon Univercity

Attribute-Driven Design (ADD)

Attribute-Driven Design in details

image image image image image image

Quality Attributes. Tactics to achieve Quality Attributes.

Functional vs Non-Func Attributes. Testing Process

Functional vs Non-Functional attributes. Non-Functional Testing Process. Testing Cycle

image
image
image
image
image

Quality attributes. How better bescribe them. Concrete examples.

Instead of just saying that performance is important for you - you need to create scenarios.
Each scenario should have 6 points starting from what's happening, from where this stimulus comes, in which circumstances, and until how to react image

image

image

Scenario for Availability. First slide - all spectre what's covered. The second and third - what we exactly handle and what we expect

image image image

Tactics to achieve Quality Attributes

Tactics: Availability. 3 types: fault detection, fault recovery, fault prevention

image

image

  • as example: for Availability you need to care about
    • Passive redundancy - copy your data\state to extra instances and make hot replacement when needed
    • Condition Monitoring - control when failure occured in your instance and react on this event - restart instance, isolate, etc.

Example of Tactics. Detect Faults

image image

  • Voting is used in airplanes where several processes make same calculations in parallel and majority wins.
    • Usually system element as Monitor is looking at the output of each process. If one of processes shows wrong results - it's deemed to be faulty.

Recovery from Faults

image image image

Prevent Faults

image

Tactics: Performance

image

image

Control Resource Demand

One way to increase performance is to carefully manage the demand for resources. This can be done by reducing the number of events processed or by limiting the rate at which the system responds to events. In addition, a number of techniques can be applied to ensure that the resources that you do have are applied judiciously

  • Manage work requests.
    One way to reduce work is to reduce the number of requests coming into the system to do work. Ways to do that include the following:
  • Manage event arrival.
    A common way to manage event arrivals from an external system is to put in place a service level agreement (SLA) that specifies the maximum event arrival rate that you are willing to support
  • Manage sampling rate.
    In cases where the system cannot maintain adequate response levels, you can reduce the sampling frequency of the stimuli—for example, the rate at which data is received from a sensor or the number of video frames per second that you process. Of course, the price paid here is the fidelity of the video stream or the information you gather from the sensor data. Nevertheless, this is a viable strategy if the result is “good enough.”
  • Limit event response.
    When discrete events arrive at the system (or component) too rapidly to be processed, then the events must be queued until they can be processed, or they are simply discarded. You may choose to process events only up to a set maximum rate, thereby ensuring predictable processing for the events that are actually processed.

Reduce computational overhead

  • Reduce indirection.
    The use of intermediaries (so important for modifiability, as we saw in Chapter 8) increases the computational overhead in processing an event stream, so removing them improves latency. This is a classic modifiability/performance tradeoff. Separation of concerns—another linchpin of modifiability—can also increase the processing overhead necessary to service an event if it leads to an event being serviced by a chain of components rather than a single component
  • Co-locate communicating resources.
    Context switching and intercomponent communication costs add up, especially when the components are on different nodes on a network. One strategy for reducing computational overhead is to co-locate resources.z
  • Periodic cleaning.
    A special case when reducing computational overhead is to perform a periodic cleanup of resources that have become inefficient. For example, hash tables and virtual memory maps may require recalculation and reinitialization.
  • Increase efficiency of resource usage.
    Improving the efficiency of algorithms used in critical areas can decrease latency and improve throughput and resource consumption.

Manage Resources

Even if the demand for resources is not controllable, the management of these resources can be. Sometimes one resource can be traded for another.

  • Increase resources. Faster processors, additional processors, additional memory
  • Introduce concurrency. If requests can be processed in parallel, the blocked time can be reduced
  • Maintain multiple copies of computations. This tactic reduces the contention that would occur if all requests for service were allocated to a single instance.
  • Maintain multiple copies of data. Two common examples of maintaining multiple copies of data are data replication and caching.
  • Bound queue sizes. This tactic controls the maximum number of queued arrivals and consequently the resources used to process the arrivals.
  • Schedule resources. Whenever contention for a resource occurs, the resource must be scheduled. Processors are scheduled, buffers are scheduled, and networks are scheduled

Patterns

  • Service Mesh
    The service mesh pattern is used in microservice architectures. The main feature of the mesh is a sidecar—a kind of proxy that accompanies each microservice, and which provides broadly useful capabilities to address application-independent concerns such as interservice communications, monitoring, and security. A sidecar executes alongside each microservice and handles all interservice communication and coordination. image

  • Load Balancer
    A load balancer is a kind of intermediary that handles messages originating from some set of clients and determines which instance of a service should respond to those messages. The key to this pattern is that the load balancer serves as a single point of contact for incoming messages—for example, a single IP address image

  • Throttling
    The throttling pattern is a packaging of the manage work requests tactic. It is used to limit access to some important resource or service. In this pattern, there is typically an intermediary—a throttler—that monitors (requests to) the service and determines whether an incoming request can be serviced. image

  • Map-Reduce The map-reduce pattern efficiently performs a distributed and parallel sort of a large data set and provides a simple means for the programmer to specify the analysis to be done. Unlike our other patterns for performance, which are independent of any application, the map-reduce pattern is specifically designed to bring high performance to a specific kind of recurring problem: sort and analyze a large data set. This problem is experienced by any organization dealing with massive data image

Tactics: Deployability. Patterns
  • Sample of Deployability scenario
    image

  • Tactics
    image

Manage Deployment Pipeline

Scale rollouts.

Rather than deploying to the entire user base, scaled rollouts deploy a new version of a service gradually, to controlled subsets of the user population, often with no explicit notification to those users. (The remainder of the user base continues to use the previous version of the service.) By gradually releasing, the effects of new deployments can be monitored and measured and, if necessary, rolled back. This tactic minimizes the potential negative impact of deploying a flawed service. It requires an architectural mechanism (not part of the service being deployed) to route a request from a user to either the new or old service, depending on that user’s identity.

Roll back.

If it is discovered that a deployment has defects or does not meet user expectations, then it can be “rolled back” to its prior state. Since deployments may involve multiple coordinated updates of multiple services and their data, the rollback mechanism must be able to keep track of all of these, or must be able to reverse the consequences of any update made by a deployment, ideally in a fully automated fashion.

Script deployment commands.

Deployments are often complex and require many steps to be carried out and orchestrated precisely. For this reason, deployment is often scripted. These deployment scripts should be treated like code—documented, reviewed, tested, and version controlled. A scripting engine executes the deployment script automatically, saving time and minimizing opportunities for human error.

Manage Deployed System

Manage service interactions.

This tactic accommodates simultaneous deployment and execution of multiple versions of system services. Multiple requests from a client could be directed to either version in any sequence. Having multiple versions of the same service in operation, however, may introduce version incompatibilities. In such cases, the interactions between services need to be mediated so that version incompatibilities are proactively avoided. This tactic is a resource management strategy, obviating the need to completely replicate the resources so as to separately deploy the old and new versions.

Package dependencies.

This tactic packages an element together with its dependencies so that they get deployed together and so that the versions of the dependencies are consistent as the element moves from development into production. The dependencies may include libraries, OS versions, and utility containers (e.g., sidecar, service mesh), which we will discuss in Chapter 9. Three means of packaging dependencies are using containers, pods, or virtual machines; these are discussed in more detail in Chapter 16.

Feature toggle.

Even when your code is fully tested, you might encounter issues after deploying new features. For that reason, it is convenient to be able to integrate a “kill switch” (or feature toggle) for new features. The kill switch automatically disables a feature in your system at runtime, without forcing you to initiate a new deployment. This provides the ability to control deployed features without the cost and risk of actually redeploying services.

Patterns for Deployability

### Code organization point of view

image

  • Benefits image

  • Tradeoffs image

    Deployment changes point of view

image

  • Benefits and Tradeoffs image
Quality Attribute Utility Tree. Example

image

Architecture Styles and their Documentation. Module Styles, C&C Styles, Allocation Styles.

Module Styles. General Information

image

Module Styles: Decomposition, Uses, Generalization, Layered, Data Model, Aspects styles

Module styles are closer to code. Used for:

  • code construction,
  • analysis (impact of changes, planning, or for budgeting concerns).
  • education (onboarding new team-members).
  • Build vs Buy. You can use Decomposition in order to understand what is better for you - build on your own or buy this product from 3rd party vendor.

Module Styles:

  • Decomposition Style
  • Uses Style
  • Generalization Style
  • Layered Style
  • Data Model Style
  • Aspects Style

image image

Notations and Usage. Informal or semi-formal notation: Box and Lines or UML

image

Module Styles in details.

Module Styles: Decomposition Style. Decomposition Refinement

Decomposition Style. Notations

image image

Decomposition Refinement

  • Need to understand if inner structure is needed image

Decomposition refinement in UML

image

Module Styles: Uses Style
  • Useful for planning an incremental development when you have several dependencies and you need to know where all of them will be available for you;
  • Useful in debugging and testing because you can stub and mock your dependencies, or you want to isolate from where the issue comes;
  • Useful to validate dependencies and avoid circular dependencies which not letting you make incremental deployment and delivery
  • Tracing changes when you want to guarantee that other dependencies will not suffer

image

Example:

image image

Module Styles: Generalization Style
  • Useful for code analyzing
  • Useful for Architecture Representation Creation.
  • Useful for Application Skeleton creation image image

Example:

image

Module Styles: Layered Style
  • Useful for Modificability image image image

  • Concentric diagram could not be equivalent to stack diagram because there is ambiguity and it's not clear if B1-B2-B3 can use each other (especially B1-B3 because they touch each other).

Informal notation in Layered style

image

Module Styles: Aspects Style (Also known as Multi-dimentional separation of concerns)
  • Style to depict relations between classes or set of classes (their aspect modules)
  • Could be used to understand the strategy for error handling. When Set of modules want to use one protocol (handling policy) for handling error if it occurs.
  • Could be useful for application decomposition
  • Could be useful to understand the scalability of solution image image

Example

  • Colored bar represents the aspect in module
  • Could be used only when code is already exist image
  • To understand the scalability the best strategy is to create UML Class diagram image
  • For very large apps you need to use non-graphical diagram because in very large apps with dozens of classes and their aspects your UML diagram will be a hell image

Good example of diagram which uses UML without crosscut arrows

  • Instead of crosscut arrows all aspects are bundled into the package
  • Annotation is attached to each aspect image
Module Styles: Data Model Style. ERD & UML Notation
  • Useful to depict Data Entities and relations between them
  • Useful for impact analysis
  • Helps to implement modules which have access to the data
  • Useful in Domain Analysis
  • Useful for performance optimizations image image

Example

image

  • In ERD Notation image
  • The same diagram in UML Notation image

Data Model Evolution over time

image

Components and Connectors Styles

Component and Connector Styles: Pipe and Filter, Client-Server, Service-Oriented (SOA), Publish-Subscribe, Shared-Data styles

Used for:

  • Performance, Security, Availability (Runtime attributes) analysis
  • Education
  • Construction. C&C Style could describe behavior that elements must demonstrate when they work together
  • Could help to describe how very specific part of the system works

Components and Connectors style:

  • Pipe-and-filter Style
  • Client-Server Style
  • Service-Oriented Style
  • Publish-Subscribe Style
  • Shared-Data Style
  • Repository Style
  • Others

image

  • DataFlow, Call-Return, Event-Based, Repository are Sub-families of C&C Styles

** Rules of C&C Style:**

  • In C&C style Component could be a runtime of interaction or Data Store. Component has ports to the outside world;
  • Connector connects components. Connector has roles which could be called;
  • Ports and Roles are just special interfaces of Components and Connectors respectively;
  • The Only one relation between elements - Attachment; It describes Attachment between Components and Connectors
  • In Architecture Connector is not just a procedure call, but could be very sophisticated computation;
  • C&C Diagram could have Quality attributes which help with analysis;
  • For different Quality Attributes could be useful different C&C Styles

image

C&C Style: Pipe and Filter. Pipes and Filters in UML. Yahoo! Pipes
  • Good for cases when data is transformed serially
  • Series of Filters or Series of Pipes one after another is prohibited by Style
  • Good for Functional composition Data Analysis (what the output could be knowing the function and input) image

In UML

  • Dependency arrow is not recommended to use, but it's possible. Better to use straight line as connector image

Yahoo! pipes.

  • Introduced by Yahoo.
  • 7 Filters and Blue Pipes from top to bottom. On the right there are support modules for mainline "components"(in Yahoo terminology). image
C&C Style: Client-Server
  • Useful for performance, availability, dependability, security analysis
  • Useful when you need to depict amount of clients at the same time
  • Useful when you need to create a strategy "what to do if one server becomes compromised" image image

Example

image

C&C Style: Service-Oriented (SOA). Web Services
  • Useful for property analysis which could be associated with service or with service client;
  • Services could be not discoverable and not dynamically bound image
  • ESB - Enterprise Service Bus, special component which takes routing of messages image

Use cases: - services made by different languages, for different platforms, or by different teams\organizations - services which have different styles - helps with integration of external components - for repackaging legacy systems. you can rehabilitate pieces of legacy system one by one

Web Services

  • Web Services is not a synonym of SOA: SOA is architecture style but Web Services is one of many technologies you may choose to implement SOA
  • Micro-service Architecture, WebServices, Micro-Macro Service Architecture, COBRA - are part of SOA Style image

SOA In practice

image image

C&C Style: Publish-Subscribe
  • Pure event-based style
  • Useful if you dont know all of your subscribers or you dont know their amount.
  • Useful for sending information for unknown recipients
  • Event distributor can be depicted as a bus or a component image image

Constraints

  • Need to understand which components can listen to which events. Some events could be not public
  • Components can listen their own events (sometimes answer yes, sometimes no)

Example

  • this diagram shows more than just publish-subscribe and it's okay image
C&C Style: Shared-Data
  • If you see big fat database in the middle of the diagram - it's definitely Shared-Data Style
  • Useful for enterprise management system image

Example

image

C&C Style: Combination of styles

image

  • same diagram depicted in another style, it loses stylistic richness
  • doesnt let you draw your conclusion in any certainty image
C&C Style: Crosscutting issues in different C&C Styles. Grouping in Tiers. Tiers vs Layers. Multi-tier Notation

Tiers vs Layers

  • Layers are used more in Model View. Tiers in C&C view.

  • They can be mapped one-to-one, but it's also possible to map several layers to one tier

  • Tiers are used to check cohesion of components in C&C view

  • Tiers also could be transformed into Packages in UML Notation

  • Cross-communication processes issues (when different processes share resources) image image

Tiers. Commonly in Client-Server style

  • Useful to declare connections between tiers (only allowed to communicate with these exact tiers and not with others)
  • Tiers could be pass-through, but should be exclicitly declared
  • Grouping components in tiers, for example Tier execution
  • Client Tier also could be described what clients it has: thin or fat: Thin clients are generally embedded within a web browser. Fat must be installed on the client's machine.
  • Tiers could be depicted as packages in UML notation image
  • We can declare how tiers are communicating image

Multi-tier notation

image

  • Tiers could be depicted as packages in UML notation image

Dynamic Creation and Destruction

  • Could be done in State-Machine diagram image image

Allocation Styles

Allocation Styles: Deployment, Install, Work Assignment styles

Allocation Style:

  • Deployment Style
  • Install Style
  • Work Assignment Style

image

Allocation Styles: Deployment
  • Puts Software to Hardware mapping
  • Good for performance, availability, durability, disaster recovery analysis image

image

Examples

image

  • Boxes depict hardware boxes or processors image
Allocation Styles: Install
  • Useful for creation build-deploy procedures image image

Example

image

  • Same in UML with demonstration internal components
  • EAR - Enterprice Archive. JAR - Java Archive files image
Allocation Styles: Work Assignment. "Who will do the job". Specializations: Platform-style, Competence-center, Open-Source

image image

Example

image

Specialization. Platform-style vs Competence-center vs Open-Source

image

Architecture Views. Documenting Software Architecture. Properties to document in your Architecture Document

General. Important Concepts of documentation. View-Based Documentation. Types of Views

image image image image

General. Select Properties to document. Examples in terms of Quality Attribute

image image

Example:

  1. Performance Attribute: you need to document best and worst response time properties. Or maximum number of event that element can service per time unit (per secord or per minute).
  2. Security: perhaps you need to document the level of encryption and authorization rules for different elements and relations.
General. Apply Views. Information on Views. Structural vs Behavioral. Traced-Oriented vs Comprehensive Language

image image image

Documenting Behavior. 99% of time in practice you work with traced-oriented language

image image

View Type: Model View

image image

View Type: (C&C) Component And Connector View

Shortly: how your code maps on resources. Where it executes. What major parameters it has:

  • Processor cores vs processes
  • Sockets
  • REST APIs (dependency relationships between information carriers)
  • Where your code is executed: Client Machine vs Server, PC or Mobile.

** Attachments:**

  • Output from one port to another Input port

** Quality of service information**

  • Amount of requests per hour
  • Latency

image

For what?

image

View Type: Allocation View: Deployment, Implementation views. Their Refinements

Details: https://sites.google.com/site/softwarearchitectureinpractice/9-documenting-software-architecture/d-allocation-views

Shortly: About requirements where and how application runs

  • Memory requirements
  • Processor requirements
  • Execution in processes and threads

Allocation View consist of:

  • Deployment view (how and where you deploy your app)
  • Implementation View

image

Implementation Refinement

  • On the right we reveal that our connector is event dispatcher
  • It might help to understand how teams need to change their interfaces to interact with dispatcher appropriately
  • New questions to architect may appear and potentially you need to put your attention to how exactly event bus will work and what limitations now you have
  • diagram on the left could be interested to people who are not so interested in tech details. The right - people who are more concerned about design image
Heterogenerity Architecture Style & View: Demonstrate a mix of approaches on one diagram. Examples

image image image

Notations for Architecture View. Model View. Uses View. Generalization View

In General

image

Notations for Model View. Decomposition View

image image

Notations for Uses View

image image

Notation for Generalization View

image image

Notation for Layered View. Usually used for portability and usability

image image image

What View You need to Document. General pieces of advice
  1. Check who is your stakeholders
  2. What diagrams each of them need to understand and sell your product
  3. Consolidate views (if their number is too high)
  4. Rationale of your design decisions
  5. Functional, Non-functional attributes and constraints of your system
  6. Legend on all diagrams

image

Put the legent to each diagram:

image

Documenting Architecture. General Structure. General Principles. Mapping Requirements

General Document Structure. General Principles

General Structure

image

Principles

image image

Mapping requirements to design Decisions. Microsoft template

image image

Where to record such mapping

image image

Allocation view could be used to depict mapping between requirements and design decisions.

  • if could be a part of software design explanation image

Summary. Microsoft Reference Template

image

Combine Views. Hydrid View

What Views to choose. 3 step principle: Build Stakeholders table, Combine Views (rule of thumb), Prioritize documentation. Part 1

Stakeholders table

  • PS: It's also important to include at least one Model, one C&C diagrams to your documentation.
  • Build Stakeholders table
  • Dont accept all their wishes, you need to document architectures which help them in their work, not "just nice to have" image

Which View to apply?

image

Example using real system

image image

  • D - detailed, S - some information, O - Overview image

Combine Views. use Rule of Thumb

image

Example. View set

  • Instead of 12 diagrams we selected 8 and decide to make combine view for 2. It means we need 7 diagrams
  • It still could make our documentation redundant, so we need to prioritize them and combine-simplify them if they are not super important
    image

Prioritize and Stage documentation

  • Hints
  • Go with high overview (not detailed architecture). It lets developers start their job and will help forming budgets
  • High overview helps to begin the analysis image

Merging different views together

  • Having 7 views after prioritization we receive 3 primary and 4 minor views. image image

Summary

image

Combine Views. Hybrid Views. Overlay Part 2

image

Overlay approach. Examples

image image

Case. Example

  • we had 2 small diagrams and decided to merge them together image
  • such overlaying might recognize the component which is not well defined and understand how we gonna built it image

Tips and Tricks

  • in case of aspects and generalization: aspects are special kind of classes and in generalization view they overlay very easily image

Summary

image

Documenting Interfaces, Behavior, Context

Documenting: Interfaces

General idea

image image

Examples

image image

UML Example with explanation

image image

Documenting: Behavior. Dynamic properties vs static properties. time to response (TTR), Throughput
  • Behavior documentation is needed to declare dynamic properties of built system: time to response, throughput, etc.
  • It supports system analysis as it executes.
  • Answers on "in what order components interact".
  • Depicts transition from State to State to State
  • What the system status under certain circumstances
  • How the system startup
  • Can guarantee that the system works as indended under variety of conditions

image image image

Documenting: Behavior. When and why. Trace-oriented vs Comprehensive language
  • useful in documenting interfaces and templates image image

  • Both Comprehensive and Trace-oriented may represent the whole system and any part of sub-systems image

Documenting: Behavior. Trace-oriented Language & Comprehensive Language. BPMN Notation. Diagrams: Collaboration, Sequence, Activity

image

Trace-oriented Language

  • Trace-oriented language answers on the question "what happens when particular stimulus arrives or in specific state"
  • It does not help us to capture all possible behaviors unless you are collecting them

Comprehensive Language

  • Shows complete behavior of that system
  • usually Statemachine or StateChart diagrams

image

Activity diagram

image

Activity diagram example

image

BPMN

image

BPMN Example (Same diagram as Activity diagram, but with another focus)

image

Sequence (rely on Trace-oriented Language)

  • Sequence diagram shows only one trace, not all possible traces image

Sequence diagram example

image

Sequence diagram in UML

image

State Machine diagram

  • Not good for showing interactions among components becaue states are not software modules, but state of subsystem or system
  • Tend to be well understood by developers and programmers, but not managers without giving extra explanation image

State Machine example

  • It's possible that two different transactions could become eligible
  • State machine diagram could depict parallel processes
  • One state machine could depend on another state machine image
Documenting: Context Diagrams. Their Notations. System, environment, Relations. in C&C and Layered view.

image image image

Example and explanation

General idea

  • apply Context diagram only to one view in order to not repeat yourself each time
  • you may show on context diagram the system itself and its runtime interactions with environment
  • Runtime suggests C&C View for that image

Context in layered View

image

In other views

image

Notations. Boxes and Lines, UML.

image image

Documenting: Decisions. Capturing Complex Architecture Decisions. 12 steps-parts document

General rule. When and What

  • If decision took 5 minutes - probably it's not worth to document.
  • If 5 days - so, yes.
  • Document the future steps especially if you have concerns regarding decisions you have made. Good start point in the next communication with your manager. image image image

Complex Architecture Decisions

  • Consists of 12 steps. All are on slides image image

  • in section 9 you may declare what temporarly decisions you have made. It's useful because you can declare unfortunate cases when you need to move back and reconsider because you hit the dead end.

  • Under affected Artifacts you can but what's affected - budgets and\or schedules. image

Documenting: Evaluation of Alternatives and Objectives

image

View Packets. Alternatives. How to build and document

Documenting: View Packets
  • View packets is the way to split complex diagrams on parts.

  • Each view has parent, child, and\or siblings.

  • In case you have view packets the overall document will lot a bit another image

  • View packets could also be very useful in ADD because inside View packet you can solve specific attribute questions.

  • Create view and assign it particular responsibilities

  • It also useful to save chronological predecessors image

Examples when view packets could be helpful

image image

Alternatives

  1. Make huge unwieldy diagram but use the tool to present and use it with zoon-in, zoom-out, and fly-through abilities
  2. Series of diagrams

Summary

image

Views and Beyond. Alternatives: ISO 42010 (ISO\IEC 42010), DoDAF, Documentation in Agile

Views and Beyond. Template for Beyong view and rationale

General structure

image image image

Mapping

image

Directory

  • in directory you may find better outcome from stakeholders than you expect.
  • here you can discuss acronyms image

Background

image

ISO\IEC 42010 (also known as IEEE 1471-2000). ISO42010 vs "Views and Beyond".Alternative to "Views and Beyond"

ISO\IEC 42010. Also known as IEEE 1471-2000. Enterprise standard.

image image image

Structure

image

42010 Also defines view as representation of the whole system from perspective of related concerns

![image](https://user-images.githubusercontent.com/4239376/224509958-7a6c3bf4-936d-496d-9ad0-ca90d831ae67.png)

42010. Content Requirements

image image image

ISO 42010 vs Views and Beyond. Differences and Similarities

ISO42010 vs Beyong Views

image image

  • 42010 requires explicit identification of stakeholders and their concerns image

  • 42010 requires that you define the viewpoints. In Views and Beyond it's close to "architecture styles" image image

DoDAF. Alternatives to "Views and Beyond"

Summary. NOT A GREAT CHOICE.

image image

General Structure

image

DoDAF Pros

image

DoDAF Cons

image

image

Operational View. All-view

image image

Documentation in Agile. Alternatives to "Views and Beyond"

Summary

  • ISO42010 works well in Agile environment, so better to go straight with this format
  • Views and Beyond works well in Agile as well. Views and Beyond could be made compliant to ISO42010 in Agile environment as well

General

image image

Points

image image

BDUF and SAFe

image image

Reviewing Architecture Documentation. 6 Steps. Active Reviewing

6 steps of Documentation Review Process

Overall

  • Avoid the type of review process when you gather people two weeks before deadline. image

Step 1. Identify Stakeholders

image image

Step 2. Identify Artifacts needed to be on hand for the review

Also it's important to understand what else stakeholders need along with architecture diagrams

image

Step 3. Build question set

Worst scenario if you ask reviewers "let us know what you think!"

image

Step 4. Plan the Review. Manage date, participants, assign questions

Options: group workshop, all in one room or separate meetings in parallel. One-one interviews are also applicable

image image

New option: Active design review

  • Avoid the type of review process when you gather people two weeks before deadline. image

Step 5. Perform the Review session

  • Gather questions and answers
  • Gather strenght and weaknesses of your document image

Step 6. Summarize results

  • Aggregate questions from reviewers
  • Find problems in document. It's not simple "Pass\Fail", it's about finding weaknesses in documentaion image

Architecture Evaluation. Measure and Output. ATAM

Architecture Evaluation. Approaches and Techniques. Evaluation Output

Measuring Techniques:

image
image

image

Evaluation Output:

image

ATAM. Risk identification method

ATAM Conceptual flow

image

The point of ATAM is only to find risks, not to mitigate them You can do that through elisit the right questions to architects, senior designers and key developers So, its risk identification method, not risk resolution method We do not provide precise analysis

image image

When to use:

image

  • After creating architecture, but not so much code is in place;
  • Check existing system architecture, evaluate it;
  • Decide whether we will build this system or buy it from 3rd party vendor;

ATAM Phases

image

Phase 0: gather a small group of architects and evaluators and discuss what you are going to evaluate, what you have, etc.

image

Phase 1: explain what ATAM is about, what the process, who needs to be there, expectations

Business goals from Project Owners;

Architect presents an architecture;

Utility Tree; L,M,H - how important in terms of business (1st value, Highly important, Low importance) and how risky (2nd, High risks, Low risks).

image image image

H,H scenarios - our main business scenarios and drivers we must focus on

Scenarios:

image image image image

Risks and Tradeoffs. Non-risks:

image image

Non-risks may become risks if situation changed

Phase 2. More like QAW, workshop, scenario brainstorming. we do more architecture analysis, gather more scenarios from broader group and map on architecture approaches:

image image

Phase 3. Formal meeting\presentation about findings

image image

Architecture Rule of Thumb. Rules to follow in architecture

https://www.linkedin.com/pulse/architecture-10-rules-thumb-matthew-golzari/
https://medium.com/@i.gorton/six-rules-of-thumb-for-scaling-software-architectures-a831960414f9

Carnegie Mellon Univercity Slides. Solution Architecture Principles in Practice

Solution Architecture Principles in Practice.pdf
Solution Architecture Principles in Practice,_Student_Workbook_2020.pdf

SEI Exams

SEI Software Architecture Professional Exam: https://www.sei.cmu.edu/education-outreach/courses/course.cfm?courseCode=V19 Service-based Architecture Professional Cert: https://www.sei.cmu.edu/education-outreach/credentials/credential.cfm?customel_datapageid_14047=15189

Architecture Materials. AZ-304 SA Azure Architecture Design Exam materials

Powershell intro

powershell materials coming from Skylines Academy

SLA, SLO, SLI Difference: https://www.atlassian.com/incident-management/kpis/sla-vs-slo-vs-sli * SLI Explained in details: http://cs.brown.edu/courses/csci2952-f/slides/Class9.pdf SLI Of The Platform - Critical Replica Threshold = CRT - Available replicas = min(total available pods, CRT) - Replica availability = (available replicas / CRT) * 100 - Critical Replica Availability = mean(replica availability of each service)

AZ-300 AZ-301 AZ-304 AZ-305 Exam tips

AZ+301 SKYLINES ACADEMY Slides_Student_Version.pdf

SKYLINES ACADEMY MATERIALS:

SECTION-1 Workload requirements
SECTION-2 Identity and Security
SECTION-3 Data Platform Solutions
SECTION-4 Business Continuity
SECTION-5 Deployment, Migration, Integration
SECTION-6 Infrastructure Strategy

Insights and everything related to that.

Cost Optimizations materials

Security materials

  • Azure AD, Azure AD Connect, Azure AD Connect sync, Azure AD B2B, Azure AD B2C, Azure AD Conditional Policies
  • Managed Identities, User-defined, System-defined
  • Azure Key Vault, when to use
  • SAS Token, when to use

All information related to Authorization and Authentication is here: Authentication, Authorization, Azure AD and features

Design Security

RBAC:
RBAC
Policy:
policy

Consider how Azure policy is different from role-based access control (RBAC).
It’s important not to confuse Azure Policy and Azure RBAC. Azure RBAC and Azure Policy should be used together to achieve full scope control.

  1. You use Azure Policy to ensure that the resource state is compliant to your organization’s business rules. Compliance doesn’t depend on who made the change or who has permission to make changes. Azure Policy will evaluate the state of a resource, and act to ensure the resource stays compliant.

  2. You use Azure RBAC to focus on user actions at different scopes. Azure RBAC manages who has access to Azure resources, what they can do with those resources, and what areas they can access. If actions need to be controlled, then use Azure RBAC. If an individual has access to complete an action, but the result is a non-compliant resource, Azure Policy still blocks the action.

Area Azure Policy Role-based Access Control
Description Ensure resources are compliant with a set of rules. Authorization system to provide fine-grained access controls.
Focus Focused on the properties of resources. Focused on what resources the users can access.
Implementation Specify a set of rules. Assign roles and scopes.
Default access By default, rules are set to allow. By default, all access is denied.
Trust Center, Compliance Manager, Data Protection, Azure Security and Compliance, Blueprints

Microsoft Trust Center

  • In-depth information access tp FedRAMP, ISO, SOC audit reports, data protection white papers and different assessment reports
  • Centralized resources around security, compliance and privacy

Compliance Manager

  • Manage compliance from a central location
  • Proactive risk assessment
  • Insights and recommended actions
  • Prepare compliance reports for audit

Data Protection Resources

  • Trust documents, GDPR, Compliance guides, Pen test and Security Assessment tests

Azure Security and Compliance, Blueprints

  • Industry-specific overview and guidance
  • Customer responsibilities matrix
  • References architectures with threat models

Cache

Cache. Patterns. Challenges. Info

image

Cache consistency models

image

Cache Challenges

Cache. Challenges

image image image image

Cache Replacement Policies

Cache. Cache Replacement Policies. Performance Metrics

Cache Replacement Policies

image image

Cache Performance Metrics

image

Governance and Compliance materials

Design Governance

Lab. Microsoft Docs. Azure Blueprints. Additional Materials

Lab:
https://docs.microsoft.com/en-us/learn/paths/design-identity-governance-monitor-solutions/

Microsoft docs:
https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/govern/guides/
https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/govern/guides/standard/

Azure blueprints

With Azure Blueprints, the relationship between the blueprint definition (what should be deployed) and the blueprint assignment (what was deployed) is preserved.

In other words, Azure creates a record that associates a resource with the blueprint that defines it. This connection helps you track and audit your deployments. Azure Blueprints orchestrates the deployment of various resource templates and other artifacts.

Blueprint

How are Azure Blueprints different from Azure Policy

  1. A policy is a default allow and explicit deny system focused on resource properties during deployment and for already existing resources. It supports cloud governance by validating those resources within a subscription adhere to requirements and standards.

  2. A policy can be included as one of many artifacts in a blueprint definition. Including a policy in a blueprint enables the creation of the right pattern or design during assignment of the blueprint. The policy inclusion makes sure that only approved or expected changes can be made to the environment to protect ongoing compliance to the intent of the blueprint.

Additional Materials

Build a cloud governance strategy on Azure
Describe core Azure architectural components
Microsoft Cloud Adoption Framework for Azure
Intro to Azure blueprints

Data materials

Design for Data Storage

Azure SQL Server vs Azure SQL managed instance. Difference
* Azure SQL Server vs Azure SQL Managed Instance, difference: [CHECK THIS LINK](https://medium.com/awesome-azure/azure-difference-between-azure-sql-database-and-azure-sql-managed-instance-sql-mi-2e61e4485a65)  
AKS. Persistent Volumes - Types of replication
In Case of AKS:  

Infrastructure-based asynchronous replication

  • Your apps might require persistent storage. In Kubernetes, you can use persistent volumes to persist data storage. These persistent volumes are mounted to a node VM and then exposed to the pods. Typically, you provide a common storage point where apps write their data. This data is then replicated across regions and accessed locally, as displayed in the following graphic.
    image

Application-based asynchronous replication

  • Kubernetes currently provides no native implementation for application-based asynchronous replication. However, because containers and Kubernetes are loosely coupled, you should be able to use any traditional app or language approach to replicate storage.
    image

Consider Azure Backup or Velero
As with any app, it's important you back up the data related to your AKS clusters and their apps. When your apps consume and store data which is persisted on disks or in files, you should schedule frequent backups or take regular snapshots of that data. You can use several tools for these backup operations, including:

  • Azure Disks: Azure Disks can use built-in snapshot technologies. However, your apps might need to flush writes-to-disk before the snapshot operation.
  • Velero: Velero can back up persistent volumes along with additional cluster resources and configurations.

Design for Relational Data Storage

Business Critical Tier. General Purpose
  • General Purpose:
    image

  • Business Critical Tier:
    The next service tier to consider is Business Critical, which can generally achieve the highest performance and availability of all Azure SQL service tiers (General Purpose, Hyperscale, Business Critical). Business Critical is meant for mission-critical applications that need low latency and minimal downtime.

image

Azure SQL Types. Difference

image image

Azure SQL Data Replication types and strategies. Quorum and Leader-Follower patterns: Pros & Cons

Patterns used in Databases
  • Hyperscale The Hyperscale service tier is currently available for Azure SQL Database, and not Azure SQL Managed Instance. This service tier has a unique architecture because it uses a tiered layer of caches and page servers to expand the ability to quickly access database pages without having to access the data file directly.

image

Data Replication

image image image

Leader and Follower Pattern

image image

Quorum in Databases

image image

Active Geo-Replication

image Active geo-replication is available for:

  • Azure SQL Database: You can configure active geo-replication for any database in any elastic database pool.
    You can use active geo-replication to:
  • Create a readable secondary replica in a different region.
  • Fail over to a secondary database if your primary database fails or needs to be taken offline.

Data Retention in Azure SQL DBs (Managed Instance and SQL Server)

image

How to manage situations when you need data retention for more than 35 days

image

Design for Storage Accounts: File Share, Managed Disks (used by Azure VMs), Blob Storage

Materials are taken from this site: https://rajanieshkaushikk.com/2023/04/08/azure-blob-storage-vs-file-storage-vs-disk-storage-which-is-right-for-you/#:~:text=Azure%20File%20storage%20is%20not,low%20latency%20and%20high%20IOPS. image

Azure File Share vs Azure Managed Disks vs Blob Storage

image
image
image
image

Design for Backups and Recovery

Design Networking

Traffic Manager

Traffic manager. General Info

traffic manager

Failover scenarios:

image

  • Manually, by using Azure DNS, this failover solution uses the standard DNS mechanism to fail over to your backup site. This option works best when used in conjunction with the cold standby or the pilot light approaches.

  • Automatically, by using Traffic Manager, with more complex architectures and multiple sets of resources capable of performing the same function, you can configure Azure Traffic Manager (based on DNS). Traffic Manager checks the health of your resources and routes the traffic from the non-healthy resource to the healthy resource automatically.

High Availability with Traffic Manager:

Approach Description
Active/Passive with cold standby Your VMs (and other appliances) that are running in the standby region aren't active until needed. However, your production environment is replicated to a different region. This approach is cost-effective but takes longer to undertake a complete failover.
Active/Passive with pilot light You establish the standby environment with a minimal configuration; it has only the necessary services running to support a minimal and critical set of apps. In its default form, this approach can only execute minimal functionality. However, it can scale up and spawn more services, as needed, to take more of the production load during a failover.
Active/Passive with warm standby Your standby region is pre-warmed and is ready to take the base load. Auto scaling is on, and all the instances are up and running. This approach isn't scaled to take the full production load but is functional, and all services are up and running.

Messaging Systems. Messaging Patterns. Kafka

Kafka. Kafka Basics. Kafka Cluster

Kafka Basics. Record. Topics. Consumers. Consumer Groups. Load Balancing. Compression and Batching

image image image

Consumers

Consumers are the applications that subscribe to (read and process) data from Kafka topics. Consumers subscribe **to one or more topics** and consume published messages by pulling data from the brokers.

Records

A record is a message or an event that gets stored in Kafka. Essentially, it is the data that travels from producer to consumer through Kafka. A record contains a key, a value, a timestamp, and optional metadata headers.

Consumer Groups. Consumer group can have multiple consumers that subscribe to the same topic, allowing the system to process messages in parallel.

Consumer groups are a way to manage multiple consumers of a messaging system that work together to process messages from one or more topics.  
Each consumer group ensures that all messages in the topic are processed, and each message is processed by only one consumer within the group.  
This approach allows for parallel processing and load balancing among consumers.  
For example, in Apache Kafka, a consumer group can have multiple consumers that subscribe to the same topic, allowing the system to process messages in parallel and distribute the workload evenly.

Load Balancing in Kafka (balancing messages in partitions)

Kafka uses a referred to as "sticky partitioning" to provide load balancing functionality.  
This algorithm locks each message to a specific partition, and then distributes new messages to the next available partition in a round-robin fashion. This ensures that load is spread evenly across partitions and that remain balanced.

Compression and Batching in Kafka

Message batching is the process of combining multiple messages into a single batch before processing or transmitting them.  
This approach can improve throughput and reduce the overhead of processing individual messages.  
  
Compression, on the other hand, reduces the size of the messages, leading to less network bandwidth usage and faster transmission.  
For example, Apache Kafka supports both batching and compression:  
Producers can batch messages together, and the system can compress these batches using various compression algorithms like Snappy or Gzip, reducing the amount of data transmitted and improving overall performance.
Kafka Cluster. Zoo Keeper

image

Kafka cluster

Kafka is deployed as a cluster of one or more servers, where each server is responsible for running one Kafka broker.

ZooKeeper

ZooKeeper is a distributed key-value store and is used for coordination and storing configurations. It is highly optimized for reads.  
Kafka uses ZooKeeper to coordinate between Kafka brokers; ZooKeeper maintains metadata information about the Kafka cluster.
Kafka vs RabbitMQ vs ActiveMQ vs Azure ServiceBus vs AWS SQS

image

Kafka vs ServiceBus

Kafka is ideal for streaming data in an at-least-once manner and provides powerful features such as transmission of data over partitions, replication, and high-availability across multiple data centers.  
It is optimized for large-scale data streaming and has built-in support for high throughput, low latency and scalability.   
  
Azure Service Bus is used for messaging and not for streaming, and generally provides higher throughput and lower latency for scenarios where at-most-once messaging is required.    
It supports mobile devices, and provides integration with other Azure services such as Azure Storage and Service Fabric.  

Kafka vs AWS SQS

Kafka is ideal for streaming data in an at-least-once manner and provides powerful features such as transmission of data over partitions, replication, and high-availability across multiple data centers.  
It can be used to ingest data from multiple sources to multiple destinations and is optimized for large-scale streaming.  

AWS SQS is not suitable for streaming and is used for message queuing scenarios where at-most-once delivery is required.  
It supports mobile devices, and provides integration with other AWS services.  

Patterns with Kafka and Other Services. Q&A

Messaging Patterns with Kafka. Point to Point. Pub-Sub. Request-Response. Fan-Out/Fan-In (Scatter-Gather). Dead Letter Queue

Patterns are not covered in images:

  • Competing consumers
  • Guaranteed delivery
  • Content-based routing
  • Routing slip
  • Correlation identifier
  • Routing by header
  • Receiver-initiated workflow
  • Routing using selectors
  • Sagas

image image image image image

Q&A:

Which messaging pattern fits better for data stream processing?
The best messaging pattern for data stream processing is Publish/Subscribe. This pattern is typically used for passing data between applications, decoupling producers and consumers, and ensuring that messages are distributed to all interested parties in the system.

System Design Interview. Low-level System Design Interview. High-level System Design Interview

CAP Theorem. PACELC Theorem. Examples

CAP Theorem
  • Consistency ( C ): All nodes see the same data at the same time. This means users can read or write from/to any node in the system and will receive the same data. It is equivalent to having a single up-to-date copy of the data.
  • Availability ( A ): Availability means every request received by a non-failing node in the system must result in a response. Even when severe network failures occur, every request must terminate. In simple terms, availability refers to a system's ability to remain accessible even if one or more nodes in the system go down.
  • Partition tolerance ( P ): A partition is a communication break (or a network failure) between any two nodes in the system, i.e., both nodes are up but cannot communicate with each other.
    A partition-tolerant system continues to operate even if there are partitions in the system. Such a system can sustain any network failure that does not result in the failure of the entire network.
    Data is sufficiently replicated across combinations of nodes and networks to keep the system up through intermittent outages.

image

PACELC Theorem. General Info: ACID vs BASE. PACELC Theorem Examples

General Info.

  • We cannot avoid partition in a distributed system, therefore, according to the CAP theorem, a distributed system should choose between consistency or availability.
  • ACID (Atomicity, Consistency, Isolation, Durability) databases, such as RDBMSs like MySQL, Oracle, and Microsoft SQL Server, chose consistency (refuse response if it cannot check with peers), while BASE (Basically Available, Soft-state, Eventually consistent) databases, such as NoSQL databases like MongoDB, Cassandra, and Redis, chose availability (respond with local data without ensuring it is the latest with its peers).

image

Examples

  • Dynamo and Cassandra are PA/EL systems: They choose availability over consistency when a partition occurs; otherwise, they choose lower latency.
  • BigTable and HBase are PC/EC systems: They will always choose consistency, giving up availability and lower latency.
  • MongoDB can be considered PA/EC (default configuration): MongoDB works in a primary/secondaries configuration. In the default configuration, all writes and reads are performed on the primary.
    As all replication is done asynchronously (from primary to secondaries), when there is a network partition in which primary is lost or becomes isolated on the minority side, there is a chance of losing data that is unreplicated to secondaries, hence there is a loss of consistency during partitions.
    Therefore it can be concluded that in the case of a network partition, MongoDB chooses availability, but otherwise guarantees consistency. Alternately, when MongoDB is configured to write on majority replicas and read from the primary, it could be categorized as PC/EC.

Consistent Hashing. Data Partitioning. Data Replication.

image

Q: Where Consistent Hashing is used for Data Partitioning?
A: Amazon's Dynamo and Apache Cassandra use Consistent Hashing to distribute and replicate data across nodes
Q: In what other scenarios we may use Consistent Hashing for Data Servers?
A: In the following scenarios:
Any system working with a set of storage (or database) servers and needs to scale up or down based on the usage, e.g., the system could need more storage during Christmas because of high traffic.
Any distributed system that needs dynamic adjustment of its cache usage by adding or removing cache servers based on the traffic load.
Any system that wants to replicate its data shards to achieve high availability.

Data Partitioning. Data Replication. Naive approach

Basics

Data partitioning: It is the process of distributing data across a set of servers. It improves the scalability and performance of the system.
Data replication: It is the process of making multiple copies of data and storing them on different servers. It improves the availability and durability of the data across the system.
Data partition and replication strategies lie at the core of any distributed system. A carefully designed scheme for partitioning and replicating the data enhances the performance, availability, and reliability of the system and also defines how efficiently the system will be scaled and managed.

Naive Approach

  1. How do we know on which node a particular piece of data will be stored?
  2. When we add or remove nodes, how do we know what data will be moved from existing nodes to the new nodes? Additionally, how can we minimize data movement when nodes join or leave?

image
PROS:

  1. Easy to create and understand

CONS:

  1. Hard to add or delete node
Consistent Hashing for Data Partitioning. Algorithm: MD5

Problem statement

Distributed systems can use Consistent Hashing to distribute data across nodes. Consistent Hashing maps data to physical nodes and ensures that only a small set of keys move when servers are added or removed.
Consistent Hashing stores the data managed by a distributed system in a ring. Each node in the ring is assigned a range of data.
image

How consistent hashing may help us

Whenever the system needs to read or write data, the first step it performs is to apply the MD5 hashing algorithm to the key. The output of this hashing algorithm determines within which range the data lies and hence, on which node the data will be stored.
hus, the hash generated from the key tells us the node where the data will be stored.
image
PROS:

  1. When node added or deleted only limited amount of data is affected
  2. When node deleted the next node starts being responsible for all operations of removed node

CONS:

  1. Each node in Consistent Hashing represents a real server. Therefore, it shows not great load distribution
  2. Works well only in homogenious systems. If you have different servers you cant balance them well
  3. High chance of hotspot issue (when one server uses more often than others)
Consistent Hashing. Virtual Nodes

image
PROS:

  1. The load spreads more evenly across the physical nodes on the cluster by dividing the hash ranges into smaller subranges, this speeds up the rebalancing process after adding or removing nodes
  2. When a new node is added, it receives many Vnodes from the existing nodes to maintain a balanced cluster
  3. Many nodes participate in the rebuild process when a node needs to be rebuilt
  4. It's easier to maintain the data cluster if it consists of different machines (heterogenious servers). More powerful machines may have more Vnodes than others
Consistent Hashing. Data Replication using Consistent Hashing

image

System Design Interview: General Rules. Step by Step guide

Step by Step guide by Design Gurus. 3 Steps: Clarify Requirements, Expectations & Estimations, System interface definition

Clarify Requirements

  1. Ask questions about problems you are trying to solve; Design questions on interview are open-ended
    a) It has not ONE correct answer. You must clarify ambiguities
    b) Need to know on which aspects you must focus

Question examples:

  • Will users of our service be able to post tweets and follow other people?
  • Should we also design to create and display the user's timeline?
  • Will tweets contain photos and videos?
  • Are we focusing on the backend only, or are we developing the front-end too?
  • Will users be able to search tweets?
  • Do we need to display hot trending topics?
  • Will there be any push notification for new (or important) tweets?

Define API Interface

  • Define what APIs are expected from the system. This will establish the exact contract expected from the system and ensure if we haven't gotten any requirements wrong. Some examples of APIs for our Twitter-like service will be:
    postTweet(user_id, tweet_data, tweet_location, user_location, timestamp, …) generateTimeline(user_id, current_time, user_location, …) markTweetFavorite(user_id, tweet_id, timestamp, …)
Step by Step guide by Design Gurus. Next 4 Steps: Define data model, Degine Database Type, High-level design

Defining Data Model

  • Defining the data model in the early part of the interview will clarify how data will flow between different system components. Later, it will guide for data partitioning and management.

  • The candidate should identify various system entities, how they will interact with each other, and different aspects of data management like storage, transportation, encryption, etc. Here are some entities for our Twitter-like service:

    User: UserID, Name, Email, DoB, CreationDate, LastLogin, etc.
    Tweet: TweetID, Content, TweetLocation, NumberOfLikes, TimeStamp, etc.
    UserFollow: UserID1, UserID2
    FavoriteTweets: UserID, TweetID, TimeStamp

Defining Database Type

Which database system should we use? Will NoSQL like Cassandra best fit our needs, or should we use a SQL-like solution? What kind of block storage should we use to store photos and videos?

Questions here:
1) Do we need to be ACID-compliant?
2) Do we need to support Strong Data Consistency and Transactions?
3) What data structure do we have?
4) Amount of Read\Write operations?
In case of Tweeter the answer is possible NoSQL, we can sacrifice Strong Consistency over Availability (Low Latency).
In terms of type we need to estimate the balance between read and write operations. Potentially, we read more often. So, it's a NO for Cassandra and probably it's better to go with MongoDB.

PS: If we are choosing among native NoSQL solutions, for DynamoDB there is only one good DB pattern named "One Big Table". Lots of patterns eventually say you to go with One Big Table: https://www.alexdebrie.com/posts/dynamodb-single-table/

High-Level Design

image

  • Draw a block diagram with 5-6 boxes representing the core components of our system. We should identify enough components that are needed to solve the actual problem from end to end.
  • For Twitter, at a high level, we will need multiple application servers to serve all the read/write requests with load balancers in front of them for traffic distributions.
    If we're assuming that we will have a lot more read traffic (compared to write), we can decide to have separate servers to handle these scenarios. On the back-end, we need an efficient database that can store all the tweets and support a large number of reads. We will also need a distributed file storage system for storing photos and videos.
Step by Step guide by Design Gurus. Last Step: Low-Level Design with details. Identify Bottlenecks

image