- Data Structures
- Algorithms
- System Design
- Coding Assessment
- Python
- Big-O
- Bit Manipulation
- Dynamic Programming
- OSI Model
- Misc. Programming Concepts
- Powers of 10
- Resources
- Weighted graph: A graph that has values/weights associated with its edges
- Degree: Denotes the number of edges connected to a certain node
- In a directed graph, this is further broken down into indegree and outdegree
- Hamiltonian path: A path that visits each vertex exactly once
- Hamiltonian cycle/circuit: A path that results in a cycle
-
Directed graph (digraph): A graph whose edges have an explicit direction. This implies that relations between vertices are not always symmetric. In cases where they are symmetric (e.g. each directed edge has a counterpart edge in the opposite direction), the graph is considered bi-directional or parallel.
-
Directed acylic graph (DAG): A digraph that contains no cycles
- Enables a topological sorting of vertices
- Any DAG has a minimum of one topological ordering
- Ideal for scheduled systems of events
- The vertices of a graph can represent tasks to be performed
- The edges between vertices can represent constraints that must be completed after a task
A
but prior to a taskB
, for example
- Enables a topological sorting of vertices
-
Connected graph: A graph with no isolated vertices. Each vertex is connected to at least one other vertex
- A connected component is a graph that is internally connected but not connected to the vertices of the supergraph
- In BFS/DFS, if the number of nodes visited is equal to the total number of vertices, then the graph can be considered connected
-
Strongly-connected digraph: A graph in which every pair of vertices has a directed path both from
x
toy
and fromy
tox
. Otherwise, it is considered weakly-connected- Strongly-connected components:
- Adjacency List
- A hashmap whose keys are vertices and the corresponding values are an unordered list of neighbors of a given vertex
- More efficient than a matrix in regards to fetching a vertex's neighbors -
O(1)
lookup- Less efficient for ascertaining whether 2 vertices are neighbors -
O(|V|)
lookup - Slow to remove a vertex/edge because the entire map must be iterated over to remove any instances
- Less efficient for ascertaining whether 2 vertices are neighbors -
- Adjacency Matrix
- Each (non-diagonal) entry
A
ij
is equal to the number of edges from vertexi
to vertexj
- If the graph is parallel, the matrix can be composed entirely with booleans
- In a graph without cycles, the diagonal values in the matrix are
0
- Space efficient (1 bit per entry)
O(1)
time to check if any 2 vertices are neighbors- Slow to add/remove vertices because the matrix will need to be resized
- Each (non-diagonal) entry
A tree is an undirected graph in which any two vertices are connected by exactly one path (not edge). In other words, a tree is an undirected graph without cycles.
- Depth: the number of edges from the root node to a node
n
(a root has a depth of0
) - Height: the number of edges from a node
n
to its deepest leaf (a leaf has a height of0
)
Starting at the root node and going leftwards & downwards, the following traversals can be described likeso:
- Pre-order: Visit the left side of each node
- In-order: Visit the bottom of each node
- Post-order: Visit the right side of each node
- Full tree: Each node (aside from the leaves) has either
0
or2
children - Complete tree: Every level is completely filled and all nodes in the bottom level are as far left as possible
- Perfect tree: A tree which is both full and wholly complete e.g. the last level is entirely filled
- Balanced tree: A tree in which the heights of any left or right sub-trees differ by <= 1
- Tree rotation: An operation that changes the tree structure without affecting the original order of the nodes
Binary trees that are optimized for insertions and lookups, these trees have an additional constraint such that all nodes of a left sub-tree are lesser than the root and all nodes of the right sub-tree are greater. An inorder traversal of such a tree would then print nodes in a sorted, ascending order.
- BSTs have a performance that's proportional to its height
- Therefore, a smaller height is algorithmically preferable
- Balanced trees have a height of
O(logN)
forN
nodes- The worst case scenario for an unbalanced tree can be
O(N)
(linked list structure)
- The worst case scenario for an unbalanced tree can be
A sub-graph of an undirected, connected graph which includes all vertices of the supergraph, trimmed down to contain the absolute minimum number of edges. The total number of spanning trees with n
vertices that can be created from a complete graph is equal to n
(n-2)
.
A minimum spanning tree is a logical extension of a spanning tree. The only difference is that, for a minimum spanning tree, weighted edges are taken into account in order to derive the global minimum cost required to connect all vertices.
Algorithms:
Given a connected, directed graph G
, a shortest-path tree rooted at vertex v
is a spanning tree T
of G
, such that the path distance from root v
to any other vertex u
in T
is the shortest path distance from v
to u
in G
.
A key component in SPTs, edge relaxation is the method by which the next-shortest path is able to be determined. All node distances are initially assumed to be infinity. Starting from the source node v
, the neighboring nodes are then greedily selected based on their distance from v
. The selected itself node is then "visited" by this algorithm.
On each visit of the next-closest vertex u
, the paths to every other vertex is re-calculated from u
. If this recalculation leads to a lesser-weighted path to any vertex w
, then the weight of that edge is updated to reflect the new, lesser value.
Eventually, once all the nodes have been visited, the shortest paths from a source vertex v
to every other node in the graph will have been computed.
Note: An SPT is not guaranteed to be an MST. An SPT guarantees that any node u
will have a locally optimized minimal path in regards to the source node v
, whereas an MST guarantees a globally optimal cost to connect all vertices (without taking a source node into consideration).
Algorithms:
A tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. This type of tree is well-suited for storage systems that read and write relatively large blocks of data (i.e. disks), and as such, it is commonly implemented in database and file systems.
Definition
- Every node has, at most,
m
children - Every non-leaf node has at least
ceil(m/2)
children - The root has at least two children
- A non-leaf node with
k
children containsk-1
keys - All leaves exist on the same level
Insertions
- Find a leaf node where the item should be inserted by iteratively comparing existing keys to the item-to-be-inserted
- If a leaf node can accommodate another key, insert into the leaf
- If it can't be accommodated, then the node is split into two. The median key is then promoted to the parent node. If the parent node itself is also full, the process is potentially repeated all the way back to the root node, in which case the height of the entire tree would be incremented by one
There are 4 main properties intrinsic to Red-Black trees:
- Each node is either red or black
- How the tree is "painted" ensures its balance
- Tree is constantly re-painted to maintain its red-black properties
- Root & all leaves are black
- Leaf nodes have no data
- If a node is red, then both of its children are black
- Every path from a given node to its descendant leaves has the same number of black nodes (not counting the original node)
- Therefore, there exists a black height for every node denoted as
bh(n)
- Therefore, there exists a black height for every node denoted as
These properties thus reveal an axiom that a path from the root node to its farthest leaf node is no more than twice as long as the path from the root node to its nearest leaf node (shortest path is entirely black nodes while the longest is of alternating colors). This ensures the tree is roughly height-balanced and therefore optimally efficient.
A self-balancing binary search tree in which each node maintains extra information called a balance factor whose value is either -1
, 0
or +1
.
- Balance factor of a node in an AVL tree is the difference between the height of the left subtree and that of the right subtree
- More stringent balancing than that of a Red-Black tree
- Slower insertion & removal, faster search
- The heights of two subtrees from any node differ at most by
1
- If at any point the heights differ by more than
1
, a rebalancing is performed via tree rotations
- If at any point the heights differ by more than
A complete binary tree that satisfies the property that for any given node n
, its children nodes are either lesser or greater than n
.
When represented as an array:
- Children live at indices
2i+1
(left child) and2i+2
(right child) - Parent lives at
(i-1)//2
def heapify(arr, n, i):
# Find largest among root and children
largest = i
l = 2 * i + 1
r = 2 * i + 2
if l < n and arr[i] < arr[l]:
largest = l
if r < n and arr[largest] < arr[r]:
largest = r
# If root is not largest, swap with largest and continue heapifying
if largest != i:
arr[i], arr[largest] = arr[largest], arr[i]
heapify(arr, n, largest)
def heap_sort(arr):
n = len(arr)
# Build max heap
for i in range(n//2, -1, -1):
heapify(arr, n, i)
for i in range(n-1, 0, -1):
# Swap
arr[i], arr[0] = arr[0], arr[i]
# Heapify root element
heapify(arr, i, 0)
arr = [1, 12, 9, 5, 6, 10]
heap_sort(arr)
An abstract data structure that addresses the connectivity between components of a given network. Disjoint sets do not contain cycles, therefore, they can be represented as trees.
-
find(vertex)
-
union(vertex1, vertex2)
- Initialize an array of size
V
, whereV
is the total number of vertices- For each index, assign its value to that of the node it's representing
- Iterate through a list of edges
- Check whether the vertices of these edges belong to the same disjoint set
- If not, union the vertices by updating the array to reflect the new set association
- Continue iterating through the rest of the edges
O(1)
lookups but unions have a time complexity ofO(N)
- Each index in the array is assigned to the root node of its corresponding set
- Therefore, looking up whether two nodes belong to a set is trivial
return find(x) == find(y)
- Unions, however, necessitate all nodes in the joinee set to be updated with a new root value
def __init__(self, size):
self.root = [i for i in range(size)]
def find(self, x):
return self.root[x] #efficient
def union(self, x, y):
rootX = self.find(x)
rootY = self.find(y)
if rootX != rootY:
for i in range(len(self.root)): #inefficient
if self.root[i] == rootY:
self.root[i] = rootX
O(logN)
unions and finds- Each parent index in the array is assigned to another parent index
- Therefore, finding whether two vertices belong to a set takes a bit longer.
find(x), find(y)
has to access the index ofx
andy
, and if their values are not equal, the algorithm continuously travels up their lineages until it finds their respective root nodes - Unions, on the other hand, are very efficient. To union two sets, the parent node of
set2
is the sole node that needs to be updated in the array
def __init__(self, size):
self.root = [i for i in range(size)]
def find(self, x):
while x != self.root[x]: #inefficient
x = self.root[x]
return x
def union(self, x, y):
rootX = self.find(x)
rootY = self.find(y)
if rootX != rootY:
self.root[rootY] = rootX #efficient
There are 2 main optimizations we can implement into the disjoint set union algorithm
- An extension of quick-union, union by rank provides optimizations for
union()
- Serves to limit the maximum height of each set
- When unioning two vertices, instead of selecting either the root of
x
ory
at random (as quick union does), the root of the vertex with the lesser "rank" is chosen as the unionee- Ranks are determined by tree height
- Thus, the "smaller" tree is merged into the larger and the possibility of creating a set that mimics a linked-list (
O(N)
) structuring is reduced
- The general concept is that the array structure should represent a tree in order for the time complexity to be reduced down to a logarithmic time complexity
- Less parent nodes, less iterations
def __init__(self, size):
self.root = [i for i in range(size)]
self.rank = [0] * size # good!
def find(self, x):
while x != self.root[x]:
x = self.root[x] # could be better :(
return x
def union(self, x, y):
rootX = self.find(x)
rootY = self.find(y)
if rootX != rootY:
if self.rank[rootX] > self.rank[rootY]:
self.root[rootY] = rootX
elif self.rank[rootX] < self.rank[rootY]:
self.root[rootX] = rootY
else:
self.root[rootY] = rootX
self.rank[rootX] += 1 # union by rank
- This optimization leverages the fact that the parent node of a parent node must be computed in order to find a node's given root
- Since these computations are made along the entire length of the branch up to the root, the branch as a whole can be compressed
- The individual nodes composing this branch can be directly assigned to their ancestral root node rather than their respective parents
- This mimics the functionality in the Quick Find implementation and grants a best
O(1)
lookup time,O(logN)
on average
def __init__(self, size):
self.root = [i for i in range(size)]
def find(self, x):
while x != self.root[x]:
self.root[x] = self.find(self.root[x]) # great!
return self.root[x]
def union(self, x, y):
rootX = self.find(x)
rootY = self.find(y)
if rootX != rootY:
self.root[rootY] = rootX
def __init__(self, size):
self.root = [i for i in range(size)]
self.rank = [1] * size
def find(self, x):
while x != self.root[x]:
self.root[x] = self.find(self.root[x])
return self.root[x]
def union(self, x, y):
rootX = self.find(x)
rootY = self.find(y)
if rootX != rootY:
if self.rank[rootX] > self.rank[rootY]:
self.root[rootY] = rootX
elif self.rank[rootX] < self.rank[rootY]:
self.root[rootX] = rootY
else:
self.root[rootY] = rootX
self.rank[rootX] += 1
- Nodes do not store keys, but rather their position within the trie defines the key
- e.g. They typically only store a character and the full string is built via depth traversal
- They also typically store a boolean value representing whether a given node signifies the end of a word
- Implemented as a
dictionary
of dictionaries O(N)
lookup whereN
is the length of the wordO(M*N)
space complexity, forM
words of lengthN
- Fast, slow pointers
- Both start at the head but one pointer jumps 2 nodes at a time, the other just 1
- When
fast == slow
, then a cycle exists - To find the length of the cycle, move one of the pointers one-at-a-time until
fast == slow
again - To find the start of the cycle, start 2 pointers at the
root
, move 1 pointer ahead by the cycle length, then increment both by 1 untilptr1 == ptr2
Problem | Pattern |
---|---|
Sorted array | Binary search, 2 pointers |
Permutations, subsets, all possible solutions | Backtracking |
Recursion not permitted | Stack |
Max/min subarray, subset | Dynammic Programming |
Top, least K items | Heap |
Common strings | Map, trie |
Anything else | Hash table, set |
- Ideal when the objective is to calculate something among subsets of an array or list
- e.g. Given an array, find the average of all contiguous subarrays of size
K
- e.g. Given an array, find the average of all contiguous subarrays of size
- Can reuse previous window calculation by simply removing the element that is exiting the window upon the next iteration -
O(n)
def find_averages_of_subarrays(arr, k):
results = []
windowSum = 0
windowStart = 0
for windowEnd in range(arr.length):
windowSum += arr[windowEnd]
if windowEnd >= k - 1:
results.append(windowSum/k)
windowSum -= arr[windowStart]
windowStart += 1
return results
- Ideal for sorted arrays/linked lists or finding a set (pair, triplet, subarray) of elements that fit a certain constraint
- e.g. Squaring a sorted array, finding a triplet that sum to 0
def remove_element(arr: List[int], key: int) -> int:
nextElement = 0
for val in range(len(arr)):
if val != key:
arr[nextElement] = val
nextElement += 1
return nextElement
- Typically reserved for cyclical arrays and linked lists
- e.g. Identifying palindromes, cycles
def has_cycle(head: Optional[ListNode]) -> bool:
slow = fast = head
while fast and fast.next:
fast = fast.next.next
slow = slow.next
if fast == slow:
return True
return False
def find_happy_number(num):
slow = num
fast = num
while True:
slow = find_square_sum(slow)
fast = find_square_sum(find_square_sum(fast))
if slow == fast: break
# check if the cycle is stuck on the number 1
return slow == 1
def find_square_sum(num):
sum = 0
while num > 0:
digit = num % 10
sum += digit * digit
num //= 10
return sum
- Relevant for any problem involving overlapping intervals
- Can be either finding these intervals or merging them
- If there are 2 intervals
A
andB
, there are 6 outcomes:
class Interval:
def __init__(start, end):
this.start = start
this.end = end
def merge(intervals):
if len(intervals) < 2:
return intervals
intervals.sort(lambda x: x.start)
mergedIntervals = []
start = intervals[0].start
end = intervals[0].end
for i in range(1, len(intervals)):
interval = intervals[i]
if interval.start <= end:
end = max(interval.end, end)
else:
mergedIntervals.append(Interval(start, end))
start = interval.start
end = interval.end
mergedIntervals.append(Interval(start, end))
return mergedIntervals
- Ideal for arrays containing numbers in a given range
- Iterates through array to identify any misplaced values and then placing them into their correct position
- e.g. Given an array of unsorted numbers 1 to
n
(duplicates allowed), find the missing number. We can sort by placing1
at index0
, 2 at1
, etc. and then iterate again to find missing number
def cyclic_sort(nums):
i = 0
while i < nums.length:
index = nums[i] - 1
if nums[i] != nums[index]:
nums[i], nums[index] = nums[index], nums[i]
else:
i += 1
return nums
- Reversing a linked list using existing nodes i.e. no extra memory is used
def reverse(head):
current = head
previous = None
while current:
next_node = current.next
current.next = previous
previous = current
current = next_node
return previous
- Optimal choice for any problem involving traversal of a tree level-by-level
- Ideal for finding the shortest path between two vertices in a graph whose edges have equal and positive weights.
- Implemented via queues
O(V)
space complexity whereV
is equal to the number of nodesO(V + E)
time complexity, whereE
is equal to the number of edges
def bfs(graph, root):
visited, queue = set(), deque([root])
visited.add(root)
while queue:
vertex = queue.popleft()
for neighbour in graph[vertex]:
if neighbour not in visited:
visited.add(neighbour)
queue.append(neighbour)
return visited
- Implemented via stacks or recursion
- Start at root node
- Perform null check
- Make recursive call to children
- Perform operations on node
O(V)
space complexityO(V + E)
time complexity
def dfs(graph, start, visited=None):
if visited is None:
visited = set()
visited.add(start)
for next_node in graph[start]:
if next_node not in visited:
dfs(graph, next_node, visited)
return visited
# Given a binary tree and a number ‘S’,
# find if the tree has a path from root-to-leaf such that the sum of all the node values of that path equals ‘S’
def has_path(root, sum):
if not root: return False
if root.value == sum and not root.left and not root.right:
return True
return has_path(root.left, sum - root.value) or has_path(root.right, sum - root.value)
- Performant for problems that involve a set of elements that can be divided into 2 parts in order to find some kind of median value
- Store the first half into a max-heap and the other half into a min-heap
- Median of the current list of numbers can be calculated from the top element of 2 heaps at any time
- e.g. Priority queues, scheduling, finding the smallest/largest/median elements of a given set
class MedianFinder:
def __init__(self):
self.maxheap = [] # lesser values
self.minheap = [] # greater values
def addNum(self, num: int) -> None:
heapq.heappush(self.maxheap, -num)
if len(self.maxheap) - len(self.minheap) > 1:
heapq.heappush(self.minheap, -heapq.heappop(self.maxheap))
if self.maxheap and self.minheap and -self.maxheap[0] > self.minheap[0]:
ret = -heapq.heappop(self.maxheap)
ret2 = heapq.heapreplace(self.minheap, ret)
heapq.heappush(self.maxheap, -ret2)
def findMedian(self) -> float:
if len(self.maxheap) != len(self.minheap):
return -self.maxheap[0]
else:
return (-self.maxheap[0] + self.minheap[0]) / 2
- BFS approach to dealing with permutations and combinations of sets
- Iteratively add element of a set to all existing subsets to create new subsets:
[1, 5, 3]
->[[1]]
->[[1], [5], [1,5]]
->[[1], [5], [1,5], [3], [1,3], [5, 3], [1, 5, 3]]
- e.g. Find combinations or permutations of a given set, subsets with dupes
def find_subsets(nums):
subsets = [[]]
for val in nums:
for i in range(len(subsets)):
subsets.extend([subsets[i].append(val)])
return subsets
- Used for finding an element from a sorted list by comparing its mid-point to the target value and continuously narrowing the scope of its search space by half
O(logN)
This template is the most basic and elementary form of binary search. It's used to search for an element or condition which can be determined by accessing a single index in the array.
left = 0, right = length-1
left = mid+1
right = mid-1
left, right = 0, len(nums) - 1
while left <= right:
mid = (left + right) // 2
if nums[mid] == target:
return mid
elif nums[mid] < target:
left = mid + 1
else:
right = mid - 1
# End Condition: left > right
return -1
This template is a more advanced binary search. It's used to search for an element or condition that requires accessing the current index and its immediate right neighbor's index.
left = 0, right = length
left = mid+1
right = mid
left, right = 0, len(nums)
while left < right:
mid = (left + right) // 2
if nums[mid] == target:
return mid
elif nums[mid] < target:
left = mid + 1
else:
right = mid
# End Condition: left == right
if left != len(nums) and nums[left] == target:
return left
return -1
This template is used to search for an element or condition which requires accessing the current index and both its neighbors.
left = 0, right = length-1
left = mid
right = mid
left, right = 0, len(nums) - 1
while left + 1 < right:
mid = (left + right) // 2
if nums[mid] == target:
return mid
elif nums[mid] < target:
left = mid
else:
right = mid
# End Condition: left + 1 == right
if nums[left] == target: return left
if nums[right] == target: return right
return -1
- Logical bitwise operator that returns
0
if both bits are the same,1
otherwise
# Given an array of n-1 integers in the range from 1 to n, find the one number that is missing from the array.
def find_missing_number(arr):
# x1 represents XOR of all values from 1 to n
x1 = 1
# x2 represents XOR of all values in arr
x2 = arr[0]
for i in range(2, len(arr+1)):
x1 ^= i
for i in range(1, n-1):
x2 ^= arr[i]
# missing number is the xor of x1 and x2
return x1 ^ x2
find_missing_number([1,5,2,6,4]) # 3
def find_single_number(arr):
# we can XOR all the numbers in the input; duplicate numbers will zero out each other and we will be left with the single number.
num = 0
for val in arr:
num ^= val
return num
find_single_number([1, 4, 2, 1, 3, 2, 3]) # 4
- Implement using heap
- Insert K elements into a min/max-heap
- Iterate through remaining numbers and if current iteration is greater than the heap, replace top of heap with that element
def top_k_frequent(nums: List[int], k: int) -> List[int]:
count = Counter(nums)
return heapq.nlargest(k, count.keys(), key=count.get)
- Problems that involve a set of sorted arrays
- Use a heap to efficiently perform sorted traversal of all elements of all K arrays
- Insert first element of each array into a min-heap
- Take out the top of the heap and add it to a merged list
- Insert the next element from the array of the removed element
- Repeat 2 & 3 to populate the merged list in sorted order
# You are given an array of k linked-lists lists, each linked-list is sorted in ascending order.
# Merge all the linked-lists into one sorted linked-list and return it.
def merge_k_lists(lists):
h = [(l.val, idx) for idx, l in enumerate(lists) if l]
heapq.heapify(h)
head = cur = ListNode(None)
while h:
val, idx = heapq.heappop(h)
cur.next = ListNode(val)
cur = cur.next
node = lists[idx] = lists[idx].next
if node:
heapq.heappush(h, (node.val, idx))
return head.next
A general algorithm for finding all (or some) solutions to computational problems which incrementally build candidates to the solution and abandons a candidate (backtracks) as soon as it determines that the candidate cannot lead to a valid solution. Implemented via recursion, it is often the preferable option (over a divide and conquer algorithm) if the number of solutions is unknown.
def all_paths_from_src_to_target(graph: List[List[int]]) -> List[List[int]]:
paths = []
target = len(graph)-1
def backtrack(node, path):
if node == target:
paths.append(list(path))
return
for neighbor in graph[node]:
backtrack(neighbor, path + [neighbor])
backtrack(0, [0])
return paths
A graph traversal in which each node v
is visited only after all of its dependencies are visited first. Note: multiple topological sortings can exist for any given graph.
- Must be a directed, acyclic graph
- There must be at least one vertex with an in-degree of
0
, i.e. an origin node
def topological_sort(self, n, edges):
indegrees = [0 for x in range(n)]
neighbors = defaultdict(list)
ret = []
for src, dest in edges:
neighbors[src].append(dest)
indegrees[dest] += 1
q = deque(node for node in range(n) if indegrees[node] == 0)
while q:
node = q.popleft()
if node in neighbors:
for neighbor in neighbors[node]:
indegrees[neighbor] -= 1
if indegrees[neighbor] == 0:
q.append(neighbor)
ret.append(node)
return ret if len(ret) == n else []
class Solution:
WHITE = 1
GRAY = 2
BLACK = 3
has_cycle = False
def dfs(self, node, visited, stack):
visited[node] = Solution.GRAY
for neighbor in self.graph[node]:
if visited[neighbor] == Solution.WHITE:
self.dfs(neighbor, visited, stack)
elif visited[neighbor] == Solution.GRAY:
self.has_cycle = True
visited[node] = Solution.BLACK
stack.append(v)
def topological_sort(self):
visited = [Solution.WHITE] * self.V
stack = deque()
for node in range(self.V):
if visited[node] == Solution.WHITE:
self.dfs(node, visited, stack)
return stack if not self.has_cycle else []
O(V + E)
time complexity:O(E)
for constructing an adjacency list,O(V + E)
time to repeatedly visit each vertex and update all of its outgoing edges in the worst case scenario.O(E)
+O(V + E)
=O(V + E)
O(V + E)
space complexity for an adjacency list of sizeO(E)
, storingO(V)
in-degrees for each vertex, andO(V)
nodes in the queue
A greedy algorithm that takes a weighted, connected, undirected graph and calculates the minimum possible sum from the set of edges to produce a minimum spanning tree.
Algorithm
- Sort all edges by their weights
- Take the edge with the lowest weight and add it to the minimum spanning tree (skip edges that would produce a cycle)
- MST can be represented using a disjoint set
- Keep adding edges until all vertices are accounted for in the MST
Complexity
O(E⋅logE)
time complexity, whereE
is equal to the number of edges- This is due to the initial sorting of the edges
O(V)
space complexity, whereV
represents the total number of vertices
def min_cost_to_connect_points(self, points: List[List[int]]) -> int:
edges = []
sum = 0
self.init(len(points))
for i in range(len(points)):
for j in range(i+1, len(points)):
cost = self.calc_manhattan_distance(points[i], points[j])
edges.append((cost, i, j))
heapq.heapify(edges)
while edges:
edge = heapq.heappop(edges)
sum += self.union(edge)
return sum
def init(self, sz):
self.roots = [x for x in range(sz)]
self.ranks = [0 for _ in range(sz)]
def calc_manhattan_distance(self, pt1: List[int], pt2: List[int]):
return abs(pt1[0] - pt2[0]) + abs(pt1[1] - pt2[1])
def find(self, x):
if self.roots[x] != x:
self.roots[x] = self.find(self.roots[x])
return self.roots[x]
def union(self, data):
cost, x, y = data
rootx = self.find(x)
rooty = self.find(y)
sum = 0
if rootx != rooty:
sum += cost
if self.ranks[rootx] == self.ranks[rooty]:
self.roots[rooty] = rootx
self.ranks[rootx] += 1
elif self.ranks[rootx] > self.ranks[rooty]:
self.roots[rooty] = rootx
else:
self.roots[rootx] = rooty
return sum
Very similar to Kruskal's except rather than constructing an MST with edges, Prim's algorithm builds it using vertices
- Steps
- Initialize the minimum spanning tree with a vertex chosen at random
- Find all the edges that connect to the current vertex
- From those edges, "select" the minimum and add it to the tree
- Repeat from step 2 until MST is fully constructed
- Time complexity
- Binary heap:
O(E⋅logV)
O(V + E)
time to traverse all verticesO(logV)
time to extract minimum element
- Fibonacci heap:
O(E + V⋅logV)
O(logV)
time to extract minimum element
- Binary heap:
- Space complexity:
O(V)
to store all vertices in data structure
def min_cost_to_connect_points(self, points: List[List[int]]) -> int:
sz, visited = len(points), [False]*sz
pq, res, v1 = [], 0, points[0]
visited[0] = True
for i in range(1, sz):
v2 = points[i]
cost = self.calcManhattan(v1, v2)
heapq.heappush(pq, (cost, 0, i))
while pq:
cost, pt1, pt2 = heapq.heappop(pq)
if not visited[pt2]:
res += cost
visited[pt2] = True
for j in range(sz):
if not visited[j]:
cost = self.calcManhattan(points[pt2], points[j])
heapq.heappush(pq, (cost, pt2, j))
return res
def calc_manhattan(self, pt1: List[int], pt2: List[int]):
return abs(pt1[0] - pt2[0]) + abs(pt1[1] - pt2[1])
Intended for weighted, directed graphs with non-negative weights. The order of the vertex traversal is determined by the smallest weighted edge at any given time.
Complexity
O(E + VlogV)
time complexity using a Fibonacci heap, whereE
is equal to the number of edges andV
the vertices.- Using a min-heap, for
V
vertices, it takesO(logV)
time to extract the minimum element. - Alternatively,
O(V + ElogV)
time complexity for a binary heap implementation.
- Using a min-heap, for
O(V)
space complexity
class Solution:
def network_delay_time(self, times: List[List[int]], n: int, k: int) -> int:
graph, dist, pq = defaultdict(list), {}, [(0, k)]
for src, dest, cost in times:
graph[src].append((dest, cost))
while pq:
time, node = heapq.heappop(pq)
if node not in dist:
dist[node] = time
for neighbor, w in graph[node]:
heapq.heappush(pq, (w+time, neighbor))
return max(dist.values()) if len(dist) == n else -1
Intended for any weighted, directed graphs (including negative values). This algorithm recalculates every edge for each new vertex visitation (as opposed to Dijkstra's greedy method, which limits its scope of analysis to the edges of a visited vertex's immediate neighbors). The order of the vertex traversal is not dependent on any locally-optimum weight, rather, all nodes V
are iterated over V-1
times.
Complexity
O(VE)
time complexity in the worst case scenario in which every edge is relaxed for every vertexO(V)
space complexity for two 1-dimensional arrays: one for storing the shortest distance from the source vertex to every other vertex (usingV-1
edges) and the other for representing the shortest distance using at mostV
edges
class Graph:
def __init__(self, vertices):
self.V = vertices
self.graph = [] # Array of edges
def add_edge(self, s, d, w):
self.graph.append([s, d, w])
def bellman_ford(self, src):
# Step 1: fill the distance array and predecessor array
dist = [float("Inf")] * self.V
# Mark the source vertex
dist[src] = 0
# Step 2: relax edges |V| - 1 times
for _ in range(self.V - 1):
for s, d, w in self.graph:
if dist[s] != float("Inf") and dist[s] + w < dist[d]:
dist[d] = dist[s] + w
# Step 3: detect negative cycle
# if value changes then we have a negative cycle in the graph
# and we cannot find the shortest distances
for s, d, w in self.graph:
if dist[s] != float("Inf") and dist[s] + w < dist[d]:
return
Refer here
Stability: A sorting algorithm is considered stable if the two or more items with the same value maintain the same relative positions even after sorting.
For a graphical Big-O analysis of these algorithms, refer the Big-O section.
- A divide-and-conquer sorting algorithm that divides an array into multiple smaller subproblems. When each subproblem is solved, the results are combined to form a sorted array.
- Stable
O(N⋅logN)
time complexityO(N)
space complexity
def merge_sort(my_list):
if len(my_list) > 1:
mid = len(my_list) // 2
left = my_list[:mid]
right = my_list[mid:]
merge_sort(left)
merge_sort(right)
i = j = k = 0
while i < len(left) and j < len(right):
if left[i] < right[j]:
my_list[k] = left[i]
i += 1
else:
my_list[k] = right[j]
j += 1
k += 1
while i < len(left):
my_list[k] = left[i]
i += 1
k += 1
while j < len(right):
my_list[k]=right[j]
j += 1
k += 1
def find_minimum(lst):
if (len(lst) <= 0):
return None
merge_sort(lst)
return lst[0]
- A divide-and-conquer sorting algorithm that divides an array into smaller subarrays by random selection of a pivot. Elements less than the pivot occupy the left side of it, while those greater go on the right. The constituent subarrays follow the same approach and all subproblems are combined to form a sorted array.
- Not stable
O(N^2)
time complexity in the worst case,O(N)
on averageO(logN)
space complexity
- Select random element as pivot
- Swap pivot with the last index
- Create a dedicated swap pointer, assign it to the specified start index
- Iterate through the array from the start index, swapping values with the swap pointer iff the current element is lesser than that of the pivot
- Increment swap pointer by 1
- elif the current index is greater than that of the pivot, then do nothing
- Finally, once we reach the end of the loop, swap the pivot with the swap pointer
- Return the index of the swap pointer and repeat as necessary
class Solution:
def topKFrequent(self, nums: List[int], k: int) -> List[int]:
counter = Counter(nums)
uniques = list(counter.keys())
N = len(uniques)
def partition(start, end):
pivot_idx = random.randint(start, end)
pivot = uniques[pivot_idx]
swap_ptr = start
uniques[pivot_idx], uniques[end] = uniques[end], uniques[pivot_idx]
for i in range(start, end):
if counter[uniques[i]] < counter[pivot]:
uniques[swap_ptr], uniques[i] = uniques[i], uniques[swap_ptr]
swap_ptr += 1
uniques[swap_ptr], uniques[end] = uniques[end], uniques[swap_ptr]
return swap_ptr
def quick_select(start, end):
if start == end: return
pivot_idx = partition(start, end)
if N-k == pivot_idx: return
if N-pivot_idx < k:
quick_select(start, pivot_idx-1)
else:
quick_select(pivot_idx+1, end)
quick_select(0, N-1)
return uniques[N-k:]
- Algorithm that places an unsorted element at its suitable place in each iteration
- Stable
O(n^2)
- Assume the first element of the array is already sorted. Consider this to be a partition between the "unsorted" remainder
- Iterate through array starting at index
1
and compare to sorted array- Insert index element in its appropriate place in the sorted array
def insertion_sort(array):
for i in range(1, len(array)):
key = array[i]
left = i - 1
while left >= 0 and key < array[left]:
array[left + 1] = array[left]
left = left - 1
array[left + 1] = key
- Algorithm that selects the smallest element from the "unsorted" portion of an array and then places that element at the beginning of the sorted portion
- Not stable
O(n^2)
- At every iteration, set the first element from the unsorted array to
minimum
- Walk through the array and re-assign
minimum
if a lesser number is encountered - Once we reach the end of the array, swap
minimum
with the last index of the "sorted" array
def selection_sort(array, size):
for i in range(size):
min_idx = i
for i in range(i + 1, size):
min_idx = i if array[i] < array[min_idx] else min_idx
array[i], array[min_idx] = array[min_idx], array[i]
- Algorithm that compares two adjacent elements and swaps them to be in the intended order.
- Stable
O(n^2)
- Starting at index
0
, compare the element with its neighbor at index1
- Swap them such that the lesser element is on the left and the greater element is on the right
- Increment such that we now compare values at index
1
and2
- Repeat until end of array is reached
- Invoke function again until array is fully sorted
def bubble_sort(array):
for i in range(len(array) - 1):
if array[i] > array[i + 1]:
array[i], array[i+1] = array[i+1], array[i]
- Efficient algorithm that leverages arrays and trees
- Recall that all leaves are proper heaps by default (they have no children and thus they are the min/max element of their tree)
- Efficiency of algorithm can therefore be improved by starting at the first non-leaf node from the bottom
n//2 - 1
, wheren
is the total number of nodes, gives us the index to start at
- Efficiency of algorithm can therefore be improved by starting at the first non-leaf node from the bottom
O(N⋅logN)
- Heapify
- The discrete algorithm that is used to construct a heap
- Given a node within the tree whose children are proper heaps, compare the parent to its children
- If the node is greater than its children, do nothing
- Otherwise, swap the node with its highest priority child
- Percolate this node downwards until it's in its appropriate index by recursively invoking
heapify()
- Percolate this node downwards until it's in its appropriate index by recursively invoking
O(N)
time
def heapify(arr, n, i): # n == array size, i == index
largest = i
left = 2 * i + 1
right = 2 * i + 2
if left < n and arr[i] < arr[left]:
largest = left
if right < n and arr[largest] < arr[right]:
largest = right
if largest != i:
arr[i], arr[largest] = arr[largest], arr[i]
heapify(arr, n, largest)
- Building max-heap
- To build a max-heap from any tree, we start from the bottom up (start at first non-leaf)
- Iterate upwards through these nodes, running
heapify()
on each
def build_max_heap(arr):
n = len(arr)
for i in range(n//2, -1, -1):
heapify(arr, n, i)
- Sorting
- Swap: Remove the root element by swapping it with the node in the last index
- Remove: Reduce the size of the heap by
1
- Heapify:
heapify()
the root element again to restore the max-heap property - Repeat process until entire heap is sorted
def sort(arr):
n = len(arr)
for i in range(n-1, 0, -1):
arr[i], arr[0] = arr[0], arr[i]
heapify(arr, i, 0)
- Result
def heapSort(arr):
build_max_heap(arr)
sort(arr)
print(arr)
- If complete binary trees have a height of
O(logN)
, then the worst case scenario for a root node to percolate isO(logN)
- During the
build_max_heap()
stack, we invokeheapify()
forn/2
elementsn/2
is asymptotically equivalent ton
, thus a complexity ofO(N⋅logN)
- During the
sort()
stack, we callheapify()
n
times, thus also resulting in a time complexity ofO(N⋅logN)
- Not factored into equation since these two methods are invoked sequentially
O(1)
space complexity
A stable sorting algorithm that divides the unsorted array elements into several buckets. Each bucket is then sorted by using any of the sorting mechanisms mentioned above or by recursively applying the same bucket algorithm.
The algorithm itself is described as a scatter-gather approach due to the nature of its process. An array, whose indices represent individual buckets, is created to store the elements of an input. Each bucket is assigned a specific range which is then used as the partition critera when scattering the input elements.
A stable sorting algoritm is then applied to each bucket. Once complete, the elements are then gathered back into a holistic data structure by iterating through each bucket and inserting the elements into the original array, in order.
O(n^2)
worst case,O(n)
average caseO(n + k)
space complexity
def bucketSort(array):
N = len(array)
buckets = [[]] * N
for e in array:
bucket = e % N
buckets[bucket].append(e)
for bucket in buckets:
bucket.sort()
k = 0
for bucket in buckets:
for e in bucket:
array[k] = e
k += 1
return array
- Clarify functional requirements and scope
- Who is going to use it? How are they going to use it?
- How many users?
- What does the system do? What are its inputs and outputs?
- Additional non-functional requirements (i.e. consistency vs. availability)
- Estimations and constraints
- Throughput/traffic estimates
- How many queries per second?
- 86400 seconds in a day
- Read-to-write ratio
- How many queries per second?
- Storage estimates
- Total storage required over 5 years
- Memory estimates
- What do we want to store in cache?
- Consider 80-20 rule
- Approximate RAM required
- What do we want to store in cache?
- Bandwidth estimates
- QPS * payload
- Throughput/traffic estimates
- Define the APIs and data schemas
- Define the API: the resources, parameters, functions, & responses
- Define the database schema: the fields and estimated bytes per record
- High-level design
- Sketch a basic system that includes the main components and the connections between them
- Scaling
- Iterate through each component and scale individually
- For the application layer, break down into microservices
- DNS
- CDN
- Push vs. pull
- Load Balancers
- Active-passive
- Active-active
- Layer 4
- Layer 7
- Databases
- RDBMS
- NoSQL
- Key-value (DynamoDB)
- Document (MongoDB)
- Column (Cassandra)
- Graph (Neo4j)
- Partitioning
- Vertical
- Horizontal (sharding)
- Replication
- Master-slave
- Master-master
- Leaderless
- Caching
- Write-through
- Write-behind
- Cache-aside
- Refresh-ahead
- Asynchronism
- Message queues
- Task queues
- Iterate through each component and scale individually
Content delivery networks are a globally distributed network of proxy servers that aid in serving clients static files such as HMTL, CSS, and JS to improve end-user experience and reduce server load.
Push CDNs receive new content whenever changes occur on the upstream server. The onus of updating the CDN (providing content and updating URLs to point to the CDN) lies squarely with the application server. This type of CDN is best-suited for websites with a small amount of traffic and/or ones which are not updated frequently.
Pull CDNs, on the other hand, poll the server for new updates. Time-to-live (TTLs) are used to determine how long content is cached for. Pull CDNs minimize the storage space but can potentially create redundant traffic if the TTL has expired and is re-requested.
These types of CDNs are optimal for sites with heavy traffic as they inherently only hold the most-recently requested content.
- Recently requested data is likely to be requested again
- 80-20 rule: 80% of traffic is generated by 20% of the resources
- Most commonly implemented at level nearest to front-end, so as to avoid taxing downstream systems
A process in a computer systems whereby entries in a cache are replaced or removed. If data is modified on the database, its cached version should be invalidated.
-
Write-through cache
- Data is written into the cache and DB synchronously
- Sacrifices latency for a minimized risk of data loss/inconsistency
- This is because each update necessitates 2 writes
- Disadvantages
- Most written data will never end up being read (but this can be minimized with a TTL)
-
Write-around cache
- Data is written directly to storage, bypassing cache entirely
- Pro: Reduces the risk of the cache being flooded with writes that ultimately won't be read (80-20 rule)
- Disadvantages
- Read requests for recently-written data are likely to result in a cache miss. Latency would increase in this case
- A request for recently-written data may actually be in the cache but could be inconsistent with the database. Depending on requirements, this could be acceptable but otherwise a mechanism to invalidate stale caches would need to be devised
-
Write-back cache
- Written only to the cache
- Completion is immediately confirmed to the client
- Writes to permanent storage are done after a specified interval or under defined conditions
- Pro: Low latency, high throughput
- Con: Increased risk of data loss if cache were to fail
-
Refresh-ahead cache
- Cache automatically and asynchronously reloads any recently-accessed cache entry prior to its TTL expiration
- Reduces latency when an entry expires and a fresh, synchronous request needs to be sent to the data store
- If entry expired, synchronous request is made
- If close to expiring, async request is made
- Consistency: all nodes see the same data all the time
- Availability: every request results in a response
- Partition tolerance: a system continues to operate despite a partial system failure
- Data is sufficiently replicated across nodes to keep system operational through partial outages
- A distributed hashing scheme that operates independently of the number of servers or objects in a distributed hash table (DHT) by assigning them a position on an abstract ring
- Allows servers and objects to scale without affecting the overall system architecture
- Traditional distributed hashing i.e.
hash(key) % num_servers = partition_index
- Hash algorithm will have to be reconfigured when any server is added or removed
- All existing indices will also have to be rehashed
- Not horizontally scalable
- Can easily become unbalanced over time with development of hot-spots
- Hash algorithm will have to be reconfigured when any server is added or removed
- Hash output range is mapped onto an abstract circle or ring
- Both objects and servers are then mapped onto this ring
- Servers should be mapped in a random but repeatable way, i.e. hashing
server_name
orip_address
- Servers should be mapped in a random but repeatable way, i.e. hashing
- Objects are designated to servers based on proximity
- Rotate key either in a counter or clockwise direction until a server is located
- Can be implemented by maintaining a sorted list of servers and their associated positions on the ring and then walking through this list until the position is greater than or equal to that of the key's
- Rotate key either in a counter or clockwise direction until a server is located
- This solution works but
- Hotspots can still easily develop over time
- Due to either an unbalanced distribution of keys or servers
- e.g. Server nodes can be clustered to the point that some nodes don't even receive data
- All servers are treated equally when that might not be the case i.e. varying capacities
- Due to either an unbalanced distribution of keys or servers
- Hotspots can still easily develop over time
servers = ['a', 'b', 'c']
key = 'foo'
ring = servers.map(hash);
def findFromRing(i):
return # the server that is closest to the int i
server = servers[findFromRing(hash('foo'))];
- Introducing: virtual nodes!
- Helps ameliorate the issues of load distribution and data uniformity in consistent hashing
- Instead of a server having one position/range in the ring, it now holds several, all intersped along the ring
- Weight: the factor by which to increase the number of vnodes for a given server
- Implemented at the discretion of the engineer. More powerful servers can be assigned greater weights
- Weight: the factor by which to increase the number of vnodes for a given server
- When a node is removed, all of its virtual nodes are removed as well
- Objects formerly adjacent to these nodes are now assigned to the next-closest server node
- Much better in contrast to the solution without vnodes since the changes are distributed across multiple nodes rather than just one
- When a node is added, a similar reassignment happens
servers = ['a', 'b', 'c']
key = 'foo'
vnodes = [0, 1, 2, 0, 1, 2] # each server is assigned several vnodes
server = servers[vnodes[hash(key) % vnodes.length]]
Load balancers improve the responsiveness and availability of applications, websites, databases, etc by distributing the load across a host of servers.
Load balancers introduce a single point of failure and need to be engineered correctly to continue handling requests in the event of a failure.
In this schema, one load balancer is kept as the active node while a secondary node is on stand-by at all times, ready to replace the active node if it were to fail. The secondary node sends occasional health checks to the primary node to ensure it is still alive.
As the name implies, this schema has 2 active load balancers sharing the load.
- Least connections: server with fewest active connections is given priority
- Lowest latency: priority given to server with minimal response time
- Least bandwidth: priority given to server with least amount of traffic, in terms of Mbps
- Round robin: a simple algorithm that distributes load equally among all servers by running through a cycle
- Weighted round robin: A variation of the above whereby servers are assigned a weight (indicating processing capacity, i.e. processor speed) and ordering the cycle in respect to that metric
- Layer 4: Leveraging the transport layer, this type of routing is facilitated by data such as the source/destination IPs and ports
- Layer 7: At layer 7, the application layer, this type of routing inspects the payload (header, body, cookies) of the request to determine an appropriate server
A server that sits in front of a back-end system and acts as the public-facing interface for all incoming requests. Provides a multitude of benefits:
- Security
- A reverse proxy can deter DDoS attacks by blacklisting certain IPs, payload matching, or limiting the number of connections
- Scalability
- The back-end service is free to change its configuration without affecting clients because the entry point to this service remains static
- Compression
- Bandwidth can be conserved by compressing the server response prior to sending it to the client
- SSL termination
- SSL encryption/decryption is computationally expensive and performing these operations at the reverse proxy level frees up resources for the back-end
- Caching
- Response time and server load can be reduced by storing responses in a local cache and doing a lookup prior to forwarding requests to the back-end
- Horizontal (Sharding)
- All partitions share the same schema
- Each partition is a shard that holds a specific subset of data
- e.g. In a
Places
table, we partition based on ZIP codes. ZIP codes underxyz
threshold go intoServer1
, otherwise they go intoServer2
- Can be problematic if the range value is not thoroughly vetted: the tables can become easily unbalanced
- Onerous to change the partition key after system is in operation
- e.g. In a
- Common problems
- Most constraints are due to the fact that a horizontally distributed system, by definition, requires operations across multiple servers
- Joins & denormalization
- Denormalization: a strategy that DBAs employ to increase the performance of a DB by adding redundant data to the DB so as to reduce the operational cost of queries which combine data from multiple sources into a single source
- Normalization: organizing a DB into tables to promote a given use
- Cons: how do we address data inconsistency?
- Vertical
- Partitions data based on an abstract concept or feature
- Each partition holds a subset of the fields within a certain table
- Ideal for reducing the I/O costs associated with frequently-accessed items
- Can be problematic as growth could beget a further partitioning of the table
- Partitions data based on an abstract concept or feature
- Directory-based
- Loosely-coupled approach that creates a lookup service that is cognizant of the paritioning scheme and abstracts away
interfacing with the DB
- Contains the mapping between PKs and servers
- Flexible as we can add servers to DB pool or change partition scheme without application impact
- Loosely-coupled approach that creates a lookup service that is cognizant of the paritioning scheme and abstracts away
interfacing with the DB
- Key or hash-based partitioning
- Applies hash
f(x)
to a key attributex
of an entity to yield a unique hash number that can then be used for partitioning- e.g. If we have
y
servers, we can derive a partition index by applying a modulo to the hash result byf(x) % y = index_partition
- e.g. If we have
- Applies hash
- List partitioning
- Each partition is assigned a list of values
- Appropriate partition for storage can be retrieved by the server's list
- e.g. Country-based lists
- Round robin
- A uniform data distribution scheme
- Composite
- Any of the above in combination
- e.g. A list-based partitioning scheme than then follows a hash-based partition
- Any of the above in combination
- Leader-follower
- One replica is designated as the leader and all others are followers.
- All write requests are processed by the leader who then sends the writes to its followers as part of a replication log or change stream.
- Each follower takes this log and updates its local copy of the databse accordingly, applying updates in the same order as the log. Reads can be handled by leaders or followers.
- Cons
- One node for all writes
- Leader can be blocked from writes if implemented synchronously (waits for successful response from a failing follower node)
- Easier to implement if one node is synchronous and the rest as asynchronous
- Guarantees an up-to-date copy of data on at least 2 nodes
- Easier to implement if one node is synchronous and the rest as asynchronous
- Complexity of new leader election / split-brain
- Replication lag
- Leader-leader
- Same general process as above with the exception that there is more than one leader node. Suitable for data spread across multiple datacenters: each datacenter can have its own leader.
- Improves latency in comparison to single leader as there could be a write node closer to the client
- Tolerant of datacenter outages
- Cons
- Data can be concurrenctly modified on two different leaders
- Often considered dangerous
- Leaderless
- All nodes function as leaders.
- No concept of failover
- Common to implement with quorums
- Denormalization
- The practice of adding datasets from a remote node onto a local node to reduce the costs of complex joins over a network
- Used to improve query response time by facilitating faster searching
- Implemented with a sorted list of (narrowly-scoped) data that can be used to look something up
- i.e. Table of contents
- Decreases write performance because for every write to a table, the index has to be written as well
FROM
WHERE
GROUP BY
HAVING
- requires an aggregation function
- i.e.
HAVING sum(population) > 200,000
- i.e.
- requires an aggregation function
SELECT
ORDER BY
(can use aSELECT
alias)LIMIT
- Data is stored in array of KV pairs
- Redis, Voldemort, DynamoDB
- Data is stored in documents (instead of rows and columns) and grouped together in collections
- Each document can have an entirely different structure
- CouchDB, MongoDB
- Instead of tables, column families are used as containers for rows
- All of the columns do not need to be known up front
- Each row doesn't have to have same number of columns
- Best suited for large datasets
- Cassandra, HBase
- Store data whose relations are best represented in a graph
- Contains nodes (entities), properties (entity metadata), and lines (connections between entities)
- Neo4J, InfiniteGraph
-
ACID
- Atomicity
- Transactions are either performed in whole or they are not. There is no concept of a partially-completed transaction
- Consistency
- Ensures that a transaction only make changes in pre-defined, predictable ways. It will not corrupt the database in any way
- Isolation
- Transactions are performed with no other concurrent processes being performed on the data. No race conditions will arise
- Durability
- Guarantees that, once written, the data is persisted and will stay persisted, even in the event of a system failure
- Strong focus on consistency and availability
- Atomicity
-
BASE
- Basically available
- Does not enforce immediate consistency, but rather guarantees availability of data (in terms of CAP theorem)
- Soft state
- The state of a system may change over time due to eventual consistency
- Eventually consistent
- Eventual consistency: a system becomes consistent over time
- Strong focus on availability and partition tolerance
- Basically available
-
NoSQL is better at horizontal scaling because the objects exist as a self-contained file with no relations to any other object
- Therefore, no joins to other objects (that exist on other servers) are required
SQL | NoSQL | |
---|---|---|
Storage | Tables (rows are entities, columns are datapoints) | Objects |
Schemas | Fixed, pre-defined | Dynamic, non-uniform |
Querying | SQL | UnQL |
Scalability | Vertically | Horizontally |
Reliability | Consistent, available | Performant, scalable |
- The duplication of critical components with the explicit intent of increasing the reliability of a system
- Single point of failure mitigation
- Asynchronous Javascript & XML (AJAX) polling: client continuously polls server for response
- Long polling: client sends request with the understanding that the server will respond when the response is ready
- Websockets: a protocol with a full duplex communication channel over a single TCP connection
- Provides a persistent connection between a server and its client in which both parties can use to send data to each other at any time, in real-time
- Server-sent events (SSE): a client connects to a server once and then solely acts as a receiver for the remainder of the connection
- Typically used to filter, log, encrypt, decrypt, &/or transform requests
- An ideal location to implement caching
- Gather requirements
- Format of input?
- Size of input?
- Range of values?
- What kind of values? Negatives? Floating points?
- Can we assume the input is well-formed and non-null?
- Duplicates?
- Edge cases?
- Should the original input be preserved?
- Can we assume English alphabet?
- Devise a small example to ensure the question is understood
- Explain high-level approach
- Elaborate on a brute force algo first, then consider optimizations
- Explain estimated time and space complexity
- If solution not clear, consider multiple and verbalize why each would (or wouldn't) work
- Elaborate on a brute force algo first, then consider optimizations
- Once an approach has been determined, then start coding
- Draft skeleton using pseudocode
- If stuck, explain why what you initially thought would work is no longer true
- Devise test cases, see if a pattern emerges
- Think about tangentially related problems and how they were solved before
- Iterate through different data structures and see if they can be leveraged for the problem at hand
- Can repeated work be cached?
- Trading off memory for speed
- Talk about what you're currently coding and its intended purpose
- Start with a simplified example and solve for base cases
- Review code
- Refactor where possible
- Come up with test cases
- Ensure the edge cases are covered
- Step through with debugger
- Estimate the time and space complexity
- Explain any potential trade-offs that could be made
for key in dict
only iterates over keys- Use
dict.items()
for both keys and values
- Use
- Intersections between dictionaries
dict3 = dict1 | dict2
- returns any keys in dict2 not already present in dict1
defaultdict
: A dictionary that, when a key doesn't exist, returns an object instead of aKeyError
exception- Alternative is
dict.get(key, default_val)
- Alternative is
- Instantiation:
defaultdict(fx)
- Where
fx
is a function that returns the expected default value - e.g.
defaultdict(list)
- Where
- A stateful
dict
that preserves the order in which the keys were inserted from collections import OrderedDict
- Methods
popitem(last=True)
- Pops in LIFO order if
True
, else FIFO
- Pops in LIFO order if
move_to_end(key, last=True)
- Moves an existing key to either end of the dictionary
import heapq
heapq.heapify(arr) # O(N)
heapq.heappush(arr, val) # O(logN)
heapq.heappop(arr) # O(logN)
heapq.heappushpop(arr, val) # heappush() + heappop() in a single method
heapq.heapreplace(arr, val) # heapop() + heappush() in a single method
# function that returns a list of n values from iterable according to the comparison criteria defined in key
heapq.nlargest(K, arr, key=None) # O(N*logK)
heapq.nsmallest(K, arr, key=None) # same as the above, just for minimums
# for a max heap, negate the val
# note: this library modifies an array in-place
A module that receives an iterable as input and returns a (wrapped) dictionary of keys mapped to the amount of occurrences of that specific key within the iterable. The Counter
object can be accessed using the same methods and syntax as a typical dictionary.
from collections import Counter
arr = ['a' 'b', 'a', 'c', 'c', 'c']
counter = Counter(arr)
print(counter) # Counter({'a': 2, 'b': 1, 'c': 3})
# separate counters can also be added/subtracted (with the exception that any negative sums will be ignored)
counter2 = Count("ab")
counter3 = counter + counter2 # counter - counter2
print(counter3) # Counter({'a': 3, 'b': 2, 'c': 3})
# standard intersections between dictionaries are still applicable but keys with negative values will not be factored into the set comparisons
counter5 = counter | counter2
# unlike standard dictionaries, counters can be unioned!
counter6 = counter & counter2 # counts are summed (any key with a negative sum is excluded from the resulting dictionary)
counter.update("add me pls")
counter.subtract("remove mi")
counter.elements() # returns a list of all elements (multiplied by their respective counts) with a count > 0
counter.most_common(val=None) # returns a descending val-sized list of tuples sorted according to their counts. If val not supplied, sorted list is returned in its entirety
-
Wraps an existing function inside another function
-
Decorator receives original function as input, defines an inner function, invokes the original function somewhere within the inner function, and then returns the outer function to the caller
-
Use cases
- Measuring function execution time
- URL routing
- Input sanitation
-
Examples
@classmethod
: can only be called from the class-level@staticmethod
: can be called from both the class or instance-level@synchronized(lock)
def decorator(func):
def inner():
print("Hello world!")
func()
return inner
@decorator # equivalent to obj = decorator(fx)
def fx():
print("Goodbye world!")
Iterator
- Any object with a class that has
__next__()
and__iter__()
__iter(self)__
returnsself
__next()__
is the custom implementation that describes how the class retrieves the next object
- Wieldy to implement because it requires maintaining state (current iteration)
- Calling
next(x)
on anIterable
produces the next sequential item ofx
- Any object with a class that has
Generator
- Every
generator
is aniterator
, but not everyiterator
is agenerator
- Built by constructing a function that implements
yield
- This
generator
(lazily) lends us a sequence of values for python to iterate over - Provides an easy, short-hand way to create iterators, where the state is automatically maintained for us
- This
- Implemented in a function, does not need a class
- Every
method | description |
---|---|
.remove() |
Throws exception if not found |
.discard() |
No exception thrown |
.union() or | |
Creates a new set that contains all elements from the provided sets |
.intersection() or & |
Returns all elements that are common between both |
.difference() or - |
Returns elements that are solely unique to the first set |
.symmetric_difference() or ^ |
Returns all elements that are unique to both sets |
module with a collection of fast and memory-efficient tools for iterables
method | description |
---|---|
.cycle(iterable) |
Cycles through an iterable endlessly |
.combinations(it, r=None) |
Accepts an iterable and integer (representing combination size), returns all combinations as an iterable |
combinations_with_replacement(it, r=None) |
Same as the above with the exception that combinations can include repeated elements |
.permutations(it, r=None) |
Returns list of permutations of length r |
- Order of elements does not matter for combinations
- A set with
n
distinct elements hasn!
permutations
for folder_name, subfolders, filenames in os.walk('/home/user'):
print('The current folder is {}'.format(folder_name))
for subfolder in subfolders:
print('SUBFOLDER OF {}: {}'.format(folder_name, subfolder))
for filename in filenames:
print('FILE INSIDE {}: {}'.format(folder_name, filename))
with open(file, rwx) as obj_file:
obj_file.read() # reads whole file
obj_file.readlines() # gets list of string values, by line
for line in obj_file: # iterates through each line, newline included
- In a function declaration, it packs all remaining positional arguments into a tuple
- In function call, it unpacks tuple or list into positional arguments
- Use this to allow a function to accept a variable number of positional args
tuple
def fruits(*args):
for fruit in args:
print(fruit)
fruits("apples", "bananas", "grapes")
- In a function declaration, it is the same as the above with the condition that it unpacks keyword arguments
- In a function call, it unpacks keyword arguments into a variable
dict
def fruit(**kwargs):
for k, v in kwargs.items():
print("{0}: {1}".format(k, v)
fruit(name="apple", color="red")
- Ideal for storing data
- Requires data type declarations
from typing import Any
when datatype doesn't need to be explicitly declared
@dataclass
class Product:
name: str
count: int = 0
price: float = 0.0
customer: Any
-
Nested list comprehensions follow the same flow as its loop equivalent:
[x for row in grid for x in row] # is equivalent to... for row in grid: for x in row:
-
In dict comprehensions with a ternary operator, the
else
is only applicable to the value. The key is unaffected:{k: v for k, v in collection if v % 2 == 0 else 1}
-
List of empty lists:
[[] for i in range(x)]
-
Matrix:
[[False for x in range(len(grid[0]))] for y in range(len(grid))]
-
Sanitizing/extracting elements from string:
s = "".join(c for c in s if c not in x)
-
Converting a string to list of chars:
list(string)
- Conversely:
"".join(string)
- Conversely:
-
import string
:
-
Check all elements meet a condition:
all(c == '' for c in x)
-
Check that at least 1 element meets a condition:
any(x < 0 or y < 0 for (x, y) in z)
-
Reverse list:
x[::-1]
- In-place reversal:
x.reverse()
- In-place reversal:
-
Sort list:
sorted(x)
-
In-place:
x.sort()
-
Sorting with key:
sorted(x, key=lambda x: x[1], reverse=True)
-
O(nlogn)
time complexity andO(n)
space
-
-
Function calls
- Pass by reference
- The variable from the caller is passed directly into the function, implicitly allowing direct changes to the memory location.
- Pass by value
Java
- A new variable is created in the function and the contents of the supplied argument are copied to this new object (new location in memory).
- Pass by object reference
python
- A new variable is yet again created, but instead of copying, the variable points to the same memory location as the variable that was supplied in the argument. So, any changes to the new variable will be reflected outside of the function. Re-assignments, however, are scoped to the method and will not affect the original variable.
- Pass by reference
-
x is y
: object comparison -
x == y
: value comparison -
Swapping:
a, b = b, a
-
Appending a list to another list stores a reference to the appended list
- i.e. Any updates to the list after it has been appended will update the appendee as well
- To resolve this issue, append a new object to the list
list1.append(list2[:])
orlist1.append(list(list2))
-
Instance variables are declared in
__init__()
-
Class variables are declared outside the constructor
class Square: length = 0 # class variable height = 0 def __init__(self, color): self.color = color # instance variable
-
-
@property
enables a method to be accessed as an attribute- Useful for overriding setter behavior by
@{method_name}.setter
- Useful for overriding setter behavior by
-
Maximum/minimum number
float('inf')
float('-inf')
-
Caching results for recursive calls/dynamic programming
from functools import cache
- Annotate any method that should be memoized with
@cache
-
Memory management
- Managed entirely by the Python private heap space; all objects and data structures are located here. The interpreter handles this space intelligently and does not permit the engineer permissions to access it.
Problem | Solution |
---|---|
divide by 2i | num >> i |
multiply by 2i | num << i |
get ith bit | num & (1 << i) |
set ith bit | num | (1 << i) |
clear ith bit | num & ~(1 << i) |
update ith bit | (num & ~(1 << i)) | (x << i) |
clear right-most bit (also power of 2 check) | num & (num - 1) |
swap variables | num1 ^= num2 -> num2 ^=num1 -> num1 ^= num2 |
An XOR between two numbers is the sum of the integers' binary representations, without taking carry into account (sum) or without taking borrow into account (difference). In other words, it's a sum of all bits where at least one of the bits is not set.
When summing two numbers, the carry-forward values need to be leveraged. To get these values, it follows that if an XOR x^y
finds all bit differences, an AND x&y
would provide all bit similarities. This result is then shifted over to the left once in order to properly align it with the XOR result. This process is then continuously repeated until the carry is 0.
For a difference between two numbers, the borrow values need to be indexed and squared against the XOR. The borrow can be derived using ((~x)&y) << 1
. The borrow is then used to repeat the above process until it is 0.
def get_sum(a, b):
if abs(a) < abs(b): return self.getSum(b,a)
x, y = abs(a), abs(b)
sign = 1 if a > 0 else -1
if a*b >= 0:
while y:
x, y = x^y, (x&y) << 1
else:
while y:
x, y = x^y, ((~x)&y) << 1
return x*sign
In big-endian, the most significant bit is stored at the start of the array. In little-endian, it is the least significant bit.
A technique that helps to efficiently solve a class of problems which have overlapping subproblems. This is done by storing the results of subproblems in memory for quick retrieval - memoization (top-down approach).
By reversing the direction in which the algorithm works i.e. by starting from the base case and working towards the solution, we can also implement dynamic programming in a bottom-up manner using tabulation.
The dynamic programming framework is composed of 3 core modules:
- A function/data structure that computes/contains the answer to every problem for a given state
- A state is defined by a combination of certain variables at a certain point in time that are required to calculate the optimal result
- These states can then be preserved in-memory for algorithmic optimization
- A recurrence relation to transition between states
- Base cases
Dynamic programming problems can be categorized into 5 main frameworks:
- 0/1 Knapsack
- Knapsack problems generally require 2 states at minimum 2. Index is a common state
- Unbounded Knapsack
- Shortest Path (Unique Paths)
- Fibonacci Sequence (House Thief, Jump Game)
- Longest Common Substring/Sub-sequence
Ideal for scenarios in which not every problem needs to be solved. Typically implemented with a recursive function and hash map.
memo = {}
def fib(n):
if n <= 1:
return n
if n not in memo:
memo[n] = fib(n − 1) + fib(n − 2)
return memo[n]
Tabulation is best suited for scenarios that require a large amount of recursive calls i.e. all problems to be solved. Typically implemented with nested for-loops and an array.
def climbStairs(self, n: int) -> int:
if n == 1:
return 1
dp = [0] * (n + 1)
dp[1] = 1 # Base cases
dp[2] = 2 # Base cases
for i in range(3, n + 1):
dp[i] = dp[i - 1] + dp[i - 2] # Recurrence relation
return dp[n]
The Open Systems Interconnection (OSI) model describes seven layers that computer systems use to communicate over a network. OSI is a generic, protocol-independent model that is intended to describe all forms of network communication. This is in contrast to TCP/IP which is a functional model designed to solve specific communication problems using discrete, standard protocols.
The foundational cornerstone, this layer is responsible for the physical equipment connecting network nodes. It defines the connector, the electrical cable (or wireless technology) connecting the devices and transmits raw binary data between the two.
This layer is similar to the network layer, with the exception that it's concerned with data transfer between two devices on the same network. It establishes a connection between these two physically-connected nodes and is responsible for transferring data from the source to its destination by breaking up packets into frames. This layer deals with MAC addresses.
This layer serves to facilitate data transfer between two different networks. It has two primary functions: 1. breaking up/re-assembling segments into packets and 2. routing packets by discovering the most optimal path in the network. This layer deals with IP addresses.
The transport layer handles the data transferred in the session layer: when transmitting, it breaks the data into segments and when receiving, it's responsible for reassembling the segments into a coherent structure that can be interpreted by the session layer. It implements a flow control, such that it sends data at a rate that matches the connection speed of the receiving device as well as error control (checking if data is malformed and re-requesting if necessary).
This layer is responsible for creating and maintaining the communication channels (sessions) between devices.
A preparation layer for the application layer. It defines how two devices encode, encrypt, and compress data. It transforms data it into a consumable format for applications; this is also where encryption occurs. Additionally, this layer is responsible for compressing the data received from the application layer prior to delivery to the session layer.
This layer is utilized by end-user software such as web browsers and email clients. The applications leverage the protocols of this layer to present meaningful data to the client: HTTP, FTP, POP, SMTP, DNS, etc.
Overriding | Overloading | |
---|---|---|
Purpose | Changes the inherited behavior of a method to implement a new, custom behavior | Extends an existing (non-inherited) method by modifying its parameters |
Polymorphism | Run-time polymorphism | Compile-time polymorphism |
- The ability of a programming language to present the same interface for objects of differing underlying structures
- e.g. Shapes can manifest in many different forms
- Area is calculated differently for each but they all share the same interface
getArea()
- The interface is the same for all but the implementation varies for each
- Area is calculated differently for each but they all share the same interface
- e.g. Shapes can manifest in many different forms
- Can be implemented via an interface or inheritance: method overrriding, abstract classes
- A way for a class to extend an already existing class
- Commonly used for specialization on top of a "base" class
- Base class contains common functionality and variables that's shared among all inheritors
A mutual exclusion object that synchronizes access to a resource. It's a locking mechanism that ensures only one thread can acquire the mutex at a time and enter the critical section. The thread releases the mutex only after it exits the critical section.
acquire(mutex)
# critical stuffs
release(mutex)
A signaling mechanism that allows a thread to be notified by other threads when a resource is available. A mutex, in contrast, can only be signaled by the thread which activated the lock.
wait(S)
while S <= 0
S -= 1
signal(S)
S += 1
Interface | Abstract Class |
---|---|
Method signatures only (implicitly abstract), i.e. no implementation | Abstract and non-abstract methods |
A class can implement multiple |
A class can only implement one abstract class (Java) |
Public, final variables | No constraint on class variables |
Changes have downstream impact (implementers need to implement a new method) | Changes may or may not beget downstream changes |
Primitive | Bytes | Range |
---|---|---|
byte |
1 | -128 to 127 |
char |
2 | 0 to 65,536 (unsigned) |
short |
2 | -32,768 to 32,767 |
int |
4 | -2,147,483,648 to 2,147,483,647 |
float |
4 | Huge positive |
double |
8 | Really huge positive |
long |
8 | -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 |
boolean |
undefined | 0 or 1 |
Note: Java only supports signed
types (with the sole exception of char
)
integer
float
string
boolean
Note: Python only supports signed
types. Additonally, primitive sizes vary depending on machine architecture
Power | Number | Bytes |
---|---|---|
103 | 1,000 | 1 KB |
106 | 1,000,000 | 1 MB |
109 | 1,000,000,000 | 1 GB |
1012 | 1,0000,000,000,000 | 1 TB |
1015 | 1,0000,000,000,000,000 | 1 PB |
Power | Number | Bytes |
---|---|---|
28 | 256 | < 1 KiB |
210 | 1024 | 1 KiB |
220 | 1,048,576 | 1 MiB |
230 | 1,073,741,824 | 1 GiB |
232 | 4,294,967,296 | 4 GiB |
240 | 1,099,511,627,776 | 1 TiB |