CSE Class Notes 1
This is the notes of first half of Computer System Engineering course, which mainly covers the concept in file system, especially distributed file systems. Here is part of the material I prepared for the mid-term examination, which is also helpful for studying.
Inode-fs
Unix FS(top->down):
- Symbolic link layer: name files on other disks
- Soft link: new inode(target file path)
- Hard link: no new inode
- Absolute path name layer: root directory
- Path name layer: Paht name -> inode number (Hierarchy of directories and files, link/unlink)
- File name layer: file name -> inode number (mapping table in directory)
- Inode number layer: inode number -> inode (inode table)
- File layer: block index number in inode -> block number on disk (inode: metadata of file)
- Block layer: block number -> block data (block: basic unit in file system, 不等同于sector)
- Super Block: one per system: size, free block number / list(bitmap), metadata
File name is not a part of file, but data of a directory and metadata of a file system. Hard links are equal. No links to directories -> acyclic graph
Fs-API
FD -> fd_table(memory per process) File Cursor: shared / not shared -> file_table(memory) 进程通过fd_table找到对应的file index,再到file_table中寻找对应的inode_num和file_cursor
Delete after open but before close: the inode is not freed until the first process called CLOSE Renaming:
- LINK(from_name, to_name)
- UNLINK(from_name)
Dist-sys
Scalable web apps: high request rate, massive data, transparent scale Routing Methods: Consistent hashing(find the caching server of the related key-value) CAP Theorem: Consistency, Availability, Partition tolerance 2 out of 3 Distributed Systems Properties: scalability, performance, fault tolerance, consistency, usability
Remote Procedure Call
Other File System Type:
- FAT: linked list 1-1 with blocks
- Follow lists to get blocks
- Use free list instead of bitmap
- 随机读性能差,不支持hard link
- Directory: filename -> file number, size
Filesystem+RPC: a form of distributed system Idea: build RPC atop of the socket interface RPC stub:
- Client:
- put arguments in request
- send to server
- wait for response
- Server:
- wait for message
- get parameters
- call the procedure according to the parameters
- put the results in response
- send the response to the client Message: Service ID, Service Parameter, using marshal(不能使用指针) Message Type: Standardized encoding, Binary formats Automatic Stub Generation: argument type, function name
RPC facing failure:
- At-leasts-once semantics: retry until getting response
- At-most-once semantics
- server saves transaction ID(可以实现exactly-once,但是无法确定删除的时间)
- Idempotent(幂等): run any number of times without harm. 保证at-most-once
Distributed File System
Remote Access Model
NFS: Network File System
Goals:
- transparency
- recovery from failure: server stateless, client retries
- high performance: caching
System calls -> RPC
- 少open, close
- 返回file handler
- read需要offset
NFS Clients
- 对应用程序提供open,close接口
- fd <-> fh (根据file attributes创建本地inode)
- Why no open? 在client端实现相关内容
- file handler: inode number + generation number
- Delete after open: return error(unlike local Unix FS)
Problem: performance overhead
- Caching at client(file data, file attribute, pathname bindings)
- Cache coherence
- close-to-open consistency
- open: compare last modification time with local cache
- close: send cached writes to the server
- read/write consistency
- NFS guarantee coherence only for certain operations
- validation
- compare last modification time, invalidate cached data if remote is more recent
- always invalidate data after some time
- close-to-open consistency
Key-Value Store
System Modeling
- Build an internal abstraction of the system, distill the main components that determine the system behavior
- Qualitative Model
- Typically needs to be parameterized(static / input)
Random write < Sequential write: log-structured file(append only) Get Latency: add index(key -> offset) (e.g. in-memory hash index)
- fat pointer: virtual memory address + offset in a file Prevent a file from growing forever? Compaction Range-query? Hash table -> B+Tree(no log-structure, value in leaf node)
- Trade-off: Get, Insert, Update are slow
Consistency
Consistency Model: defines rules for the apparent order and visibility of updates
- Eventual -> Causal -> Sequential -> Linearizability -> Strict
Eventual Data Consistency: Eventually all the data becomes the same
- Deterministic sort all posts upon syncing
- ensure both nodes have the same updates in the log
- ordered lists of updates at each time
- reapply sorted updates after the sync
- a total order among operations
- Problem: unsynchronized clock across different devices
- Ideal order: Causal ordering
- Lamport Clock: a logical clock
- Each server keeps a clock T
- Increment T as real time passes
- T = max(T’ + 1, T), when receive message from another server
Lamport clock gives a order to every message -> too strong Partial Order: vector lock
- Local lamport clock and cached copies of clocks in other servers
- Modify $T_i=max(T_i,T’_i+1)$ if see T from another server
Strong Consistency Model: everything has one copy
- strict consistency: sorted based on issue time
Consistency under Failure: Atomicity
Completion-to-issue: if operation a’s completion time notified by the client is before the issue time of b, then b must be observed executed later than a Linearizability: Client can not infer the execution order of concurrent operations
Implementing linearizability in replicated data storage
Principle: one writer rule
- only one process is allowed to do the read/write
- then propagate to other processes
Primary backup implementation:
- In order updates: primary must use some seq number: all replicas apply writes in the order of seq number
- Performance issues & reliability issues(raft)
Atomicity
Ensure a set of operations written to a file is all-or-nothing.
Shadow Copy: copy on write(replace the original file with updates only if all writes is successful)
- Goal: file system APIs are all-or-nothing
- Journaling
- record changes in journal
- commit journal
- update
- Mitigation: only protect metadata via journaling
- Drawbacks: multiple clients share the same file
Realizing atomicity: Logging+Checkpoint
Journaling Drawbacks:
- everything is written to disk twice
- hard to generalize to multiple files
Logging for Atomicity
Idea: avoid updating the disk state until we can recover it after failure Transaction and Commit Point: marking atomic units
Recovery of Commit Log(redo-only logging)
- Travel from start to end
- Re-apply the updates recorded in a complete log entry Cons of Redo-only logging
- All updates must be buffered in memory
- Log file is continuously growing
Undo Logging
- Should contain sufficient information to undo uncommited transaction
- Log records
- records from different transaction may interleave -> need pointer to trace operations from the same TX
- Recovery rules
- Travel from end to start
- Mark all transaction’s log record w/o CMT log and append ABORT log
- Undo ABORT log from end to start
- Redo CMT log from start to end
Checkpointing
Idea: save the system state into a compact form so that the restore is faster Checkpoint in logging Observation: uncommited updates are only in page cache Challenge: maintain correctness
- wait till no operation in progress
- write a CKPT to the log
- Flush page cache
- Discard all the log records except CKPT record
Checkpointing in LLM Training
- Checkpointing must be atomic -> shadow copy
- Reduce fault tolerance overhead(checkpoint overhead + recomputation)
Checkpoint in Eventual Consistency Challenge: data is decided by multiple servers, cannot simply use all the record De-centralized Approach: update T is stable if all nodes have seen all updates with time <= T Centralized Approach:
- One primary server
- Assign a total commit order CSN to each write
- Complete timestamp
- Any write with a known CSN is stable
- All stable writes are ordered
- CSNs are exchanged between servers Notes:A server asks the primary to assign CSN for all tentative writes
Before-or-After Atomicity and Serializability
Linearizability: serial execution of read / write -> Serial execution of actions Goal: before-or-after atomicity Global Lock:
- The action must acquire the global lock before executing and release it after it commits.
- Trade performance for correctness
Fine-grained Locking: each shared data has a lock
- The action must acquire it before accessing it and release it after the data access finishes.
- Not ensure before-or-after
- Solution: acquire all locks first and release them at last
2PL(Two phase locking)
- The action must acquire the shared data’s lock before accessing it, and once the lock is released, no lock can be held.
Serializability: run actions concurrently, and have it appears as if they can ran sequentially How to check it? View Serializability: the final written state as well as intermediate reads are the same as in some serial schedule(informal definition) Conflict Serializability
- Two operations conflict if:
- they operate on the same data object
- at least one of them is write
- they belong to different transactions
- the order of its conflicts is the same as the order of conflicts in some sequential schedule Conflict Serializability > View Serializability
Serializability, OCC & Transaction
Proof of 2PL: 反证法 2PL Problem:
- Phantom Problem: Only lock the things you touch is insufficient
- Solutions: Predicate lock; Range locks in a B-tree index or lock entire table
Deadlock Two phase locking is pessimistic: before proceed each TX must waiting for conflicting TX to release lock Methods:
- Acquire locks in a pre-defined order(must know all the data set before access)
- Detect deadlock by calculating the conflict graph & abort one TX to break the cycle
- Using heuristics to pre-abort TXs
Optimistic concurrency control
- Concurrent local processing
- Reads data in a read set
- Buffers writes into a write set
- Validation serializability in critical section
- compare the data’s version
- Commit the results in critical section or abort Phase 2 & 3 should execute in a critical section
- Two-phase locking(not deadlock since read set is known)
- No need for read set lock(validation <=> lock)
- elements in read set is neither changed or locked
Locking Preliminary: atomic compare and swap
- Lock prefix to ensure an instruction is atomically executed on a memory address
Transaction & Multi-site Transaction
OCC Problems
- False Abort & Live Lock
- Under high contention, OCC may continuously abort
System Principle behind OCC:
- Fast Path & Slow Path
- use a fast path to accelerate execution
- validate that the fast path matches the results
- if validation fails, go to slow path
RTM(Restricted Transactional Memory)
- xbegin / xend to mark the begin and end of transaction
- if fail, go to fallback
- Problems
- limited working set
- external events break the RTM execution
Transaction: an abstraction to manage data to guarantee ACID properties Consistency in TX: defined by applications Atomicity is not sufficient for application consistency
Read-only or Read-mostly transactions Idea: multi-versioning concurrency control
- A set of version represents a snapshot of the database
- Read: only read from a consistent snapshot at a start time
- Write: install a new version of data instead of overwriting the existing one
Optimize OCC
- Concurrent local processing
- Commit the results in critical section
- No validation is need Problem: Partial snapshot Solution: enforcing write atomicity of snapshot with locking
- get commit timestamp after locking
MVCC ensures no race for reads, but not writes
- Validation: check whether another TX has installed a new snapshot after the commiting TX’s start time
Write skew anomaly
- Case: R(X), W(Y) & R(Y), W(X)
- Fix: validate the read set in read-write TX
Multi-site transaction Two-phase Commit
- High-layer TX: coordinates the execution of lower-layer TXs
- Low-layer TX: specific reads and writes that execute on a single machine
- Phase 1:
- Delay the commitment of lower-layer TXs
- Lower-layer transactions either abort or tentatively committed
- Higher-layer transaction evaluate lower situation
- Phase 2:
- Higher layer decides whether low-layer TXs will commit or abort
- coordinate the commitment of lower layers
- Idea: the coordinator is responsible for committing all the TXs
Consistency under Partial Failure
Goal: all sites either all commits or all aborts under partial failure Multi-site Atomicity
- Principles
- Follow the coordinator’s decision
- Log sufficient state to tolerate failures
- Problem: if the coordinator fails, the server cannot know whether to release the lock
Primary-Backup Model
- Replicated State Machine
- View Server: keeps a table that maintains a sequence of view, manage primary and backup; coordinator make requests to view server asking who is primary
- Rules when facing network partitions
- Primary must wait for backups to accept each request
- check if it is in the same view of the primary
- Non-primary must reject direct coordinator requests
- Primary must wait for backups to accept each request
- Quorum
- Confirm a write after writing to at least Qw of replicas
- Read at least Qr agree on the data or witness value
- Together
- One server is responsible for ordering inputs received from clients
- After receives an input, sync up to W backups before ACK
- If the primary fails, select R backups, and chooses the one with the most up-to-date inputs
Paxos
Single-decree Paxos(agree on single value)
- Challenge: support multiple writers
- Phase 1:
- Leader creates proposal N and send to quorum
- Acceptor:
- N > any previous proposal:
- replay with the highest past proposal number & value
- promise to ignore all IDs < N
- else: ignore
- N > any previous proposal:
- Phase 2:
- Leader receives enough promise (majority)
- set value to the proposal V, if any accepted value returned, replace V with the returned one
- send accept request to quorum with the chosen value V
- Acceptor:
- If promise still holds:
- register the value V
- send accepted message to Proposer / Learners
- else: ignore
- If promise still holds:
- Leader receives enough promise (majority)
- Phase 3:
- Learner: respond to the client or take action on the request
- When is the value V chosen? Leader receives a majority accepted
- Acceptor fails after sending promise: Nh
- Acceptor fails after receiving accept: Nh, Na, Va
- 用于View Server, but single-decree is not enough
Multi-Paxos(value -> log) Basic Multi-Paxos: correct but insufficient(2 Rounds of RPC) Solution:
- Leader election: prepare message batching, prepare multiple writes at a time
- Reason: if there is only one writer for most of the time, batching prepare message and the following accept stages will succeed in most cases(multiple writers still work) Problem: hole in log