CSE Class Notes 1

This is the notes of first half of Computer System Engineering course, which mainly covers the concept in file system, especially distributed file systems. Here is part of the material I prepared for the mid-term examination, which is also helpful for studying.

Posted Nov 14, 2025 Updated Jan 5, 2026

By Akihi

12 min read

CSE Class Notes 1

Inode-fs

Unix FS(top->down):

Symbolic link layer： name files on other disks
- Soft link: new inode(target file path)
- Hard link: no new inode
Absolute path name layer: root directory
Path name layer: Paht name -> inode number (Hierarchy of directories and files, link/unlink)
File name layer: file name -> inode number (mapping table in directory)
Inode number layer: inode number -> inode (inode table)
File layer: block index number in inode -> block number on disk (inode: metadata of file)
Block layer: block number -> block data (block: basic unit in file system, 不等同于sector)
- Super Block: one per system: size, free block number / list(bitmap), metadata

File name is not a part of file, but data of a directory and metadata of a file system. Hard links are equal. No links to directories -> acyclic graph

Fs-API

FD -> fd_table(memory per process) File Cursor: shared / not shared -> file_table(memory) 进程通过fd_table找到对应的file index，再到file_table中寻找对应的inode_num和file_cursor

Delete after open but before close: the inode is not freed until the first process called CLOSE Renaming:

LINK(from_name, to_name)
UNLINK(from_name)

Dist-sys

Scalable web apps: high request rate, massive data, transparent scale Routing Methods: Consistent hashing(find the caching server of the related key-value) CAP Theorem: Consistency, Availability, Partition tolerance 2 out of 3 Distributed Systems Properties: scalability, performance, fault tolerance, consistency, usability

Remote Procedure Call

Other File System Type:

FAT: linked list 1-1 with blocks
- Follow lists to get blocks
- Use free list instead of bitmap
- 随机读性能差，不支持hard link
- Directory: filename -> file number, size

Filesystem+RPC: a form of distributed system Idea: build RPC atop of the socket interface RPC stub:

Client:
- put arguments in request
- send to server
- wait for response
Server:
- wait for message
- get parameters
- call the procedure according to the parameters
- put the results in response
- send the response to the client Message: Service ID, Service Parameter, using marshal(不能使用指针) Message Type: Standardized encoding, Binary formats Automatic Stub Generation: argument type, function name

RPC facing failure:

At-leasts-once semantics: retry until getting response
At-most-once semantics
- server saves transaction ID(可以实现exactly-once，但是无法确定删除的时间)
- Idempotent(幂等): run any number of times without harm. 保证at-most-once

Distributed File System

Remote Access Model

NFS: Network File System

Goals:

transparency
recovery from failure: server stateless, client retries
high performance: caching

System calls -> RPC

少open, close
返回file handler
read需要offset

NFS Clients

对应用程序提供open，close接口
fd <-> fh (根据file attributes创建本地inode)
Why no open? 在client端实现相关内容
file handler: inode number + generation number
Delete after open: return error(unlike local Unix FS)

Problem: performance overhead

Caching at client(file data, file attribute, pathname bindings)
Cache coherence
- close-to-open consistency
  - open: compare last modification time with local cache
  - close: send cached writes to the server
- read/write consistency
  - NFS guarantee coherence only for certain operations
  - validation
    - compare last modification time, invalidate cached data if remote is more recent
    - always invalidate data after some time

Key-Value Store

System Modeling

Build an internal abstraction of the system, distill the main components that determine the system behavior
Qualitative Model
Typically needs to be parameterized(static / input)

Random write < Sequential write: log-structured file(append only) Get Latency: add index(key -> offset) (e.g. in-memory hash index)

fat pointer: virtual memory address + offset in a file Prevent a file from growing forever? Compaction Range-query? Hash table -> B+Tree(no log-structure, value in leaf node)
Trade-off: Get, Insert, Update are slow

Consistency

Consistency Model: defines rules for the apparent order and visibility of updates

Eventual -> Causal -> Sequential -> Linearizability -> Strict

Eventual Data Consistency: Eventually all the data becomes the same

Deterministic sort all posts upon syncing
- ensure both nodes have the same updates in the log
- ordered lists of updates at each time
- reapply sorted updates after the sync
a total order among operations
- Problem: unsynchronized clock across different devices
- Ideal order: Causal ordering
Lamport Clock: a logical clock
1. Each server keeps a clock T
2. Increment T as real time passes
3. T = max(T’ + 1, T), when receive message from another server

Lamport clock gives a order to every message -> too strong Partial Order: vector lock

Local lamport clock and cached copies of clocks in other servers
Modify $T_i=max(T_i,T’_i+1)$ if see T from another server

Strong Consistency Model: everything has one copy

strict consistency: sorted based on issue time

Consistency under Failure: Atomicity

Completion-to-issue: if operation a’s completion time notified by the client is before the issue time of b, then b must be observed executed later than a Linearizability: Client can not infer the execution order of concurrent operations

Implementing linearizability in replicated data storage

Principle: one writer rule

only one process is allowed to do the read/write
then propagate to other processes

Primary backup implementation:

In order updates: primary must use some seq number: all replicas apply writes in the order of seq number
Performance issues & reliability issues(raft)

Atomicity

Ensure a set of operations written to a file is all-or-nothing.

Shadow Copy: copy on write(replace the original file with updates only if all writes is successful)

Goal: file system APIs are all-or-nothing
Journaling
1. record changes in journal
2. commit journal
3. update
Mitigation: only protect metadata via journaling
Drawbacks: multiple clients share the same file

Realizing atomicity: Logging+Checkpoint

Journaling Drawbacks:

everything is written to disk twice
hard to generalize to multiple files

Logging for Atomicity

Idea: avoid updating the disk state until we can recover it after failure Transaction and Commit Point: marking atomic units

Recovery of Commit Log(redo-only logging)

Travel from start to end
Re-apply the updates recorded in a complete log entry Cons of Redo-only logging
- All updates must be buffered in memory
- Log file is continuously growing

Undo Logging

Should contain sufficient information to undo uncommited transaction
Log records
- records from different transaction may interleave -> need pointer to trace operations from the same TX
Recovery rules
1. Travel from end to start
2. Mark all transaction’s log record w/o CMT log and append ABORT log
3. Undo ABORT log from end to start
4. Redo CMT log from start to end

Checkpointing

Idea: save the system state into a compact form so that the restore is faster Checkpoint in logging Observation: uncommited updates are only in page cache Challenge: maintain correctness

wait till no operation in progress
write a CKPT to the log
Flush page cache
Discard all the log records except CKPT record

Checkpointing in LLM Training

Checkpointing must be atomic -> shadow copy
Reduce fault tolerance overhead(checkpoint overhead + recomputation)

Checkpoint in Eventual Consistency Challenge: data is decided by multiple servers, cannot simply use all the record De-centralized Approach: update T is stable if all nodes have seen all updates with time <= T Centralized Approach:

One primary server
- Assign a total commit order CSN to each write
- Complete timestamp
- Any write with a known CSN is stable
All stable writes are ordered
CSNs are exchanged between servers Notes：A server asks the primary to assign CSN for all tentative writes

Before-or-After Atomicity and Serializability

Linearizability: serial execution of read / write -> Serial execution of actions Goal: before-or-after atomicity Global Lock:

The action must acquire the global lock before executing and release it after it commits.
Trade performance for correctness

Fine-grained Locking: each shared data has a lock

The action must acquire it before accessing it and release it after the data access finishes.
Not ensure before-or-after
Solution: acquire all locks first and release them at last

2PL(Two phase locking)

The action must acquire the shared data’s lock before accessing it, and once the lock is released, no lock can be held.

Serializability: run actions concurrently, and have it appears as if they can ran sequentially How to check it? View Serializability: the final written state as well as intermediate reads are the same as in some serial schedule(informal definition) Conflict Serializability

Two operations conflict if:
- they operate on the same data object
- at least one of them is write
- they belong to different transactions
the order of its conflicts is the same as the order of conflicts in some sequential schedule Conflict Serializability > View Serializability

Serializability, OCC & Transaction

Proof of 2PL: 反证法 2PL Problem:

Phantom Problem: Only lock the things you touch is insufficient
Solutions: Predicate lock; Range locks in a B-tree index or lock entire table

Deadlock Two phase locking is pessimistic: before proceed each TX must waiting for conflicting TX to release lock Methods:

Acquire locks in a pre-defined order(must know all the data set before access)
Detect deadlock by calculating the conflict graph & abort one TX to break the cycle
Using heuristics to pre-abort TXs

Optimistic concurrency control

Concurrent local processing
- Reads data in a read set
- Buffers writes into a write set
Validation serializability in critical section
- compare the data’s version
Commit the results in critical section or abort Phase 2 & 3 should execute in a critical section
- Two-phase locking(not deadlock since read set is known)
- No need for read set lock(validation <=> lock)
- elements in read set is neither changed or locked

Locking Preliminary: atomic compare and swap

Lock prefix to ensure an instruction is atomically executed on a memory address

Transaction & Multi-site Transaction

OCC Problems

False Abort & Live Lock
- Under high contention, OCC may continuously abort

System Principle behind OCC:

Fast Path & Slow Path
- use a fast path to accelerate execution
- validate that the fast path matches the results
- if validation fails, go to slow path

RTM(Restricted Transactional Memory)

xbegin / xend to mark the begin and end of transaction
if fail, go to fallback
Problems
- limited working set
- external events break the RTM execution

Transaction: an abstraction to manage data to guarantee ACID properties Consistency in TX: defined by applications Atomicity is not sufficient for application consistency

Read-only or Read-mostly transactions Idea: multi-versioning concurrency control

A set of version represents a snapshot of the database
Read: only read from a consistent snapshot at a start time
Write: install a new version of data instead of overwriting the existing one

Optimize OCC

Concurrent local processing
Commit the results in critical section
- No validation is need Problem: Partial snapshot Solution: enforcing write atomicity of snapshot with locking
- get commit timestamp after locking

MVCC ensures no race for reads, but not writes

Validation: check whether another TX has installed a new snapshot after the commiting TX’s start time

Write skew anomaly

Case: R(X), W(Y) & R(Y), W(X)
Fix: validate the read set in read-write TX

Multi-site transaction Two-phase Commit

High-layer TX: coordinates the execution of lower-layer TXs
Low-layer TX: specific reads and writes that execute on a single machine
Phase 1:
- Delay the commitment of lower-layer TXs
- Lower-layer transactions either abort or tentatively committed
- Higher-layer transaction evaluate lower situation
Phase 2:
- Higher layer decides whether low-layer TXs will commit or abort
- coordinate the commitment of lower layers
Idea: the coordinator is responsible for committing all the TXs

Consistency under Partial Failure

Goal: all sites either all commits or all aborts under partial failure Multi-site Atomicity

Principles
- Follow the coordinator’s decision
- Log sufficient state to tolerate failures
Problem: if the coordinator fails, the server cannot know whether to release the lock

Primary-Backup Model

Replicated State Machine
View Server: keeps a table that maintains a sequence of view, manage primary and backup; coordinator make requests to view server asking who is primary
Rules when facing network partitions
- Primary must wait for backups to accept each request
  - check if it is in the same view of the primary
- Non-primary must reject direct coordinator requests
Quorum
- Confirm a write after writing to at least Qw of replicas
- Read at least Qr agree on the data or witness value
Together
1. One server is responsible for ordering inputs received from clients
2. After receives an input, sync up to W backups before ACK
3. If the primary fails, select R backups, and chooses the one with the most up-to-date inputs

Paxos

Single-decree Paxos(agree on single value)

Challenge: support multiple writers
Phase 1:
1. Leader creates proposal N and send to quorum
2. Acceptor:
  - N > any previous proposal:
    1. replay with the highest past proposal number & value
    2. promise to ignore all IDs < N
  - else: ignore
Phase 2:
1. Leader receives enough promise (majority)
  1. set value to the proposal V, if any accepted value returned, replace V with the returned one
  2. send accept request to quorum with the chosen value V
2. Acceptor:
  - If promise still holds:
    1. register the value V
    2. send accepted message to Proposer / Learners
  - else: ignore
Phase 3:
1. Learner: respond to the client or take action on the request
When is the value V chosen? Leader receives a majority accepted
Acceptor fails after sending promise: Nh
Acceptor fails after receiving accept: Nh, Na, Va
用于View Server, but single-decree is not enough

Multi-Paxos(value -> log) Basic Multi-Paxos: correct but insufficient(2 Rounds of RPC) Solution:

Leader election: prepare message batching, prepare multiple writes at a time
- Reason: if there is only one writer for most of the time, batching prepare message and the following accept stages will succeed in most cases(multiple writers still work) Problem: hole in log

Computer Science, Basics

This post is licensed under CC BY 4.0 by the author.