CSE Class Notes 2

This is the notes of second half of Computer System Engineering course, which goes through the basic idea of Network, Security and Distributed Computing.

Posted Jan 5, 2026

By Akihi

11 min read

CSE Class Notes 2

Network

Link Layer

Purpose: moving data directly from one point to another

Physical transmission
- Serial transmission(no waiting for response)
- 不同机器上如何进行频率同步？VCO硬件根据输入信号进行对齐
- 输入大量0对VCO对齐不友好？Manchester Code(0->01, 1->10)
Multiplexing
- Isochronous: TDM，每个连接在建立的时候在链路上被分配一个time slot（适合数据均匀的情况）
- Asychronous: 通过packet(header)发送，不需要维护连接状态（适合burst的情况）
Error handling
- Error detection: checksum at the end
- Hamming distance: A $xor$ B中1的个数，通过增加冗余位，拉开合法编码之间的H-distance
  - e.g: 4-bit -> 7-bit，重新计算纠错位的值，根据不同错误情况可以纠错1-bit error

Network Layer

Function: forwarding(data-plane) and routing(control-plane) IP: Best-effort network Routing

Link-state
- advertisements contain its neighbors and its costs
- advertise to everyone(flooding)
- Dijkstra
- 每个顶点都有整个网络的所有拓扑结构
- 问题：flooding开销大
Distance-vector
- advertisements contain its current costs to every node it’s aware of
- advertise to neighbor only
- 自身维护其到每个node的路径，需要经过哪个neighbor
- update: A受到B->dst的路径，更新A->dst
- Problem: 存在infinity问题，会认为没有路的地方有路径
- Split Horizon: 不向neighbor发经过neighbor到达的node的distance
Scale up
- Hierarchy of routing(route between regions): BGP
- Topological Addressing
- Path Vector: includes full path in its routing advertisements
  - refuse path including itself in case of circle
  - graph change: discard any path that a neighbor stops advertising

NAT: bridge private networks and public networks

src的IP和port映射到NAT的公网IP和port

Ethernet Mapping

Broadcast network: every frame is delivered to every station
ARP: IP <-> MAC(ask for IP, get MAC address)
Attack: ARP spoofing: when one device ask for IP, hacker can reply with a wrong MAC to poison the ARP cache

BGP

Policies: for financial incentives
- Customer/Provider: customer pays
- Peers: provide mutual access to a subset of each’s routing tables
  - Usually peers will tell each other about their customers but not their providers

End2End Layer

Purpose: provide transport protocol to application which can ensure some characteristics

Assurance with at-least-once
- set packet with 随机数
- resend until receive ACK
- try limit times before returning error
- Timeout? No fix timer(congestion collapse, 不同设备间相位同步)
- Adaptive timeout: 统计RTT数据，根据RTT设置Timeout时间
- NAK: receiver asks for missing items(sender no timer)
Assurance with at-most-once
- receiver side maintains a table of nonce(Problem: duplicate suppression)
- 幂等性
Assurance of data integrity: add checksum in sender
Segment and reassembly of long messages
- segment contains ID for where it fits
- Solution: hold early packets in buffer until in order/only ACK in order packets
Assurance of Jitter Control: delay all arriving segments and cache
End2End Performance
- Fixed Window: receiver tells sender a window size, window advances when all packets in previous window are acked
- Sliding Window: Sender advances the window by 1 for each in-sequence ACK it receives
  - handling packet loss: timeout之后重传
  - Window Size $\ge$ RTT * bottleneck data rate
- TCP Congestion Control
  - Window size $\leq$ Receiver Buffer
  - Idea: increase slowly, drop quickly
  - AIMD(Additive increase, multiplicative decrease)
    - too slow start? do multiplicative increase at the beginning
    - leads to efficiency and fairness
  - Weakness: packet loss not always caused by congestion, bias against long RTTs(TxRate = W / RTT)

DNS: Binding IP and Domain Name

DNS Hierarchy: structure the hostname
Fault Tolerant: each zone can have multiple name servers
Three Enhancements
1. Initial DNS request can go to any name server
2. Recursion: the name server does all the lookup and return the final IP
3. Caching: DNS clients and servers keep a cache of names(cache has expire time limit)
Client can get nearby name server from ISP

CDN(Content Distribution Network)

Network of computers that replicate content across the Internet
Content providers actively pushes data into the network

Security

ROP and CFI

Security Principles: least trust

Stack Buffer Overflow: inject malicious code to buffer, overwrite return address

Defense: Data area marked non-executable

ROP(Return Orientated Programming)

Find code gadgets in existed code base(end with ret)
Push address of gadgets on the stack
So each gadgets will be connected by return Defense:
Hide the binary file
ASLR to randomize the code position(fork不会改变偏移量)
Canary to protect the stack(在ret地址前加入随机数，detect stack overflow, 防止栈被修改)

CFI: Control Flow Integrity Main Idea: pre-determine control flow graph of an application(大部分indirect jump只会去确定的几个位置)

Use binary rewriting to instrument code with runtime checks, ensure the execution always stays within the statically determined CFG
Insert a unique bit pattern at every destination and check whether is the same
Effective against attacks based on illegitimate control-flow transfer

Blind ROP

Stack reading: overwrite a single byte with value X(试出ASLR的偏移量和Canary)
Find gadgets
- 先找stop gadget(never crashes, e.g. sleep(10))
- 修改return address上的address为stop gadget，之后尝试不同return address
- 目标是找到BROP gadgets(ret前存在pop语句)，埋crash gadget（为了找pop rdi/pop rsi)
- 通过函数对于不同参数的反应识别关键函数（strcmp：可以改rdx）
  - 找PLT表（记录所有jmp，特征是向下找每一行都能执行）
  - strcmp fingerprint: only two args are readable will not crash
- 找到write
  - 是否受到字符
get binary

Secure Data Flow

Taint Tracking: tracking sensitive data flow

输入的信息标记为Taint
由标记taint的信息计算的到变量都标记为taint
给sensitive data标记taint

Defending Malicious Input TaintCheck: mark all input data to the computer as tainted, detect when tainted data is used in dangerous ways

use binary re-writer
seed - tracker - assert

Secure Channel

简单加密：

Replay attack(Solution: add sequence number)
Reflection attack(Solution: 使用不同的key和seq)

Diffie-Hellman

A pick a, send $g^a \mod p$
B pick b, send $g^b \mod p$
$key = g^{ab} \mod p$
Problem: man-in-the-middle, attacker sends $g^e \mod p$ to both

RSA Algorithm: public key & private key

Certificate: 用CA的私钥签署一条消息包含用户身份和其公钥，接收方用CA的公钥解密

Distributed and Parallel Computing

How to scale Computation?

Forward Path: $2size(W)B$ Backward Path: calculating the dX and dW, $4size(W)B$

Single Device Capability

Single-core Machine: pipeline / increased clock rate Multi-core Machine:

Problem: coherence under memory access hierarchy(cache in different core)
increase per-core density: more ALUs on a single core(SIMD processing)
- Bottleneck: access memory
Roofline Model: Gflops/s / Flops/Byte(检查算力瓶颈还是访存瓶颈)（斜率为带宽）
Conditional execution under SIMD
- Constraint: every SIMD ALUs share the same program counter
- Solution: masked instruction(execute all the branch)
SIMT abstraction: threads are grouped in a block, which can synchronize execution

System Methods to fully utilize the device

GPU

Basic Building Block: SM(Streaming Multiprocessor)(硬件)
- core->warp->SM
SIMT: thread->block->grid(kernel)(硬件抽象)
Execution: 每个SM执行若干个block，一个block有若干个warp，每个周期scheduler会分配warp到各个core上执行（一个warp下所有thread的PC一致）
Access Memory: Register, L1 Cache, L2 Cache, HBM, DRAM
Why block?
1. Shared Memory(SRAM)
2. Block内部线程间同步
Memory Hierarchy

Optimizing Attention

Memory Hierarchy: SRAM > HBM > DRAM
Goal: Minimize HBM load
1. naive->$O(NMK)$
2. change loop order->$O(NMK/BB)$
3. Tiling: 分块矩阵乘法
  - Read: $O(NMK/BM + NMK/BN)$
  - Write: $O(NM)$
Overall close to $O(N^2)$

System Methods to do computation on multiple devices

MapReduce Programming Interface

Map: Partition input data, group all the intermediate values, pass them to Reduce(pass to Intermediate File)
Reduce: partition key space, accept a key & a set of values, merge the values
Intermediate File is on local disk of Map Worker
Reduce Worker needs to sort intermediate files to ensure all occurrences of the same key are grouped
Fault Tolerance
- Detect: Master sends heartbeat
- Recovery: re-execution
- Master Failure: checkpoint(master state is persisted to GFS: worker state, location of IF)
- Problem: bad records(master sees two failure of the same record, skip!)
Optimization for Locality: schedule map worker on its input chunk server
Performance: spawn backup copies of tasks, first wins
Problem: not fit for multi-stage execution

Computation Graph

Job Manager, Daemon Process
Scheduling Rules: prefer executing a vertex near its inputs
Fault Tolerance: re-execution, run upstream vertices recursively, finish first

Distributed Training

Parallelization: data parallelism / model parallelism

Data Parallelism

high GPU utilization
- overlapping data loader on CPU
- optimize synchronization part
Allreduce
1. parameter server
  - Problem: huge bandwidth requirement at the PS
2. Co-located & sharded PS: each server handles the reduce of a portion of data($N/P$)
  - Problem: high fan-in: 同时和所有server连接，降低network性能
3. De-centralize approach
  - Problem: high communication data($O(N*(P-1))$)
4. Ring Allreduce: only connect to peers, use sharded data
  - each process send one partition, compute it and send to the next process
  - will reorder the computation
  - $O(2p)$ round / $2(p-1)*N/p$ communication per machine
    Model Parallelism
    Pipeline Parallelism
Partition on the Layer, each partition is on a device
Problem: bubbles during forward-backward process
Techniques: micro-batching(breaking batches into smaller pieces)
- bubble size depends on partitions / ratio of bubble overhead depends on m
- Bubble time fraction: $\frac{p-1}{m}$
  Tensor Parallelism
Partition the parameters of a layer
Forward Process: each node send $(P-1)\frac{diB}{P}=di*B$（simple forward, 只有all gather部分)
Backward Process
- $\nabla_W$ : no need of communication
- $\nabla_X$ : need Allreduce $2\frac{diB}{P}*(P-1)$ (ring reduce)
  Summary
  Tensor Parallelism的communication cost高于Pipeline Parallelism(Pipeline Parallelism只需要传递一个矩阵)
NVLink: Tensor Parallelism
RDMA: Pipeline Parallelism

Raft

High Level Approach

Leader election
Log replication
Safety

Basics

Server States
- Leader: receive client requests, replicate logs to followers
- Follower: only respond to requests from leaders or candidates
Term: logical clock to identify obsolete information
Heartbeats & Timeouts
- Leader send heartbeats to followers to maintain authority in a term
  - if timeout, follower assumes leader has crashed and starts a new election
Election
1. change its state to the candidate state
2. increment current term
3. vote for itself, send RequestVote RPC to all other servers until
  - receive votes from majority
  - receive heartbeat from a valid leader
  - no one wins(timeout) - Safety: each server only vote for one per term - Liveness: use random timeout before retry
Log Replication
- Log Structure: log entry(index, term, command)
  1. send commands to leader
  2. leader appends to its log
  3. leader send RPC to followers(until succeed)
  4. once commited, leader passes command to state machine, returns results and notifies followers to commit
- Handle Inconsistency
  - check if the preceding entries is the same
  - if reject, leader will decrement nextIndex and try again(找到leader和follower最早匹配的entry位置并覆盖follower之后的所有entry)
- Safety of the Commit Entry
  - During election, choose candidate with log most likely to contain all committed entries
  - Voter denies to vote if its log is more complete
- Commit Rules
  - the entry must be stored on a majority of servers
  - at least one entries from current term must also be stored on a majority of servers(so it will not be overwritten by new leaders)

Computer Science, Basics

This post is licensed under CC BY 4.0 by the author.