OS Kernels

Lecture 1

Will require C!
Will have a complicated C assignment
Office Hours in 105 Schorr, Tuesday 10:30-12:30
more low-level
HW (35%)
- 2 programming assignments
- 2 problem solving
Project (15%)
- Pair programming
- Trio programming (requires additional components)
Midterms 3x, (40% total)
Quizzes (10%)
Tips
- Get to know classmates
- participate!!
- Prior to due dates, class time will be spent answering questions
Goals
- knowledge of internals
- OS roles
- interaction with hardware
- systems programming
- memory management
- concurrency
- practical OS experience
Know what a process is
And how multiple processes work
thinking about shared resources
And things like race conditions
Thread-safety is important – all code must be thread safe!

Lecture 2: Review – Cache Memories

OS Objectives
- Provides convenience
  - simplifies io
  - simplifies user interfaces
  - simplifies portabilities
- manages resources
  - memory
  - processes
  - I/O devices
  - communications media
- ease of evolution
  - support new hardware components
  - provides new services
  - supports maintenance
Major portion on resource management
review of memory hierarchy and cache

Memory Hierarchy and Cash

Memory Hierarchy
1. Registers
2. L1 Cache (SRAM)
3. L2 Cache (SRAM)
4. Main Memory (DRAM)
5. Local storage
6. remote storage
Smaller numbers are smaller in space, faster, but more expensive per byte
rule of locality – bring things closer as you use them
The Cache
- A small, fast storage device – a staging area for data in a larger, slower device
- for each $k$ the faster, smaller device at $k$ serves as cache for larger, slower $k+1$
- Hiercharchies work because of locality – programs tend to access data at level k more often than k+1 – thus storage at level k+1 can be slower and thus larger and cheaper per bit
- The memory hieracrrchy creates a large pool of storage that costs as much as cheap storage near the bottomm, serving data to programs at the rate of the fast storage near the top
- SRAM – managed automatically in hardware – holds frequently accessed blocks of main memory
- CPU looks for data in caches first, then in the main memory
- data is copied in transfer blocks
- Requires various ways of managing which blocks are in the cache in the cache, and what isn't
Cache concepts
- Cache hits – When the cache has the block
- Cache misses – when the cache doesn't have the block
  - Get from main memory
  - store in cache – using placement (determines where a block goes) and replacement (determines which block gets evicted) policies
  - Types
    - Cold start / Compulsory – cache is empty
    - conflict miss – limits blocks at level $k+1$ to a small subset of the blocks at $k$
      - occur when level $k$ cache is large enough, but multiple objects map to the same level k block
    - capacity miss – occurs when active blocks is larger than the cache
- Faults – when the needed stuff isn't in main memory

Cache Organization

$S$ – set
$E$ – way/line per set
$B$ – block size
Block has $v$ (valid bit), tag, $B$ bytes
$C = SEB$ for cache size
Address is Tag, Set, Block offset
Cache is at the physical domain
CPU generally does cache/virtual addresses, not physical addresses
To read:
- $B = 8$
- $S = 4$
- $E = 1$
- Locate Set
- Locate matching tag, and valid
- Locate data starting at offset
Direct mapped cache, $E = 1$
when no match, evict line and replace
$E$-way set associative cache – two lines per set, means more storage/more compact, requires more comparison of the tag – no match is a bit harder, but requires a slightly different replacementt protocol

Lecture 3

Cache – Continued

understand matching and replacement
replacements policies – random, least recently used, etc
writebacks – happen to the original memory
write-through will automatically write to memory
write-back defers writing until the line is replaced – needs a dirty bit
write-misss
- write-allocate – load into cache, update line
- no-write-allocate – immediately write to memory
miss rate, hit time, miss penalty are the performance metrics
make sure to write s.t. the common case goes fast
minimize misses in the inner loops
- repeated refs to vars are good
- stride-1 ref patterns are good

Problems

physical address only
memory is byte addressable
access to 1 byte words
phys address is 12 bit wide
81 is fysically addressed and direct mapped with a 4-byte line and 16 total sets

Lecture 4: Virtual Memory

lets a computer think it has more memory than it actually does
helps to deal with a lot of more basic stuff
physical addressing – used in small systems – embeded microcontollers for things like cars, elevators, etc
- can be an issue – a set amount of memory is provided, no more, no less
- 512mb means 512mb
- programs are machine dependent
- running multiple programs becomes very difficult
virtual memory uses an mmu to translate a virtual address to a physical address
- means that it doesn't matter how much memory it has
linear address space – orderd set of contiguous, non negative integer address
virtual address space – set of $N = 2^n$ virtual addresses
physical address set of $M = 2^m$ physicall addresses
difference between data and attributes
an object can have multiple addresses
every byte in main memory has one PA and one or more VAs
virtual memory uses main memory efficiently
simplifies memory management
isolates address space – one process can't interfere with another's, and a user program cannot access privileged kernel information

Virtual memory as a cache

VM is an array of N contiguous byets stored on disk
contents of the array are cached in physical memory
blocks are called pages (size $P = 2^p$ bytes)
pages are of fixed, but configurable size
Given an address space of $2^n$, and a page size of $2^m$, you have $2^{n-m}$ pages
keeping track requires special software, cannot be implemented in hardware. This is the page table
write-back, not write-through
Page tables
- an array of page table entries (PTEs) that map virtual pages to physical pages
- per-process kernel data structure, in DRAM
- Has valid bit, keeps physical page number or disk address
- valid signals on disk
- works because of locality
- actie pages is called the working set
- working set < main size, good performance for one process after misses
- sum(working set) > main size, thrashing, lots of swapping

Virtual memory for memory management

each process has its own virtual address space – own page table
views memory as a simple linear array
mapping function scatters through physical memory
maps virtual pages to the same physical page
and can allow a single library to be used for multiple programs
can make linking and loading easier
kernel vm is shared
linking maps programs into the common structure
loading allocates and copies things as needed

Lecture 5: Virtual Memory

homework is a tool
quiz based
midterm will be electronic?
first midterm, programming assignment coming, week, week and a half

Virtual memory for memory management

use chunks to bring stuff up!
everything has its own virtual address space
mapping scatters addresses through physical memory
well chosen mappings simplify allocation and management
includes protecting sections from exeution

Memory protection

page fault bits, superfior, read, write, block address

Address Translation

$N = 2^n$ number of addresses in virtual address space
$M = 2^m$ number of addresses in physical memory
$P = 2^p$ page size
TLBI – tlb index
TLBT – tlb tag
VPO – virtual page offset
VPN – virtual page number
PPO – Physical page offset
PPN – Physical page number
CO – byte offset
CI – cache index
CT – cache tag
given base address, use page number and offset
mmu translates based on ptable entry/address
VM and cache together make things faster/more able to handle a lot of memory
TLB – another cache in the MMU, maps vpns to ppns, contains complete page table entries for a small number of pages
multilevel page tables, would need 512 gb page table for 48 bit, 8 bite PTE at 4kb page size
- paging the page table!
- page tables that list page tables

Lecture 6: Virtual Memory

Page faults

small page size – less internal fragmentation
- more pages required
- larger page tables
- large programs means page tables may be in virtual memory
- rotational characteristics favor larger page size
4k is the most common size
internal fragmentation
- 10 seats, 10 students - no fragmentation
- 20 seats, 10 students - fragmentation of 10
- 40 seats, 10 students - fragmentation of 30
Small size with small pages – high page fault rate, but faster to fill
large size, large pages – low page fault rate, slower to fill
many pages in process, large page size, low page fault
contemporary techniques decrease locality of references within a process
Fetch policy – when pages should be brought into memory
- demand paging – when the page is needed, as they are needed
  - many page faults when a process is first started
- prepaging – load what is needed, and then a bit more
  - exploits characteristics of second memory devices
  - more efficient to bring in a number of pages at one time
  - ineffective if extra pages aren't referenced
placement policy
- determines where stuff is to be put
replacement policy
- deals with selection of a page in main memory to be replaced
- objective is to remove the page least likely to be referenced in the near future
- more elaborate policies require greater overhead
frame locking
- a frame that must stay in memory at al time
- the kernel, and key control structures
- i/o buffers and time-critical areas may be locked
- locking is achieved by lock bit with the frame
algorithms
- optimal (Belady's policy)
  - selects the page for which the time to next ref is longest, produces three page faults after the frame allocation has been filled
- FIFO
  - page frames a circular buffer
  - first thing in gets pushed out
- least recently used
  - based on locality
  - least likely to be used is the least recently used
  - difficult to implement
- clock policy
  - requires a use bit
    - set on load
    - reference
    - replacement if not 1
    - flipped periodically
Desireable to maintain as many processes in memory as possible
frees programmers from size restrictions
virt mem
- references are logical, translated
- breaks processes into pieces
- two approaches are paging and segmentation
- management scheme requires both hardware and software

Lecture 7: Virtual Memory

Considere a paged VM system witth 32 bit virt addresses and 2K-byte pages. Each page table entry requires 32 bits. Each page table size is exactly one page (or 2K-bytes)
- how many levels of page tables
- how many entries does the table use at each level (at one level, the table uses a smaller number of entries)
- smaller number of entries can be used at the top or botom. Which strategy uses the least number of pages? Explain by calculating the number of pages for both
- 2^32/2^11 = 2^21
- 2^11/2^2 = 2^9, then, 3 levels, (9 + 9 + 3) = 21
- entries per level: 512, 512, 8

Dynamic Memory allocation

basic concepts
- malloc etc acquires vm at run time
- manage an area of process vm known as the heap
- allocator maintains heap as collection of variable sized blocks, either allocated or free
- types
  - explicit – malloc and free in C
  - implicit – application allocates, doesn't free, GC in Java, ML, Lisp
void *malloc(size_t size) – returns void pointer, memory blocks of at least size, aligned to 8-byte boundary (usually), returns null, sets errno
void free(void *p) – returns block pointed at p to pool of available memory – must come from pervious call to malloc
sbrk is internal function
memory is word addressed
applications – can issue arbit sequence of malloc and free
free must be called on a malloc'd block
cant control number or size of allocated blocks
must respond immediately to malloc
must allocate from free memory
must align blocks to satisfy all alignment requirements (8 byte for GNU Malloc)
can manipulate and modify only free memory
can't move the allocated blocks once malloc'd
given sequence of malloc and free,
maximize throughput and peak memory utilization
throughput is number of completed requests per unit time
payload, malloc(p) results in a block with payload of p bytes
aggregate payload is the sum of the currently allocated payloads
currentheap size is non-decreasing
peak utilization after kay requests max payload over heap size
internal fragmentation is when the payload is smaller than the block size
enough aggregate memory, but not enough contiguous is external framentation
compacting the stack can be done, but requires a tremendous amount of work/linear search of stack/memory/introspection
size is embeded in a block
keep track of free blocks – some structure in the heap
picking blocks for allocation
reinserted freed blocks?
standard block – length ob flock in word preceding the block (header field, header), requires an extra word for every allocated block
implicit list using length (links all blocks)
explicit list (lists free blocks)
segregated free list
different free lists for different size classes

Lecture 8: Malloc

how much memory to free – block before is the size
implicit list - use length/links of all blocks – implemented in K&R
segregated free list – free list for different size classes
blocks sorted by size – use RB tree with pointers to each block, link used as key
The implicit list
- size and allocation status – could store in two words, but this is wasteful
- use word for block size, low bit is status (because blocks are a multiple of an even number)
- status word, payload, padding fulfill alignment
- alignment with 8
- while not past end, and already allocated, and too small, go to the next block
- next fit, first fit, but starts where previous search finished
- best fit, looks for the smallest block with appropriate size
- allocation in a free block – split a block
- freeing – simply clear allocated flag
  - or coalesce – check the next/previous block if its free
    - boundary tags – size tag/status flag at both ends, traverse list backwards, requires extra space, important, general technique, Knuth 73
    - extra internl frag with boundary tags
- Placement – first fit, next fit, best fit
- splitting – how much internal frag
- coalescing – immediate or deferred
- often used in special purpose apps, easy to implement
The explicit free list
- allocated is as before, free has size, allocation status, and next previous are pointers stored in the first two blocks of the free
- allocation requires moving previous and next for several
- faster when memory is more full
- more complex free/allocate because blocks need to be spliced in/out of the list
- extra space required by links
- commonly used with segregated free lists

Lecture 9: Malloc Continued

programming assignment coming
Explicit List
splice out predecessor and successor blocks, insert new block at root of list, coalesce if needed
explicit – only include free blocks
linked lists are mostly used in segregated free lists
segragated free list
- blocks sorted by size – rb tree, or similar to use length as a key
- size classes of blocks have own free list
- large sizes – one for each two power size
- Search for appropriate free list for block of size $m > n$
  - if an appropriate block is found
    - split block and place on appropriate list
  - If no block found, try next larger class
  - repeat until block found
  - If no block is yet found – request additional heap memory from os (sbrk)
  - allocate block of $n$ bytes from this new memory
  - place remainder as a single free block in largest size class
Memory perils and pitfalls
- dereferencing bad pointers
- reading uninitialized memory
- overwiriting memory
- referencing non existent variables
- freeing mulitple times
- referencing freed blocks
- failing to free blocks
- be careful of various operators and their precedence

Lecture 10: Garbage Collection!

Function pointers are used to implement dynamic loading and in some OO languages, v-tables

Garbage Collection

OO and Functional Programming languages
storage is still exhaustable
Let the runtime handle it
- less debugging
- no leaks
- no explicit management
malloc function, new operator
- used to allocate neded memory
mutator – does useful work, chages or mutates connectivity of object graph
gc – shift responsibility of freeing to systems
static allocation – binding at compile time – local vars bound to the same locations at every activation of the procedure
heap allocation – allocated and deallocated in any order, outlives method creating it, real-world data structures
liveness – held in roots, can be detected
memory is now a directed graph
non-reachable is garbage
nodes are reachable if there is a path from any root to that node
initialization is part of new – which acts like malloc otherwise
ref count – the number of objects that refer to the current object
harder to determine deallocateability
but les time is spent debugging memory
Classical algorithms
- Reference counting
  - counts number of references to each object
  - incremental
  - Perl, photoshop, small talk, VB
  - update(right(m), null)
    - recursively decrement all of the things below it, and update anything below again
  - management distributed through execution
  - better locality than tracing
  - cells that are short lived can be reused to reduce the amount of heap usage
  - but fragile, and doesn't deal with cyclic structures
- Mark sweep
  - dead objects not reclaimed immediately
  - begins when a request is made that results in memory exhaustion
  - useful info is suspended
  - sweep unused objects back to list
  - global traversal of all live objects starting from root
  - sweap from bottom up
  - unmarked cells returned to the free pool
  - marking bits are unset in prep for next gc
  - Start/stop
  - complexity proportional to size of heap
  - more fragmentation/reduced locality
  - requires head-room
- Copying algorithm
  - split into from and to spaces
  - allocation through pointer mubing in from space
  - live identification is done through tracing
  - live object copied to other side of heap
  - once copied, tags for spaces are flipped
  - fast allocation, simple check for exhaustion
  - no fragmentation
  - poor space utilization
  - more page faults
- Generational gc
  - the defacto
  - uses multiple kinds

Lecture 11: Process Management

difference between programs and processes
program is a lifeless object, sits on disk, just instructions
a process is a running program
- OS takes the instructions and gets them running
- transforms the program into something useful
issues with programs running on different OS – because of memory differences, abis etc
creation
- program loaded from disk
- stored to memory (code and data), stack and heap
- i/o – create file descriptors
- execute, starting at entry point (main)
System to be initialized
user requests a new process
initiation of batch jobs
be very careful of fork
termination conditions
1. normal exit
2. error exit
3. fatal error
4. killed by another process
Parents create children, children can create children
forming a hierarchy, called a process group (except on windows)
processes may be either running or not running (suspesion is used to wait for other things)
has to use a queue – kinda

Lecture 12: Process Management, continued

the two process model
running or not running
running requires availability of cpu resources
many state models
two state model
- running, and not running, looped together
- not running is entry point
- exit is exit point from running
not-running-queue
- uses a queue for not yet running
three state model
- ready, blocked, running
- running to blocked, or to ready
- ready to running
- blocked to ready
- lowest layer of a process-structured os
- handles interrupts and scheduling
- above the scheduler are sequential processes
- can preempt, switch, etc
scheduler preemption
- save program counter
- register values
- memory, etc
five state
- running, ready, new, blocked, exit
- new admits to ready,
- ready dispatches to running
- running ties out to ready
- events wait from running o blocked
- blocked occcurs tto ready
- running releases to exit
- two/multiple queues
  - blocked queue
  - ready queue
  - a queue for each event waiting type
process suspension
- processor is faster than i/o
- swap processes to disk to free memory
- blocked state becomes suspend when swapped to disk
- blocked/suspend, ready/suspend
process suspension may happen for a variety of reasonns
i/o tables
file tables
status tables
process structure points to where various things are
- registers
- pc
- program status
- stack pointer
- process stae
- priority
- sched params
- pid
- parent
- process group
- a variety of others
shared memory can be used to communicate/coordinate between processes
Threading
- lighter weight than processes
- but have a similar purpose
- keep the same process, execution context
- and only are a specific function
- can still lead to race conditions
- can cause issues w/ stack space

Lecture 13: Process Examples

virtual addresses, physical addresses must be mapped for shared memory
know how many processes are in play
remember that process have parent/child relationships

Threading

miniature processes – part of one, same resources as the original, but new execution context (new stack space for the thread)
may/does require dynamic allocation on the stack
multi threading helps to utilize as much of the processor as possible
threads can be implemented either in userspace or by the kernel
- requires scheduling, a thread table, etc
kernel threads are now the norm – makes things much easier to handle
- good concurrency support – blocking thread doesn't starve siblings
- scheduling needs system calls
- not as fleixble
- not as scalable
hybrid implementations – part kernel hread, part user threads
scheduler activations
- mimc kernel threads, gaining performance of user threads
- avoid unnecessary user/kernel transitions
- kernel assigns virt processor to each process
- relies on the kernel, to notify of kernel blocking events
- more virt processors are assigned in blocking
mysql
- popup threads – new threads on message arrival
- some before, but most after
threads must be reentrant
may have private globals
- and static locals

Lecture 14: Threads, Continued

create a thread and start routine
exit simply terminates a thread
pthread_join suspends execution of calling thread until the target thread terminates, unless the target thread has already terminated
Be careful with how sync is done.
And consider how syncing in threads is done
Mutexs, flags, semaphores, etc

Lecture 16: More Concurrency

Print Queue – $N = 100$
Processes add print jobs to the queue, but can't add more to a full queue
Processes remove jobs from the queue and do the printing, bu can't consume from the empty queue
Some buffers are circular
When buffer is full, producers go to sleep
When empty, consumers go to sleep
If empty buffer gets added to, wake consumer; if full buffer gets removed, wake producers
Updates may, and likely do, require a mutex
Semaphores – A number, 0 or a postive number
- down – if val > 0, decremet, if val = 0, suspend without completing down
- up – increment and if there are processes sleeping on the semaphore, wake one or all up
Monitors
- Semaphores are used for mutual exclusion
- barriers and signalling events
- low level
Monitors provide mutual exclusion implicity

Lecture 17: Monitor

Semaphores as barriers
Monitor is a language construct
Only one process can be active in a monitor at anny instant
Use condition variables to block processes
wait operations
signal operations
See slide 11 from class
produce-consumer problem, only one process can be active at one time
buffer has $n$ slots
implementation independent
only visible effects matter
callers are independent
monitor writers can change the iplementation as long as tthe same effects are maintained
provides mutual exclusion implicitly
condition variables
- delay a process if a monitor state fails to satisfy boolean condition
- awaken a process or processes if the condition becomes true
signaling disciplines
signal continue signal wait
Helps to solve race conditions

Lecture 18: Deadlock

When there's a shared resource tthat you only need a part of
many different methods of dealing with deadlock
happens on mutual exclusion
no preemption
circular waiting
mutual exclusions hel to avoid race conditions
hold-and-wait – must request all resources initially
no preemption – unable to do well
circular wait – define linear ordering of resource types
Most OS use the ostrich's approach – head in sand
Use the banker's algorithm to help control allocation/release
max requirement staed in advance
process must be independen – no synchronization requirement
fixed number of resources
no process may exit while holding resources
detect by
- marking each process that has a roww of all zeroes
- initials a tem vector to equal the available vector
- find an $i$ such that $i$ is unmarked and the $i$th row of $Q \leq W$
  - if does not exist, terminate
  - if exists, mark and add to row of $A$

Lecture 20: Dealing with Interrupts

be careful of how synchronus/asynchronous function calls happen
interrupts are handled by function pointers
execution context
- alt_exception_entry.s – handles changing execution context, i.e., interrupts
- stores return address, exception handling exit
- stores current stack pointer, all registers, etc
- Loads and stores into the stack
Stack pointer points to the last used slot
FP points to the last saved frame
In multi-threading, you can just go to the next thread on return from a interrupt handler
Next up – scheduling algorithms
Signal and continue – signaler continues and process executes at some later time

Lecture 21: Uniprocessor Scheduling

Assign processes to be executed
minimize response time
maximize throughput and processor efficiency
Long term scheduling – determines which wich are admitted, determines degree of multiprogramming
Medium-term – swapping function, manage degree of multiprogramming and available memory
short-term scheduling – dispatcher, executes most frequently, when events occur (clock, i/o interrupts, OS calls, signals)
- user-oriented by response time, which should be minimized
- throughput for the system
- turnaround time – interval between submission and completion – batch processing
- response time – submission and initial response
- deadline – tife for a process to complete maximize percentage of deadlines met
- predictability – job should take about the same time to run and incur the same cost on a system regardless of load
Multiple queues based on what is being waited on
Schedulers often use priorities – low priority may suffer starvation
First Come First served – first process
- Arrival and time Required
- Look at like a Gant chart
- Pick oldest in queue to go first
- Short processes may have to wait a long time before it can execute
Round Robin – Premption based on a clock
- Amount of timedetermined that allows each process to use the processor for that time
- Time slicing
Shortes process next
shortest remaining time
highest response ratio next – choose based on time waiting plus service time over service time
Multilevel feedback queue – penalize jobs that have been running longer, don't know remaining time process needs to execute

Lecture 22: Scheduling

Use context library to implement thread scheduler
Context should be pretty helpful