CSCE 230: Computer Organization

Lecture 1: Binary Numbers

a number with only two possible symbols, 0 or 1, a bit
we study binary numbers because computers use binary numbers
computers use binary numbers because digital circutits only support two states, on and off
we frequently speak of n-bit binary numbers, a word
a byte is a part of a word, frequently 8-bits, but historically has been 6
two types of binary numbers,
- unsigned
  - positive numbers or 0
  - a 4 bit unsigned binary number has up to 16 possibilities
    - $b_3b_2b_1b_0$
    - $b_0$ is the lsb, $b_3$ is the msb
    - in decimal, it is $b_0 2^0 + b_1 2^1 + b_2 2^2 + b_3 2^3$
  - an $n$ bit unsigned binary number has up to $2^n - 1$ possibilities, $[0, 2^n - 1]$
  - Decimal conversion, use divide by two and remainders, first remainder is lsb
  - Addition of two binary numbers:
    - When sum is within the range $[0, 2^n - 1]$
      - start with the LSB, adding each bit, carrying as necessary
      - $1 + 1 = 10$ where $0$ is the one-bit sum and $1$ is the carry-out.
    - When sum is out of the range $[0, 2^n - 1]$ (overflow)
      - When possible, just add a bit.
      - You loose information.
    - To detect overflow, if the msb has a carry-out, there is overflow.
  - Subtraction of two numbers:
    - $a - b = a + (- b)$, becomes the addition of two signed binary numbers
- signed
  - msb is the sign, 1 is negative, 0 is positive
  - In all, the unsigned representation of the positive numbers is the same
  - In signed magnitude and 1's complement, there is negative zero
  - signed magnitude
    - change the sign, that's it.
    - $[-(2^{n - 1} - 1), 2^{n - 1} - 1]$
  - 1's complement
    - Complement each bit of the unsigned postive number (flip each bit)
    - $[-(2^{n - 1} - 1), 2^{n - 1} - 1]$
  - 2's complement
    - Take the ones complement and add 1
    - this has slightly greater range
    - $[-(2^{n - 1}), 2^{n - 1} - 1]$
    - Addition:
      - Within range:
        
        Start at the LSB, summing as before
      - Overflows:
        
        overflow if MSB carry-over and MSB-1 carry-over are not the same
        
        overflow if both have same sign, and the sign is different in the final result

Lecture 2: More Binary Numbers

Subtraction of signed binary numbers (2's complement)
- $a - b = a + (- b)$
- first negate b by flipping all bits, and adding 1
- $a - b = a + b^{\prime} + 1$
- add as normal, be careful of overflow
subtraction of unsigned binary numbers:
- see above.
- if the final overflow bit is different from the sign of both numbers, then there is overflow.
sign extension: Add the correct number of "1"s before the MSB or the correct number of "0"s to the same position, depending on the sign bit.
to extend an unsigned number, just and the correct number of "0"s before the MSB
Hex!
- 0-15, as 0-9, A-F
- 1 hex digit is 1 nybble (4 bits)
- $1111110 = FE$

Recitation 1: Welcome

format
- announcements
- new material review
- common mistakes
- intro to lab
read and start lab prior to coming to recitation!

Binary Numbers

0b for binary numbers, nothing special for decimal, and 0x for Hex.
Place-value review.
binary is on $2^n$
remember that index is alwasy 0-based

Lab

THE HORROR THAT IS ASSEMBLY

Lecture 3: Assembly Language

assembly is very low level
- made for a specific processor
high level languags include java, c, c++, python and lisps
- can be executed on a variety of processors
In 230, we do assembly for NIOS2, and for the expiremental, we also deal with x86
there is an almost 1-to-1 relationship between an assembly language and an isa, conversion is supported by an assembler
processor is connected to memory
- processor works on n-bits
- memory is a sequnce of bits, organized as words.
- we use byte addressing, assigning successive addresses to successive bytes.
- n-bits = 32 bits
- 4 bytes per word.
- bytes: 0, 1, 2, 3, 4, …
- words: 0, 4, 8, 12, 16, …
little endian versus big endian
- little-endian
  - first word is 3, 2, 1, 0
  - addresses start at the least significant bit
- big-endian
  - first word is 0, 1, 2, 3
  - addresses start from the most significant bit
- most archs use little endian, including NIOS and x86
- TCP/IP and many older computers use big-endian
  - hton
  - ntoh
Word alignment:
- a word address has an aligned address if the address is a multiple of the number of bytes in a word
- for n=32: 0, 4, 8, 12, 16,…
- unaligned for n=32: 1, 2, 3, 5, …
- some archs don't support unaligned word addresses

Assembly Basics:

java/c:

int declarations

  int r2 = 1;
  int r3 = r2;
  int r4 = r2 + r3;

Assembly:

          .text
          .global _start
  _start:
          movi r2, 1              ; Sets Register 2 to 1
          movi r3, r2             ; Set Register 3 to the contents of register 2
          add r4, r2, r3          ; Set Register 4 to r2 + r3
          .data
          .end
          ;; Registers 2-23 are general purpose
          ;; There are 32 registers, those that are not general purpose are special purpose
          ;; There are 6 control registers, used for I/O
          ;; PC is also a register, the program counter.  This points to the address of the next instruction to be executed

A register stores one word of data
register namse are case sensitive in NIOS2 assembly, all others are case-insensitive
assembly follows the mnemonic operand* pattern
two types of instructions, regular instructions (directly implemented in hardware) and psuedo instructions (implemented by regular insturctions (used to be called macro instructions))
two types of instruction sets:
- Reduced – RISC
  - small number of simple instructions
  - instructions can fit in a single word
  - ARM
  - OpenRISC
  - NIOS2
  - POWER
- Complex – CISC
  - large number of both simple and complex instructions
  - instructions fit in one or more words
  - x86(_64)
  - VAX
NIOS2 Instructions
- arithmetic
  - add
  - addi
  - sub
  - subi
  - mul
  - muli
  - div
  - divu
  - divi
  - diviu
- logical
- data copy
- control transfer
  - unconditional branching
    - jmp
    - br
      - label must be ±32 kb from current instruction
  - conditional branching
    - blt, bltu
    - bgt, bgtu
    - ble, bleu
    - bge, bgeu
    - beq
    - bne
- misc
  - shift
  - rotate
Remember to INF loop at the end of a program if you are unable to send a software signal to shut off

ex:

  _start:
          movi r2, 1
          movi r3, r2
          movi r4, 0
          bgtu r2, r3, _true
  _false: movi r4, r3n
          br _cont
  _true:  movi r4, r2
  _cont:  nop
          br _cont

ex:

  _start: movi r2, 1
          movi r3, 10
          movi r4, 0
  _loop:  add r4, r4, r2
          add r2, r2, 1
          bleu r2, r3, _loop
  _cont:  nop
          br _cont

Lecture 4: Assembly Continued

types
- arithmetic
- branching
- logical
  - and
    - and/r,r,r
    - andi/r,r,v
    - andhi/r,r,v
  - or
    - or
    - ori
    - orhi
  - xor
    - xor
    - xori
    - xorhi
- data
  - move
    - mov
    - movi
    - movui
    - movio – move label address to register
  - load/store
    - assembler directives
      - .egu
      - .data
      - .byte
      - .hword
      - .word
      - .ascii
      - .stop
    - memory addresses
      - displacement addresses: V(rj), number(register) = v + rj
      - number is 16 bit signed
      - absolute mode v(r0)
      - register indirect 0(rj)
    - instructions
      - ex:
        
        X: .byte 1
        
        can be addressed using absolute mode X(r0)
        
        address to register, and then register indirect
      - ex:
        
        ARRAY: .byte 1,2,3,4
        
        movi r2, 0; ARRAY(r2)
        
        movi r2, 1; ARRAY(r2)
        
        movia r2, ARRAY; 0(r2)
        
        movia r2, ARRAY; 1(r2)
      - ldw
      - ldh
      - ldb
      - ldhu
      - ldbu
      - stw
      - sth
      - stb
- subroutines

Lecture 5: More Assembly

Exam on February 16, clases split afterward

More addressing

addressing modes
- immediate mode – addi
- register mode – add
- displacement mode – ldw r5, num(r6)
- register indirect mode – ldw r5, (r6)
- absolute mode – ldw r5, num(r0)

ex:

  #define N 10

  int num[N] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};

  int i = 0;

  .egu N 10
  start:
          movi r2, 0
  loop:   ldb r3, NUM(r2)
          ldw r4, SUM(r0)
          add r4, r4, r3
          addi r2, r2, 1
          stw r4, SUM(r0)
          blt r2, N, loop
  end:    br end
          .data
  NUM:    .byte 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
          .skip 2
  SUM:    .word 0

  .egu N 40
  start:
          movi r2, 0
  loop:   ldw r3, NUM(r2)
          ldw r4, SUM(r0)
          add r4, r4, r3
          addi r2, r2, 4
          stw r4, SUM(r0)
          blt r2, N, loop
  end:    br end
          .data
  NUM:    .word 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
  SUM:    .word 0

Actual Subroutines

instructions
- callr
- call
- ret
On call, next address is put in the ra (return address) register, and then the subroutine is jumped to.
At return, the return-address is jumped back to.

ex (factorial):

  start:  movi r2, 6
          call factorial
  end:    br end
  factorial:
          movi r3, 1
          movi r4, 1
  loop:   mul r3, r3, r4
          addi r4, r4, 1
          ble r4, r4, loop
          ret

  factorial2:
          movi r3, 1
          mov r4, r2
  loop2:  mul r3, r3, r4
          subi r4, r4, 1
          bne r4, r0, loop2
          ret

.stack, a lifo

r27 is the stack pointer, sp
add element
- subi, sp, sp, 4
- stw ri,(sp)
remove top element
- ldw ri,(sp)
- addi sp, sp, 4
to save registers in a subroutine
pass parameters to a subroutine

ex:

          movi r5,
          subi sp, sp, 4
          stw r5, (sp)
          call factorial
          ldw r5, (sp)
          addi sp, sp, 4
  end:    br end
  factorial:
          ldw r3, 4(sp)
          movi r4, 1
  loop:   mul r4, r4, r3
          subi r3, r3, 1
          bne r3, 0, loop
          subi sp, sp, 4
          stw r4, (sp)
          ret

Lecture 6: Hardware, and the background of logic circuits

a logic circuit implements a function of logic signals (logical variables) that take two possible values that take two possible values 0 and 1 corresponding to different voltage levels.
or looks like an arrowhead with a curved back, $f = x_1 + x_2$
and looks like a box with a leading half-circle, $f = x_1 \cdot x_2$
not takes one, wedge with leading circle, $f = \overline{x}$
synthesis of logic functions:
- truth table
- sum-of-products form
- minimize

Ex: Bit multiplexer

takes S, A, B, single output
S is control Signal
data are inputs a and b, each one bit
if S = 0, a, else, B

truth table

s	a	b	out (f(s, a, b))	term
0	0	0	0
0	0	1	0
0	1	0	1	$\overline{s} \cdot a \cdot \overline{b}$
0	1	1	1	$\overline{s} \cdot a \cdot b$
1	0	0	0
1	0	1	1	$s \cdot \overline{a} \cdot b$
1	1	0	0
1	1	1	1	$s \cdot a \cdot b$

Product Form terms
- $f(s, a, b) = \overline{s} \cdot a \cdot \overline{b} + \overline{s} \cdot a \cdot b + s \cdot \overline{a} \cdot b + s \cdot a \cdot b$
- not and or
$f = (\overline{S}a) + (Sb)$
minimize, using logic rules:
- Commutative: $w + y = y + w$, $wy = yw$
- Distributive: $wy + wz = w(y + z)$, $w + yz = (w + y)(w + z)$
- Associative: $(w + y) + z = w + (y + z)$, $(wy)z = w(yz)$
- Idempotent: $w + w = w$, $ww = w$
- Involution: $\overline{\overline{w}} = w$
- Complement: $w + \overline{w} = 1$, $w\overline{w} = 0$
- DeMorgan's: $\overline{w + y} = \overline{w}\cdot\overline{y}$, $\overline{wy} = \overline{w} + \overline{y}$

Ex:

Truth table:

x_1	x_2	x_3	f	term
0	0	0	1	$\overline{x_1}\overline{x_2}\overline{x_3}$
0	0	1	1	$\overline{x_1}\overline{x_2}x_3$
0	1	0	1	$\overline{x_1}x_2\overline{x_3}$
0	1	1	0
1	0	0	1	$x_1\overline{x_2}\overline{x_3}$
1	0	1	1	$x_1\overline{x_2}x_3$
1	1	0	0
1	1	1	0

Product of Terms:
- $f = \overline{x_1}\overline{x_2}\overline{x_3} + \overline{x_1}\overline{x_2}x_3 + \overline{x_1}x_2\overline{x_3} + x_1\overline{x_2}\overline{x_3} + x_1\overline{x_2}x_3$
Minimized:
- $f = \overline{x_1}\overline{x_2} + \overline{x_1}x_2\overline{x_3} + x_1\overline{x_2}$
- $f = \overline{x_2} + \overline{x_1}x_2\overline{x_3}$
- $f = \overline{x_2} + \overline{x_1}\cdot\overline{x_3}$

Other Logical Gates:
- xor, or with a line behind it
- nand, and with a not on the output
- nor, or with a not on the output
- xnor, xor with a not on the output
Implementation:
- propagation delay: a very small amount of time between input change and output change, $T_p$, non-zero.
- PLA(Programable Logic Array): Appendix A.11.1
- FPGA(Field Programmable Gate Array): A form of programmable logic device, the most powerful and flexible of the main programmable logic devices.
  - VHDL is the hardware description language used
  - Quartus2 is the software used to write and develop VHDL, can minimize functions automagically
  - Modelsim is modeling software to simulate logic circuits

Recitation 3: Subroutines and Logic Design

NIOS2 is little-endian
.text always starts assembly
.end is always at the end
.global _start is the main entry point
.data starts the data section
Make sure to mention TA's poor comunication skills

Lecture 7: VHDL

Damn bastardized child of Ada.

example: $f(s, a, b) = \overbar{s}a + sb$

  library ieee;
  use ieee.std_logic_1164.all;
  entity mux2inputbit is
    part (
      s, a, b : in std_logic;
      result: out std_logic );
  end mux2inputbit;

  architecture implementation1 of mux2inputbit is
  begin
    result <= (not s and a) or (s and b);
  end implementation1;

use ieee and ieee.std_logic_1164.all for the std_logic type
entity defines an interface for a part, including the signals
architecture defines the implementation
case insensitive
double-dach comments

use quartus to:
- compile
- minimize
- rtl view before minimize
- technology map view after minimize
Datatypes
- std_logic: 9 possible values, only use 2: 0 and 1; others are Z, X, W, L, H, -, U
- std_logic_vector: an array of std_logic; (n downto m); V : out std_logic_vector(3 downto 0); V <= "1000"; V(2 downto 1) <= "10"; V(0) <= '1'
Operators
- logic: and, or, not, xor, nand, nor, xnor
- concatenation operator: &

Lecture 8: More VHDL

Combinations circuits, output is a function of the current inputs
- multiplexor: see mux2inputb.vhd
  - 2 data input, 1 bit wide.
  - control input
- multiplexor: m data inputs ($m = 2^k$), $\lg m$ control inputs, n-bit-wide max
  - for 4 data inputs, 1 bit wide, we need 2 control inputs, see mux4input1bit.vhd
  - for 4 data inputs, 2 bit wide, we need 2 control inputs, see mux4input2bit.vhd
Sequential circuits: output depends on both input and the current state
- SR Latch
  - see srlatch.vhd
- Gated SR latch
  - see gatedsrlatch.vhd
- D latch
  - see dlatch.vhd
- D flip-flop
  - D latch triggered only on rising edge of the clock.
  - d latch who's output p is connected to another d latch whos clock is notted (triggered on falling edge)
  - to make above triggered on rising edge, not the clock before the first, and renot on second

Lecture 9: VHDL Sequential Statements

Normally al statements within an architecture are concurrent.
process is concurrent, however the statements within it are sequential.
- select/vhdl conditionals are not allowed, use if or a case statement.

Lecture 10 (First Day Split): Machine-Level Programming 1: X86 Basics

History of processors and archs
C, assembly, machine code
assembly basics: registers, operands, move
intro to x86-64

History

totally dominates the computer market
evolutionary design
- compatible back to the 8086 from 1978
- Added more features as time goes on
Is CISC
- many different instructions with many different formats
  - only a small subset encountered within GNU/Linux programs
- Hard to match the performance of RISC
- But Intel has done it in terms of speed, but less so for power
Milestones
- 8086, 1978
- 386, 1985
- Pentium 4F, 2004
- Core i7, 2008
64 bit means you can have more memory
Multi-core lets you run more at the same time (parallelism), thanks to Moore's law, we can have more cores on the same space
Introduced multimedia operations
AMD recruted DEC's designers and built Opteron and developed x86-64
Intel tried IA64, but ended up building EM64T, nearly identical to x86-64

C, Assembly, Machine Code

Architecture: the ISA. The parts of a processor design that one needs to understand to write assembly code, specification, registers
Microarchitecture: implementation of the architecture, chache sizes, core frequencies
ISAs include x86, Nios, MIPS, SPARC, Power
Generally have Program Counter, Registers, Condition Codes, in memory, you have object code, program data, os data, the stack and possibly a heap

Assembly Basics

C (gcc -S) to assembly (as) to object (ld) to executable
compile, assemble, link
remember that the stack is important, thus there are push and pop instructions
Integers are 1, 2 or 4 bytes, they are data values or untyped pointers (addresses)
floating point data are 4, 8 or 10 bytes
no aggregate like arrays or structures
arith functions
data transfer
control flow
variable length instructions are a PITA
Stacks have %esp for the stacp pointer
follows a similar pattern as NIOS
control flow, call, return, various branching ops
program counter is %eip
stacks are important for languages that support recursion.
code must be reentrant
you must have a place to store all of the relavent data

Lecture 11: More X86

only 6 registers in x86
caller saves – the caller saves the registers before calling
- eax – ax, ah, al – accumulator
- ecx – cx, ch, cl – counter
- edx – dx, dh, dl – data
callee saves – the callee saves the registers before using
- ebx – bx, bh, bl – base
- esi – si – source index
- edi – di – destination index
esp – sp – stack pointer
ebp – bp – base pointer

Moving data

movl source, dest
- imm, reg/mem
- reg, reg/mem
- mem, reg
operand types
- immediate – prefix with '$', integer data of 1, 2 or 4 bytes
- register, one of the 8 registers
- memory (4 consecutive bytes a address given by register), (%eax) or other address modes
Memory Adressing
- Normal, (reg), the register specifies the address
- Displacement n(reg), reg + d is the address
- General n(reg,in,scale) reg + s*in + d is the address

Lecture 12: Still More X86

64 bit has registers r8 to r15, and their half-word variants
-l instructions hav -q instructions also for 8 byte words (quad words)
with the registers, we have enough to pass most arguments in registers rather than on the stack
leal, load effective address, source is mode address expression, set destination to address denoted by expression
arith operations take src, dest, where dest = dest op source
be careful with arith, can be done with atypical functions
condition codes are used to determine equality and the like, these are in cf, zf, sf, of registers
- cd is unsigned overflow
- zf if zero
- sf if negative
- of if signed overflow
- can be set with cmpl/cmpq
  - cf if unsigned carry out
  - zf if a = b
  - sf if a - b < 0
  - of if signed carry out
- Or with testl/testq
  - based on a & b
- can be set with setX instructions
Jumping go based on tthe status of various flags, named in the X of jX

Lecture 13: C Programming, Part 1

using gcc
pointers are used for indirection
ampersand and a variable name give it's address
- is used to deref a pointer
malloc is used to allocate memory from the heap, returns the address of the allocated chunk
- takes an argument to specify the size to allocate
- done at run time

Lecture 14: C Programming, Part 2

  #include <stdio.h>
  #include <stdlib.h>

  #define SIZE 10

  struct node {
    int value;
    struct node *next;
  };

  typedef struct node NODE;

  void display_list(int, NODE *, int);

  int main(void) {
    NODE *head, *current, *prior;
    int i;
    for (i = 0 ; i < SIZE; i++) {
      current = (NODE*) malloc(sizeof(NODE));
      current->value = i*i;
      current->next = NULL;
      if (i == 0)
        head = current;
      else
        prior->next = current;
      prior = current;
    }
    current->next = head;

    display_list(SIZE*2, head, SIZE);
  }

  void display_list(int itr, NODE * mynode, int len)
  {
    int i;
    NODE * current = mynode;
    for(i = 0; i < itr; i++) {
      printf("node[%d] = %d,%p\n", i % len, current->value, current->next);
      current = current->next;
    }
  }

argc is an argument count
arg is an argument vector

  #include <stdio.h>
  #include <stdlib.h>

  struct node {
    int value;
    struct node *next;
  };

  typedef struct node NODE;

  void display_list(int, NODE *, int);

  int main(int argc, char** argv) {
    int i, size;
    NODE *head, *current, *prior;

    if(argc != 2)
      {
        fprintf(stderr, "usage: %s number\n", argv[0]);
        exit(1);
      }

    size = atoi(argv[1]);
  
    for (i = 0 ; i < size; i++) {
      current = (NODE*) malloc(sizeof(NODE));
      current->value = i*i;
      current->next = NULL;
      if (i == 0)
        head = current;
      else
        prior->next = current;
      prior = current;
    }
    current->next = head;

    display_list(size*2, head, size);
  }

  void display_list(int itr, NODE * mynode, int len)
  {
    int i;
    NODE * current = mynode;
    for(i = 0; i < itr; i++) {
      printf("node[%d] = %d,%p\n", i % len, current->value, current->next);
      current = current->next;
    }
  }

  #include <stdio.h>
  #include <stdlib.h>

  struct node {
    int value;
    struct node *next;
  };

  typedef struct node NODE;

  void display_list(int, NODE *, int);

  int main(int argc, char** argv) {
    int size = 0, temp;
    FILE * fptr;
    NODE *head, *current, *prior;

    if(argc != 2)
      {
        fprintf(stderr, "usage: %s file\n", argv[0]);
        exit(1);
      }

    if (!(fptr = fopen(argv[1], "r")))
      {
        fprintf(stderr, "error: %s, file not found\n", argv[1]);
        exit(1);
      }

    while(fscanf(fptr,"%d", &temp) == 1)
      {
        current = (NODE*) malloc(sizeof(NODE));
        current->value = temp;
        current->next = NULL;
        if (size == 0)
          head = current;
        else
          prior->next = current;
        prior = current;
        size++;
      }

    current->next = head;
    fclose(fptr);
    display_list(size*2, head, size);
  }

  void display_list(int itr, NODE * mynode, int len)
  {
    int i;
    NODE * current = mynode;
    for(i = 0; i < itr; i++) {
      printf("node[%d] = %d,%p\n", i % len, current->value, current->next);
      current = current->next;
    }
  }

  #include <stdio.h>
  #include <stdlib.h>

  struct node {
    int value;
    struct node *next;
  };

  typedef struct node NODE;

  void display_list(int, NODE *, int);
  void free_list(int, NODE *);

  int main(int argc, char** argv) {
    int size = 0, temp;
    FILE * fptr;
    NODE *head, *current, *prior;

    if(argc != 2)
      {
        fprintf(stderr, "usage: %s file\n", argv[0]);
        exit(1);
      }

    if (!(fptr = fopen(argv[1], "r")))
      {
        fprintf(stderr, "error: %s, file not found\n", argv[1]);
        exit(1);
      }
  
    while(fscanf(fptr,"%d", &temp) == 1)
      {
        current = (NODE*) malloc(sizeof(NODE));
        current->value = temp;
        current->next = NULL;
        if (size == 0)
          head = current;
        else
          prior->next = current;
        prior = current;
        size++;
      }
  
    current->next = head;
    fclose(fptr);
    display_list(size*2, head, size);
    free_list(size, head);
  }

  void display_list(int itr, NODE * mynode, int len)
  {
    int i;
    NODE * current = mynode;
    for(i = 0; i < itr; i++) {
      printf("node[%d] = %d,%p\n", i % len, current->value, current->next);
      current = current->next;
    }
  }

  void free_list(int size, NODE* mynode)
  {
    NODE *current = mynode, *next;
    next = current->next;
    current->next = NULL;
    current = next;
    while (current->next != NULL) {
      next = current->next;
      printf("Free %d,%p\n", current->value, current->next);
      free(current);
      current = next;
    }
    printf("Free %d,%p\n", current-value, current->next);
    free(current);
  }

Lecture 15: C Programming, Part 3

be careful of endian-ness and dangling references
worms can run by itself, and can propagate a fully woring versionto other computers
virus adds themselves tto other programs. Cannot run independently
return oriented programming involves finding gadgets in the code
load stack with addresses to these gadgets
at the end of each gadget a return is called

Lecture 16: Accessing I/O devices in C

memory mapped I/O
nios timers: 6 16 bit timer registers
- provides information about status, and for control mechanism
- to is a timeout signal set to by the timer when the count is 0, can be reset by writing a 0 to it
- run is to 1 by the timer if counting
- start/stop are used to start/suspend the timer by setting the relevant bit to 1
- managed by simply setting the various registers/addresses to specific values
Be careful of pointer types and thus type sizes
Be careful of sign extension

Lecture 17: Interrupt Handling

Single CPU, Multiple Tasks
interrupt handling requires all registers and context to be saved
exceptions are a transfer of control away from the normal program. These are software exceptions or hardware interrupts
threading allows for a simulation of multiple processes
threads are used for specific tasks
multithreading happens with scheduling techniques

Lecture 18: Hardware IO with the DE1

Final Project due the 20th
Midterm on the 6th over intel assembly and C programming
Use eclipse todeal with the mechanics
must open, install interrupt handler, and can then read/write to the device
use the HAL for the interface, can use keys 1-3 and ledg2-8
must set interrupt mask

Lecture 19: ABIs and Memory Layout

an ABI specifies
- how data is arranged in memory
- supported types and sizes
- behavior and structure of the stack
- function calling conventions
- nios2: https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/hb/nios2/n2cpu_nii51016.pdf
- Will specify whether numbers are straight binary or two's complement
- needed for
  - developing embedded systems
  - developing runtime systems
  - explaining runtime behaviors
Memory Layout
- On linux, everything gets 4 gb, layed out in various ways
- understanding it will help you find errors

Recitation Notes

Buliding a vending machine

Lecture 20: The NIOS2 Datapath

document avaliable on line
3 types of instructions in NIOS
- R: rsrc1 (31:27), rsrc2 (26:22), rdst(21:17), opx (16:6), op code (5:0)
- I: rsrc (31:27), rdst (26:22), immediate operand (21:6), op code (5:0)
- J: address (31:6), 0(5:0)
  - involves dark voodo magic
MIPS also has R, I, J and they are similar, but layed out slightly differently
data paths are a tad confusing, but make sense if you examine them carefully
this is von Neumann
classical model
- Steps
  - fetch
  - decode
  - read registers (if not call or j-type)
  - execute (most use the ALU, but in different ways)
  - write back to either registers or memory
- Must be completed in one cycle
- implications:
  - constraint: need separate memories for instructions and data (Harvard architecture)
  - choice: No caches are needed (to simplify design)
  - choice: register file has 2 output ports (to speed up reading 2 operand registers) and one input port
  - choice: programmable alu instead of separate functional units
ALUs must be controlled, in many cases, this is done with a nybble to choose the function

Lecture 21: Memory Hierarchy

Overview
- Storage technologies and trends – pyramidal
- Locality of reference
- caching in the memory hierarchy
Tech and Trends
- Registers
  - within processors
- RAM
  - traditionally packaged as a chip
  - basic unit is a one-bit cell
  - multiple chips form the memory
  - static ram
    - 4 or 6 transistor circuit, keeps value as long as it's powered
    - insensitive to noise and radiation
    - faster and moe expensive than dram
  - dynamic ram
    - stored with capacitor and transistor for access
    - value must be refreshed every 10-100ms
    - more sensitive to disturbances
    - slower and cheaper than sram
    - dxw organization
    - access is by row and then column
    - double data-rate synchronous dram (ddr sdram)
- Non-volatile
  - types
    
    rom
    
    programmed during production
    
    prom
    
    can be programmed once
    
    eprom
    
    can be bulk erased using uv or xray
    
    eeprom
    
    electronic erase capability
    
    flash memory
    
    eeproms with partial erase capability, wears out after about 100000 erasings
  - used for:
    - firmware programes
    - solid state disks
    - disk caches
- Disk Drives
  - very mechanical
  - each platter is double sided, and has a head for each side
  - uses one of several interfaces
  - organized in tracks (concentric circles) and sectors (chords of each track)
  - aligned tracks form a cylinder
  - recording density is bits/inch
  - track density is tracks/inch
  - areal density bits/sq.inch
  - bytes/sector * sectors/track * tracks/surface * surfaces/platter * platters/disk
  - access = seektime + rotationtime + transfertime
  - logical blocks are maped to physical sectors by hardware/firmware
- SSDs
  - are flash, take a long time to write
  - but are very fast for reads
  - generally have wear-leveling logic
busses
- system bus
- memory bus
- io bridge interfaces memory bus and system bus
speeds have become dramatically faster, while storage hasn't caught up
locality of reference – programs tend to use data and instructions with addresses near or equal to those they have used recently
- temporal – think loops, like a loop counter
- spatial locality – nearby addresses tend to be referenced close together in time

Lecture 22: Pipelining

steps
1. fetch (f)
2. decode/read (r)
3. ALU/pc adder (c)
4. memory access (m)
5. write (w)
single-cycle
- takes single cycle for frcmw
multi-cycle
- takes a cycle for each step of frcmw
pipelining lets the process for a new instruction start at each pulse, effectively being both single cycle and multicycle simltaneously
- has hazards: situations that delay or discard an instruction in the pipeline, we say that the pipeline is 'stalled', or that there is a 'bubble' in the pipeline
  structure hazards
  multiple instructions using the same component at the same time
  - deleteing memory
    
    when two instructions must read memory at the same time
    
    pipeline stalling, delay instructions by a clock cycle, staggering, simple, but inefficient
    
    component duplication for to code and normal memory
    
    single memory, duplicated i/o
  - due to pc adder
    
    when changing pc address due to branching
    
    multiple pc adders
  data hazards
  data dependency between two instructions
  - when registers are accessed at the same time
    
    pipeline stalling, is very, very inefficient
    
    operand forwarding, don't delay second instruction, but instead send the values of the relevant registers to the next instruction in the pipeline, much faster
    
    software handling, puts nops detween relevant operations
  control hazard
  
  control transfer instruction

Lecture 23: Caching in the Memory Hierarchy

levels:
- registers
- l1 cache
- l2 cache
- main memory
- secondary storage (disks, solid state)
- remote storage (tapes, distributed file systems, the web)
caches provide a smaller, faster storage device to act as a staging area for subset of data in a larger, slower device
these are used because of locality
three types of misses
- cold (start up, always happens)
- conflict miss (maps two potential blocks to the same place)
- capacity miss (when active cache blocks is larger than the cache)
small and sram based, generally 2 levels
entries in caches have valid and tag
cache addresses have 3 parts

block offset

4 bits

set index

3 bits

tag

remainder
addresses have s, b, tag, and if valid bit is set, you can use that cache line
where there is only one line per set, this is a direct mapped cache
- when no match, then the old line must be replaced
- on miss, grap from memory, cpu only accesses data from the cache
set associative, two lines per set
- allows for two different group of info in the cache

Lecture 24: More Caching

review of last lecture
E-way set assosiactive cashe lets you have E lines per set
writes can have problems when multiple copies exist
- L1, L2, Ln
- Main memory
- disk
- other
On write hit

write-through

write straight to memory

write-back

defer writing until line relpacement, needs a dirty bit
On write-miss

write-allocate

load into cache, update line in cache

no-write-allocate

writes immediately to memory
typicall
- write-through, write-allocate
- write-back, no-write-allocate (most common)
metrics
- miss rate
- hit time
- miss penalty
huge numbers game
to write friendly code, make the common case go fast
minimize the misses in inner loops
- repeated references to variables are good
- stride-1 reference patterns are good

Virtual Memory

Address Range may be up to a specific number, but only a certain portion is available
virtual memory lets you swap pages of memory in and out of slower devices
pages are the primary unit, and don't always have to be in memory
chunks of a process called pages, and chunks of memory are frames
the OS maintains a page table for each process, contains frame location for each page, address consists of a page number and offset within the page

Lecture 25: More Memory Management and Paging

taking process and splitting memory up into bits that can be swapped in and out of memory, stored on disk temporarily if necessary
this is done with virtual memory, locality and the like and managed by the operating system, sometimes with the help of an mmu
page is a virtual address, frame is a physical address
each page can go in any free frame
the page table is used to manage where each page is located, either in a frame or in the swap space
then addresses have a logical address, the first n bits is a page number, the remaining is an offset
- the page number is replaced with the address start from the page table
replacement policy
- deals with the selection of pages to replace and when new pages must be brought in
- the more elaborate the policy, the greater the hardware and software overhead to implement it
- policies
  
  optimal
  
  selects the page for which the next reference is the longest (may not be known ahead of time), produces three page faults after the frame allocation has been filled
  
  least-recently-used
  
  the one that hasn't been used most recently is replaced, many faults can be caused. This is very difficult to ipmlement, as most implementations require a great deal of overhead
  
  first-in-first-out
  
  exactly as it sounds
translation lookaside buffer – a type of cache, caching the page table to make access to memory easier
- page number and frame number pairs are cached
- virtual access is checked, if tlb has it, get physical address, if not, check page table and then get physical address, unless non-address value is returned
page faults:
- disk to memory
- update page table
- update tlb
- get physical address
converting physicals to cache blocks:
- convert VA to PA, check l1 cache, if available, get block, otherwise, update cache
- in the event of tlb hit, no page fault is possible