Tuesday, June 11, 2019

Early Boot is a Lie

Everything you know about early boot is a lie. The classic introduction to IBM PC compatible (x86, x64) startup sequence goes as follows. Read-only memory (Boot ROM) loads the Master Boot Record (MBR) from the first Peripheral Component Interconnect (PCI) Device. MBR loads the rest of the Operating System (OS). Everyone should write an MBR bootloader (it's fun!), but there is a glaring constraint that is a hold-over from the old days, which is that you have 512 bytes (about 400 available to use for machine code). Here is an example MBR bootloader that closely resembles the "Standard MBR", traditionally written in assembly, but for clarity we have rewritten it in pesudo-Rust.
// Oversimplified Legacy Standard MBR
pub unsafe fn mbr_main() {
    const SECTOR: usize = 0x200;

    // Pointer to Master Boot Record (MBR)
    // This is empty before shadowing, MBR after shadowing.
    let mut mbr = Vec::from_raw_parts(
        0x0600 as *mut u8, SECTOR, SECTOR);

    // Pointer to Volume Boot Record (VBR)
    // This is the MBR before shadowing, VBR after shadowing.
    let mut vbr = Vec::from_raw_parts(
        0x7c00 as *mut u8, SECTOR, SECTOR);

    // Chainload/Underlay/Shadow Master Boot Record (MBR)
    // This program started running from 0x7c00, but the
    // instruction pointer has moved since then, so we need to 
    // jump to here + 0x0600 - 0x7c00, represented by asm_shadow!().
    ptr::copy(vbr.as_ptr(), mbr.as_mut_ptr(), SECTOR);
    asm_shadow!(mbr.as_ptr() as usize - vbr.as_ptr() as usize);

    // Find first bootable partition entry (oversimplified!)
    let bootable = MbrTable::from(mbr).partitions[0];

    // Chainload/Overlay/Fetch Volume Boot Record (VBR)
    csm::ChsReader::from(bootable).read(vbr.as_mut_ptr());

    // Tail call the Volume Boot Record (VBR)
    let vbr_main = vbr.as_ptr() as *const Fn();
    vbr_main();
    unreachable!();
}
The first thing to realize is that since this is the first 512 bytes on the hard drive, that people have written many versions of this. There are MBR bootloaders found in Windows, macOS, Linux, SYSLINUX, GRUB, CoreBoot, SeaBios, EDK2, and many others. There are far too many versions of this to discuss here, and so I will move on to more interesting things. Let's go in the other direction. What loads the MBR? What jumps to the MBR? Enter the reset vector. The code that goes in the the reset vector is called the Volume Top File (VTF). The reset vector is physical address 0xfffffff0 (-16), which has a jump to physical address 0x00007c00 (31744). Sounds like a dream. Here is some equivalent pseudo-Rust:
// Oversimplified Boot Device Selection (BDS)
pub unsafe fn bds_main() {
    const SECTOR: usize = 0x200;

    // Pointer to Master Boot Record (MBR)
    let mbr = Vec::from_raw_parts(
        0x7c00 as *mut u8, SECTOR, SECTOR);

    // Load first sector from first hard drive
    csm::ChsReader::default().read(mbr.as_mut_slice());

    // Tail call the Master Boot Record (MBR)
    let mbr_main = mbr.as_ptr() as *const Fn();
    mbr_main();
    unreachable!();
}
I put my compiled binary code in the first sector of the hard drive, which gets loaded at 0x7c00, and I never have to worry about anything else. Perhaps if you've heard about the Power-On Self-Test (POST), then you may say to yourself, even if there are processes before the MBR, then I don't have to worry about them because they're in ROM, and therefore read-only.

That's not true.

The 16 bytes at the top of memory is where all of UEFI takes place. How is this even possible? UEFI is beyond the scope of this article, but to overview, here is some psuedo-Rust:

pub unsafe fn vtf_main() {
    // There is only enough room for a jump, so imagine 
    // that this is what the top of memory jumps to.

    // Security (SEC)
    sec_main();

    // Pre-EFI Initialization (PEI)
    pei_main();

    // Driver eXecution Environment (DXE)
    dxe_main(); 

    // Tail call Boot Device Selection (BDS)
    bds_main(); 
    unreachable!();
}
If we had the source code to the final stages of the Boot ROM, then it might look something like this pesudo-Rust:
pub unsafe fn rom_main() {
    const SECTOR: usize = 0x200;

    // Pointer to Volume Top File (VTF)
    let vtf = Vec::from_raw_parts(
        0xfffff000 as *mut u8, 8*SECTOR, 8*SECTOR);

    // Copy 8 sectors from SPI flash to memory
    rom::SpiReader::default().read(vtf.as_mut_slice());

    // Tail call the Volume Top File (VTF)
    let vtf_main = 0xfffffff0 as *const Fn();
    vtf_main();
    unreachable!();
}
That seems a bit unfair. The modern UEFI runtime environment gets 8 sectors (about 4 KiB), and the legacy MBR environment gets 1 sector (512 bytes? If I were a bootloader, I would ask for a refund. So let us overview the process:
  • rom_main(), presumably still read-only...
  • vtf_main(), writable (HW)
  • sec_main(), writable (HW)
  • pei_main(), writable (HW, or PEI modules)
  • dxe_main(), writable (HW, or DXE modules)
  • bds_main(), writable (SW, or NVRAM)
  • mbr_main(), writable (SW)
  • vbr_main(), writable (SW)
  • boot_loader()
  • oper_system()
We have certainly skipped a few steps, for example, the Compatibility Support Module (CSM), System Management Mode (SMM), and the Trusted Platform Module (TPM). These three are all critical parts of the startup process, but are too complex to get into detail here. There really is no such thing as "Early Boot" anymore, because what we think of as early, is very late in the process now. Suffice it to say that there is an entire ecosystem of companies and hackers trying to squeeze ever more out of those 16 bytes at the top of memory.

Tuesday, January 27, 2015

Hardware accelerated SHA-1

There are many articles out there describing hardware accelerated AES, which has been out for a few years now, but SHA instructions are not out yet, but should be appearing in ARM and x86 instruction sets sometime this year. So I figure it's time to start talking about them now. The first and formost standard for the SHA-1 algorithm is NIST's current FIPS-180 document, specifically section 6.1.2 "SHA-1 Hash Computation" part 3, which says:

T = ROTL5(a) + ft(b, c, d) + e + Kt + Wt

This formula (which I call the T-formula) is probably the most computationally intensive part of SHA-1, and one of the most surprising things about my research into the ARM NEON and x86 SHA instruction sets is that neither of them implement this formula exactly as is.

SHA-1 Instructions

ARMx86Comment
sha1h aRotate left 30
sha1su0 a, b, csha1msg1 a, bMessage schedule part 1
sha1su1 a, bsha1msg2 a, bMessage schedule part 2
sha1c hash, e, msgsha1rnds4 hash, msg, 0Rounds 0 to 20
sha1p hash, e, msgsha1rnds4 hash, msg, 1Rounds 20 to 40
sha1m hash, e, msgsha1rnds4 hash, msg, 2Rounds 40 to 60
sha1p hash, e, msgsha1rnds4 hash, msg, 3Rounds 60 to 80
sha1nexte hash, msgInvisible e operand

As you can see, ARM and x86 are a bit different. From my research, it seems that ARM omits "Kt" from the T-formula, and x86 omits "e" from the T-formula. These implementation choices imply that SHA-1 optimizations on each architecture will use these instructions very differently. x86 has a publication documenting the SHA instruction set extensions, and if there are any conflicts with this article, then the final word is there. X86 omits "e" in the T-formula, so it must be calculated with the sha1nexte instruction. In pseudo-code, a representative sample of 4 SHA-1 rounds on x86 can be computed by:

work18 = sha1msg2(sha1msg1(work14, work15) ^ work16, work17);
hash19 = sha1rnds4(hash18, sha1nexte(hash17, work18), 3);

ARM on the other hand, has absolutely no documentation about their SHA instruction set, except how to determine if the processor supports it. While this is handy, it does very little to describe the inner workings of each instruction. So the following is just a guess on my part, but given that the inputs to (sha1c, sha1p, and sha1m) are two vectors, there is enough information to compute 4 rounds of SHA-1, just like the x86 sha1rnds4 instruction. ARM omits "Kt" in the T-formula, so it must be added to the work vector. In pseudo-code, a representative sample of 4 SHA-1 rounds on ARM can be computed by:

work18 = sha1su1(sha1su0(work14, work15, work16), work17);
hash19 = sha1p(hash18, sha1h(hash17[0]), work18 + SHA1_K3V);

where SHA1_K3V is a vector of 4 uint32's all of which are SHA1_K3.

For more information see also: sha1.rs

Monday, November 24, 2014

Programming Struggles

Software is hard. Lots of programmers out there are constantly searching for new solutions to old problems, and sometimes they find something that works well. It seems that many recent languages (e.g. Dart, Scala, Groovy, Vala, Hack, Swift) are simply syntax sugar for existing programming languages. Are we ever going to make any progress? I know this question has been rehashed a million times, but I'm going to ask it again anyway.

What makes a good programming language?

  1. Algebraic Data Types - includes sum types (open and closed), and product types (open and closed); too many languages miss these.
  2. Generics, Parameterized Types - includes arrays, vectors, lists, hash tables, trees, parsers, I/O streams, pointers
  3. Reflection, Metaprogramming - includes macros, annotations, doc parsers, syntax highlighting, coverage tools, static analysis tools, serialization, etc.
  4. Memory Pool Management - this is different from memory management, but it refers to the ability to manage the pools themselves, such as the ability to switch between Gc (garbage collected) and Rc (reference counted) memory pool models.

There are probably countless other conditions that I can't think of right now, but if you can think of them for me and leave your comments below, perhaps I will make this article longer.

Algebraic Data Types

What is an algebraic data type? Fundamentally it is a way of combining existing types. For two types, there are two fundamental options for a new type, the new type can store both types (ProductType), or either type (SumType). However, it can get more complicated than that. The way that algebraic data types are implemented in most languages differ by how extensible the resulting types are. I call them open if they can be extended after definition, and closed if they cannot. So to give some examples from existing languages:

  1. closed product types - usually called structures, classes, records, tuples, etc.
  2. closed sum types - usually called enumerations, or tagged unions. Most languages get these wrong because C got them wrong by making them equivalent to integers (which are an example of an open sum type).
  3. open product types - usually called arrays, vectors, or lists. These can always store more data by appending. Adding fields in a subclass can also be an example of this kind of construction.
  4. open sum types - usually called objects, dynamics, variants, etc. These can always be extended by inheritance or by subclassing this type.

There are usually too many ways of performing one of these type definitions. For example Haskell has a Dynamic type, which can be used as a variant type for every type in the language, but it also provides type classes, which also allow open sum type definition.

In Java, and most other OOP programming languages, classes perform most of the above kinds of definitions. Final class fields are an example of closed product type definition. Abstract classes can be used for open sum type definition. Class fields added to a subclass that inherits from a superclass is an example of open product type definition.

Generics

Common generic types include Optional (Maybe in Haskell), List, Array, Pointer, GcPointer (Gc in Rust, ^ in C++/CLI), RcPointer (Rc in Rust, shared_ptr in C++), Box (Box in Rust, unique_ptr in C++), HashTable, etc. Some language don't have generics, but end up expressing concepts with special syntax, for example, * for pointers (in C and Go), and [ ] for arrays (in C and Go).

Reflection

One prerequisite for reflection is that the language must expose the parser used in the compiler (or interpreter, although I cannot think of any language that still uses an interpreter these days). If it does not expose the same parser the compiler uses, then the language must be easy to parse. Too many languages miss this. If people are going to write tools for your language (and they will), then they might not use a fancy parsing framework to help, they might only look for documentation strings or lines that start with an @ symbol. C/C++ is one of the most difficult languages to parse, and yet it makes up most desktop software in use today.

Memory Pool Management

Most languages pick one memory pool model and stick with it. C does not provide any, except for malloc/free, and so reference counting (Rc) emerged as a common solution to this problem. Many scripting languages have opted for garbage collection (Gc) for every single object, which makes it impossible to opt-out. For a general-purpose programming language, it is unacceptable to require that users are forced into one model or another.

Conclusion

Rust has all of these features.

Update: Since the time of this writing, #rust has since informed the author that Gc has been removed from the roadmap, and the standard library. Gc was also not implemented as a garbage-collected pointer, it was more of an auto-reference-counted pointer type.

Saturday, June 8, 2013

Rust wc

So in my attempt to practice writing POSIX command-line tools in Rust, the next tool I tried to rewrite was wc (or word count). This is a simple tool that counts 4 things:

  • lines (separated by U+0D, LINE FEED)
  • words (separated by any whitespace)
  • characters (multibyte encoded, probably UTF-8)
  • bytes

In the interest of brevity, I'm going to refer the reader to this link for the source code. One thing I learned is that the current std::getopts library doesn't have many features, e.g., it doesn't link short options and long options, so you have to check each separately. While I was writing this, some people in #rust on irc.mozilla.org suggested that I look at docopt which has not been ported to Rust yet, still mostly a Python library, but if ported would make a great addition to the language.

Saturday, June 1, 2013

Rust echo

I recently found the Rust programming language, and I've been having a blast of a time learning it. So far, it seems like a mixture of 50 different languages, most notably C++ and Haskell.

To help myself learn it, I decided a good place to start was with command-line utilities, the most basic of which are defined by POSIX. POSIX echo does not have any parameters, it just prints all of its arguments back to the user. BSD echo takes one parameter '-n' and if it's not given, then it prints a newline after all arguments are printed.

This seemed like a good example to learn a language so this was my first attempt:

fn main() {
    let mut newline = true;
    let mut arguments = os::args();
    arguments.shift();
    if arguments[0] == ~"-n" {
        arguments.shift();
        newline = false;
    }
    for arguments.eachi |argi, &argument| {
        print(argument);
        if argi == arguments.len() - 1 {
            if newline {
                print("\n");
            }
        } else {
            print(" ");
        }
    }
}

I talked about this on #rust at irc.mozilla.org and bjz and huonw recommended:

fn main() {
    let args = os::args();
    match args.tail() {
        [~"-n",..strs] => print(str::connect(strs, " ")),
        strs => println(str::connect(strs, " ")),
    }
}

which had some issues with it, so I refactored it to be more functional:

fn unwords(args: &[~str]) -> ~str {
    return str::connect(args, " ");
}
 
fn echo(args: &[~str]) {
    match args.tail() {
        [~"-n", ..strs] => print(unwords(strs)),
        strs => println(unwords(strs)),
    }
}
 
fn main() {
    echo(os::args());
}

which defined a Haskell-like unwords function to help. I am quite pleased with how supportive Rust is to approaching the problem from a procedural paradigm or a functional paradigm. Hooray for multi-paradigm programming!

Review of previous posts

In this post, I'm going to be reviewing my old posts with what I've learned over the past few years. The purpose behind this analysis is to consolidate my ideas into fewer ideas and hopefully come up with some new insights along the way.

This was my first post about XML entity references. The goal was to restrict the DTD grammar to only entities so facilitate usage in other applications that have no concern about elements and attribute lists. Now I would recommend using only DTD since its usage is widespread. I was considering folding this into a macro system I was planning to write, so that one could replace © with © just as easily as it could replace sqrt(4) with 2. I've come to the conclusion that this post was premature, and needs to be included in a more detailed discussion about macro systems, and templating systems such as M4, Genshi, Mustache, and many others.

This was an example of binary parsing, which is still not possible in many languages, but a topic more suited to C structs. GCC has a large number of attributes and packing parameters that accomplish some of the goals of binary parsing, but I still haven't found a good binary parsing framework to date.

This was a bunch of things trying to make Go something that it wasn't. I have since found the Rust programming language, which has all the features I was looking for in this post and more. So, now I would recommend using Rust.

These posts were just crazy comments.

The following posts were all about RDF:

This was a comment primarily intended to allow for arbitrary Common Lisp lists to be represented in RDF.

This was a comment as well, and I still haven't found a datatype URI for this, perhaps something for a future article.

This is interesting because there isn't an XSD datatype associated with large finite-size integers, although you could define it as xsd:integer, then restrict the domain until it matches int128_t. I have since defined a URI for this on my other website.

This was fun. I have since written about these algorithms on my other website.

This is a type of post I'd like to write more of. You know, explaining the pros and cons of certain implementation techniques and why one is better or worse than the other.

This was attempting to illustrate the inconsistencies that I found between different XML standards.

This and this were comments about The Web. I would like to add that I've started to use unqualified identifiers to refer to RDF and OWL core concepts in my private notes:

  • Type = rdfs:Class
    • ObjectType = owl:Class
    • DataType = rdfs:Datatype
  • Property = rdf:Property
    • ObjectProperty = owl:ObjectProperty
    • DataProperty = owl:DataProperty

The following posts were unintentionally about Data URIs:

This was attempting to solve a problem that Data URIs already solve. Now I recommend using Data URIs instead.

This was a good comment, but I don't recommend these namespaces anymore. Now I recommend using Data URIs instead, for example:

<data:text/x-scheme;version=R5RS,vector-length>

This mentions a problem regarding xsd:hexBinary and xsd:base64Binary, which can be solved in several ways:

  1. Candle assigned the type identifier "binary" to the combined value space.
  2. The MediaType "application/octet-stream" is naturally isomorphic to the binary datatype.
  3. The Data URI <data:application/octet-stream,DATA> may be used to represent data of type binary.
  4. The URI
    <http://www.iana.org/assignments/media-types/application/octet-stream>
    may also be used to represent the binary datatype itself.

Sunday, July 29, 2012

On binary search

There are many articles about the pedagogical benefits of writing a binary search algorithm. Instead of just implementing binary search, I wanted to take it one step furtuer. Is is possible to write a standards compliant, formally correct, and beautiful implementation of bsearch() in C?

// This type simplifies the 
// argument list of bsearch.
typedef int (*compare_f)(
  const void *x,
  const void *y);
void *
bsearch(
  const void *key,
  const void *base,
  size_t nel,
  size_t width,
  compare_f compar)
{
  size_t k;
  size_t lo = 0;
  size_t hi = nel - 1;
  const void *mid;

  while (lo <= hi) {
    // If we used (lo + hi)/2, then it
    // would overflow for large integers. 
    // The following formula prevents that.
    k = lo + (hi - lo)/2;
    mid = base + k*width;

    // If we used compar(mid, key), then it 
    // would violate POSIX, which specifies
    // that key must be the first argument.
    int cmp = compar(key, mid);
    if (cmp < 0) hi = k - 1;
    if (cmp > 0) lo = k + 1;
    if (cmp == 0)
      return (void *)mid;
  }

  return NULL;
}

Note that not every argument and variable have been colorized, just the ones that usually give people the hardest time when reasoning about this function. The average (k) can be reduced to (lo + hi)/2 if the array you are handling is less than 263 elements long, but since it isn't technically correct, I didn't write it that way.