==========================================================
         Introduction into the Genode porting for Xilinx MicroBlaze
         ==========================================================


                             Norman Feske
                             Martin Stein

This file gives an overview to the Genode porting for MicroBlaze-based
platforms. To get a quick introduction in how to build and run Genode on 
such platforms, please refer to:

! <GENODE_DIR/base-mb/doc/getting_started.txt>

Xilinx MicroBlaze is a so-called softcore CPU, which is commonly used as part
of FPGA-based System-on-Chip designs. At Genode Labs, we are regularly using
this IP core, in particular for our Genode FPGA Graphics Project, which is a
GUI software stack and a set of IP cores for implementing fully-fledged
windowed GUIs on FPGAs:

:Website of the Genode FPGA Graphics Project:

  [http://genode-labs.com/products/fpga-graphics]

Ever since we first released the Genode FPGA project, we envisioned to combine
it with the Genode OS Framework. In Spring 2010, Martin Stein joined our team
at Genode Labs and accepted the challenge to bring the Genode OS Framework to
the realms of FPGA-based SoCs. Technically, this implies porting the framework
to the MicroBlaze CPU architecture. In contrast to most softcore CPUs such as
the popular Lattice Mico32, the MicroBlaze features a MMU, which is a fundamental
requirement for implementing a microkernel-based system. Architecturally-wise
MicroBlaze is a RISC CPU similar to MIPS. Many system parameters of the CPU
(caches, certain arithmetic and shift instructions) can be parametrized at
synthesizing time of the SoC. We found that the relatively simple architecture
of this CPU provides a perfect playground for pursuing some of our ideas about
kernel design that go beyond the scope of current microkernels. So instead of
adding MicroBlaze support into one of the existing microkernels already
supported by Genode, we went for a new kernel design. Deviating from the typical
microkernel, which is a self-sufficient program running in kernel mode that
executes user-level processes on top, our design regards the kernel as a part of
Genode's core. It is not a separate program but a library that implements the
glue between user-level core and the raw CPU. Specifically, it provides the
entrypoint for hardware exceptions, a thread scheduler, an IPC mechanism, and
functions to manipulate virtual address spaces (loading and flushing entries
from the CPU's software-loaded TLB). It does not manage any physical memory
resources or the relationship between processes. This is the job of core.
From the kernel-developer's point of view, the kernel part can be summarized as
follows:

* The kernel provides user-level threads that are scheduled in a round-robin
  fashion.
* Threads can communicate via synchronous IPC.
* There is a mechanism for blocking and waking up threads. This mechanism
  can be used by Genode to implement locking as well as asynchronous
  inter-process communication.
* There is a single kernel thread, which never blocks in the kernel code paths.
  So the kernel acts as a state machine. Naturally, there is no concurrency in the
  execution paths traversed in kernel mode, vastly simplifying these code parts.
  However, all code paths are extremely short and bounded with regard to
  execution time. Hence, we expect the interference with interrupt latencies
  to be low.
* The IPC operation transfers payload between UTCBs only. Each thread has a
  so-called user-level thread control block which is mapped transparently by
  the kernel. Because of this mapping, user-level page faults cannot occur
  during IPC transfers.
* There is no mapping database. Virtual address spaces are manipulated by
  loading and flushing physical TLB entries. There is no caching of mappings
  done in the kernel. All higher-level information about the interrelationship
  of memory and processes is managed by the user-level core.
* Core runs in user mode, mapped 1-to-1 from the physical address space
  except for its virtual thread-context area.
* The kernel paths are executed in physical address space (MicroBlaze).
  Because both kernel code and user-level core code are observing the same
  address-space layout, both worlds appear to run within a single address
  space.
* User processes can use the entire virtual address space (4G) except for a
  helper page for invoking syscalls and a page containing atomic operations.
  There is no reservation used for the kernel.
* The MicroBlaze architecture lacks an atomic compare-and-swap instruction. On
  user-level, this functionality is emulated via delayed preemption. A kernel-
  provided page holds the sequence of operations to be executed atomically and
  prevents (actually delays) the preemption of a thread that is currently
  executing instructions at that page.
* The MicroBlaze MMU supports several different page sizes (1K up to 16MB).
  Genode fully supports this feature for page sizes >= 4K. This way, the TLB
  footprint can be minimized by choosing sensible alignments of memory
  objects.

Current state
=============

The MicroBlaze platform support resides in the 'base-mb' repository. At the
current stage, core is able to successfully start multiple nested instances of
the init process. Most of the critical kernel functionality is working. This
includes inter-process communication, address-space creation, multi-threading,
thread synchronization, page-fault handling, and TLB eviction.

The nested init scenario runs on Qemu, emulating the Petalogix Spartan 3A 
DSP1800 design, as well as on real hardware, tested with the Xilinx Spartan 
3A Starter Kit configured with an appropriate Microblaze SoC. 

This simple scenario already illustrates the vast advantage of
using different page sizes supported by the MicroBlaze CPU. If using
4KB pages only, a scenario with three nested init processes produces more than
300.000 page faults. There is an extremely high pressure on the TLB, which
only contains 64 entries. Those entries are constantly evicted so that
threshing effects are likely to occur. By making use of flexible page
sizes (4K, 16K, 64K, 256K, 1M, 4M, 16M), the number of page faults gets
slashed to only 1.800, speeding up the boot time by factor 10.

On hardware the capability remains to increase execution speed significantly
by turning on instruction- and data-caches. However this feature has not been
tested for now.

The kernel provides, beyond the requirements of the nested init scenario,
allocation, handling and deallocation of IRQs to the userland to enable 
core to offer IRQ and IO Memory session services. This allows
custom device-driver implementations within the userland.

Currently, there is no restriction of IPC communication rights. Threads are
addressed using their global thread IDs (in fact, using their respective
indices in the KTCB array). For the future, we are planning to add
capabilty-based delegation of communication rights.