===============================================
Release notes for the Genode OS Framework 13.08
===============================================
Genode Labs
The release of version 13.08 marks the 5th anniversary of the Genode OS
framework. We celebrate this anniversary with the addition of three major
features that we have much longed for, namely the port of Qt5 to Genode,
profound multi-processor support, and a light-weight event tracing
framework. Additionally, the new version comes with new device drivers for
SATA 3.0 and power management for the Exynos-5 SoC, improved virtualization
support on NOVA on x86, updated kernels, and integrity checks for
downloaded 3rd-party source code.
Over the course of the past five years, Genode's development was primarily
motivated by adding and cultivating features to make the framework fit for as
many application areas as possible. Now that we have a critical mass of
features, the focus on mere functionality does not suffice anymore. The
question of what Genode can do ultimately turns into the question of how well
Genode can do something: How stable is a certain workload? How does networking
perform? How does it scale to multi-processor systems? Because we are lacking
concise answers to these kind of questions, we have to investigate.
When talking about stability, our recently introduced automated testing
infrastructure makes us more confident than ever. Each night, over 200
automated tests are performed, covering various kernels and several hardware
platforms. All those tests are publicly available in the form of so-called run
scripts and are under continues development.
Regarding performance investigations, recently we have begun to benchmark
application performance focusing on network throughput. Interestingly, our
measurements reveal significant differences between the used kernels, but
also shortcomings in our software stack. For example, currently we see that
our version of lwIP performs poorly with gigabit networking. To thoroughly
investigate such performance issues, the current version adds support
for tracing the behaviour of Genode components. This will allow us to get a
profound understanding of all inter-component interaction that are on the
critical path for the performance of complex application-level workloads.
Thanks to the Genode architecture, we could come up with a strikingly simple,
yet powerful design for a tracing facility. Section
[Light-weight event tracing] explains how it works.
When it comes to multi-processor scalability, we used to shy away from such
inquiries because, honestly, we haven't paid much consideration to it. This
view has changed by now. With the current release, we implemented the
management of CPU affinities right into the heart of the framework, i.e.,
Genode's session concept. Additionally, we cracked a damn hard nut by enabling
Genode to use multiple CPUs on the NOVA hypervisor. This kernel is by far the
most advanced Open-Source microkernel for the x86 architecture. However,
NOVA's MP model seemed to inherently contradict with the API design of Genode.
Fortunately, we found a fairly elegant way to go forward and we're able to
tame the beast. Section [Enhanced multi-processor support] goes into more
detail.
Functionality-wise, we always considered the availability of Qt on Genode as a
big asset. With the current release, we are happy to announce that we finally
made the switch from Qt4 to Qt5. Section [Qt5 available on all kernels] gives
insights into the challenges that we faced during porting work.
In addition to those highlights, the new version comes with improvements all
over the place. To name a few, there are improved support for POSIX threads,
updated device drivers, an updated version of the Fiasco.OC kernel and
L4Linux, and new device drivers for Exynos-5. Finally, the problem of
verifying the integrity of downloaded 3rd-party source codes has been
addressed.
Qt5 available on all kernels
############################
Since its integration with Genode version 9.02, Qt4 is regarded as
one of the most prominent features of the framework. For users, combining Qt
with Genode makes the world of sophisticated GUI-based end-user applications
available on various microkernels. For Genode developers, Qt represents by far
the most complex work load natively executed on top of the framework API,
thereby stressing the underlying system in any way imaginable. We have been
keeping an eye on Qt version 5 for a while and highly anticipate the direction
where Qt is heading. We think that the time is right to leave Qt4 behind to
embrace Qt5 for Genode.
For the time being, both Qt4 and Qt5 are available for Genode, but Qt4 is
declared as deprecated and will be removed with the upcoming version 13.11.
Since Qt5 is almost API compatible to Qt4, the migration path is relatively
smooth. So we recommend to move your applications over to Qt5 during the
next release cycle.
In the following, we briefly describe the challenges we faced while adding Qt5
support to Genode, point you to the place where to find Qt5 in the source
tree, and give quick-start instructions for getting a Qt5 application
scenario running.
We found that the biggest architectural difference between version 4 and
version 5 is the use of the so-called Qt Platform Abstraction (QPA) interface,
which replaces the former Qt Window System (QWS).
Moving from QWS to QPA
======================
With Qt4, we relied on QWS
to perform the window handling. A Qt4 application used to create a session to
Genode's GUI server (called nitpicker) and applied its graphical output onto a
virtual framebuffer. The virtual framebuffer was not visible per se. To make
portions of the virtual framebuffer visible on screen, the application had to
create so-called nitpicker views. A view is a rectangular area of the physical
screen that displays a (portion of) the virtual framebuffer. The position,
size, and stacking order of views is managed by the application. For each Qt
window, the application would simply create a corresponding nitpicker view and
maintain the consistency of the view with the geometry of the Qt window. Even
though each Qt application seemingly operated as a full-screen application
with the windows managed by the application-local QWS, the use of nitpicker
views still allowed the integration of any number of Qt applications into one
windowed environment.
With the advent of compositing window managers, the typical way of how an
application interacts with the window system of the OS changed. Whereas
old-generation GUIs relied on a tight interplay of the application with the
window server in order to re-generate newly exposed window regions whenever
needed (e.g., revealing the window content previously covered by another
window), the modern model of a GUI server keeps all pixels of all windows in
memory regardless of whether the window is visible or covered by other
windows. The use of one pixel buffer per window seems wasteful with respect to
memory usage when many large windows are overlapping each other. On the other
hand, this technique largely simplifies GUI servers and makes the
implementation of fancy effects, like translucent windows, straight forward.
Since memory is cheap, the Qt developers abandoned the old method and fully
embraced the buffer-per-window approach by the means of QPA.
For Genode, we faced the challenge that we don't have a window server
in the usual sense. With nitpicker, we have a GUI server, but with a more
radical design. In particular, nitpicker leaves the management of window
geometries and stacking to the client. In contrast, QPA expects the window
system to provide both means for a user to interactively change the window
layout and a way for an application to define the properties (such as the
geometry, title, and visibility) of its windows.
The obviously missing piece was the software component that deals with window
controls. Fortunately, we already have a bunch of native nitpicker applications
that come with client-side window controls, in particular the so-called liquid
framebuffer (liquid_fb). This nitpicker client presents a virtual framebuffer
in form of a proper window on screen and, in turn, provides a framebuffer and
input service. These services can be used by other Genode processes, for
example, another nested instance of nitpicker.
This way, liquid_fb lends itself to be the interface between the nitpicker
GUI server and QPA.
For each QPA window, the application creates a new liquid_fb instance as a
child process. The liquid_fb instance will request a dedicated nitpicker
session, which gets routed through the application towards the parent of the
application, which eventually routes the request to the nitpicker GUI server.
Finally, the liquid_fb instance announces its input and framebuffer services
to its parent, which happens to be the application. Now, the application is
able to use those services in order to access the window. Because the
liquid_fb instances are children of the application, the application can
impose control over those. In particular, it can update the liquid_fb
configuration including the window geometry and title at any time. Thanks to
Genode's dynamic reconfiguration mechanism, the liquid_fb instances are able
to promptly respond to such reconfigurations.
Combined, those mechanisms give the application a way to receive user input
(via the input services provided by the liquid_fb instances), perform
graphical output (via the virtual framebuffers provided by the liquid_fb
instances), and define window properties (by dynamically changing the
respective liquid_fb configurations). At the same time, the user can use
liquid_fb's window controls to move, stack, and resize application windows as
expected.
[image qt5_screenshot]
Steps of porting Qt5
====================
Besides the switch to QPA, the second major change was related to the build
system. For the porting work, we use a Linux host system to obtain the
starting point for the build rules. The Qt4 build system would initially
generate all Makefiles, which could be inspected and processed at once. In
contrast, Qt5 generates Makefiles during the build process whenever needed.
When having configured Qt for Genode, however, the build on Linux will
ultimately fail. So the much-desired intermediate Makefiles won't be created.
The solution was to have 'configure' invoke 'qmake -r' instead of 'qmake'.
This way, qmake project files will be processed recursively. A few additional
tweaks were needed to avoid qmake from backing out because of missing
dependencies (qt5_configuration.patch). To enable the build of the Qt tools
out of tree, qmake-specific files had to be slightly adapted
(qt5_tools.patch). Furthermore, qtwebkit turned out to use code-generation
tools quite extensively during the build process. On Genode, we perform this
step during the 'make prepare' phase when downloading and integrating the Qt
source code with the Genode source tree.
For building Qt5 on Genode, we hit two problems. First, qtwebkit depends on
the ICU (International Components for Unicode) library, which was promptly
ported and can be found in the libports repository. Second, qtwebkit
apparently dropped the support of the 'QThread' API in favor of
POSIX-thread support only. For this reason, we had to extend the coverage
of Genode's pthread library to fulfill the needs of qtwebkit.
Once built, we entered the territory of debugging problems at runtime.
* We hit a memory-corruption problem caused by an assumption of 'QArrayData'
with regard to the alignment of memory allocated via malloc. As a
work-around, we weakened the assumptions to 4-byte alignment
(qt5_qarraydata.patch).
* Page faults in QWidgetAnimator caused by
use-after-free problems. Those could be alleviated by adding pointer
checks (qt5_qwidgetanimator.patch).
* Page faults caused by the slot function 'QWidgetWindow::updateObjectName()'
with a 'this' pointer of an incompatible type 'QDesktopWidget*'.
As a workaround, we avoid this condition by delegating the
'QWidgetWindow::event()' that happened to trigger the slot method to
'QWindow' (base class of 'QWidgetWindow') rather than to a
'QDesktopWidget' member (qt5_qwidgetwindow.patch).
* We observed that Arora presented web sites incomplete, or including HTTP
headers. During the evaluation of HTTP data, a signal was sent to another
thread, which activated a "user provided download buffer" for optimization
purposes. On Linux, the receiving thread was immediately scheduled and
everything went fine. However, on some kernels used by Genode, scheduling
is different, so that the original thread continued to execute a bit longer,
ultimately triggering a race condition. As a workaround, we disabled the
"user provided download buffer" optimization.
* Page faults in the JavaScript engine of Webkit. The JavaScript
'RegExp.exec()' function returned invalid string objects. We worked around
this issue by deactivating the JIT compiler for the processing of
regular expressions (ENABLE_YARR_JIT).
The current state of the Qt5 port is fairly complete. It covers the core, gui,
jscore, network, script, scriptclassic, sql, ui, webcore, webkit, widgets,
wtf, and xml modules. That said, there are a few known limitations and
differences compared to Qt4. First, the use of one liquid_fb instance per
window consumes more memory compared to the use of QWS in Qt4. Furthermore,
external window movements are not recognized by our QPA implementation yet.
This can cause popup menus to appear at unexpected positions. Key repeat is
not yet handled. The 'QNitpickerViewWidget' is not yet adapted to Qt5. For this
reason, qt_avplay is not working yet.
Test drive
==========
Unlike Qt4, which was hosted in the dedicated 'qt4' repository, Qt5 is
integrated in the libports repository. It can be downloaded and integrated
into the Genode build system by issuing 'make prepare' from within the
libports repository. The Qt5 versions of the known Qt examples are located at
libports/src/app/qt5. Ready-to-use run scripts for those examples are available
at libports/run.
Migration away from Qt4 to Qt5
==============================
The support for Qt4 for Genode has been declared as deprecated. By default,
it's use is inhibited to avoid name aliasing problems between both versions.
Any attempt to build a qt4-based target will result in a message:
!Skip target app/qt_launchpad because it requires qt4_deprecated
To re-enable the use of Qt4, the SPEC value qt4_deprecated must be defined
manually for the build directory:
!echo "SPECS += qt4_deprecated" >> etc/specs.conf
We will keep the qt4 repository in the source tree during the current
release cycle. It will be removed with version 13.11.
Light-weight event tracing
##########################
With Genode application scenarios getting increasingly sophisticated,
the need for thorough performance analysis has come into spotlight.
Such scenarios entail the interaction of many components.
For example, with our recent work on optimizing network performance, we
have to consider several possible attack points:
* Device driver: Is the device operating in the proper mode? Are there
CPU-intensive operations such as allocations within the critical path?
* Interface to the device driver: How frequent are context switches between
client and device driver? Is the interface designed appropriately for
the access patterns?
* TCP/IP stack: How does the data flow from the raw packet level to the
socket level? How dominant are synchronization costs between the involved
threads? Are there costly in-band operations performed, e.g., dynamic
memory allocations per packet?
* C runtime: How does integration of the TCP/IP stack with the C runtime
work, for example how does the socket API interfere with timeout
handling during 'select' calls?
* Networking application
* Timer server: How often is the timer consulted by the involved components?
What is the granularity of timeouts and thereby the associated costs for
handling them?
* Interaction with core: What is the profile of the component's interaction
with core's low-level services?
This example is just an illustration. Most real-world performance-critical
scenarios have a similar or even larger scope. With our traditional
tools, it is hardly possible to gather a holistic view of the scenario. Hence,
finding performance bottlenecks tends to be a series of hit-and-miss
experiments, which is a tiresome and costly endeavour.
To overcome this situation, we need the ability to gather traces of component
interactions. Therefore, we started investigating the design of a tracing
facility for Genode one year ago while posing the following requirements:
* Negligible impact on the performance, no side effects:
For example, performing a system call per traced event
is out of question because this would severely influence the flow of
control (as the system call may trigger the kernel to take a scheduling
decision) and the execution time of the traced code, not to speak of
the TLB and cache footprint.
* Kernel independence: We want to use the same tracing facility across
all supported base platforms.
* Accountability of resources: It must be clearly defined where the
resources for trace buffers come from. Ideally, the tracing tool should be
able to dimension the buffers according to its needs and, in turn, pay for
the buffers.
* Suitable level of abstraction: Only if the trace contains information at
the right level of abstraction, it can be interpreted for large scenarios.
A counter example is the in-kernel trace buffer of the Fiasco.OC kernel,
which logs kernel-internal object names and a few message words when tracing
IPC messages, but makes it almost impossible to map this low-level
information to the abstraction of the control flow of RPC calls. In
contrast, we'd like to capture the names of invoked RPC calls (which is an
abstraction level the kernel is not aware of). This requirement implies the
need to have useful trace points generated automatically. Ideally those
trace points should cover all interactions of a component with the outside
world.
* (Re-)definition of tracing policies at runtime: The
question of which information to gather when a trace point is passed
should not be solely defined at compile time. Instead of changing static
instrumentations in the code, we'd prefer to have a way to configure
the level of detail and possible conditions for capturing events at runtime,
similar to dtrace. This way, a series of different hypotheses could be
tested by just changing the tracing policy instead of re-building and
rebooting the entire scenario.
* Straight-forward implementation: We found that most existing tracing
solutions are complicated. For example, dtrace comes with a virtual
machine for the sandboxed interpretation of policy code. Another typical
source of complexity is the synchronization of trace-buffer accesses.
Because for Genode, low TCB complexity is of utmost importance, the
simplicity of the implementation is the prerequisite to make it an
integral part of the base system.
* Support for both online and offline analysis of traces.
We are happy to report to have come up with a design that meets all those
requirements thanks to the architecture of Genode. In the following, we
present the key aspects of the design.
The tracing facility comes in the form of a new TRACE service implemented
in core. Using this service, a TRACE client can gather information about
available tracing subjects (existing or no-longer existing threads),
define trace buffers and policies and assign those to tracing subjects,
obtain access to trace-buffer contents, and control the tracing state
of tracing subjects. When a new thread is created via a CPU session, the
thread gets registered at a global registry of potential tracing sources. Each
TRACE service manages a session-local registry of so-called trace subjects.
When requested by the TRACE client, it queries new tracing sources from the
source registry and obtains references to the corresponding threads. This way,
the TRACE session becomes able to control the thread's tracing state.
To keep the tracing overhead as low as possible, we assign a separate trace
buffer to each individually traced thread. The trace buffer is a shared memory
block mapped to the virtual address space of the thread's process. Capturing
an event comes down to a write operation into the thread-local buffer. Because
of the use of shared memory for the trace buffer, no system call is needed and
because the buffer is local to the traced thread, there is no need for
synchronizing the access to the buffer. When no tracing is active, a thread
has no trace buffer. The buffer gets installed only when tracing is started.
The buffer is not installed magically from the outside of the traced process
but from the traced thread itself when passing a trace point. To detect
whether to install a new trace buffer, there exists a so-called trace-control
dataspace shared between the traced process and its CPU service. This
dataspace contains control bits for each thread created via the CPU session.
The control bits are evaluated each time a trace point is passed by the
thread. When the thread detects a change of the tracing state, it actively
requests the new trace buffer from the CPU session and installs it into its
address space. The same technique is used for loading the code for tracing
policies into the traced process. The traced thread actively checks for
policy-version updates by evaluating the trace-control bits. If an update is
detected, the new policy code is requested from the CPU session. The policy
code comes in the form of position-independent code, which gets mapped into
the traced thread's address space by the traced thread itself. Once mapped,
a trace point will call the policy code. When called, the policy
module code returns the data to be captured into the trace buffer. The
relationship between the trace monitor (the client of TRACE service), core's
TRACE service, core's CPU service, and the traced process is depicted in
Figure [trace_control].
[image trace_control]
There is one trace-control dataspace per CPU session, which gets accounted
to the CPU session's quota. The resources needed for the
trace-buffer dataspaces and the policy dataspaces are paid-for by the
TRACE client. On session creation, the TRACE client can specify the amount
of its own RAM quota to be donated to the TRACE service in core. This
enables the TRACE client to define trace buffers and policies of arbitrary
sizes, limited only by its own RAM quota.
In Genode, the interaction of a process with its outside world is
characterized by its use of inter-process communication, namely synchronous
RPC, signals, and access to shared memory. For the former two types of
inter-process communication, Genode generates trace points automatically. RPC
clients generate trace points when an RPC call is issued and when a call
returned. RPC servers generate trace points in the RPC dispatcher, capturing
incoming RPC requests as well as RPC replies. Thanks to Genode's RPC
framework, we are able to capture the names of the RPC functions in the RPC
trace points. This information is obtained from the declarations of the RPC
interfaces. For signals, trace points are generated for submitting and
receiving signals. Those trace points form a useful base line for gathering
tracing data. In addition, manual trace points can be inserted into the code.
State of the implementation
===========================
The implementation of Genode's tracing facility is surprisingly low complex.
The addition to the base system (core as well as the base library) are
merely 1500 lines of code. The mechanism works across all base platforms.
Because the TRACE client provides the policy code and trace buffer to the
traced thread, the TRACE client imposes ultimate control over the traced
thread. In contrast to dtrace, which sandboxes the trace policy, we express
the policy module in the form of code executed in the context of the traced
thread. However, in contrast to dtrace, such code is never loaded into a large
monolithic kernel, but solely into the individually traced processes. So the
risk of a misbehaving policy is constrained to the traced process.
In the current form, the TRACE service of core should be considered as a
privileged service because the trace-subject namespace of each session
contains all threads of the system. Therefore, TRACE sessions should be routed
only for trusted processes. In the future, we plan to constrain the
namespaces for tracing subjects per TRACE session.
The TRACE session interface is located at base/include/trace_session/.
A simple example for using the service is available at os/src/test/trace/
and is accompanied with the run script os/run/trace.run. The test
demonstrates the TRACE session interface by gathering a trace of a thread
running locally in its address space.
Enhanced multi-processor support
################################
Multi-processor (MP) support is one of those features that most users take for
granted. MP systems are so ubiquitous, even on mobile platforms, that a
limitation to utilizing a single CPU only is almost a fallacy.
That said, MP support in operating systems is hard to get right. For this
reason, we successively deferred the topic on the agenda of Genode's
road map.
For some base platforms such as the Linux or Codezero kernels, Genode
always used to support SMP because the kernel would manage the affinity
of threads to CPU cores transparently to the user-level process. So on
these kernels, there was no need to add special support into the framework.
However, on most microkernels, the situation is vastly different. The
developers of such kernels try hard to avoid complexity in the kernel and
rightfully argue that in-kernel affinity management would contribute to kernel
complexity. Another argument is that, in contrast to monolithic kernels that
have a global view on the system and an "understanding" of the concerns of the
user processes, a microkernel is pretty clueless when it comes to the roles
and behaviours of individual user-level threads. Not knowing whether a thread
works as a device driver, an interactive program, or a batch process, the
microkernel is not in the position to form a reasonably useful model of the
world, onto which it could intelligently apply scheduling and affinity
heuristics. In fact, from the perspective of a microkernel, each thread does
nothing else than sending and receiving messages, and causing page faults.
For these reasons, microkernel developers tend to provide the bootstrapping
procedure for the physical CPUs and a basic mechanism to assign
threads to CPUs but push the rest of the problem to the user space, i.e.,
Genode. The most straight-forward way would make all physical CPUs visible
to all processes and require the user or system integrator to assign
physical CPUs when a thread is created. However, on the recursively
structured Genode system, we virtualize resources at each level, which
calls for a different approach. Section [Management of CPU affinities]
explains our solution.
When it comes to inter-process communication on MP systems, there is a
certain diversity among the kernel designs. Some kernels allow the user land
to use synchronous IPC between arbitrary threads, regardless of whether both
communication partners reside on the same CPU or on two different CPUs. This
convenient model is provided by Fiasco.OC. However, other kernels do not offer
any synchronous IPC mechanism across CPU cores at all, NOVA being a poster
child of this school of thought. If a user land is specifically designed for
a particular kernel, those peculiarities can be just delegated to the
application developers. For example, the NOVA user land called NUL is designed
such that a recipient of IPC messages spawns a dedicated thread on each
physical CPU. In contrast, Genode is meant to provide a unified API that
works well across various different kernels. To go forward, we had four
options:
# Not fully supporting the entity of API semantics across all base platforms.
For example, we could stick with the RPC API for synchronous communication
between threads. Programs just would happen to fail on some base platforms
when the called server resides on a different CPU. This would effectively
push the problem to the system integrator. The downside would be the
sacrifice of Genode's nice feature that a program developed
on one kernel usually works well on other kernels without any changes.
# Impose the semantics provided by the most restrictive kernel onto all
users of the Genode API. Whereas this approach would facilitate that
programs behave consistently across all base platforms, the restrictions
would be artificially imposed onto all Genode users, in particular the
users of kernels with less restrictions. Of course, we don't change
the Genode API lightheartedly, which attributes to our hesitance to go
into this direction.
# Hiding kernel restrictions behind the Genode API. This approach could come
in many different shapes. For example, Genode could transparently spawn a
thread on each CPU when a single RPC entrypoint gets created, following
the model of NUL. Or Genode could emulate synchronous IPC using the core
process as a proxy.
# Adapting the kernel to the requirements of Genode. That is, persuading
kernel developers to implement the features we find convenient, i.e., adding
a cross-CPU IPC feature to NOVA. History shows that our track record in
doing that is not stellar.
Because each of those options is like opening a different can of worms, we
used to defer the solution of the problem. Fortunately, however, we finally
kicked off a series of practical experiments, which led to a fairly elegant
solution, which is detailed in Section
[Adding multi-processor support to Genode on NOVA].
Management of CPU affinities
============================
In line with our experience of supporting
[http://www.genode.org/documentation/release-notes/10.02#Real-time_priorities - real-time priorities]
in version 10.02, we were seeking a way to express CPU affinities such that
Genode's recursive nature gets preserved and facilitated. Dealing with
physical CPU numbers would contradict with this mission. Our solution
is based on the observation that most MP systems have topologies that can
be represented on a two-dimensional coordinate system. CPU nodes close
to each other are expected to have closer relationship than distant nodes.
In a large-scale MP system, it is natural to assign clusters of closely
related nodes to a given workload. Genode's architecture is based on the
idea to recursively virtualize resources and thereby lends itself to the
idea to apply this successive virtualization to the problem of clustering
CPU nodes.
In our solution, each process has a process-local view on a so-called affinity
space, which is a two-dimensional coordinate space. If the process creates a
new subsystem, it can assign a portion of its own affinity space to the new
subsystem by imposing a rectangular affinity location to the subsystem's CPU
session. Figure [affinity_space] illustrates the idea.
[image affinity_space]
Following from the expression of affinities as a rectangular location within a
process-local affinity space, the assignment of subsystems to CPU nodes
consists of two parts, the definition of the affinity space dimensions as used
for the process and the association of sub systems with affinity locations
(relative to the affinity space). For the init process, the affinity space is
configured as a sub node of the config node. For example, the following
declaration describes an affinity space of 4x2:
!
! ...
!
! ...
!
Subsystems can be constrained to parts of the affinity space using the
'' sub node of a '' entry:
!
! ...
!
!
! ...
!
! ...
!
As illustrated by this example, the numbers used in the declarations for this
instance of the init process are not directly related to physical CPUs. If
the machine has just two cores, init's affinity space would be mapped
to the range [0,1] of physical CPUs. However, in a machine
with 16x16 CPUs, the loader would obtain 8x8 CPUs with the upper-left
CPU at position (4,0). Once a CPU session got created, the CPU client can
request the physical affinity space that was assigned to the CPU session
via the 'Cpu_session::affinity()' function. Threads of this CPU session
can be assigned to those physical CPUs via the 'Cpu_session::affinity()'
function, specifying a location relative to the CPU-session's affinity space.
Adding multi-processor support to Genode on NOVA
================================================
The NOVA kernel has been supporting MP systems for a long time. However
Genode did not leverage this capability until now. The main reason was that
the kernel does not provide - intentionally by the kernel developer - the
possibility to perform synchronous IPC between threads residing on different
CPUs.
To cope with this situation, Genode servers and clients would need to make
sure to have at least one thread on a common CPU in order to communicate.
Additionally, shared memory and semaphores could be used to communicate across
CPU cores. Both options would require rather fundamental changes to the Genode
base framework and the API. An exploration of this direction should in any
case be pursued in evolutionary steps rather than as one big change, also
taking into account that other kernels do not impose such hard requirements on
inter-CPU communication. To tackle the challenge, we conducted a series of
experiments to add some kind of cross-CPU IPC support to Genode/NOVA.
As a general implication of the missing inter-CPU IPC, messages between
communication partners that use disjoint CPUs must take an indirection through
a proxy process that has threads running on both CPUs involved. The sender
would send the message to a proxy thread on its local CPU, the proxy process
would transfer the message locally to the CPU of the receiver by using
process-local communication, and the proxy thread on the receiving CPU would
deliver the message to the actual destination. We came up with three options
to implement this idea prototypically:
# Core plays the role of the proxy because it naturally has access to all
CPUs and emulates cross-CPU IPC using the thread abstractions of the
Genode API.
# Core plays the role of the proxy but uses NOVA system calls directly
rather than Genode's thread abstraction.
# The NOVA kernel acts as the proxy and emulates cross-CPU IPC directly
in the kernel.
After having implemented the first prototypes, we reached the following
conclusions.
For options 1 and 2 where core provides this service: If a client can not
issue a local CPU IPC, it asks core - actually the pager of the client
thread - to perform the IPC request. Core then spawns or reuses a proxy thread
on the target CPU and performs the actual IPC on behalf of the client. Option
1 and 2 only differ in respect to code size and the question to whom to
account the required resources - since a proxy thread needs a stack and some
capability selectors.
As one big issue for option 1 and 2, we found that in order to delegate
capabilities during the cross-CPU IPC, core has to receive capability mappings
to delegate them to the target thread. However, core has no means to know
whether the capabilities must be maintained in core or not. If a capability is
already present in the target process, the kernel would just translate the
capability to the target's capability name space. So core wouldn't need to
keep it. In the other case where the target receives a prior unknown
capability, the kernel creates a new mapping. Because the mapping gets
established by the proxy in core, core must not free the capability.
Otherwise, the mapping would disappear in the target process. This means that
the use of core as a proxy ultimately leads to leaking kernel resources
because core needs to keep all transferred capabilities, just for the case a
new mapping got established.
For option 3, the same general functionality as for option 1 and 2 is
implemented in the kernel instead of core. If a local CPU IPC call
fails because of a BAD_CPU kernel error code, the cross-CPU IPC extension
will be used. The kernel extension creates - similar as to option 1 and 2 - a
semaphore (SM), a thread (EC), and a scheduling context (SC) on the remote
CPU and lets it run on behalf of the caller thread. The caller thread
gets suspended by blocking on the created semaphore until the remote EC has
finished the IPC. The remote proxy EC reuses the UTCB of the suspended caller
thread as is and issues the IPC call. When the proxy EC returns, it wakes up
the caller via the semaphore. Finally, the proxy EC and SC de-schedule
themselves and the resources get to be destroyed later on by the kernel's RCU
mechanism. Finally, when the caller thread got woken up, it takes care to
initiate the deconstruction of the semaphore.
The main advantage of option 3 compared to options 1 and 2 is that we don't
have to keep and track the capability delegations during a cross-CPU IPC.
Furthermore, we do not have potentially up to two additional address space
switches per cross-CPU IPC (from client to core and core to the server).
Additionally, the UTCB of the caller is reused by the proxy EC and does not
need to be allocated separately as for option 1 and 2.
For these reasons, we decided to go for the third option. From Genode's API
point of view, the use of cross-CPU IPC is completely transparent. Combined
with the affinity management described in the previous section, Genode/NOVA
just works on MP systems.
As a simple example for using Genode on MP systems, there is a ready-to-use
run script available at base/run/affinity.run.
Base framework
##############
Affinity propagation in parent, root, and RPC entrypoint interfaces
===================================================================
To support the propagation of CPU affinities with session requests, the
parent and root interfaces had to be changed. The 'Parent::Session'
and 'Root::Session' take the affinity of the session as a new argument.
The affinity argument contains both the dimensions of the affinity space
used by the session and the session's designated affinity location within
the space. The corresponding type definitions can be found at
base/affinity.h.
Normally, the 'Parent::Session' function is not used directly but indirectly
through the construction of a so-called connection object, which represents an
open session. For each session type there is a corresponding connection type,
which takes care of assembling the session-argument string by using the
'Connection::session()' convenience function. To maintain API compatibility,
we kept the signature of the existing 'Connection::session()' function using a
default affinity and added a new overload that takes the affinity as
additional argument. Currently, this overload is used in
cpu_session/connection.h.
For expressing the affinities of RPC entrypoints to CPUs within the affinity
space of the server process, the 'Rpc_entrypoint' takes the desired affinity
location of the entrypoint as additional argument. For upholding API
compatibility, the affinity argument is optional.
CPU session interface
=====================
The CPU session interface underwent changes to accommodate the new event
tracing infrastructure and the CPU affinity management.
Originally the 'Cpu_session::num_cpus()' function could be used to
determine the number of CPUs available to the session. This function
has been replaced by the new 'affinity_space' function, which returns the
bounds of the CPU session's physical affinity space. In the simplest case of
an SMP machine, the affinity space is one-dimensional where the width
corresponds to the number of CPUs. The 'affinity' function, which is used to
bind a thread to a specified CPU, has been changed to take an affinity
location as argument. This way, the caller can principally express the
affiliation of the thread with multiple CPUs to guide load-balancing in a CPU
service.
New TRACE session interface
===========================
The new event tracing mechanism as described in Section
[Light-weight event tracing] is exposed to Genode processes in the form
of the TRACE service provided by core. The new session interface
is located under base/include/trace_session/. In addition to the new session
interface, the CPU session interface has been extended with functions for
obtaining the trace-control dataspace for the session as well as the trace
buffer and trace policy for a given thread.
Low-level OS infrastructure
###########################
Event-driven operation of NIC bridge
====================================
The NIC bridge component multiplexes one physical network device among
multiple clients. It enables us to multiplex networking on the network-packet
level rather than the socket level and thereby take TCP/IP out of the
critical software stack for isolating network applications. As it
represents an indirection in the flow of all networking packets, its
performance is important.
The original version of NIC bridge was heavily multi-threaded. In addition to
the main thread, a timer thread, and a thread for interacting with the NIC
driver, it employed one dedicated thread per client. By merging those flows of
control into a single thread, we were able to significantly reduce the number
of context switches and improve data locality. These changes reduced the
impact of the NIC bridge on the packet throughput from 25% to 10%.
Improved POSIX thread support
=============================
To accommodate qtwebkit, we had to extend Genode's pthread library with
working implementations of condition variables, mutexes, and thread-local
storage. The implemented functions are attr_init, attr_destroy, attr_getstack,
attr_get_np, equal, mutex_attr, mutexattr_init, mutexattr_destroy,
mutexattr_settype, mutex_init, mutex_destroy, mutex_lock, mutex_unlock,
cond_init, cond_timedwait, cond_wait, cond_signal, cond_broadcast, key_create,
setspecific, and getspecific.
Device drivers
##############
SATA 3.0 on Exynos 5
====================
The previous release featured the initial version of our SATA 3.0 driver for
the Exynos 5 platform. This driver located at os/src/drivers/ahci/exynos5 has
reached a fully functional state by now. It supports UDMA-133 with up to
6 GBit/s.
For driver development, we set the goal to reach a performance equal to
the Linux kernel. To achieve that goal, we had to make sure to
operate the controller and the disks in the same ways as Linux does.
For this reason, we modeled our driver closely after the behaviour of the
Linux driver. That is, we gathered traces of I/O transactions to determine the
initialization steps and the request patterns that Linux performs to access
the device, and used those behavioral traces as a guide for our
implementation. Through step-by-step analysis of the traces, we not only
succeeded to operate the device in the proper modes, but we also found
opportunities for further optimization, in particular regarding the error
recovery implementation.
This approach turned out to be successful. We measured that our driver
generally operates as fast (and in some cases even significantly faster)
than the Linux driver on solid-state disks as well as on hard disks.
Dynamic CPU frequency scaling for Exynos 5
==========================================
As the Samsung Exynos-5 SoC is primarily targeted at mobile platforms,
power management is an inherent concern. Until now, Genode did not pay
much attention to power management though. For example, we completely
left out the topic from the scope of the OMAP4 support. With the current
release, we took the first steps towards proper power management on ARM-based
platforms in general, and the Exynos-5-based Arndale platform in particular.
First, we introduced a general interface to regulate clocks and voltages.
Priorly, each driver did its own part of configuring clock and power control
registers. The more device drivers were developed, the higher were the chances
that they interfere when accessing those clock, or power units.
The newly introduced "Regulator" interface provides the possibility to enable
or disable, and to set or get the level of a regulator. A regulator might be
a clock for a specific device (such as a CPU) or a voltage regulator.
For the Arndale board, an exemplary implementation of the regulator interface
exists in the form of the platform driver. It can be found at
os/src/drivers/platform/arndale. Currently, the driver implements
clock regulators for the CPU, the USB 2.0 and USB 3.0 host controller, the
eMMC controller, and the SATA controller. Moreover, it provides power
regulators for SATA, USB 2.0, and USB 3.0 host controllers. The selection of
regulators is dependent on the availability of drivers for the platform.
Otherwise it wouldn't be possible to test that clock and power state doesn't
affect the device.
Apart from providing regulators needed by certain device drivers, we
implemented a clock regulator for the CPU that allows changing the CPU
frequency dynamically and thereby giving the opportunity to scale down
voltage and power consumption. The possible values range from 200 MHz to
1.7 GHz whereby the last value isn't recommended and might provoke system
crashes due to overheating. When using Genode's platform driver for Arndale
it sets CPU clock speed to 1.6 GHz by default. When reducing
the clock speed to the lowest level, we observed a power consumption
reduction of approximately 3 Watt. Besides reducing dynamic power consumption
by regulating the CPU clock frequency, we also explored the gating of the clock
management and power management to further reduce power consumption.
With the CPU frequency scaling in place, we started to close all clock gates
not currently in use. When the platform driver for the Arndale board gets
initialized, it closes everything. If a device driver enables its clock
regulator, all necessary clock gates for the device's clock are opened. This
action saves about 0.7 Watt. The initial closing of all unnecessary power
gates was much more effective. Again, everything not essential for the working
of the kernel is disabled on startup. When a driver enables its power
regulator, all necessary power gates for the device are opened. Closing all
power gates saves about 2.6 Watt.
If we consider all measures taken to save power, we were able to reduce power
consumption to about 59% without performance degradation. When measuring power
consumption after boot up, setting the CPU clock to 1.6 GHz, and fully load
both CPU cores without the described changes, we measured about 8 Watt. With
the described power saving provisions enabled, we measured about 4.7 Watt.
When further reducing the CPU clock frequency to 200 MHz, only 1.7 Watt were
measured.
VESA driver moved to libports
=============================
The VESA framebuffer driver executes the initialization code located
in VESA BIOS of the graphics card. As the BIOS code is for real mode,
the driver uses the x86emu library from X11 as emulation environment.
We updated x86emu to version 1.20 and moved the driver from the 'os'
repository to the 'libports' repository as the library is third-party
code. Therefore, if you want to use the driver, the 'libports'
repository has to be prepared
('make -C /libports prepare PKG=x86emu') and enabled in
your 'etc/build.conf'.
Runtime environments
####################
Seoul (aka Vancouver) VMM on NOVA
=================================
Since we repeatedly received requests for using the Seoul respectively
Vancouver VMM on NOVA, we improved the support for this virtualization
solution on Genode. Seoul now supports booting from raw hard disk images
provided via Genode's block session interface. Whether this image is actually
a file located in memory, or it is coming directly from the hard disk, or just
from a partition of the hard disk using Genode's part_blk service, is
completely transparent thanks to Genode's architecture.
Additionally, we split up the one large Vancouver run script into several
smaller Seoul run scripts for easier usage - e.g. one for disk, one for
network testing, one for automated testing, and one we call "fancy". The
latter resembles the former vancouver.run script using Genode's GUI to let the
user start VMs interactively. The run scripts prefixed with 'seoul-' can be
found at ports/run. For the fancy and network scripts, ready-to-use VM images
are provided. Those images are downloaded automatically when executing the run
script for the first time.
L4Linux on Fiasco.OC
====================
L4Linux has been updated from version 3.5.0 to Linux kernel version 3.9.0 thus
providing support for contemporary user lands running on top of L4Linux on both
x86 (32bit) and ARM platforms.
Noux runtime for Unix software
==============================
Noux is our way to use the GNU software stack natively on Genode. To improve
its performance, we revisited the address-space management of the runtime to
avoid redundant revocations of memory mappings when Noux processes are cleaned
up.
Furthermore, we complemented the support for the Genode tool chain to
cover GNU sed and GNU grep as well. Both packages are available at the ports
repository.
Platforms
#########
Fiasco.OC updated to revision r56
=================================
Fiasco.OC and the required L4RE parts have been updated to the current SVN
revision (r56). For us, the major new feature is the support of Exynos SOCs in
the mainline version of Fiasco.OC (www.tudos.org). Therefore Genode's
implementation of the Exynos5250 platform could be abandoned leading to less
maintenance overhead of Genode on Fiasco.OC.
Furthermore, Genode's multi-processor support for this kernel has been
improved so that Fiasco.OC users benefit from the additions described in
Section [Enhanced multi-processor support].
NOVA updated
============
In the process of our work on the multi-processor support on NOVA, we updated
the kernel to the current upstream version. Additionally, our customized branch
(called r3) comes with the added cross-CPU IPC system call and improvements
regarding the release of kernel resources.
Integrity checks for downloaded 3rd-party software
##################################################
Even though Genode supports a large variety of 3rd-party software, its
source-code repository contains hardly any 3rd-party source code. Whenever
3rd-party source code is needed, Genode provides automated tools for
downloading the code and integrating it with the Genode environment. As of
now, there exists support for circa 70 software packages, including the
tool chain, various kernels, libraries, drivers, and a few applications. Of
those packages, the code for 13 packages comes directly from their respective
Git repositories. The remaining 57 packages are downloaded in the form of tar
archives from public servers via HTTP or FTP. Whereas we are confident with
the integrity of the code that comes from Git repositories, we are less so
about the archives downloaded from HTTP or FTP servers.
Fortunately, most Open-Source projects provide signature files that allow
the user to verify the origin of the archive. For example, archives of
GNU software are signed with the private key of the GNU project. So the
integrity of the archive can be tested with the corresponding public key.
We used to ignore the signature files for many years but
this has changed now. If there is a signature file available for a package,
the package gets verified right after downloading. If only a hash-sum file
is provided, we check it against a known-good hash sum.
The solution required three steps, the creation of tools for validating
signatures and hashes, the integration of those tools into Genode's
infrastructure for downloading the 3rd-party code, and the definition of
verification rules for the individual packages.
First, new tools for downloading and validating hash sums and signatures were
added in the form of the shell scripts download_hashver (verify hash sum) and
download_sigver (verify signature) found at the tool/ directory. Under the
hood, download_sigver uses GNU GPG, and download_hashver uses the tools
md5sum, sha1sum, and sha256sum provided by coreutils.
Second, hooks for invoking the verification tools were added to the
tool-chain build script as well as the ports and the libports repositories.
The third and the most elaborative step, was going through all the packages,
looking for publicly available signature files, and adding corresponding
package rules. As of now, this manual process has been carried out for 30
packages, thereby covering the half of the archives.
Thanks to Stephan Mueller for pushing us into the right direction, kicking off
the work on this valuable feature, and for the manual labour of revisiting all
the 3rd-party packages!