mirror of
https://github.com/genodelabs/genode.git
synced 2025-01-11 15:33:04 +00:00
416 lines
20 KiB
Plaintext
416 lines
20 KiB
Plaintext
|
|
|
|
=======================================
|
|
Genode on seL4 - IPC and virtual memory
|
|
=======================================
|
|
|
|
|
|
Norman Feske
|
|
|
|
|
|
This is the second part of a series of hands-on articles about bringing Genode
|
|
to the seL4 kernel.
|
|
[http://genode.org/documentation/articles/sel4_part_1 - Read the previous part here...]
|
|
|
|
After having created a minimalistic root task consisting of two threads, we
|
|
can move forward with exercising the functionality provided by the kernel,
|
|
namely inter-process communication and the handling of virtual memory.
|
|
Once we have tested those functionalities in our minimalistic root task
|
|
environment, we will be able to apply the gained knowledge to the actual
|
|
porting effort of Genode's core component.
|
|
|
|
|
|
Inter-process communication
|
|
###########################
|
|
|
|
In the L4 universe, the term IPC (inter-process communication) usually stands
|
|
for synchronous communication between two threads. In seL4, IPC has two uses.
|
|
First, it enables threads of different protection domains (or the same
|
|
protection domain) to exchange messages. So information can be transferred
|
|
across protection-domain boundaries. Second, IPC is the mechanism used to
|
|
delegate access rights throughout the system. This is accomplished by sending
|
|
capabilities as message payload. When a capability is part of a message, the
|
|
kernel translates the local name of the capability in the sender's protection
|
|
domain to a local name in the receiver's protection domain.
|
|
|
|
In Genode, IPC is realized via two thin abstractions that build upon each
|
|
other:
|
|
|
|
# At the low level, the IPC library _src/base/ipc/ipc.cc_ is responsible
|
|
for sending and receiving messages using the kernel mechanism. It has a
|
|
generic interface _base/include/base/ipc.h_, which supports the marshalling
|
|
and un-marshalling of message arguments and capabilities using C++
|
|
insertion and extraction
|
|
operators. Genode users never directly interact with the IPC library.
|
|
|
|
# Built on top the IPC library, the so-called RPC framework adds the notion
|
|
of RPC functions and RPC objects. RPC interfaces are declared using
|
|
abstract C++ base classes with a few annotations. Under the hood, the
|
|
RPC framework uses C++ meta-programming techniques to turn RPC definitions
|
|
into code that transfers messages via the IPC library. In contrast to
|
|
the IPC library, the RPC library is platform-agnostic.
|
|
|
|
To enable Genode's RPC mechanism on seL4, we merely have to provide a
|
|
seL4-specific IPC library implementation. To warm up with seL4's IPC
|
|
mechanism, however, we first modify our existing test program (see
|
|
[http://genode.org/documentation/articles/sel4_part_1 - the first part])
|
|
to let the main thread perform an IPC call to the second thread.
|
|
|
|
To let the second thread receive IPC messages, we first need to create a
|
|
synchronous IPC endpoint using the 'seL4_Untyped_RetypeAtOffset' function
|
|
with 'seL4_EndpointObject' as type, an offset that skips the already allocated
|
|
TCB (the TCB object has a known size of 1024 bytes), and the designated
|
|
capability number (EP_CAP).
|
|
|
|
As a first test, we want the second thread to receive an incoming message.
|
|
So we change the entry function as follows:
|
|
|
|
! PDBG("call seL4_Wait");
|
|
! seL4_MessageInfo_t msg_info = seL4_Wait(EP_CAP, nullptr);
|
|
! PDBG("returned from seL4_Wait, call seL4_Reply");
|
|
! seL4_Reply(msg_info);
|
|
! PDBG("returned from seL4_Reply");
|
|
|
|
At the end of the main function (the context of the first thread), we try to
|
|
call the second thread via 'seL4_Call':
|
|
|
|
! PDBG("call seL4_Call");
|
|
! seL4_MessageInfo_t msg_info = seL4_MessageInfo_new(0, 0, 0, 0);
|
|
! seL4_Call(EP_CAP, msg_info);
|
|
! PDBG("returned from seL4_Call");
|
|
|
|
When executing the code, we get an error as follows:
|
|
|
|
! int main(): call seL4_Call
|
|
! void second_thread_entry(): call seL4_Wait
|
|
! Caught cap fault in send phase at address 0x0
|
|
! while trying to handle:
|
|
! vm fault on data at address 0x4 with status 0x6
|
|
! in thread 0xe0100080 at address 0x10002e1
|
|
|
|
By looking at the output of 'objdump -lSd', we see that fault happens at the
|
|
instruction
|
|
! mov %edi,%gs:0x4(,%ebx,4)
|
|
The issue is the same as the one we experienced before for the main thread - we
|
|
haven't initialized the GS register with a proper segment, yet. This can be
|
|
easily fixed by adding a call to our 'init_ipc_buffer' function right at the
|
|
start of the second thread's entry function. Still, the program does not work
|
|
yet:
|
|
|
|
! vm fault on data at address 0x4 with status 0x6
|
|
! in thread 0xe0100080 at address 0x10002e8
|
|
|
|
Looking at the objdump output, we see that the fault still happens at the same
|
|
instruction. So what is missing? The answer is that we haven't equipped the
|
|
second thread with a proper IPC buffer. The attempt to call 'seL4_Wait',
|
|
however, tries to access the IPC buffer of the calling thread. The IPC buffer
|
|
can be configured for a thread using the 'seL4_TCB_SetIPCBuffer' function. But
|
|
wait - what arguments do we need to pass? In addition to the TCB capability,
|
|
there are two arguments: a pointer to the IPC buffer and a page capability,
|
|
which contains the IPC buffer. Well, I had hoped to get away without dealing
|
|
with the memory management at this point. I figure that setting up the IPC
|
|
buffer for the second thread would require me to create a seL4_IA32_4K frame
|
|
object via 'seL4_Untyped_RetypeAtOffset' and insert a mapping of the page frame
|
|
within the roottask's address space, and possibly also create and install a
|
|
page-table object.
|
|
|
|
To avoid becoming side-tracked by those memory-management issues, I decide
|
|
to assign the IPC buffer of the second thread right at the same page as
|
|
the one for the initial thread. Both the local address and the page
|
|
capability for the initial thread's IPC buffer are conveniently provided by
|
|
seL4's boot info structure. So let's give this a try:
|
|
|
|
! /* assign IPC buffer to second thread */
|
|
! {
|
|
! static_assert(sizeof(seL4_IPCBuffer) % 512 == 0,
|
|
! "unexpected seL4_IPCBuffer size");
|
|
!
|
|
! int const ret = seL4_TCB_SetIPCBuffer(SECOND_THREAD_CAP,
|
|
! (seL4_Word)(bi->ipcBuffer + 1),
|
|
! seL4_CapInitThreadIPCBuffer);
|
|
!
|
|
! PDBG("seL4_TCB_SetIPCBuffer returned %d", ret);
|
|
! }
|
|
|
|
With the initialization of the IPC buffer in place, we finally get our
|
|
desired output:
|
|
|
|
! int main(): call seL4_Call
|
|
! void second_thread_entry(): call seL4_Wait
|
|
! void second_thread_entry(): returned from seL4_Wait, call seL4_Reply
|
|
! int main(): returned from seL4_Call
|
|
! void second_thread_entry(): returned from seL4_Reply
|
|
|
|
|
|
Delegation of capabilities via IPC
|
|
==================================
|
|
|
|
The seL4 kernel supports the delegation of capabilities across address-space
|
|
boundaries by the means of synchronous IPC. As Genode fundamentally relies
|
|
on such a mechanism, I decide to give it a try by extending the simple IPC
|
|
test. Instead of letting the main thread call the second thread without any
|
|
arguments, the main thread will pass the thread capability of the second
|
|
thread as argument. Upon reception of the call, the second thread will find
|
|
a capability in its IPC buffer. To validate that the received capability
|
|
corresponds to the thread cap, the second thread issues a 'seL4_TCB_Suspend'
|
|
operation on the received cap. It is supposed to stop its execution right
|
|
there. This experiment requires the following steps:
|
|
|
|
# At the caller side, we need to supply a capability as argument to the
|
|
'seL4_Call' operation by specifying the number of capabilities to transfer
|
|
at the 'extraCaps' field of the 'seL4_MessageInfo', and marshalling the
|
|
index of the capability via the 'seL4_SetCap' function (specifying
|
|
SECOND_THREAD_CAP as argument).
|
|
|
|
# At the callee side, we need to define where to receive an incoming
|
|
capability. First, we have to reserve a CNode slot designated for the
|
|
new capability. For the test, a known-free index will do:
|
|
|
|
! enum { RECV_CAP = 0x102 };
|
|
|
|
Second, we have to configure the IPC buffer of the second thread to
|
|
point to the RECV_CAP:
|
|
|
|
! seL4_SetCapReceivePath(seL4_CapInitThreadCNode, RECV_CAP, 32);
|
|
|
|
We specify 32 as receive depth because the CNode of the initial thread has a
|
|
size of 2^12 and a guard of 20.
|
|
|
|
At this point, I am wondering that there is apparently no way to specify a
|
|
receive window rather than an individual CNode for receiving capabilities.
|
|
After revisiting Section 4.2.2 of the manual, I came to the realization that
|
|
that *seL4 does not support delegating more than one capability in a single IPC*.
|
|
From Genode's perspective, this could become an issue because Genode's RPC
|
|
framework generally allows for the delegation of multiple capabilities via a
|
|
single RPC call.
|
|
|
|
That said, the simple capability-delegation test works as expected.
|
|
|
|
When repeatedly performing an IPC call with a delegated capability, the
|
|
RECV_CAP index will be populated by the first call. Subsequent attempts to
|
|
override the RECV_CAP capability do not work (the 'extraCaps' field of the
|
|
received message info remains 0). The receiver has to make sure that the
|
|
specified 'CapReceivePath' is an empty capability slot. I.e., by calling
|
|
'seL4_CNode_Delete' prior 'seL4_Wait'.
|
|
|
|
|
|
Translation of capabilities aka "unwrapping"
|
|
============================================
|
|
|
|
In addition to receiving delegated capabilities under a new name, seL4's IPC
|
|
mechanism allows the recipient of a capability that originated from the
|
|
recipient to obtain a custom defined value instead of receiving a new name.
|
|
In seL4 terminology, this mechanism is called "unwrapping".
|
|
|
|
Capability unwrapping is supposed to happen if the transferred capability is
|
|
a minted endpoint capability and the recipient is the original creator of
|
|
the capability. For testing the mechanism, we replace the 'SECOND_THREAD_CAP'
|
|
argument of the 'seL4_Call' by a minted endpoint capability derived from
|
|
the endpoint used by the second thread.
|
|
|
|
For creating a minted endpoint capability, we allocate a new index for the
|
|
minted capability (EP_MINTED_CAP) and use the 'seL4_CNode_Mint' operation
|
|
as follows:
|
|
|
|
! seL4_CNode const service = seL4_CapInitThreadCNode;
|
|
! seL4_Word const dest_index = EP_MINTED_CAP;
|
|
! uint8_t const dest_depth = 32;
|
|
! seL4_CNode const src_root = seL4_CapInitThreadCNode;
|
|
! seL4_Word const src_index = EP_CAP;
|
|
! uint8_t const src_depth = 32;
|
|
! seL4_CapRights const rights = seL4_Transfer_Mint;
|
|
! seL4_CapData_t const badge = seL4_CapData_Badge_new(111);
|
|
!
|
|
! int const ret = seL4_CNode_Mint(service,
|
|
! dest_index,
|
|
! dest_depth,
|
|
! src_root,
|
|
! src_index,
|
|
! src_depth,
|
|
! rights,
|
|
! badge);
|
|
|
|
The badge is set to the magic value 111.
|
|
When specifying the resulting EP_MINTED_CAP as IPC argument for a 'seL4_Call',
|
|
the kernel will translate the capability to the badge value. The callee
|
|
observes the reception of such an "unwrapped" capability via the
|
|
'capsUnwrapped' field of the 'seL4_MessageInfo' structure returned by the
|
|
'seL4_Wait' operation. The badge value can be obtained from the IPC buffer via
|
|
'seL4_GetBadge(0)'. This simple experiment shows that the mechanism works
|
|
as expected.
|
|
|
|
|
|
Management of virtual memory
|
|
############################
|
|
|
|
Besides synchronous IPC, Genode relies on two other forms of inter-process
|
|
communication, namely asynchronous notifications and shared memory. I will
|
|
save the investigation of the former mechanism for later but focus on seL4's
|
|
mechanisms for managing virtual memory for now.
|
|
|
|
|
|
Virtual-memory management in traditional L4 kernels
|
|
===================================================
|
|
|
|
Traditional L4 kernels rely on an in-kernel mapping database to track memory
|
|
mappings. If a protection domain has access to a range of memory pages, it can
|
|
transfer a "mapping" of those memory pages to other protection domains as IPC
|
|
payload. As a side effect of the IPC operation, the kernel populates the page
|
|
tables of the receiver of the mapping accordingly. In reverse, the originator
|
|
of a mapping can revoke the mapping by using a system call. This system call
|
|
takes a virtual-address range of the calling protection domain as argument and
|
|
flushes all mappings that originated from this range. Because mappings could
|
|
be established transitively, the kernel has to maintain a tree structure
|
|
(mapping database) that records how each memory mapping got established. This
|
|
approach has several problems:
|
|
|
|
First, it intertwines two concerns, namely the delegation of the authority
|
|
over memory pages and making memory pages accessible. In order to have the
|
|
authorization over memory pages, i.e., the right to hand them out to other
|
|
protection domains or to revoke them from other protection domains, a
|
|
protection domain has to make the pages accessible within its own virtual
|
|
address space. For Genode's core component, which needs to have the authority
|
|
over the entirety of physical memory, this raises two problems: First, even
|
|
though core is not supposed to ever touch memory pages that are in use by
|
|
other components, it still has to make those memory pages accessible within
|
|
its virtual address space. This complicates the assessment of the isolation
|
|
properties of core with respect to the components on top. For example, a
|
|
dangling pointer bug in core could leak or corrupt information that should be
|
|
local to an individual component. Second, the virtual address space of core
|
|
limits the amount of physical memory that can be handed out to other
|
|
components. On a 32-bit machine with 4 GiB of memory, core can hand out a
|
|
maximum of 3 GiB to other components because the upper 1 GiB of its virtual
|
|
address space is owned by the kernel.
|
|
|
|
Second, the mapping database keeps records about how mappings got established.
|
|
Thereby, the memory required for storing this information in the kernel
|
|
depends on the behaviour of the user land. As a consequence, a malicious
|
|
user-level program is able to provoke a high consumption of kernel memory
|
|
by establishing mappings. Eventually, this represents an attack vector
|
|
for denial-of-service attacks onto the kernel.
|
|
|
|
|
|
Virtual-memory management in seL4
|
|
=================================
|
|
|
|
Fortunately, the seL4 kernel breaks with this traditional approach.
|
|
Authority over memory pages is represented by capabilities. In fact, in
|
|
order to use a memory page, a capability for the memory page must exist.
|
|
This capability can be delegated. But the possession of the authority
|
|
over a memory page does not imply that the memory page is accessible.
|
|
Very good! Second, seL4 dropped the traditional mapping data base.
|
|
It does not record _how_ memory mappings were established. But it provides
|
|
an thin abstraction of page tables that define _who_ has access to which
|
|
memory page. The backing store for those page tables is managed outside
|
|
the kernel. This effectively alleviates the denial-of-service attack vector
|
|
present in traditional L4 kernels.
|
|
|
|
To test the virtual-memory management facilities of seL4, I decide to
|
|
conduct a simple experiment: I want to attach a page of physical memory into
|
|
the virtual address space of the root task twice. I will write a
|
|
string at the first virtual address and expect to read the same string
|
|
from the second virtual address. Compared to other kernels, this simple
|
|
scenario is quote elaborative on seL4 because the kernel barely abstracts
|
|
from the MMU hardware. I have to perform the following steps:
|
|
|
|
# Creating a page table and a page frame. I want to attach the page somewhere
|
|
at 0x40000000, which is an address range not yet in use but the root task.
|
|
So I have to create an 'seL4_IA32_PageTableObject' first. This is done by
|
|
the 'seL4_Untyped_RetypeAtOffset' operation as for all other kernel objects.
|
|
In order to be able to use a portion of physical memory as actual memory, we
|
|
also have to convert untyped memory to a 'seL4_IA32_4K' kernel object,
|
|
also using the 'Untyped_Retype' operation.
|
|
|
|
# With the page table created, we can tell seL4 to insert the page table
|
|
into the page directory of the root task:
|
|
|
|
! seL4_IA32_PageTable const service = PAGE_TABLE_CAP;
|
|
! seL4_IA32_PageDirectory const pd = seL4_CapInitThreadPD;
|
|
! seL4_Word const vaddr = 0x40000000;
|
|
! seL4_IA32_VMAttributes const attr = seL4_IA32_Default_VMAttributes;
|
|
!
|
|
! int const ret = seL4_IA32_PageTable_Map(service, pd, vaddr, attr);
|
|
|
|
# Now that the virtual memory range at 0x40000000 is backed by a page table,
|
|
we can finally insert our page frame at the designated virtual address:
|
|
|
|
! seL4_IA32_Page const service = PAGE_CAP;
|
|
! seL4_IA32_PageDirectory const pd = seL4_CapInitThreadPD;
|
|
! seL4_Word const vaddr = 0x40001000;
|
|
! seL4_CapRights const rights = seL4_AllRights;
|
|
! seL4_IA32_VMAttributes const attr = seL4_IA32_Default_VMAttributes;
|
|
!
|
|
! int const ret = seL4_IA32_Page_Map(service, pd, vaddr, rights, attr);
|
|
|
|
After these steps, root task is able to touch the memory at 0x40001000
|
|
without crashing, which is just expected.
|
|
|
|
However, the attempt to attach the same page at a second virtual address
|
|
fails:
|
|
|
|
! <<seL4 [decodeIA32FrameInvocation/1630 Te3ffd880 @100025f]:
|
|
! IA32Frame: Frame already mapped.>>
|
|
|
|
This is the point where the similarity of seL4's kernel interface to real
|
|
page tables ends. With real page tables, one physical page frame can be
|
|
present in any number of page tables by simply writing the frame number into
|
|
the corresponding page-table entry. In principle, a frame number corresponds
|
|
to an IA32_4K capability. But in contrast to a frame number, which can be
|
|
reused for any number of page-table entries, on seL4, each insertion of a
|
|
page frame into a page table requires a distinct IA32_4K capability. For a
|
|
given IA32_4K capability, the creation of a second capability that refers to
|
|
the same frame is easy enough:
|
|
|
|
! seL4_CNode const service = seL4_CapInitThreadCNode;
|
|
! seL4_Word const dest_index = PAGE_CAP_2;
|
|
! uint8_t const dest_depth = 32;
|
|
! seL4_CNode const src_root = seL4_CapInitThreadCNode;
|
|
! seL4_Word const src_index = PAGE_CAP;
|
|
! uint8_t const src_depth = 32;
|
|
! seL4_CapRights const rights = seL4_AllRights;
|
|
!
|
|
! int const ret = seL4_CNode_Copy(service, dest_index, dest_depth,
|
|
! src_root, src_index, src_depth, rights);
|
|
|
|
Inserting the page mapping using the copy of the original IA32_4K capability
|
|
works:
|
|
|
|
! seL4_IA32_Page const service = PAGE_CAP_2;
|
|
! seL4_IA32_PageDirectory const pd = seL4_CapInitThreadPD;
|
|
! seL4_Word const vaddr = 0x40002000;
|
|
! seL4_CapRights const rights = seL4_AllRights;
|
|
! seL4_IA32_VMAttributes const attr = seL4_IA32_Default_VMAttributes;
|
|
!
|
|
! int const ret = seL4_IA32_Page_Map(service, pd, vaddr, rights, attr);
|
|
|
|
The subsequent test of writing a string to 0x40001000 and reading it from
|
|
0x40002000 produces the desired result.
|
|
|
|
|
|
*Sentiments*
|
|
|
|
Compared to traditional L4 kernels, the virtual memory management of seL4
|
|
is much more advanced. It solves pressing problems that plagued L4 kernels
|
|
since an eternity. This is extremely valuable!
|
|
|
|
On the other hand, it does so by putting the burden of kernel-resource
|
|
management onto the user land while further complicating the problem
|
|
compared to the underlying mechanism provided by the MMU hardware.
|
|
I understand that the copies of the page-frame capabilities are needed to
|
|
enable to kernel to find and flush all page-table entries when a page frame is
|
|
destroyed. This is a fundamental security property of the kernel.
|
|
|
|
In Genode, each range of physical memory is represented as a dataspace that
|
|
can be attached to different or the same virtual address spaces any number of
|
|
times. The seemingly small detail that the population of each page-table entry
|
|
requires a dedicated kernel object raises quite a challenge.
|
|
We do not only need to perform the book keeping of the physical page
|
|
frames and their corresponding capability numbers, but additionally need to
|
|
create and keep track of an additional kernel object (a copy of the page-frame
|
|
capability) each time a physical page is mapped to a virtual address space.
|
|
It goes without saying that all the book-keeping meta data must be allocated
|
|
somewhere and accounted properly.
|
|
Fortunately, I see how Genode's resource-trading mechanisms could come to
|
|
the rescue here.
|
|
|