corda

ExternalVendorCode/corda

Fork 0

mirror of https://github.com/corda/corda.git synced 2025-01-09 14:33:30 +00:00

Commit Graph

Author	SHA1	Message	Date
Johannes Schindelin	edb48ffec2	Regex: support prioritized threads If we want to match greedy or reluctant regular expressions, we have to make sure that certain threads are split off with a higher priority than others. We will use the ThreadQueues' natural order as priority order: high to low. To support splitting into different-priority threads, let's introduce a second SPLIT opcode: SPLIT_JMP. The latter prefers to jump while the former prefers to execute the opcode directly after the SPLIT opcode. There is a subtle challenge here, though: let's assume that there are two current threads and the higher-priority one wants to jump where the lower-priority one is already. In the PikeVM implementation before this change, queueImmediately() would see that there is already a thread queued for that program counter and not queue the higher-priority one. Example: when matching the pattern '(a?)(a??)(a?)' against the string 'aa', after the first character, the first (high priority) thread will have matched the first group while the second thread matched the second group. In the following step, therefore, the first thread will want to SPLIT_JMP to match the final 'a' to the third group but the second thread already queued that program counter. The proposed solution is to introduce a third thread queue: 'queued'. When queuing threads to be executed after reading the next character from the string to match, they are not directly queued into 'next' but into 'queued'. Every thread requiring immediate execution (i.e. before reading the next character) will be queued into 'current'. Whenever 'current' is drained, the next thread from 'queued' that has not been queued to 'current' yet will be executed. That way, we can guarantee that 1) no lower-priority thread can override a higher-priority thread and 2) infinite loop are prevented. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>	2013-12-03 12:28:10 -06:00
Johannes Schindelin	63b06ebde8	Regex: optimize matching characters Instead of having an opcode 'CHAR', let's have the opcodes that fall within the range of a char be the opcode 'match this character'. While at it, break the ranges of the different types of opcodes apart into ranges so that related operations are clustered. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>	2013-12-03 12:28:10 -06:00
Johannes Schindelin	944f5f3567	Start implementing a regular expression engine So far, these are humble beginnings indeed. Based on the descriptions of http://swtch.com/%7Ersc/regexp/regexp2.html I started implementing a Thompson NFA / Pike VM. The idea being that eventually, regular expressions are to be compiled into special-purpose bytecode for the Pike VM that executes a varying number of threads in lock-step over each character of the text to match. The thread count is bounded by the length of the program: two different threads with identical instruction pointer at the same character-to-match would yield exactly the same outcome (and therefore, we can execute just one such thread instead of possibly many). To allow for matching groups, each thread carries a state with it, saving the group offsets acquired so far. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>	2013-12-03 12:28:10 -06:00

Author

SHA1

Message

Date

Johannes Schindelin

edb48ffec2

Regex: support prioritized threads

If we want to match greedy or reluctant regular expressions, we have
to make sure that certain threads are split off with a higher priority
than others. We will use the ThreadQueues' natural order as priority
order: high to low.

To support splitting into different-priority threads, let's introduce
a second SPLIT opcode: SPLIT_JMP. The latter prefers to jump while the
former prefers to execute the opcode directly after the SPLIT opcode.

There is a subtle challenge here, though: let's assume that there are
two current threads and the higher-priority one wants to jump where
the lower-priority one is already. In the PikeVM implementation
before this change, queueImmediately() would see that there is
already a thread queued for that program counter and *not* queue the
higher-priority one.

Example: when matching the pattern '(a?)(a??)(a?)' against the string
'aa', after the first character, the first (high priority) thread
will have matched the first group while the second thread matched the
second group. In the following step, therefore, the first thread will
want to SPLIT_JMP to match the final 'a' to the third group but the
second thread already queued that program counter.

The proposed solution is to introduce a third thread queue: 'queued'.
When queuing threads to be executed after reading the next character
from the string to match, they are not directly queued into 'next' but
into 'queued'. Every thread requiring immediate execution (i.e. before
reading the next character) will be queued into 'current'. Whenever
'current' is drained, the next thread from 'queued' that has not been
queued to 'current' yet will be executed.

That way, we can guarantee that 1) no lower-priority thread can override
a higher-priority thread and 2) infinite loop are prevented.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

2013-12-03 12:28:10 -06:00

Johannes Schindelin

63b06ebde8

Regex: optimize matching characters

Instead of having an opcode 'CHAR', let's have the opcodes that fall
within the range of a char *be* the opcode 'match this character'.

While at it, break the ranges of the different types of opcodes apart
into ranges so that related operations are clustered.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

2013-12-03 12:28:10 -06:00

Johannes Schindelin

944f5f3567

Start implementing a regular expression engine

So far, these are humble beginnings indeed. Based on the descriptions of

	http://swtch.com/%7Ersc/regexp/regexp2.html

I started implementing a Thompson NFA / Pike VM.

The idea being that eventually, regular expressions are to be compiled
into special-purpose bytecode for the Pike VM that executes a varying
number of threads in lock-step over each character of the text to match.

The thread count is bounded by the length of the program: two different
threads with identical instruction pointer at the same character-to-match
would yield exactly the same outcome (and therefore, we can execute just
one such thread instead of possibly many).

To allow for matching groups, each thread carries a state with it, saving
the group offsets acquired so far.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

2013-12-03 12:28:10 -06:00

3 Commits