2013-11-09 04:26:53 +00:00
|
|
|
/* Copyright (c) 2008-2013, Avian Contributors
|
|
|
|
|
|
|
|
Permission to use, copy, modify, and/or distribute this software
|
|
|
|
for any purpose with or without fee is hereby granted, provided
|
|
|
|
that the above copyright notice and this permission notice appear
|
|
|
|
in all copies.
|
|
|
|
|
|
|
|
There is NO WARRANTY for this software. See license.txt for
|
|
|
|
details. */
|
|
|
|
|
|
|
|
package regex;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Opcodes for the Pike VM.
|
|
|
|
* <p>
|
|
|
|
* See {@link PikeVM}.
|
|
|
|
* </p>
|
|
|
|
*
|
|
|
|
* @author Johannes Schindelin
|
|
|
|
*/
|
|
|
|
interface PikeVMOpcodes {
|
2013-11-09 20:13:19 +00:00
|
|
|
final static int DOT = -1;
|
|
|
|
final static int DOTALL = -2;
|
|
|
|
|
2013-11-20 04:58:01 +00:00
|
|
|
final static int WORD_BOUNDARY = -10;
|
|
|
|
final static int NON_WORD_BOUNDARY = -11;
|
|
|
|
final static int LINE_START = -12;
|
|
|
|
final static int LINE_END = -13;
|
|
|
|
|
2013-11-09 21:43:26 +00:00
|
|
|
final static int CHARACTER_CLASS = -20;
|
|
|
|
|
2013-11-12 15:33:45 +00:00
|
|
|
final static int LOOKAHEAD = -30;
|
2013-11-14 17:10:18 +00:00
|
|
|
final static int LOOKBEHIND = -31;
|
2013-11-20 15:57:04 +00:00
|
|
|
final static int NEGATIVE_LOOKAHEAD = -32;
|
|
|
|
final static int NEGATIVE_LOOKBEHIND = -33;
|
2013-11-12 15:33:45 +00:00
|
|
|
|
2013-11-09 20:13:19 +00:00
|
|
|
final static int SAVE_OFFSET = -40;
|
|
|
|
|
|
|
|
final static int SPLIT = -50;
|
Regex: support prioritized threads
If we want to match greedy or reluctant regular expressions, we have
to make sure that certain threads are split off with a higher priority
than others. We will use the ThreadQueues' natural order as priority
order: high to low.
To support splitting into different-priority threads, let's introduce
a second SPLIT opcode: SPLIT_JMP. The latter prefers to jump while the
former prefers to execute the opcode directly after the SPLIT opcode.
There is a subtle challenge here, though: let's assume that there are
two current threads and the higher-priority one wants to jump where
the lower-priority one is already. In the PikeVM implementation
before this change, queueImmediately() would see that there is
already a thread queued for that program counter and *not* queue the
higher-priority one.
Example: when matching the pattern '(a?)(a??)(a?)' against the string
'aa', after the first character, the first (high priority) thread
will have matched the first group while the second thread matched the
second group. In the following step, therefore, the first thread will
want to SPLIT_JMP to match the final 'a' to the third group but the
second thread already queued that program counter.
The proposed solution is to introduce a third thread queue: 'queued'.
When queuing threads to be executed after reading the next character
from the string to match, they are not directly queued into 'next' but
into 'queued'. Every thread requiring immediate execution (i.e. before
reading the next character) will be queued into 'current'. Whenever
'current' is drained, the next thread from 'queued' that has not been
queued to 'current' yet will be executed.
That way, we can guarantee that 1) no lower-priority thread can override
a higher-priority thread and 2) infinite loop are prevented.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-11-11 22:36:19 +00:00
|
|
|
final static int SPLIT_JMP = -51; // this split prefers to jump
|
|
|
|
final static int JMP = -52;
|
Regex: add a method to reverse the PikeVM program
A program for the PikeVM corresponds to a regular expression pattern. The
program matches the character sequence in left-to-right order. However,
for look-behind expressions, we will want to match the character sequence
backwards.
To this end, it is nice that regular expression patterns can be reversed
in a straight-forward manner. However, it would be nice if we could avoid
multiple parsing passes and simply parse even look-behind expressions as
if they were look-ahead ones, and then simply reverse the program for that
part.
Happily, it is not difficult to reverse the program so it is equivalent to
matching the pattern backwards.
There is one catch, though. Imagine matching the sequence "a" against the
regular expression "(a?)a?". If we match forward, the group will match the
letter "a", when matching backwards, it will match the empty string. So,
while the reverse pattern is equivalent to the forward pattern in terms of
"does the pattern match that sequence", but not its sub-matches. For that
reason, Java simply ignores capturing groups in look-behind patterns (and
for consistency, the same holds for look-ahead patterns).
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-11-13 17:13:06 +00:00
|
|
|
|
|
|
|
final static int SINGLE_ARG_START = CHARACTER_CLASS;
|
|
|
|
final static int SINGLE_ARG_END = JMP;
|
2013-11-09 04:26:53 +00:00
|
|
|
}
|