Now that we have non-greedy repeats, we can implement the find() (which
essentially prefixes the regular expression pattern with '.*?'.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Now that we have reluctant quantifiers, we can get rid of the hardcoded
program for the challenging regular expression pattern.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
While at it, let's get rid of the unescaping in TrivialPattern which was
buggy anyway: special operators such as \b were misinterpreted as trivial
patterns.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Originally, this developer wanted to (ab)use the PikeVM with a
hand-crafted program and an added "callback" opcode to parse the regular
expressions.
However, this turned out to be completely unnecessary: there are no
ambiguities in regular expression patterns, so there is no need to do
anything else than parse the pattern, one character at a time, into a
nested expression that then knows how to write itself into a program for
the PikeVM.
For the moment, we still hardcode the program for the regular expression
pattern demonstrating the challenge with the prioritized threads because
the compiler cannot yet parse reluctant operators.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This will be used to match character classes (such as '[0-9a-f]'),
but it will also be used by the regular expression pattern compiler
to determine whether a character has special meaning in regular
expressions.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Among other challenges, this regular expression is designed to demonstrate
that thread prioritization is finicky: Given the string 'aaaaaa' to match,
the first four threads will try to grab the second 'a', the third thread
(the one that matched the '(a??)' group) having scheduled the same
instruction pointer to the '(a+)' group that the second -- higher-priority
-- thread will try to advance to only after processing the '(a??)' group's
SPLIT. The second thread must override the third thread in that case,
essentially stopping the latter.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
If we want to match greedy or reluctant regular expressions, we have
to make sure that certain threads are split off with a higher priority
than others. We will use the ThreadQueues' natural order as priority
order: high to low.
To support splitting into different-priority threads, let's introduce
a second SPLIT opcode: SPLIT_JMP. The latter prefers to jump while the
former prefers to execute the opcode directly after the SPLIT opcode.
There is a subtle challenge here, though: let's assume that there are
two current threads and the higher-priority one wants to jump where
the lower-priority one is already. In the PikeVM implementation
before this change, queueImmediately() would see that there is
already a thread queued for that program counter and *not* queue the
higher-priority one.
Example: when matching the pattern '(a?)(a??)(a?)' against the string
'aa', after the first character, the first (high priority) thread
will have matched the first group while the second thread matched the
second group. In the following step, therefore, the first thread will
want to SPLIT_JMP to match the final 'a' to the third group but the
second thread already queued that program counter.
The proposed solution is to introduce a third thread queue: 'queued'.
When queuing threads to be executed after reading the next character
from the string to match, they are not directly queued into 'next' but
into 'queued'. Every thread requiring immediate execution (i.e. before
reading the next character) will be queued into 'current'. Whenever
'current' is drained, the next thread from 'queued' that has not been
queued to 'current' yet will be executed.
That way, we can guarantee that 1) no lower-priority thread can override
a higher-priority thread and 2) infinite loop are prevented.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Instead of having an opcode 'CHAR', let's have the opcodes that fall
within the range of a char *be* the opcode 'match this character'.
While at it, break the ranges of the different types of opcodes apart
into ranges so that related operations are clustered.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
We still do not parse the regular expression patterns, but we can at
least test that the hardcoded 'a(bb)+a' works as expected.
This class will be extended as we support more and more features.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Based on the just-implemented PikeVM, let's test it with a specific
regular expression. At this point, no parsing is implemented but instead
an explicit program executing a(bb)?a is hardcoded.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
So far, these are humble beginnings indeed. Based on the descriptions of
http://swtch.com/%7Ersc/regexp/regexp2.html
I started implementing a Thompson NFA / Pike VM.
The idea being that eventually, regular expressions are to be compiled
into special-purpose bytecode for the Pike VM that executes a varying
number of threads in lock-step over each character of the text to match.
The thread count is bounded by the length of the program: two different
threads with identical instruction pointer at the same character-to-match
would yield exactly the same outcome (and therefore, we can execute just
one such thread instead of possibly many).
To allow for matching groups, each thread carries a state with it, saving
the group offsets acquired so far.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This makes both the Pattern and the Matcher class abstract so that more
specialized patterns than the trivial patterns we support so far can be
implemented as convenient subclasses of the respective abstract base
classes.
To ease development, we work on copies in test/regex/ in the 'regex'
package. That way, it can be developed in Eclipse (because it does not
interfere with Oracle JRE's java.util.regex.* classes).
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
... for proper statistics (I thought I was #5 contributor at the
time I started the mailmap, but I was only #6).
Unfortunately, I could not find the full name of Stan
<goo.in.my.shoes@gmail.com> for proper credit.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
When the class whose field is to be inspected has no annotations at all,
at least my javac here (1.6.0_51 on MacOSX) does not produce any class
addendum.
Therefore, let's verify that the addendum is not null before proceeding.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This is done by implementing the readObject()/writeObject() method
pair as demanded by the serialization specification. The specifics
were reverse-engineered from serializing trivial TreeMap instances
with OpenJDK.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This will be needed for Java-compatible serialization of tree maps.
Note that the field should be null when the TreeMap uses the default
comparator.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
We punted previously on any serializable super class' descriptor and
simply expected the super class not to be serializable (and consequently,
we expected the respective descriptor to be null). However, for quite
common classes, e.g. OpenJDK's Double class, this is not true.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
There are serialized objects out in the wild which make heavy use of
TC_REFERENCE: for example when an object has a reference to itself.
Therefore we need to support that, too.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
We punted previously on any serializable super class' descriptor and
simply expected the super class not to be serializable (and consequently,
we expected the respective descriptor to be null).
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The specification of the Java deserialization demands that a private
readObject(ObjectOutputStream) method is used -- if it exists. In
that case, ObjectInputStream must not initialize the contents of the
fields (called 'classdata[]' in the documentation) but offer that
functionality via the defaultReadObject() method.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The specification of the Java serialization demands that a private
writeObject(ObjectOutputStream) method is used -- if it exists. In that
case, ObjectOutputStream must not write the contents of the fields
(called 'classdata[]' in the documentation) but offer that via the
defaultWriteObject() method.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The serialization protocol specifies a quick method to serialize
a String (because that is so common an operation): TC_STRING +
(short)length + bytes. Let's use that, also to make it easier to test
the upcoming changes to TreeMap harmonizing that Avian's serialization
of said class with OpenJDK's.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This is by no means a complete support for the deserialization compliant
to the Java Language Specification, but it is better to add the support
incrementally, for better readability of the commits.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The Java Language Specification documents the serialization protocol
implemented by this change set:
http://docs.oracle.com/javase/7/docs/platform/serialization/spec/protocol.html#10258
This change is intended to make it easier to use Avian VM as a drop-in
replacement for the Oracle JVM when serializing objects.
The previous serialization code is still available as
avian.LegacyObjectInputStream.
This commit only implements the non-object parts of the deserialization
specification.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>