Commit Graph

349 Commits

Author SHA1 Message Date
Johannes Schindelin
b4c768b101 Regex: Test Pattern#split(String)
The particular pattern we use to test it is used in ImgLib2, based on
this answer on stackoverflow:

	http://stackoverflow.com/a/279337

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:11 -06:00
Johannes Schindelin
8ab10a6953 Regex: support special character classes
This adds support for character classes such as \d or \W, leaving \p{...}
style character classes as an exercise for later.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:11 -06:00
Johannes Schindelin
098f688cd8 Regex: implement negative look-arounds
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:11 -06:00
Johannes Schindelin
8b611c8075 Regex: support look-behind patterns
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:11 -06:00
Johannes Schindelin
62d1964779 Regex: add a method to reverse the PikeVM program
A program for the PikeVM corresponds to a regular expression pattern. The
program matches the character sequence in left-to-right order. However,
for look-behind expressions, we will want to match the character sequence
backwards.

To this end, it is nice that regular expression patterns can be reversed
in a straight-forward manner. However, it would be nice if we could avoid
multiple parsing passes and simply parse even look-behind expressions as
if they were look-ahead ones, and then simply reverse the program for that
part.

Happily, it is not difficult to reverse the program so it is equivalent to
matching the pattern backwards.

There is one catch, though. Imagine matching the sequence "a" against the
regular expression "(a?)a?". If we match forward, the group will match the
letter "a", when matching backwards, it will match the empty string. So,
while the reverse pattern is equivalent to the forward pattern in terms of
"does the pattern match that sequence", but not its sub-matches. For that
reason, Java simply ignores capturing groups in look-behind patterns (and
for consistency, the same holds for look-ahead patterns).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:11 -06:00
Johannes Schindelin
85af36ef90 Regex: support lookaheads
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:11 -06:00
Johannes Schindelin
d4a2f58eb5 Regex: implement alternatives
Now we support regular expressions like 'A|B|C'.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:11 -06:00
Johannes Schindelin
c3a06a600a Regex: implement non-capturing groups
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:11 -06:00
Johannes Schindelin
53563c4f8e Regex: add support for character classes
Now we support regular expression patterns a la '[0-9]'.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:11 -06:00
Johannes Schindelin
ca428c406c Regex: implement find()
Now that we have non-greedy repeats, we can implement the find() (which
essentially prefixes the regular expression pattern with '.*?'.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:11 -06:00
Johannes Schindelin
7da03b0f19 Regex: Implement reluctant '?', '*' and '+'
Now that we have reluctant quantifiers, we can get rid of the hardcoded
program for the challenging regular expression pattern.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:11 -06:00
Johannes Schindelin
f979505b3d Regex: implement * and + operators
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:10 -06:00
Johannes Schindelin
d753edafcd Regex: support the dot
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:10 -06:00
Johannes Schindelin
e2105670a0 Regex compiler: fall back to TrivialPattern when possible
While at it, let's get rid of the unescaping in TrivialPattern which was
buggy anyway: special operators such as \b were misinterpreted as trivial
patterns.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:10 -06:00
Johannes Schindelin
04d8955f98 Regex: Implement compiler for regular expression patterns
Originally, this developer wanted to (ab)use the PikeVM with a
hand-crafted program and an added "callback" opcode to parse the regular
expressions.

However, this turned out to be completely unnecessary: there are no
ambiguities in regular expression patterns, so there is no need to do
anything else than parse the pattern, one character at a time, into a
nested expression that then knows how to write itself into a program for
the PikeVM.

For the moment, we still hardcode the program for the regular expression
pattern demonstrating the challenge with the prioritized threads because
the compiler cannot yet parse reluctant operators.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:10 -06:00
Johannes Schindelin
26c4bf8d8b Regex: add a class for matching character classes
This will be used to match character classes (such as '[0-9a-f]'),
but it will also be used by the regular expression pattern compiler
to determine whether a character has special meaning in regular
expressions.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:10 -06:00
Johannes Schindelin
d00f799d2e Regex: special-case a(a*?)(a?)(a??)(a+)(a*)a
Among other challenges, this regular expression is designed to demonstrate
that thread prioritization is finicky: Given the string 'aaaaaa' to match,
the first four threads will try to grab the second 'a', the third thread
(the one that matched the '(a??)' group) having scheduled the same
instruction pointer to the '(a+)' group that the second -- higher-priority
-- thread will try to advance to only after processing the '(a??)' group's
SPLIT. The second thread must override the third thread in that case,
essentially stopping the latter.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:10 -06:00
Johannes Schindelin
edb48ffec2 Regex: support prioritized threads
If we want to match greedy or reluctant regular expressions, we have
to make sure that certain threads are split off with a higher priority
than others. We will use the ThreadQueues' natural order as priority
order: high to low.

To support splitting into different-priority threads, let's introduce
a second SPLIT opcode: SPLIT_JMP. The latter prefers to jump while the
former prefers to execute the opcode directly after the SPLIT opcode.

There is a subtle challenge here, though: let's assume that there are
two current threads and the higher-priority one wants to jump where
the lower-priority one is already. In the PikeVM implementation
before this change, queueImmediately() would see that there is
already a thread queued for that program counter and *not* queue the
higher-priority one.

Example: when matching the pattern '(a?)(a??)(a?)' against the string
'aa', after the first character, the first (high priority) thread
will have matched the first group while the second thread matched the
second group. In the following step, therefore, the first thread will
want to SPLIT_JMP to match the final 'a' to the third group but the
second thread already queued that program counter.

The proposed solution is to introduce a third thread queue: 'queued'.
When queuing threads to be executed after reading the next character
from the string to match, they are not directly queued into 'next' but
into 'queued'. Every thread requiring immediate execution (i.e. before
reading the next character) will be queued into 'current'. Whenever
'current' is drained, the next thread from 'queued' that has not been
queued to 'current' yet will be executed.

That way, we can guarantee that 1) no lower-priority thread can override
a higher-priority thread and 2) infinite loop are prevented.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:10 -06:00
Johannes Schindelin
63b06ebde8 Regex: optimize matching characters
Instead of having an opcode 'CHAR', let's have the opcodes that fall
within the range of a char *be* the opcode 'match this character'.

While at it, break the ranges of the different types of opcodes apart
into ranges so that related operations are clustered.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:10 -06:00
Johannes Schindelin
b03283033e Add a unit test for the regular expression engine
We still do not parse the regular expression patterns, but we can at
least test that the hardcoded 'a(bb)+a' works as expected.

This class will be extended as we support more and more features.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:10 -06:00
Johannes Schindelin
2073d4bffb Prepare the Matcher class for multiple groups
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:10 -06:00
Johannes Schindelin
e6ad10de04 Implement Pattern / Matcher classes based on the PikeVM
Based on the just-implemented PikeVM, let's test it with a specific
regular expression. At this point, no parsing is implemented but instead
an explicit program executing a(bb)?a is hardcoded.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:10 -06:00
Johannes Schindelin
944f5f3567 Start implementing a regular expression engine
So far, these are humble beginnings indeed. Based on the descriptions of

	http://swtch.com/%7Ersc/regexp/regexp2.html

I started implementing a Thompson NFA / Pike VM.

The idea being that eventually, regular expressions are to be compiled
into special-purpose bytecode for the Pike VM that executes a varying
number of threads in lock-step over each character of the text to match.

The thread count is bounded by the length of the program: two different
threads with identical instruction pointer at the same character-to-match
would yield exactly the same outcome (and therefore, we can execute just
one such thread instead of possibly many).

To allow for matching groups, each thread carries a state with it, saving
the group offsets acquired so far.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:10 -06:00
Johannes Schindelin
84829dc390 Refactor Pattern / Matcher classes
This makes both the Pattern and the Matcher class abstract so that more
specialized patterns than the trivial patterns we support so far can be
implemented as convenient subclasses of the respective abstract base
classes.

To ease development, we work on copies in test/regex/ in the 'regex'
package. That way, it can be developed in Eclipse (because it does not
interfere with Oracle JRE's java.util.regex.* classes).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-12-03 12:28:10 -06:00
Joshua Warner
c3bbe555be make Sockets test Java6-compilable, make it more generic, and move it to 'extra' 2013-11-08 10:05:53 -07:00
Ilya Mizus
45ee25f68c Implement socket API 2013-11-08 09:55:43 -07:00
Joshua Warner
fd81e126ef fix Dates test for openjdk and stub out java.util.TimeZone 2013-11-07 20:44:02 -07:00
Joshua Warner
76b0bb4872 remove non-conforming ZipEntry.getJavaTime API and associated tests (which failed the openjdk build) 2013-11-07 19:13:13 -07:00
Joshua Warner
dd460ab55e Merge pull request #99 from dscho/fix-get-annotation
Fix NPE in Field#getAnnotation
2013-11-06 09:03:45 -08:00
Joshua Warner
4cf3d9de88 Merge pull request #95 from dscho/compatible-serialization
Java-compatible (de)serialization of TreeMap, ArrayList and Number
2013-11-06 09:02:12 -08:00
Joshua Warner
d0d4f600dc Merge pull request #94 from dscho/serialization
Implement Java-compatible serialization
2013-11-06 08:49:14 -08:00
Johannes Schindelin
ff50034206 Fix NPE in Field#getAnnotation
When the class whose field is to be inspected has no annotations at all,
at least my javac here (1.6.0_51 on MacOSX) does not produce any class
addendum.

Therefore, let's verify that the addendum is not null before proceeding.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-11-06 09:46:56 -06:00
Johannes Schindelin
dddd9e5016 Serialize test: augment the hexdump with address and ASCII dump
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-11-06 09:12:18 -06:00
Johannes Schindelin
7e72f4362b Add a test to ensure TreeMap's (de)serialization compatibility with OpenJDK
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-11-06 09:12:18 -06:00
Johannes Schindelin
48e0912ad4 Test the new, Java-compatible (de)serialization
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-11-06 09:10:51 -06:00
Johannes Schindelin
6159f5cd3c Support Logger#log(Level,String,Object)
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-11-06 09:07:58 -06:00
Johannes Schindelin
dba8d39e63 Implement Class#getDeclaredClasses
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-11-06 09:07:58 -06:00
Joshua Warner
da69b735f4 Merge branch 'addzip' of git://github.com/CUBoulderBoy/avian into CUBoulderBoy-addzip 2013-11-04 17:38:27 -07:00
Joshua Warner
d128838617 Merge pull request #92 from dscho/collections
Various improvements regarding Collections
2013-11-04 16:33:06 -08:00
Joshua Warner
ac32e0de39 Merge pull request #91 from dscho/intro-sort
Replace Arrays.sort() with an efficient sort algorithm
2013-11-04 16:31:08 -08:00
Johannes Schindelin
a2feec0bab Add a pseudo-integration test for getResources()
This adds an extra class path element to the VM running the unit tests,
writes files with identical file names into both directories and then
verifies that SystemClassLoader#getResources can find them.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-11-04 16:53:02 -06:00
Johannes Schindelin
6a81623690 Implement the Arrays#copyOf family
... as introduced in Java 6.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-11-04 15:11:00 -06:00
Johannes Schindelin
d37b5ada37 Implement Collections#sort
This is really a verbatim translation of Arrays#sort.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-11-04 12:08:22 -06:00
Johannes Schindelin
605701e40a Replace Arrays.sort() with an efficient sort algorithm
This change reuses the existing insertion sort (which was previously what
Arrays.sort() executed) in a full intro sort pipeline.

The implementation is based on the Musser paper on intro sort (Musser,
David R. "Introspective sorting and selection algorithms." Softw., Pract.
Exper. 27.8 (1997): 983-993.) and Wikipedia's current description of the
heap sort: http://en.wikipedia.org/wiki/Heapsort.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-11-04 00:27:04 -06:00
Johannes Schindelin
95fcc9ac8e Test the newly-introduced Integer#decode method
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-10-25 15:32:33 -05:00
Johannes Schindelin
c1ec6020a6 Verify that String#lastIndexOf handles large fromIndex correctly
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-10-25 15:32:33 -05:00
Johannes Schindelin
6f64e3aaab Verify that two-dimensional object arrays have the correct component type
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-10-25 15:32:33 -05:00
Joshua Warner
b20dcd268c Merge pull request #85 from dscho/simple-regex
Simple regex
2013-10-21 13:30:57 -07:00
Johannes Schindelin
359f99c0f7 Add a unit test for regular expressions
We do not really support regular expressions yet, but we do support
trivial patterns including ones with escaped characters. Let's make sure
that that works as advertised.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-10-21 13:34:33 -05:00
Johannes Schindelin
db99aada94 Test date parsing/formatting
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2013-10-21 10:41:40 -05:00