Monday, August 23, 2004

Java Regex Speed

Tim Bray:

There are all sorts of variations around I/O and so on, but my finding is that for this problem, the Java 1.4.2 regex processing is somewhere around twice as fast as Perl 5.8.1. Frankly, I’m astounded.


Behold, the power of UTF-8!

Tim's likely running over Unicode data. Perl 5 stores unicode in UTF-8 format, a variable-width storage form. It's really, really inefficient to access, though it does take up very little space. Java uses UTF-16, which is a fixed-width format. (And yes, I know about combining characters and alternate planes and such) I fully expect the place perl buys it big time is in the code that has to do character boundary checking. (This is one of the reasons Parrot's going with a fixed-width encoding scheme. Variable width schemes suck)

If we use a normal method calling it lets say some 10,00,000 times,i have found a 10 time difference in the speed of a normal validation method and Regex.Is there any way in which i can speed up Regex as i need it for Validations.

I have coded and tested in Java

