package com.google.common.base;
import static com.google.common.base.Preconditions.checkArgument;
import static com.google.common.base.Preconditions.checkNotNull;
import javax.annotation.CheckReturnValue;
Determines a true or false value for any Java
char
value, just as
Predicate
does
for any
Object
. Also offers basic text processing methods based on this function.
Implementations are strongly encouraged to be side-effect-free and immutable.
Throughout the documentation of this class, the phrase "matching character" is used to mean
"any character c
for which this.matches(c)
returns true
".
Note: This class deals only with char
values; it does not understand
supplementary Unicode code points in the range 0x10000
to 0x10FFFF
. Such logical
characters are encoded into a String
using surrogate pairs, and a CharMatcher
treats these just as two separate characters.
Example usages:
String trimmed = WHITESPACE
.trimFrom
(userInput);
if (ASCII
.matchesAllOf
(s)) { ... }
See the Guava User Guide article on
CharMatcher
.
- Author(s):
- Kevin Bourrillion
- Since:
- 1.0
Determines whether a character is a breaking whitespace (that is, a whitespace which can be
interpreted as a break between words for formatting purposes). See
WHITESPACE
for a
discussion of that term.
anyOf("\t\n\013\f\r \u0085\u1680\u2028\u2029\u205f\u3000")
Determines whether a character is ASCII, meaning that its code point is less than 128.
Determines whether a character is a digit according to
Unicode.
"\u0660\u06f0\u07c0\u0966\u09e6\u0a66\u0ae6\u0b66\u0be6\u0c66"
+ "\u0ce6\u0d66\u0e50\u0ed0\u0f20\u1040\u1090\u17e0\u1810\u1946"
+ "\u19d0\u1b50\u1bb0\u1c40\u1c50\ua620\ua8d0\ua900\uaa50\uff10";
digit = digit.or(inRange(base, (char) (base + 9)));
Determines whether a character is a digit according to
Java's
definition
. If you only care to match ASCII digits, you can use
inRange('0', '9')
.
Determines whether a character is a letter according to
Java's
definition
. If you only care to match letters of the Latin alphabet, you can use
inRange('a', 'z').or(inRange('A', 'Z'))
.
Determines whether a character is a letter or digit according to
Character.isLetterOrDigit(char) Java's definition
.
Determines whether a character is upper case according to
Java's definition
.
Determines whether a character is lower case according to
Java's definition
.
Determines whether a character is an ISO control character as specified by
Character.isISOControl(char)
.
Determines whether a character is invisible; that is, if its Unicode category is any of
SPACE_SEPARATOR, LINE_SEPARATOR, PARAGRAPH_SEPARATOR, CONTROL, FORMAT, SURROGATE, and
PRIVATE_USE according to ICU4J.
.or(anyOf("\u06dd\u070f\u1680\u180e"))
.or(anyOf("\ufeff\ufff9\ufffa\ufffb"))
Determines whether a character is single-width (not double-width). When in doubt, this matcher
errs on the side of returning
false
(that is, it tends to assume a character is
double-width).
Note: as the reference file evolves, we will modify this constant to keep it up to
date.
return (sequence.length() == 0) ? -1 : 0;
int length = sequence.length();
return (start == length) ? -1 : start;
return sequence.length() == 0;
char[] array = new char[sequence.length()];
Arrays.fill(array, replacement);
for (int i = 0; i < sequence.length(); i++) {
return (sequence.length() == 0) ? "" : String.valueOf(replacement);
int length = sequence.length();
return sequence.length() == 0;
Returns a
char
matcher that matches only one specified character.
return other.matches(match) ? other : super.or(other);
Returns a
char
matcher that matches any character except the one specified.
To negate another CharMatcher
, use negate()
.
return other.matches(match) ? super.and(other) : other;
Returns a
char
matcher that matches any character present in the given character
sequence.
final char match1 = sequence.charAt(0);
final char match2 = sequence.charAt(1);
return c == match1 || c == match2;
Returns a
char
matcher that matches any character not present in the given character
sequence.
Returns a
char
matcher that matches any character in a given range (both endpoints are
inclusive). For example, to match any lowercase letter of the English alphabet, use
CharMatcher.inRange('a', 'z')
.
- Throws:
- IllegalArgumentException if
endInclusive < startInclusive
return inRange(startInclusive, endInclusive, description);
return startInclusive <= c && c <= endInclusive;
if (c++ == endInclusive) {
Returns a matcher with identical behavior to the given
Character
-based predicate, but
which operates on primitive
char
instances instead.
return predicate.apply(c);
Sets the
toString()
from the given description.
Constructor for use by subclasses. When subclassing, you may want to override
toString()
to provide a useful description.
Determines a true or false value for the given character.
public abstract boolean matches(char c);
Returns a matcher that matches any character not matched by this matcher.
Returns a matcher that matches any character matched by both this matcher and
other
.
this(a, b, "CharMatcher.and(" + a + ", " + b + ")");
return new And(this, other);
Returns a matcher that matches any character matched by either this matcher or
other
.
this(a, b, "CharMatcher.or(" + a + ", " + b + ")");
Returns a
char
matcher functionally equivalent to this one, but which may be faster to
query than the original; your mileage may vary. Precomputation takes time and is likely to be
worthwhile only if the precomputed matcher is queried many thousands of times.
This method has no effect (returns this
) when called in GWT: it's unclear whether a
precomputed matcher is faster, but it certainly consumes more memory, which doesn't seem like a
worthwhile tradeoff in a browser.
Construct an array of all possible chars in the slowest way possible.
char[] allChars = new char[65536];
allChars[size++] = (char) c;
char[] retValue = new char[size];
System.arraycopy(allChars, 0, retValue, 0, size);
This is the actual implementation of
precomputed
, but we bounce calls through a method
on
Platform
so that we can have different behavior in GWT.
If the number of matched characters is small enough, we try to build a small hash
table to contain all of the characters. Otherwise, we record the characters in eight-kilobyte
bit array. In many situations this produces a matcher which is faster to query
than the original.
int totalCharacters = chars.length;
if (totalCharacters == 0) {
} else if (totalCharacters == 1) {
Subclasses should provide a new CharMatcher with the same characteristics as
this
,
but with their
toString
method overridden with the new description.
This is unsupported by default.
For use by implementors; sets the bit corresponding to each character ('\0' to '\uFFFF') that matches this matcher in the given bit array, leaving all other bits untouched.
The default implementation loops over every possible character value, invoking matches
for each one.
A bit array with one bit per
char
value, used by
CharMatcher.precomputed
.
TODO(kevinb): possibly share a common BitArray class with BloomFilter and others... a
simpler java.util.BitSet.
int[] data = new int[2048];
data[index >> 5] |= (1 << index);
boolean get(char index) {
return (data[index >> 5] & (1 << index)) != 0;
Returns
true
if a character sequence contains at least one matching character.
Equivalent to
!matchesNoneOf(sequence)
.
The default implementation iterates over the sequence, invoking matches
for each
character, until this returns true
or the end is reached.
- Parameters:
sequence
the character sequence to examine, possibly empty- Returns:
true
if this matcher matches at least one character in the sequence- Since:
- 8.0
Returns
true
if a character sequence contains only matching characters.
The default implementation iterates over the sequence, invoking matches
for each
character, until this returns false
or the end is reached.
- Parameters:
sequence
the character sequence to examine, possibly empty- Returns:
true
if this matcher matches every character in the sequence, including when
the sequence is empty
for (int i = sequence.length() - 1; i >= 0; i--) {
Returns
true
if a character sequence contains no matching characters. Equivalent to
!matchesAnyOf(sequence)
.
The default implementation iterates over the sequence, invoking matches
for each
character, until this returns false
or the end is reached.
- Parameters:
sequence
the character sequence to examine, possibly empty- Returns:
true
if this matcher matches every character in the sequence, including when
the sequence is empty
Returns the index of the first matching character in a character sequence, or
-1
if no
matching character is present.
The default implementation iterates over the sequence in forward order calling matches
for each character.
- Parameters:
sequence
the character sequence to examine from the beginning- Returns:
- an index, or
-1
if no character matches
int length = sequence.length();
for (int i = 0; i < length; i++) {
Returns the index of the first matching character in a character sequence, starting from a
given position, or
-1
if no character matches after that position.
The default implementation iterates over the sequence in forward order, beginning at start
, calling matches
for each character.
- Parameters:
sequence
the character sequence to examinestart
the first index to examine; must be nonnegative and no greater than sequence.length()
- Returns:
- the index of the first matching character, guaranteed to be no less than
start
,
or -1
if no character matches - Throws:
- IndexOutOfBoundsException if start is negative or greater than
sequence.length()
int length = sequence.length();
for (int i = start; i < length; i++) {
Returns the index of the last matching character in a character sequence, or
-1
if no
matching character is present.
The default implementation iterates over the sequence in reverse order calling matches
for each character.
- Parameters:
sequence
the character sequence to examine from the end- Returns:
- an index, or
-1
if no character matches
for (int i = sequence.length() - 1; i >= 0; i--) {
Returns the number of matching characters found in a character sequence.
for (int i = 0; i < sequence.length(); i++) {
Returns a string containing all non-matching characters of a character sequence, in order. For
example:
CharMatcher.is('a').removeFrom("bazaar")
... returns
"bzr"
.
if (pos == chars.length) {
chars[pos - spread] = chars[pos];
return new String(chars, 0, pos - spread);
Returns a string containing all matching characters of a character sequence, in order. For
example:
CharMatcher.is('a').retainFrom("bazaar")
... returns
"aaa"
.
Returns a string copy of the input character sequence, with each character that matches this
matcher replaced by a given replacement character. For example:
CharMatcher.is('a').replaceFrom("radar", 'o')
... returns
"rodor"
.
The default implementation uses indexIn(CharSequence)
to find the first matching
character, then iterates the remainder of the sequence calling matches(char)
for each
character.
- Parameters:
sequence
the character sequence to replace matching characters inreplacement
the character to append to the result string in place of each matching
character in sequence
- Returns:
- the new string
chars[pos] = replacement;
for (int i = pos + 1; i < chars.length; i++) {
Returns a string copy of the input character sequence, with each character that matches this
matcher replaced by a given replacement sequence. For example:
CharMatcher.is('a').replaceFrom("yaha", "oo")
... returns
"yoohoo"
.
Note: If the replacement is a fixed string with only one character, you are better
off calling replaceFrom(CharSequence, char)
directly.
- Parameters:
sequence
the character sequence to replace matching characters inreplacement
the characters to append to the result string in place of each matching
character in sequence
- Returns:
- the new string
int replacementLen = replacement.length();
if (replacementLen == 0) {
if (replacementLen == 1) {
buf.append(string, oldpos, pos);
buf.append(string, oldpos, len);
Returns a substring of the input character sequence that omits all characters this matcher
matches from the beginning and from the end of the string. For example:
CharMatcher.anyOf("ab").trimFrom("abacatbab")
... returns
"cat"
.
Note that:
CharMatcher.inRange('\0', ' ').trimFrom(str)
... is equivalent to
String.trim()
.
for (first = 0; first < len; first++) {
for (last = len - 1; last > first; last--) {
Returns a substring of the input character sequence that omits all characters this matcher
matches from the beginning of the string. For example:
CharMatcher.anyOf("ab").trimLeadingFrom("abacatbab")
... returns
"catbab"
.
for (first = 0; first < len; first++) {
Returns a substring of the input character sequence that omits all characters this matcher
matches from the end of the string. For example:
CharMatcher.anyOf("ab").trimTrailingFrom("abacatbab")
... returns
"abacat"
.
for (last = len - 1; last >= 0; last--) {
Returns a string copy of the input character sequence, with each group of consecutive
characters that match this matcher replaced by a single replacement character. For example:
CharMatcher.anyOf("eko").collapseFrom("bookkeeper", '-')
... returns
"b-p-r"
.
The default implementation uses indexIn(CharSequence)
to find the first matching
character, then iterates the remainder of the sequence calling matches(char)
for each
character.
- Parameters:
sequence
the character sequence to replace matching groups of characters inreplacement
the character to append to the result string in place of each group of
matching characters in sequence
- Returns:
- the new string
for (int i = first + 1; i < sequence.length(); i++) {
Collapses groups of matching characters exactly as
collapseFrom
does, except that
groups of matching characters at the start or end of the sequence are removed without
replacement.
boolean inMatchingGroup = false;
for (int i = first; i < sequence.length(); i++) {
Returns
true
if this matcher matches the given character.
- Throws:
- NullPointerException if
character
is null
Returns a string representation of this
CharMatcher
, such as
CharMatcher.or(WHITESPACE, JAVA_DIGIT)
.
Determines whether a character is whitespace according to the latest Unicode standard, as
illustrated
here.
This is not the same definition used by other Java APIs. (See a
comparison of several
definitions of "whitespace".)
Note: as the Unicode definition evolves, we will modify this constant to keep it up
to date.
A special-case CharMatcher for Unicode whitespace characters that is extremely
efficient both in space required and in time to check for matches.
Implementation details.
It turns out that all current (early 2012) Unicode characters are unique modulo 79:
so we can construct a lookup table of exactly 79 entries, and just check the character code
mod 79, and see if that character is in the table.
There is a 1 at the beginning of the table so that the null character is not listed
as whitespace.
Other things we tried that did not prove to be beneficial, mostly due to speed concerns:
* Binary search into the sorted list of characters, i.e., what
CharMatcher.anyOf() does
* Perfect hash function into a table of size 26 (using an offset table and a special
Jenkins hash function)
* Perfect-ish hash function that required two lookups into a single table of size 26.
* Using a power-of-2 sized hash table (size 64) with linear probing.
--Christopher Swenson, February 2012.
private final char[] table = {1, 0, 160, 0, 0, 0, 0, 0, 0, 9, 10, 11, 12, 13, 0, 0,
8232, 8233, 0, 0, 0, 0, 0, 8239, 0, 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
12288, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 133, 8192, 8193, 8194, 8195, 8196, 8197, 8198, 8199,
8200, 8201, 8202, 0, 0, 0, 0, 0, 8287, 5760, 0, 0, 6158, 0, 0, 0};
return table[c % 79] == c;