Before you begin

This unit is part of the “Intro to Java programming” learning path. Although the concepts discussed in the individual units are standalone in nature, the hands-on component builds as you progress through the units, and I recommend that you review the prerequisites, setup, and unit details before proceeding.

Unit objectives

  • Learn what the three core regex classes are and what they do
  • Become familiar with regex pattern syntax
  • Be able to perform both simple and more complex searches and replacements
  • Know how to reference matches by capturing groups

The Regular Expressions API

A regular expression is essentially a pattern to describe a set of strings that share that pattern. This unit gets you started with using regular expressions in your Java programs.

Here’s a set of strings that have a few things in common:

  • A string
  • A longer string
  • A much longer string

Note that each of these strings begins with A and ends with string. The Java Regular Expressions API helps you pull out these elements, see the pattern among them, and do interesting things with the information you’ve gleaned.

The Regular Expressions API has three core classes that you use almost all the time:

  • Pattern describes a string pattern.
  • Matcher tests a string to see if it matches the pattern.
  • PatternSyntaxException tells you that something wasn’t acceptable about the pattern that you tried to define.

You’ll begin working on a simple regular-expressions pattern that uses these classes shortly. But first, take a look at the regex pattern syntax.

Regex pattern syntax

A regex pattern describes the structure of the string that the expression tries to find in an input string. The pattern syntax can look strange to the uninitiated, but once you understand it, you’ll find it easier to decipher. Table 1 lists some of the most common regex constructs that you use in pattern strings.

Table 2. Common regex constructs
Regex construct What qualifies as a match
. Any character
? Zero (0) or one (1) of what came before
* Zero (0) or more of what came before
One (1) or more of what came before
[] A range of characters or digits
^ Negation of whatever follows (that is, “not whatever“)
\d Any digit (alternatively, [0-9])
\D Any nondigit (alternatively, [^0-9])
\s Any whitespace character (alternatively, [\n\t\f\r])
\S Any nonwhitespace character (alternatively, [^\n\t\f\r])
\w Any word character (alternatively, [a-zA-Z_0-9])
\W Any nonword character (alternatively, [^\w])

The first few constructs are called quantifiers, because they quantify what comes before them. Constructs like \d are predefined character classes. Any character that doesn’t have special meaning in a pattern is a literal and matches itself.

Pattern matching

Armed with the pattern syntax in Table 1, you can work through the simple example in Listing 1, using the classes in the Java Regular Expressions API.

Listing 1. Pattern matching with regex
Pattern pattern = Pattern.compile("[Aa].*string");
  Matcher matcher = pattern.matcher("A string");
  boolean didMatch = matcher.matches();
  Logger.getAnonymousLogger().info (didMatch);
  int patternStartIndex = matcher.start();
  Logger.getAnonymousLogger().info (patternStartIndex);
  int patternEndIndex = matcher.end();
  Logger.getAnonymousLogger().info (patternEndIndex);

First, Listing 1 creates a Pattern class by calling compile()— a static method on Pattern— with a string literal representing the pattern you want to match. That literal uses the regex pattern syntax. In this example, the English translation of the pattern is:

Find a string of the form A or a followed by zero or more characters, followed by string.

Methods for matching

Next, Listing 1 calls matcher() on Pattern. That call creates a Matcher instance. The Matcher then searches the string you passed in for matches against the pattern string you used when you created the Pattern.

Every Java language string is an indexed collection of characters, starting with 0 and ending with the string length minus one. The Matcher parses the string, starting at 0, and looks for matches against it. After that process is complete, the Matcher contains information about matches found (or not found) in the input string. You can access that information by calling various methods on Matcher:

  • matches() tells you if the entire input sequence was an exact match for the pattern.
  • start() tells you the index value in the string where the matched string starts.
  • end() tells you the index value in the string where the matched string ends, plus one.

Listing 1 finds a single match starting at 0 and ending at 7. Thus, the call to matches() returns true, the call to start() returns 0, and the call to end() returns 8.

lookingAt() versus matches()

If your string had more elements than the number of characters in the pattern you searched for, you could use lookingAt() instead of matches(). The lookingAt() method searches for substring matches for a specified pattern. For example, consider the following string:

a string with more than just the pattern.

If you search this string for a.*string, you get a match if you use lookingAt(). But if you use matches(), it returns false, because there’s more to the string than what’s in the pattern.

Complex patterns in regex

Simple searches are easy with the regex classes, but you can also do highly sophisticated things with the Regular Expressions API.

Wikis are based almost entirely on regular expressions. Wiki content is based on string input from users, which is parsed and formatted using regular expressions. Any user can create a link to another topic in a wiki by entering a wiki word, which is typically a series of concatenated words, each of which begins with an uppercase letter, like this:

MyWikiWord

Suppose a user inputs the following string:

Here is a WikiWord followed by AnotherWikiWord, then YetAnotherWikiWord.

You could search for wiki words in this string with a regex pattern like this:

[A‑Z][a‑z]([A‑Z][a‑z])+

And here’s code to search for wiki words:


String input = "Here is a WikiWord followed by AnotherWikiWord, then SomeWikiWord.";
Pattern pattern = Pattern.compile("[A‑Z][a‑z]([A‑Z][a‑z])+");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
  Logger.getAnonymousLogger().info("Found this wiki word: " + matcher.group());
}

Run this code, and you can see the three wiki words in your console.

Replacing strings

Searching for matches is useful, but you also can manipulate strings after you find a match for them. You can do that by replacing matched strings with something else, just as you might search for text in a word-processing program and replace it with other text. Matcher has a couple of methods for replacing string elements:

  • replaceAll() replaces all matches with a specified string.
  • replaceFirst() replaces only the first match with a specified string.

Using Matcher‘s replace methods is straightforward:


String input = "Here is a WikiWord followed by AnotherWikiWord, then SomeWikiWord.";
Pattern pattern = Pattern.compile("[A‑Z][a‑z]([A‑Z][a‑z])+");
Matcher matcher = pattern.matcher(input);
Logger.getAnonymousLogger().info("Before: " + input);
String result = matcher.replaceAll("replacement");
Logger.getAnonymousLogger().info("After: " + result);

This code finds wiki words, as before. When the Matcher finds a match, it replaces the wiki word text with its replacement. When you run the code, you can see the following on your console:


Before: Here is WikiWord followed by AnotherWikiWord, then SomeWikiWord.
  After: Here is replacement followed by replacement, then replacement.

If you had used replaceFirst(), you would have seen this:


Before: Here is a WikiWord followed by AnotherWikiWord, then SomeWikiWord.
  After: Here is a replacement followed by AnotherWikiWord, then SomeWikiWord.

Matching and manipulating groups

When you search for matches against a regex pattern, you can get information about what you found. You’ve seen some of that capability with the start() and end() methods on Matcher. But it’s also possible to reference matches by capturing groups.

In each pattern, you typically create groups by enclosing parts of the pattern in parentheses. Groups are numbered from left to right, starting with 1 (group 0 represents the entire match). The code in Listing 2 replaces each wiki word with a string that “wraps” the word:

Listing 2. Matching groups

String input = "Here is a WikiWord followed by AnotherWikiWord, then SomeWikiWord.";
Pattern pattern = Pattern.compile("[A‑Z][a‑z]([A‑Z][a‑z])+");
Matcher matcher = pattern.matcher(input);
Logger.getAnonymousLogger().info("Before: " + input);
String result = matcher.replaceAll("blah$0blah");
Logger.getAnonymousLogger().info("After: " + result);

Run the Listing 2 code, and you get the following console output:

Before: Here is a WikiWord followed by AnotherWikiWord, then SomeWikiWord.
  After: Here is a blahWikiWordblah followed by blahAnotherWikiWordblah,
  then blahSomeWikiWordblah.

Listing 2 references the entire match by including $0 in the replacement string. Any portion of a replacement string of the form $int$int refers to the group identified by the integer (so $1 refers to group 1, and so on). In other words, $0 is equivalent to matcher.group(0);.

You could accomplish the same replacement goal by using other methods. Rather than calling replaceAll(), you could do this:


StringBuffer buffer = new StringBuffer();
while (matcher.find()) {
  matcher.appendReplacement(buffer, "blah$0blah");
}
matcher.appendTail(buffer);
Logger.getAnonymousLogger().info("After: " + buffer.toString());

And you’d get the same result:

Before: Here is a WikiWord followed by AnotherWikiWord, then SomeWikiWord.
  After: Here is a blahWikiWordblah followed by blahAnotherWikiWordblah,
  then blahSomeWikiWordblah.

Test your understanding

  1. Which statement best describes the ? quantifier?

    1. Matches zero or more times
    2. Matches one or more times
    3. Matches the first occurrence and appends the match to the output group
    4. Matches once or not at all
    5. None of the above
  2. Which statement best describes the + quantifier?

    1. Matches zero or more times
    2. Matches one or more times
    3. Matches the first occurrence and appends the match to the output group
    4. Matches once or not at all
    5. None of the above
  3. Which statement best describes the * quantifier?

    1. Matches zero or more times
    2. Matches one or more times
    3. Matches the first occurrence and appends the match to the output group
    4. Matches once or not at all
    5. None of the above
  4. True or false: The Matcher class is used to describe the input string to the Pattern class.

  5. Which answer best describes an application of the following regular expression string: [A-Z]?\d

    1. Match any character A through Z one or more times, followed by a single optional digit.
    2. Match any character A through Z zero or one times, followed by a single optional digit.
    3. Match any character A through Z one or more times, followed by a single digit.
    4. Match any character A through Z zero or one times, followed by a single digit.
    5. None of the above.
  6. Examine the following code and choose the response that best describes the matches (in order).

    
    @Test
    public void testFindMatches() {
     
     String input = "Do you run? Ran? No, bro, run! Bro, I ran and run.";
     
     String regex = "r[au]n";
     
     Pattern pattern = Pattern.compile(regex);
     Matcher matcher = pattern.matcher(input);
     
     int matchCount = 0;
     StringBuilder matchHolder = new StringBuilder();
     while (matcher.find()) {
     if (matchCount > 0) 
     matchHolder.append(',');
     matchHolder.append(matcher.group());
     matchCount++;
     }
     
     System.out.println("Matches: " + matchHolder.toString());
     
    }
    

    1. run,Ran,run,ran,run
    2. run,run,run,run
    3. run,ran,run,Ran,run
    4. run,run,ran,run
    5. The specified pattern does not match any part of the input string.
  7. Programming exercise, part 1: Create a new class (call it MyRegExMatcher), and write a method called matchesAll that takes two parameters — a String called regex, and a String called input— and returns a boolean. For now, just write the method to return false.

  8. Programming exercise, part 2: Create a JUnit test case that calls the method you wrote for Question 7. Your JUnit test will invoke the method with the simplest regular expression you can come up with that matches this input String: The quick brown fox jumped over the lazy dogs

    Note: the regular expression may only contain quantifiers and must contain the letters l and x only (no other letters).

  9. Programming exercise, part 3: Implement the method from Question 7 so that your test case passes (if your test case does not pass, your regular expression might be wrong). Return true if the entire input string matches the regular expression pattern, false otherwise. Hint: Use the Pattern class, and the Matcher class, as you saw in Listing 1.

Check your answers

  1. Which statement best describes the ? quantifier?

    1. Matches zero or more times
    2. Matches one or more times
    3. Matches the first occurrence and appends the match to the output group
    4. Matches once or not at all
    5. None of the above
  2. Which statement best describes the + quantifier?

    1. Matches zero or more times
    2. Matches one or more times
    3. Matches the first occurrence and appends the match to the output group
    4. Matches once or not at all
    5. None of the above
  3. Which statement best describes the * quantifier?

    1. Matches zero or more times
    2. Matches one or more times
    3. Matches the first occurrence and appends the match to the output group
    4. Matches once or not at all
    5. None of the above
  4. True or false: The Matcher class is used to describe the input string to the Pattern class. False. The Matcher class is used to match the input string to the regular expression represented by the Pattern class.

  5. Which answer best describes an application of the following regular expression string: [A-Z]?\d

    1. Match any character A through Z one or more times, followed by a single optional digit.
    2. Match any character A through Z zero or one times, followed by a single optional digit.
    3. Match any character A through Z one or more times, followed by a single digit.
    4. Match any character A through Z zero or one times, followed by a single digit.
    5. None of the above.
  6. Examine the following code and choose the response that best describes the matches (in order).

    
    @Test
    public void testFindMatches() {
     
     String input = "Do you run? Ran? No, bro, run! Bro, I ran and run.";
     
     String regex = "r[au]n";
     
     Pattern pattern = Pattern.compile(regex);
     Matcher matcher = pattern.matcher(input);
     
     int matchCount = 0;
     StringBuilder matchHolder = new StringBuilder();
     while (matcher.find()) {
     if (matchCount > 0) 
     matchHolder.append(',');
     matchHolder.append(matcher.group());
     matchCount++;
     }
     
     System.out.println("Matches: " + matchHolder.toString());
     
    }
    

    1. run,Ran,run,ran,run
    2. run,run,run,run
    3. run,ran,run,Ran,run
    4. run,run,ran,run
    5. The specified pattern does not match any part of the input string.
  7. Programming exercise, part 1: Create a new class (call it MyRegExMatcher), and write a method called matchesAll that takes two parameters — a String called regex, and a String called input— and returns a boolean. For now, just write the method to return false.

     public class MyRegExMatcher {
    
       public boolean matchesAll(String regex, String input) {
         boolean ret = false;
         // TODO: write some code...
         return ret;
       }
    
     }
    
  8. Programming exercise, part 2: Create a JUnit test case that calls the method you wrote for Question 7. Your JUnit test will invoke the method with the simplest regular expression you can come up with that matches this input String: The quick brown fox jumped over the lazy dogs

    Note: the regular expression may only contain quantifiers and must contain the letters l and x only (no other letters).

     import static org.junit.Assert.assertTrue;
    
     import org.junit.Test;
    
     public class MyRegExMatcherTest {
    
       @Test
       public void testMatchesAll() {
         MyRegExMatcher classUnderTest = new MyRegExMatcher();
    
         String input = "The quick brown fox jumped over the lazy dogs";
         String regEx = ".*x.*l.*";
         boolean matches = classUnderTest.matchesAll(regEx, input);
    
         assertTrue(matches);
    }
    
  9. Programming exercise, part 3: Implement the method from Question 7 so that your test case passes (if your test case does not pass, your regular expression might be wrong). Return true if the entire input string matches the regular expression pattern, false otherwise. Hint: Use the Pattern class, and the Matcher class, as you saw in Listing 1.

     import java.util.regex.Matcher;
     import java.util.regex.Pattern;
    
     public class MyRegExMatcher {
    
       public boolean matchesAll(String regex, String input) {
         boolean ret = false;
    
         Pattern pattern = Pattern.compile(regex);
         Matcher matcher = pattern.matcher(input);
    
         ret = matcher.matches();
         return ret;
      }
    
     }
    

Previous: Nested classesNext: Generics