Regular expression multiline mode - what's a newline?

Posted on Wed 30 May 2012 in Coding

I stumbled upon an interesting little detail as I was using a regular expression in a unit test case in a C# application. I had a multiline string and was searching for a particular substring in multiline mode. The newlines in the string were Windows newlines, meaning CR followed by LF. Can you see where this is heading?

To illustrate the situation, consider the following C# program:

void Main(string[] args)
{
    const string input = "foo\r\nbar";
    var match = Regex.IsMatch(input, "foo$", RegexOptions.Multiline);
    Console.WriteLine(match);
}

What does it print? Intuitively, it should print “True”, as multiline mode changes the meaning of $ to match at the end of any line, not just at the end of input. But if you run the program, you’ll find that it prints “False”. Took me a little while to figure out why, but it’s documented here. The relevant part:

If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.

So the fix is easy! But I’m a bit surprised that $ doesn’t match CR+LF, given that that’s the sequence Windows uses for newlines, and the .NET framework was developed for the Windows platform…

For comparison, I tried the same pattern matching in Java:

public static void main(String[] args) {
    String input = "foo\r\nbar";
    Pattern p = Pattern.compile("foo$", Pattern.MULTILINE);
    boolean match = p.matcher(input).lookingAt();
    System.out.println(match);
}

This program prints “true”. In Java, $ matches a range of different line terminators. See the API documentation for the complete list.

Morale of the story? Don’t make assumptions about the behavior of regular expression engines (or about regular expression syntax, for that matter) - make sure to read the documentation!