The other day I wanted to extract a part of a filename and was quite dumbfounded when the extracted part came back empty. The regular expression used to extract the part using a capturing group was:
[A-Za-z0-9]*_?([0-9]*)\..*
The filenames fed into this expressing where mostly of the pattern someText_12345.data
. However, occasionally a file named 12345.data
would pass along and even though its name matched the expression, the capturing group came up empty. But… but… why?
The answer to that lies in the quantifiers. Quantifiers are the parts that determine how often a character or group of characters must appear to be matched, *
meaning any number of matches, +
meaning 1 or more matches, ?
meaning 0 or 1 matches. The quantifiers used in the above expression are called “greedy quantifiers:” if they have the choice of matching a shorter or a longer part of the string they will always choose the longer part (unless a shorter match would allow an overall match to occur and the longer match would not). So, how exactly does that influence the evaluation of the expression on the second filename?
The regular expression consists of 5 parts.
[A-Za-z0-9]*
– this matches an arbitrary number of characters or digits._?
– this matches an optional underscore.([0-9]*)
– this matches any number of digits in a capturing group.\.
– a dot..*
– any number of arbitrary characters.
Parsing 12345.data
the parts are tried sequentially.
- Part 1 of the expression matches the
12345
. - Part 2 of the expression matches the null string between the
12345
and the dot. - So does part 3.
- Part 4 matches the dot.
- Part 5 matches the rest of the filename.
So the filename is parsed successfully but our capturing group matched the null string right before the dot. This is clearly not what we wanted.
The solution is to turn some of the greedy quantifiers *
and ?
into reluctant quantifiers, i.e. *?
and ??
: [A-Za-z0-9]*?_??([0-9]*)\..*
. This will change the matching to the following sequence:
- Part 1 will be matched by the null string at the beginning of the filename.
- Part 2 will match the null string following the first null string. (The discussion whether those null strings have to be considered the same null string is a philosophical one.)
- Part 3 now matches the
12345
part of the filename. This part is still matched by a greedy quantifier so it will not simply accept the null string as a match. - Part 4 matches the dot, and
- part 5 matches the extension data.
So, even though both regular expressions match both filenames, only one is parsed in a way we expect it to.
(There is also a third type of quantifier: possessive quantifiers. Those consume as much of the input string as they can but never back off, causing a regular expression to not match an input string that could very well match if a different quantifier had been used.)