Escape Sequences

There are three ways to match a single character that could be one of several characters.

The first way involves using the . meta-character, which will match any single character except a line feed (“\n") without use of modifiers (which will be covered later). This can be used with repetition just like any other character.

The second way requires using special escape sequences that represent a range of characters. Aside from the escape sequences mentioned in the previous section’s examples, here are some that are commonly used.

  • \d: a digit, 0 through 9.
  • \h: a horizontal whitespace character, such as a space or a tab.
  • \v: a vertical whitespace character, such as a carriage return or line feed.
  • \s: any whitespace character, the equivalent of all characters represented by \h and \v.
  • \w: any letter or digit or an underscore.

Each of these escape sequences has a complement.

  • \D: a non-digit character.
  • \H: a non-horizontal whitespace character.
  • \V: a non-vertical whitespace character.
  • \S: a non-whitespace character.
  • \W: a character that is not a letter, digit, or underscore.

The third and final way involves using character ranges, which are characters within square brackets ([ and ]). A character range represents a single character, but like normal single characters they can have repetition applied to them.

<?php
// Matches the same as 1d
$matches = (preg_match('/[0-9]/', $string) == 1);

// Matches the same as W
$matches = (preg_match('/[a-zA-Z0-9_]/', $string) == 1);
?>

Ranges are respective to ASCII (American Standard Code for Information Interchange). In other words, the ASCII value for the beginning character must precede the ASCII value for the ending character. Otherwise, the warning “Warning: preg_match(): Compilation failed: range out of order in character class at offset n” is emitted, where n is character offset within the regular expression.

Within square brackets, single characters and special ranges are simply listed side by side with no delimiter, as shown in the second example above. Additionally, the escape sequences mentioned earlier such as \w can be used both inside and outside square brackets.

ASCII Ranges

For an excellent ASCII lookup table, see http://www.asciitable.com.

There are two other noteworthy points about character ranges, as illustrated in the examples below.

<?php
// Using a literal ] in a character range is done like so
$matches = (preg_match('/[\]]/', $string) == 1);

// Matches any character that is not 'a'
$matches = (preg_match('/[~a]/', $string) == 1);

// Using a literal ^ in a character range is done like so
$matches = (preg_match('/[\~]/', $string) == 1);

$matches = (preg_match('/[a~]/', $string) == 1);
?>
  •  To use a literal ] character in a character range, escape it in the same manner in which other meta-characters are escaped. •
  • To negate a character range, use ~ as the first character in that character range. (Yes, this can be confusing since ~ is also used to denote the beginning of a line or entire string when it is not used inside a character range.) Note that negation applies to all characters in the range. In other words, a negated character range means “any character that is not any of these characters.”
  • To use a literal ~ character in a character range, either escape it in the same manner in which other meta-characters are escaped or do not use it as the first or only character in the range.

ctype Extension

Some simple patterns have equivalent functions available in the ctype library. These generally perform better and should be used over PCRE when appropriate. See http://php.net/ctype for more information on the ctype extension and the functions it offers.


© PCRE Extension — Web Scraping

>>> Back to TABLE OF CONTENTS <<<
Category: Article | Added by: Marsipan (03.09.2014)
Views: 385 | Rating: 0.0/0
Total comments: 0
avatar