This is an old revision of the document!
Table of Contents
~~SLIDESHOW~~
Text Processing with Regular Expressions
Contact Hour 10: To be discussed on Tuesday 21st February, 2012.
Lecturer: Dr Chris P. Jobling.
Text Processing with Regular Expressions
- We conclude our review of the Basics of JavaScript with a discussion of text manipulation with regular expressions.
- Regular expressions is a key idea that we shall return to again in the context of server-side scripting.
The slides and notes for this lecture are based on Chapter 4 of Robert W. Sebasta, Programming the World-Wide Web, 3rd Edition, Addison Wesley, 2006. There is a good discussion of JavaScript regular expressions in Sections 7.2 and 9.9 of the Chris Bates, Web Programming: Building Internet Applications, 3rd Edition, John Wiley, 2006. A good website that intruduces this topic is Regular-Expressions.info.
Contents of this Lecture
Text processing with regular expressions
Learning Outcomes
At the end of this lecture you should be able to answer these questions:
- What is a character class in a pattern?
- What are the predefined character classes, and what do they mean?
- What are the symbolic quantifiers, and what do they mean?
- Describe the two end-of-line anchors.
Learning Outcomes (2)
At the end of this lecture you should be able to answer these questions:
- What does the
i
pattern modifier do? - What exactly does the
String
methodreplace
do? - What exactly does the
String
methodmatch
do?
Text manipulation in JavaScript
- Text manipulation is a very important feature of many Web Applications.
- Some examples:
- search for a string in a document or form field
- search for a string and replace it with another
- validate the text fields of a form
- Can use string manipulation, but it tends to be restrictive and inefficient
- Better to use a technique called pattern matching
Pattern Matching
- JavaScript provides two ways to do pattern matching:
- Using
RegExp
objects - Using methods on
String
objects
- A powerful pattern matcher called a regular expression matcher is provided
- This is our first exposure to regular expressions but it is an important topic in its own right.
A Little History
Regular expression pattern matching is a technique that was first developed for the text editors ed and sed which were (and still are) part of the Unix system. The ideas were extended to the program awk and eventually reached their full potential in the Perl programming language. Perl regular expressions are the inspiration for JavaScript's and a variation of the Perl form of regular expression are to be found in many other contexts such as the text editors vi and emacs, most scripting languages, and even in the standard Java library.
If you are interested, Regular expression has more to say on the subject.
Demo
- This clever piece of JavaScript 1) magic was developed by Cüneyt Yýlmaz
- You can use it to play with regular expressions
Simple patterns: characters
Normal characters (match themselves)
- E.g:
/ee/
matches need, greed, weed, but not wed or dead
Simple patterns: meta-characters
Meta-characters have special meanings in patterns – they do not match themselves:
\ | ( ) [ ] { } ^ $ * + ? .
- A meta-character is treated as a normal character if it is escaped (preceded with a backslash
\
) - period (
.
) is a special meta-character – it matches any character except newline /c.t/
matches Ascot, cat, cut and crt but not act or cart.
The search function
search (pattern)
returns the position in the object string of the pattern (position is relative to zero);
- returns -1 if it fails
var str = "Gluckenheimer"; var position = str.search(/n/); /* position is now 6 */
Character classes
- Put a sequence of characters in brackets, and it defines a set of characters, any one of which matches:
[abcd]
matches any of letters 'a', 'b', 'c', or 'd'.
- Dashes can be used to specify spans of characters in a class:
[a-z]
matches any lower-case letter (in the English alphabet).
- A caret at the left end of a class definition means match anything but the characters in the class:
[^0-9]
matches any character that is not a decimal digit.
Character class abbreviations
Abbr. | Equiv. | Pattern Matches |
---|---|---|
\d | [0-9] | a digit |
\D | [^0-9] | not a digit |
\w | [A-Za-z_0-9] | a word character |
\W | [^A-Za-z_0-9] | not a word character |
\s | [ \r\t\n\f] | a whitespace character |
\S | [^ \r\t\n\f] | not a whitespace character |
(JavaScript) variables in patterns are interpolated
Quantifiers
Quantifiers in braces
Quantifier | Meaning |
---|---|
{n} | exactly n repetitions |
{m,} | at least m repetitions |
{m, n} | at least m but not more than n repetitions |
Other Quantifiers
Just abbreviations for the most commonly used quantifiers
*
means zero or more repetitions e.g.,\d*
means zero or more digits+
means one or more repetitions e.g.,\d+
means one or more digits?
Means zero or one e.g.,\d?
means zero or one digit
Anchors
The pattern can be forced to match only at the start with ^
or at the end with $
- Example 1:
/^Lee/
matches “Lee Ann” but not “Mary Lee Ann” - Example 2:
/Lee Ann$/
matches “Mary Lee Ann”, but not “Mary Lee Ann is nice” - The anchor operators (
^
and$
) do not match characters in the string – they match positions, at the beginning or end
Pattern modifiers
The i
modifier tells the matcher to ignore the case of letters
- Example:
/oak/i
matches “OAK” and “Oak”
The x
modifier tells the matcher to ignore whitespace in the pattern (allows comments in patterns)
The replace function
replace(pattern, string)
- Finds a substring that matches the pattern and replaces it with the string (
g
modifier can be used) g
modifier means “replace globally”, all matched strings will be replaced.- Matched substrings are returned in special variables
$1
,$2
, etc.
The replace function: example
var str = "Some rabbits are rabid"; str.replace(/rab/g, "tim"); // str is now "Some timbits are timid" // $1 and $2 are both set to "rab"
The match function
match(pattern)
- The most general pattern-matching method
- Returns an array of results of the pattern-matching operation
- With the
g
modifier, it returns an array of all of the substrings that matched - Without the
g
modifier, first element of the returned array has the matched substring, the other elements have the values of$1
,…
The match function: example
var str = "My 3 kings beat your 2 aces"; var matches = str.match(/[ab]/g); //matches is set to ["b", "a", "a"]
The split function
split(parameter)
- Example:
var str = "grapes:apples:oranges" var fruit = str.split(/:/) // fruit is set to ["grapes", "apples", "oranges"]
":"
and/:/
are equivalent
Another Example
Common use of JavaScript is to check validity of user inputs on forms
- avoids a trip to server that would result in an error page
- error handling is kept local
- usually triggered by submission button
- error message generated locally by writing into document object.
- This example defines a function that could be used in a registration page to check that a phone number is valid (using US conventions!) HTML5 Markup: forms_check.html Script: forms_check.js
Markup:
<!DOCTYPE html> <!-- forms_check.html A function tst_phone_num is defined and tested. This function checks the validity of phone number input from a form --> <html lang="en"> <head> <meta charset="utf-8" /> <title> Phone number tester </title> </head> <body> <form id="phone" method="post" action="http://eng-hope.swan.ac.uk/cgi-bin/echo_form.cgi"> <label for="phone_number">Phone number: </label> <input id="phone_number" type="text" name="phone_number" placeholder="444-4444" /> <input type="submit" onclick="return validate();" name="Submit" value="Submit" /> </form> <!-- Best practice guidelines suggest that you load scripts first --> <script src="forms_check.js"></script> </body> </html>
The script (validation function validate()
will be explained later)
/* Function tst_phone_num Parameter: A string Result: Returns true if the parameter has the form of a legal seven-digit phone number (3 digits, a dash, 4 digits) */ function tst_phone_num(num) { // Use a simple pattern to check the number of digits and the dash var ok = num.search(/\d{3}-\d{4}/); if(ok == 0) return true; else return false; }// end of function tst_phone_num /* Actual form validation. Called onclick */ var validate = function () { var phoneNumber = document.getElementById("phone_number"); if(tst_phone_num(phoneNumber.value)) { return true; } else { alert("Phone number is invalid. Please use format ddd-dddd."); return false; // prevents submission } };
Test code for tst_phone_num
// Test tst_phone_num -- commented out in production tests = ["444-5432", "444-r432", "44-1234"] for( i = 0; i < tests.length; i++) { var tst = tst_phone_num(tests[i]); if(tst) { console.log(tests[i] + " is a legal phone number <br />"); } else { console.error("Error in tst_phone_num: " + tests[i] + " is not a legal phone number <br />"); } }
Regular Expression Validator in HTML5
- New
pattern
attribute can be used on some modern browsers - Pattern text is actually evaluated as the JavaScript expression
/^pattern$/
by the JavaScript engine. - You may need to provide a JavaScript fallback for older browsers (see later)
- E.g.
<input id="phone_number" type="text" name="phone_number" placeholder="444-4444" pattern="\d{3}-\d{4}" />
<html> <input id=“phone_number” type=“text” name=“phone_number” placeholder=“444-4444” pattern=“\d{3}-\d{4}” /> </html>
HTML5 Version of the Phone Number Validator
<!DOCTYPE html> <!-- forms_check_html5.html Uses the new HTML5 pattern attribute to validate phone number --> <html lang="en"> <head> <meta charset="utf-8" /> <title> Phone number tester (HTML5)</title> </head> <body> <form id="phone" method="post" action="http://eng-hope.swan.ac.uk/cgi-bin/echo_form.cgi"> <label for="phone_number">Phone number: </label> <input id="phone_number" type="text" name="phone_number" pattern="\d{3}-\d{4}" placeholder="444-4444" /> <!-- No need for onclick validator now --> <input type="submit" name="Submit" value="Submit" /> </form> <!-- Look no scripts! --> </body> </html>
Debugging JavaScript: IE6+
- Select
Internet Options
from theTools
menu - Choose the
Advanced
tab - Uncheck the
Disable script debugging
box - Check the
Display a notification about every script error
box - Now, a script error causes a small window to be opened with an explanation of the error
Debugging JavaScript: IE6+ (continued)
Debugging JavaScript: Firefox
- Select
Tools → JavaScript Console
- A small window appears to display script errors
- Remember to clear the console after correcting an error message – avoids confusion
Debugging JavaScript (continued)
- If you need to trace the execution of your scripts you need more than a JavaScript console
- Both IE6 and Firefox have JavaScript Debuggers
- In IE6 the debugger is part of the browser. See http://www.microsoft.com/scripting/debugger/default.htm for documentation.
- For Firefox (and other Mozilla-based browsers, including Netscape), the JavaScript debugger is called Venkman and is an optional plug in available at http://www.mozilla.org/projects/venkman/.
Debugging with Firebug
- Firefox only!
- Firebug plugin provides sophisticated web page analysis tools including JavaScript debugging facilities and a console
- Firebug Lite provides (limited) facilities for IE and other browsers.
- Demo
Debugging in WebKit Browsers
- Apple Safari
- Google Chrome
- Have built-in development tools
Summary of This Lecture
Text processing with regular expressions
Learning Outcomes
At the end of this lecture you should be able to answer these questions:
- What is a character class in a pattern?
- What are the predefined character classes, and what do they mean?
- What are the symbolic quantifiers, and what do they mean?
- Describe the two end-of-line anchors.
Learning Outcomes (2)
At the end of this lecture you should be able to answer these questions:
- What does the
i
pattern modifier do? - What exactly does the
String
methodreplace
do? - What exactly does the
String
methodmatch
do?
Exercises
Write, test and debug (if necessary) XHTML files that include JavaScript scripts for the following problems. When required to write functions, you must include a script to test the function with at least two different data sets.
- Input: A text string, using
prompt
; Output: either legal name or Illegal name, depending on whether the input string fits the required format, which is: Last name, first name, middle initial where neither of the names can have more than 15 characters. - Input: A text string, using
prompt
; Output: The words of the input text, in alphabetical order - Function:
tst_name
; Parameter: a string; Returns:true
if the given string has the form:string1, string2, letter
where both strings must be all lowercase letters except the first letter, and letter must be uppercase;false
otherwise.
More Homework Exercises
- Further basic JavaScript exercises, taken from Chapter 4 of Chris Bates, Web Programming: Building Internet Applications, 3rd Edition, John Wiley, 2006., are available. See the additional exercises for details.
- Watch the two instructional videos on Regular Expressions and Debugging in Firebug.
- Work through the Practical Exercises
What's Next?
Manipulating web documents through the Document Object Model (DOM) and the JavaScript event model.