In imperative languages:
In declarative languages:
Good (?:morning|evening)
?calc_expr: NUMBER | calc_op
calc_op: "(" OPERATOR calc_expr* ")"
OPERATOR: "+" | "-" | "*" | "/"
select max(longitude) from cities where longitude >= 115
Many declarative languages are domain-specific: they are designed to tackle problems in a particular domain, instead of being general purpose multi-domain programming languages.
Language | Domain |
---|---|
Regular expressions | Pattern-matching strings |
Backus-Naur Form | Parsing strings into parse trees |
SQL | Querying and modifying database tables |
HTML | Describing the semantic structure of webpage content |
CSS | Styling webpages based on selectors |
Prolog | Describes and queries logical relations |
Pattern matching in strings is a common problem in computer programming.
An imperative approach:
def is_email_address(str):
parts = str.split('@')
if len(parts) != 2:
return False
domain_parts = parts[1].split('.')
return len(domain_parts) >= 2 and len(domain_parts[-1]) == 3
An equivalent regular expression:
(.+)@(.+)\.(.{3})
With regular expressions, a programmer can just describe the pattern using a common syntax, and a regular expression engine figures out how to do the pattern matching for them.
The following are special characters in regular expressions:
\
(
)
[
]
{
}
+
*
?
|
$
^
.
To match an exact string that has no special characters, just use the string:
Berkeley, CA 94720
But if the matched string contains special characters, they must be escaped using a backslash.
\(1\+3\)
The .
character matches any single character
that is not a new line.
.a.a.a
It's typically better to match a more specific range of characters, however...
Pattern | Description | Example | Matches: |
---|---|---|---|
[] |
Denotes a character class. Matches characters in a set (including ranges of characters like 0-9 ). Use [^] to match characters outside a set. |
[top] |
|
. |
Matches any character other than the newline character. | 1. |
|
\d |
Matches any digit character. Equivalent to [0-9] . \D is the complement and refers to all non-digit characters. |
\d\d |
|
\w |
Matches any word character. Equivalent to [A-Za-z0-9_] . \W is the complement. |
\d\w |
|
\s |
Matches any whitespace character: spaces, tabs, or line breaks. \S is the complement. |
\d\s\w |
These indicate how many of a character/character class to match.
Pattern | Description | Example | Matches: |
---|---|---|---|
* |
Matches 0 or more of the previous pattern. | a* |
|
+ |
Matches 1 or more of the previous pattern. | lo+l |
|
? |
Matches 0 or 1 of the previous pattern. | lo?l |
|
{} |
Used like {Min, Max} . Matches a quantity between Min and Max of the previous pattern. |
a{2,4} |
These don't match an actual character, they indicate the position where the surrounding pattern should be found.
Pattern | Description | Example | Matches: |
---|---|---|---|
^ |
Matches the beginning of a string. | ^aw+ |
|
$ |
Matches the end of a string. | \w+y$ |
|
\b |
Matches a word boundary, the beginning or end of a word. | \w+e\b |
Patterns P₁
and P₂
can be combined in various ways.
Combination | Description | Example | Matches: |
---|---|---|---|
P₁P₂ |
A match for P₁ followed immediately by one for P₂. | ab[.,] |
|
P₁|P₂ |
Matches anything that either P₁ or P₂ does. | \d+|Inf |
|
(P₁) |
Matches whatever P₁ does. Parentheses group, just as in arithmetic expressions. | (<3)+ |
Regular expressions are supported natively in many languages and tools.
Languages: Perl, ECMAScript, Java, Python, ..
Tools: Excel/Google Spreadsheets, SQL, BigQuery, VSCode, grep, ...
In normal Python strings, a backslash indicates an escape sequence, like \n
for new line
or \b
for bell.
>>> print("I have\na newline in me.")
I have
a newline in me
But backslash has a special meaning in regular expressions. To make it easy to write regular expressions
in Python strings, use raw strings by prefixing the string with an r
:
pattern = r"\b[ab]+\b"
The re module provides many helpful functions.
Function | Description |
re.search(pattern, string)
| returns a match object representing the first occurrence of pattern within string |
re.fullmatch(pattern, string)
| returns a match object, requiring that pattern matches the entirety of string |
re.match(pattern, string)
| returns a match object, requiring that string starts with a substring that matches pattern |
re.findall(pattern, string)
| returns a list of strings representing all matches of pattern within string, from left to right |
re.sub(pattern, repl, string)
| substitutes all matches of pattern within string with repl |
The functions re.match
, re.search
, and re.fullmatch
all take a string containing a regular expression and a string of text.
They return either a
Match
object or, if there is no match, None
.
re.fullmatch
requires that the pattern matches the entirety of the string:
import re
re.fullmatch(r'-?\d+', '123') # <re.Match object>
re.fullmatch(r'-?\d+', '123 peeps') # None
Match objects are treated as true values, so you can use the result as a boolean:
bool(re.fullmatch(r'-?\d+', '123')) # True
bool(re.fullmatch(r'-?\d+', '123 peeps')) # False
re.search
returns a match object representing the first occurrence of pattern within string.
title = "I Know Why the Caged Bird Sings"
bool(re.search(r'Bird', title)) # True
Match objects also carry information about what has been matched.
The .group()
method allows you to retrieve it.
x = "This string contains 35 characters."
mat = re.search(r'\d+', x)
mat.group(0) # 35
If there are parentheses in a patterns, each of the parenthesized groups will become groups in the match object.
x = "There were 12 pence in a shilling and 20 shillings in a pound."
mat = re.search(r'(\d+)[a-z\s]+(\d+)', x)
mat.group(0) # '12 pence in a shilling and 20'
mat.group(1) # 12
mat.group(2) # 20
mat.groups() # (12, 20)
re.findall()
returns a list of strings representing all matches of pattern within string, from left to right.
locations = "CA 91105, NY 13078, CA 94702"
re.findall(r'\d\d\d\d\d', locations)
# ['91105', '13078', '94702']
Regular expressions can match a given string in more than one way. Especially when there are parenthesized groups, this can lead to ambiguity:
mat = re.match(r'wind|window', 'window')
mat.group() # 'wind'
mat = re.match(r'window|wind', 'window')
mat.group() # 'window'
mat = re.match(r'(wind|window)(.*)shade', 'window shade')
mat.groups() # ('wind', 'ow ')
mat = re.match(r'(window|wind)(.*)shade', 'window shade')
mat.groups() # ('window', ' ')
Python resolves these particular ambiguities in favor of the first option.
Likewise, there is ambiguity with *
, +
, and ?
.
mat = re.match(r'(x*)(.*)', 'xxx')
mat.groups() # ('xxx', '')
mat = re.match(r'(x+)(.*)', 'xxx')
mat.groups() # ('xxx', '')
mat = re.match(r'(x?)(.*)', 'xxx')
mat.groups() # ('x', 'xx')
mat = re.match(r'(.*)/(.+)', '12/10/2020')
mat.groups() # ('12/10', '2020')
Python chooses to match greedily, matching the pattern left-to-right and, when given a choice, matching as much as possible while still allowing the rest of the pattern to match.
Sometimes, you don’t want to match as much as possible.
The lazy operators *?
, +?
, and ??
match only as much as necessary for the whole pattern to match.
mat = re.match(r'(.*)(\d*)', 'I have 5 dollars')
mat.groups() # ('I have 5 dollars', '')
mat = re.match(r'(.*?)(\d+)', 'I have 5 dollars')
mat.groups() # ('I have ', '5')
mat = re.match(r'(.*?)(\d*)', 'I have 5 dollars')
mat.groups() # ('', '')
The ambiguities introduced by *
, +
, ?
, and |
don’t matter if
all you care about is whether there is a match!
Regular expressions can be very useful. However: