Discussion 12: Regular Expressions & BNF
Files: disc12.pdf
This is an online worksheet that you can work on during discussions. Your work is not graded and you do not need to submit anything.
BNF
Backus-Naur Form (BNF) is a syntax for describing a context-free grammar. It was invented for describing the syntax of programming languages, and is still commonly used in documentation and language parsers. EBNF is a dialect of BNF which contains some convenient shorthands.
An EBNF grammar contains symbols and a set of recursive production rules. In 61A, we are using the Python Lark library to write EBNF grammars, which has a few specific rules for grammar writing.
There are two types of symbols: Non-terminal symbols can expand into non-terminals (including themselves) or terminals. In the Python Lark library, non-terminal symbols are always lowercase. Terminal symbols can be strings or regular expressions. In Lark, terminals are always uppercase.
Consider these two production rules:
numbers: INTEGER | numbers "," INTEGER
INTEGER: /-?\d+/
The symbol numbers
is a non-terminal with a recursive production rule.
It corresponds to either an INTEGER
terminal or to the numbers
symbol (itself) plus a comma plus an INTEGER
terminal. The INTEGER
terminal is defined using a regular expression which matches any number
of digits with an optional - sign in front.
This grammar can describe strings like:
10
10,-11
10,-11,12
And so on, with any number of integers in front.
A grammar should also specify a start symbol, which corresponds to the whole expression being parsed (or the whole sentence, for a spoken language).
For the simple example of comma-separated numbers, the start symbol could just be the numbers terminal itself:
?start: numbers
numbers: numbers "," INTEGER | INTEGER
INTEGER: /-?\d+/
EBNF grammars can use these shorthand notations for specifying how many symbols to match:
EBNF Notation | Meaning | Pure BNF Equivalent |
---|---|---|
item\* | Zero or more items | items: \| items item |
item+ | One or more items | items: item \| items item |
[item] item? |
Optional item | optitem: \| item |
Lark also includes a few handy features:
- You can specify tokens to complete ignore by using the ignore
directive at the bottom of a grammar. For example,
%ignore /\s+/
ignores all whitespace (tabs/spaces/new lines). - You can import pre-defined terminals for common types of data to
match. For example,
%import common.NUMBER
imports a terminal that matches any integer or decimal number.
Q1: lambda BNF
We’ve written a simple BNF grammar to handle lambda expressions. The body of our lambda has to consist of a single expression, which can be a number, word, or another lambda expression.
?start: lambda_expression
lambda_expression: "lambda " arguments ":" body
arguments: WORD ("," WORD)*
body: expression
?expression: value | lambda_expression
?value: WORD | NUMBER
%import common.WORD
%import common.NUMBER
%ignore /\s+/
For each of the given examples, draw the resulting tree created by this BNF.
lark> lambda x: 5
lark> lambda x, y: x
lark> lambda x: lambda y: x
Regular Expressions
Q6: Email Domain Validator
Create a regular expression that makes sure a given string email
is a
valid email address and that its domain name is in the provided list of
domains
.
An email address is valid if it contains letters, number, or underscores, followed by an @ symbol, then a domain.
All domains will have a 3 letter extension following the period.
Hint: For this problem, you will have to make a regex pattern based
on the inputs domains
. A for loop can help with that.
Extra: There is a particularly elegant solution that utilizes join and replace instead of a for loop.
Note: The skeleton code is just a suggestion; feel free to use your own structure if you prefer.
import re
def email_validator(email, domains):
"""
>>> email_validator("oski@berkeley.edu", ["berkeley.edu", "gmail.com"])
True
>>> email_validator("oski@gmail.com", ["berkeley.edu", "gmail.com"])
True
>>> email_validator("oski@berkeley.com", ["berkeley.edu", "gmail.com"])
False
>>> email_validator("oski@berkeley.edu", ["yahoo.com"])
False
>>> email_validator("xX123_iii_OSKI_iii_123Xx@berkeley.edu", ["berkeley.edu", "gmail.com"])
True
>>> email_validator("oski@oski@berkeley.edu", ["berkeley.edu", "gmail.com"])
False
>>> email_validator("oski@berkeleysedu", ["berkeley.edu", "gmail.com"])
False
"""
pattern = _____________________
for _____________________:
'Use as many lines as necessary'
return bool(re.search(pattern, email))