Discussion 12: Regular Expressions & BNF

Files: disc12.pdf

This is an online worksheet that you can work on during discussions. Your work is not graded and you do not need to submit anything.

BNF

Backus-Naur Form (BNF) is a syntax for describing a context-free grammar. It was invented for describing the syntax of programming languages, and is still commonly used in documentation and language parsers. EBNF is a dialect of BNF which contains some convenient shorthands.

An EBNF grammar contains symbols and a set of recursive production rules. In 61A, we are using the Python Lark library to write EBNF grammars, which has a few specific rules for grammar writing.

There are two types of symbols: Non-terminal symbols can expand into non-terminals (including themselves) or terminals. In the Python Lark library, non-terminal symbols are always lowercase. Terminal symbols can be strings or regular expressions. In Lark, terminals are always uppercase.

Consider these two production rules:

numbers: INTEGER | numbers "," INTEGER
INTEGER: /-?\d+/

The symbol numbers is a non-terminal with a recursive production rule. It corresponds to either an INTEGER terminal or to the numbers symbol (itself) plus a comma plus an INTEGER terminal. The INTEGER terminal is defined using a regular expression which matches any number of digits with an optional - sign in front.

This grammar can describe strings like:

10
10,-11
10,-11,12

And so on, with any number of integers in front.

A grammar should also specify a start symbol, which corresponds to the whole expression being parsed (or the whole sentence, for a spoken language).

For the simple example of comma-separated numbers, the start symbol could just be the numbers terminal itself:

?start: numbers
numbers: numbers "," INTEGER | INTEGER
INTEGER: /-?\d+/

EBNF grammars can use these shorthand notations for specifying how many symbols to match:

EBNF Notation	Meaning	Pure BNF Equivalent
item\*	Zero or more items	items: \\| items item
item+	One or more items	items: item \\| items item
[item] item?	Optional item	optitem: \\| item

Lark also includes a few handy features:

You can specify tokens to complete ignore by using the ignore directive at the bottom of a grammar. For example, %ignore /\s+/ ignores all whitespace (tabs/spaces/new lines).
You can import pre-defined terminals for common types of data to match. For example, %import common.NUMBER imports a terminal that matches any integer or decimal number.

Q1: lambda BNF

We’ve written a simple BNF grammar to handle lambda expressions. The body of our lambda has to consist of a single expression, which can be a number, word, or another lambda expression.

?start: lambda_expression
lambda_expression:  "lambda " arguments ":" body
arguments: WORD ("," WORD)*
body: expression
?expression: value | lambda_expression
?value: WORD | NUMBER

%import common.WORD
%import common.NUMBER
%ignore /\s+/

For each of the given examples, draw the resulting tree created by this BNF.

lark> lambda x: 5

lark> lambda x, y: x

lark> lambda x: lambda y: x

Regular Expressions

Q6: Email Domain Validator

Create a regular expression that makes sure a given string email is a valid email address and that its domain name is in the provided list of domains.

An email address is valid if it contains letters, number, or underscores, followed by an @ symbol, then a domain.

All domains will have a 3 letter extension following the period.

Hint: For this problem, you will have to make a regex pattern based on the inputs domains. A for loop can help with that.

Extra: There is a particularly elegant solution that utilizes join and replace instead of a for loop.

Note: The skeleton code is just a suggestion; feel free to use your own structure if you prefer.

import re
def email_validator(email, domains):
    """
    >>> email_validator("oski@berkeley.edu", ["berkeley.edu", "gmail.com"])
    True
    >>> email_validator("oski@gmail.com", ["berkeley.edu", "gmail.com"])
    True
    >>> email_validator("oski@berkeley.com", ["berkeley.edu", "gmail.com"])
    False
    >>> email_validator("oski@berkeley.edu", ["yahoo.com"])
    False
    >>> email_validator("xX123_iii_OSKI_iii_123Xx@berkeley.edu", ["berkeley.edu", "gmail.com"])
    True
    >>> email_validator("oski@oski@berkeley.edu", ["berkeley.edu", "gmail.com"])
    False
    >>> email_validator("oski@berkeleysedu", ["berkeley.edu", "gmail.com"])
    False
    """
    pattern = _____________________
    for _____________________:
        'Use as many lines as necessary'
    return bool(re.search(pattern, email))