When to safely and reliably deploy regular expressions in your codebase.
Regular expressions have long been demonised for being totally unreliable and tacky solutions which should be avoided at all cost in most if not all situations. However, regexes are not all bad and evil in my opinion and experience as long as you use them responsibly.
When you overuse or abuse them, then you run into lots of trouble that you can’t dig yourself out from easily.
Watch this short video (2 minutes) below:
The is the famous John Papa of the Angular community making a comment about regexes and from what i see in the video frame above, only two of the regexes (i.e. the JavaScript regexes assigned to variables: x and z) are too complex and bad for use in any codebase.
For instance, the regular expression assigned to variable x is for string email addresses and probably used for validating emails entered into an input text box on a web page. Yet, this is a very unsafe/unreliable application of a regular expression and should never be used and i will state why in a second — stay with me.
When i was a much less experienced software engineer, i would use an email address regex to perform web frontend validations for HTML forms.
In my naive mind, using a regular expressions was the perfect way to fully match and validate email addresses. I mean, what could be better ?
Back in the day, programming languages like Perl were super good at handling regular expressions of mind-bending complexity really well such that most programmers then reached for regular expressions as a solution to everything. These programmers were wrong as the regex-based solutions generated by Perl were not tamper-proof (i.e. security-wise) or assertion-proof (i.e. only allow correct/valid patterns). This means that sometimes they failed to do the job well enough even on an edge case or when the input string could cause the regex to malfunction.
Well, as i have gained more experience, i have realised that most things (not everything) that has an RFC or spec (e.g. email addresses, URIs, JSON, HTTP multipart request body) that require a lengthy, hard-to-read and complicated regular expressions to match all valid samples or string patterns (out there in the wild) are not worth deploying a lengthy regex for. Simply, use a parser instead.
Furthermore, when you use a parser, you save yourself so much unnecessary hassle.
Several languages like Go have an email address parser as well as a JSON parser. Other languages like JavaScript have an adhoc email address parser in Nodejs and also native support for a URL parser and a JSON parser in both Nodejs and the browser.
There’s a simple formula i personally use to determine when to use a regular expression. This formula is centered around 2 ideas:
- Character Cardinality
- Character Entropy
These 2 ideas have always helped me decide whether to use a regex or not and i will explain.
Cardinality is simply the size or amount of distinct characters (set of characters) occurring in a particular order (or arrangement) for a set of string patterns that are of a specific or minimally varied length which can correctly match fully/globally on a given regex.
This definition of mine for character cardinality is taken from set theory in maths stemming from this article. Let me explain further with a concrete example.
Let’s say i want to create a regex to match UUIDs, now i have to look at many samples of UUIDs and start noting down similar patterns among them and encode these patterns using the regex meta-characters (which is mostly the same for all programming languages — by the way).
After, looking at several samples, it’s noticed that a large portion of a UUID string is digits, the first 6 characters of the alphabet and hyphens. The cardinality therefore is the amount characters that make up most or every UUID which is 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f,-.
Also, the order in which these characters occur in the string is particularly unique. Again, these characters occur in groups of 4, 8 or 12 separated by the hyphen (-) character like so.
9b1deb4d-3b7d-4bad-9bdd-2b0d7b3dcb6d
Since the amount of distinct characters that make up all samples of a UUID is low (less than 40 distinct character is considered low) and the characters occur in a specific order and of specific length (all UUIDs of any version have the same length — 36), then the character cardinality of any UUID is low.
This means i can match a UUID (v4) string with a regex like this:
// Bear in mind that UUIDs are organized by version. Below is UUID version 4
// Yet, for all versions of UUID, cardinality is still always low
const regexForUUIDv4 = ^[{(]?[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-4[0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}[)}]?$
Therefore, any string sample with a high character cardinality that can perhaps be matched by a regex (short for regular expression), should very rarely have a regex created for it. Ever!
So what are other examples of strings with low character cardinality apart from UUIDs:
- All credit/debit card numbers
- All international phone numbers (E.164 format)
- All local Nigerian bank account numbers (NUBAN)
- Any username (twitter handle)
- Any date string (ISO, GMT)
For these with low character cardinality, it’s very much okay to use a regular expression that fully matches the string sample.
See examples below: 👇🏾👇🏾
// The format for MASTERCARD
const intlMasterCardRegex = /^(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}$/;
// The format for VERVE
const localVerveRegex = /^(?:50[067][180]|6500)(?:[0-9]{15})$/;
// The international E.164 phone number format
const intlPhoneNumberRegex = /^\+(?:[1-9]{1,3})(?:\d{9,12})$/;
For international E.164 phone numbers, they are usually not of a specific length but of minimally varied length (some a 12 to 15 characters long — not a lot of difference).
NOTE: The structure of an E.164 international phone number is: [+][country code][area code][subscriber number]
On the other hand, emails and URLs have very high character cardinality and so are not suitable for full regex pattern matching (notice i highlighted the word “full” — this will be cleared up later in this article).
For instance, these 2 regular expressions for validating emails look like they’d do the job but they won’t because of the the ways in which we can abuse the regex and get something past it that isn’t quite an email.
NOTE: the structure of an email address is: [local-part][@][domain]
// These are not even valid even by the RFC 2822 standands because ...
// ... digits are not allowed in the local-part of an email address
// This seems to be different RFC 5322; digits are allowed in the local-part
// Confusing as to which RFC should the regex follow!!!
// See: https://datatracker.ietf.org/doc/html/rfc2822#section-3.4.1
// This regex will pass "2345@34555.com" as an email address
const badEmailAddressRegex = /^[^\s@]+@[^\s@.]+\.[a-z]{2,6}$/;
// This regex will pass "2345@34555.com" & "_2345@34555.0.bn.cdm" as an email addresses
const veryBadEmailAddressRegex = /^(([^<>()\[\]\\.,;:\s@"]+(\.[^<>()\[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/;
// This regex will also pass "2345@34555.com" as an email address
const anotherHorribleEmailAdddressRegex = /^[\w-\.]+@([\w-]+\.)+[\w-]{2,4}$/g;
Apart from the trivial fact that these regular expression above for email addresses are not constructed well enough to reject invalid/non-existent domains like (e.g. 34555.0.bn.cdm), lengthy regular expressions that match a full string can also be exploited for weaknesses and lead to security issues or lead to other syntax-related issues.
See below: 👇🏾
This sample email address (_23%%#45@345\-55.0.8.mp) is passed by the regular expression on the frontend using JavaScript below but rejected by an email parser using Go (on the backend).
See below: 👇🏾
The point i am making here is that a parser offers you much more safety and correctness overall than a regular expression even though, on the surface, they both achieve the same thing (security issues can be costly down the line, why not be safe ?). Still, you shouldn’t make an invalid email address more of your softwares’ problem than it really needs to be. It’s more of the end users’ problem if anything.
If the parser validates and passes the email address, it’s possible to follow it up with a lookup for the MX record of the email domain/host (attached to the email address — see CLI command below) so that it is certain that your app can send emails to that address. If there’s an error with the lookup, report that to the user as soon as possible so they can change it (since there’s no other way to contact them if/after they leave your app/site).
nslookup -q=mx "gmail.com"
Furthermore, be sure that bots are not spamming the HTML forms on your app/site with fictitious email addresses. An extra layer of checks can’t hurt for your software.
The same goes for URLs and JSON strings. It’s always much safer to use a parser.
Lastly, for string patterns with high character cardinality, it’s very much okay to either split the string into parts using a parser and then apply regular expressions on each part or use a regular expression to match only a small part of the entire string.
This can be done for emails and URLs respectively like so:
package main
import (
"regexp"
"strings"
"fmt"
)
func checkIfEmailLike (email string) bool {
// match a part of the email string: only the `@` sign
// This is much better and safer than trying to match the full email
var emailRegexp = regexp.MustCompile(`[@]`)
return emailRegexp.MatchString(strings.ToLower(email))
}
func main () {
isLikeEmail := checkIfEmailLike("aron+samuelson@gmail.com")
if isLikeEmail {
fmt.Println("It's possibly an email")
} else {
fmt.Println("It's possibly not an email")
}
}
const checkIfURL = (url: string) => {
try {
// Use the parser to break the URI/URL into parts
// Then return those parts so a regex can be used to check them
const { host, protocol, pathname } = new URL(url);
return { host, protocol, pathname };
} catch (_) {
throw new Error("Not a valid URL");
}
};
export default function isWebTransportURL (url: string) {
try {
// The `protocol` has a low character cardinality
// So we can now use a regex to check it safely
const { protocol } = checkIfURL(url);
// Simply using regex to check if `url` is a HTTP or Websockets URL
return /^(?:http|ws)s?$/.test(protocol);
} catch (e) {
if (e instanceof Error) {
throw e;
}
}
return false;
}
Now, let’s move on to the second idea which is character entropy.
Entropy is the degree by which characters that are matched by a given regex at any position within different samples of all valid string patterns vary. It’s also basically the degree of randomness of the order or arrangement of the characters that make up every sample of the string pattern that can match a given regex.
This means that the characters that make up the string vary from sample to sample. For instance, a very good example of string patterns with high character entropy is passwords.
If you take a look at samples of plain passwords (not the hashed versions inside a typical database) chosen at random from different login credentials stored in different password managers (like 1password), you would find that the characters that make up the passwords and the arrangement or order of these characters vary so much.
This is the reason why no one creates regular expressions that fully match password strings.
Password strength checkers that are solid use match only parts of a password to determine its’ strength. Naturally, there are no parser for password strings because passwords follow no decipherable pattern of character arrangement.
URIs (including URLs) also have high character entropy too as well as high entropy so doing a full regular expression match is a big NO!
So, if you are ever wondering if you should use regular expressions on a string or not, check if the string has either high character cardinality or entropy. This should help you decide.
Cheers!