Regexes Got Good: The History And Future Of Regular Expressions In JavaScript

Category Image 080

Modern JavaScript regular expressions have come a long way compared to what you might be familiar with. Regexes can be an amazing tool for searching and replacing text, but they have a longstanding reputation (perhaps outdated, as I’ll show) for being difficult to write and understand.

This is especially true in JavaScript-land, where regexes languished for many years, comparatively underpowered compared to their more modern counterparts in PCRE, Perl, .NET, Java, Ruby, C++, and Python. Those days are over.

In this article, I’ll recount the history of improvements to JavaScript regexes (spoiler: ES2018 and ES2024 changed the game), show examples of modern regex features in action, introduce you to a lightweight JavaScript library that makes JavaScript stand alongside or surpass other modern regex flavors, and end with a preview of active proposals that will continue to improve regexes in future versions of JavaScript (with some of them already working in your browser today).

The History of Regular Expressions in JavaScript

ECMAScript 3, standardized in 1999, introduced Perl-inspired regular expressions to the JavaScript language. Although it got enough things right to make regexes pretty useful (and mostly compatible with other Perl-inspired flavors), there were some big omissions, even then. And while JavaScript waited 10 years for its next standardized version with ES5, other programming languages and regex implementations added useful new features that made their regexes more powerful and readable.

But that was then.

Did you know that nearly every new version of JavaScript has made at least minor improvements to regular expressions?

Let’s take a look at them.

Don’t worry if it’s hard to understand what some of the following features mean — we’ll look more closely at several of the key features afterward.

  • ES5 (2009) fixed unintuitive behavior by creating a new object every time regex literals are evaluated and allowed regex literals to use unescaped forward slashes within character classes (/[/]/).
  • ES6/ES2015 added two new regex flags: y (sticky), which made it easier to use regexes in parsers, and u (unicode), which added several significant Unicode-related improvements along with strict errors. It also added the RegExp.prototype.flags getter, support for subclassing RegExp, and the ability to copy a regex while changing its flags.
  • ES2018 was the edition that finally made JavaScript regexes pretty good. It added the s (dotAll) flag, lookbehind, named capture, and Unicode properties (via \p{...} and \P{...}, which require ES6’s flag u). All of these are extremely useful features, as we’ll see.
  • ES2020 added the string method matchAll, which we’ll also see more of shortly.
  • ES2022 added flag d (hasIndices), which provides start and end indices for matched substrings.
  • And finally, ES2024 added flag v (unicodeSets) as an upgrade to ES6’s flag u. The v flag adds a set of multicharacter “properties of strings” to \p{...}, multicharacter elements within character classes via \p{...} and \q{...}, nested character classes, set subtraction [A--B] and intersection [A&&B], and different escaping rules within character classes. It also fixed case-insensitive matching for Unicode properties within negated sets [^...].

As for whether you can safely use these features in your code today, the answer is yes! The latest of these features, flag v, is supported in Node.js 20 and 2023-era browsers. The rest are supported in 2021-era browsers or earlier.

Each edition from ES2019 to ES2023 also added additional Unicode properties that can be used via \p{...} and \P{...}. And to be a completionist, ES2021 added string method replaceAll — although, when given a regex, the only difference from ES3’s replace is that it throws if not using flag g.

Aside: What Makes a Regex Flavor Good?

With all of these changes, how do JavaScript regular expressions now stack up against other flavors? There are multiple ways to think about this, but here are a few key Aspects:

  • Performance.
    This is an important Aspect but probably not the main one since mature regex implementations are generally pretty fast. JavaScript is strong on regex performance (at least considering V8’s Irregexp engine, used by Node.js, Chromium-based browsers, and even Firefox; and JavaScriptCore, used by Safari), but it uses a backtracking engine that is missing any syntax for backtracking control — a major limitation that makes ReDoS vulnerability more common.
  • Support for advanced features that handle common or important use cases.
    Here, JavaScript stepped up its game with ES2018 and ES2024. JavaScript is now best in class for some features like lookbehind (with its infinite-length support) and Unicode properties (with multicharacter “properties of strings,” set subtraction and intersection, and script extensions). These features are either not supported or not as robust in many other flavors.
  • Ability to write readable and maintainable patterns.
    Here, native JavaScript has long been the worst of the major flavors since it lacks the x (“extended”) flag that allows insignificant whitespace and comments. Additionally, it lacks regex subroutines and subroutine definition groups (from PCRE and Perl), a powerful set of features that enable writing grammatical regexes that build up complex patterns via composition.

So, it’s a bit of a mixed bag.

JavaScript regexes have become exceptionally powerful, but they’re still missing key features that could make regexes safer, more readable, and more maintainable (all of which hold some people back from using this power).

The good news is that all of these holes can be filled by a JavaScript library, which we’ll see later in this article.

Using JavaScript’s Modern Regex Features

Let’s look at a few of the more useful modern regex features that you might be less familiar with. You should know in advance that this is a moderately advanced guide. If you’re relatively new to regex, here are some excellent tutorials you might want to start with:

Named Capture

Often, you want to do more than just check whether a regex matches — you want to extract substrings from the match and do something with them in your code. Named capturing groups allow you to do this in a way that makes your regexes and code more readable and self-documenting.

The following example matches a record with two date fields and captures the values:

const record = 'Admitted: 2024-01-01\nReleased: 2024-01-03';
const re = /^Admitted: (?<admitted>\d{4}-\d{2}-\d{2})\nReleased: (?<released>\d{4}-\d{2}-\d{2})$/;
const match = record.match(re);
console.log(match.groups);
/* → {
  admitted: '2024-01-01',
  released: '2024-01-03'
} */

Don’t worry — although this regex might be challenging to understand, later, we’ll look at a way to make it much more readable. The key things here are that named capturing groups use the syntax (?<name>...), and their results are stored on the groups object of matches.

You can also use named backreferences to rematch whatever a named capturing group matched via \k<name>, and you can use the values within search and replace as follows:

// Change 'FirstName LastName' to 'LastName, FirstName'
const name = 'Shaquille Oatmeal';
name.replace(/(?<first>\w+) (?<last>\w+)/, '$<last>, $<first>');
// → 'Oatmeal, Shaquille'

For advanced regexers who want to use named backreferences within a replacement callback function, the groups object is provided as the last argument. Here’s a fancy example:

function fahrenheitToCelsius(str) {
  const re = /(?<degrees>-?\d+(\.\d+)?)F\b/g;
  return str.replace(re, (...args) => {
    const groups = args.at(-1);
    return Math.round((groups.degrees - 32) * 5/9) + 'C';
  });
}
fahrenheitToCelsius('98.6F');
// → '37C'
fahrenheitToCelsius('May 9 high is 40F and low is 21F');
// → 'May 9 high is 4C and low is -6C'

Lookbehind

Lookbehind (introduced in ES2018) is the complement to lookahead, which has always been supported by JavaScript regexes. Lookahead and lookbehind are assertions (similar to ^ for the start of a string or \b for word boundaries) that don’t consume any characters as part of the match. Lookbehinds succeed or fail based on whether their subpattern can be found immediately before the current match position.

For example, the following regex uses a lookbehind (?<=...) to match the word “cat” (only the word “cat”) if it’s preceded by “fat ”:

const re = /(?<=fat )cat/g;
'cat, fat cat, brat cat'.replace(re, 'pigeon');
// → 'cat, fat pigeon, brat cat'

You can also use negative lookbehind — written as (?<!...) — to invert the assertion. That would make the regex match any instance of “cat” that’s not preceded by “fat ”.

const re = /(?<!fat )cat/g;
'cat, fat cat, brat cat'.replace(re, 'pigeon');
// → 'pigeon, fat cat, brat pigeon'

JavaScript’s implementation of lookbehind is one of the very best (matched only by .NET). Whereas other regex flavors have inconsistent and complex rules for when and whether they allow variable-length patterns inside lookbehind, JavaScript allows you to look behind for any subpattern.

The matchAll Method

JavaScript’s String.prototype.matchAll was added in ES2020 and makes it easier to operate on regex matches in a loop when you need extended match details. Although other solutions were possible before, matchAll is often easier, and it avoids gotchas, such as the need to guard against infinite loops when looping over the results of regexes that might return zero-length matches.

Since matchAll returns an iterator (rather than an array), it’s easy to use it in a for...of loop.

const re = /(?<char1>\w)(?<char2>\w)/g;
for (const match of str.matchAll(re)) {
  const {char1, char2} = match.groups;
  // Print each complete match and matched subpatterns
  console.log(Matched "${match[0]}" with "${char1}" and "${char2}");
}

Note: matchAll requires its regexes to use flag g (global). Also, as with other iterators, you can get all of its results as an array using Array.from or array spreading.

const matches = [...str.matchAll(/./g)];

Unicode Properties

Unicode properties (added in ES2018) give you powerful control over multilingual text, using the syntax \p{...} and its negated version \P{...}. There are hundreds of different properties you can match, which cover a wide variety of Unicode categories, scripts, script extensions, and binary properties.

Note: For more details, check out the documentation on MDN.

Unicode properties require using the flag u (unicode) or v (unicodeSets).

Flag v

Flag v (unicodeSets) was added in ES2024 and is an upgrade to flag u — you can’t use both at the same time. It’s a best practice to always use one of these flags to avoid silently introducing bugs via the default Unicode-unaware mode. The decision on which to use is fairly straightforward. If you’re okay with only supporting environments with flag v (Node.js 20 and 2023-era browsers), then use flag v; otherwise, use flag u.

Flag v adds support for several new regex features, with the coolest probably being set subtraction and intersection. This allows using A--B (within character classes) to match strings in A but not in B or using A&&B to match strings in both A and B. For example:

// Matches all Greek symbols except the letter 'π'
/[\p{Script_Extensions=Greek}--π]/v

// Matches only Greek letters
/[\p{Script_Extensions=Greek}&&\p{Letter}]/v

For more details about flag v, including its other new features, check out this explainer from the Google Chrome team.

A Word on Matching Emoji

Emoji are 🤩🔥😎👌, but how emoji get encoded in text is complicated. If you’re trying to match them with a regex, it’s important to be aware that a single emoji can be composed of one or many individual Unicode code points. Many people (and libraries!) who roll their own emoji regexes miss this point (or implement it poorly) and end up with bugs.

The following details for the emoji “👩🏻‍🏫” (Woman Teacher: Light Skin Tone) show just how complicated emoji can be:

// Code unit length
'👩🏻‍🏫'.length;
// → 7
// Each astral code point (above \uFFFF) is divided into high and low surrogates

// Code point length
[...'👩🏻‍🏫'].length;
// → 4
// These four code points are: \u{1F469} \u{1F3FB} \u{200D} \u{1F3EB}
// \u{1F469} combined with \u{1F3FB} is '👩🏻'
// \u{200D} is a Zero-Width Joiner
// \u{1F3EB} is '🏫'

// Grapheme cluster length (user-perceived characters)
[...new Intl.Segmenter().segment('👩🏻‍🏫')].length;
// → 1

Fortunately, JavaScript added an easy way to match any individual, complete emoji via \p{RGI_Emoji}. Since this is a fancy “property of strings” that can match more than one code point at a time, it requires ES2024’s flag v.

If you want to match emojis in environments without v support, check out the excellent libraries emoji-regex and emoji-regex-xs.

Making Your Regexes More Readable, Maintainable, and Resilient

Despite the improvements to regex features over the years, native JavaScript regexes of sufficient complexity can still be outrageously hard to read and maintain.

Regular Expressions are SO EASY!!!! pic.twitter.com/q4GSpbJRbZ

— Garabato Kid (@garabatokid) July 5, 2019


ES2018’s named capture was a great addition that made regexes more self-documenting, and ES6’s String.raw tag allows you to avoid escaping all your backslashes when using the RegExp constructor. But for the most part, that’s it in terms of readability.

However, there’s a lightweight and high-performance JavaScript library named regex (by yours truly) that makes regexes dramatically more readable. It does this by adding key missing features from Perl-Compatible Regular Expressions (PCRE) and outputting native JavaScript regexes. You can also use it as a Babel plugin, which means that regex calls are transpiled at build time, so you get a better developer experience without users paying any runtime cost.

PCRE is a popular C library used by PHP for its regex support, and it’s available in countless other programming languages and tools.

Let’s briefly look at some of the ways the regex library, which provides a template tag named regex, can help you write complex regexes that are actually understandable and maintainable by mortals. Note that all of the new syntax described below works identically in PCRE.

Insignificant Whitespace and Comments

By default, regex allows you to freely add whitespace and line comments (starting with #) to your regexes for readability.

import {regex} from 'regex';
const date = regex`
  # Match a date in YYYY-MM-DD format
  (?<year>  \d{4}) - # Year part
  (?<month> \d{2}) - # Month part
  (?<day>   \d{2})   # Day part
`;

This is equivalent to using PCRE’s xx flag.

Subroutines and Subroutine Definition Groups

Subroutines are written as \g<name> (where name refers to a named group), and they treat the referenced group as an independent subpattern that they try to match at the current position. This enables subpattern composition and reuse, which improves readability and maintainability.

For example, the following regex matches an IPv4 address such as “192.168.12.123”:

import {regex} from 'regex';
const ipv4 = regex`\b
  (?<byte> 25[0-5] | 2[0-4]\d | 1\d\d | [1-9]?\d)
  # Match the remaining 3 dot-separated bytes
  (\. \g<byte>){3}
\b`;

You can take this even further by defining subpatterns for use by reference only via subroutine definition groups. Here’s an example that improves the regex for admittance records that we saw earlier in this article:

const record = 'Admitted: 2024-01-01\nReleased: 2024-01-03';
const re = regex`
  ^ Admitted:\ (?<admitted> \g<date>) \n
    Released:\ (?<released> \g<date>) $

  (?(DEFINE)
    (?<date>  \g<year>-\g<month>-\g<day>)
    (?<year>  \d{4})
    (?<month> \d{2})
    (?<day>   \d{2})
  )
`;
const match = record.match(re);
console.log(match.groups);
/* → {
  admitted: '2024-01-01',
  released: '2024-01-03'
} */

A Modern Regex Baseline

regex includes the v flag by default, so you never forget to turn it on. And in environments without native v, it automatically switches to flag u while applying v’s escaping rules, so your regexes are forward and backward-compatible.

It also implicitly enables the emulated flags x (insignificant whitespace and comments) and n (“named capture only” mode) by default, so you don’t have to continually opt into their superior modes. And since it’s a raw string template tag, you don’t have to escape your backslashes \\\\ like with the RegExp constructor.

Atomic Groups and Possessive Quantifiers Can Prevent Catastrophic Backtracking

Atomic groups and possessive quantifiers are another powerful set of features added by the regex library. Although they’re primarily about performance and resilience against catastrophic backtracking (also known as ReDoS or “regular expression denial of service,” a serious issue where certain regexes can take forever when searching particular, not-quite-matching strings), they can also help with readability by allowing you to write simpler patterns.

Note: You can learn more in the regex documentation.

What’s Next? Upcoming JavaScript Regex Improvements

There are a variety of active proposals for improving regexes in JavaScript. Below, we’ll look at the three that are well on their way to being included in future editions of the language.

Duplicate Named Capturing Groups

This is a Stage 3 (nearly finalized) proposal. Even better is that, as of recently, it works in all major browsers.

When named capturing was first introduced, it required that all (?<name>...) captures use unique names. However, there are cases when you have multiple alternate paths through a regex, and it would simplify your code to reuse the same group names in each alternative.

For example:

/(?<year>\d{4})-\d\d|\d\d-(?<year>\d{4})/

This proposal enables exactly this, preventing a “duplicate capture group name” error with this example. Note that names must still be unique within each alternative path.

Pattern Modifiers (aka Flag Groups)

This is another Stage 3 proposal. It’s already supported in Chrome/Edge 125 and Opera 111, and it’s coming soon for Firefox. No word yet on Safari.

Pattern modifiers use (?ims:...), (?-ims:...), or (?im-s:...) to turn the flags i, m, and s on or off for only certain parts of a regex.

For example:

/hello-(?i:world)/
// Matches 'hello-WORLD' but not 'HELLO-WORLD'

Escape Regex Special Characters with RegExp.escape

This proposal recently reached Stage 3 and has been a long time coming. It isn’t yet supported in any major browsers. The proposal does what it says on the tin, providing the function RegExp.escape(str), which returns the string with all regex special characters escaped so you can match them literally.

If you need this functionality today, the most widely-used package (with more than 500 million monthly npm downloads) is escape-string-regexp, an ultra-lightweight, single-purpose utility that does minimal escaping. That’s great for most cases, but if you need assurance that your escaped string can safely be used at any arbitrary position within a regex, escape-string-regexp recommends the regex library that we’ve already looked at in this article. The regex library uses interpolation to escape embedded strings in a context-aware way.

Conclusion

So there you have it: the past, present, and future of JavaScript regular expressions.

If you want to journey even deeper into the lands of regex, check out Awesome Regex for a list of the best regex testers, tutorials, libraries, and other resources. And for a fun regex crossword puzzle, try your hand at regexle.

May your parsing be prosperous and your regexes be readable.

The AI Bubble Might Burst Soon – And That’s a Good Thing

Featured Imgs 23

Almost two years into the AI hype, a looming market correction may soon separate true innovators from those who are trying to capitalize on the hype. The burst of the bubble could pave the way for a more mature phase of AI development.

ai-bubble.jpg

Amidst recent turmoil on the stock markets, during which the 7 biggest tech companies collectively lost some $650 billion, experts and media alike are warning that the next tech bubble is about to pop (e.g.: The Guardian, Cointelegraph, The Byte). The AI industry has indeed been riding a wave of unprecedented hype and investment, with inflated expectations potentially setting up investors and CEOs for a rude awakening. However, the bursting of a bubble often has a cleansing effect, separating the wheat from the chaff. This article examines the current state of the AI industry, exploring both the signs that point to an imminent burst and the factors that suggest continued growth.

Why the Bubble Must Burst

Since the release of ChatGPT started a mainstream hype around AI, it looks like investors jumped at the opportunity to put their money into AI-related projects. Billions have been spent on them this year alone, and analysts expect AI to become a $1 trillion industry within the next 4-5 years. OpenAI alone is currently valued at $80 billion, which is almost twice the valuation of General Motors, or four times that of Western Digital. The list of other AI companies with high valuations has been growing quickly, as has the list of failed AI startups. At the same time, the progress visible to end-users has slowed down, and the hype around AI has been overshadowed by an endless string of PR disasters.

Here are three key reasons why the AI bubble might pop soon:

  1. AI doesnt sell. A study led by researchers of Washington State University revealed that using 'artificial intelligence' in product descriptions decreases purchase likelihood. This effect most likely stems from the emotional trust people typically associate with human interaction. AI distrust might have been further fueled by various PR disasters ranging from lying chatbots to discriminatory algorithms and wasteful public spending on insubstantial projects.
  2. AI investments aren't paying off. Most AI companies remain unprofitable and lack clear paths to profitability. For instance, OpenAI received a $13 billion investment from Microsoft for a 49% stake. Yet OpenAI's estimated annual revenue from 8.9 million subscribers is just $2.5 billion. Even with minimal operational costs (which isn't the case), Microsoft faces a long road to recouping its investment, let alone profiting.
  3. Regulation is hampering progress. End-users have seen little tangible improvement in AI applications over the past year. While video generation has advanced, ChatGPT and other LLMs have become less useful despite boasting higher model numbers and larger training data. A multitude of restrictions aimed, for example, at copyright protection, preventing misuse, and ensuring inoffensiveness have led to a "dumbification of LLMs." This has created a noticeable gap between hype and reality. Nevertheless, AI companies continue hyping minor updates and little new features that fail to meet expectations.

It's also crucial to remember that technology adoption takes time. Despite ChatGPT's record-breaking user growth, it still lags behind Netflix by about 100 million users, and has only about 3.5% of Netflix's paid subscribers. Consider that it took 30 years for half the world's population to get online after the World Wide Web's birth in 1989. Even today, 37% globally (and 9-12% in the US and Europe) don't use the internet. Realistically, AI's full integration into our lives will take considerable time. The burst of economic bubbles is much more likely to occur before that.

The Thing About Bubbles

A potential counter-argument to the thesis that AI development is slowing down, lacks application value and will struggle to expand its userbase, is that some big players might be hiding groundbreaking developments, which they could pull out of their metaphorical hats any moment. Speculations about much better models or even AGI lurking on OpenAI's internal testing network are nothing new. And indeed it is a fact that tech that is being developed usually surpasses the capabilities of tech that has already been thoroughly tested and released such is the nature of development. While AI development certainly might have the one or the other surprise in stock, and new applications arise all the time, it is questionable if there's a wildcard that can counteract an overheated market and hastily made investments in the billions. So anyone who's invested into AI-related stocks might want to buckle up, as turbulent quarters are likely to be ahead.

Now forget your investment portfolio and think about progress. Here's why a bursting AI bubble might actually benefit the industry:

The thing about bubbles is, they don't say much about the real-life value of a new technology. Sure, the bursting of a bubble might show that a useless thing is useless, as was the case with NFTs, which got hyped up and then quickly lost their "value" (NFTs really were the tulip mania of the digital age). But the bursting of bubbles also does not render a useful thing useless. There are many good examples for this:

  • During the .com-bubble of the late 1990s countless companies boasting little more than a registered domain name were drastically overvalued and when the bubble did burst their stock became worthless from one day to another. Yet, .com-services are not only still around, they have become the driving force behind the economy.
  • The bursting of the crypto bubble in early 2018 blasted many shitcoins into oblivion, but Bitcoin is still standing and not far off its all-time high. Also, blockchain tech is already applied in many areas other than finance e.g. in supply-chain management.
  • The crash of the housing market in 2007 worked a little differently, as it was not a tech-bubble. Property was hopelessly overvalued and people couldn't keep up with rising interest rates. The bursting of the bubble exposed a dire reality of financial markets where investors bet on whether you will be able to pay your mortgage or not. And today? Well, take a look at the chart on average, housing in the US costs almost twice as much now as it did at the height of the bubble of 2007. Even when adjusted for inflation, buying a house is now more expensive than ever before.

In case of the housing market, the bursting of the bubble had the effect that mortgages became more difficult to access and financial speculations became a little more regulated. In the case of the .com- and crypto-bubble, however, the burst had a cleansing effect that drove away the fakes and shillers, and left the fraction of projects alive that were actually on to something. It can be suspected that a bursting of the AI bubble would have a similar effect.

While the prospect of an AI bubble burst may cause short-term market turbulence, it could ultimately prove beneficial for the industry's long-term health and innovation. A market correction would likely weed out ventures that lack substance and redirect focus towards applications with real-world impact.

Investors, developers, and users alike should view this potential reset not as an end, but as a new beginning. The AI revolution is far from over it's entering a more mature, pragmatic phase.

Compatibility Issues with SAS Controllers and Host Bus Adapters (HBA) in My

Featured Imgs 23

Hello DaniWeb community,

I'm currently working on a server build and have run into some issues with my SAS controllers and Host Bus Adapters (HBA). Heres the situation:

I'm using a [specific server motherboard model] with [specific CPU model] and [specific amount of RAM], and I've installed a [specific model] SAS controller. The goal is to connect several high-capacity SAS drives for storage, but I'm experiencing unexpected behavior. The system sometimes fails to recognize the drives during boot, and even when it does, the drives are intermittently dropping out during operation.

Ive tried updating the firmware on the SAS controller and the HBA, and Ive also tested different cables and power supplies, but the issue persists. Could this be a compatibility issue between the SAS controller and the HBA? Or is there something else I might be overlooking?

Any suggestions or insights would be greatly appreciated, especially if youve had similar experiences with SAS controllers and HBAs.

Thanks in advance!