Intro to Regular Expressions

2019-12-01 893 words 5 minutes

Contents

Introduction

Being a Linux user and not coming across Regular Expressions or regex is next to impossible. I kept seeing the cryptic little set of characters everywhere and the people who could use it looked like ninjas to me. They are everywhere text editor's, JavaScript, Python, JAVA , Bash……and the list goes on.

What Exactly is a Regular Expression?

A regular expression is a string containing a combination of normal characters and special meta-characters or meta-sequences. The normal characters match themselves. Meta-characters and meta-sequences are characters or sequences of characters that represent ideas such as quantity, locations,or types of characters.

Literals (literally match this)

Something that we normally do while searching for text matches. Character by character matching is done.

Sample Text:

hello world

With the below regex you would match literals since there are no special or meta-characters.

hello

Matches the following characters in source string.

hello world

Meta-characters

Single Character (what to match?):

Option	Description
.	anything
\d	digit in 0123456789
\D	non-digit
\w	“word” letter digits and _
\W	non-word
(a single space)	space
\t	tab
\r	return
\n	newline
\s	whitespace ( (a single space), \t, \r, \n )
\S	non-whitespace

Quantifiers (How many occurence to match?)

Option	Description
X*	Zero or more occurence of X
X+	One or more occurence of X
X*	Exactly m instances of X
X{m}	One or more instances occurence of X
X{m,}	Atleast m instances of X
X{m,n}	between m and n instances (inclusive) of X

Note

Quantifiers characters comes after what character we want to repeat not before it.

+T is invalid
T+ means one or more occurrence of the character T
u? means check for 0 or 1 occurrence of u in the string
?u is invalid

Position (where?)

Option	Description
\b	Word Boundaires as defined as any edge between \w and \W
\B	non-word-boundaries
^	the beginning of the line
$	the end of the line

Practising the what, How many and where of regular expressions

Now that we are done with the absolute basics of regular expressions. Whenever writing a regular expression ask yourself what are the “What”, “How many”, and “Where” of the regular expression

The best way to learn regular expression is through Trial and Error.

To follow along open up a test regex matching tool in your text editor or Online regex matching tools

Scenario 1:

Matching the word cat

what to match : ‘cat’
How many : None,Since we are doing literal matching with no repetitions (we want to match cat not catcat).
Where : ‘cat’ that do not belong inside any word and are whole words.

regex : \bcat\b

Specifies a regular expressions that matches the literal cat that comes between world boundaries (word boundaries mark the start or end of a word), without the world boundaries the regex will match all occurrence of ‘cat’ in the text. Always remember there is a cat in catastrophe. And the importance of word boundaries.

Scenario 2:

Match all occurrences of 10 digit phone numbers

What to match : All numbers \d (for digit match)
How many : 10 occurrences.
Where : not needed

Character Classes and Alternations

Character classes (anything in between [ ])

Think of them as the OR operation in programming. We specify in the character class braces any alternations of character that need to be matched.

[ab] means matches for a orb
f[au]n matches for fan and fun

Note: if you specify single character meta-character inside [ ] it loses its special characteristics

But there are characters that have special meaning inside [] braces in regular expressions they are

^ - Anything except what is specified. [^ab] means match anything except a or b

- - If it is the first character inside the [ ] it is treated as a literal - but if in between different characters or digits it specifies a range [a-z] means lowercase characters from a to z and [0-10] means numbers from 0 to 10.

Alternations ( anything in between ( | ) )

What if we needed to give an or option for a set of character like match a ‘net’ or a ‘com’ we can use (net|com) for multi-character option matching. Example: www.facebook.(com|net) matches www.facebook.com and www.facebook.net

Advanced Regular Expressions with Groups and Back References

Groups

Regular expressions can be divided into parts that can be later used for replace and other operations.

$0 is the whole regular expression

$1 is the first part .

$2 is the second part and so on from the left to right.

In the following example we do a search and replace of Last-name , First-name to First-Name Last-name

Back References

It is a way of specifying a previous group of regular expression later in the same regular expression using \GroupNo .

Imagine a scenario where we are searching for occurrence of the same word one after.

hello hello ha ha kill kill

we make use of the regex (\w+)\s\1 which uses the \1 to specify the same pattern as the first group

Reading Regular Expressions

^\w\w\w\S$ - (^)begin with any word character(\w) and then two more word characters then anything except a space.

^T+\w\w\d$ - (^)begin with one or more (+) occurrence of T followed by two (\w)word characters then a (\d)digit and then the end.

Contents

Intro to Regular Expressions

Introduction

XKCD Regex Comic

What Exactly is a Regular Expression?

Literals (literally match this)

Meta-characters

Single Character (what to match?):

Quantifiers (How many occurence to match?)

Position (where?)

Practising the what, How many and where of regular expressions

Regex by Trial and Error

Character Classes and Alternations

Character classes (anything in between [ ])

Alternations ( anything in between ( | ) )

Advanced Regular Expressions with Groups and Back References

Groups

Groups in regular expression

Back References

Reading Regular Expressions