Ch1
Introduction to Regular Expression🔗
- You are tasked with checking all pages of a web server for double words !
- accept any number of files, report each line of each file that has doubled words, results nicely formatted
- word of the end of a line is repeated at the beginning of next line
- ignore Capitalization
- find doubled words even when separated by HTML tags.
- Above tasks could be done using Regular Expression with some effort but significantly less than traditional solutions
Regular Expressions are key to powerful, flexible, and efficient text processing. Regular expressions allow you to describe and parse text. With additional support by tools being used, regular expression can add, remove, isolate, and generally fold, spindle and mutilate all kinds of text and data.
Solving Real Problems🔗
- to check a lot of files (70 per say) for whether each file container
SetSize
exactly as often as it containedResetSize
. To complicate matters consider capitalization optional. - find pair of send and recieve emails from few hundred of emails without actually digging thru content :P
Regular Expressions as a Language🔗
full regular expression are composed of two types of characters
- MetaCharacters (
* ? +
) - Literal (e.g. Normal text character)
Regular Expression Frame of Mind🔗
- Searching Text Files: Egrep
Solving Email problem :
Above Regex matches all the mails for words that start with word From:
or Subject:
. Here metacharacters are ^ ( ) |
.
Simple example finding word cat
will match all words which are cat
or vacation
without regard of word boundary.
Metacharacters🔗
Typographical Conventions
^
: (caret) represents start of a line$
: (dollar) represents end of a line
Character Classes
- Matching any one of several characters
- Example lets say you want to match all words for ‘grey’ but you want to make sure if it were spelled ‘gray’ too. A simple regex like this can be used to define character class:
gr[ea]y
- Within character classes
-
represents range of characters : for detecting header tags :<H[1-6]>
or several combination are[0-9]
or[a-z]
- We can define multiple ranges :
0123456789abcd...
rather[0-9a-fA-F]
NOTE : -
is metacharacter only when used in character classes otherwise its a normal dash character.
Negated Character Classes
- Using
[^...]
negates the range defined in the character class meaning all charcters except specified.
Matching Any Characters with Dot
.
: match any character. Example find the dates that could be defined as dd/mm/yy or dd-mm-yy or some other format use this character to find dates :13.11.2001
Alternation
|
: (or) : For example to search for Jeffrey or Jeffery we can use either of these threeJeffrey|Jeffery
Jeff(rey|ery)
Jeff(re|er)y
- NOTE: Character Class only matches one character while this can take in as many as characters. So
Jeff[re][re]y
might seem correct but it is not :P it matchesJeffrry or Jeffeey
too.
Ignoring Differences in Capitalization
- One approach is to
[Jj]
for each capitalization or useegrep -i
which ignores case
Word Boundaries
- to match word boundries we can
\< or \>
: for our cat example :\<cat\>
This matches words starting and ending with cat. - Note :
<
or>
is not metacharacters, we need\
to escape them - NOTE : not all
egrep
version supports this metacharacter.
Optional Items
- Lets find words :
color
andcolour
:) ?
means optional character.- Above could be done using :
colou?r
- Another use Case lets find characters where
(July|Jul) (fourth|4th|4)
can be reduced toJuly? (fourth|4th|4)
- further simplication can be done
July? (fourth|4(th)?)
. NOTE: here()
are used to create subexpression for?
Repetition
+
: one or more of the immediately preceding item*
: any number, including none of the item
phrased differently
...*
: means match as many times as possible, but its ok to settle for nothing if need be...+
: means match as many times as possible, but fails only if not even single match found
NOTE: ...?
or ...*
always succeeds as they match none also
NOTE : .?
means a single space while .*
means everything.
A direct consequence of above is that lets say you want to match 15 or any number : one way is [0-9][0-9]
(which is wrong) another is just [0-9]+
Defined range of matches: intervals
...{min,max}
: some version ofegrep
supports a metasequence for providing min and max occurrences....{3,12}
matches upto 12 times if possible, but settles for 3[a-zA-Z]{1,5}
: this can match US stock ticker from 1-5 letters. So our?
is just{0,1}
:)
Parentheses and Backreferences
- In many regex flavors, parenthesis can “remember” text matched by subexpression they enclose
- We are now equipped to solve double word problem :
\<the.+the\>
(note . here only represents space) this should display all double occuringthe
word (word boundries are important here) : let’s replace the with general word :\<[A-Za-z]+.+[A-Za-z]+\>
this solution can be minimized in size by back capture of first match using()
and can be backreferenced using\1
: this order goes from left to right for each capture group numbering. - Effectively our regex becomes :
\<([A-Za-z]+).+\1\>
The Great Escape
- how to match metacharacters ? we can use escape characters for that.
ega\.att\.com
- Another example is escaping
\([A-Za-z]+\)
: this matches parenthesised words
The Goal of Regular Expression🔗
Sometimes goal of regex is not just to match but also operate on the matched text. For example looking for some number in a text file to update/add/subtract it.
Other interesting Examples :
- Variable Names : doesn’t start with digit. they are matched by
[a-zA-Z_]*
. If there is limit on size of varible name you can replace that by{0,32}
or something else. - A string within double quotes :
"[^"]*"
inner^"
prevents matching of unbalanced quotes :P - Dollar amount with optional cents :
\$[0-9]+(\.[0-9][0-9])?
- An HTTP/HTML URL :
\<http://[-a-z0-9_.:]+/[-a-z0-9_:@&?=+,.!/~*%$]*\.html?\>
this could be simplified to\<http://[^ ]*\.html?\>
- Knowing the data you’ll be searching is an important aspect of finding the balance between complexity and completeness.
- An HTML Tag :
<.*>
(little incorrect but this is simpler) - Time of day such as “9:17am” or “12:30pm” :
[0-9]?[0-9]:[0-9][0-9] (am|pm)
This is very generic but we know time is maximum till 12:00 am/pm. Further simplifying :(1[012]|[1-9]):[0-5][0-9] (am|pm)
Regular Expression Nomenclature🔗
- Regex : Short form of Regular expression
- Matching
- Metacharacter
- Flavor (don’t confuse with different tools, same tool’s older version may behave differently)
- Subexpression
- Character