Educational program on GREP and regular expressions. Regular expressions and grep command linux regular expressions examples

Good time, guests!

In today's article I want to touch on such a huge topic as Regular expressions... I think everyone knows that the topic of regexes (as regular expressions are called in slang) is immense in the volume of one post. Therefore, I will try to briefly, but as clearly as possible, gather my thoughts together and convey them to you.

To begin with, there are several flavors of regular expressions:

1. Traditional regular expressions(they are also basic, basic and basic regular expressions(BRE))

the syntax of these expressions is determined to be obsolete, but nevertheless it is still widespread and used by many UNIX utilities
Basic regular expressions include the following metacharacters (see their meanings below):
- \ (\) - original for () (in extended)
- \ (\) - original for () (in extended)
- \n, where n- number from 1 to 9

Features of using these metacharacters:
- An asterisk must follow an expression that matches a single character. Example: *.
- Expression $ block$ * should be considered invalid. In some cases, it matches zero or more repetitions of the string. block... In others, it matches the string block* .
- Within a character class, special character meanings are generally ignored. Special cases:
- To add a ^ character to a set, it must not be placed first there.
- To add a symbol - to a set, it must be placed there first or last. For example:
  - DNS name pattern, which can include letters, numbers, minus and separator period: [-0-9a-zA-Z.];
  - any character except minus and digit: [^ -0-9].
- To add the [or] character to the set, it must be placed there first. For example:
  - matches], [, a, or b.

2. Extended Regular Expressions(they are extended regular expressions(ERE))

The syntax for these expressions is the same as for the main expressions, except:
- Removed the use of backslashes for the () and () metacharacters.
- The backslash in front of a metacharacter cancels its special meaning.
- Rejected theoretically irregular construction \ n .
- Added metacharacters +,? , | ...

3. Perl Compatible Regular Expressions(they are Perl-compatible regular expressions(PCRE))

have a richer and at the same time predictable syntax than even POSIX ERE, therefore it is often used by applications.

Regular Expressions consist of templates, or rather set a pattern search. The template consists from rules searches, which are composed of characters and metacharacters.

Search rules defined by the following operations:

Enumeration |

Vertical bar (|) separates acceptable options, we can say - logical OR. For example, "gray | gray" matches gray or gray.

Grouping or union ()

Round brackets are used to define the scope and priority of operators. For example, "gray | gray" and "gr (a | e) y" are different patterns, but they both describe a set containing gray and gray.

Quantification ()? * +

Quantifier after a character or group determines how many times prior expression can occur.

general expression, repetitions can be from m to n inclusive.

general expression, m or more repetitions.

general expression, no more than n repetitions.

smoothn repetitions.

Question mark means 0 or 1 times, the same as {0,1} ... For example, "colou? R" matches and color, and color.

Star means 0, 1 or any number once ( {0,} ). For example, "go * gle" matches ggle, gogle, google and etc.

A plus means at least 1 once ( {1,} ). For example, "go + gle" matches gogle, google etc. (but not ggle).

The specific syntax for these regular expressions is implementation dependent. (that is, in basic regular expressions symbols ( and )- escaped with a backslash)

Metacharacters, in simple terms, these are symbols that do not correspond to their real meaning, that is, a symbol. (point) is not a point, but any one character, etc. please familiarize yourself with the metacharacters and their meanings:

.	corresponds to one any character
[something]	Compliant any single a character in parentheses. In this case: The "-" character is interpreted literally only if it is located immediately after the opening or before the closing parenthesis: or [-abc]. Otherwise, it denotes a character interval. For example, matches "a", "b", or "c". matches lowercase letters of the Latin alphabet. These designations can and be combined: matches a, b, c, q, r, s, t, u, v, w, x, y, z. To match the characters "[" or "]", it is enough that the closing parenthesis was the first character after the opening character: matches "]", "[", "a" or "b". If the value in square brackets is preceded by a ^, then the value of the expression matches single character from among those which are not in brackets... For example, [^ abc] matches any character other than "a", "b", or "c". [^ a-z] matches any character other than Latin lowercase characters.
^	Matches the beginning of the text (or the beginning of any line if line mode).
$	Matches the end of the text (or the end of any line if inline mode).
or ()	Declares a "marked subexpression" (grouped expression) that can be used later (see next element: \ n). The "marked subexpression" is also a "block". Unlike other operators, this one (in traditional syntax) requires a backslash, in extended and Perl the \ - is not needed.
\n	Where n- this is a number from 1 to 9; corresponds to n th marked subexpression (for example (abcd) \ 0, that is, abcd characters are marked with zero). This construction is theoretically irregular, it was not accepted in the extended regular expression syntax.
*	Star after an expression matching a single character matches zero or more copies of this (preceding) expression. For example, "" matches an empty string, "x", "y", "zx", "zyx", etc. \n, where n is a digit from 1 to 9, matches zero or more occurrences to match n th marked subexpression. For example, "\ (a. \) C \ 1 " matches "abcab" and "abcaba" but not "abcac". An expression enclosed in "\ (" and "\)" followed by "" should be considered invalid. In some cases, it matches zero or more occurrences of the string that was enclosed in parentheses. In others, it matches the expression enclosed in parentheses, given the "*" character.
\{x,y\}	Matches the latter ( the forthcoming) to a block occurring at least x and no more y once. For example, "a \ (3,5 \)" matches "aaa", "aaaa", or "aaaaa". Unlike other operators, this one (in traditional syntax) requires a backslash.
.*	Designation of any number of any characters between two parts of a regular expression.

Metacharacters help us to use different matches. But how do you represent a metacharacter with an ordinary character, that is, the character [(square bracket) by the value of the square bracket? Just:

must be preceded ( shield) metacharacter (. * + \? ()) backslash. For example \. or \[

To simplify the definition of some character sets, they were combined into the so-called. character classes and categories. POSIX has standardized the declaration of certain classes and categories of symbols, as shown in the following table:

POSIX class	similarly	designation
[: upper:]		uppercase characters
[: lower:]		lowercase characters
[: alpha:]		upper and lower case characters
[: alnum:]		numbers, uppercase and lowercase characters
[: digit:]		numbers
[: xdigit:]		hexadecimal digits
[: punct:]	[.,!?:…]	punctuation marks
[: blank:]	[\ t]	space and TAB
[: space:]	[\ t \ n \ r \ f \ v]	skip characters
[: cntrl:]		control symbols
[: graph:]	[^ \ t \ n \ r \ f \ v]	print symbols
[: print:]	[^ \ t \ n \ r \ f \ v]	print and skip characters

In regex there is such a thing as:

Greed regex

I will try to describe it as clearly as possible. Let's say we want to find all HTML tags in some text. Having localized the task, we want to find the values enclosed between< и >, along with these same brackets. But we know that the tags have different lengths and the tags themselves, at least 50 pieces. To list them all, enclosing them in metacharacters is too time-consuming task. But we know that we have an expression. * (Dot asterisk) that characterizes any number of any characters in the string. Using this expression, we will try to find in the text (

So, How to create a RAID level 10/50 on an LSI MegaRAID controller (also valid for: Intel SRCU42x, Intel SRCS16):

) all values between< и >... As a result, ALL lines will match this expression. why, because the regex is GREEDY and tries to capture ANY ALL number of characters in between< и >, respectively, the whole line, starting < p> So ... and ending ...> will belong to this rule!

Hopefully this is an example of what greed is. To get rid of this greed, you can go along the following path:

take into account the symbols, not matching the desired pattern (for example:<[^>] *> for the above case)
get rid of greed by adding the definition of a quantifier as non-greedy:
- *? - "not greedy" ("lazy") equivalent *
- +? - "not greedy" ("lazy") equivalent +
- (n,)? - "not greedy" ("lazy") equivalent to (n,)
- . *? - "not greedy" ("lazy") equivalent. *

I want to supplement all of the above with the extended regular expression syntax:

POSIX regular expressions are similar to traditional Unix syntax, but with the addition of some metacharacters:

A plus indicates that previous symbol or Group can be repeated one or more times... Unlike an asterisk, at least one repetition is required.

Question mark does previous character or group optional. In other words, in the corresponding line, it may be absent or present smooth one once.

Vertical bar separates alternate regular expressions. One symbol defines two alternatives, but there can be more of them, it is enough to use more vertical bars. Remember that this operator uses as much of the expression as possible. For this reason, the alternative operator is most often used within parentheses.

The use of the backslash has also been removed: \ (... \) becomes (...) and \ (... \) becomes (...).

At the end of the post, here are some examples of using regex:

$ cat text1 1 apple 2 pear 3 banana $ grep p text1 1 apple 2 pear $ grep "pp *" text1 1 apple 2 pear $ cat text1 | grep "l \ | n" 1 apple 3 banana $ echo -e "find an \ n * here" | grep "\ *" * here $ grep "pl \?. * r" text1 # p on lines containing r 2 pear $ grep "a .." text1 # lines with a followed by at least 2 characters 1 apple 3 banana $ grep "" text1 # search for lines with 3 or p 1 apple 2 pear 3 banana $ echo -e "find an \ n * here \ nsomewhere." | grep "[. *]" * here somewhere..name] $ echo -e "123 \ n456 \ n789 \ n0" | grep "" 123 456 789 $ sed -e "/$a.*a$\\$p.*p$/s/a/A/g" text1 # replace a with A in all lines where after a comes a or after p comes p 1 Apple 2 pear 3 bAnAnA * \ ./ LAST WORD./g "First. A LAST WORD. This is a LAST WORD.

Regards, Mc.Sim!

Background and source: not everyone who needs to use regular expressions fully understands how they work and how to create them. I also belonged to this group - I was looking for examples of regular expressions that fit my tasks, tried to correct them as necessary. For me, everything changed radically after reading the book. The Linux Command Line (Second Internet Edition) the author William E. Shotts, Jr. In it, the principles of work of regular expressions are stated so clearly that after reading I learned to understand them, create regular expressions of any complexity, and now I use them whenever necessary. This material is a translation of the part of the chapter devoted to regular expressions. This material is intended for absolute beginners who do not at all understand how regular expressions work, but have some idea of how they work. Hopefully this article helps you make the same breakthrough that helped me. If the material outlined here isn't new to you, try the article Regular Expressions and the grep Command for more details on grep options and additional examples.

How regular expressions are used

Text data plays an important role in all Unix-like systems like Linux. Among other things, the text is the output of console programs, and configuration files, reports, etc. Regular expressions are (perhaps) one of the most difficult concepts for working with text, since they involve a high level of abstraction. But the time spent studying them will pay off with interest. Knowing how to use regular expressions can help you do amazing things, although their full value may not be immediately apparent.

This article will walk you through the use of regular expressions in conjunction with the command grep... But their application is not limited only to this: regular expressions are supported by other Linux commands, many programming languages, they are used in configuration (for example, in the settings of mod_rewrite rules in Apache), and also some GUI programs allow you to set rules for search / copy / delete from support for regular expressions. Even in the popular office program Microsoft Word, you can use regular expressions and wildcards to find and replace text.

What are regular expressions?

In simple terms, a regular expression is a shorthand, a symbolic notation for a pattern that is searched for in a text. Regular expressions are supported by many command line tools and most programming languages, and are used to help ease text manipulation problems. However (as if their complexity is not enough for us), not all regular expressions are the same. They vary slightly from tool to tool and from programming language to language. For our discussion, we will restrict ourselves to the regular expressions described in the POSIX standard (which will cover most command line tools), in contrast to many programming languages (primarily Perl) that use slightly larger and richer sets of notations.

grep

The main program we'll be using for regular expressions is our old friend,. The name "grep" actually comes from the phrase "global regular expression print", so we can see that grep has something to do with regular expressions. Essentially, grep searches text files for text that matches the specified regexp and prints any line containing the match to standard output.

grep can search for text received in standard input, for example:

Ls / usr / bin | grep zip

This command will list the files in the / usr / bin directory whose names contain the substring "zip".

Grep can search for text in files.

General usage syntax:

Grep [options] regex [file ...]

regex is a regular expression.
[file…]- one or more files in which the regular expression search will be performed.

[options] and [file ...] may be missing.

A list of the most commonly used grep options:

Option	Description
-i	Ignore case. Make no distinction between large and small characters. It can also be set with the option --ignore-case.
-v	Invert match. Usually grep prints lines that contain a match. This option causes grep to print every line that does not match. You can also use --invert-match.
-c	Print the number of matches (or mismatches if option is specified -v) instead of the strings themselves. Can also be specified with the option --count.
-l	Instead of the lines themselves, print the name of each file that contains the match. Can be specified by option --files-with-matches.
-L	As an option -l but only prints filenames that do not contain matches. Another option name --files-withoutmatch.
-n	Adds a line number within the file to the beginning of each matched line. Another option name --line-number.
-h	To search across multiple files, suppress the output of the file name. You can also specify the option --no-filename.

To explore grep more fully, let's create some text files to search for:

Ls / bin> dirlist-bin.txt ls / usr / bin> dirlist-usr-bin.txt ls / sbin> dirlist-sbin.txt ls / usr / sbin> dirlist-usr-sbin.txt ls dirlist * .txt dirlist -bin.txt dirlist-sbin.txt dirlist-usr-bin.txt dirlist-usr-sbin.txt

We can do a simple search through our list of files like this:

Grep bzip dirlist * .txt dirlist-bin.txt: bzip2 dirlist-bin.txt: bzip2recover

This example grep searches all the listed files for the bzip string and finds two matches, both in the dirlist-bin.txt file. If we are only interested in the list of files containing matches, and not the matching lines themselves, we can specify the option -l:

Grep -l bzip dirlist * .txt dirlist-bin.txt

Conversely, if we only wanted to see a list of files that did not contain matches, we could do this:

Grep -L bzip dirlist * .txt dirlist-sbin.txt dirlist-usr-bin.txt dirlist-usr-sbin.txt

If there is no output, then no files matching the conditions were found.

Metacharacters and Literals

While this may not seem obvious, our grep searches always use regular expressions, albeit very simple ones. The regular expression "bzip" means that a match will occur (that is, the string will be considered suitable) only if the line in the file contains at least four characters and that somewhere in the string there are characters "b", "z" , "I" and "p" are in that order, with no other characters in between. The characters in the "bzip" string are literals, i.e. literal symbols as they correspond to themselves. In addition to literals, regular expressions can also include metacharacters which are used to specify more complex matches. Regular expression metacharacters consist of the following:

^ $ . { } - ? * + () | \

All other characters are considered literals. The backslash character can have different meanings. It is used in several cases to create meta sequences and also allows metacharacters to be escaped and treated as literals rather than metacharacters.

Note: as we can see, many of the regex metacharacters are also shell-meaningful (performing expansion) characters. When specifying a regular expression containing command line metacharacters, it is imperative that you enclose it in quotes, otherwise the shell will interpret them differently and break your command.

Any character

The first metacharacter with which we start our acquaintance is dot symbol which means "any character". If we include it in the regex, then it will match any character for that character position. Example:

Grep -h ".zip" dirlist * .txt bunzip2 bzip2 bzip2recover gunzip gzip funzip gpg-zip mzip p7zip preunzip prezip prezip-bin unzip unzipsfx

We looked for any line in our files that matches the regular expression ".zip". A couple of interesting points should be noted in the results obtained. Please note that the zip program was not found. This is because including the dot metacharacter in our regular expression increased the length required to match to four characters, and since the name "zip" contains only three, it does not match. Also, if any of the files on our lists contained the .zip file extension, they would also be considered valid, since the dot character in the file extension also matches the "any character" condition.

Anchors

The caret character ( ^ ) and dollar sign ( $ ) are considered in regular expressions anchors... This means that they only match if the regex is found at the beginning of the line ( ^ ) or at the end of the line ( $ ):

Grep -h "^ zip" dirlist * .txt zip zipcloak zipdetails zipgrep zipinfo zipnote zipsplit grep -h "zip $" dirlist * .txt gunzip gzip funzip gpg-zip mzip p7zip preunzip prezip prezip unzip zip grep-$ " * .txt zip

Here we searched through the file lists for the string "zip" located at the beginning of the line, at the end of the line, as well as in the line where it would be both at the beginning and at the end (that is, the whole line would contain only "zip" ). Note that the regular expression " ^$ "(Beginning and end between which there is nothing) will match empty lines.

A small lyrical digression: crossword puzzle assistant

Even with our currently limited knowledge of regular expressions, we can do something useful.

If you have ever solved crosswords, then you had to solve problems like "what a five-letter word, where the third letter is" j ", and the last letter is" r ", which means ...". This question can be thought provoking. Did you know there is a dictionary on Linux? And he is. Look in the / usr / share / dict directory, there you can find one or more dictionaries. The dictionaries posted there are just long lists of words, one per line, arranged alphabetically. On my system, the dictionary file contains 99,171 words. To search for possible answers to the above crossword question, we can do this:

Grep -i "^ .. j.r $" / usr / share / dict / american-english Major major

Using this regex, we can find all the words in our dictionary file that are five letters long, have "j" in the third position and "r" in the last position.

The example used an English dictionary file as it is present on the system by default. Having previously downloaded the corresponding dictionary, you can do similar searches by words in Cyrillic or from any other characters.

Bracket Expressions and Character Classes

In addition to matching any character at a given position in our regex, we also use expressions in square brackets, we can match a single character from the specified character set. With expressions in square brackets, we can specify the character set to match (including characters that would otherwise be interpreted as metacharacters). In this example, using a set of two characters:

Grep -h "zip" dirlist * .txt bzip2 bzip2recover gzip

we will find any lines containing the strings "bzip" or "gzip".

A set can contain any number of characters, and metacharacters lose their special meaning when placed inside square brackets. However, there are two cases in which the metacharacters used inside square brackets have different meanings. The first is the caret ( ^ ), which is used to indicate negation; the second is a dash ( - ), which is used to specify a range of characters.

Negation

If the first character of the expression in square brackets is a caret ( ^ ), then the rest of the characters are accepted as a set of characters that should not be present in the given character position. Let's do this by modifying our previous example:

Grep -h "[^ bg] zip" dirlist * .txt bunzip2 gunzip funzip gpg-zip mzip p7zip preunzip prezip prezip-bin unzip unzipsfx

With negation activated, we got a list of files that contain the string "zip" preceded by any character except "b" or "g". Please note that the zip was not found. The negated character set still requires the character at the given position, but the character must not be a member of the inverted character set.

A carriage character is negated only if it is the first character within a bracketed expression; otherwise, it loses its special purpose and becomes an ordinary character from the set.

Traditional character ranges

If we want to construct a regular expression that must find every file in our list that starts with an uppercase letter, we can do the following:

Grep -h "^" dirlist * .txt MAKEDEV GET HEAD POST VBoxClient X X11 Xorg ModemManager NetworkManager VBoxControl VBoxService

The bottom line is that we have placed all 26 capital letters in the expression inside square brackets. But the idea of printing them all is not enthusiastic, so there is another way:

Grep -h "^" dirlist * .txt

Using a 3-character range, we can shorten a 26-letter entry. Any range of characters can be expressed this way, including multiple ranges at once, such as this expression, which matches all filenames starting with letters and numbers:

Grep -h "^" dirlist * .txt

In character ranges, we see that the hyphen character is treated in a special way, so how can we include the dash character in the expression inside the square brackets? By making it the first character in the expression. Let's look at two examples:

Grep -h "" dirlist * .txt

This will match every filename containing an uppercase letter. Wherein:

Grep -h "[-AZ]" dirlist * .txt

will match every filename containing a dash or an uppercase "A" or an uppercase "Z".

In order to fully process texts in bash scripts using sed and awk, you just need to understand regular expressions. Implementations of this most useful tool can be found literally everywhere, and although all regular expressions are arranged in a similar way, based on the same ideas, working with them has certain peculiarities in different environments. Here we will talk about regular expressions that are suitable for use in Linux command line scripts.

This material is intended to be an introduction to regular expressions, aimed at those who may not know at all about what it is. So let's start from the very beginning.

What are regular expressions

For many, when they first see regular expressions, the thought immediately arises that they are in front of a meaningless jumble of characters. But this, of course, is far from the case. Take a look at this regex for example

In our opinion, even an absolute beginner will immediately understand how it works and why you need it :) If you don't quite understand, just read on and everything will fall into place.
A regular expression is a pattern that programs like sed or awk use to filter text. The templates use regular ASCII characters to represent themselves, and so-called metacharacters, which play a special role, for example, by allowing you to refer to certain groups of characters.

Regular Expression Types

Regular expression implementations in various environments, such as programming languages like Java, Perl, and Python, and Linux tools like sed, awk, and grep, have certain quirks. These features depend on so-called regex engines that interpret patterns.
There are two regular expression engines on Linux:

An engine that supports the POSIX Basic Regular Expression (BRE) standard.
An engine that supports the POSIX Extended Regular Expression (ERE) standard.

Most Linux utilities conform at least to the POSIX BRE standard, but some utilities (including sed) understand only a subset of the BRE standard. One of the reasons for this limitation is the desire to make such utilities as fast as possible in word processing.

The POSIX ERE standard is often implemented in programming languages. It allows you to use a lot of tools when designing regular expressions. For example, these can be special sequences of characters for frequently used patterns, such as searching the text for individual words or sets of numbers. Awk supports the ERE standard.

There are many ways to develop regular expressions, depending both on the opinion of the programmer and on the features of the engine for which they are created. It is not easy to write generic regular expressions that any engine can understand. Therefore, we will focus on the most commonly used regular expressions and take a look at how they are implemented for sed and awk.

POSIX BRE Regular Expressions

Perhaps the simplest BRE pattern is a regular expression for finding the exact occurrence of a sequence of characters in text. This is how sed and awk look for a string:

$ echo "This is a test" | sed -n "/ test / p" $ echo "This is a test" | awk "/ test / (print $ 0)"

Searching for text by pattern in sed

Finding text by pattern in awk

You can notice that the search for a given pattern is performed without taking into account the exact location of the text in the string. In addition, the number of occurrences does not matter. After the regular expression finds the given text anywhere in the string, the string is considered valid and is passed on for further processing.

When working with regular expressions, keep in mind that they are case-sensitive:

$ echo "This is a test" | awk "/ Test / (print $ 0)" $ echo "This is a test" | awk "/ test / (print $ 0)"

Regular expressions are case sensitive

The first regular expression did not match, since the word "test" starting with a capital letter does not occur in the text. The second, configured to search for a capitalized word, found a matching string in the stream.

In regular expressions, you can use not only letters, but also spaces and numbers:

$ echo "This is a test 2 again" | awk "/ test 2 / (print $ 0)"

Find a piece of text containing spaces and numbers

Spaces are treated as regular characters by the regex engine.

Special symbols

There are a few things to keep in mind when using different characters in regular expressions. So, there are some special characters, or metacharacters, which require a special approach to use in a template. Here they are:

.*^${}\+?|()
If one of them is needed in the pattern, it will need to be escaped with a backslash (backslash) - \.

For example, if you need to find a dollar sign in the text, it must be included in the template, preceded by an escape character. Let's say you have a file called myfile with the following text:

There is 10 $ on my pocket
The dollar sign can be detected using a pattern like this:

$ awk "/ \ $ / (print $ 0)" myfile

Using a special character in a template

In addition, the backslash is also a special character, so if you want to use it in a template, you will need to escape it too. It looks like two forward slashes:

$ echo "\ is a special character" | awk "/ \\ / (print $ 0)"

Backslash escaping

Although the forward slash is not included in the above list of special characters, trying to use it in a regular expression written for sed or awk will result in an error:

$ echo "3/2" | awk "/// (print $ 0)"

Incorrect use of forward slash in a template

If you need it, you also need to screen it:

$ echo "3/2" | awk "/ \ // (print $ 0)"

Escaping forward slash

Anchor symbols

There are two special characters to anchor a pattern to the beginning or end of a text string. The cover character - ^ allows you to describe sequences of characters that appear at the beginning of text lines. If the pattern you are looking for appears elsewhere in the string, the regular expression will not respond to it. The use of this symbol looks like this:

$ echo "welcome to likegeeks website" | awk "/ ^ likegeeks / (print $ 0)" $ echo "likegeeks website" | awk "/ ^ likegeeks / (print $ 0)"

Finding a pattern at the beginning of a string

The ^ symbol is intended to search for a pattern at the beginning of a string, while the case is also taken into account. Let's see how this affects the processing of a text file:

$ awk "/ ^ this / (print $ 0)" myfile

Search for a pattern at the beginning of a line in text from a file

With sed, if you place a cap anywhere inside the pattern, it will be treated like any other regular character:

$ echo "This ^ is a test" | sed -n "/ s ^ / p"

Cover not at the beginning of a pattern in sed

In awk, when using the same pattern, the given character must be escaped:

$ echo "This ^ is a test" | awk "/ s \ ^ / (print $ 0)"

Cover not at the beginning of a template in awk

We figured out the search for text fragments located at the beginning of the line. What if you want to find something at the end of a line?

The dollar sign - $, which is the anchor character for the end of the line, will help us with this:

$ echo "This is a test" | awk "/ test $ / (print $ 0)"

Finding text at the end of a line

Both anchor characters can be used in the same pattern. Let's process the file myfile, the contents of which are shown in the figure below, using the following regular expression:

$ awk "/ ^ this is a test $ / (print $ 0)" myfile

Pattern that uses special characters for the beginning and end of a line

As you can see, the template reacted only to a string that fully corresponded to the specified sequence of characters and their location.

Here's how to filter out empty lines using anchor characters:

$ awk "! / ^ $ / (print $ 0)" myfile
In this template, I used the negation symbol, the exclamation mark -! ... This pattern searches for lines that contain nothing between the beginning and end of the line, and the exclamation mark only prints lines that do not match the pattern.

Point symbol

The period is used to search for any single character, except the line feed character. Let's pass the file myfile to such a regular expression, the contents of which are given below:

$ awk "/.st/(print $ 0)" myfile

Using a dot in regular expressions

As you can see from the displayed data, only the first two lines from the file match the pattern, since they contain the sequence of characters "st", preceded by one more character, while the third line does not contain a suitable sequence, and in the fourth it is, but is in the very beginning of the line.

Character classes

The period matches any single character, but what if you need to be more flexible in limiting the set of characters you are looking for? In a similar situation, you can use character classes.

Thanks to this approach, you can organize a search for any character from a given set. Square brackets are used to describe a character class -:

$ awk "/ th / (print $ 0)" myfile

Regular Expression Character Class Description

Here we are looking for a sequence of characters "th", preceded by the character "o" or the character "i".

Classes come in handy when looking for words that can start with both uppercase and lowercase letters:

$ echo "this is a test" | awk "/ his is a test / (print $ 0)" $ echo "This is a test" | awk "/ his is a test / (print $ 0)"

Find words that can start with a lowercase or uppercase letter

Character classes are not limited to letters. Other symbols can be used here as well. It is impossible to say in advance in what situation the classes will be needed - it all depends on the problem being solved.

Negation of character classes

Character classes can also be used to solve the opposite problem described above. Namely, instead of searching for symbols included in the class, you can organize a search for everything that is not included in the class. In order to achieve this behavior of a regular expression, a ^ must be placed in front of the character list of the class. It looks like this:

$ awk "/ [^ oi] th / (print $ 0)" myfile

Find characters outside of a class

In this case, sequences of characters "th" will be found, before which there is neither "o" nor "i".

Ranges of characters

In character classes, you can describe ranges of characters using a dash:

$ awk "/ st / (print $ 0)" myfile

Describing a range of characters in a character class

In this example, the regular expression responds to the sequence of characters "st", preceded by any character located, in alphabetical order, between the characters "e" and "p".

Ranges can also be created from numbers:

$ echo "123" | awk "//" $ echo "12a" | awk "//"

Regular expression to find any three numbers

Several ranges can be included in a character class:

$ awk "/ st / (print $ 0)" myfile

Multi-Ranged Character Class

This regular expression will match all strings that are preceded by characters in the ranges a-f and m-z.

Special character classes

BRE has special character classes that you can use when writing regular expressions:

[[: alpha:]] - matches any uppercase or lowercase alphabetic character.
[[: alnum:]] - matches any alphanumeric character, namely, characters in the ranges 0-9, A-Z, a-z.
[[: blank:]] - matches a space and a tab.
[[: digit:]] - any digital character from 0 to 9.
[[: upper:]] - uppercase alphabetic characters - A-Z.
[[: lower:]] - lowercase alphabetic characters - a-z.
[[: print:]] - matches any printable character.
[[: punct:]] - matches punctuation marks.
[[: space:]] - whitespace characters, in particular - space, tabulation, NL, FF, VT, CR characters.

You can use special classes in templates like this:

$ echo "abc" | awk "/ [[: alpha:]] / (print $ 0)" $ echo "abc" | awk "/ [[: digit:]] / (print $ 0)" $ echo "abc123" | awk "/ [[: digit:]] / (print $ 0)"

Special character classes in regular expressions

Star symbol

If you place an asterisk after a character in the pattern, this means that the regular expression will work if the character appears in the string any number of times - including the situation when the character is absent in the string.

$ echo "test" | awk "/ tes * t / (print $ 0)" $ echo "tessst" | awk "/ tes * t / (print $ 0)"

Using the * character in regular expressions

This wildcard character is usually used to work with words in which typos are constantly encountered, or for words that can be spelled differently:

$ echo "I like green color" | awk "/ colou * r / (print $ 0)" $ echo "I like green color" | awk "/ colou * r / (print $ 0)"

Search for a word that has different spellings

In this example, the same regexp reacts to both the word "color" and the word "color". This is due to the fact that the symbol "u", after which there is an asterisk, can either be absent or appear several times in a row.

Another useful feature that follows from the peculiarities of the asterisk symbol is to combine it with a dot. This combination allows the regular expression to respond to any number of any characters:

$ awk "/this.*test/(print $ 0)" myfile

A template that responds to any number of any characters

In this case, it doesn't matter how many and what characters are between the words "this" and "test".

The asterisk can also be used with character classes:

$ echo "st" | awk "/ s * t / (print $ 0)" $ echo "sat" | awk "/ s * t / (print $ 0)" $ echo "set" | awk "/ s * t / (print $ 0)"

Using an asterisk with character classes

In all three examples, the regular expression works because the asterisk after the character class means that if any number of "a" or "e" characters are found, or if they cannot be found, the string will match the specified pattern.

POSIX ERE Regular Expressions

The POSIX ERE templates that some Linux utilities support may contain additional characters. As already mentioned, awk supports this standard, but sed does not.

Here we will look at the most commonly used symbols in ERE patterns, which will come in handy when creating your own regular expressions.

▍Question mark

The question mark indicates that the preceding character may appear once in the text or not at all. This character is one of the repetition metacharacters. Here are some examples:

$ echo "tet" | awk "/ tes? t / (print $ 0)" $ echo "test" | awk "/ tes? t / (print $ 0)" $ echo "tesst" | awk "/ tes? t / (print $ 0)"

Question mark in regular expressions

As you can see, in the third case, the letter "s" occurs twice, so the regular expression does not react to the word "tesst".

The question mark can be used with character classes as well:

Question mark and character classes

If there are no characters from the class in the string, or one of them occurs once, the regular expression is triggered, but as soon as two characters appear in the word, the system no longer finds a match for the pattern in the text.

▍Plus symbol

The plus symbol in the pattern indicates that the regular expression will find the desired one if the preceding character occurs one or more times in the text. At the same time, such a construction will not react to the absence of a symbol:

$ echo "test" | awk "/ te + st / (print $ 0)" $ echo "teest" | awk "/ te + st / (print $ 0)" $ echo "tst" | awk "/ te + st / (print $ 0)"

Plus sign in regular expressions

In this example, if there is no "e" in a word, the regex engine will not find a match for the pattern in the text. The plus symbol works with character classes as well, which makes it look like an asterisk and a question mark:

$ echo "tst" | awk "/ t + st / (print $ 0)" $ echo "test" | awk "/ t + st / (print $ 0)" $ echo "teast" | awk "/ t + st / (print $ 0)" $ echo "teeast" | awk "/ t + st / (print $ 0)"

Plus sign and character classes

In this case, if the string contains any character from the class, the text will be considered to match the pattern.

▍Character brackets

The curly braces that you can use in ERE patterns are similar to the characters discussed above, but they allow you to more accurately specify the required number of occurrences of the character that precedes them. The limitation can be specified in two formats:

n - a number that specifies the exact number of occurrences to find
n, m - two numbers, which are interpreted as follows: "at least n times, but not more than m".

Here are examples of the first option:

$ echo "tst" | awk "/ te (1) st / (print $ 0)" $ echo "test" | awk "/ te (1) st / (print $ 0)"

Curly braces in patterns, find exact number of occurrences

In older versions of awk, you had to use the --re-interval command line switch in order for the program to recognize intervals in regular expressions, but in newer versions this is not necessary.

$ echo "tst" | awk "/ te (1,2) st / (print $ 0)" $ echo "test" | awk "/ te (1,2) st / (print $ 0)" $ echo "teest" | awk "/ te (1,2) st / (print $ 0)" $ echo "teeest" | awk "/ te (1,2) st / (print $ 0)"

The spacing specified in curly braces

In this example, the character "e" must appear in the line 1 or 2 times, then the regular expression will react to the text.

Curly braces can also be used with character classes. Here are the principles that are already familiar to you:

$ echo "tst" | awk "/ t (1,2) st / (print $ 0)" $ echo "test" | awk "/ t (1,2) st / (print $ 0)" $ echo "teest" | awk "/ t (1,2) st / (print $ 0)" $ echo "teeast" | awk "/ t (1,2) st / (print $ 0)"

Curly braces and character classes

The template will react to the text if it contains the character "a" or the character "e" once or twice.

▍ Boolean or symbol

Symbol | - vertical bar, means logical "or" in regular expressions. When processing a regular expression containing several fragments separated by such a sign, the engine will consider the parsed text to be appropriate if it matches any of the fragments. Here's an example:

Boolean "or" in regular expressions

In this example, the regular expression is configured to search the text for the words "test" or "exam". Note that between the template fragments and the separating symbol | there should be no spaces.

Regular expression fragments can be grouped using parentheses. If you group a certain sequence of characters, it will be perceived by the system as an ordinary character. That is, for example, it will be possible to apply repetition metacharacters to it. This is how it looks:

$ echo "Like" | awk "/ Like (Geeks)? / (print $ 0)" $ echo "LikeGeeks" | awk "/ Like (Geeks)? / (print $ 0)"

Grouping Regular Expression Fragments

In these examples, the word "Geeks" is enclosed in parentheses, followed by a question mark. Recall that a question mark means "0 or 1 repetition", as a result, the regular expression will respond to both the "Like" string and the "LikeGeeks" string.

Practical examples

Now that we've covered the basics of regular expressions, it's time to do something useful with them.

▍Counting the number of files

Let's write a bash script that counts the files in the directories that are written to the PATH environment variable. In order to do this, you will first need to generate a list of paths to directories. Let's do it with sed, replacing colons with spaces:

$ echo $ PATH | sed "s /: / / g"
The replace command supports regular expressions as patterns for searching text. In this case, everything is extremely simple, we are looking for a colon symbol, but no one bothers to use something else here - it all depends on the specific task.
Now you need to go through the resulting list in a loop and perform the actions necessary to count the number of files there. The general scheme of the script will be as follows:

Mypath = $ (echo $ PATH | sed "s /: / / g") for directory in $ mypath do done
Now let's write the full text of the script, using the ls command to get information about the number of files in each of the directories:

#! / bin / bash mypath = $ (echo $ PATH | sed "s /: / / g") count = 0 for directory in $ mypath do check = $ (ls $ directory) for item in $ check do count = $ [$ count + 1] done echo "$ directory - $ count" count = 0 done
When you run the script, it may turn out that some directories from PATH do not exist, however, this will not prevent it from counting files in existing directories.

Counting files

The main value of this example lies in the fact that using the same approach, you can solve much more complex problems. Which one exactly depends on your needs.

▍Checking Email Addresses

There are websites with huge collections of regular expressions that allow you to validate email addresses, phone numbers, and so on. However, it's one thing to take a ready-made one, and quite another to create something yourself. So let's write a regular expression to validate email addresses. Let's start by analyzing the initial data. For example, here is a certain address:

[email protected]
The username, username, can be alphanumeric and some other characters. Namely, it is a period, dash, underscore, plus sign. The username is followed by the @ sign.

Armed with this knowledge, let's start assembling the regular expression from its left side, which serves to validate the username. Here's what we got:

^(+)@
This regular expression can be read like this: "At the beginning of a line, there must be at least one character from those in the group specified in square brackets, and after that there must be an @ sign."

Now - the hostname queue is hostname. The same rules apply here as for the username, so the template for it will look like this:

(+)
The top-level domain name is subject to special rules. There can be only alphabetic characters, of which there must be at least two (for example, such domains usually contain a country code), and no more than five. All this means that the template for checking the last part of the address will be like this:

\.({2,5})$
You can read it like this: "First there must be a period, then - from 2 to 5 alphabetical characters, and after that the line ends."

Having prepared templates for the individual parts of the regular expression, let's put them together:

^(+)@(+)\.({2,5})$
Now all that remains is to test what happened:

$ echo " [email protected]"| awk" /^(+)@(+)\.((2,5))$/(print $ 0) "$ echo" [email protected]"| awk" /^(+)@(+)\.((2,5))$/(print $ 0) "

Validating an Email Address Using Regular Expressions

The fact that the text passed to awk is displayed on the screen means that the system recognized the email address in it.

Outcomes

If the regular expression for validating email addresses that you met at the very beginning of the article seemed completely incomprehensible then, we hope that now it no longer looks like a meaningless set of characters. If this is true, then this material has fulfilled its purpose. In fact, regular expressions are a topic that can be dealt with all your life, but even the little that we have discussed can already help you in writing scripts that process texts quite advanced.

In this series of articles, we usually showed very simple examples of bash scripts that consisted of literally a few lines. Next time, let's look at something bigger.

Dear Readers! Do you use regular expressions when processing text in command line scripts?

One of the most useful and versatile commands in the Linux terminal is the "grep" command. The name is an acronym for the English phrase "search Globally for lines matching the Regular Expression, and Print them". The "grep" command scans the input stream sequentially, line by line, looking for matches and outputs (filters) only those lines that contain text that matches the given pattern - regular expression.

Regular expressions are a special formal language for searching and manipulating substrings in a text, based on the use of metacharacters. Now almost all modern programming languages have built-in support for regular expressions for processing texts, however, historically, the popularization of this approach was largely promoted by the UNIX world and, in particular, the ideas embodied in the commands "grep", "sed", etc. The philosophy "everything is a file »Is completely permeated with UNIX and mastery of tools for working with text files is one of the essential skills of every Linux user.

SAMPLE

GIST | Simplest search for all strings that contain the text "Adams". When designing this and subsequent examples, we will adhere to the following order: command line parameters at the top, at the bottom standard streams on the left of stdin input and on the right of stdout output.

The "grep" command has an impressive number of options that you can specify at startup. You can do a lot of useful things with these options and you don't even need to be good at regular expression syntax.

OPTIONS

To begin with, "grep" can not only filter stdin, but also search through files. By default, "grep" will only search files in the current directory, but with the very useful --recursive option, you can tell the "grep" command to search recursively starting from a given directory.

GIST | By default, the "grep" command is case sensitive. The following example shows how you can search and be case insensitive, for example "Adams" and "adams" are the same:

Ignore-case "adams"

George Washington, 1789-1797 John Adams, 1797-1801 Thomas Jefferson, 1801-1809 John Adams, 1797-1801

GIST | Search is the opposite (sometimes they say inverse search), that is, all lines will be displayed, except for those with an occurrence of the specified pattern:

Invert-match "Adams"

George Washington, 1789-1797 John Adams, 1797-1801 Thomas Jefferson, 1801-1809 George Washington, 1789-1797 Thomas Jefferson, 1801-1809

GIST | The options, of course, can and should be combined with each other. For example, the search is vice versa with the output of the ordinal numbers of lines with occurrences:

Line-number --invert-match "Adams"

George Washington, 1789-1797 John Adams, 1797-1801 Thomas Jefferson, 1801-1809 1: George Washington, 1789-1797 3: Thomas Jefferson, 1801-1809

GIST | Coloring. Sometimes it is convenient when the word we are looking for is highlighted in color. All this is already in "grep", all that remains is to include:

Line-number --color = always "Adams"

George Washington, 1789-1797 John Adams, 1797-1801 Thomas Jefferson, 1801-1809 2: John Adams, 1797-1801

GIST | We want to select all errors from the log file, but we know that the next line after the error may contain useful information, then it is convenient to take several lines out of context. By default, "grep" only prints the line where a match was found, but there are several options to make "grep" print more. To display several lines (in our case, two) after the entry:

Color = always -A2 "Adams"

George Washington, 1789-1797 John Adams, 1797-1801 Thomas Jefferson, 1801-1809 James Madison, 1809-1817 James Monroe, 1817-1825 John Adams, 1797-1801 Thomas Jefferson, 1801-1809 James Madison, 1809-1817

GIST | Likewise for additional output of multiple lines before the entry:

Color = always -B2 "James"

GIST | However, most often it is required to display a symmetric context, for this there is an even more abbreviated notation. We will output two lines each, both above and below the entry:

Color = always -C2 "James"

George Washington, 1789-1797 John Adams, 1797-1801 Thomas Jefferson, 1801-1809 James Madison, 1809-1817 James Monroe, 1817-1825 John Quincy Adams, 1825-1829 Andrew Jackson, 1829-1837 Martin Van Buren, 1837-1841 John Adams, 1797-1801 Thomas Jefferson, 1801-1809 James Madison, 1809-1817 James Monroe, 1817-1825 John Quincy Adams, 1825-1829 Andrew Jackson, 1829-1837

GIST | When you search for qwe, by default "grep" will also output qwe123, 345qwerty and similar combinations. Let's find only those lines that turn off exactly the whole word as a whole:

Word-regexp --color = always "John"

John Fitzgerald Kennedy, 1961-1963 Lyndon Baines Johnson, 1963-1969 John Fitzgerald Kennedy, 1961-1963

GIST | And finally, if you just want to know the number of lines with matches with one single number, but do not display anything else:

Count --color = always "John"

John Fitzgerald Kennedy, 1961-1963 Lyndon Baines Johnson, 1963-1969 Richard Milhous Nixon, 1969-1974 2

It is worth noting that most of the options have a counterpart, for example --ignore-case can be reduced to a shorter form -i, etc.

BASIC REGULAR EXPRESSIONS

All regular expressions are made up of two types of characters: standard text characters called literals, and special characters called metacharacters... In the previous examples, the search was carried out by literals (exact match by letter), but further it will be much more interesting. Welcome to the world of regular expressions!

The caret character ^ and the dollar sign $ have special meanings in a regular expression. They are called “anchors”. Anchors are special characters that indicate the location in a string of a desired match. When the search reaches the anchor, it checks if there is a match, and if so, it continues to follow the pattern, adding nothing to the result.

GIST | The caret anchor is used to indicate that the regular expression should be checked from the beginning of the line:

Color = always "^ J"

George Washington, 1789-1797 John Adams, 1797-1801 Thomas Jefferson, 1801-1809 John Adams, 1797-1801

GIST | Similarly, the dollar anchor should be used at the end of the pattern to indicate that the match is valid only if the search string is at the end of the text string, and not otherwise:

Color = always "9 $"

George Washington, 1789-1797 John Adams, 1797-1801 Thomas Jefferson, 1801-1809 Thomas Jefferson, 1801-1809

GIST | Any character. The dot symbol is used in regular expressions to indicate that absolutely any symbol can be in the specified location:

Color = always "0. $"

GIST | Shielding. If you need to find exactly the dot character, then escaping will help. An escape character (usually a backslash) preceding a character like a period turns the metacharacter into a literal:

Color = always "\."

George Washington. 1789-1797 John Adams, 1797-1801 Thomas Jefferson, 1801-1809 George Washington. 1789-1797

GIST | Classes of characters. Ranges and character classes can be used in regular expressions. For this, when composing the template, square brackets are used. By placing a group of characters (including characters that would otherwise be interpreted as metacharacters) in square brackets, you can specify that any of the characters in parentheses can be at that position:

Color = always "0"

George Washington, 1789-1797 John Adams, 1797-1801 Thomas Jefferson, 1801-1809 John Adams, 1797-1801 Thomas Jefferson, 1801-1809

GIST | Range. These are two characters separated by a hyphen, for example 0-9 (decimal digits) or 0-9a-fA-F (hexadecimal digits):

Color = always ""

George Washington, ??? John Adams, 1797-1801 Thomas Jefferson, 1801-1809 John Adams, 1797-1801 Thomas Jefferson, 1801-1809

GIST | Negation. If the first character of the expression in square brackets is a caret, then the rest of the characters are accepted as a set of characters that should not be present in the given position of the regular expression:

Color = always "[^ 7] $"

George Washington, 1789-1797 John Adams, 1797-1801 Thomas Jefferson, 1801-1809 John Adams, 1797-1801 Thomas Jefferson, 1801-1809

GIST | POSIX character classes. There is a set of predefined character classes that you can use in regular expressions. There are a dozen of them, it is enough to quickly look through the manual to understand the purpose of each. For example, let's filter out only hexadecimal digits:

Color = always "^ [[: xdigit:]] * $"

4.2 42 42abc 42 42abc

GIST | Repetition (0 or more times). One of the most commonly used metacharacters is the asterisk, which means "repeat the previous character or expression zero or more times":

Color = always "^ * $"

George Washington, ??? John Adams, 1797-1801 Thomas Jefferson, 1801-1809 George Washington, ???

Distinguish between basic regular expressions (BRE) and extended ERE (extended regular expressions). BRE recognizes the following metacharacters ^ $. * and all other characters are treated as literals. Are there more metacharacters like this () () added to the ERE? + | and related functions. Well, to completely confuse everyone in "grep" they came up with such a thing - characters () () in BRE are treated as metacharacters if they are escaped with a backslash, while in ERE setting a backslash before any metacharacters leads to the fact that they are interpreted as literals.

EXTENDED REGULAR EXPRESSIONS

GIST | Disjunction. Just as square brackets indicate different possible matches for a single character, disjunction allows you to specify alternative matches for character strings or expressions. The pipe symbol is used to indicate disjunction:

Extended-regexp --color = always "George | John"

George Washington, 1789-1797 John Adams, 1797-1801 Thomas Jefferson, 1801-1809 George Washington, 1789-1797 John Adams, 1797-1801

GIST | Match zero or one time. In extended regular expressions, there are several additional metacharacters that indicate how often a character or expression is repeated (just as the asterisk metacharacter indicates 0 or more matches). One such metacharacter is the question mark, which makes the preceding character or expression essentially optional:

Extended-regexp --color = always "^ (Andrew)? John"

John Adams, 1797-1801 Andrew Johnson, 1865-1869 Lyndon Baines Johnson, 1963-1969 John Adams, 1797-1801 Andrew Johnson, 1865-1869

GIST | Match one or more times. A plus sign metacharacter is provided for this. It works almost like an asterisk character, except that the expression must match at least one time:

Extended-regexp --color = always "^ [[: alpha:]] + $"

John Adams Andrew Johnson, 1865-1869 Lyndon Baines Johnson, 1963-1969 John Adams

GIST | Match the specified number of times. You can use curly braces for this. These metacharacters are used to specify the exact number, range, and upper and lower limit for the number of matches of an expression:

Extended-regexp --color = always "(1,3) \. (1,3) \. (1,3) \. (1,3)"

42 127.0.0.1 127.0.0.1

The grep command is so useful, versatile, and easy to use that once you get to know it, it’s impossible to imagine working without it.

Regular expression- a text pattern consisting of a combination of letters, numbers and special characters, known as metacharacters. A close cousin of regular expressions are wildcard expressions that are commonly used in file management. Regular expressions are mainly used for text comparison and search. Used extensively for parsing syntax.

UNIX users are familiar with regular expressions from grep, sed, awk (or gawk), and ed. Using these programs or their analogues, you can try and verify the examples below. Text editors such as (X) Emacs and vi also make heavy use of regular expressions. Perhaps the most famous and widest use of regular expressions occurs in the Perl language. It is difficult for a software developer and system administrator to do without knowledge of regular expressions.

Metacharacters

So, strings can be composed of letters, numbers, and metacharacters. The metacharacters are:

\ | () { } ^ $ * + ? . < >

Metacharacters can play the following roles in a regular expression:

quantifier

statement;

group sign;

alternative;

sequence sign

Quantifiers

The * (asterisk) metacharacter replaces 0 or more characters. The + (plus) metacharacter replaces 1 or more characters. Metacharacter. (dot) replaces exactly 1 arbitrary character. Metacharacter? (question mark) replaces 0 or 1 characters. The difference in the use of * and + is such that a query to find a string with * will return any strings, including empty ones, and a query with + - only strings containing the c character.

Blank lines obey the following conventions: An empty line contains one and only one blank line; a nonblank line contains empty lines before each character and also at the end of the line.

Regular expressions also use the (n, m) construction, which means that the character preceding the construction occurs n to m times in the string. Omitting the number m means infinity. Those. special cases of the construction are the following entries: (0,), (1,) and (0,1). The first matches *, the second matches the + metacharacter, and the third matches? ... These equalities are easy to obtain from the definition of the corresponding quantifiers. In addition, the construction (n) means that the symbol occurs exactly n times.

In connection with the use of some punctuation marks and mathematical symbols as metacharacters, an additional \ (backslash, backslash) metacharacter has been introduced, which, when written before the metacharacter, turns the latter into an ordinary character. Those. ? is a quantifier, and \? - question mark.

Groups

The quantifiers described above, as already mentioned, act on the character closest to them on the left (the last preceding one). But this limitation allows you to bypass the groups in the designation of which the metacharacters (and) are used. These characters extract a subexpression from an expression, which is combined into a group, to which a quantifier is then applied.

Example:

means (or replaces)

Ho ho ho ho ho ho hohoho

Nesting of subexpressions is possible, i.e. shorter subexpressions can be extracted from a subexpression.

Alternatives

Formed using the metacharacter | (vertical bar) denoting a logical “or”.

Example: regular expression cows (a | s | e | y | oops | oyu)? sets all possible declensions of the word "cow" in the singular for cases.

Assertions

Metacharacters are highlighted, which denote special objects - strings of zero length, which are used to determine the place of the text preceding or following them. Such objects are called statements. The following statements exist in regular expressions:

^ start of line $ end of line< начало слова >end of word

Example: the regular expression $ The matches the string that starts with The.

Note: Regular characters can be viewed as assertions with non-zero length.

Sequences

A special construction, enclosed in the [and] metacharacters (square brackets), allows you to list the variants of characters that can appear in the regular expression at a given place, and is called a sequence. Inside square brackets, all metacharacters are treated as simple characters, and the symbols - (minus) and ^ acquire new meanings: the first one allows you to specify a continuous sequence of characters between the two specified, and the second gives a logical "not" (negation). The following examples are easiest to consider:

any of the lowercase Latin letters:

latin alphanumeric character (a to z, A to Z, and 0 to 9):

non-latin alphanumeric character:

[^ a-zA-Z0-9]

any word (without hyphens, mathematical symbols and numbers):

<+>

For brevity and simplicity, the following abbreviations are introduced:

\ d a digit (i.e. matches an expression); \ D is not a digit (ie [^ 0-9]); \ w Latin word (alphanumeric); \ W is a sequence of characters without spaces that is not a Latin alphanumeric word ([^ a-zA-Z0-9]); \ s empty space [\ t \ n \ r \ f], ie spaces, tabs, etc. \ S is a non-empty span ([^ \ t \ n \ r \ f]).

Relationship with wildcards

Every user is probably familiar with wildcards. An example of a wildcard expression is * .jpg, which denotes all files with the jpg extension. How are regular expressions different from wildcards? The differences can be summarized in three rules for converting an arbitrary wildcard expression to a regular expression:

Replaced by.*

Replace? on the.

Replace all characters that match metacharacters with their backslashed variants.

Indeed, in a regular expression, writing * is useless and gives an empty string, since means that the empty string is repeated any number of times. And here. * (Repeat an arbitrary character as many times as you like, including 0) exactly coincides in meaning with the * character in the set of wildcards.

The regular expression matching * .jpg will look like this:. * \. Jpg. For example, the wildcard sequences ez * .pp match two equivalent regular expressions, ez. * \. Pp and ez. * \. (Cpp | hpp).

Regular Expression Examples

E-mail in the format [email protected]

+(\.+)*@+(\.+)+

E-mail in the format "Ivan Ivanov "

("? +"? [\ t] *) + \<+(\.+)*@+(\.+)+\>

Checking the web protocol in the URL (http: //, ftp: // or https: //)

+://

Some C / C ++ commands and directives:

^ # include [\ t] + [<"][^>"] + [">] - include directive

//.+$ - comment on one line

/ \ * [^ *] * \ * / - comment on several lines

-? + \. + - floating point number

0x + is a hexadecimal number.

And here, for example, the program for finding the word cow:

grep -E "cow | vache" *> / dev / null && echo "Found a cow"

Here the -E option is used to enable extended regular expression syntax support.

This text is based on an article by Jan Borsodi from the HOWTO-regexps.htm file