目錄

正則表示式


In the next few chapters, we are going to look at tools used to manipulate text. As we have seen, text data plays an important role on all Unix-like systems, such as Linux. But before we can fully appreciate all of the features offered by these tools, we have to first examine a technology that is frequently associated with the most sophisticated uses of these tools — regular expressions.

接下來的幾章中,我們將會看一下一些用來操作文字的工具。正如我們所見到的,在類別 Unix 的 作業系統中,比如 Linux 中,文字資料起著舉足輕重的作用。但是在我們能完全理解這些工具提供的 所有功能之前,我們不得不先看看,經常與這些工具的高階使用相關聯的一門技術——正則表示式。

As we have navigated the many features and facilities offered by the command line, we have encountered some truly arcane shell features and commands, such as shell expansion and quoting, keyboard shortcuts, and command history, not to mention the vi editor. Regular expressions continue this “tradition” and may be (arguably) the most arcane feature of them all. This is not to suggest that the time it takes to learn about them is not worth the effort. Quite the contrary. A good understanding will enable us to perform amazing feats, though their full value may not be immediately apparent. What Are Regular Expressions?

我們已經瀏覽了許多由命令列提供的功能和工具,我們遇到了一些真正神祕的 shell 功能和命令, 比如 shell 展開和引用、鍵盤快捷鍵和命令歷史,更不用說 vi 編輯器了。正則表示式延續了 這種“傳統”,而且有可能(備受爭議地)是這些‘神祕功能’中最神祕的那個。這並不是說花費時間來學習它們 是不值得的,而是恰恰相反。雖然它們的全部價值可能不能立即顯現,但是較強理解這些功能 使我們能夠表演令人驚奇的技藝。什麼是正則表示式?

Simply put, regular expressions are symbolic notations used to identify patterns in text. In some ways, they resemble the shell’s wildcard method of matching file and pathnames, but on a much grander scale. Regular expressions are supported by many command line tools and by most programming languages to facilitate the solution of text manipulation problems. However, to further confuse things, not all regular expressions are the same; they vary slightly from tool to tool and from programming language to language. For our discussion, we will limit ourselves to regular expressions as described in the POSIX standard (which will cover most of the command line tools), as opposed to many programming languages (most notably Perl), which use slightly larger and richer sets of notations.

簡而言之,正則表示式是一種符號表示法,被用來識別文字模式。在某種程度上,它們與匹配 檔案和路徑名的 shell 萬用字元比較相似,但其規模更龐大。許多命令列工具和大多數的程式語言 都支援正則表示式,以此來幫助解決文字操作問題。然而,並不是所有的正則表示式都是一樣的, 這就進一步混淆了事情;不同工具以及不同語言之間的正則表示式都略有差異。我們將會限定 POSIX 標準中描述的正則表示式(其包括了大多數的命令列工具),供我們討論, 與許多程式語言(最著名的 Perl 語言)相反,它們使用了更多和更豐富的符號集。

grep

The main program we will use to work with regular expressions is our old pal, grep. The name “grep” is actually derived from the phrase “global regular expression print,” so we can see that grep has something to do with regular expressions. In essence, grep searches text files for the occurrence of a specified regular expression and outputs any line containing a match to standard output.

我們將使用的主要程式是我們的老朋友,grep 程式,它會用到正則表示式。實際上,“grep”這個名字 來自於短語“global regular expression print”,所以我們能看出 grep 程式和正則表示式有關聯。 本質上,grep 程式會在文字檔案中查詢一個指定的正則表示式,並把匹配行輸出到標準輸出。

So far, we have used grep with fixed strings, like so:

到目前為止,我們已經使用 grep 程式查找了固定的字串,就像這樣:

[me@linuxbox ~]$ ls /usr/bin | grep zip

This will list all the files in the /usr/bin directory whose names contain the substring “zip”.

這個命令會列出,位於目錄 /usr/bin 中,檔名中包含子字串“zip”的所有檔案。

The grep program accepts options and arguments this way:

grep 程式以這樣的方式來接受選項和引數:

grep [options] regex [file...]

where regex is a regular expression.

這裡的 regex 是指一個正則表示式。

Here is a list of the commonly used grep options:

這是一個常用的 grep 選項列表:

Table20-1: grep Options
Option Description
-i Ignore case. Do not distinguish between upper and lower case characters. May also be specified --ignore-case.
-v Invert match. Normally, grep prints lines that contain a match. This option causes grep to print every line that does not contain a match. May also be specified --invert-match.
-c Print the number of matches (or non-matches if the -v option is also specified) instead of the lines themselves. May also be specified --count.
-l Print the name of each file that contains a match instead of the lines themselves. May also be specified --files-with-matches.
-L Like the -l option, but print only the names of files that do not contain matches. May also be specified --files-without-match.
-n Prefix each matching line with the number of the line within the file. May also be specified --line-number.
-h For multi-file searches, suppress the output of filenames. May also be specified --no-filename.
表20-1: grep 選項
選項 描述
-i 忽略大小寫。不會區分大小寫字元。也可用--ignore-case 來指定。
-v 不匹配。通常,grep 程式會列印包含匹配項的文字行。這個選項導致 grep 程式只會列印不包含匹配項的文字行。也可用--invert-match 來指定。
-c 列印匹配的數量(或者是不匹配的數目,若指定了-v 選項),而不是文字行本身。 也可用--count 選項來指定。
-l 列印包含匹配項的檔名,而不是文字行本身,也可用--files-with-matches 選項來指定。
-L 相似於-l 選項,但是隻是列印不包含匹配項的檔名。也可用--files-without-match 來指定。
-n 在每個匹配行之前打印出其位於檔案中的相應行號。也可用--line-number 選項來指定。
-h 應用於多檔案搜尋,不輸出檔名。也可用--no-filename 選項來指定。

In order to more fully explore grep, Let's create some text files to search:

為了更好的探究 grep 程式,讓我們建立一些文字檔案來搜尋:

[me@linuxbox ~]$ ls /bin > dirlist-bin.txt
[me@linuxbox ~]$ ls /usr/bin > dirlist-usr-bin.txt
[me@linuxbox ~]$ ls /sbin > dirlist-sbin.txt
[me@linuxbox ~]$ ls /usr/sbin > dirlist-usr-sbin.txt
[me@linuxbox ~]$ ls dirlist*.txt
dirlist-bin.txt     dirlist-sbin.txt    dirlist-usr-sbin.txt
dirlist-usr-bin.txt

We can perform a simple search of our list of files like this:

我們能夠對我們的檔案列表執行簡單的搜尋,像這樣:

[me@linuxbox ~]$ grep bzip dirlist*.txt
dirlist-bin.txt:bzip2
dirlist-bin.txt:bzip2recover

In this example, grep searches all of the listed files for the string bzip and finds two matches, both in the file dirlist-bin.txt. If we were only interested in the list of files that contained matches rather than the matches themselves, we could specify the -l option:

在這個例子裡,grep 程式在所有列出的檔案中搜索字串 bzip,然後找到兩個匹配項,其都在 檔案 dirlist-bin.txt 中。如果我們只是對包含匹配項的檔案列表,而不是對匹配項本身感興趣 的話,我們可以指定-l 選項:

[me@linuxbox ~]$ grep -l bzip dirlist*.txt
dirlist-bin.txt

Conversely, if we wanted only to see a list of the files that did not contain a match, we could do this:

相反地,如果我們只想檢視不包含匹配項的檔案列表,我們可以這樣操作:

[me@linuxbox ~]$ grep -L bzip dirlist*.txt
dirlist-sbin.txt
dirlist-usr-bin.txt
dirlist-usr-sbin.txt

元字元和原義字元(Metacharacters And Literals)

While it may not seem apparent, our grep searches have been using regular expressions all along, albeit very simple ones. The regular expression “bzip” is taken to mean that a match will occur only if the line in the file contains at least four characters and that somewhere in the line the characters “b”, “z”, “i”, and “p” are found in that order, with no other characters in between. The characters in the string “bzip” are all literal characters, in that they match themselves. In addition to literals, regular expressions may also include metacharacters that are used to specify more complex matches. Regular expression metacharacters consist of the following:

它可能看起來不明顯,但是我們的 grep 程式一直使用了正則表示式,雖然是非常簡單的例子。 這個正則表示式“bzip”意味著,匹配項所在行至少包含4個字元,並且按照字元 “b”、“z”、 “i” 和 “p”的順序 出現在匹配行的某處,字元之間沒有其它的字元。字串“bzip”中的所有字元都是原義字元,因此 它們匹配本身。除了原義字元之外,正則表示式也可能包含元字元,其被用來指定更復雜的匹配項。 正則表示式元字元由以下字元組成:

^ $ . [ ] { } - ? * + ( ) | \

All other characters are considered literals, though the backslash character is used in a few cases to create meta sequences, as well as allowing the metacharacters to be escaped and treated as literals instead of being interpreted as metacharacters.

其它所有字元都被認為是原義字元。在個別情況下,反斜槓會被用來建立元序列, 元字元也可以被轉義為原義字元,而不是被解釋為元字元。


Note: As we can see, many of the regular expression metacharacters are also characters that have meaning to the shell when expansion is performed. When we pass regular expressions containing metacharacters on the command line, it is vital that they be enclosed in quotes to prevent the shell from attempting to expand them.

注意:正如我們所見到的,當 shell 執行展開的時候,許多正則表示式元字元,也是對 shell 有特殊 含義的字元。當我們在命令列中傳遞包含元字元的正則表示式的時候,把元字元用引號引起來至關重要, 這樣可以阻止 shell 試圖展開它們。


任何字元

The first metacharacter we will look at is the dot or period character, which is used to match any character. If we include it in a regular expression, it will match any character in that character position. Here’s an example:

我們將要檢視的第一個元字元是圓點字元,其被用來匹配任意字元。如果我們在正則表示式中包含它, 它將會匹配在此位置的任意一個字元。這裡有個例子:

[me@linuxbox ~]$ grep -h '.zip' dirlist*.txt
bunzip2
bzip2
bzip2recover
gunzip
gzip
funzip
gpg-zip
preunzip
prezip
prezip-bin
unzip
unzipsfx

We searched for any line in our files that matches the regular expression “.zip”. There are a couple of interesting things to note about the results. Notice that the zip program was not found. This is because the inclusion of the dot metacharacter in our regular expression increased the length of the required match to four characters, and because the name “zip” only contains three, it does not match. Also, if there had been any files in our lists that contained the file extension .zip, they would have also been matched as well, because the period character in the file extension is treated as “any character,” too.

我們在檔案中查詢包含正則表示式“.zip”的文字行。對於搜尋結果,有幾點需要注意一下。 注意沒有找到這個 zip 程式。這是因為在我們的正則表示式中包含的圓點字元把所要求的匹配項的長度 增加到四個字元,並且因為字串“zip”只包含三個字元,所以這個 zip 程式不匹配。另外,如果我們的檔案列表 中有一些檔案的副檔名是.zip,則它們也會成為匹配項,因為副檔名中的圓點符號也會被看作是 “任意字元”。

錨點

The caret and dollar sign characters are treated as anchors in regular expressions. This means that they cause the match to occur only if the regular expression is found at the beginning of the line or at the end of the line:

在正則表示式中,插入符號和美元符號被看作是錨點。這意味著正則表示式 只有在文字行的開頭或末尾被找到時,才算發生一次匹配。

[me@linuxbox ~]$ grep -h '^zip' dirlist*.txt
zip
zipcloak
zipgrep
zipinfo
zipnote
zipsplit
[me@linuxbox ~]$ grep -h 'zip$' dirlist*.txt
gunzip
gzip
funzip
gpg-zip
preunzip
prezip
unzip
zip
[me@linuxbox ~]$ grep -h '^zip$' dirlist*.txt
zip

Here we searched the list of files for the string “zip” located at the beginning of the line, the end of the line, and on a line where it is at both the beginning and the end of the line (i.e., by itself on the line.) Note that the regular expression ‘^$’ (a beginning and an end with nothing in between) will match blank lines.

這裡我們分別在檔案列表中搜索行首、行尾以及行首和行尾同時包含字串“zip”(例如,zip 獨佔一行)的匹配行。 注意正則表示式‘^$’(行首和行尾之間沒有字元)會匹配空行。

A Crossword Puzzle Helper

字謎助手

Even with our limited knowledge of regular expressions at this point, we can do something useful.

到目前為止,甚至憑藉我們有限的正則表示式知識,我們已經能做些有意義的事情了。

My wife loves crossword puzzles and she will sometimes ask me for help with a particular question. Something like, “what’s a five letter word whose third letter is ‘j’ and last letter is ‘r’ that means…?” This kind of question got me thinking.

我妻子喜歡玩字謎遊戲,有時候她會因為一個特殊的問題,而向我求助。類似這樣的問題,“一個 有五個字母的單詞,它的第三個字母是‘j’,最後一個字母是‘r’,是哪個單詞?”這類別問題會 讓我動腦筋想想。

Did you know that your Linux system contains a dictionary? It does. Take a look in the /usr/share/dict directory and you might find one, or several. The dictionary files located there are just long lists of words, one per line, arranged in alphabetical order. On my system, the words file contains just over 98,500 words. To find possible answers to the crossword puzzle question above, we could do this:

你知道你的 Linux 系統中帶有一本英文字典嗎?千真萬確。看一下 /usr/share/dict 目錄,你就能找到一本, 或幾本。儲存在此目錄下的字典檔案,其內容僅僅是一個長長的單詞列表,每行一個單詞,按照字母順序排列。在我的 系統中,這個檔案僅包含98,000個單詞。為了找到可能的上述字謎的答案,我們可以這樣做:

[me@linuxbox ~]$ grep -i '^..j.r$' /usr/share/dict/words
Major
major

Using this regular expression, we can find all the words in our dictionary file that are five letters long and have a “j” in the third position and an “r” in the last position.

使用這個正則表示式,我們能在我們的字典檔案中查詢到包含五個字母,且第三個字母 是“j”,最後一個字母是“r”的所有單詞。

中括號表示式和字元類別

In addition to matching any character at a given position in our regular expression, we can also match a single character from a specified set of characters by using bracket expressions. With bracket expressions, we can specify a set of characters (including characters that would otherwise be interpreted as metacharacters) to be matched. In this example, using a two character set:

除了能夠在正則表示式中的給定位置匹配任意字元之外,透過使用中括號表示式, 我們也能夠從一個指定的字元集合中匹配單個字元。透過中括號表示式,我們能夠指定 一個待匹配字元集合(包含在不加中括號的情況下會被解釋為元字元的字元)。在這個例子裡,使用了一個兩個字元的集合:

[me@linuxbox ~]$ grep -h '[bg]zip' dirlist*.txt
bzip2
bzip2recover
gzip

we match any line that contains the string “bzip” or “gzip”.

我們匹配包含字串“bzip”或者“gzip”的任意行。

A set may contain any number of characters, and metacharacters lose their special meaning when placed within brackets. However, there are two cases in which metacharacters are used within bracket expressions, and have different meanings. The first is the caret (^), which is used to indicate negation; the second is the dash (-), which is used to indicate a character range.

一個字元集合可能包含任意多個字元,並且元字元被放置到中括號裡面後會失去了它們的特殊含義。 然而,在兩種情況下,會在中括號表示式中使用元字元,並且有著不同的含義。第一個元字元 是插入字元(^),其被用來表示否定;第二個是連字元字元(-),其被用來表示一個字元範圍。

否定

If the first character in a bracket expression is a caret (^), the remaining characters are taken to be a set of characters that must not be present at the given character position. We do this by modifying our previous example:

如果在中括號表示式中的第一個字元是一個插入字元(^),則剩餘的字元被看作是不會在給定的字元位置出現的 字元集合。透過修改之前的例子,我們試驗一下:

[me@linuxbox ~]$ grep -h '[^bg]zip' dirlist*.txt
bunzip2
gunzip
funzip
gpg-zip
preunzip
prezip
prezip-bin
unzip
unzipsfx

With negation activated, we get a list of files that contain the string “zip” preceded by any character except “b” or “g”. Notice that the file zip was not found. A negated character set still requires a character at the given position, but the character must not be a member of the negated set.

透過啟用否定操作,我們得到一個檔案列表,它們的檔名都包含字串“zip”,並且“zip”的前一個字元 是除了“b”和“g”之外的任意字元。注意檔案 zip 沒有被發現。一個否定的字符集仍然在給定位置要求一個字元, 但是這個字元必須不是否定字符集的成員。

The caret character only invokes negation if it is the first character within a bracket expression; otherwise, it loses its special meaning and becomes an ordinary character in the set.

插入字元如果是中括號表示式中的第一個字元的時候,才會喚醒否定功能;否則,它會失去 它的特殊含義,變成字符集中的一個普通字元。

傳統的字元區域

If we wanted to construct a regular expression that would find every file in our lists beginning with an upper case letter, we could do this:

如果我們想要建構一個正則表示式,它可以在我們的列表中找到每個以大寫字母開頭的檔案,我們 可以這樣做:

[me@linuxbox ~]$ grep -h '^[ABCDEFGHIJKLMNOPQRSTUVWXZY]' dirlist*.txt

It’s just a matter of putting all twenty-six upper case letters in a bracket expression. But the idea of all that typing is deeply troubling, so there is another way:

這只是一個在正則表示式中輸入26個大寫字母的問題。但是輸入所有字母非常令人煩惱,所以有另外一種方式:

[me@linuxbox ~]$ grep -h '^[A-Z]' dirlist*.txt
MAKEDEV
ControlPanel
GET
HEAD
POST
X
X11
Xorg
MAKEFLOPPIES
NetworkManager
NetworkManagerDispatcher

By using a three character range, we can abbreviate the twenty-six letters. Any range of characters can be expressed this way including multiple ranges, such as this expression that matches all filenames starting with letters and numbers:

透過使用一個三個符區域,我們能夠縮寫26個字母。任意字元的區域都能按照這種方式表達,包括多個區域, 比如下面這個表示式就匹配了所有以字母和數字開頭的檔名:

[me@linuxbox ~]$ grep -h '^[A-Za-z0-9]' dirlist*.txt

In character ranges, we see that the dash character is treated specially, so how do we actually include a dash character in a bracket expression? By making it the first character in the expression. Consider these two examples:

在字元區域中,我們看到這個連字元被特殊對待,所以我們怎樣在一個正則表示式中包含一個連字元呢? 方法就是使連字元成為表示式中的第一個字元。考慮一下這兩個例子:

[me@linuxbox ~]$ grep -h '[A-Z]' dirlist*.txt

This will match every filename containing an upper case letter. While:

這會匹配包含一個大寫字母的檔名。然而:

[me@linuxbox ~]$ grep -h '[-AZ]' dirlist*.txt

will match every filename containing a dash, or a upper case “A” or an uppercase “Z”.

上面的表示式會匹配包含一個連字元,或一個大寫字母“A”,或一個大寫字母“Z”的檔名。

POSIX 字符集

The traditional character ranges are an easily understood and effective way to handle the problem of quickly specifying sets of characters. Unfortunately, they don’t always work. While we have not encountered any problems with our use of grep so far, we might run into problems using other programs.

傳統的字元區域是一個易於理解和有效的方法,用來處理快速指定字元集合的問題。 不幸的是,它們不總是工作。到目前為止,雖然我們在使用 grep 程式的時候沒有遇到任何問題, 但是我們可能在使用其它程式的時候會遭遇困難。

Back in Chapter 5, we looked at how wildcards are used to perform pathname expansion. In that discussion, we said that character ranges could be used in a manner almost identical to the way they are used in regular expressions, but here’s the problem:

回到第5章,我們看看萬用字元怎樣被用來完成路徑名展開操作。在那次討論中,我們說過在 某種程度上,那個字元區域被使用的方式幾乎與在正則表示式中的用法一樣,但是有一個問題:

[me@linuxbox ~]$ ls /usr/sbin/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]*
/usr/sbin/MAKEFLOPPIES
/usr/sbin/NetworkManagerDispatcher
/usr/sbin/NetworkManager

(Depending on the Linux distribution, we will get a different list of files, possibly an empty list. This example is from Ubuntu) This command produces the expected result — a list of only the files whose names begin with an uppercase letter, but:

(依賴於不同的 Linux 發行版,我們將得到不同的檔案列表,有可能是一個空列表。這個例子來自於 Ubuntu) 這個命令產生了期望的結果——只有以大寫字母開頭的檔名,但是:

[me@linuxbox ~]$ ls /usr/sbin/[A-Z]*
/usr/sbin/biosdecode
/usr/sbin/chat
/usr/sbin/chgpasswd
/usr/sbin/chpasswd
/usr/sbin/chroot
/usr/sbin/cleanup-info
/usr/sbin/complain
/usr/sbin/console-kit-daemon

with this command we get an entirely different result (only a partial listing of the results is shown). Why is that? It’s a long story, but here’s the short version:

透過這個命令我們得到完全不同的結果(只列出了部分結果)。原因說來話長,簡單來說就是:

Back when Unix was first developed, it only knew about ASCII characters, and this feature reflects that fact. In ASCII, the first thirty-two characters (numbers 0-31) are control codes (things like tabs, backspaces, and carriage returns). The next thirty-two (32-63) contain printable characters, including most punctuation characters and the numerals zero through nine. The next thirty-two (numbers 64-95) contain the uppercase letters and a few more punctuation symbols. The final thirty-one (numbers 96-127) contain the lowercase letters and yet more punctuation symbols. Based on this arrangement, systems using ASCII used a collation order that looked like this:

追溯到 Unix 剛剛開發的時候,它只知道 ASCII 字元,並且Unix特性也如實反映了這一事實。在 ASCII 中,前32個字元(數字0-31)都是控制碼(如 tabs、backspaces和回車)。隨後的32個字元(32-63)包含可列印的字元,包括大多數的標點符號和數字0到9。再隨後的32個字元(64-95)包含大寫字元和一些更多的標點符號。最後的31個字元(96-127)包含小寫字母和更多的標點符號。基於這種安排方式,使用ASCII的系統的排序規則像下面這樣:

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

This differs from proper dictionary order, which is like this:

這個不同於正常的字典順序,其像這樣:

aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ

As the popularity of Unix spread beyond the United States, there grew a need to support characters not found in U.S. English. The ASCII table was expanded to use a full eight bits, adding characters numbers 128-255, which accommodated many more languages.

隨著 Unix 系統的知名度在美國之外的國家傳播開來,就需要支援不在 U.S.英語範圍內的字元。於是就擴充套件了這個 ASCII 字元表,使用了整個8位,添加了字元(數字128-255),這樣就容納了更多的語言。

To support this ability, the POSIX standards introduced a concept called a locale, which could be adjusted to select the character set needed for a particular location. We can see the language setting of our system using this command:

為了支援這種功能,posix標準引入了”locale”概念,它能針對不同地區選擇合適的字符集。:

[me@linuxbox ~]$ echo $LANG
en_US.UTF-8

With this setting, POSIX compliant applications will use a dictionary collation order rather than ASCII order. This explains the behavior of the commands above. A character range of [A-Z] when interpreted in dictionary order includes all of the alphabetic characters except the lowercase “a”, hence our results.

透過這個設定,POSIX 相容的應用程式將會使用字典排列順序而不是 ASCII 順序。這就解釋了上述命令的行為。當[A-Z]字元區域按照字典順序解釋的時候,包含除了小寫字母“a”之外的所有字母,因此得到這樣的結果。

To partially work around this problem, the POSIX standard includes a number of character classes which provide useful ranges of characters. They are described in the table below:

為了部分地解決這個問題,POSIX 標準包含了大量的字符集,其提供了有用的字元區域。如下表中所示:

Table 20-2: POSIX Character Classes
Character Class Description
[:alnum:] The alphanumeric characters. In ASCII, equivalent to: [A-Za-z0-9]
[:word:] The same as [:alnum:], with the addition of the underscore (\_) character.
[:alpha:] The alphabetic characters. In ASCII, equivalent to: [A-Za-z]
[:blank:] Includes the space and tab characters.
[:cntrl:] The ASCII control codes. Includes the ASCII characters zero through thirty-one and 127.
[:digit:] The numerals zero through nine.
[:graph:] The visible characters. In ASCII, it includes characters 33 through 126.
[:lower:] The lowercase letters.
[:punct:] The punctuation characters. In ASCII, equivalent to:[-!"#$%&'()*+,./:;<=>?@[\\\]_`{|}~]
[:print:] The printable characters. All the characters in [:graph:] plus the space character.
[:space:] The whitespace characters including space, tab, carriage return, newline, vertical tab, and form feed. In ASCII, equivalent to: [ \t\r\n\v\f]
[:upper:] The upper case characters.
[:xdigit:] Characters used to express hexadecimal numbers. In ASCII, equivalent to: [0-9A-Fa-f]
表20-2: POSIX 字符集
字符集 說明
[:alnum:] 字母數字字元。在 ASCII 中,等價於:[A-Za-z0-9]
[:word:] 與[:alnum:]相同, 但增加了下劃線字元。
[:alpha:] 字母字元。在 ASCII 中,等價於:[A-Za-z]
[:blank:] 包含空格和 tab 字元。
[:cntrl:] ASCII 的控制碼。包含了0到31,和127的 ASCII 字元。
[:digit:] 數字0到9
[:graph:] 可視字元。在 ASCII 中,它包含33到126的字元。
[:lower:] 小寫字母。
[:punct:] 標點符號字元。在 ASCII 中,等價於:[-!"#$%&'()*+,./:;<=>?@[\\\]_`{|}~]
[:print:] 可列印的字元。在[:graph:]中的所有字元,再加上空格字元。
[:space:] 空白字元,包括空格、tab、回車、換行、vertical tab 和 form feed.在 ASCII 中, 等價於:[ \t\r\n\v\f]
[:upper:] 大寫字母。
[:xdigit:] 用來表示十六進位制數字的字元。在 ASCII 中,等價於:[0-9A-Fa-f]

Even with the character classes, there is still no convenient way to express partial ranges, such as [A-M].

甚至透過字符集,仍然沒有便捷的方法來表達部分割槽域,比如[A-M]。

Using character classes, we can repeat our directory listing and see an improved result:

透過使用字符集,我們重做上述的例題,看到一個改進的結果:

[me@linuxbox ~]$ ls /usr/sbin/[[:upper:]]*
/usr/sbin/MAKEFLOPPIES
/usr/sbin/NetworkManagerDispatcher
/usr/sbin/NetworkManager

Remember, however, that this is not an example of a regular expression, rather it is the shell performing pathname expansion. We show it here because POSIX character classes can be used for both.

記住,然而,這不是一個正則表示式的例子,而是 shell 正在執行路徑名展開操作。我們在這裡展示這個例子, 是因為 POSIX 規範的字符集適用於二者。

Reverting To Traditional Collation Order

恢復到傳統的排列順序

You can opt to have your system use the traditional (ASCII) collation order by changing the value of the LANG environment variable. As we saw above, the LANG variable contains the name of the language and character set used in your locale. This value was originally determined when you selected an installation language as your Linux was installed.

透過改變環境變數 LANG 的值,你可以選擇讓你的系統使用傳統的(ASCII)排列規則。如上所示,這個 LANG 變數包含了語種和字符集。這個值最初由你安裝 Linux 系統時所選擇的安裝語言決定。

To see the locale settings, use the locale command:

使用 locale 命令,來檢視 locale 的設定。

 [me@linuxbox ~]$ locale

 LANG=en_US.UTF-8

 LC_CTYPE="en_US.UTF-8"

 LC_NUMERIC="en_US.UTF-8"

 LC_TIME="en_US.UTF-8"

 LC_COLLATE="en_US.UTF-8"

 LC_MONETARY="en_US.UTF-8"

 LC_MESSAGES="en_US.UTF-8"

 LC_PAPER="en_US.UTF-8"

 LC_NAME="en_US.UTF-8"

 LC_ADDRESS="en_US.UTF-8"

 LC_TELEPHONE="en_US.UTF-8"

 LC_MEASUREMENT="en_US.UTF-8"

 LC_IDENTIFICATION="en_US.UTF-8"

 LC_ALL=

To change the locale to use the traditional Unix behaviors, set the LANG variable to POSIX:

把這個 LANG 變數設定為 POSIX,來更改 locale,使其使用傳統的 Unix 行為。

[me@linuxbox ~]$ export LANG=POSIX

Note that this change converts the system to use U.S. English (more specifically, ASCII) for its character set, so be sure if this is really what you want.

You can make this change permanent by adding this line to you your .bashrc file:

注意這個改動使系統為它的字符集使用 U.S.英語(更準確地說,ASCII),所以要確認一下這 是否是你真正想要的效果。透過把這條語句新增到你的.bashrc 檔案中,你可以使這個更改永久有效。

export LANG=POSIX

POSIX基本正則表示式 與 POSIX擴充套件正則表示式

Just when we thought this couldn’t get any more confusing, we discover that POSIX also splits regular expression implementations into two kinds: basic regular expressions (BRE) and extended regular expressions (ERE). The features we have covered so far are supported by any application that is POSIX-compliant and implements BRE. Our grep program is one such program.

就在我們認為這已經非常令人困惑了,我們卻發現 POSIX 把正則表示式的實現分成了兩類別: 基本正則表示式(BRE)和擴充套件的正則表示式(ERE)。既服從 POSIX 規範又實現了 BRE 的任意應用程式,都支援我們目前研究的所有正則表示式特性。我們的 grep 程式就是其中一個。

What’s the difference between BRE and ERE? It’s a matter of metacharacters. With BRE, the following metacharacters are recognized:

BRE 和 ERE 之間有什麼區別呢?這是關於元字元的問題。BRE 可以辨別以下元字元:

^ $ . [ ] *

All other characters are considered literals. With ERE, the following metacharacters (and their associated functions) are added:

其它的所有字元被認為是文字字元。ERE 添加了以下元字元(以及與其相關的功能):

( ) { } ? + |

However (and this is the fun part), the “(”, “)”, “{”, and “}” characters are treated as metacharacters in BRE if they are escaped with a backslash, whereas with ERE, preceding any metacharacter with a backslash causes it to be treated as a literal. Any weirdness that comes along will be covered in the discussions that follow.

然而(這也是有趣的地方),在 BRE 中,字元“(”,“)”,“{”,和 “}”用反斜槓轉義後,被看作是元字元, 相反在 ERE 中,在任意元字元之前加上反斜槓會導致其被看作是一個文字字元。在隨後的討論中將會涵蓋 很多奇異的特性。

Since the features we are going to discuss next are part of ERE, we are going to need to use a different grep. Traditionally, this has been performed by the egrep program, but the GNU version of grep also supports extended regular expressions when the -E option is used.

因為我們將要討論的下一個特性是 ERE 的一部分,我們將要使用一個不同的 grep 程式。照慣例, 一直由 egrep 程式來執行這項操作,但是 GNU 版本的 grep 程式在使用了-E 選項之後也支援擴充套件的正則表示式。

POSIX

During the 1980’s, Unix became a very popular commercial operating system, but by 1988, the Unix world was in turmoil. Many computer manufacturers had licensed the Unix source code from its creators, AT&T, and were supplying various versions of the operating system with their systems. However, in their efforts to create product differentiation, each manufacturer added proprietary changes and extensions. This started to limit the compatibility of the software.

在 20 世紀 80 年代,Unix 成為一款非常流行的商業作業系統,但是到了1988年,Unix 世界 一片混亂。許多計算機制造商從 Unix 的建立者 AT&T 那裡得到了許可的 Unix 原始碼,並且 供應各種版本的作業系統。然而,在他們努力創造產品差異化的同時,每個製造商都增加了 專用的更改和擴充套件。這就開始限制了軟體的相容性。

As always with proprietary vendors, each was trying to play a winning game of “lock-in” with their customers. This dark time in the history of Unix is known today as “the Balkanization.”

專有軟體供應商一如既往,每個供應商都試圖玩嬴遊戲“鎖定”他們的客戶。這個 Unix 歷史上 的黑暗時代,就是今天眾所周知的 “the Balkanization”。

Enter the IEEE (Institute of Electrical and Electronics Engineers). In the mid-1980s, the IEEE began developing a set of standards that would define how Unix (and Unix-like) systems would perform. These standards, formally known as IEEE 1003, define the application programming interfaces (APIs), shell and utilities that are to be found on a standard Unix-like system. The name “POSIX,” which stands for Portable Operating System Interface (with the “X” added to the end for extra snappiness), was suggested by Richard Stallman (yes, that Richard Stallman), and was adopted by the IEEE.

然後進入 IEEE( 電氣與電子工程師協會 )時代。在上世紀 80 年代中葉,IEEE 開始制定一套標準, 其將會定義 Unix 系統( 以及類別 Unix 的系統 )如何執行。這些標準,正式成為 IEEE 1003, 定義了應用程式程式設計介面( APIs ),shell 和一些實用程式,其將會在標準的類別 Unix 作業系統中找到。“POSIX” 這個名字,象徵著可移植的作業系統介面(為了時髦一點,添加了末尾的 “X” ), 是由 Richard Stallman 建議的( 是的,的確是 Richard Stallman ),後來被 IEEE 採納。

交替

The first of the extended regular expression features we will discuss is called alternation, which is the facility that allows a match to occur from among a set of expressions. Just as a bracket expression allows a single character to match from a set of specified characters, alternation allows matches from a set of strings or other regular expressions. To demonstrate, we’ll use grep in conjunction with echo. First, Let's try a plain old string match:

我們將要討論的擴充套件表示式的第一個特性叫做 alternation(交替),其是一款允許從一系列表示式 之間選擇匹配項的實用程式。就像中括號表示式允許從一系列指定的字元之間匹配單個字元那樣, alternation 允許從一系列字串或者是其它的正則表示式中選擇匹配項。為了說明問題, 我們將會結合 echo 程式來使用 grep 命令。首先,讓我們試一個普通的字串匹配:

[me@linuxbox ~]$ echo "AAA" | grep AAA
AAA
[me@linuxbox ~]$ echo "BBB" | grep AAA
[me@linuxbox ~]$

A pretty straightforward example, in which we pipe the output of echo into grep and see the results. When a match occurs, we see it printed out; when no match occurs, we see no results.

一個相當直截了當的例子,我們把 echo 的輸出管道給 grep,然後看到輸出結果。當出現 一個匹配項時,我們看到它會打印出來;當沒有匹配項時,我們看到沒有輸出結果。

Now we’ll add alternation, signified by the vertical bar metacharacter:

現在我們將新增 alternation,以豎槓線元字元為標記:

[me@linuxbox ~]$ echo "AAA" | grep -E 'AAA|BBB'
AAA
[me@linuxbox ~]$ echo "BBB" | grep -E 'AAA|BBB'
BBB
[me@linuxbox ~]$ echo "CCC" | grep -E 'AAA|BBB'
[me@linuxbox ~]$

Here we see the regular expression ‘AAA|BBB’ which means “match either the string AAA or the string BBB.” Notice that since this is an extended feature, we added the -E option to grep (though we could have just used the egrep program instead), and we enclosed the regular expression in quotes to prevent the shell from interpreting the vertical bar metacharacter as a pipe operator. Alternation is not limited to two choices:

這裡我們看到正則表示式’AAA|BBB’,這意味著“匹配字串 AAA 或者是字串 BBB”。注意因為這是 一個擴充套件的特性,我們給 grep 命令(雖然我們能以 egrep 程式來代替)添加了-E 選項,並且我們 把這個正則表示式用單引號引起來,為的是阻止 shell 把豎槓線元字元解釋為一個 pipe 運算子。 Alternation 並不侷限於兩種選擇:

[me@linuxbox ~]$ echo "AAA" | grep -E 'AAA|BBB|CCC'
AAA

To combine alternation with other regular expression elements, we can use () to separate the alternation:

為了把 alternation 和其它正則表示式元素結合起來,我們可以使用()來分離 alternation。

[me@linuxbox ~]$ grep -Eh '^(bz|gz|zip)' dirlist*.txt

This expression will match the filenames in our lists that start with either “bz”, “gz”, or “zip”. Had we left off the parentheses, the meaning of this regular expression :

這個表示式將會在我們的列表中匹配以“bz”,或“gz”,或“zip”開頭的檔名。如果我們刪除了圓括號, 這個表示式的意思:

[me@linuxbox ~]$ grep -Eh '^bz|gz|zip' dirlist*.txt

changes to match any filename that begins with “bz” or contains “gz” or contains “zip”.

會變成匹配任意以“bz”開頭,或包含“gz”,或包含“zip”的檔名。

限定符

Extended regular expressions support several ways to specify the number of times an element is matched.

擴充套件的正則表示式支援幾種方法,來指定一個元素被匹配的次數。

? - 匹配零個或一個元素

This quantifier means, in effect, “make the preceding element optional.” Let's say we wanted to check a phone number for validity and we considered a phone number to be valid if it matched either of these two forms:

這個限定符意味著,實際上,“使前面的元素可有可無。”比方說我們想要檢視一個電話號碼的真實性, 如果它匹配下面兩種格式的任意一種,我們就認為這個電話號碼是真實的:

(nnn) nnn-nnnn

nnn nnn-nnnn

where “n” is a numeral. We could construct a regular expression like this:

這裡的“n”是一個數字。我們可以建構一個像這樣的正則表示式:

^\(?[0-9][0-9][0-9]\)?  [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$

In this expression, we follow the parentheses characters with question marks to indicate that they are to be matched zero or one time. Again, since the parentheses are normally metacharacters (in ERE), we precede them with backslashes to cause them to be treated as literals instead.

在這個表示式中,我們在圓括號之後加上一個問號,來表示它們將被匹配零次或一次。再一次,因為 通常圓括號都是元字元(在 ERE 中),所以我們在圓括號之前加上了反斜槓,使它們成為文字字元。

Let's try it:

讓我們試一下:

[me@linuxbox ~]$ echo "(555) 123-4567" | grep -E '^\(?[0-9][0-9][0-9]\)? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$'
(555) 123-4567
[me@linuxbox ~]$ echo "555 123-4567" | grep -E '^\(?[0-9][0-9][0-9]\)? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$'
555 123-4567
[me@linuxbox ~]$ echo "AAA 123-4567" | grep -E '^\(?[0-9][0-9][0-9]\)? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$'
[me@linuxbox ~]$

Here we see that the expression matches both forms of the phone number, but does not match one containing non-numeric characters.

這裡我們看到這個表示式匹配這個電話號碼的兩種形式,但是不匹配包含非數字字元的號碼。

* - 匹配零個或多個元素

Like the ? metacharacter, the * is used to denote an optional item; however, unlike the ?, the item may occur any number of times, not just once. Let's say we wanted to see if a string was a sentence; that is, it starts with an uppercase letter, then contains any number of upper and lowercase letters and spaces, and ends with a period. To match this (very crude) definition of a sentence, we could use a regular expression like this:

像 ? 元字元一樣,這個 * 被用來表示一個可選的字元;然而,又與 ? 不同,匹配的字元可以出現 任意多次,不僅是一次。比方說我們想要知道是否一個字串是一句話;也就是說,字串開始於 一個大寫字母,然後包含任意多個大寫和小寫的字母和空格,最後以句號收尾。為了匹配這個(非常粗略的) 語句的定義,我們能夠使用一個像這樣的正則表示式:

[[:upper:]][[:upper:][:lower:] ]*.

The expression consists of three items: a bracket expression containing the [:upper:] character class, a bracket expression containing both the [:upper:] and [:lower:] character classes and a space, and a period escaped with a backslash. The second element is trailed with an * metacharacter, so that after the leading uppercase letter in our sentence, any number of upper and lowercase letters and spaces may follow it and still match:

這個表示式由三個元素組成:一個包含[:upper:]字符集的中括號表示式,一個包含[:upper:]和[:lower:] 兩個字符集以及一個空格的中括號表示式,和一個被反斜槓字元轉義過的圓點。第二個元素末尾帶有一個 *元字元,所以在開頭的大寫字母之後,可能會跟隨著任意數目的大寫和小寫字母和空格,並且匹配:

[me@linuxbox ~]$ echo "This works." | grep -E '[[:upper:]][[:upper:][:lower:] ]*\.'
This works.
[me@linuxbox ~]$ echo "This Works." | grep -E '[[:upper:]][[:upper:][:lower:] ]*\.'
This Works.
[me@linuxbox ~]$ echo "this does not" | grep -E '[[:upper:]][[:upper:][:lower:] ]*\.'
[me@linuxbox ~]$

The expression matches the first two tests, but not the third, since it lacks the required leading uppercase character and trailing period.

這個表示式匹配前兩個測試語句,但不匹配第三個,因為第三個句子缺少開頭的大寫字母和末尾的句號。

+ - 匹配一個或多個元素

The + metacharacter works much like the *, except it requires at least one instance of the preceding element to cause a match. Here is a regular expression that will only match lines consisting of groups of one or more alphabetic characters separated by single spaces:

+ 元字元的作用與 * 非常相似,除了它要求前面的元素至少出現一次匹配。這個正則表示式只匹配 那些由一個或多個字母字元組構成的文字行,字母字元之間由單個空格分開:

^([[:alpha:]]+ ?)+$
[me@linuxbox ~]$ echo "This that" | grep -E '^([[:alpha:]]+ ?)+$'
This that
[me@linuxbox ~]$ echo "a b c" | grep -E '^([[:alpha:]]+ ?)+$'
a b c
[me@linuxbox ~]$ echo "a b 9" | grep -E '^([[:alpha:]]+ ?)+$'
[me@linuxbox ~]$ echo "abc  d" | grep -E '^([[:alpha:]]+ ?)+$'
[me@linuxbox ~]$

We see that this expression does not match the line “a b 9” because it contains a non- alphabetic character; nor does it match “abc d” because more than one space character separates the characters “c” and “d”.

我們看到這個正則表示式不匹配“a b 9”這一行,因為它包含了一個非字母的字元;它也不匹配 “abc d” ,因為在字元“c”和“d”之間不止一個空格。

{ } - 匹配特定個數的元素

The { and } metacharacters are used to express minimum and maximum numbers of required matches. They may be specified in four possible ways:

{ 和 } 元字元都被用來表達要求匹配的最小和最大數目。它們可以透過四種方法來指定:

Table 20-3: Specifying The Number Of Matches
Specifier Meaning
{n} Match the preceding element if it occurs exactly n times.
{n,m} Match the preceding element if it occurs at least n times, but no more than m times.
{n,} Match the preceding element if it occurs n or more times.
{,m} Match the preceding element if it occurs no more than m times.
表20-3: 指定匹配的數目
限定符 意思
{n} 匹配前面的元素,如果它確切地出現了 n 次。
{n,m} 匹配前面的元素,如果它至少出現了 n 次,但是不多於 m 次。
{n,} 匹配前面的元素,如果它出現了 n 次或多於 n 次。
{,m} 匹配前面的元素,如果它出現的次數不多於 m 次。

Going back to our earlier example with the phone numbers, we can use this method of specifying repetitions to simplify our original regular expression from:

回到之前處理電話號碼的例子,我們能夠使用這種指定重複次數的方法來簡化我們最初的正則表示式:

^\(?[0-9][0-9][0-9]\)?  [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$

簡化為:

^\(?[0-9]{3}\)?  [0-9]{3}-[0-9]{4}$

Let's try it:

讓我們試一下:

[me@linuxbox ~]$ echo "(555) 123-4567" | grep -E '^\(?[0-9]{3}\)? [0-9]{3}-[0-9]{4}$'
(555) 123-4567
[me@linuxbox ~]$ echo "555 123-4567" | grep -E '^\(?[0-9]{3}\)? [0-9]{3}-[0-9]{4}$'
555 123-4567
[me@linuxbox ~]$ echo "5555 123-4567" | grep -E '^\(?[0-9]{3}\)? [0-9]{3}-[0-9]{4}$'
[me@linuxbox ~]$

As we can see, our revised expression can successfully validate numbers both with and without the parentheses, while rejecting those numbers that are not properly formatted.

我們可以看到,我們修訂的表示式能成功地驗證帶有和不帶有圓括號的數字,而拒絕那些格式 不正確的數字。

讓正則表示式工作起來

Let's look at some of the commands we already know and see how they can be used with regular expressions.

讓我們看看一些我們已經知道的命令,然後看一下它們怎樣使用正則表示式。

透過 grep 命令來驗證一個電話簿

In our earlier example, we looked at single phone numbers and checked them for proper formatting. A more realistic scenario would be checking a list of numbers instead, so Let's make a list. We’ll do this by reciting a magical incantation to the command line. It will be magic because we have not covered most of the commands involved, but worry not. We will get there in future chapters. Here is the incantation:

在我們先前的例子中,我們檢視過單個電話號碼,並且檢查了它們的格式。一個更現實的 情形是檢查一個數字列表,所以我們先建立一個列表。我們將念一個神奇的咒語到命令列中。 它會很神奇,因為我們還沒有涵蓋所涉及的大部分命令,但是不要擔心。我們將在後面的章節裡面 討論那些命令。這就是那個咒語:

[me@linuxbox ~]$ for i in {1..10}; do echo "(${RANDOM:0:3}) ${RANDOM:0:3}-${RANDOM:0:4}" >> phonelist.txt; done

This command will produce a file named phonelist.txt containing ten phone numbers. Each time the command is repeated, another ten numbers are added to the list. We can also change the value 10 near the beginning of the command to produce more or fewer phone numbers. If we examine the contents of the file, however, we see we have a problem:

這個命令會建立一個包含10個電話號碼的名為 phonelist.txt 的檔案。每次重複這個命令的時候,另外10個號碼會被新增到這個列表中。我們也能夠更改命令開頭附近的數值10,來產生或多或少的電話號碼。如果我們檢視這個檔案的內容,然而我們會發現一個問題:

[me@linuxbox ~]$ cat phonelist.txt
(232) 298-2265
(624) 381-1078
(540) 126-1980
(874) 163-2885
(286) 254-2860
(292) 108-518
(129) 44-1379
(458) 273-1642
(686) 299-8268
(198) 307-2440

Some of the numbers are malformed, which is perfect for our purposes, since we will use grep to validate them.

一些號碼是殘缺不全的,這正是我們想要的,因為我們將使用 grep 命令來驗證電話號碼的正確性。

One useful method of validation would be to scan the file for invalid numbers and display the resulting list on the display:

一個有用的驗證方法是掃描這個檔案,查詢無效的號碼,並把搜尋結果顯示到螢幕上:

[me@linuxbox ~]$ grep -Ev '^\([0-9]{3}\) [0-9]{3}-[0-9]{4}$'    phonelist.txt
(292) 108-518
(129) 44-1379
[me@linuxbox ~]$

Here we use the -v option to produce an inverse match so that we will only output the lines in the list that do not match the specified expression. The expression itself includes the anchor metacharacters at each end to ensure that the number has no extra characters at either end. This expression also requires that the parentheses be present in a valid number, unlike our earlier phone number example.

這裡我們使用-v 選項來產生相反的匹配,因此我們將只輸出不匹配指定表示式的文字行。這個 表示式自身的兩端都包含定位點(錨)元字元,是為了確保這個號碼的兩端沒有多餘的字元。 這個表示式也要求圓括號出現在一個有效的號碼中,不同於我們先前電話號碼的範例。

用 find 查詢醜陋的檔名

The find command supports a test based on a regular expression. There is an important consideration to keep in mind when using regular expressions in find versus grep. Whereas grep will print a line when the line contains a string that matches an expression, find requires that the pathname exactly match the regular expression. In the following example, we will use find with a regular expression to find every pathname that contains any character that is not a member of the following set:

這個 find 命令支援一個基於正則表示式的測試。當在使用正則表示式方面比較 find 和 grep 命令的時候, 還有一個重要問題要牢記在心。當某一行包含的字串匹配上了一個表示式的時候,grep 命令會打印出這一行, 然而 find 命令要求路徑名精確地匹配這個正則表示式。在下面的例子裡面,我們將使用帶有一個正則 表示式的 find 命令,來查詢每個路徑名,其包含的任意字元都不是以下字符集中的一員。

[-\_./0-9a-zA-Z]

Such a scan would reveal pathnames that contain embedded spaces and other potentially offensive characters:

這樣一種掃描會發現包含空格和其它潛在不規範字元的路徑名:

[me@linuxbox ~]$ find . -regex '.*[^-\_./0-9a-zA-Z].*'

Due to the requirement for an exact match of the entire pathname, we use .* at both ends of the expression to match zero or more instances of any character. In the middle of the expression, we use a negated bracket expression containing our set of acceptable pathname characters.

由於要精確地匹配整個路徑名,所以我們在表示式的兩端使用了.*,來匹配零個或多個字元。 在表示式中間,我們使用了否定的中括號表示式,其包含了我們一系列可接受的路徑名字元。

用 locate 查詢檔案

The locate program supports both basic (the --regexp option) and extended (the -- regex option) regular expressions. With it, we can perform many of the same operations that we performed earlier with our dirlist files:

這個 locate 程式支援基本的(--regexp 選項)和擴充套件的(--regex 選項)正則表示式。透過 locate 命令,我們能夠執行許多與先前操作 dirlist 檔案時相同的操作:

[me@linuxbox ~]$ locate --regex 'bin/(bz|gz|zip)'
/bin/bzcat
/bin/bzcmp
/bin/bzdiff
/bin/bzegrep
/bin/bzexe
/bin/bzfgrep
/bin/bzgrep
/bin/bzip2
/bin/bzip2recover
/bin/bzless
/bin/bzmore
/bin/gzexe
/bin/gzip
/usr/bin/zip
/usr/bin/zipcloak
/usr/bin/zipgrep
/usr/bin/zipinfo
/usr/bin/zipnote
/usr/bin/zipsplit

Using alternation, we perform a search for pathnames that contain either bin/bz, bin/gz, or /bin/zip.

透過使用 alternation,我們搜尋包含 bin/bz,bin/gz,或/bin/zip 字串的路徑名。

在 less 和 vim 中查詢文字

less and vim both share the same method of searching for text. Pressing the / key followed by a regular expression will perform a search. If we use less to view our phonelist.txt file:

less 和 vim 兩者享有相同的文字查詢方法。按下/按鍵,然後輸入正則表示式,來執行搜尋任務。 如果我們使用 less 程式來瀏覽我們的 phonelist.txt 檔案:

[me@linuxbox ~]$ less phonelist.txt

Then search for our validation expression:

然後查詢我們有效的表示式:

(232) 298-2265
(624) 381-1078
(540) 126-1980
(874) 163-2885
(286) 254-2860
(292) 108-518
(129) 44-1379
(458) 273-1642
(686) 299-8268
(198) 307-2440
~
~
~
/^\([0-9]{3}\) [0-9]{3}-[0-9]{4}$

less will highlight the strings that match, leaving the invalid ones easy to spot:

less 將會高亮匹配到的字串,這樣就很容易看到無效的電話號碼:

(232) 298-2265
(624) 381-1078
(540) 126-1980
(874) 163-2885
(286) 254-2860
(292) 108-518
(129) 44-1379
(458) 273-1642
(686) 299-8268
(198) 307-2440
~
~
~
(END)

vim, on the other hand, supports basic regular expressions, so our search expression would look like this:

另一方面,vim 支援基本的正則表示式,所以我們用於搜尋的表示式看起來像這樣:

/([0-9]\{3\}) [0-9]\{3\}-[0-9]\{4\}

We can see that the expression is mostly the same; however, many of the characters that are considered metacharacters in extended expressions are considered literals in basic expressions. They are only treated as metacharacters when escaped with a backslash.

我們看到表示式幾乎一樣;然而,在擴充套件表示式中,許多被認為是元字元的字元在基本的表示式 中被看作是文字字元。只有用反斜槓把它們轉義之後,它們才被看作是元字元。

Depending on the particular configuration of vim on our system, the matching will be highlighted. If not, try this command mode command:

依賴於系統中 vim 的特殊配置,匹配項將會被高亮。如若不是,試試這個命令模式:

:hlsearch

to activate search highlighting.

來啟用搜尋高亮功能。


Note: Depending on your distribution, vim may or may not support text search highlighting. Ubuntu, in particular, supplies a very stripped-down version of vim by default. On such systems, you may want to use your package manager to install a more complete version of vim.

注意:依賴於你的發行版,vim 有可能支援或不支援文字搜尋高亮功能。尤其是 Ubuntu 自帶了 一款非常簡化的 vim 版本。在這樣的系統中,你可能要使用你的軟體套件管理器來安裝一個功能 更完備的 vim 版本。


總結歸納

In this chapter, we’ve seen a few of the many uses of regular expressions. We can find even more if we use regular expressions to search for additional applications that use them. We can do that by searching the man pages:

在這章中,我們已經看到幾個使用正則表示式例子。如果我們使用正則表示式來搜尋那些使用正則表示式的應用程式, 我們可以找到更多的使用範例。透過查詢手冊頁,我們就能找到:

[me@linuxbox ~]$ cd /usr/share/man/man1
[me@linuxbox man1]$ zgrep -El 'regex|regular expression' *.gz

The zgrep program provides a front end for grep, allowing it to read compressed files. In our example, we search the compressed section one man page files located in their usual location. The result of this command is a list of files containing either the string “regex” or “regular expression”. As we can see, regular expressions show up in a lot of programs.

這個 zgrep 程式是 grep 的前端,允許 grep 來讀取壓縮檔案。在我們的例子中,我們在手冊檔案所在的 目錄中,搜尋壓縮檔案中的內容。這個命令的結果是一個包含字串“regex”或者“regular expression”的檔案列表。正如我們所看到的,正則表示式會出現在大量程式中。

There is one feature found in basic regular expressions that we did not cover. Called back references, this feature will be discussed in the next chapter.

基本正則表示式中有一個特性,我們沒有涵蓋。叫做反引用,這個特性在下一章中會被討論到。

拓展閱讀

There are many online resources for learning regular expressions, including various tutorials and cheat sheets.

有許多線上學習正則表示式的資源,包括各種各樣的教材和速記表。

In addition, the Wikipedia has good articles on the following background topics:

另外,關於下面的背景話題,Wikipedia 有不錯的文章。


Go to Table of Contents