Python 正则表达式¶

from nltk.book import *import nltkimport rewordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

 Introductory Examples for the NLTK Book Loading text1, …, text9 and sent1, …, sent9Type the name of the text or sentence to view it.Type: 'texts()' or 'sents()' to list the materials.text1: Moby Dick by Herman Melville 1851text2: Sense and Sensibility by Jane Austen 1811text3: The Book of Genesistext4: Inaugural Address Corpustext5: Chat Corpustext6: Monty Python and the Holy Grailtext7: Wall Street Journaltext8: Personals Corpustext9: The Man Who Was Thursday by G . K . Chesterton 1908

1开始^与结束\$¶

^匹配一个正则表达式的开始，\$匹配一个正则表达式的结束:

比如r’ed$’代表以ed结束的正则表达

regexp = r'ed$'[w for w in wordlist if re.search(regexp,w)][0:10]

[u'abaissed', u'abandoned', u'abased', u'abashed', u'abatised', u'abed', u'aborted', u'abridged', u'abscessed', u'absconded']

或者匹配以aa开头的字符：

regexp = r'^aa'[w for w in wordlist if re.search(regexp,w)]

[u'aa', u'aal', u'aalii', u'aam', u'aardvark', u'aardwolf']

2任意单个字符”.”和可选”?”¶

“.”代表任何单个字符

“?”代表之前一个字符是可以选择的，比如r’E-?mail’可以匹配到Email和E-mail

regexp = r'^b.t$'[w for w in wordlist if re.search(regexp,w)]

[u'bat', u'bet', u'bit', u'bot', u'but']

regexp = r'e-?mail'sum(1 for w in nltk.book.text5 if re.search(regexp,w))

3

3范围与闭包¶

范围¶

r’^[abc]’匹配以abc开头的字符括号中的字母顺序没有关系

r’^[a-d]’匹配以a-d开头的字符

闭包¶

‘+’ 代表一个或多个实例

‘*’ 代表0个或几个实例

chatWords = sorted(set(w for w in nltk.corpus.nps_chat.words()))regexp = r'^[ha]+$'[w for w in chatWords if re.search(regexp,w)][0:10]

[u'a', u'aaaaaaaaaaaaaaaaa', u'aaahhhh', u'ah', u'ahah', u'ahahah', u'ahh', u'ahhahahaha', u'ahhh', u'ahhhh']

regexp1 = r'^m+i+n+e+$'[w for w in chatWords if re.search(regexp1,w)]

[u'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', u'miiiiiinnnnnnnnnneeeeeeee', u'mine', u'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']

regexp2 = r'^mine$'[w for w in chatWords if re.search(regexp2,w)]

[u'', u'e', u'i', u'in', u'm', u'me', u'meeeeeeeeeeeee', u'mi', u'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', u'miiiiiinnnnnnnnnneeeeeeee', u'min', u'mine', u'mm', u'mmm', u'mmmm', u'mmmmm', u'mmmmmm', u'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee', u'mmmmmmmmmm', u'mmmmmmmmmmmmm', u'mmmmmmmmmmmmmm', u'n', u'ne']

‘[^..]’括号中的^匹配所有不在其中的字符

比如匹配所有非元音字母：

regexp2 = r'^[^aeiouAEIOU]+$'[w for w in wordlist if re.search(regexp2,w)][0:10]

[u'b', u'by', u'byth', u'c', u'cly', u'cry', u'crypt', u'cwm', u'cyp', u'cyst']

4其他需要用到的表达¶

‘\’转义，比如’.’匹配’.’,’\$’匹配‘\$’

‘|’表示选择，比如’(ed|ing)\$’代表以ed或者ing结尾的字符

‘{m,n}’表示重复次数，m到n次，左右可舍弃

wsj = sorted(set(nltk.corpus.treebank.words()))[w for w in wsj if re.search('^[0-9]+.[0-9]+$',w)][1:10]

[u'0.05', u'0.1', u'0.16', u'0.2', u'0.25', u'0.28', u'0.3', u'0.4', u'0.5']

[w for w in wsj if re.search('^[A-Z]+\$$',w)]

[u'US$']

[w for w in wsj if re.search('^[0-9]{4}',w)][0:10]

[u'1614', u'1637', u'1738.1', u'1787', u'1901', u'1903', u'1917', u'1920s', u'1925']

[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}',w)][0:10]

[u'10-day', u'10-lap', u'10-year', u'100-megabyte', u'100-share', u'11-month-old', u'12-member', u'12-point', u'12-year', u'14-hour']

[w for w in wsj if re.search('[a-z]{5,}-[a-z]{2,3}-[a-z]',w)][0:10]

[u'black-and-white', u'bread-and-butter', u'father-in-law', u'machine-gun-toting', u'savings-and-loan', u'search-and-seizure', u'truth-in-lending']

[w for w in wsj if re.search('(ed|ing)$',w)][0:10]

[u'62%-owned', u'Absorbed', u'According', u'Adopting', u'Advanced', u'Advancing', u'Alfred', u'Allied', u'Annualized', u'Anything']