不写R包的分析师不是好全栈

Python 正则表达式

    JsPy&Others









Python 正则表达式







In [17]:



from nltk.book import *
import nltk
import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]









 Introductory Examples for the NLTK Book 
Loading text1, …, text9 and sent1, …, sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908








1开始^与结束\$

^匹配一个正则表达式的开始,\$匹配一个正则表达式的结束:


比如r’ed$’代表以ed结束的正则表达








In [10]:



regexp = r'ed$'
[w for w in wordlist if re.search(regexp,w)][0:10]







Out[10]:

[u'abaissed',
u'abandoned',
u'abased',
u'abashed',
u'abatised',
u'abed',
u'aborted',
u'abridged',
u'abscessed',
u'absconded']







或者匹配以aa开头的字符:








In [11]:



regexp = r'^aa'
[w for w in wordlist if re.search(regexp,w)]







Out[11]:

[u'aa', u'aal', u'aalii', u'aam', u'aardvark', u'aardwolf']







2任意单个字符”.”和可选”?”

“.”代表任何单个字符


“?”代表之前一个字符是可以选择的,比如r’E-?mail’可以匹配到Email和E-mail








In [12]:



regexp = r'^b.t$'
[w for w in wordlist if re.search(regexp,w)]







Out[12]:

[u'bat', u'bet', u'bit', u'bot', u'but']




In [19]:



regexp = r'e-?mail'
sum(1 for w in nltk.book.text5 if re.search(regexp,w))







Out[19]:

3







3范围与闭包

范围

r’^[abc]’匹配以abc开头的字符括号中的字母顺序没有关系


r’^[a-d]’匹配以a-d开头的字符


闭包

‘+’ 代表一个或多个实例


‘*’ 代表0个或几个实例








In [39]:



chatWords = sorted(set(w for w in nltk.corpus.nps_chat.words()))
regexp = r'^[ha]+$'
[w for w in chatWords if re.search(regexp,w)][0:10]







Out[39]:

[u'a',
u'aaaaaaaaaaaaaaaaa',
u'aaahhhh',
u'ah',
u'ahah',
u'ahahah',
u'ahh',
u'ahhahahaha',
u'ahhh',
u'ahhhh']




In [31]:



regexp1 = r'^m+i+n+e+$'
[w for w in chatWords if re.search(regexp1,w)]







Out[31]:

[u'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
u'miiiiiinnnnnnnnnneeeeeeee',
u'mine',
u'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']




In [32]:



regexp2 = r'^mine$'
[w for w in chatWords if re.search(regexp2,w)]







Out[32]:

[u'',
u'e',
u'i',
u'in',
u'm',
u'me',
u'meeeeeeeeeeeee',
u'mi',
u'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
u'miiiiiinnnnnnnnnneeeeeeee',
u'min',
u'mine',
u'mm',
u'mmm',
u'mmmm',
u'mmmmm',
u'mmmmmm',
u'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee',
u'mmmmmmmmmm',
u'mmmmmmmmmmmmm',
u'mmmmmmmmmmmmmm',
u'n',
u'ne']







‘[^..]’括号中的^匹配所有不在其中的字符


比如匹配所有非元音字母:








In [38]:



regexp2 = r'^[^aeiouAEIOU]+$'
[w for w in wordlist if re.search(regexp2,w)][0:10]







Out[38]:

[u'b', u'by', u'byth', u'c', u'cly', u'cry', u'crypt', u'cwm', u'cyp', u'cyst']







4其他需要用到的表达


  • ‘\’转义,比如’.’匹配’.’,’\$’匹配‘\$’

  • ‘|’表示选择,比如’(ed|ing)\$’代表以ed或者ing结尾的字符

  • ‘{m,n}’表示重复次数,m到n次,左右可舍弃








In [41]:



wsj = sorted(set(nltk.corpus.treebank.words()))
[w for w in wsj if re.search('^[0-9]+.[0-9]+$',w)][1:10]







Out[41]:

[u'0.05', u'0.1', u'0.16', u'0.2', u'0.25', u'0.28', u'0.3', u'0.4', u'0.5']




In [42]:



[w for w in wsj if re.search('^[A-Z]+\$$',w)]







Out[42]:

[u'US$']




In [43]:



[w for w in wsj if re.search('^[0-9]{4}',w)][0:10]







Out[43]:

[u'1614',
u'1637',
u'1738.1',
u'1787',
u'1901',
u'1903',
u'1917',
u'1920s',
u'1925']




In [44]:



[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}',w)][0:10]







Out[44]:

[u'10-day',
u'10-lap',
u'10-year',
u'100-megabyte',
u'100-share',
u'11-month-old',
u'12-member',
u'12-point',
u'12-year',
u'14-hour']




In [45]:



[w for w in wsj if re.search('[a-z]{5,}-[a-z]{2,3}-[a-z]',w)][0:10]







Out[45]:

[u'black-and-white',
u'bread-and-butter',
u'father-in-law',
u'machine-gun-toting',
u'savings-and-loan',
u'search-and-seizure',
u'truth-in-lending']




In [46]:



[w for w in wsj if re.search('(ed|ing)$',w)][0:10]







Out[46]:

[u'62%-owned',
u'Absorbed',
u'According',
u'Adopting',
u'Advanced',
u'Advancing',
u'Alfred',
u'Allied',
u'Annualized',
u'Anything']


page PV:  ・  site PV:  ・  site UV: