Hits: 801
正規表達式(regular expression; regex)是處理字串的好工具,在任何語言的實作方式上大同小異,本文以python實作,記錄一下上課學到的regex技法。
引用套件
import re
patterns = ['term1', 'term2']
text = 'This is a string with term1, not not the other term'
print(re.search('h', 'hihhhhi')) # search particular pattern, return match object
>>> <re.Match object; span=(0, 1), match='h'>
match = re.search(patterns[0], text)
print(match.start())
print(match.end())
>>> 22
>>> 27
# split method
split_term = "@"
string = "[email protected]"
re.split(split_term, string)
>>> ['aaa', 'gmail.com']
# findall method
re.findall("on", "how do you turn this on and on")
>>> ['on', 'on']
Pattern syntax
def multi_re_find(patterns,phrase):
'''
Takes in a list of regex patterns
Prints a list of all matches
'''
for pattern in patterns:
print('Searching the phrase using the re check: %r' %(pattern))
# 代換字串的寫法(詳細解釋: https://www.geeksforgeeks.org/str-vs-repr-in-python/)
# %r用rper()方法處理物件, output會有''
# %s用str()方法處理物件,output無''
print(re.findall(pattern,phrase))
print('\n')
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd...dds...dddds'
test_patterns = [ 'sd*', # s followed by zero or more d's
'sd+', # s followed by one or more d's
'sd?', # s followed by zero or one d's
'sd{3}', # s followed by three d's
'sd{2,3}', # s followed by two to three d's
'd{2}s*', # two d followed by 0 or more s
]
multi_re_find(test_patterns,test_phrase)
Character sets
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'
test_patterns = ['[sd]', # either s or d
's[sd]+'] # s followed by one or more s or d
multi_re_find(test_patterns,test_phrase)
Exclusion
For example, [^5]
will match any character except '5'
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'
re.findall('[^!.? ]+',test_phrase)
Escape code
Escapes are indicated by prefixing the character with a backslash \
. Unfortunately, a backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read. Using raw strings, created by prefixing the literal value with r"\"
, eliminates this problem and maintains readability.
test_phrase = 'This is a string with some numbers 1233 and a symbol #hashtag'
test_patterns=[ r'\d+', # sequence of digits
r'\D+', # sequence of non-digits
r'\s+', # sequence of whitespace
r'\S+', # sequence of non-whitespace
r'\w+', # alphanumeric characters
r'\W+', # non-alphanumeric
]
multi_re_find(test_patterns,test_phrase)
重點就是,官方文件看熟,寫久了就會了吧
官方文件
Comments