Home
Search
re.match, search ExamplesExecute regular expressions with re: call match, search, split and findall.
Python
This page was last reviewed on Dec 15, 2021.
Regular expressions. In Python we access regular expressions through the "re" library. We call methods like re.match() to test for patterns.
re.sub, subn
Shows a word
Re details. With match() and search() we run these little programs. More advanced methods like groupdict can process groups. Findall handles multiple matches—it returns a list.
Match example. This program uses a regular expression in a loop. It applies a for-loop over the elements in a list. In the loop body, we call re.match().
for
Then We test this call for success. If it was successful, groups() returns a tuple containing the text content that matches the pattern.
Pattern This uses metacharacters to describe what strings can be matched. The "\w" means "word character." The plus means "one or more."
Shows a word
import re # Sample strings. list = ["dog dot", "data day", "no match"] # Loop. for element in list: # Match if 2 words starting with letter "d." m = re.match("(d\w+)\W(d\w+)", element) # See if success. if m: print(m.groups())
('dog', 'dot') ('data', 'day')
d Lowercase letter d. \w+ One or more word characters. \W A non-word character.
Search. This method is different from match. Both apply a pattern. But search attempts this at all possible starting points in the string. Match just tries the first starting point.
So Search scans through the input string and tries to match at any location. In this example, search succeeds but match fails.
import re # Input. value = "voorheesville" m = re.search("(vi.*)", value) if m: # This is reached. print("search:", m.group(1)) m = re.match("(vi.*)", value) if m: # This is not reached. print("match:", m.group(1))
search: ville
vi The lowercase letters v and i together. .* Zero or more characters of any type.
Split. The re.split() method accepts a pattern argument. This pattern specifies the delimiter. With it, we can use any text that matches a pattern as the delimiter to separate text data.
Here We split the string on one or more non-digit characters. The regular expression is described after the script output.
Tip A split() method is also available directly on a string. This method handles no regular expressions. It is simpler.
import re # Input string. value = "one 1 two 2 three 3" # Separate on one or more non-digit characters. result = re.split("\D+", value) # Print results. for element in result: print(element)
1 2 3
\D+ One or more non-digit characters.
Findall. This is similar to split(). Findall accepts a pattern that indicates which strings to return in a list. It is like split() but we specify matching parts, not delimiters.
Here We scan a string for all words starting with the individual letters "dp," and with one or more following word characters.
import re # Input. value = "abc 123 def 456 dot map pat" # Find all words starting with d or p. list = re.findall("[dp]\w+", value) # Print result. print(list)
['def', 'dot', 'pat']
[dp] A lowercase d, or a lowercase p. \w+ One or more word characters.
Finditer. Unlike re.findall, which returns strings, finditer returns matches. For each match, we call methods like start() or end(). And we can access the value of the match with group().
import re value = "123 456 7890" # Loop over all matches found. for m in re.finditer("\d+", value): print(m.group(0)) print("start index:", m.start())
123 start index: 0 456 start index: 4 7890 start index: 8
Start, end. We can use special characters in an expression to match the start and end of a string. For the start, we use the character "^" and for the end, we use the "$" sign.
Here We loop over a list of strings and call re.match. We detect all the strings that start or end with a digit character "\d."
Tip The match method tests from the leftmost part of the string. So to test the end, we use ".*" to handle these initial characters.
import re list = ["123", "4cat", "dog5", "6mouse"] for element in list: # See if string starts in digit. m = re.match("^\d", element) if m: print("START:", element) # See if string ends in digit. m = re.match(".*\d$", element) if m: print(" END:", element)
START: 123 END: 123 START: 4cat END: dog5 START: 6mouse
^\d Match at the start, check for single digit. .*\d$ Check for zero or more of any char. Check for single digit. Match at the end.
Or, repeats. Here we match strings with three letters or three dashes at their starts. And the final three characters must be digits. We use non-capturing groups with the "?:" syntax.
And We use the "3" codes to require three repetitions of word characters or hyphens.
Finally We specify digit characters with the code "\d" and the metacharacter "$" to require the end of the string.
import re values = ["cat100", "---200", "xxxyyy", "jjj", "box4000", "tent500"] for v in values: # Require 3 letters OR 3 dashes. # ... Also require 3 digits. m = re.match("(?:(?:\w{3})|(?:\-{3}))\d\d\d$", v) if m: print(" OK:", v) else: print("FAIL:", v)
OK: cat100 OK: ---200 FAIL: xxxyyy FAIL: jjj FAIL: box4000 FAIL: tent500
(?: The start of a non-capturing group. \w{3} Three word characters. | Logical or: a group within the chain must match. \- An escaped hyphen. \d A digit. $ The end of the string.
Named groups. A regular expression can have named groups. This makes it easier to retrieve those groups after calling match(). But it makes the pattern more complex.
Here We can get the first name with the string "first" and the groups() method. We use "last" for the last name.
import re # A string. name = "Clyde Griffiths" # Match with named groups. m = re.match("(?P<first>\w+)\W+(?P<last>\w+)", name) # Print groups using names as id. if m: print(m.group("first")) print(m.group("last"))
Clyde Griffiths
Pattern: (?P<first>\w+)\W+(?P<last>\w+) (?P<first>\w+) First named group. \W+ One or more non-word characters. (?P<last>\w+) Second named group.
Groupdict. A regular expression with named groups can fill a dictionary. This is done with the groupdict() method. In the dictionary, each group name is a key.
And Each value is the data matched by the regular expression. So we receive a key-value store based on groups.
Here With groupdict, we eliminate all references to the original regular expression. We can change the data to dictionary format.
import re name = "Roberta Alden" # Match names. m = re.match("(?P<first>\w+)\W+(?P<last>\w+)", name) if m: # Get dict. d = m.groupdict() # Loop over dictionary with for-loop. for t in d: print(" key:", t) print("value:", d[t])
key: last value: Alden key: first value: Roberta
Comment. Sometimes a regular expression is confusing. A comment can be used to explain a complex part. One problem is the comment syntax may be confusing too—this should be considered.
Here We see that a Regex comment starts with a "#" character (just like in Python itself).
import re data = "bird frog" # Use comments inside a regular expression. m = re.match("(?#Before part).+?(?#Separator)\W(?#End part)(.+)", data) if m: print(m.group(1))
frog
(?#Before part) Comment, ignored .+? As few characters as possible (?#Separator) Comment, ignored \W Non-word character (?#End part) Comment, ignored (.+) One or more characters, captured
Not-followed-by. We use a negative match pattern to ensure a value does not match. In this example, we match all the 3-digit strings except ones that are followed by a "dog" string.
Tip This is called a "negative lookahead assertion." It may be clearer to filter out results in Python code after matching.
import re data = "100cat 200cat 300dog 400cat 500car" # Find all 3-digit strings except those followed by "dog" string. # ... Dogs are not allowed. m = re.findall("(?!\d\d\ddog)(\d\d\d)", data) print(m)
['100', '200', '400', '500']
(?!\d\d\ddog) Not followed by 3 digits and "dog" (\d\d\d) 3 digit value
Benchmark, search. Regular expressions often hinder performance in programs. I tested the in-operator on a string against the re.search method.
Version 1 This version of the code uses the in-operator to find the letter "x" in the string.
Version 2 Here we use re.search (a regular expression method) to find the same letter.
Result I found that the in-operator was much faster than the re.search method. For searching with no pattern, prefer the in-operator.
import time import re input = "max" if "x" in input: print(1) if re.search("x", input): print(2) print(time.time()) # Version 1: in. c = 0 i = 0 while i < 1000000: if "x" in input: c += 1 i += 1 print(time.time()) # Version 2: re.search. i = 0 while i < 1000000: if re.search("x", input): c += 1 i += 1 print(time.time())
1 2 1381081435.177 1381081435.615 [in = 0.438 s] 1381081437.224 [re.search = 1.609 s]
Benchmark, match. In another test I rewrote a method that uses re.match to use if-statements and a for-loop. It became much faster.
Version 1 In this experiment, stringmatch returns after finding an invalid length or an invalid start character.
Version 2 This version of the code uses a regular expression instead of custom Python looping logic.
Result It is faster to use a for-loop and test characters. But the code is more confusing and prone to errors.
import re import time def stringmatch(s): # Check for "ca+t" with if-statements and loop. if len(s) >= 3 and s[0] == 'c' and s[len(s) - 1] == 't': for v in range(1, len(s) - 2): if s[v] != 'a': return False return True return False def stringmatch_re(s): # Check for "ca+t" with re. m = re.match(r"ca+t", s) if m: return True return False print(time.time()) # Version 1: use string loop with if-statement. for i in range(0, 10000000): result = stringmatch("ct") result = stringmatch("caat") result = stringmatch("dooog") print(time.time()) # Version 2: use re.match. for i in range(0, 10000000): result = stringmatch_re("ct") result = stringmatch_re("caat") result = stringmatch_re("dooog") print(time.time())
1411309406.96144 1411309430.354504 stringmatch = 23.39 s 1411309480.849815 stringmatch_re = 50.50 s
Sub method. The re.sub method can apply a method or lambda to each match found in a string. We specify a pattern and a method that receives a match. And we can process matches in any way.
re.sub, subn
Word count. We implement a simple word-counting routine. We use re.findall and count non-whitespace sequences in a string. And then we return the length of the resulting list.
Word Count
A summary. A regular expression is often hard to correctly write. But when finished, it is shorter and overall simpler to maintain. It describes a specific type of logic.
Dot Net Perls is a collection of tested code examples. Pages are continually updated to stay current, with code correctness a top priority.
Sam Allen is passionate about computer languages. In the past, his work has been recommended by Apple and Microsoft and he has studied computers at a selective university in the United States.
This page was last updated on Dec 15, 2021 (simplify).
Home
Changes
© 2007-2023 Sam Allen.