O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Python - Regular Expressions

355 visualizações

Publicada em

Regular expressions, also known as regex, in Python

Publicada em: Software
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Python - Regular Expressions

  1. 1. © Prof Mukesh N Tekwani, 2016 1 / 6 Unit I Chap 3 : Python – Regular Expressions 3.1 Concept of Regular Expression A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expressions are widely used in text pattern matching, text extraction, and search-and-replace facility. Regular expressions are also called REs, or regexes or regex patterns. The module re provides full support for regular expressions in Python. The re module raises the exception re.error if an error occurs while compiling or using a regular expression. We can specify the rules for the set of possible strings that we want to match; this set might contain English sentences, or e-mail addresses, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways. The patterns or regular expressions can be defined as follows: ● Literal characters must match exactly. For example, "a" matches "a". ● Concatenated patterns match concatenated targets. For example, "ab" ("a" followed by "b") matches "ab". ● Alternate patterns (separated by a vertical bar) match either of the alternative patterns. For example, "(aaa)|(bbb)" will match either "aaa" or "bbb". ● Repeating and optional items: ○ "abc*" matches "ab" followed by zero or more occurrences of "c", for example, "ab", "abc", "abcc", etc. ○ "abc+" matches "ab" followed by one or more occurrences of "c", for example, "abc", "abcc", etc, but not "ab". ○ "abc?" matches "ab" followed by zero or one occurrences of "c", for example, "ab" or "abc". ● Sets of characters -- Characters and sequences of characters in square brackets form a set; a set matches any character in the set or range. For example, "[abc]" matches "a" or "b" or "c". And, for example, "[_a-z0- 9]" matches an underscore or any lower-case letter or any digit. ● Groups -- Parentheses indicate a group with a pattern. For example, "ab(cd)*ef" is a pattern that matches "ab" followed by any number of occurrences of "cd" followed by "ef", for example, "abef", "abcdef", "abcdcdef", etc. ● There are special names for some sets of characters, for example "d" (any digit), "w" (any alphanumeric character), "W" (any non- alphanumeric character), etc.
  2. 2. © Prof Mukesh N Tekwani, 2016 2 / 6 3.2 Metacharacters In forming a regular expression we use certain characters as metacharacters. These characters don’t match themselves but they indicate that some other thing should be matched. The complete list of metacharacters is: . ^ $ * + ? { } [ ] | ( ) Metacharacters [ and ] : They’re used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'. For example, [abc] will match any of the characters a, b, or c; this is the same as [a-c], which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be [a-z]. If you want to match digits between 2 to 7, the RE will be [2-7] Metacharacter ^ : You can match the characters not listed within the class by complementing the set. This is indicated by including a '^' as the first character of the class; '^' outside a character class will simply match the '^' character. For example, [^5] will match any character except '5'. Metacharacter : Backslash is one of the most important metacharacter. The backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns. If you need to match a [ or , you can precede them with a backslash to remove their special meaning: [ or . This will search for the [ character or the character. Metacharacter . : The . matches anything except a newline character, and there’s an alternate mode (re.DOTALL) where it will match even a newline. '.' is often used where you want to match “any character”. Example: ‘x.x’ will match ‘xxx’ and also ‘xyx’. Special Sequences: d Matches any decimal digit; this is equivalent to the class [0-9].
  3. 3. © Prof Mukesh N Tekwani, 2016 3 / 6 D Matches any non-digit character; this is equivalent to the class [^0-9]. s Matches any whitespace character. S Matches any non-whitespace character. w Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_]. W Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_]. 3.3 The re package: search() and match() The re package provides two methods to perform queries on an input string. These methods are: re.search() and re.match() re.search() method: Syntax of search(): re.search(pattern, string) The description of parameters is as follows: Parameter Description pattern It is the regular expression to be matched string This is the string, which would be searched to match the pattern anywhere in the string. The re.search function returns a match object on success, none on failure. Program 1: RegEx1.py Write a program to search if a pattern 'aa[bc]*dd' appears in the line input by the user. import sys, re pat = re.compile('aa[bc]*dd') while 1: line = input('Enter a line ("q" to quit):') if line == 'q':
  4. 4. © Prof Mukesh N Tekwani, 2016 4 / 6 break if pat.search(line): print ('matched:', line) else: print ('no match:', line) Analysis: 1. We import module re in order to use regular expressions. 2. re.compile() compiles a regular expression so that we can reuse the compiled regular expression without compiling it repeatedly. Output: Enter a line ("q" to quit):aabcdd matched: aabcdd Enter a line ("q" to quit):abcd no match: abcd Enter a line ("q" to quit):aacd no match: aacd Enter a line ("q" to quit):aadd matched: aadd Enter a line ("q" to quit):aabcbcdd matched: aabcbcdd Enter a line ("q" to quit):aabcdddd matched: aabcdddd Enter a line ("q" to quit):q >>> Program 2: RegEx2.py Write a program that searches for the occurrence of the pattern ‘A’ followed by a single digit, followed by the pattern ‘bb’. import sys, re pat = re.compile('A[0-9]bb') while 1: line = input('Enter a line ("q" to quit):') if line == 'q': break if pat.search(line): print ('matched:', line) else:
  5. 5. © Prof Mukesh N Tekwani, 2016 5 / 6 print ('no match:', line) In the above program, search is used to search a string and match the first string from the left. search() searches for the pattern anywhere in the string. Output: Enter a line ("q" to quit):A65b no match: A65b Enter a line ("q" to quit):A65bb no match: A65bb Enter a line ("q" to quit):A6bb matched: A6bb Enter a line ("q" to quit):AA6bb matched: AA6bb Enter a line ("q" to quit):AA6bbc matched: AA6bbc Enter a line ("q" to quit):q >>> re.match() method: Syntax of search(): re.match(pattern, string) The description of parameters is as follows: Parameter Description pattern It is the regular expression to be matched string This is the string, which would be searched to match the pattern anywhere in the string. flags You can specify different flags using bitwise OR (|). The re.match function returns a match object on success, None on failure. We use group(num) or groups() function of match object to get matched expression. group(num=0) This method returns entire match (or specific subgroup num)
  6. 6. © Prof Mukesh N Tekwani, 2016 6 / 6 groups() This method returns all matching subgroups in a tuple ** Example of using escape sequence: Start Python and type the following two lines. >>> name = 'AlbertnEinstein' >>> print(name) The output is as shown below. Note that the n character is a new line character. This character is treated as a single character. This character causes the remaining part to appear in the next line Output: Albert Einstein ** Raw Strings: Raw strings are strings with escape characters disabled. We have to add the character ‘r’ or ‘R’ as a prefix to our strings to make them raw strings. Modify the above example as follows: >>> name = r'AlbertnEinstein' >>> print (name) AlbertnEinstein >>> Note that the n character had no effect in this case. IMPORTANT QUESTIONS 1. What is a regular expression? Which module provides support for regular expressions? 2. What is meant by the following: Literal characters, concatenated patterns, alternate patterns, repeating and optional items, sets of characters. 3. What is a metacharacter? List the metacharacters used in Python. Explain the following metacharacters: [ and ], ^, and . 4. With an example, explain the search and match methods. 5. What is a raw string? Explain with a simple example.

×