未加星标

An Introduction to Regex in Python

字体大小 | |
[开发(python) 所属分类 开发(python) | 发布者 店小二04 | 时间 2017 | 作者 红领巾 ] 0人收藏点击收藏

Related Course

Get Started with javascript for Web Development

JavaScript is the language on fire. Build an app for any platform you want including website, server, mobile, and desktop.

Code

A regular expression is simply a sequence of characters that define a pattern.

When you want to match a string to perhaps validate an email or password, or even extract some data, a regex is an indispensable tool.

Everything in regex is a character. Even this .

While unicode characters can be used to match any international text, most patterns use normal ASCII (letters, digits, punctuation and keyboard symbols like $@%#!.)

Why should I learn regex?

Regular expressions are everywhere. Here's some of the reasons why you should learn them:

They do a lot with less You can write a few characters to do something that could have taken dozens of lines of code to implement Standing out from the crowd Most programmers don't know regex. If you don't know it, you are about to detatch yourself from that category They are super fast Regex patterns wrote with performance in mind takes a very short time to execute. Backtracking might take some time, but even that has optimal variations that run super fast They are portable The majority of regex syntax works the same way in a variety of programming languages You should learn them for the same reason I do they make your work a lot easier Are there any real world applications?

Common applications of regex are:

Input validation (emails, usernames, passwords) Web scraping Data wrangling Simple parsing

Also, regex is used for text matching in spreadsheets, text editors, IDEs and Google Analytics .

Let the coding begin!

We are going to use python to write some regex. Python is known for its readability so it makes it easier to implement them.

In python, the re module provides full support for regular expressions. A github repo contains code and concepts we'll use here.

Our first regex pattern

Python uses raw string notations to write regular expressions r"write-expression-here" First, we'll import the re module. Then write out the regex pattern.

import re pattern = re.compile(r"")

The purpose of the compile method is to compile the regex pattern which will be used for matching later. It's advisable to compile regex when it'll be used several times in your program. Resaving the resulting regular expression object for reuse, which re.compile does, is more efficient.

To add some regular expression inside the raw string notation, we'll put some special sequences to make our work easier.

So, what are special sequences? They are simply a sequence of characters that have a backslash \ character. For instance, \d is a match for one digit [0-9] \w is a match for one alphanumeric character. This means any ASCII character that's either a letter or a number [a-z A-Z 0-9]

It's important to know them since they help us write simpler and shorter regex.

Here's a table with more special sequences

Element Description . This element matches any character except \n \d This matches any digit [0-9] \D This matches non-digit characters [^0-9] \s This matches whitespace character [ \t\n\r\f\v] \S This matches non-whitespace character [^ \t\n\r\f\v] \w This matches alphanumeric character [a-zA-Z0-9_] \W This matches any non-alphanumeric character [^a-zA-Z0-9]

Points to note:

[0-9] is the same as [0123456789] \d is short for [0-9] \w is short for [a-zA-Z0-9] [7-9] is the same as [789]

Having learned something about special sequences, let's continue with our coding. Write down and run the code below.

import re pattern = re.compile(r"\w") # Let's feed in some strings to match string = "regex is awesome!" # Then call a matching method to match our pattern result = pattern.match(string) print result.group() # will print out 'r'

The match method returns a match object, or None if no match was found. We are printing a result.group() . The group() is a match object method that returns an entire match. If not, it returns a NoneType, which mean there was no match to our compiled pattern.

You may wonder why the output is only a letter and not the whole word. It's simply because \w sequence matches only the first letter or digit at the start of the string . We've just wrote our first regex program!

Let's do more than that

We want to do more than simply matching a single letter. So we ammend our code to look like this

# Replace the pattern variable with this pattern = re.compile(r"\w+") # Notice the plus sign we just added

The + on our second pattern is what we call a quantifier .

Quantifiers simply specify the quantity of characters to match.

Here are some other regex quantifiers and how to use them.

Quantifier Description Example Sample match + one or more \w+ ABCDEF097 {2} exactly 2 times \d{2} 01 {1,} one or more times \w{1,} smiling {2,4} 2, 3 or 4 times \w{2,4} 1234 * 0 or more times A*B AAAAB ? once or none(lazy) \d+? 1 in 12345

Let's write some more quantifiers in our program!

import re def regex(string): """This function returns at least one matching digit.""" pattern = re.compile(r"\d{1,}") # For brevity, this is the same as r"\d+" result = pattern.match(string) if result: return result.group() return None # Call our function, passing in our string regex("007 James Bond")

The above regex uses a quantifier to match at least one digit. Calling the function will print this output: '007'

What are ^ and $ ?

You may have noticed that a regex usually has the ^ and $ characters. For example, r"^\w+$" . Here's why.

^ and $ are boundaries or anchors. ^ marks the start, while $ marks the end of a regular expression.

However, when used in square brackets [^ ... ] it means not . For example, [^\s$] or just [^\s] will tell regex to match anything that is not a whitespace character .

Let's write some code to prove this

import re line = "dance more" result = re.match(r"[^\d+]", line) print result # Prints out 'dance'

First, notice there's no re.compile this time. Programs that use only a few regular expressions at a time don't have to compile a regex. We therefore don't need re.compile for this. Next, re.match() takes in an optional string argument as well, so we fed it with the line variable. Moving on swiftly!

Let's look at some new concept: search .

Searching versus Matching

The match method checks for a match only at the beginning of the string, while a re.search() checks for a match anywhere in the string.

Let's write some search functionality.

import re string = "\n dreamer" result = re.search(r"\w+", string, re.MULTILINE) print result.group() # Prints out 'dreamer'

The search method, like the match method, can also take an extra argument. The re.MULTILINE simply tells our method to search on multiple lines that have been separated by the new line space character if any.

Let's take a look at another example on how search works

import re pattern = re.compile(r"^<html>") result = pattern.search("<html></html>") print result.group()

This will print out <html> .

Splitting

The re.split() splits a string into a list delimited by the passed pattern. For example, consider having names read from a file that we want to put in an list:

text = "John Doe Jane Doe Jin Du Chin Doe"

We can use split to read each line and split them into an array as such:

import re results = re.split(r"\n+", text) print results # will print: ['Jane Doe', 'Jane Doe', 'Jin Du', 'Chin Doe'] Finding it all

But what if we wanted to find all instances of words in a string? Enter re.findall .

re.findall() finds all the matches of all occurrences of a pattern, not just the first one as re.search() does. Unlike search which returns a match object, findall returns a list of matches. Let's write and run this functionality.

import re def finder(string): """This function finds all the words in a given string.""" result_list = re.findall(r"\w+", string) return result_list # Call finder function, passing in the string argument finder("finding dory") The output will be a list: ['finding', 'dory']

Let's say we want to search for people with 5 or 6-figure salaries. Regex will make it easy for us. Let's try it out:

import re salaries = "120000 140000 10000 1000 200" result_list = re.findall(r"\d{5,6}", salaries) print result_list # prints out: ['120000', '140000', '10000'] Manipulating that string

Suppose we wanted to do some string replacement. The re.sub method will help us do that. It simply returns a string that has undergone some replacement using a matched pattern.

Let's write code to do some string replacement

import re pattern = re.compile(r"[0-9]+") result = pattern.sub("__", "there is only 1 thing 2 do") print result

The program's aim is to replace any digit in the string with the _ character. Therefore, the print output will be there is only __ thing __ do

Let's try out another example. Write down the following code:

import re pattern = re.compile(r"\w+") # Match only alphanumeric characters input_string = "Lorem ipsum with steroids" result = pattern.sub("regex", input_string) # replace with the word regex print result # prints 'regex regex regex regex'

We have managed to replace the words in the input string with the word "regex". Regex is very powerful in string manipulations.

Look ahead!

Sometimes you might encounter this (?=) in regex. This syntax is defines a look ahead . Instead of matching from the start of the string, match an entity that's followed by the pattern. For instance, r" a (?=b) " will return a match a only if it's followed by b .

Let's write some code to elaborate that.

import re pattern = re.compile(r'\w+(?=\sfox)') result = pattern.search("The quick brown fox") print result.group() # prints 'brown'

The pattern tries to match the closest string that is followed by a space character and the word fox .

Let's look at another example. Go ahead and write this snippet:

""" Match any word followed by a comma. The example below is not the same as re.compile(r"\w+,") For this will result in [ 'me,' , 'myself,' ] """ pattern = re.compile(r"\w+(?=,)") res = pattern.findall("Me, myself, and I") print res The above regex tries to match all instances of characters that is followed by a comma When we run this, we should print out a list containing: [ 'Me', 'myself' ] When to escape

What if you wanted to match a string that has a bunch of this special regex characters? A backlash is used to define special characters in regex. So to cover them as characters in our pattern string, we need to escape them and use '\'.

Here's an example.

import re pattern = re.compile('\\\\') result = pattern.match("\\author") print result.group() # will print \

Let's try it one more time just to get it Suppose we want to include a + (a reserved quantifier) in a string to be matched by a pattern. We'll do something like this:

import re pattern = re.compile(r"\w+\+") # match alphanumeric characters followed by a + character result = pattern.search("file+") print result.group() # will print out file+

We have successfully escaped the + character so that regex might not mistake it for being a quantifier.

Can we monetize it?

For a real world application, here's a function that monetizes a number using thousands separator commas.

import re number = input("Enter your number\n") def monetizer(number): """This function adds a thousands separator using comma characters.""" number = str(number) try: if type(int(number)) == int: # Format into groups of three from the right to the left pattern = re.compile(r'\d{1,3}(?=(\d{3})+(?!\d))') # substitute with a comma then return return pattern.sub(r'\g<0>,', number) except: return "Not a Number" # Function call, passing in number as an argument print monetizer(number)

As you might have noticed, the pattern uses a look-ahead mechanism. The brackets are responsible for grouping the digits into clusters, which can be separated by the commas. For example, the number 1223456 will become 1,223,456 .

Conclusion

Congratulations for making it to the end of this intro! From the special sequences of characters, matching and searching, to finding all using reliable look aheads and manipulating strings in regex we've covered quite a lot.

There are some advance concepts in regex such as backtracking and performance optimization which we can continue to learn as we grow. A good resource for more intricate details would be the re module documentation .

Great job for learning something that many consider diffucult! If you found this helpful, spread the word.

本文开发(python)相关术语:python基础教程 python多线程 web开发工程师 软件开发工程师 软件开发流程

主题: JavaScriptJavaPythonTI
分页:12
转载请注明
本文标题:An Introduction to Regex in Python
本站链接:http://www.codesec.net/view/532558.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 开发(python) | 评论(0) | 阅读(71)