What is Regex?

It is a Domain Specific Language (DSL) that matches language to the pre-defined common patterns that you want to extract from a text document. For example, [0 – 9] or \d regex would extract all single numbers from 0 – 9. \d is known as a metacharacter, which it’s one or more special characters that have a unique meaning.

What are Quantifiers?

You can use {} to specify how many times you want your expression to match. For example, \d{3} means match our expression to the first 3 numbers found in the text document.

We also have 3 unexact quantifiers: “?”, “*”, and “+”. “?” means zero or one. “*” means zero or more. “+” means one or more.

What is a literal and escape sequence?

A literal string is any character we use in our matching expression. It is literally the string we want to find. An escape sequence is how we tell regex that we want to use metacharacters as a literal. For example “.” means any characters in regex but “\.” means we want to look for a period character.

How to use Regex?

  1. Import the re library

  2. Compile the expression you want to extract using re.compile([regex pattern])

  3. Substitute those found expression with a new pattern using regex_pattern.sub([what you are subbing with], [target sentence])

Alternatively you can use spaCy library (recommended) for perform regular expressions.

What are the use case for Regex?

  1. Finding / Searching for specific terms in text documents

  2. Finding & Replacing

  3. Text Processing –> your data might be unique and required additional filtering

Regex vs String methods

The Python language has a built-in str (str.find) method that allows you to find the first occurrence of what you want to find in the text document. What’s the difference between the two methods?

  1. String methods are easier to use and understand

  2. Regex can handle broader use cases

  3. Regex is language independent as you are specifying the patterns

  4. Regex can be faster with large data

Types of Regex errors

You have two types of regex errors:

  1. Type I (False positives): matching strings that we shouldn’t have matched

  2. Type II (False negatives): missing strings that we should have matched

It’s important to analyse both types of errors to see how well your regex performs!



Data Scientist

Leave a Reply