Regex Essentials

Table of contents

This article provides an overview of regular expressions and their key components, including various symbols and their meanings. It explains concepts such as wildcards, character classes, and special meta-characters that need escaping. Additionally, it covers the use of repetition symbols, character grouping with parentheses, and alternation with the pipe operator. Understanding these basics and extended sets equips users to efficiently build and interpret regular expressions.

Base Set

  1. a* represents 0 or more occurrence of a

    1. b* - 0 or more occurrence of b
  2. "." represents single character and also called as period

    1. .* 0 or more occurrences of a single dot. Basically any number of letters.
  3. /s represents single white space

    1. /s* represents 0 or more white spaces
  4. [abc] - Character class. One of the characters inside the square brackets - a,b,c

  5. [^abc] - Any one letter except or other than a,b,c - Negation of character class

  6. [a-c] - One of the characters falling in the range given in square brackets - a,b,c (Character Range class)

  7. [a-cx] - One of the characters falling in the range OR any of the other choices given in the square bracket - a,b,c,x

  8. [a-cA-Cx] - One of the characters falling in the range OR any of the other choices given in the square bracket - a,b,c,A,B,C,x

  9. Following characters should be escaped with a "\" as these characters have special meaning otherwise: ^ $ * . [ () \

  10. If a "." is inside square brackets, it need NOT be escaped.

  11. If a " ^ - " appears then these should be escaped with a backslash as these have special meaning inside square brackets.

  12. ^ carat is a placeholder that signifies the beginning of a line.

    1. The interpretation of ^ differs within the square bracket and outside of it.

    2. Inside the square bracket, ^ stands for negation.

    3. Outside it is a placeholder for the beginning of the line.

  13. $ is a placeholder that signifies the end of a line.

Extended Set

  1. {}

    1. a{m} represents exactly 'm' repetitions of whatever immediately precedes this i.e 'a'

    2. a{m,n} represents at least 'm' and at most 'n' repetitions of whatever immediately precedes this , i.e 'a'

    3. a{m, } represents at least 'm'

    4. a{ ,n} represents at most 'n'..

  2. () - Parenthesis is used to group and treat as a single entity.

  3. a+ - One or more occurrence of 'a' (character just preceding the plus symbol)

  4. a? - Zero or one occurrences of 'a'

  5. Pipe |

    1. (log | ply) represents either log or ply

    2. (a|b) represents either a or b, where a and b can be multi-characters strings