Tutorial
Preface
Human2Regex (H2R) is a way to spell out a regular expression in an easy to read, easy to modify language. H2R supports multiple languages as well as many (though not all) different regular expression options such as named groups and quantifiers. You may notice multiple keywords specifying the same thing, and that is intended! H2R is made to be flexible and easy to understand. With a range, do you prefer "...", "through", or "to"? It's up to you to choose, H2R supports all of those!
Your first Match
Every language starts with a "Hello World" program, so let's match the output of those programs. Matching is done using the keyword "match" followed by what you want to match. match "Hello World"
The above statement will generate a regular expression that matches "Hello World". Any invalid characters will automatically be escaped, so you don't need to worry about it. H2R also supports block comments with /**/
, or line comments with //
or #
so you can explain why or what you intend to match. match "Hello World" // matches the output of "Hello World" programs
Now what if we want to match every case variation of "Hello World" like "hello world" or "hELLO wORLD"? H2R supports the or
operator which allows you to specify many possible combinations. match "Hello World" or "hello world" or "hELLO wORLD"
Or, you can use a using
statement to specify that you want it to be case insensitive.
Using Specifiers
Using statements appear at the beginning. You may have one or more using statements which each can contain one or more specifiers. For example: using global and case insensitive matching
or using global
using case insensitive
The matching
keyword is optional. The flags which are available are:
Specifier | Description | Regex flag |
---|---|---|
Multiline | Matches can cross line breaks | /<your regex>/m |
Global | Multiple matches are allowed | /<your regex>/g |
Case Sensitive | Match must be exact case | none |
Case Insensitive | Match may be any case | /<your regex>/i |
Exact | An exact statement matches a whole line exactly, nothing before, nothing after | /^<your regex>$/ |
To match any variation of hello world, we would then do the following: using case insensitive matching
match "hello world"
Matching multiple items
H2R comes with 2 options to match multiple items in a row. The first is to simply write multiple seperate match
statements like: match "hello"
match " "
match "world"
However, you can also use a comma, and
, or then
for a more concise match. match "hello", " ", "world"
or match "hello" and " " and "world"
or match "hello" then " " then "world"
or any combination like match "hello", " " and then "world"
Optionality
Sometimes you wish to match something that may or may not exist. In H2R, this is done via the optional
or optionally
keyword. optionally match "hello world"
will match 0 or 1 "hello world"'s. This can be used along side matching multiple statements in a single match
statement. match "hello", optionally " ", "world"
will match "hello", an optional space if it exists, and "world". However, the start optional
is for the entire match statement. Thus, optionally match "hello", " ", then "world"
will actually make the whole "hello world" an optional match rather than just the first "hello". If you want to make the first match optional but keep the rest required, use multiple match
statements.
Negation
You can negate a match with the operator not
match not "hello world"
will match everything except for "hello world".
Other matching specifiers
Many times you don't know exactly what you wish to match. H2R comes with many specifiers that you can use for your matching. For example, you may wish to match any word. You can do that with: match a word
The a
or an
is optional. The possible specifiers that H2R supports are the following:
Specifier | Description | Regex alternative |
---|---|---|
Anything | Matches any character | . |
Word(s) | Matches a word | \w+ |
Number(s) | Matches an integer | \d+ |
Character(s) | Matches any letter character | \w |
Digit(s) | Matches any digit character | \d |
Whitespace(s) | Matches any whitespace character | \s |
Boundary | Boundary between a word | \b |
Line Feed | Matches a newline | \n |
Newline | Matches a newline | \n |
Carriage Return | Matches a carriage return | \r |
You can also create ranges of characters to match. Say for example, you wanted to match any characters between a and z, you could write any of the following: match from "a" to "z" // from is optional
or match between "a" and "z" // between is optional
or match "a" ... "z" // can use ... or ..
or match "a" - "z"
or match "a" through "z" // can also use thru
Repetition
TODO
String value | Numeric value |
---|---|
Zero | 0 |
One | 1 |
Two | 2 |
Three | 3 |
Four | 4 |
Five | 5 |
Six | 6 |
Seven | 7 |
Eight | 8 |
Nine | 9 |
Ten | 10 |
Grouping
TODO
Miscellaneous features
Unicode character properties
You can match specific unicode sequences using "\uXXXX"
or "\UXXXXXXXX"
where X is a hexadecimal character. match "\u0669" // matches arabic digit 9 "٩"
Unicode character classes/scripts can be matched using the unicode
keyword. match unicode "Latin" // matches any latin character
match unicode "N" // matches any number character
The following Unicode class specifiers are available:
Class | Description | Class | Description | Class | Description | Class | Description | Class | Description | Class | Description |
---|---|---|---|---|---|---|---|---|---|---|---|
C | Other | Cc | Control | Cf | Format | Cn | Unassigned | Co | Private use | Cs | Surrogate |
L | Letter | Ll | Lower case letter | Lm | Modifier letter | Lo | Other letter | Lt | Title case letter | Lu | Upper case letter |
M | Mark | Mc | Spacing mark | Me | Enclosing mark | Mn | Non-spacing mark | N | Number | Nd | Decimal number |
Nl | Letter number | No | Other number | P | Punctuation | Pc | Connector punctuation | Pd | Dash punctuation | Pe | Close punctuation |
Pf | Final punctuation | Pi | Initial punctuation | Po | Other punctuation | Ps | Open punctuation | S | Symbol | Sc | Currency symbol |
Sk | Modifier symbol | Sm | Mathematical symbol | So | Other symbol | Z | Separator | Zl | Line separator | Zp | Paragraph separator |
Zs | Space separator |
The following Unicode script specifiers are available:
Note: Java and .NET require "Is" in front of the script name. For example, "IsLatin" rather than just "Latin"
Arabic | Armenian | Avestan | Balinese | Bamum |
Batak | Bengali | Bopomofo | Brahmi | Braille |
Buginese | Buhid | Canadian_Aboriginal | Carian | Chakma |
Cham | Cherokee | Common | Coptic | Cuneiform |
Cypriot | Cyrillic | Deseret | Devanagari | Egyptian_Hieroglyphs |
Ethiopic | Georgian | Glagolitic | Gothic | Greek |
Gujarati | Gurmukhi | Han | Hangul | Hanunoo |
Hebrew | Hiragana | Imperial_Aramaic | Inherited | Inscriptional_Pahlavi |
Inscriptional_Parthian | Javanese | Kaithi | Kannada | Katakana |
Kayah_Li | Kharoshthi | Khmer | Lao | Latin |
Lepcha | Limbu | Linear_B | Lisu | Lycian |
Lydian | Malayalam | Mandaic | Meetei_Mayek | Meroitic_Cursive |
Meroitic_Hieroglyphs | Miao | Mongolian | Myanmar | New_Tai_Lue |
Nko | Ogham | Old_Italic | Old_Persian | Old_South_Arabian |
Old_Turkic | Ol_Chiki | Oriya | Osmanya | Phags_Pa |
Phoenician | Rejang | Runic | Samaritan | Saurashtra |
Sharada | Shavian | Sinhala | Sora_Sompeng | Sundanese |
Syloti_Nagri | Syriac | Tagalog | Tagbanwa | Tai_Le |
Tai_Tham | Tai_Viet | Takri | Tamil | Telugu |
Thaana | Thai | Tibetan | Tifinagh | Ugaritic |
Vai | Yi |