Human2Regex Tutorial

Tutorial

Preface

Human2Regex (H2R) is a way to spell out a regular expression in an easy to read, easy to modify language. H2R supports multiple languages as well as many (though not all) different regular expression options such as named groups and quantifiers. You may notice multiple keywords specifying the same thing, and that is intended! H2R is made to be flexible and easy to understand. With a range, do you prefer "...", "through", or "to"? It's up to you to choose, H2R supports all of those!

Your first Match

Every language starts with a "Hello World" program, so let's match the output of those programs. Matching is done using the keyword "match" followed by what you want to match. match "Hello World" The above statement will generate a regular expression that matches "Hello World". Any invalid characters will automatically be escaped, so you don't need to worry about it. H2R also supports block comments with /**/, or line comments with // or # so you can explain why or what you intend to match. match "Hello World" // matches the output of "Hello World" programs Now what if we want to match every case variation of "Hello World" like "hello world" or "hELLO wORLD"? H2R supports the or operator which allows you to specify many possible combinations. match "Hello World" or "hello world" or "hELLO wORLD" Or, you can use a using statement to specify that you want it to be case insensitive.

Using Specifiers

Using statements appear at the beginning. You may have one or more using statements which each can contain one or more specifiers. For example: using global and case insensitive matching or using global using case insensitive The matching keyword is optional. The flags which are available are:

Specifier	Description	Regex flag
Multiline	Matches can cross line breaks	/<your regex>/m
Global	Multiple matches are allowed	/<your regex>/g
Case Sensitive	Match must be exact case	none
Case Insensitive	Match may be any case	/<your regex>/i
Exact	An exact statement matches a whole line exactly, nothing before, nothing after	/^<your regex>$/

To match any variation of hello world, we would then do the following: using case insensitive matching match "hello world"

Matching multiple items

H2R comes with 2 options to match multiple items in a row. The first is to simply write multiple seperate match statements like: match "hello" match " " match "world" However, you can also use a comma, and, or then for a more concise match. match "hello", " ", "world" or match "hello" and " " and "world" or match "hello" then " " then "world" or any combination like match "hello", " " and then "world"

Optionality

Sometimes you wish to match something that may or may not exist. In H2R, this is done via the optional or optionally keyword. optionally match "hello world" will match 0 or 1 "hello world"'s. This can be used along side matching multiple statements in a single match statement. match "hello", optionally " ", "world" will match "hello", an optional space if it exists, and "world". However, the start optional is for the entire match statement. Thus, optionally match "hello", " ", then "world" will actually make the whole "hello world" an optional match rather than just the first "hello". If you want to make the first match optional but keep the rest required, use multiple match statements.

Negation

You can negate a match with the operator not match not "hello world" will match everything except for "hello world".

Other matching specifiers

Many times you don't know exactly what you wish to match. H2R comes with many specifiers that you can use for your matching. For example, you may wish to match any word. You can do that with: match a word The a or an is optional. The possible specifiers that H2R supports are the following:

Specifier	Description	Regex alternative
Anything	Matches any character	.
Word(s)	Matches a word	\w+
Number(s)	Matches an integer	\d+
Character(s)	Matches any letter character	\w
Digit(s)	Matches any digit character	\d
Whitespace(s)	Matches any whitespace character	\s
Boundary	Boundary between a word	\b
Line Feed	Matches a newline	\n
Newline	Matches a newline	\n
Carriage Return	Matches a carriage return	\r

You can also create ranges of characters to match. Say for example, you wanted to match any characters between a and z, you could write any of the following: match from "a" to "z" // from is optional or match between "a" and "z" // between is optional or match "a" ... "z" // can use ... or .. or match "a" - "z" or match "a" through "z" // can also use thru

Repetition

TODO

String value	Numeric value
Zero	0
One	1
Two	2
Three	3
Four	4
Five	5
Six	6
Seven	7
Eight	8
Nine	9
Ten	10

Grouping

TODO

Miscellaneous features

Unicode character properties

You can match specific unicode sequences using "\uXXXX" or "\UXXXXXXXX" where X is a hexadecimal character. match "\u0669" // matches arabic digit 9 "٩" Unicode character classes/scripts can be matched using the unicode keyword. match unicode "Latin" // matches any latin character match unicode "N" // matches any number character The following Unicode class specifiers are available:

Class	Description	Class	Description	Class	Description	Class	Description	Class	Description	Class	Description
C	Other	Cc	Control	Cf	Format	Cn	Unassigned	Co	Private use	Cs	Surrogate
L	Letter	Ll	Lower case letter	Lm	Modifier letter	Lo	Other letter	Lt	Title case letter	Lu	Upper case letter
M	Mark	Mc	Spacing mark	Me	Enclosing mark	Mn	Non-spacing mark	N	Number	Nd	Decimal number
Nl	Letter number	No	Other number	P	Punctuation	Pc	Connector punctuation	Pd	Dash punctuation	Pe	Close punctuation
Pf	Final punctuation	Pi	Initial punctuation	Po	Other punctuation	Ps	Open punctuation	S	Symbol	Sc	Currency symbol
Sk	Modifier symbol	Sm	Mathematical symbol	So	Other symbol	Z	Separator	Zl	Line separator	Zp	Paragraph separator
Zs	Space separator

The following Unicode script specifiers are available:

Note: Java and .NET require "Is" in front of the script name. For example, "IsLatin" rather than just "Latin"

Arabic	Armenian	Avestan	Balinese	Bamum
Batak	Bengali	Bopomofo	Brahmi	Braille
Buginese	Buhid	Canadian_Aboriginal	Carian	Chakma
Cham	Cherokee	Common	Coptic	Cuneiform
Cypriot	Cyrillic	Deseret	Devanagari	Egyptian_Hieroglyphs
Ethiopic	Georgian	Glagolitic	Gothic	Greek
Gujarati	Gurmukhi	Han	Hangul	Hanunoo
Hebrew	Hiragana	Imperial_Aramaic	Inherited	Inscriptional_Pahlavi
Inscriptional_Parthian	Javanese	Kaithi	Kannada	Katakana
Kayah_Li	Kharoshthi	Khmer	Lao	Latin
Lepcha	Limbu	Linear_B	Lisu	Lycian
Lydian	Malayalam	Mandaic	Meetei_Mayek	Meroitic_Cursive
Meroitic_Hieroglyphs	Miao	Mongolian	Myanmar	New_Tai_Lue
Nko	Ogham	Old_Italic	Old_Persian	Old_South_Arabian
Old_Turkic	Ol_Chiki	Oriya	Osmanya	Phags_Pa
Phoenician	Rejang	Runic	Samaritan	Saurashtra
Sharada	Shavian	Sinhala	Sora_Sompeng	Sundanese
Syloti_Nagri	Syriac	Tagalog	Tagbanwa	Tai_Le
Tai_Tham	Tai_Viet	Takri	Tamil	Telugu
Thaana	Thai	Tibetan	Tifinagh	Ugaritic
Vai	Yi