mirror of
https://github.com/pdemian/human2regex.git
synced 2025-05-16 12:30:09 -07:00
37 lines
20 KiB
HTML
37 lines
20 KiB
HTML
<!DOCTYPE html><html lang="en" dir="ltr"><head><meta name="viewport" content="width=device-width,initial-scale=1,shrink-to-fit=no"><meta name="description" content="Create regular expressions with natural, human language"><meta name="keywords" content="Human2Regex, Human, Regex, Natural, Language, Natural Language"><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><title>Human2Regex Tutorial</title><link href="bundle.min.css" rel="stylesheet" type="text/css"><meta name="theme-color" content="#212529"><meta name="apple-mobile-web-app-capable" content="yes"><meta name="apple-mobile-web-app-status-bar-style" content="default"><link rel="icon" type="image/x-icon" href="favicon.ico"></head><body><a class="skip skip-top" href="#maincontent">Skip to main content</a><div class="wrapper"><nav class="navbar navbar-expand-lg navbar-light fixed-top" id="mainNav"><div class="container"><a class="navbar-brand" href="index.html"><img src="favicon-small.png" width="30" height="30" class="d-inline-block align-top" alt="logo"> Human2Regex</a><div class="float-right heading-links"><a class="heading-link" href="index.html">Index</a> <span> | </span> <a class="heading-link" href="tutorial.html">Tutorial</a></div></div></nav><div class="container contained-container" id="maincontent" role="main"><div id="tutorial"><h2>Tutorial</h2><br><p class="font-weight-bold" id="tut-preface">0. Preface</p><p>Human2Regex (H2R) is a way to spell out a regular expression in an easy to read, easy to modify language. H2R supports multiple languages as well as many (though not all) different regular expression options such as named groups and quantifiers. You may notice multiple keywords specifying the same thing, and that is intended! Just like how in English there are many ways to express yourself, H2R is made to be flexible and easy to understand. With a range, do you prefer "...", "through", or "to"? It's up to you to choose, H2R supports all of those!</p><br><p class="font-weight-bold" id="tut-first-match">1. Your first Match</p><p>Every language starts with a "Hello World" program, so let's match the output of those programs. Matching is done using the keyword <code class="cm-s-idea">match</code> followed by what you want to match. <span class="tutorial-code"><code class="cm-s-idea">match "Hello World"</code></span> The above statement will generate a regular expression that matches "Hello World", like "/Hello World/". Any invalid characters will automatically be escaped, so you don't need to worry about it. H2R also supports block comments with <code class="cm-s-idea">/**/</code>, or line comments with <code class="cm-s-idea">//</code> or <code class="cm-s-idea">#</code> so you can explain why or what you intend to match.</p><pre class="tutorial-code"><code class="cm-s-idea">/* This is a block comment */
|
|
match "Hello World" // matches the output of "Hello World" programs
|
|
</code></pre><p>Now what if we want to match every case variation of "Hello World" like "hello world" or "hELLO wORLD"? H2R supports the <code class="cm-s-idea">or</code> operator which allows you to specify many possible combinations. <span class="tutorial-code"><code class="cm-s-idea">match "Hello World" or "hello world" or "hELLO wORLD"</code></span> Or, you can use a <code class="cm-s-idea">using</code> statement to specify that you want it to be case insensitive.</p><br><p class="font-weight-bold" id="tut-using">2. Using Specifiers</p><p>Using statements appear at the beginning. You may have one or more using statements which each can contain one or more specifiers. For example: <span class="tutorial-code"><code class="cm-s-idea">using global and case insensitive matching</code></span> or</p><pre class="tutorial-code">
|
|
<code class="cm-s-idea">using global
|
|
using case insensitive
|
|
</code></pre><p>The <code class="cm-s-idea">matching</code> keyword is optional. The flags which are available are:</p><table class="table table-sm table-striped table-bordered"><thead><tr><th scope="col">Specifier</th><th scope="col">Description</th><th scope="col">Regex flag</th></tr></thead><tbody><tr><td><code class="cm-s-idea">multiline</code></td><td>Matches can cross line breaks</td><td>/<your regex>/m</td></tr><tr><td><code class="cm-s-idea">global</code></td><td>Multiple matches are allowed</td><td>/<your regex>/g</td></tr><tr><td><code class="cm-s-idea">case sensitive</code></td><td>Match must be exact case</td><td><span class="font-italic">none</span></td></tr><tr><td><code class="cm-s-idea">case insensitive</code></td><td>Match may be any case</td><td>/<your regex>/i</td></tr><tr><td><code class="cm-s-idea">exact</code></td><td>An exact statement matches a whole line exactly, nothing before, nothing after</td><td>/^<your regex>$/</td></tr></tbody></table><p>To match any variation of hello world, we would then do the following:</p><pre class="tutorial-code"><code class="cm-s-idea">using case insensitive matching
|
|
match "hello world"
|
|
</code></pre><br><p class="font-weight-bold" id="tut-multiple-match">3. Matching multiple items</p><p>H2R comes with 2 options to match multiple items in a row. The first is to simply write multiple separate <code class="cm-s-idea">match</code> statements like:</p><pre class="tutorial-code">
|
|
<code class="cm-s-idea">match "hello"
|
|
match " "
|
|
match "world"
|
|
</code></pre><p>However, you can also use a comma, <code class="cm-s-idea">and</code>, or <code class="cm-s-idea">then</code> for a more concise match. <span class="tutorial-code"><code class="cm-s-idea">match "hello", " ", "world"</code></span> or <span class="tutorial-code"><code class="cm-s-idea">match "hello" and " " and "world"</code></span> or <span class="tutorial-code"><code class="cm-s-idea">match "hello" then " " then "world"</code></span> or any combination like <span class="tutorial-code"><code class="cm-s-idea">match "hello", " " and then "world"</code></span><br></p><p class="font-weight-bold" id="tut-optionality">4. Optionality</p><p>Sometimes you wish to match something that may or may not exist. In H2R, this is done via the <code class="cm-s-idea">optional</code>, <code class="cm-s-idea">optionally</code>, <code class="cm-s-idea">possibly</code> or <code class="cm-s-idea">maybe</code> keyword. <span class="tutorial-code"><code class="cm-s-idea">optionally match "hello world"</code></span> will match 0 or 1 "hello world"'s. This can be used alongside matching multiple statements in a single <code class="cm-s-idea">match</code> statement. <span class="tutorial-code"><code class="cm-s-idea">match "hello", maybe " ", "world"</code></span> will match "hello", an optional space if it exists, and "world". However, the start <code class="cm-s-idea">optional</code> is for the entire match statement. Thus, <span class="tutorial-code"><code class="cm-s-idea">possibly match "hello", " ", then "world"</code></span> will actually make the whole "hello world" an optional match rather than just the first "hello". If you want to make the first match optional but keep the rest required, place the <code class="cm-s-idea">optional</code> immediately after the <code class="cm-s-idea">match</code>.</p><br><p class="font-weight-bold" id="tut-negation">5. Negation</p><p>You can negate a match with the operator <code class="cm-s-idea">not</code> <span class="tutorial-code"><code class="cm-s-idea">match not "hello world"</code></span> or <span class="tutorial-code"><code class="cm-s-idea">match anything but "hello world"</code></span> will match everything except for "hello world".</p><br><p class="font-weight-bold" id="tut-other-match">6. Other matching specifiers</p><p>Many times you don't know exactly what you wish to match. H2R comes with many specifiers that you can use for your matching. For example, you may wish to match any word. You can do that with: <span class="tutorial-code"><code class="cm-s-idea">match a word</code></span> The <code class="cm-s-idea">a</code> or <code class="cm-s-idea">an</code> is optional. The possible specifiers that H2R supports are the following:</p><table class="table table-sm table-striped table-bordered"><thead><tr><th scope="col">Specifier</th><th scope="col">Description</th><th scope="col">Regex alternative</th><th scope="col">Note</th></tr></thead><tbody><tr><td><code class="cm-s-idea">anything</code></td><td>Matches any character</td><td>.</td><td> </td></tr><tr><td><code class="cm-s-idea">word(s)</code></td><td>Matches many a-z, A-Z, _, or digit characters</td><td>\w+</td><td>For a-z only, use <code class="cm-s-idea">letter(s)</code></td></tr><tr><td><code class="cm-s-idea">letter(s)</code></td><td>Matches any letter character</td><td>[a-zA-Z]</td><td> </td></tr><tr><td><code class="cm-s-idea">number(s)</code></td><td>Matches a string of digit characters</td><td>\d+</td><td> </td></tr><tr><td><code class="cm-s-idea">digit(s)</code></td><td>Matches any digit character</td><td>\d</td><td> </td></tr><tr><td><code class="cm-s-idea">integer(s)</code></td><td>Matches an integer</td><td>[+-]?\d+</td><td> </td></tr><tr><td><code class="cm-s-idea">decimal(s)</code></td><td>Matches digits, an optional decimal point and more digits</td><td>[+-]?((\d+[,.]?\d*)|([,.]\d+))</td><td>Supports both "," and "." decimal points</td></tr><tr><td><code class="cm-s-idea">character(s)</code></td><td>Matches a-z, A-Z, _, or digits</td><td>\w</td><td>For a-z only, use <code class="cm-s-idea">letter(s)</code></td></tr><tr><td><code class="cm-s-idea">whitespace(s)</code></td><td>Matches any whitespace character</td><td>\s</td><td> </td></tr><tr><td><code class="cm-s-idea">(word )boundary</code></td><td>Boundary between a word</td><td>\b</td><td> </td></tr><tr><td><code class="cm-s-idea">line feed</code>/<code class="cm-s-idea">newline</code></td><td>Matches a newline</td><td>\n</td><td> </td></tr><tr><td><code class="cm-s-idea">carriage return</code></td><td>Matches a carriage return</td><td>\r</td><td> </td></tr></tbody></table><p>You can also create ranges of characters to match. Say for example, you wanted to match any characters between a and z, you could write any of the following: <span class="tutorial-code"><code class="cm-s-idea">match from "a" to "z" // "from" is optional</code></span> or <span class="tutorial-code"><code class="cm-s-idea">match between "a" and "z" // "between" is optional</code></span> or <span class="tutorial-code"><code class="cm-s-idea">match "a" ... "z" // can use "..." or ".."</code></span> or <span class="tutorial-code"><code class="cm-s-idea">match "a" - "z"</code></span> or <span class="tutorial-code"><code class="cm-s-idea">match "a" through "z" // can also use thru</code></span><br></p><p class="font-weight-bold" id="tut-repeition">7. Repetition</p><p>H2R supports 2 types of repetition: single match repetition, or grouped repetition. When using <code class="cm-s-idea">match</code> you can specify the number of captures you want just before the text to capture. <span class="tutorial-code"><code class="cm-s-idea">match 2 digits</code></span> or <span class="tutorial-code"><code class="cm-s-idea">match exactly 2 digits</code></span> will match any 2 digits in a row. You can also specify a range you wish to capture <span class="tutorial-code"><code class="cm-s-idea">match 2 ... 5 digits</code></span> or <span class="tutorial-code"><code class="cm-s-idea">match 2 to 5 digits</code></span> or <span class="tutorial-code"><code class="cm-s-idea">match between 2 to 5 digits</code></span> will match 2, 3, 4, or 5 digits. You can specify if the final number is exclusive with the <code class="cm-s-idea">exclusive</code> or <code class="cm-s-idea">inclusive</code> keywords. <span class="tutorial-code"><code class="cm-s-idea">match 2 to 5 exclusive digits</code></span> will only match up to 4 digits. You can also choose to leave the end unspecified. <span class="tutorial-code"><code class="cm-s-idea">match 2+ digits</code></span> or <span class="tutorial-code"><code class="cm-s-idea">match 2 or more digits</code></span> will match 2 or more digits. Repeition can be chained with the <code class="cm-s-idea">and then</code> keywords or the <code class="cm-s-idea">optional</code> keyword. For example: <span class="tutorial-code"><code class="cm-s-idea">match 1+ digits then optionally "." then optionally 0...8 digits</code></span> Suppose you want to repeat a group of these match statements. You can group a repetition using the <code class="cm-s-idea">repeat</code> keyword. Everything underneath that is tabbed (scoped) will be repeated. By default, this will match 0 or more of the following statements.</p><pre class="tutorial-code">
|
|
<code class="cm-s-idea">repeat
|
|
match "Hello "
|
|
match "World"
|
|
</code></pre><p>Will match 0 or more "Hello "s, but only 1 "World". The same qualifiers that exist for <code class="cm-s-idea">match</code> statements also exist for <code class="cm-s-idea">repeat</code> statements.</p><pre class="tutorial-code">
|
|
<code class="cm-s-idea">optionally repeat 3...7 times
|
|
match "Hello World"
|
|
</code></pre><p>Will potentially match "Hello World" between 3 and 7 times. H2R also supports the following for numbers: <code class="cm-s-idea">One, Two, Three, Four, Five, Six, Seven, Eight, Nine, and Ten</code></p><br><p class="font-weight-bold" id="tut-grouping">8. Grouping</p><p>Just like regular expressions, capture groups are supported in H2R. Each group is defined using the <code class="cm-s-idea">create a group</code> keyphrase.</p><pre class="tutorial-code">
|
|
<code class="cm-s-idea">create a group
|
|
match "Hello World"
|
|
</code></pre><p>This will create a non-named captured group, equivalent to the regular expression "/(Hello World)/". A non-named captured group will show up in your chosen language's matches, however will not be given a name. To access this match, you will need to know the index of the group. Most regular expression engines support named capture groups, and H2R highly recommends using this feature. If you wish to do so, simply give it a name:</p><p><pre class="tutorial-code">
|
|
<code class="cm-s-idea">create a group called TestGroup
|
|
match "Hello World"
|
|
</code></pre></p><p>In most languages, a named group can be accessed through the match result's group list. Take for example, in JavaScript,<pre class="tutorial-code">
|
|
<code class="cm-s-idea">"hello".match(/(?<TestGroup>hello)/).groups</code>
|
|
</pre></p><p>Will return an object with {TestGroup: "hello"}. For another example, check out <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/match#Using_named_capturing_groups">MDN web docs</a>. Groups can also be optional.</p><pre class="tutorial-code">
|
|
<code class="cm-s-idea">create an optional group
|
|
match "Hello World"
|
|
</code></pre><p>And groups may be nested</p><pre class="tutorial-code">
|
|
<code class="cm-s-idea">create a group called TestGroup
|
|
match "Hello"
|
|
create a group called InnerGroup
|
|
match "World"
|
|
</code></pre><p>The regular expression returned by this will be "/(?<TestGroup>Hello(?<InnerGroup>World))/". Again, in JavaScript, the following</p><pre class="tutorial-code">
|
|
<code class="cm-s-idea">"HelloWorld".match(/(?<TestGroup>Hello(?<InnerGroup>World))/).groups</code>
|
|
</pre><p>Will return an object with {TestGroup: "HelloWorld", InnerGroup: "World"}.</p><br><h3 id="tut-final">Putting it all together</h3><p>Grouping, repetition, and matching are the 3 primary elements that make up H2R. They can be combined in any way to generate a regular expression. See the <a href="index.html">main page</a> for an example that combines all above to parse a URL.</p><h3>Miscellaneous features</h3><p class="font-weight-bold" id="tut-unicode">Unicode character properties</p><p>You can match specific unicode sequences using <code class="cm-s-idea">"\uXXXX"</code> or <code class="cm-s-idea">"\UXXXXXXXX"</code> where X is a hexadecimal character. <span class="tutorial-code"><code class="cm-s-idea">match "\u0669" // matches arabic digit 9 "٩"</code></span> Unicode character classes/scripts can be matched using the <code class="cm-s-idea">unicode</code> keyword. <span class="tutorial-code"><code class="cm-s-idea">match unicode "Latin" // matches any latin character</code></span> <span class="tutorial-code"><code class="cm-s-idea">match unicode "N" // matches any number character</code></span> The following Unicode class specifiers are available:</p><table class="table table-sm table-striped table-bordered"><thead><tr><th scope="col">Class</th><th scope="col">Description</th></tr></thead><tbody><tr><td>C</td><td>Other</td></tr><tr><td>Cc</td><td>Control</td></tr><tr><td>Cf</td><td>Format</td></tr><tr><td>Cn</td><td>Unassigned</td></tr><tr><td>Co</td><td>Private use</td></tr><tr><td>Cs</td><td>Surrogate</td></tr><tr><td>L</td><td>Letter</td></tr><tr><td>Ll</td><td>Lower case letter</td></tr><tr><td>Lm</td><td>Modifier letter</td></tr><tr><td>Lo</td><td>Other letter</td></tr><tr><td>Lt</td><td>Title case letter</td></tr><tr><td>Lu</td><td>Upper case letter</td></tr><tr><td>M</td><td>Mark</td></tr><tr><td>Mc</td><td>Spacing mark</td></tr><tr><td>Me</td><td>Enclosing mark</td></tr><tr><td>Mn</td><td>Non-spacing mark</td></tr><tr><td>N</td><td>Number</td></tr><tr><td>Nd</td><td>Decimal number</td></tr><tr><td>Nl</td><td>Letter number</td></tr><tr><td>No</td><td>Other number</td></tr><tr><td>P</td><td>Punctuation</td></tr><tr><td>Pc</td><td>Connector punctuation</td></tr><tr><td>Pd</td><td>Dash punctuation</td></tr><tr><td>Pe</td><td>Close punctuation</td></tr><tr><td>Pf</td><td>Final punctuation</td></tr><tr><td>Pi</td><td>Initial punctuation</td></tr><tr><td>Po</td><td>Other punctuation</td></tr><tr><td>Ps</td><td>Open punctuation</td></tr><tr><td>S</td><td>Symbol</td></tr><tr><td>Sc</td><td>Currency symbol</td></tr><tr><td>Sk</td><td>Modifier symbol</td></tr><tr><td>Sm</td><td>Mathematical symbol</td></tr><tr><td>So</td><td>Other symbol</td></tr><tr><td>Z</td><td>Separator</td></tr><tr><td>Zl</td><td>Line separator</td></tr><tr><td>Zp</td><td>Paragraph separator</td></tr><tr><td>Zs</td><td>Space separator</td></tr></tbody></table><p>The following Unicode script specifiers are available:</p><p>Note: Java and .NET require "Is" in front of the script name. For example, "IsLatin" rather than just "Latin"</p><table class="table table-sm table-striped table-bordered"><tbody><tr><td>Arabic</td><td>Armenian</td><td>Avestan</td><td>Balinese</td><td>Bamum</td></tr><tr><td>Batak</td><td>Bengali</td><td>Bopomofo</td><td>Brahmi</td><td>Braille</td></tr><tr><td>Buginese</td><td>Buhid</td><td>Canadian_Aboriginal</td><td>Carian</td><td>Chakma</td></tr><tr><td>Cham</td><td>Cherokee</td><td>Common</td><td>Coptic</td><td>Cuneiform</td></tr><tr><td>Cypriot</td><td>Cyrillic</td><td>Deseret</td><td>Devanagari</td><td>Egyptian_Hieroglyphs</td></tr><tr><td>Ethiopic</td><td>Georgian</td><td>Glagolitic</td><td>Gothic</td><td>Greek</td></tr><tr><td>Gujarati</td><td>Gurmukhi</td><td>Han</td><td>Hangul</td><td>Hanunoo</td></tr><tr><td>Hebrew</td><td>Hiragana</td><td>Imperial_Aramaic</td><td>Inherited</td><td>Inscriptional_Pahlavi</td></tr><tr><td>Inscriptional_Parthian</td><td>Javanese</td><td>Kaithi</td><td>Kannada</td><td>Katakana</td></tr><tr><td>Kayah_Li</td><td>Kharoshthi</td><td>Khmer</td><td>Lao</td><td>Latin</td></tr><tr><td>Lepcha</td><td>Limbu</td><td>Linear_B</td><td>Lisu</td><td>Lycian</td></tr><tr><td>Lydian</td><td>Malayalam</td><td>Mandaic</td><td>Meetei_Mayek</td><td>Meroitic_Cursive</td></tr><tr><td>Meroitic_Hieroglyphs</td><td>Miao</td><td>Mongolian</td><td>Myanmar</td><td>New_Tai_Lue</td></tr><tr><td>Nko</td><td>Ogham</td><td>Old_Italic</td><td>Old_Persian</td><td>Old_South_Arabian</td></tr><tr><td>Old_Turkic</td><td>Ol_Chiki</td><td>Oriya</td><td>Osmanya</td><td>Phags_Pa</td></tr><tr><td>Phoenician</td><td>Rejang</td><td>Runic</td><td>Samaritan</td><td>Saurashtra</td></tr><tr><td>Sharada</td><td>Shavian</td><td>Sinhala</td><td>Sora_Sompeng</td><td>Sundanese</td></tr><tr><td>Syloti_Nagri</td><td>Syriac</td><td>Tagalog</td><td>Tagbanwa</td><td>Tai_Le</td></tr><tr><td>Tai_Tham</td><td>Tai_Viet</td><td>Takri</td><td>Tamil</td><td>Telugu</td></tr><tr><td>Thaana</td><td>Thai</td><td>Tibetan</td><td>Tifinagh</td><td>Ugaritic</td></tr><tr><td>Vai</td><td>Yi</td><td> </td><td> </td><td> </td></tr></tbody></table></div></div><footer><div class="container"><div class="row"><div class="col-lg-8 col-md-10 mx-auto"><p class="copyright">Copyright © 2020 Patrick Demian. This page's source code is available at <a rel="noopener noreferrer" href="https://github.com/pdemian/human2regex">github.com/pdemian/human2regex</a></p></div></div></div></footer></div><script defer="defer" src="bundle.min.js"></script></body></html> |