NAME
Regexp::Common::Markdown - Markdown Common Regular Expressions
SYNOPSIS
use Regexp::Common qw( Markdown );
while( <> )
{
my $pos = pos( $_ );
/\G$RE{Markdown}{Header}/gmc and print "Found a header at pos $pos\n";
/\G$RE{Markdown}{Bold}/gmc and print "Found bold text at pos $pos\n";
}
VERSION
v0.1.1
DESCRIPTION
This module provides Markdown regular expressions as set out by its original author John Gruber
There are different types of patterns: vanilla and extended. To get the extended regular expressions, use the -extended switch.
You can use each regular expression by using their respective names: Bold, Blockquote, CodeBlock, CodeLine, CodeSpan, Em, HtmlOpen, HtmlClose, HtmlEmpty, Header, HeaderLine, Image, ImageRef, Line, Link, LinkAuto, LinkDefinition, LinkRef, List
Almost all of the regular expressions use named capture. See "%+" in perlvar for more information on named capture.
For example:
if( $text =~ /$RE{Markdown}{LinkAuto}/ )
{
print( "Found https url \"$+{link_https}\"\n" ) if( $+{link_https} );
print( "Found file url \"$+{link_file}\"\n" ) if( $+{link_file} );
print( "Found ftp url \"$+{link_ftp}\"\n" ) if( $+{link_ftp} );
print( "Found e-mail address \"$+{link_mailto}\"\n" ) if( $+{link_mailto} );
print( "Found Found phone number \"$+{link_tel}\"\n" ) if( $+{link_tel} );
my $url = URI->new( $+{link_https} );
}
As a general rule, Markdown rule requires that the text being parsed be de-tabbed, i.e. with its tabs converted into 4 spaces. Those regular expressions reflect this principle.
STANDARD MARKDOWN
$RE{Markdown}
This returns a pattern that recognises any of the supported vanilla Markdown formatting.
If you pass the -extended parameter, some will be added and some of those regular expressions will be replaced by their extended ones, such as ExtAbbr, ExtCodeBlock, ExtLink, ExtAttributes
Blockquote
$RE{Markdown}{Blockquote}
For example:
> foo
>
> > bar
>
> foo
You can see example of this regular expression along with test units here: https://regex101.com/r/TdKq0K/1/tests
The capture names are:
-
bquote_all
The entire capture of the blockquote.
-
bquote_other
The inner content of the blockquote.
You can see also Markdown::Parser::Blockquote
Bold
$RE{Markdown}{Bold}
For example:
**This is a text in bold.**
__And so is this.__
You can see example of this regular expression along with test units here: https://regex101.com/r/Jp2Kos/2/tests
The capture names are:
-
bold_all
The entire capture of the text in bold including the enclosing marker, which can be either
**or__ -
bold_text
The text within the markers.
-
bold_type
The marker type used to highlight the text. This can be either
**or__
You can see also Markdown::Parser::Bold
Code Block
$RE{Markdown}{CodeBlock}
For example:
```
Some text
Indented code block sample code
```
You can see example of this regular expression along with test units here: https://regex101.com/r/M6W99K/1/tests
The capture names are:
-
code_all
The entire capture of the code block, including the enclosing markers, such as
``` -
code_content
The content of the code enclosed within the 2 markers.
-
code_start
The enclosing marker used to mark the code. Typically
```. -
code_trailing_new_line
The possible trailing new lines. This is used to detect if any were captured in order to put them back in the parsed text for the next markdown, since the last new lines of a markdown are alos the first new lines of the next ones and new lines are used to delimit markdowns.
You can see also Markdown::Parser::Code
Code Line
$RE{Markdown}{CodeLine}
For example:
the lines in this block
all contain trailing spaces
You can see example of this regular expression along with test units here: https://regex101.com/r/toEboU/1/tests
The capture names are:
-
code_after
This contains the data that follows the code block.
-
code_all
The entire capture of the code lines.
-
code_content
The content of the code.
-
code_prefix
This contains the leading spaces used to mark the code as code.
You can see also Markdown::Parser::Code
Code Span
$RE{Markdown}{CodeSpan}
For example:
This is some `inline code`
You can see example of this regular expression along with test units here: https://regex101.com/r/C2Vl9M/1/tests
The capture names are:
-
code_all
The entire capture of the code lines.
-
code_start
Contains the marker that delimit the inline code. The delimiter is
` -
code_content
The content of the code.
You can see also Markdown::Parser::Code
Emphasis
$RE{Markdown}{Em}
For example:
This routine parameter is _test_
You can see example of this regular expression along with test units here: https://regex101.com/r/eDb6RN/2/tests
You can see also Markdown::Parser::Emphasis
Header
$RE{Markdown}{Header}
For example:
### This is a H3 Header
### And so is this one ###
You can see example of this regular expression along with test units here: https://regex101.com/r/9uQwBk/2/tests
The capture names are:
-
header_all
The entire capture of the code lines.
-
header_content
The text that is enclosed in the header marker.
-
header_level
This contains all the dashes that precedes the text. The number of dash indicates the level of the header. Thus, you could do something like this:
length( $+{header_level} );
You can see also Markdown::Parser::Header
Header Line
$RE{Markdown}{HeaderLine}
For example:
This is an H1 header
====================
And this is a H2
-----------
You can see example of this regular expression along with test units here: https://regex101.com/r/sQLEqz/2/tests
The capture names are:
-
header_all
The entire capture of the code lines.
-
header_content
The text that is enclosed in the header marker.
-
header_type
This contains the marker line used to mark the line above as header.
A line using
=is a header of level 1, while a line using-is a header of level 2.
You can see also Markdown::Parser::Header
HTML
$RE{Markdown}{Html}
For example:
<div>
foo
</div>
You can see example of this regular expression along with test units here: https://regex101.com/r/SH8ki3/1/tests
The capture names are:
-
html_all
The entire capture of the html block.
-
html_comment
If this html block is a comment, this will contain the data within the comment.
-
html_content
The inner content between the opning and closing tag. This could be more html block or some text.
This capture will not be available obviously for html tags that are "empty" by nature, such as
<hr /> -
tag_attributes
The attributes of the opening tag, if any. For example:
<div title="Start" class="center large" id="extra_stuff"> <span title="Brand name">MyWorld</span> </div>Here, the attributes will be:
title="Start" class="center large" id="extra_stuff" -
tag_close
The closing tag, including enclosing brackets.
-
tag_name
This contains the name of the first html tag encountered, i.e. the one that starts the html block. For example:
<div> <span title="Brand name">MyWorld</span> </div>Here the tag name will be
div
You can see also Markdown::Parser::HTML
Image
$RE{Markdown}{Image}
For example:

or

or, with reference:
![alt text][foo]
You can see example of this regular expression along with test units here: https://regex101.com/r/z0yH2F/4/tests
The capture names are:
-
img_all
The entire capture of the markdown, such as:
 -
img_alt
The alternative tet to be displayed for this image. This is mandatory as per markdown, so it is guaranteed to be available.
-
img_id
If the image, is an image reference, this will contain the reference id. When an image id is provided, there is no url and no title, because the image reference provides those information.
-
img_title
This is the title of the image, which may not exist, since it is optional in markdown. The title is surrounded by single or double quote that are captured in img_title_container
-
img_url
This is the url of the image.
You can see also Markdown::Parser::Image
Line
$RE{Markdown}{Line}
For example:
---
or
- - -
or
***
or
* * *
or
___
or
_ _ _
You can see example of this regular expression along with test units here: https://regex101.com/r/Vlew4X/2
The capture names are:
-
line_all
The entire capture of the horizontal line.
-
line_type
This contains the marker used to set the line. Valid markers are
*,-, or_
See also Markdown original author reference for horizontal line
You can see also Markdown::Parser::Line
Line Break
$RE{Markdown}{LineBreak}
For example:
Mignonne, allons voir si la rose
Qui ce matin avait déclose
Sa robe de pourpre au soleil,
A point perdu cette vesprée,
Les plis de sa robe pourprée,
Et son teint au vôtre pareil.
To ensure arbitrary line breaks, each line ends with 2 spaces and 1 line break. This should become:
Mignonne, allons voir si la rose<br />
Qui ce matin avait déclose<br />
Sa robe de pourpre au soleil,<br />
A point perdu cette vesprée,<br />
Les plis de sa robe pourprée,<br />
Et son teint au vôtre pareil.
P.S.: If you're wondering, this is an extract from Ronsard.
You can see example of this regular expression along with test units here: https://regex101.com/r/6VG46H/1/
There is no capture name. This is basically used like this:
if( $text =~ /\G$RE{Markdown}{LineBreak}/ )
{
print( "Found a line break\n" );
}
Or
$text =~ s/$RE{Markdown}{LineBreak}/<br \/>\n/gs;
You can see also Markdown::Parser::NewLine
The capture name is:
-
br_all
The entire capture of the line break.
Link
$RE{Markdown}{Link}
For example:
[Inline link](https://www.example.com "title")
or
[Inline link](/some/path "title")
or, without title
[Inline link](/some/path)
or with a reference id:
[reference link][refid]
[refid]: /path/to/something (Title)
or, using the link text as the id for the reference:
[My Example][]
[My Example]: https://example.com (Great Example)
You can see example of this regular expression along with test units here: https://regex101.com/r/sGsOIv/4/tests
The capture names are:
-
link_all
The entire capture of the link.
-
link_title_container
If there is a link title, this contains the single or double quote enclosing it.
-
link_id
The link reference id. For example here
1is the id.[Reference link 1 with parens][1] -
link_name
The link text
-
link_title
The link title, if any.
-
link_url
The link url, if any
You can see also Markdown::Parser::Link
Link Auto
$RE{Markdown}{LinkAuto}
Supports, http, https, ftp, newsgroup, local file, e-mail address or phone numbers
For example:
<https://www.example.com>
would become:
<a href="https://www.example.com">https://www.example.com</a>
An e-mail such as:
<!#$%&'*+-/=?^_`.{|}~@example.com>
would become:
<a href="mailto:!#$%&'*+-/=?^_`.{|}~@example.com>!#$%&'*+-/=?^_`.{|}~@example.com</a>
Other possible and valid e-mail addresses:
<"abc@def"@example.com>
<jsmith@[192.0.2.1]>
A file link:
<file:///Volume/User/john/Document/form.rtf>
A newsgroup link:
<news:alt.fr.perl>
A ftp uri:
<ftp://ftp.example.com/plop/>
Phone numbers:
<+81-90-1234-5678>
<tel:+81-90-1234-5678>
You can see example of this regular expression along with test units here: https://regex101.com/r/bAUu1E/3/tests
The capture names are:
-
link_all
The entire capture of the link.
-
link_file
A local file url, such as:
ile:///Volume/User/john/Document/form.rtf -
link_ftp
Contains an ftp url
-
link_http
Contains an http url
-
link_https
Contains an https url
-
link_mailto
An e-mail address with or without the
mailto:prefix. -
link_news
A newsgroup link url, such as
news:alt.fr.perl -
link_tel
Contains a telephone url according to the rfc 3966
-
link_url
Contains the link uri, which contains one of link_file, link_ftp, link_http, link_https, link_mailto, link_news or link_tel
You can see also Markdown::Parser::Link
Link Definition
$RE{Markdown}{LinkDefinition}
For example:
[1]: /url/ "Title"
[refid]: /path/to/something (Title)
You can see example of this regular expression along with test units here: https://regex101.com/r/edg2F7/2/tests
The capture names are:
-
link_all
The entire capture of the link.
-
link_id
The link id
-
link_title
The link title
-
link_title_container
The character used to enclose the title, if any. This is either
"or' -
link_url
The link url
You can see also Markdown::Parser::LinkDefinition
Link Reference
$RE{Markdown}{LinkRef}
Example:
Foo [bar] [1].
Foo [bar][1].
Foo [bar]
[1].
[Foo][]
[1]: /url/ "Title"
[Foo]: https://www.example.com
You can see example of this regular expression along with test units here: https://regex101.com/r/QmyfnH/1/tests
The capture names are:
-
link_all
The entire capture of the link.
-
link_id
The link reference id. For example here
1is the id.[Reference link 1 with parens][1] -
link_name
The link text
See also the reference on links by Markdown original author
You can see also Markdown::Parser::Link
List
$RE{Markdown}{List}
For example, an unordered list:
* asterisk 1
* asterisk 2
* asterisk 3
or, an ordered list:
1. One item
1. Second item
1. Third item
You can see example of this regular expression along with test units here: https://regex101.com/r/RfhRVg/4
The capture names are:
-
list_after
The data that follows the list.
-
list_all
The entire capture of the markdown.
-
list_content
The content of the list.
-
list_prefix
Contains the first list marker possible preceded by some space. A list marker is
*, or+, or-or a digit with a dot such as1. -
list_type_any
Contains the list marker such as
*, or+, or-or a digit with a dot such as1.This is included in the list_prefix named capture.
-
list_type_any2
Sale as list_type_any, but matches the following item if any. If there is no matching item, then an end of string is expected.
-
list_type_ordered
Contains a digit followed by a dot if the list is an ordered one.
-
list_type_ordered2
Same as list_type_ordered, but for the following list item, if any.
-
list_type_unordered_minus
Contains the marker of a minus
-value if the list marker uses a minus sign. -
list_type_unordered_minus2
Same as list_type_unordered_minus, but for the following list item, if any.
-
list_type_unordered_plus
Contains the marker of a plus
+value if the list marker uses a plus sign. -
list_type_unordered_plus2
Same as list_type_unordered_plus, but for the following list item, if any.
-
list_type_unordered_star
Contains the marker of a star
*value if the list marker uses a star. -
list_type_unordered_star2
Same as list_type_unordered_star, but for the following list item, if any.
You can see also Markdown::Parser::List
List First Level
$RE{Markdown}{ListFirstLevel}
This regular expression is used for top level list, as opposed to the nth level pattern that is used for sub list. Both will match lists within list, but the processing under markdown is different whether the list is a top level one or an sub one.
You can see also Markdown::Parser::List
List Nth Level
$RE{Markdown}{ListNthLevel}
Regular expression to process list within list.
You can see also Markdown::Parser::List
List Item
$RE{Markdown}{ListItem}
You can see example of this regular expression along with test units here: https://regex101.com/r/bulBCP/1/tests
The capture names are:
-
li_all
The entire capture of the markdown.
-
li_content
Contains the data contained in this list item
-
li_lead_line
The optional leding line breaks
-
li_lead_space
The optional leading spaces or tabs. This is used to check that following items belong to the same list level
-
list_type_any
This contains the list type marker, which can be
*,+,-or a digit with a dot such as1. -
list_type_any2
Sale as list_type_any, but matches the following item if any. If there is no matching item, then an end of string is expected.
-
list_type_ordered
This contains a true value if the list marker contains a digit followed by a dot, such as
1. -
list_type_ordered2
Same as list_type_ordered, but for the following list item, if any.
-
list_type_unordered_minus
This contains a true value if the list marker is a minus sign, i.e.
- -
list_type_unordered_minus2
Same as list_type_unordered_minus, but for the following list item, if any.
-
list_type_unordered_plus
This contains a true value if the list marker is a plus sign, i.e.
+ -
list_type_unordered_plus2
Same as list_type_unordered_plus, but for the following list item, if any.
-
list_type_unordered_star
This contains a true value if the list marker is a star, i.e.
* -
list_type_unordered_star2
Same as list_type_unordered_star, but for the following list item, if any.
You can see also Markdown::Parser::ListItem
Paragraph
$RE{Markdown}{Paragraph}
For example:
The quick brown fox
jumps over the lazy dog
Lorem Ipsum
> Why am I matching?
1. Nonononono!
* Aaaagh!
# Stahhhp!
This regular expression would capture the whole block up until "Lorem Ipsum", but will be careful not to catch other markdown element after that. Thus, anything after "Lorem Ipsum" would not be caught because this is a blockquote.
You can see example of this regular expression along with test units here: https://regex101.com/r/0B3gR4/2/
The capture names are:
-
para_all
The entire capture of the paragraph.
-
para_content
Content of the paragraph
-
para_prefix
Any leading space (up to 3)
You can see also Markdown::Parser::Paragraph
EXTENDED MARKDOWN
Abbreviation
$RE{Markdown}{ExtAbbr}
For example:
Some discussion about HTML, SGML and HTML4.
*[HTML4]: Hyper Text Markup Language version 4
*[HTML]: Hyper Text Markup Language
*[SGML]: Standard Generalized Markup Language
You can see example of this regular expression along with test units here: https://regex101.com/r/ztM2Pw/2/tests
The capture names are:
-
abbr_all
The entire capture of the abbreviation.
-
abbr_name
Contains the abbreviation. For example
HTML -
abbr_value
Contains the abbreviation value. For example
Hyper Text Markup Language
You can see also Markdown::Parser::Abbr
Attributes
$RE{Markdown}{ExtAttributes}
For example, an header with attribute .cl.class#id7
### Header {.cl.class#id7 }
Code Block
$RE{Markdown}{ExtCodeBlock}
This is the same as conventional blocks with backticks, except the extended version uses tilde characters.
For example:
~~~
<div>
~~~
You can see example of this regular expression along with test units here: https://regex101.com/r/Y9lPAz/1/tests
The capture names are:
-
code_all
The entire capture of the code.
-
code_attr
The class and/or id attributes for this code. This is something like:
`````` .html {#codeid} </div> ``````Here, code_class would contain
#codeid -
code_class
The class of code. For example:
``````html {#codeid} </div> ``````Here the code class would be
html -
code_content
The code data enclosed within the code markers (backticks or tilde)
-
code_start
Contains the code delimiter, which is either a series of backticks
`or tilde~
You can see also Markdown::Parser::Code
Footnotes
$RE{Markdown}{ExtFootnote}
This looks like this:
[^1]: Content for fifth footnote.
[^2]: Content for sixth footnote spaning on
three lines, with some span-level markup like
_emphasis_, a [link][].
A reference to those footnotes could be:
Some paragraph with a footnote[^1], and another[^2].
The footnote_id reference can be anything as long as it is unique.
You can see also Markdown::Parser::Footnote
Inline Footnotes
For consistency with links, footnotes can be added inline, like this:
I met Jack [^jack](Co-founder of Angels, Inc) at the meet-up.
Inline notes will work even without the identifier. For example:
I met Jack [^](Co-founder of Angels, Inc) at the meet-up.
However, in compliance with pandoc footnotes style, inline footnotes can also be added like this:
Here is an inline note.^[Inlines notes are easier to write, since
you don't have to pick an identifier and move down to type the
note.]
You can see example of this regular expression along with test units here: https://regex101.com/r/WuB1FR/2/
The capture names are:
-
footnote_all
The entire capture of the footnote.
-
footnote_id
The footnote id which must be unique and will be referenced in text.
-
footnote_text
The footnote text
You can see also Markdown::Parser::Footnote
Footnote Reference
$RE{Markdown}{ExtFootnoteReference}
This regular expression matches 3 types of footnote references:
-
1 Conventional
An id is specified referring to a footnote that provide details.
Here's a simple footnote,[^1] [^1]: This is the first footnote. -
2 Inline
I met Jack [^jack](Co-founder of Angels, Inc) at the meet-up.Inline footnotes without any id, i.e. auto-generated id. For example:
I met Jack [^](Co-founder of Angels, Inc) at the meet-up. -
3 Inline auto-generated, pandoc style
Here is an inline note.^[Inlines notes are easier to write, since you don't have to pick an identifier and move down to type the note.]See pandoc manual for more information
You can see example of this regular expression along with test units here: https://regex101.com/r/3eO7rJ/1/
The capture names are:
-
footnote_all
The entire capture of the footnote reference.
-
footnote_id
The footnote id which must be unique and must match a footnote declared anywhere in the document and not necessarily before. For example:
Here's a simple footnote,[^1] [^1]: This is the first footnote.1 here is the id fo the footnote.
If it is not provided, then an id will be auto-generated, but a footnote text is then required.
-
footnote_text
The footnote text is optional if an id is provided. If an id is not provided, the fotnote text is guaranteed to have some value.
You can see also Markdown::Parser::FootnoteReference
Header
$RE{Markdown}{ExtHeader}
This extends regular header with attributes.
For example:
### Header {.cl.class#id7 }
You can see example of this regular expression along with test units here: https://regex101.com/r/GyzbR2/1
The capture names are:
-
header_all
The entire capture of the code lines.
-
header_attr
Contains the extended attribute set. For example:
{.class#id} -
header_content
The text that is enclosed in the header marker.
-
header_level
This contains all the dashes that precedes the text. The number of dash indicates the level of the header. Thus, you could do something like this:
length( $+{header_level} );
You can see also Markdown::Parser::Header
Header Line
$RE{Markdown}{ExtHeaderLine}
Same as header line, but with attributes.
For example:
Header {#id5.cl.class}
======
You can see example of this regular expression along with test units here: https://regex101.com/r/berfAR/2/tests
The capture names are:
-
header_all
The entire capture of the code lines.
-
header_attr
Contains the extended attribute set. For example:
{.class#id} -
header_content
The text that is enclosed in the header marker.
-
header_type
This contains the marker line used to mark the line above as header.
A line using
=is a header of level 1, while a line using-is a header of level 2.
You can see also Markdown::Parser::Header
Image
$RE{Markdown}{ExtImage}
Same as regular image, but with attributes.
For example:
This is an {.class #inline-img}.
You can see example of this regular expression along with test units here: https://regex101.com/r/xetHV1/2
The capture names are:
-
img_all
The entire capture of the markdown, such as:
 -
img_alt
The alternative tet to be displayed for this image. This is mandatory as per markdown, so it is guaranteed to be available.
-
img_attr
Contains the extended attribute set. For example:
{.class#id} -
img_id
If the image, is an image reference, this will contain the reference id. When an image id is provided, there is no url and no title, because the image reference provides those information.
-
img_title
This is the title of the image, which may not exist, since it is optional in markdown. The title is surrounded by single or double quote that are captured in img_title_container
-
img_url
This is the url of the image.
You can see also Markdown::Parser::Image
Link
$RE{Markdown}{ExtLink}
Same as regular links, but with attributes.
For example:
This is an [inline link](/url "title"){.class #inline-link}.
You can see example of this regular expression along with test units here: https://regex101.com/r/7mLssJ/2
The capture names are:
-
link_all
The entire capture of the link.
-
link_attr
Contains the extended attribute set. For example:
{.class#id} -
link_title_container
If there is a link title, this contains the single or double quote enclosing it.
-
link_id
The link reference id. For example here
1is the id.[Reference link 1 with parens][1] -
link_name
The link text
-
link_title
The link title, if any.
-
link_url
The link url, if any
You can see also Markdown::Parser::Link
Link Definition
$RE{Markdown}{ExtLinkDefinition}
Same as regular link definition, but with attributes
For example:
[refid]: /path/to/something (Title) { .class #ref data-key=val }
You can see example of this regular expression along with test units here: https://regex101.com/r/hVfXCe/2/
The capture names are:
-
link_all
The entire capture of the link.
-
link_attr
Contains the extended attribute set. For example:
{.class#id} -
link_id
The link id
-
link_title
The link title
-
link_title_container
The character used to enclose the title, if any. This is either
"or' -
link_url
The link url
You can see also Markdown::Parser::LinkDefinition
Table
$RE{Markdown}{ExtTable}
For example:
You can see example of this regular expression along with test units here: https://regex101.com/r/01XCqB/9/tests
The capture names are:
-
table
The entire capture of the link.
-
table_after
Contains the data that follows the table.
-
table_caption
Contains the table caption if set. A table caption, in markdown can be position before or after the table.
-
table_headers
Contains the entire header rows
-
table_rows
Contains the table body rows
Table format is taken from David E. Wheeler RFC
You can see also Markdown::Parser::Table
SEE ALSO
Regexp::Common for a general description of how to use this interface.
Markdown::Parser for a Markdown parser using this module.
CHANGES & CONTRIBUTIONS
Feel free to reach out to the author for possible corrections, improvements, or suggestions.
AUTHOR
Jacques Deguest <jack@deguest.jp>
CREDITS
Credits to Michel Fortin and John Gruber for their test units.
Credits to Firas Dib for his online regular expression test tool.
COPYRIGHT & LICENSE
Copyright (c) 2020 DEGUEST Pte. Ltd.
You can use, copy, modify and redistribute this package and associated files under the same terms as Perl itself.