human_regex
Overview
Regular expressions for humans.
The human-regex package provides the classes StringRegex
and BytesRegex
which are subclasses of str
or bytes
, respectively. They contain methods and properties which can produce your regular expressions with readable code.
from human_regex import StringRegex as Sre
regex = Sre("match")
assert regex == "match"
regex = regex.not_preceded_by("element")
assert regex == "(?<!element)match"
regex = regex.named("my_group")
assert regex == "(?P<my_group>(?<!element)match)"
regex = Sre("match").not_preceded_by("element").named("my_group").optional
assert regex == "(?P<my_group>(?<!element)match)?"
Let's construct regular expressions for words ending in with the letter "c" and also for words starting with the letter "a":
from human_regex import StringRegex as Sre
word = Sre(r"\w").zero_or_more
assert word == r"\w*"
word_endswith_c = word.append("c").named("ends_with_c").prepend(r"\b").append(r"\b")
assert word_endswith_c == r"\b(?P<ends_with_c>\w*c)\b"
word_startswith_a = word.prepend("a").named("starts_with_a").prepend(r"\b").append(r"\b")
assert word_startswith_a == r"\b(?P<starts_with_a>a\w*)\b"
Subclasses of str
, bytes
StringRegex
and BytesRegex
are subclasses of str
or bytes
, respectively. They interoperate with these objects seamlessly. Here are some alternative ways how to construct the pattern from the previous example in the Overview section. Here we mix StringRegex
and str
instances:
from human_regex import StringRegex as Sre
word = Sre(r"\w")
word += "*"
assert word == r"\w*"
assert isinstance(word, Sre)
assert isinstance(word, str)
word_endswith_c: str = "".join((r"\b", "(?P<ends_with_c>", word, "c", ")", r"\b"))
word_endswith_c: Sre = Sre(word_endswith_c)
# same as:
word_endswith_c: Sre = Sre("").join((r"\b", "(?P<ends_with_c>", word, "c", ")", r"\b"))
word_endswith_c: Sre = Sre.concatenate((r"\b", "(?P<ends_with_c>", word, "c", ")", r"\b"))
Proxy re
Module's Functions and Flags
StringRegex
and BytesRegex
objects proxy the class re.RegexFlag
and all flags and functions of the built-in re
module (i.e. re.compile
, re.search
, etc.). These fuctions automatically take the StringRegex
or BytesRegex
instance object as their first argument:
from human_regex import StringRegex as Sre
import re # needed only for the assert statements below
assert Sre.RegexFlag is re.RegexFlag
sre = Sre("abc.")
# Use the proxied `re.compile` function of the StringRegex instance
# and the proxied flags on the StringRegex class
compiled = sre.compile(flags=Sre.IGNORECASE | Sre.DOTALL)
# same as:
# compiled = re.compile(sre, flags=re.IGNORECASE | re.DOTALL)
assert isinstance(compiled, re.Pattern)
text = "abc\nABCd\n\Abc"
found = sre.findall(text, flags=Sre.IGNORECASE | Sre.DOTALL)
# same as:
# found = re.findall(sre, text, flags=re.IGNORECASE | re.DOTALL)
assert found == ["abc\n", "ABCd"]
StringRegex and BytesRegex
Every method demonstated with StringRegex
is available on BytesRegex
and is applicable to bytes
objects, rather than str
objects:
from human_regex import StringRegex as Sre, BytesRegex as Bre
import re # needed only for the assert statements below
string_re = Sre("abc.").named("my_group")
string_pattern = string_re.compile(flags=Sre.IGNORECASE | Sre.DOTALL)
assert isinstance(string_pattern, re.Pattern)
bytes_re = Bre(b"abc.").named(b"my_group")
bytes_pattern = bytes_re.compile(flags=Bre.IGNORECASE | Bre.DOTALL)
assert isinstance(bytes_pattern, re.Pattern)
assert string_pattern.flags == 50 # includes the implicit Sre.UNICODE flag
assert bytes_pattern.flags == 18 # bytes patterns cannot use the UNICODE flag
assert (Bre.IGNORECASE | Bre.DOTALL | Bre.UNICODE).value == 50
Caution When Iterating Over Bytes Objects
Iterating over str
instances will yield individual string characters, but iterating over bytes
instances will yield instances of int
.
some_strings = "abc"
assert tuple(some_strings) == ("a", "b", "c")
s: str = "".join(some_strings) # iterates over "abc" and joins its elements
assert s == "abc"
some_integers = b"abc"
assert tuple(some_integers) == (97, 98, 99)
b: bytes = b"".join(some_integers)
# will raise a TypeError because elements of the iterable b"abc"
# are the integers 97, 98, 99 but bytes.join
# expects instances of bytes-like objects
StringRegex
and BytesRegex
are subclasses of str
and bytes
respectively, so they inherit this behavior. You can use a StringRegex
instance as an iterable of string characters, but iterating over a BytesRegex
instance will yield integers. Methods BytesRegex.concatenate
and BytesRegex.join
, both of which use bytes.join
internally, cannot work with iterables of integers. They expect iterables of bytes-like objects.
from human_regex import BytesRegex as Bre
# as long as the iterable yields bytes-like objects, everyting is fine:
some_bytes = (b"a", Bre(b"b"), b"c")
assert Bre.concatenate(some_bytes) == Bre(b"abc")
some_integers = Bre(b"abc")
b = Bre.concatenate(some_integers) # will raise a TypeError
# because the elements of Bre(b"abc") are integers,
# rather than bytes-like objects:
assert tuple(Bre(b"abc")) == (97, 98, 99)
# we would have to convert the integers to string characters and encode them to bytes:
b = Bre.concatenate(map(lambda i: str.encode(chr(i)), some_integers))
Inherited Methods and Properties
StringRegex
and BytesRegex
differ slightly in their private class variables, but their public methods and properties have all been inherited from the human_regex.bases.general_regex.GeneralRegexBase
class. Thus, the documentation of the StringRegex
or BytesRegex
's inherited public methods and properties is to be looked up there. For methods proxied from the built-in re module or inherited from str
, or bytes
, look in the Python's standard library documentation.
1""" 2.. include:: ../../README.md 3""" 4 5from .__about__ import __version__ # noqa: F401 6from .bases import GeneralRegexBase, ReProxy 7from .utilities import building_blocks, create_class_property 8 9__all__ = ["StringRegex", "BytesRegex"] 10 11# Classes StringRegex and BytesRegex are very similar. 12# BytesRegex has a base class bytes rather than str 13# and has class properties made of building blocks 14# which are encoded in UTF-8, rather than Unicode strings. 15# 16# Hence we generate the classes StringRegex and BytesRegex dynamically: 17for class_name, str_or_bytes, encoding in ( 18 ("StringRegex", str, ""), 19 ("BytesRegex", bytes, "utf-8"), 20): 21 globals()[class_name] = type( 22 class_name, 23 (GeneralRegexBase, ReProxy, str_or_bytes), 24 {k: create_class_property(v, encoding=encoding) for k, v in building_blocks.items()}, 25 )
Base class for human_regex.StringRegex
and human_regex.BytesRegex
.
Inherited Members
- human_regex.bases.general_regex.GeneralRegexBase
- concatenate
- __add__
- __or__
- append
- prepend
- join
- unnamed
- extension
- set_flags
- no_capture
- modify_flags
- atomic
- named
- backreference
- comment
- followed_by
- not_followed_by
- preceded_by
- not_preceded_by
- yes_no
- set
- optional
- zero_or_more
- one_or_more
- lazy
- repeat
- exactly
- human_regex.bases.re_proxy.ReProxy
- A
- ASCII
- DEBUG
- I
- IGNORECASE
- L
- LOCALE
- M
- MULTILINE
- NOFLAG
- S
- DOTALL
- U
- UNICODE
- X
- VERBOSE
- compile
- search
- match
- fullmatch
- split
- findall
- finditer
- sub
- subn
- escape
- purge
- RegexFlag
- builtins.str
- encode
- replace
- rsplit
- capitalize
- casefold
- title
- center
- count
- expandtabs
- find
- partition
- index
- ljust
- lower
- lstrip
- rfind
- rindex
- rjust
- rstrip
- rpartition
- splitlines
- strip
- swapcase
- translate
- upper
- startswith
- endswith
- removeprefix
- removesuffix
- isascii
- islower
- isupper
- istitle
- isspace
- isdecimal
- isdigit
- isnumeric
- isalpha
- isalnum
- isidentifier
- isprintable
- zfill
- format
- format_map
- maketrans
Base class for human_regex.StringRegex
and human_regex.BytesRegex
.
Inherited Members
- human_regex.bases.general_regex.GeneralRegexBase
- concatenate
- __add__
- __or__
- append
- prepend
- join
- unnamed
- extension
- set_flags
- no_capture
- modify_flags
- atomic
- named
- backreference
- comment
- followed_by
- not_followed_by
- preceded_by
- not_preceded_by
- yes_no
- set
- optional
- zero_or_more
- one_or_more
- lazy
- repeat
- exactly
- human_regex.bases.re_proxy.ReProxy
- A
- ASCII
- DEBUG
- I
- IGNORECASE
- L
- LOCALE
- M
- MULTILINE
- NOFLAG
- S
- DOTALL
- U
- UNICODE
- X
- VERBOSE
- compile
- search
- match
- fullmatch
- split
- findall
- finditer
- sub
- subn
- escape
- purge
- RegexFlag
- builtins.bytes
- capitalize
- center
- count
- decode
- endswith
- expandtabs
- find
- fromhex
- hex
- index
- isalnum
- isalpha
- isascii
- isdigit
- islower
- isspace
- istitle
- isupper
- ljust
- lower
- lstrip
- maketrans
- partition
- replace
- removeprefix
- removesuffix
- rfind
- rindex
- rjust
- rpartition
- rsplit
- rstrip
- splitlines
- startswith
- strip
- swapcase
- title
- translate
- upper
- zfill