TDDA's API for Rexpy
tdda.rexpy
Functions
Rexpy: Automatic Regular Expression Extraction
Given a set of example strings, rexpy infers one or more regular expressions that collectively match them all. When a single pattern would require a large alternation to cover structurally different inputs, rexpy produces multiple patterns instead — one per distinct structure — so each remains readable and precise.
Algorithm
Extraction proceeds in the following stages:
Coarse classification. Each character is assigned to a broad category: lowercase letter, uppercase letter, digit, punctuation, whitespace, or a catch-all. The result is a sequence of category codes for each string.
Run-length encoding (RLE). Consecutive characters in the same category are collapsed into
(category, count)pairs. For example,"hello123"becomes[(letter, 5), (digit, 3)].Variable run-length encoding (VRLE). RLEs that share the same sequence of categories but differ only in their counts are merged into a single variable pattern. For example,
[(letter, 3)]and[(letter, 5)]both have the shape[letter]and merge into[(letter, 3, 5)].Fragment refinement. Within each fragment position of a VRLE, the actual characters seen across all matching examples are examined to find the narrowest character class that fits — for instance, hex digits rather than all alphanumeric characters.
Pattern merging. The refined VRLEs are aligned and merged. Fixed literal sub-sequences that appear at a consistent position (relative to the start or end) across all patterns in a group are used as anchors: patterns are split at those anchors and the pieces are recursively merged. Patterns too different to reconcile remain as separate entries in the output.
Regex generation. Each merged pattern is converted to a regular expression string in the requested output dialect (
'portable','grep','java','perl', or'posix').
For large inputs, rexpy works on a sample first and then checks which examples the resulting patterns fail to match. Any failures are folded back into the working set and extraction is repeated, ensuring that even sampled runs achieve complete coverage.
Python API
The simplest entry point is the extract() function; the
Extractor class gives full control over the process and
provides access to coverage statistics.
- tdda.rexpy.rexpy.extract(examples, tag=False, encoding=None, as_object=False, extra_letters=None, full_escape=False, remove_empties=False, strip=False, variableLengthFrags=False, max_patterns=None, min_diff_strings_per_pattern=1, min_strings_per_pattern=1, size=None, seed=None, dialect='portable', verbose=0)
Extract regular expression(s) from examples and return them.
Normally examples should be unicode (i.e.
strin Python 3). Byte strings can be passed providedencodingis specified; results are always unicode strings.- Parameters:
examples – The input strings to analyse. May be a list of strings, an integer-valued string-keyed dictionary (or
Counter) where values are string frequencies, or a callable conforming to theexample_check_functionprotocol.tag (bool) – If
True, return tagged (capturing-group) regular expressions.encoding (str) – Encoding to use when decoding byte-string examples.
None(the default) means examples are already unicode.as_object (bool) – If
True, return theExtractorobject rather than the list of regular expressions. The expressions are then available asresult.results.rex.extra_letters (str) – Additional characters to treat as word letters when building character classes.
full_escape (bool) – If
True, escape all special regex characters, not just those strictly necessary.remove_empties (bool) – If
True, discard empty strings from the examples before extraction.strip (bool) – If
True, strip leading and trailing whitespace from each example before extraction.variableLengthFrags (bool) – If
True, allow fragments of variable length within patterns.max_patterns (int) – Maximum number of patterns to produce.
Nonemeans no limit.min_diff_strings_per_pattern (int) – Minimum distinct strings per pattern. Currently ignored.
min_strings_per_pattern (int) – Minimum total strings (counting duplicates) for a pattern to be retained.
size – Controls sampling. Pass a
Sizeinstance,None(the default) to use built-in defaults, orFalse/0to disable sampling.seed (int) – Random seed for reproducibility.
Nonemeans non-deterministic.dialect (str) – The kind of regular expression to emit. One of
'portable'(the default),'grep','java','perl', or'posix'.verbose (int) – Verbosity level.
0is silent,1prints progress information,2prints maximum detail.
- Returns:
A list of regular expressions as unicode strings, or the
Extractorobject ifas_object=True.
- tdda.rexpy.rexpy.pdextract(cols, seed=None)
Extract regular expression(s) from one or more Pandas columns.
All columns provided should be string columns (i.e. of type
objectorcategorical), possibly containing null values, which are ignored.Example:
import numpy as np import pandas as pd from tdda.rexpy import pdextract df = pd.DataFrame({'a3': ["one", "two", np.nan], 'a45': ['three', 'four', 'five']}) re3 = pdextract(df['a3']) re45 = pdextract(df['a45']) re345 = pdextract([df['a3'], df['a45']])
This should result in:
re3 = ['^[a-z]{3}$'] re45 = ['^[a-z]{4,5}$'] re345 = ['^[a-z]{3,5}$']
- Parameters:
cols – A Pandas
Seriesor list ofSeriesobjects.seed (int) – Random seed for reproducibility.
None(the default) means non-deterministic.
- Returns:
A list of regular expressions as unicode strings.
Extractor Class
- class tdda.rexpy.rexpy.Extractor(examples, extract=True, tag=False, extra_letters=None, full_escape=False, remove_empties=False, strip=False, variableLengthFrags=False, specialize=False, max_patterns=None, min_diff_strings_per_pattern=1, min_strings_per_pattern=1, size=None, seed=None, dialect='portable', verbose=0)
Regular expression extractor.
Given a set of examples, constructs a regular expression that characterizes them; or, if no single pattern suffices, a list of regular expressions that collectively cover the cases.
Results are stored in
self.resultsonce extraction has occurred, which happens by default on initialization but can be triggered manually by callingextract().- Parameters:
examples –
The input strings to analyse. May be:
a list of strings, one per example
an integer-valued, string-keyed dictionary (or
Counter), where the values are taken as string frequencies (must be non-negative)a callable conforming to the
example_check_functionprotocol
extract (bool) – If
True(the default), run extraction immediately on initialization. Set toFalseto defer extraction and callextract()manually.tag (bool) – If
True, return tagged (capturing-group) regular expressions instead of plain ones.extra_letters (str) – Additional characters to treat as word letters when building character classes. By default only standard alphanumeric characters qualify.
full_escape (bool) – If
True, escape all special regex characters, not just those strictly necessary.remove_empties (bool) – If
True, discard empty strings from the examples before extraction.strip (bool) – If
True, strip leading and trailing whitespace from each example before extraction.variableLengthFrags (bool) – If
True, allow fragments of variable length within patterns. Off by default, which produces simpler patterns.specialize (bool) – If
True, produce more specialized (narrower) patterns that match fewer strings outside the examples.max_patterns (int) – Maximum number of patterns to produce.
None(the default) means no limit.min_diff_strings_per_pattern (int) – Minimum number of distinct strings a pattern must account for. Currently ignored.
min_strings_per_pattern (int) – Minimum total number of strings (counting duplicates) a pattern must account for to be retained. Default is
1.size – Controls sampling behaviour. Pass a
Sizeinstance to tune the sampling parameters,None(the default) to use rexpy's built-in defaults, orFalse/0to disable sampling entirely.seed (int) – Random seed for reproducibility when sampling.
None(the default) means non-deterministic.dialect (str) – The kind of regular expression to emit. One of
'portable'(the default),'grep','java','perl', or'posix'.verbose (int) – Verbosity level.
0(the default) is silent.1prints some progress information.2prints maximum detail.
- coverage(dedup=False)
Return match counts for each extracted regular expression.
Returns a list of counts in the same order as
self.results.rex. Each count is the number of input strings matched by that pattern.- Parameters:
dedup (bool) – If
True, count only distinct strings, ignoring duplicate frequencies.- Returns:
A list of match counts, one per regular expression in
self.results.rex, in the same order.
- extract()
Run the regular expression extraction.
Analyses the examples supplied at initialization and populates
self.resultswith the extracted patterns. The regular expressions are then available asself.results.rex.Called automatically on initialization unless
extract=Falsewas passed.
- full_incremental_coverage(dedup=False, debug=False)
Return regular expressions sorted by incremental coverage, with full per-pattern statistics.
Like
incremental_coverage(), but the dictionary values areCoverageobjects rather than plain counts, giving both cumulative and incremental match statistics for each pattern.- Parameters:
dedup (bool) – If
True, use deduplicated incremental coverage as the sort key rather than raw coverage.debug (bool) – If
True, print debugging information during computation.
- Returns:
An
OrderedDictmapping each regular expression to aCoverageobject, in decreasing order of incremental coverage. SeeCoveragefor the available attributes.
- incremental_coverage(dedup=False, debug=False)
Return regular expressions sorted by incremental coverage.
Returns an ordered dictionary mapping each regular expression to the number of new (previously unmatched) examples it accounts for, sorted from most to fewest, with ties broken by pattern sort order.
- Parameters:
dedup (bool) – If
True, ignore string frequencies when counting and sorting.debug (bool) – If
True, print debugging information during computation.
- Returns:
An
OrderedDictmapping each regular expression to its new-match count, in decreasing order of incremental coverage.
- n_examples(dedup=False)
Return the total number of examples used for extraction.
- Parameters:
dedup (bool) – If
True, return the number of distinct examples; otherwise return the total count including duplicates. In both cases examples are post-strip.- Returns:
The number of examples as an integer.
Coverage Object
- class tdda.rexpy.rexpy.Coverage(n, n_uniq, incr, incr_uniq, index)
Container for coverage information.
- n
Number of matches.
- n_uniq
Number of matches, deduplicating strings.
- incr
Number of new (unique) matches for this regex.
- incr_uniq
Number of new (unique) deduplicated matches for this regex.
- index
Index of this regex in the original list returned.