TDDA's API for Rexpy

tdda.rexpy

Functions

Rexpy: Automatic Regular Expression Extraction

Given a set of example strings, rexpy infers one or more regular expressions that collectively match them all. When a single pattern would require a large alternation to cover structurally different inputs, rexpy produces multiple patterns instead — one per distinct structure — so each remains readable and precise.

Algorithm

Extraction proceeds in the following stages:

  1. Coarse classification. Each character is assigned to a broad category: lowercase letter, uppercase letter, digit, punctuation, whitespace, or a catch-all. The result is a sequence of category codes for each string.

  2. Run-length encoding (RLE). Consecutive characters in the same category are collapsed into (category, count) pairs. For example, "hello123" becomes [(letter, 5), (digit, 3)].

  3. Variable run-length encoding (VRLE). RLEs that share the same sequence of categories but differ only in their counts are merged into a single variable pattern. For example, [(letter, 3)] and [(letter, 5)] both have the shape [letter] and merge into [(letter, 3, 5)].

  4. Fragment refinement. Within each fragment position of a VRLE, the actual characters seen across all matching examples are examined to find the narrowest character class that fits — for instance, hex digits rather than all alphanumeric characters.

  5. Pattern merging. The refined VRLEs are aligned and merged. Fixed literal sub-sequences that appear at a consistent position (relative to the start or end) across all patterns in a group are used as anchors: patterns are split at those anchors and the pieces are recursively merged. Patterns too different to reconcile remain as separate entries in the output.

  6. Regex generation. Each merged pattern is converted to a regular expression string in the requested output dialect ('portable', 'grep', 'java', 'perl', or 'posix').

For large inputs, rexpy works on a sample first and then checks which examples the resulting patterns fail to match. Any failures are folded back into the working set and extraction is repeated, ensuring that even sampled runs achieve complete coverage.

Python API

The simplest entry point is the extract() function; the Extractor class gives full control over the process and provides access to coverage statistics.

tdda.rexpy.rexpy.extract(examples, tag=False, encoding=None, as_object=False, extra_letters=None, full_escape=False, remove_empties=False, strip=False, variableLengthFrags=False, max_patterns=None, min_diff_strings_per_pattern=1, min_strings_per_pattern=1, size=None, seed=None, dialect='portable', verbose=0)

Extract regular expression(s) from examples and return them.

Normally examples should be unicode (i.e. str in Python 3). Byte strings can be passed provided encoding is specified; results are always unicode strings.

Parameters:
  • examples – The input strings to analyse. May be a list of strings, an integer-valued string-keyed dictionary (or Counter) where values are string frequencies, or a callable conforming to the example_check_function protocol.

  • tag (bool) – If True, return tagged (capturing-group) regular expressions.

  • encoding (str) – Encoding to use when decoding byte-string examples. None (the default) means examples are already unicode.

  • as_object (bool) – If True, return the Extractor object rather than the list of regular expressions. The expressions are then available as result.results.rex.

  • extra_letters (str) – Additional characters to treat as word letters when building character classes.

  • full_escape (bool) – If True, escape all special regex characters, not just those strictly necessary.

  • remove_empties (bool) – If True, discard empty strings from the examples before extraction.

  • strip (bool) – If True, strip leading and trailing whitespace from each example before extraction.

  • variableLengthFrags (bool) – If True, allow fragments of variable length within patterns.

  • max_patterns (int) – Maximum number of patterns to produce. None means no limit.

  • min_diff_strings_per_pattern (int) – Minimum distinct strings per pattern. Currently ignored.

  • min_strings_per_pattern (int) – Minimum total strings (counting duplicates) for a pattern to be retained.

  • size – Controls sampling. Pass a Size instance, None (the default) to use built-in defaults, or False/0 to disable sampling.

  • seed (int) – Random seed for reproducibility. None means non-deterministic.

  • dialect (str) – The kind of regular expression to emit. One of 'portable' (the default), 'grep', 'java', 'perl', or 'posix'.

  • verbose (int) – Verbosity level. 0 is silent, 1 prints progress information, 2 prints maximum detail.

Returns:

A list of regular expressions as unicode strings, or the Extractor object if as_object=True.

tdda.rexpy.rexpy.pdextract(cols, seed=None)

Extract regular expression(s) from one or more Pandas columns.

All columns provided should be string columns (i.e. of type object or categorical), possibly containing null values, which are ignored.

Example:

import numpy as np
import pandas as pd
from tdda.rexpy import pdextract

df = pd.DataFrame({'a3': ["one", "two", np.nan],
                   'a45': ['three', 'four', 'five']})

re3 = pdextract(df['a3'])
re45 = pdextract(df['a45'])
re345 = pdextract([df['a3'], df['a45']])

This should result in:

re3   = ['^[a-z]{3}$']
re45  = ['^[a-z]{4,5}$']
re345 = ['^[a-z]{3,5}$']
Parameters:
  • cols – A Pandas Series or list of Series objects.

  • seed (int) – Random seed for reproducibility. None (the default) means non-deterministic.

Returns:

A list of regular expressions as unicode strings.

Extractor Class

class tdda.rexpy.rexpy.Extractor(examples, extract=True, tag=False, extra_letters=None, full_escape=False, remove_empties=False, strip=False, variableLengthFrags=False, specialize=False, max_patterns=None, min_diff_strings_per_pattern=1, min_strings_per_pattern=1, size=None, seed=None, dialect='portable', verbose=0)

Regular expression extractor.

Given a set of examples, constructs a regular expression that characterizes them; or, if no single pattern suffices, a list of regular expressions that collectively cover the cases.

Results are stored in self.results once extraction has occurred, which happens by default on initialization but can be triggered manually by calling extract().

Parameters:
  • examples

    The input strings to analyse. May be:

    • a list of strings, one per example

    • an integer-valued, string-keyed dictionary (or Counter), where the values are taken as string frequencies (must be non-negative)

    • a callable conforming to the example_check_function protocol

  • extract (bool) – If True (the default), run extraction immediately on initialization. Set to False to defer extraction and call extract() manually.

  • tag (bool) – If True, return tagged (capturing-group) regular expressions instead of plain ones.

  • extra_letters (str) – Additional characters to treat as word letters when building character classes. By default only standard alphanumeric characters qualify.

  • full_escape (bool) – If True, escape all special regex characters, not just those strictly necessary.

  • remove_empties (bool) – If True, discard empty strings from the examples before extraction.

  • strip (bool) – If True, strip leading and trailing whitespace from each example before extraction.

  • variableLengthFrags (bool) – If True, allow fragments of variable length within patterns. Off by default, which produces simpler patterns.

  • specialize (bool) – If True, produce more specialized (narrower) patterns that match fewer strings outside the examples.

  • max_patterns (int) – Maximum number of patterns to produce. None (the default) means no limit.

  • min_diff_strings_per_pattern (int) – Minimum number of distinct strings a pattern must account for. Currently ignored.

  • min_strings_per_pattern (int) – Minimum total number of strings (counting duplicates) a pattern must account for to be retained. Default is 1.

  • size – Controls sampling behaviour. Pass a Size instance to tune the sampling parameters, None (the default) to use rexpy's built-in defaults, or False/0 to disable sampling entirely.

  • seed (int) – Random seed for reproducibility when sampling. None (the default) means non-deterministic.

  • dialect (str) – The kind of regular expression to emit. One of 'portable' (the default), 'grep', 'java', 'perl', or 'posix'.

  • verbose (int) – Verbosity level. 0 (the default) is silent. 1 prints some progress information. 2 prints maximum detail.

coverage(dedup=False)

Return match counts for each extracted regular expression.

Returns a list of counts in the same order as self.results.rex. Each count is the number of input strings matched by that pattern.

Parameters:

dedup (bool) – If True, count only distinct strings, ignoring duplicate frequencies.

Returns:

A list of match counts, one per regular expression in self.results.rex, in the same order.

extract()

Run the regular expression extraction.

Analyses the examples supplied at initialization and populates self.results with the extracted patterns. The regular expressions are then available as self.results.rex.

Called automatically on initialization unless extract=False was passed.

full_incremental_coverage(dedup=False, debug=False)

Return regular expressions sorted by incremental coverage, with full per-pattern statistics.

Like incremental_coverage(), but the dictionary values are Coverage objects rather than plain counts, giving both cumulative and incremental match statistics for each pattern.

Parameters:
  • dedup (bool) – If True, use deduplicated incremental coverage as the sort key rather than raw coverage.

  • debug (bool) – If True, print debugging information during computation.

Returns:

An OrderedDict mapping each regular expression to a Coverage object, in decreasing order of incremental coverage. See Coverage for the available attributes.

incremental_coverage(dedup=False, debug=False)

Return regular expressions sorted by incremental coverage.

Returns an ordered dictionary mapping each regular expression to the number of new (previously unmatched) examples it accounts for, sorted from most to fewest, with ties broken by pattern sort order.

Parameters:
  • dedup (bool) – If True, ignore string frequencies when counting and sorting.

  • debug (bool) – If True, print debugging information during computation.

Returns:

An OrderedDict mapping each regular expression to its new-match count, in decreasing order of incremental coverage.

n_examples(dedup=False)

Return the total number of examples used for extraction.

Parameters:

dedup (bool) – If True, return the number of distinct examples; otherwise return the total count including duplicates. In both cases examples are post-strip.

Returns:

The number of examples as an integer.

Coverage Object

class tdda.rexpy.rexpy.Coverage(n, n_uniq, incr, incr_uniq, index)

Container for coverage information.

n

Number of matches.

n_uniq

Number of matches, deduplicating strings.

incr

Number of new (unique) matches for this regex.

incr_uniq

Number of new (unique) deduplicated matches for this regex.

index

Index of this regex in the original list returned.