TDDA’s API for Rexpy

tdda.rexpy

Python API

The tdda.rexpy.rexpy module provides a Python API, to allow discovery of regular expressions to be incorporated into other Python programs.

class tdda.rexpy.rexpy.Coverage

Container for coverage information.

Attributes:

  • n: number of matches
  • n_unique: number matches, deduplicating strings
  • incr: number of new (unique) matches for this regex
  • incr_uniq: number of new (unique) deduplicated matches for this regex
  • index: index of this regex in original list returned.
class tdda.rexpy.rexpy.Extractor(examples, extract=True, tag=False, extra_letters=None, full_escape=False, remove_empties=False, strip=False, variableLengthFrags=False, specialize=False, max_patterns=None, min_diff_strings_per_pattern=1, min_strings_per_pattern=1, size=None, seed=None, dialect='portable', verbose=0)

Regular expression ‘extractor’.

Given a set of examples, this tries to construct a useful regular expression that characterizes them; failing which, a list of regular expressions that collectively cover the cases.

Results are stored in self.results once extraction has occurred, which happens by default on initialization, but can be invoked manually.

The examples may be given as a list of strings, a integer-valued, string-keyed dictionary or a function.

  • If it’s a list, each string in the list is an example string
  • It it’s a dictionary (or counter), each string is to be used, and the values are taken as frequencies (should be non-negative)
  • If it’s a function, it should be as specified below (see the definition of example_check_function)
size can be provided as:
  • a Size() instance, to control various sizes within rexpy
  • None (the default), in which case rexpy’s defaults are used
  • False or 0, which means don’t use sampling

Verbose is usually 0 or False. It can be to True or 1 for various extra output, and to higher numbers for even more verbose output. The highest level currently used is 2.

aligned_parts(parts)

Given a list of parts, each consisting of the fragments from a set of partially aligned patterns, show them aligned, and in a somewhat ambigous, numbered, fairly human-readable, compact form.

analyse_fragments(vrle, v_id)

Analyse the contents of each fragment in vrle across the examples it matches.

Return zip of

  • the characters in each fragment
  • the strings in each fragment
  • the run-length encoded fine classes in each fragment
  • the run-length encoded characters in each fragment
  • the fragment itself

all indexed on the (zero-based) group number.

batch_extract()

Find regular expressions for a batch of examples (as given).

build_tree_inner(vrles, fulls=None)

Turn the VRLEs into a tree based on the different initial fragments.

check_for_failures(rexes, maxExamples)

This method is the default check_fn (See the definition of example_check_function below.)

clean(examples)

Compute length of each string and count number of examples of each length.

coarse_classify(s)

Classify each character in a string into one of the coarse categories

coarse_classify_char(c)

Classify character into one of the coarse categories

coverage(dedup=False)

Get a list of frequencies for each regular expression, i.e the number of the (stripped) input strings it matches. The list is in the same order as the regular expressions in self.results.rex.

If dedup is set to True, shows only the number of distinct (stripped) input examples matches

extract()

Actually perform the regular expression ‘extraction’.

find_bad_patterns(freqs)

Given freqs, a list of frequencies (for the corresponding indexed RE) identify indexes to patterns that:

  • have too few strings
  • cause too many patterns to be returned

NOTE: min_diff_strings_per_pattern is currently ignored.

Returns set of indices for deletion

find_frag_sep_frag_repeated(tree, existing=None, results=None)

This specifically looks for patterns in the tree constructed by self.build_tree of the general form

A ABA ABABA

etc., where A and B are both fragments. A common example is that A is recognizably a pattern and B is recognizably a separator. So, for example:

HM MH-RT QY-TR-BF QK-YT-IU-QP

all fit

[A-Z]{2}(-[A-Z])*

which, as fragments, would be A = (C, 2, 2) and B = (‘.’, 1, 1).

The tree is currently not recording or taking account of frequency in the fragments. We might want to do that. Or we might leave that as a job for whatever is going to consolidate the different branches of the tree that match the repeating patterns returned.

find_non_matches(rexes)

Returns all example strings that do not match any of the regular expressions in results, together with their frequencies.

fine_class(c)

Map a character in coarse class ‘C’ (AlphaNumeric) to a fine class.

fragment2re(fragment, tagged=False, as_re=True, output=False)

Convert fragment to RE.

If output is set, this is for final output, and should be in the specified dialect (if any).

full_incremental_coverage(dedup=False, debug=False)

Returns an ordered dictionary of regular expressions, sorted by the number of new examples they match/explain, from most to fewest, with ties broken by pattern sort order. The values in the results dictionary are the numbers of (new) examples matched.

If dedup is set to True, frequencies are ignored in the sort order.

Each result is a Coverage object with the following attributes:

n:
number of examples matched including duplicates
n_uniq:
number of examples matched, excluding duplicates
incr:
number of previously unmatched examples matched, including duplicates
incr_uniq:
number of previously unmatched examples matched, excluding duplicates
incremental_coverage(dedup=False, debug=False)

Returns an ordered dictionary of regular expressions, sorted by the number of new examples they match/explain, from most to fewest, with ties broken by pattern sort order. The values in the results dictionary are the numbers of (new) examples matched.

If dedup is set to True, frequencies are ignored.

merge_fixed_omnipresent_at_pos(patterns)

Find unusual columns in fixed positions relative to ends. Align those, split and recurse

merge_fixed_only_present_at_pos(patterns)

Find unusual columns in fixed positions relative to ends. Align those Split and recurse

n_examples(dedup=False)

Returns the total number of examples used by rexpy. If dedup is set to True, this the number of different examples, otherwise it is the “raw” number of examples. In all cases, examples have been stripped.

refine_fragments(vrle, v_id)

Refine the categories for variable-run-length-encoded pattern (vrle) provided by narrowing the characters in each fragment.

rle2re(rles, tagged=False, as_re=True)

Convert run-length-encoded code string to regular expression

rle_fc_c(s, pattern, rlefc_in, rlec_in)
Convert a string, matching a ‘C’-(fragment) pattern, to
  • a run-length encoded sequence of fine classes
  • a run-length encoded sequence of characters
Given inputs:
s — a string representing the actual substring of an
example that matches a pattern fragment described by pattern

pattern — a VRLE of coarse classes

rlefc_in — a VRLE of fine classes, or None, or False

rlec_in — a VRLE of characters, or None, or False

Returns new rlefc and rlec, each of which is:

False, if the string doesn’t match the corresponding input VRLE

a possibly expanded VRLE, if it does match, or would match if expanded (by allowing more of fewer repetitions).

run_length_encode_coarse_classes(s)

Returns run-length encoded coarse classification

sample(n)

Sample self.all_examples for potentially faster induction.

sample_examples(examples, n)

Sample examples provided for potentially faster induction.

specialize(patterns)

Check all the catpure groups in each patterns and simplify any that are sufficiently low frequency.

vrle2re(vrles, tagged=False, as_re=True, output=False)

Convert variable run-length-encoded code string to regular expression

If output is set, this is for final output, and should be in the specified dialect (if any).

vrle2refrags(vrles, output=False)

Convert variable run-length-encoded code string to regular expression and list of fragments

class tdda.rexpy.rexpy.Fragment

Container for a fragment.

Attributes:

  • re: the regular expression for the fragment
  • group: True if it forms a capture group (i.e. is not constant)
class tdda.rexpy.rexpy.IDCounter

Rather Like a counter, but also assigns a numeric ID (from 1) to each key and actually builds the counter on that.

Use .add to increment an existing key’s count, or to initialize it to (by default) 1.

Get the key’s ID with .ids[key] or .keys.get(key).

add(key, freq=1)

Adds the given key, counting it and ensuring it has an id.

Returns the id.

getitem(key)

Gets the count for the key

class tdda.rexpy.rexpy.PRNGState(n)

Seeds the Python PRNG and after captures its state.

restore() cam be used to set them back to the captured state.

tdda.rexpy.rexpy.capture_group(s)

Places parentheses around s to form a capure group (a tagged piece of a regular expression), unless it is already a capture group.

tdda.rexpy.rexpy.cre(rex)

Compiled regular expression Memoized implementation.

tdda.rexpy.rexpy.escaped_bracket(chars, dialect=None, inner=False)

Construct a regular expression Bracket (character class), obeying the special regex rules for escaping these:

  • Characters do not, in general need to be escaped
  • If there is a close bracket (“]”) it mst be the first character
  • If there is a hyphen (“-”) it must be the last character
  • If there is a carat (“^”), it must not be the first character
  • If there is a backslash, it’s probably best to escape it. Some implementations don’t require this, but it will rarely do any harm, and most implementation understand at least some escape sequences (“w”, “W”, “d”, “s” etc.), so escaping seems prudent.

However, javascript and ruby do not follow the unescaped “]” as the first character rule, so if either of these dialects is specified, the “]” will be escaped (but still placed in the first position.

If inner is set to True, the result is returned without brackets.

tdda.rexpy.rexpy.example_check_function(rexes, maxN=None)

CHECK FUNCTIONS This is an example check function

A check function takes a list of regular expressions (as strings) and optionally, a maximum number of (different) strings to return.

It should return two things:

  • An Examples object (importing that class from rexpy.py) containing strings that don’t match any of the regular expressions in the list provided. (If the list is empty, all strings are candidates to be returned.)
  • a list of how many strings matched each regular expression provided (in the same order).

If maxN is None, it should return all strings that fail to match; if it is a number, that is the maximum number of (distinct) failures to return. The function is expected to return all failures, however, if there are fewer than maxN failures (i.e., it’s not OK if maxN is 20 to return just 1 failiing string if actually 5 different strings fail.)

Examples: The examples object is initialized with a list of (distinct) failing strings, and optionally a corresponding list of their frequencies. If no frequencies are provided, all frequencies will be set to 1 when the Examples object is initialized.

The regular expression match frequencies are used to eliminate low-frequency or low-ranked regular expressions. It is not essential that the values cover all candidate strings; it is enough to give frequencies for those strings tested before maxN failures are generated.

(Normally, the regular expressions provided will be exclusive, i.e. at most one will match, so it’s also fine only to test a string against regular expressions until a match is found…you don’t need to test against other patterns in case the string also matches more than one.)

tdda.rexpy.rexpy.expand_or_falsify_vrle(rle, vrle, fixed=False, variableLength=False)
Given a run-length encoded sequence
(e.g. [('A', 3), ('B', 4)])
and (usually) a variable run-length encoded sequence
(e.g. [('A', 2, 3), ('B', 1, 2)])

expand the VRLE to include the case of the RLE, if they can be consistent.

If they cannot, return False.

If vrle is None, this indicates it hasn’t been found yet, so rle is simply expanded to a VRLE.

If vrle is False, this indicates that a counterexample has already been found, so False is returned again.

If variableLength is set to True, patterns will be merged even if it is a different length from the vrle, as long as the overlapping part is consistent.

tdda.rexpy.rexpy.extract(examples, tag=False, encoding=None, as_object=False, extra_letters=None, full_escape=False, remove_empties=False, strip=False, variableLengthFrags=False, max_patterns=None, min_diff_strings_per_pattern=1, min_strings_per_pattern=1, size=None, seed=None, dialect='portable', verbose=0)

Extract regular expression(s) from examples and return them.

Normally, examples should be unicode (i.e. str in Python3, and unicode in Python2). However, encoded strings can be passed in provided the encoding is specified.

Results will always be unicode.

If as_object is set, the extractor object is returned, with results in .results.rex; otherwise, a list of regular expressions, as unicode strings is returned.

tdda.rexpy.rexpy.get_omnipresent_at_pos(fragFreqCounters, n, **kwargs)

Find patterns in fragFreqCounters for which the frequency is n.

fragFreqCounters is a dictionary (usually keyed on ‘fragments’) of whose values are dictionaries mapping positions to frequencies.

For example:

{
    ('a', 1, 1, 'fixed'): {1: 7, -1: 7, 3: 4},
    ('b', 1, 1, 'fixed'): {2: 6, 3: 4},
}

This indicates that the pattern ('a', 1, 1, 'fixed') has frequency 7 at positions 1 and -1, and frequency 4 at position 3, while pattern ('b', 1, 1, 'fixed') has frequency 6 at position 2 and 4 at position 3.

With n set to 7, this returns:

[
    (('a', 1, 1, 'fixed'), -1)
    (('a', 1, 1, 'fixed'), 1),
]

(sorted on pos; each pos really should occur at most once.)

tdda.rexpy.rexpy.get_only_present_at_pos(fragFreqCounters, *args, **kwargs)

Find patterns in fragFreqCounters that, when present, are always at the same position.

fragFreqCounters is a dictionary (usually keyed on fragments) of whose values are dictionaries mapping positions to frequencies.

For example:

{
    ('a', 1, 1, 'fixed'): {1: 7, -1: 7, 3: 4},
    ('b', 1, 1, 'fixed'): {2: 6},
}
This indicates that the
  • pattern ('a', 1, 1, 'fixed') has frequency 7 at positions 1 and -1, and frequency 4 at position 3;
  • pattern ('b', 1, 1, 'fixed') has frequency 6 at position 2 (only)

So this would return:

[
    (('b', 1, 1, 'fixed'), 2)
]

(sorted on pos; each pos really should occur at most once.)

tdda.rexpy.rexpy.left_parts(patterns, fixed)

patterns is a list of patterns each consisting of a list of frags.

fixed is a list of (fragment, position) pairs, sorted on position, specifying points at which to split the patterns.

This function returns a list of lists of pattern fragments, split at each fixed position.

tdda.rexpy.rexpy.length_stats(patterns)

Given a list of patterns, returns named tuple containing

all_same_length:
boolean, True if all patterns are the same length
max_length:
length of the longest pattern in patterns
tdda.rexpy.rexpy.matrices2incremental_coverage(patterns, matrix, deduped, indexes, examples, sort_on_deduped=False)

Find patterns, in (descending) order of # of matches, and pull out freqs.

Then set overlapping matches to zero and repeat.

Returns ordered dict, sorted by incremental match rate, with number of (previously unaccounted for) strings matched.

tdda.rexpy.rexpy.nvl(v, w)

This function is used as syntactic sugar for replacing null values.

tdda.rexpy.rexpy.pdextract(cols, seed=None)

Extract regular expression(s) from the Pandas column (Series) object or list of Pandas columns given.

All columns provided should be string columns (i.e. of type np.dtype(‘O’), possibly including null values, which will be ignored.

Example use:

import numpy as np
import pandas as pd
from tdda.rexpy import pdextract

df = pd.DataFrame({'a3': ["one", "two", np.NaN],
                   'a45': ['three', 'four', 'five']})

re3 = pdextract(df['a3'])
re45 = pdextract(df['a45'])
re345 = pdextract([df['a3'], df['a45']])

This should result in:

re3   = '^[a-z]{3}$'
re5   = '^[a-z]{3}$'
re345 = '^[a-z]{3}$'
tdda.rexpy.rexpy.rex_coverage(patterns, examples, dedup=False)

Given a list of regular expressions and a dictionary of examples and their frequencies, this counts the number of times each pattern matches a an example.

If dedup is set to True, the frequencies are ignored, so that only the number of keys is returned.

tdda.rexpy.rexpy.rex_full_incremental_coverage(patterns, examples, sort_on_deduped=False, debug=False)

Returns an ordered dictionary containing, keyed on terminated regular expressions, from patterns, sorted in decreasing order of incremental coverage, i.e. with the pattern matching the most first, followed by the one matching the most remaining examples etc.

If dedup is set to True, the ordering ignores duplicate examples; otherise, duplicates help determine the sort order.

Each entry in the dictionary returned is a Coverage object with the following attributes:

n:
number of examples matched including duplicatesb
n_uniq:
number of examples matched, excluding duplicates
incr:
number of previously unmatched examples matched, including duplicates
incr_uniq:
number of previously unmatched examples matched, excluding duplicates
tdda.rexpy.rexpy.rex_incremental_coverage(patterns, examples, sort_on_deduped=False, debug=False)

Given a list of regular expressions and a dictionary of examples and their frequencies, this computes their incremental coverage, i.e. it produces an ordered dictionary, sorted from the “most useful” patterns (the one that matches the most examples) to the least useful. Usefulness is defined as “matching the most previously unmatched patterns”. The dictionary entries are the number of (new) matches for the pattern.

If dedup is set to True, the frequencies are ignored when computing match rate; if set to false, patterns get credit for the nmultiplicity of examples they match.

Ties are broken by lexical order of the (terminated) patterns.

For example, given patterns p1, p2, and p3, and examples e1, e2 and e3, with a match profile as follows (where the numbers are multiplicities)

example p1 p2 p3
e1 2 2 0
e2 0 3 3
e3 1 0 0
e4 0 0 4
e5 1 0 1
TOTAL 4 4 8

If dedup is False this would produce:

OrderedDict(
    (p3, 8),
    (p1, 3),
    (p2, 0)
)

because:

  • p3 matches the most, with 8
  • Of the strings unmatched by p3, p1 accounts for 3 (e1 x 2 and e3 x 1) whereas p2 accounts for no new strings.

With dedup set to True, the matrix transforms to

example p1 p2 p3
e1 1 1 0
e2 0 1 1
e3 1 0 0
e4 0 0 1
e5 1 0 1
TOTAL 3 2 3

So p1 and p3 are tied.

If we assume the p1 sorts before p3, the result would then be:

OrderedDict(
    (p1, 3),
    (p3, 2),
    (p2, 0)
)
tdda.rexpy.rexpy.rexpy_streams(in_path=None, out_path=None, skip_header=False, quote=False, **kwargs)
in_path is
None: to read inputs from stdin path to file: to read inputs from file at in_path list of strings: to use those strings as the inputs
out_path is:
None: to write outputs to stdout path to file: to write outputs from file at out_path False: to return the strings as a list
tdda.rexpy.rexpy.right_parts(patterns, fixed)

patterns is a list of patterns each consisting of a list of frags.

fixed is a list of (fragment, pos) pairs where position specifies the position from the right, i.e a position that can be indexed as -position.

Fixed should be sorted, increasing on position, i.e. sorted from the right-most pattern. The positions specify points at which to split the patterns.

This function returns a list of lists of pattern fragments, split at each fixed position.

tdda.rexpy.rexpy.run_length_encode(s)

Return run-length-encoding of string s, e.g.:

'CCC-BB-A' --> (('C', 3), ('-', 1), ('B', 2), ('-', 1), ('A', 1))
tdda.rexpy.rexpy.signature(rle)

Return the sequence of characters in a run-length encoding (i.e. the signature).

Also works with variable run-length encodings

tdda.rexpy.rexpy.terminate_patterns_and_sort(patterns)

Given a list of regular expressions, this terminates any that are not and returns them in sorted order. Also returns a list of the original indexes of the results.

tdda.rexpy.rexpy.to_vrles(rles)

Convert a list of run-length encodings to a list of variable run-length encodings, one for each common signature.

For example, given inputs of:

    (('C', 2),)
    (('C', 3),)
and (('C', 2), ('.', 1))

this would return:

    (('C', 2, 3),)
and (('C', 2, 2), ('.', 1, 1))