Rexpy¶

Rexpy infers regular expressions on a line-by-line basis from text data examples.

To run the rexpy tool:

tdda rexpy [inputfile]

Command-line Tool¶

Usage:

rexpy [FLAGS] [input file [output file]]

If input file is provided, it should contain one string per line; otherwise lines will be read from standard input.

If output file is provided, regular expressions found will be written to that (one per line); otherwise they will be printed.

FLAGS are optional flags. Currently:

-h, --header      Discard first line, as a header.

-?, --help        Print this usage information and exit (without error)

-g, --group       Generate capture groups for each variable fragment
                  of each regular expression generated, i.e. surround
                  variable components with parentheses
                      e.g.     '^([A-Z]+)\-([0-9]+)$'
                      becomes  '^[A-Z]+\-[0-9]+$'

-q, --quote       Display the resulting regular expressions as
                  double-quoted, escaped strings, in a form broadly
                  suitable for use in Unix shells, JSON, and string
                  literals in many programming languages.
                      e.g.     ^[A-Z]+\-[0-9]+$
                      becomes  "^[A-Z]+\-[0-9]+$"

-u, --underscore  Allow underscore to be treated as a letter.
                  Mostly useful for matching identifiers
                  Also allow -_.

-d, --dot         Allow dot to be treated as a letter.
                  Mostly useful for matching identifiers.
                  Also -. --period.

-m, --minus       Allow minus to be treated as a letter.
                  Mostly useful for matching identifiers.
                  Also --hyphen or --dash.

-v, --version     Print the version number.

-V, --verbose     Set verbosity level to 1

-VV, --Verbose    Set verbosity level to 2

-vlf, --variable  Use variable length fragments

-flf, --fixed     Use fixed length fragments

Python API¶

The tdda.rexpy.rexpy module also provides a Python API, to allow discovery of regular expressions to be incorporated into other Python programs.

class tdda.rexpy.rexpy.Coverage¶

Container for coverage information.

Attributes:

n: number of matches

n_unique: number matches, deduplicating strings

incr: number of new (unique) matches for this regex

incr_uniq: number of new (unique) deduplicated matches for this regex

index: index of this regex in original list returned.

class tdda.rexpy.rexpy.Extractor(examples, extract=True, tag=False, extra_letters=None, full_escape=False, remove_empties=False, strip=False, variableLengthFrags=False, specialize=False, max_patterns=None, min_diff_strings_per_pattern=1, min_strings_per_pattern=1, seed=None, dialect=None, verbose=0)¶

Regular expression ‘extractor’.

Given a set of examples, this tries to construct a useful regular expression that characterizes them; failing which, a list of regular expressions that collectively cover the cases.

Results are stored in self.results once extraction has occurred, which happens by default on initialization, but can be invoked manually.

The examples may be given as a list or as a dictionary: if a dictionary, the values are assumed to be string frequencies.

Verbose is usually 0 or False. It can be to True or 1 for various extra output, and to higher numbers for even more verbose output. The highest level currently used is 2.

aligned_parts(parts)¶: Given a list of parts, each consisting of the fragments from a set of partially aligned patterns, show them aligned, and in a somewhat ambigous, numbered, fairly human-readable, compact form.

analyse_groups(pattern, examples)¶

Analyse the contents of each group (fragment) in pattern across the examples it matches.

Return zip of

the characters in each group
the strings in each group
the run-length encoded fine classes in each group
the run-length encoded characters in each group
the group itself

all indexed on the (zero-based) group number.

batch_extract(examples)¶: Find regular expressions for a batch of examples (as given).

clean(examples)¶: Compute length of each string and count number of examples of each length.

coarse_classify(s)¶: Classify each character in a string into one of the coarse categories

coarse_classify_char(c)¶: Classify character into one of the coarse categories

coverage(dedup=False)¶

Get a list of frequencies for each regular expression, i.e the number of the (stripped) input strings it matches. The list is in the same order as the regular expressions in self.results.rex.

If dedup is set to True, shows only the number of distinct (stripped) input examples matches

extract()¶: Actually perform the regular expression ‘extraction’.

find_non_matches()¶: Returns all example strings that do not match any of the regular expressions in results.

fine_class(c)¶: Map a character in coarse class ‘C’ (AlphaNumeric) to a fine class.

full_incremental_coverage(dedup=False, debug=False)¶

Returns an ordered dictionary of regular expressions, sorted by the number of new examples they match/explain, from most to fewest, with ties broken by pattern sort order. The values in the results dictionary are the numbers of (new) examples matched.

If dedup is set to True, frequencies are ignored.

Each result is a Coverage object with the following attributes:

n:

number of examples matched including duplicates

n_uniq:

number of examples matched, excluding duplicates

incr:

number of previously unmatched examples matched, including duplicates

incr_uniq:

number of previously unmatched examples matched, excluding duplicates

incremental_coverage(dedup=False, debug=False)¶

Returns an ordered dictionary of regular expressions, sorted by the number of new examples they match/explain, from most to fewest, with ties broken by pattern sort order. The values in the results dictionary are the numbers of (new) examples matched.

If dedup is set to True, frequencies are ignored.

merge_fixed_omnipresent_at_pos(patterns)¶: Find unusual columns in fixed positions relative to ends. Align those, split and recurse

merge_fixed_only_present_at_pos(patterns)¶: Find unusual columns in fixed positions relative to ends. Align those Split and recurse

n_examples(dedup=False)¶: Returns the total number of examples used by rexpy. If dedup is set to True, this the number of different examples, otherwise it is the “raw” number of examples. In all cases, examples have been stripped.

refine_groups(pattern, examples)¶: Refine the categories for variable run-length-encoded patterns provided by narrowing the characters in the groups.

rle2re(rles, tagged=False, as_re=True)¶: Convert run-length-encoded code string to regular expression

rle_fc_c(s, pattern, rlefc_in, rlec_in)¶

Convert a string, matching a ‘C’-(fragment) pattern, to

a run-length encoded sequence of fine classes
a run-length encoded sequence of characters

Given inputs:

s — a string representing the actual substring of an: example that matches a pattern fragment described by pattern

pattern — a VRLE of coarse classes

rlefc_in — a VRLE of fine classes, or None, or False

rlec_in — a VRLE of characters, or None, or False

Returns new rlefc and rlec, each of which is:

False, if the string doesn’t match the corresponding input VRLE

a possibly expanded VRLE, if it does match, or would match if expanded (by allowing more of fewer repetitions).

run_length_encode_coarse_classes(s)¶: Returns run-length encoded coarse classification

sample(nPerLength)¶: Sample strings for potentially faster induction. Only used if over a hundred million distinct strings are given. For now.

specialize(patterns)¶: Check all the catpure groups in each patterns and simplify any that are sufficiently low frequency.

vrle2re(vrles, tagged=False, as_re=True)¶: Convert variable run-length-encoded code string to regular expression

vrle2refrags(vrles)¶: Convert variable run-length-encoded code string to regular expression and list of fragments

class tdda.rexpy.rexpy.Fragment¶

Container for a fragment.

Attributes:

re: the regular expression for the fragment

group: True if it forms a capture group (i.e. is not constant)

tdda.rexpy.rexpy.capture_group(s)¶: Places parentheses around s to form a capure group (a tagged piece of a regular expression), unless it is already a capture group.

tdda.rexpy.rexpy.cre(rex)¶: Compiled regular expression Memoized implementation.

tdda.rexpy.rexpy.escaped_bracket(chars, dialect=None, inner=False)¶

Construct a regular expression Bracket (character class), obeying the special regex rules for escaping these:

Characters do not, in general need to be escaped

If there is a close bracket (“]”) it mst be the first character

If there is a hyphen (“-“) it must be the last character

If there is a carat (“^”), it must not be the first character

If there is a backslash, it’s probably best to escape it. Some implementations don’t require this, but it will rarely do any harm, and most implementation understand at least some escape sequences (“w”, “W”, “d”, “s” etc.), so escaping seems prudent.

However, javascript and ruby do not follow the unescaped “]” as the first character rule, so if either of these dialects is specified, the “]” will be escaped (but still placed in the first position.

If inner is set to True, the result is returned without brackets.

tdda.rexpy.rexpy.expand_or_falsify_vrle(rle, vrle, fixed=False, variableLength=False)¶

Given a run-length encoded sequence: (e.g. [('A', 3), ('B', 4)])
and (usually) a variable run-length encoded sequence: (e.g. [('A', 2, 3), ('B', 1, 2)])

expand the VRLE to include the case of the RLE, if they can be consistent.

If they cannot, return False.

If vrle is None, this indicates it hasn’t been found yet, so rle is simply expanded to a VRLE.

If vrle is False, this indicates that a counterexample has already been found, so False is returned again.

If variableLength is set to True, patterns will be merged even if it is a different length from the vrle, as long as the overlapping part is consistent.

tdda.rexpy.rexpy.extract(examples, tag=False, encoding=None, as_object=False, extra_letters=None, full_escape=False, remove_empties=False, strip=False, variableLengthFrags=False, max_patterns=None, min_diff_strings_per_pattern=1, min_strings_per_pattern=1, seed=None, dialect=None, verbose=0)¶

Extract regular expression(s) from examples and return them.

Normally, examples should be unicode (i.e. str in Python3, and unicode in Python2). However, encoded strings can be passed in provided the encoding is specified.

Results will always be unicode.

If as_object is set, the extractor object is returned, with results in .results.rex; otherwise, a list of regular expressions, as unicode strings is returned.

tdda.rexpy.rexpy.get_omnipresent_at_pos(fragFreqCounters, n, **kwargs)¶

Find patterns in fragFreqCounters for which the frequency is n.

fragFreqCounters is a dictionary (usually keyed on ‘fragments’) of whose values are dictionaries mapping positions to frequencies.

For example:

{
    ('a', 1, 1, 'fixed'): {1: 7, -1: 7, 3: 4},
    ('b', 1, 1, 'fixed'): {2: 6, 3: 4},
}

This indicates that the pattern ('a', 1, 1, 'fixed') has frequency 7 at positions 1 and -1, and frequency 4 at position 3, while pattern ('b', 1, 1, 'fixed') has frequency 6 at position 2 and 4 at position 3.

With n set to 7, this returns:

[
    (('a', 1, 1, 'fixed'), -1)
    (('a', 1, 1, 'fixed'), 1),
]

(sorted on pos; each pos really should occur at most once.)

tdda.rexpy.rexpy.get_only_present_at_pos(fragFreqCounters, *args, **kwargs)¶

Find patterns in fragFreqCounters that, when present, are always at the same position.

fragFreqCounters is a dictionary (usually keyed on fragments) of whose values are dictionaries mapping positions to frequencies.

For example:

{
    ('a', 1, 1, 'fixed'): {1: 7, -1: 7, 3: 4},
    ('b', 1, 1, 'fixed'): {2: 6},
}

This indicates that the

pattern ('a', 1, 1, 'fixed') has frequency 7 at positions 1 and -1, and frequency 4 at position 3;
pattern ('b', 1, 1, 'fixed') has frequency 6 at position 2 (only)

So this would return:

[
    (('b', 1, 1, 'fixed'), 2)
]

(sorted on pos; each pos really should occur at most once.)

tdda.rexpy.rexpy.left_parts(patterns, fixed)¶

patterns is a list of patterns each consisting of a list of frags.

fixed is a list of (fragment, position) pairs, sorted on position, specifying points at which to split the patterns.

This function returns a list of lists of pattern fragments, split at each fixed position.

tdda.rexpy.rexpy.length_stats(patterns)¶

Given a list of patterns, returns named tuple containing

all_same_length:

boolean, True if all patterns are the same length

max_length:

length of the longest pattern in patterns

tdda.rexpy.rexpy.matrices2incremental_coverage(patterns, matrix, deduped, indexes, example_freqs, dedup=False)¶: Find patterns, in order of # of matches, and pull out freqs. Then set overlapping matches to zero and repeat. Returns ordered dict, sorted by incremental match rate, with number of (previously unaccounted for) strings matched.

tdda.rexpy.rexpy.pdextract(cols)¶

Extract regular expression(s) from the Pandas column (Series) object or list of Pandas columns given.

All columns provided should be string columns (i.e. of type np.dtype(‘O’), possibly including null values, which will be ignored.

Example use:

import pandas as pd
from tdda.rexpy import pdextract

df = pd.DataFrame({'a3': ["one", "two", pd.np.NaN],
                   'a45': ['three', 'four', 'five']})

re3 = pdextract(df['a3'])
re45 = pdextract(df['a45'])
re345 = pdextract([df['a3'], df['a45']])

This should result in:

re3   = '^[a-z]{3}$'
re5   = '^[a-z]{3}$'
re345 = '^[a-z]{3}$'

tdda.rexpy.rexpy.rex_coverage(patterns, example_freqs, dedup=False)¶

Given a list of regular expressions and a dictionary of examples and their frequencies, this counts the number of times each pattern matches a an example.

If dedup is set to True, the frequencies are ignored, so that only the number of keys is returned.

tdda.rexpy.rexpy.rex_full_incremental_coverage(patterns, example_freqs, dedup=False, debug=False)¶

Returns an ordered dictionary containing, keyed on terminated regular expressions, from patterns, sorted in decreasing order of incremental coverage, i.e. with the pattern matching the most first, followed by the one matching the most remaining examples etc.

If dedup is set to True, the ordering ignores duplicate examples; otherise, duplicates help determine the sort order.

Each entry in the dictionary returned is a Coverage object with the following attributes:

n:

number of examples matched including duplicatesb

n_uniq:

number of examples matched, excluding duplicates

incr:

number of previously unmatched examples matched, including duplicates

incr_uniq:

number of previously unmatched examples matched, excluding duplicates

tdda.rexpy.rexpy.rex_incremental_coverage(patterns, example_freqs, dedup=False, debug=False)¶

Given a list of regular expressions and a dictionary of examples and their frequencies, this computes their incremental coverage, i.e. it produces an ordered dictionary, sorted from the “most useful” patterns (the one that matches the most examples) to the least useful. Usefulness is defined as “matching the most previously unmatched patterns”. The dictionary entries are the number of (new) matches for the pattern.

If dedup is set to True, the frequencies are ignored when computing match rate; if set to false, patterns get credit for the nmultiplicity of examples they match.

Ties are broken by lexical order of the (terminated) patterns.

For example, given patterns p1, p2, and p3, and examples e1, e2 and e3, with a match profile as follows (where the numbers are multiplicities)

example	p1	p2	p3
e1	2	2	0
e2	0	3	3
e3	1	0	0
e4	0	0	4
e5	1	0	1
TOTAL	4	4	8

If dedup is False this would produce:

OrderedDict(
    (p3, 8),
    (p1, 3),
    (p2, 0)
)

because:

p3 matches the most, with 8

Of the strings unmatched by p3, p1 accounts for 3 (e1 x 2 and e3 x 1) whereas p2 accounts for no new strings.

With dedup set to True, the matrix transforms to

example	p1	p2	p3
e1	1	1	0
e2	0	1	1
e3	1	0	0
e4	0	0	1
e5	1	0	1
TOTAL	3	2	3

So p1 and p3 are tied.

If we assume the p1 sorts before p3, the result would then be:

OrderedDict(
    (p1, 3),
    (p3, 2),
    (p2, 0)
)

tdda.rexpy.rexpy.rexpy_streams(in_path=None, out_path=None, skip_header=False, quote=False, **kwargs)¶

in_path is: None: to read inputs from stdin path to file: to read inputs from file at in_path list of strings: to use those strings as the inputs
out_path is:: None: to write outputs to stdout path to file: to write outputs from file at out_path False: to return the strings as a list

tdda.rexpy.rexpy.right_parts(patterns, fixed)¶

patterns is a list of patterns each consisting of a list of frags.

fixed is a list of (fragment, pos) pairs where position specifies the position from the right, i.e a position that can be indexed as -position.

Fixed should be sorted, increasing on position, i.e. sorted from the right-most pattern. The positions specify points at which to split the patterns.

This function returns a list of lists of pattern fragments, split at each fixed position.

tdda.rexpy.rexpy.run_length_encode(s)¶

Return run-length-encoding of string s, e.g.:

'CCC-BB-A' --> (('C', 3), ('-', 1), ('B', 2), ('-', 1), ('A', 1))

tdda.rexpy.rexpy.signature(rle)¶

Return the sequence of characters in a run-length encoding (i.e. the signature).

Also works with variable run-length encodings

tdda.rexpy.rexpy.terminate_patterns_and_sort(patterns)¶: Given a list of regular expressions, this terminates any that are not and returns them in sorted order. Also returns a list of the original indexes of the results.

tdda.rexpy.rexpy.to_vrles(rles)¶

Convert a list of run-length encodings to a list of variable run-length encodings, one for each common signature.

For example, given inputs of:

    (('C', 2),)
    (('C', 3),)
and (('C', 2), ('.', 1))

this would return:

    (('C', 2, 3),)
and (('C', 2, 2), ('.', 1, 1))

Examples¶

The tdda.rexpy module includes a set of examples.

To copy these examples to your own rexpy-examples subdirectory (or to a location of your choice), run the command:

tdda examples rexpy [mydirectory]