Rexpy¶
Rexpy infers regular expressions on a line-by-line basis from text data examples.
To run the rexpy tool:
tdda rexpy [inputfile]
Command-line Tool¶
Usage:
rexpy [FLAGS] [input file [output file]]
If input file is provided, it should contain one string per line; otherwise lines will be read from standard input.
If output file is provided, regular expressions found will be written to that (one per line); otherwise they will be printed.
FLAGS are optional flags. Currently:
-h, --header Discard first line, as a header.
-?, --help Print this usage information and exit (without error)
-g, --group Generate capture groups for each variable fragment
of each regular expression generated, i.e. surround
variable components with parentheses
e.g. '^([A-Z]+)\-([0-9]+)$'
becomes '^[A-Z]+\-[0-9]+$'
-u, --underscore Allow underscore to be treated as a letter.
Mostly useful for matching identifiers
Also allow -_.
-d, --dot Allow dot to be treated as a letter.
Mostly useful for matching identifiers.
Also -. --period.
-m, --minus Allow minus to be treated as a letter.
Mostly useful for matching identifiers.
Also --hyphen or --dash.
-v, --version Print the version number.
-V, --verbose Set verbosity level to 1
-VV, --Verbose Set verbosity level to 2
-vlf, --variable Use variable length fragments
-flf, --fixed Use fixed length fragments
Python API¶
The tdda.rexpy.rexpy
module also provides a Python API, to allow
discovery of regular expressions to be incorporated into other Python
programs.
-
class
tdda.rexpy.rexpy.
Coverage
¶ Container for coverage information.
Attributes:
n:
number of matchesn_unique:
number matches, deduplicating stringsincr:
number of new (unique) matches for this regexincr_uniq:
number of new (unique) deduplicated matches for this regexindex:
index of this regex in original list returned.
-
class
tdda.rexpy.rexpy.
Extractor
(examples, extract=True, tag=False, extra_letters=None, full_escape=False, remove_empties=False, strip=False, variableLengthFrags=False, specialize=False, max_patterns=None, min_diff_strings_per_pattern=1, min_strings_per_pattern=1, verbose=0)¶ Regular expression ‘extractor’.
Given a set of examples, this tries to construct a useful regular expression that characterizes them; failing which, a list of regular expressions that collectively cover the cases.
Results are stored in
self.results
once extraction has occurred, which happens by default on initialization, but can be invoked manually.The examples may be given as a list or as a dictionary: if a dictionary, the values are assumed to be string frequencies.
Verbose is usually 0 or
False
. It can be toTrue
or 1 for various extra output, and to higher numbers for even more verbose output. The highest level currently used is 2.-
aligned_parts
(parts)¶ Given a list of parts, each consisting of the fragments from a set of partially aligned patterns, show them aligned, and in a somewhat ambigous, numbered, fairly human-readable, compact form.
-
analyse_groups
(pattern, examples)¶ Analyse the contents of each group (fragment) in pattern across the examples it matches.
- Return zip of
- the characters in each group
- the strings in each group
- the run-length encoded fine classes in each group
- the run-length encoded characters in each group
- the group itself
all indexed on the (zero-based) group number.
-
batch_extract
(examples)¶ Find regular expressions for a batch of examples (as given).
-
clean
(examples)¶ Compute length of each string and count number of examples of each length.
-
coarse_classify
(s)¶ Classify each character in a string into one of the coarse categories
-
coarse_classify_char
(c)¶ Classify character into one of the coarse categories
-
coverage
(dedup=False)¶ Get a list of frequencies for each regular expression, i.e the number of the (stripped) input strings it matches. The list is in the same order as the regular expressions in
self.results.rex
.If
dedup
is set toTrue
, shows only the number of distinct (stripped) input examples matches
-
extract
()¶ Actually perform the regular expression ‘extraction’.
-
find_non_matches
()¶ Returns all example strings that do not match any of the regular expressions in results.
-
fine_class
(c)¶ Map a character in coarse class ‘C’ (AlphaNumeric) to a fine class.
-
full_incremental_coverage
(dedup=False, debug=False)¶ Returns an ordered dictionary of regular expressions, sorted by the number of new examples they match/explain, from most to fewest, with ties broken by pattern sort order. The values in the results dictionary are the numbers of (new) examples matched.
If
dedup
is set toTrue
, frequencies are ignored.Each result is a
Coverage
object with the following attributes:n
:- number of examples matched including duplicates
n_uniq
:- number of examples matched, excluding duplicates
incr
:- number of previously unmatched examples matched, including duplicates
incr_uniq
:- number of previously unmatched examples matched, excluding duplicates
-
incremental_coverage
(dedup=False, debug=False)¶ Returns an ordered dictionary of regular expressions, sorted by the number of new examples they match/explain, from most to fewest, with ties broken by pattern sort order. The values in the results dictionary are the numbers of (new) examples matched.
If
dedup
is set toTrue
, frequencies are ignored.
-
merge_fixed_omnipresent_at_pos
(patterns)¶ Find unusual columns in fixed positions relative to ends. Align those, split and recurse
-
merge_fixed_only_present_at_pos
(patterns)¶ Find unusual columns in fixed positions relative to ends. Align those Split and recurse
-
n_examples
(dedup=False)¶ Returns the total number of examples used by rexpy. If
dedup
is set toTrue
, this the number of different examples, otherwise it is the “raw” number of examples. In all cases, examples have been stripped.
-
refine_groups
(pattern, examples)¶ Refine the categories for variable run-length-encoded patterns provided by narrowing the characters in the groups.
-
rle2re
(rles, tagged=False, as_re=True)¶ Convert run-length-encoded code string to regular expression
-
rle_fc_c
(s, pattern, rlefc_in, rlec_in)¶ - Convert a string, matching a ‘C’-(fragment) pattern, to
- a run-length encoded sequence of fine classes
- a run-length encoded sequence of characters
- Given inputs:
s
— a string representing the actual substring of an- example that matches a pattern fragment described by pattern
pattern
— a VRLE of coarse classesrlefc_in
— a VRLE of fine classes, or None, or Falserlec_in
— a VRLE of characters, or None, or False
Returns new rlefc and rlec, each of which is:
False
, if the string doesn’t match the corresponding input VRLEa possibly expanded VRLE, if it does match, or would match if expanded (by allowing more of fewer repetitions).
-
run_length_encode_coarse_classes
(s)¶ Returns run-length encoded coarse classification
-
sample
(nPerLength)¶ Sample strings for potentially faster induction. Only used if over a hundred million distinct strings are given. For now.
-
specialize
(patterns)¶ Check all the catpure groups in each patterns and simplify any that are sufficiently low frequency.
-
vrle2re
(vrles, tagged=False, as_re=True)¶ Convert variable run-length-encoded code string to regular expression
-
vrle2refrags
(vrles)¶ Convert variable run-length-encoded code string to regular expression and list of fragments
-
-
class
tdda.rexpy.rexpy.
Fragment
¶ Container for a fragment.
Attributes:
re
: the regular expression for the fragmentgroup
: True if it forms a capture group (i.e. is not constant)
-
tdda.rexpy.rexpy.
capture_group
(s)¶ Places parentheses around s to form a capure group (a tagged piece of a regular expression), unless it is already a capture group.
-
tdda.rexpy.rexpy.
cre
(rex)¶ Compiled regular expression Memoized implementation.
-
tdda.rexpy.rexpy.
expand_or_falsify_vrle
(rle, vrle, fixed=False, variableLength=False)¶ - Given a run-length encoded sequence
- (e.g.
[('A', 3), ('B', 4)]
) - and (usually) a variable run-length encoded sequence
- (e.g.
[('A', 2, 3), ('B', 1, 2)]
)
expand the VRLE to include the case of the RLE, if they can be consistent.
If they cannot, return False.
If vrle is None, this indicates it hasn’t been found yet, so rle is simply expanded to a VRLE.
If vrle is False, this indicates that a counterexample has already been found, so False is returned again.
If variableLength is set to True, patterns will be merged even if it is a different length from the vrle, as long as the overlapping part is consistent.
-
tdda.rexpy.rexpy.
extract
(examples, tag=False, encoding=None, as_object=False, extra_letters=None, full_escape=False, remove_empties=False, strip=False, variableLengthFrags=False, max_patterns=None, min_diff_strings_per_pattern=1, min_strings_per_pattern=1, verbose=0)¶ Extract regular expression(s) from examples and return them.
Normally, examples should be unicode (i.e.
str
in Python3, andunicode
in Python2). However, encoded strings can be passed in provided the encoding is specified.Results will always be unicode.
If as_object is set, the extractor object is returned, with results in .results.rex; otherwise, a list of regular expressions, as unicode strings is returned.
-
tdda.rexpy.rexpy.
get_omnipresent_at_pos
(fragFreqCounters, n, **kwargs)¶ Find patterns in
fragFreqCounters
for which the frequency isn
.fragFreqCounters is a dictionary (usually keyed on ‘fragments’) of whose values are dictionaries mapping positions to frequencies.
For example:
{ ('a', 1, 1, 'fixed'): {1: 7, -1: 7, 3: 4}, ('b', 1, 1, 'fixed'): {2: 6, 3: 4}, }
This indicates that the pattern
('a', 1, 1, 'fixed')
has frequency 7 at positions 1 and -1, and frequency 4 at position 3, while pattern('b', 1, 1, 'fixed')
has frequency 6 at position 2 and 4 at position 3.With n set to 7, this returns:
[ (('a', 1, 1, 'fixed'), -1) (('a', 1, 1, 'fixed'), 1), ]
(sorted on pos; each pos really should occur at most once.)
-
tdda.rexpy.rexpy.
get_only_present_at_pos
(fragFreqCounters, *args, **kwargs)¶ Find patterns in fragFreqCounters that, when present, are always at the same position.
fragFreqCounters
is a dictionary (usually keyed onfragments
) of whose values are dictionaries mapping positions to frequencies.For example:
{ ('a', 1, 1, 'fixed'): {1: 7, -1: 7, 3: 4}, ('b', 1, 1, 'fixed'): {2: 6}, }
- This indicates that the
- pattern
('a', 1, 1, 'fixed')
has frequency 7 at positions 1 and -1, and frequency 4 at position 3; - pattern
('b', 1, 1, 'fixed')
has frequency 6 at position 2 (only)
- pattern
So this would return:
[ (('b', 1, 1, 'fixed'), 2) ]
(sorted on
pos
; eachpos
really should occur at most once.)
-
tdda.rexpy.rexpy.
left_parts
(patterns, fixed)¶ patterns is a list of patterns each consisting of a list of frags.
fixed is a list of
(fragment, position)
pairs, sorted on position, specifying points at which to split the patterns.This function returns a list of lists of pattern fragments, split at each fixed position.
-
tdda.rexpy.rexpy.
length_stats
(patterns)¶ Given a list of patterns, returns named tuple containing
all_same_length
:- boolean, True if all patterns are the same length
max_length
:- length of the longest pattern in patterns
-
tdda.rexpy.rexpy.
matrices2incremental_coverage
(patterns, matrix, deduped, indexes, example_freqs, dedup=False)¶ Find patterns, in order of # of matches, and pull out freqs. Then set overlapping matches to zero and repeat. Returns ordered dict, sorted by incremental match rate, with number of (previously unaccounted for) strings matched.
-
tdda.rexpy.rexpy.
pdextract
(cols)¶ Extract regular expression(s) from the Pandas column (
Series
) object or list of Pandas columns given.All columns provided should be string columns (i.e. of type np.dtype(‘O’), possibly including null values, which will be ignored.
Example use:
import pandas as pd from tdda.rexpy import pdextract df = pd.DataFrame({'a3': ["one", "two", pd.np.NaN], 'a45': ['three', 'four', 'five']}) re3 = pdextract(df['a3']) re45 = pdextract(df['a45']) re345 = pdextract([df['a3'], df['a45']])
This should result in:
re3 = '^[a-z]{3}$' re5 = '^[a-z]{3}$' re345 = '^[a-z]{3}$'
-
tdda.rexpy.rexpy.
rex_coverage
(patterns, example_freqs, dedup=False)¶ Given a list of regular expressions and a dictionary of examples and their frequencies, this counts the number of times each pattern matches a an example.
If
dedup
is set toTrue
, the frequencies are ignored, so that only the number of keys is returned.
-
tdda.rexpy.rexpy.
rex_full_incremental_coverage
(patterns, example_freqs, dedup=False, debug=False)¶ Returns an ordered dictionary containing, keyed on terminated regular expressions, from patterns, sorted in decreasing order of incremental coverage, i.e. with the pattern matching the most first, followed by the one matching the most remaining examples etc.
If
dedup
is set toTrue
, the ordering ignores duplicate examples; otherise, duplicates help determine the sort order.Each entry in the dictionary returned is a
Coverage
object with the following attributes:n
:- number of examples matched including duplicatesb
n_uniq
:- number of examples matched, excluding duplicates
incr
:- number of previously unmatched examples matched, including duplicates
incr_uniq
:- number of previously unmatched examples matched, excluding duplicates
-
tdda.rexpy.rexpy.
rex_incremental_coverage
(patterns, example_freqs, dedup=False, debug=False)¶ Given a list of regular expressions and a dictionary of examples and their frequencies, this computes their incremental coverage, i.e. it produces an ordered dictionary, sorted from the “most useful” patterns (the one that matches the most examples) to the least useful. Usefulness is defined as “matching the most previously unmatched patterns”. The dictionary entries are the number of (new) matches for the pattern.
If
dedup
is set toTrue
, the frequencies are ignored when computing match rate; if set to false, patterns get credit for the nmultiplicity of examples they match.Ties are broken by lexical order of the (terminated) patterns.
For example, given patterns p1, p2, and p3, and examples e1, e2 and e3, with a match profile as follows (where the numbers are multiplicities)
example p1 p2 p3 e1 2 2 0 e2 0 3 3 e3 1 0 0 e4 0 0 4 e5 1 0 1 TOTAL 4 4 8 If
dedup
isFalse
this would produce:OrderedDict( (p3, 8), (p1, 3), (p2, 0) )
because:
- p3 matches the most, with 8
- Of the strings unmatched by p3, p1 accounts for 3 (e1 x 2 and e3 x 1) whereas p2 accounts for no new strings.
With
dedup
set toTrue
, the matrix transforms toexample p1 p2 p3 e1 1 1 0 e2 0 1 1 e3 1 0 0 e4 0 0 1 e5 1 0 1 TOTAL 3 2 3 So p1 and p3 are tied.
If we assume the p1 sorts before p3, the result would then be:
OrderedDict( (p1, 3), (p3, 2), (p2, 0) )
-
tdda.rexpy.rexpy.
right_parts
(patterns, fixed)¶ patterns is a list of patterns each consisting of a list of frags.
fixed is a list of
(fragment, pos)
pairs where position specifies the position from the right, i.e a position that can be indexed as-position
.Fixed should be sorted, increasing on position, i.e. sorted from the right-most pattern. The positions specify points at which to split the patterns.
This function returns a list of lists of pattern fragments, split at each fixed position.
-
tdda.rexpy.rexpy.
run_length_encode
(s)¶ Return run-length-encoding of string s, e.g.:
'CCC-BB-A' --> (('C', 3), ('-', 1), ('B', 2), ('-', 1), ('A', 1))
-
tdda.rexpy.rexpy.
signature
(rle)¶ Return the sequence of characters in a run-length encoding (i.e. the signature).
Also works with variable run-length encodings
-
tdda.rexpy.rexpy.
terminate_patterns_and_sort
(patterns)¶ Given a list of regular expressions, this terminates any that are not and returns them in sorted order. Also returns a list of the original indexes of the results.
-
tdda.rexpy.rexpy.
to_vrles
(rles)¶ Convert a list of run-length encodings to a list of variable run-length encodings, one for each common signature.
For example, given inputs of:
(('C', 2),) (('C', 3),) and (('C', 2), ('.', 1))
this would return:
(('C', 2, 3),) and (('C', 2, 2), ('.', 1, 1))
Examples¶
The tdda.rexpy
module includes a set of examples.
To copy these examples to your own rexpy-examples subdirectory (or to a location of your choice), run the command:
tdda examples rexpy [mydirectory]