Rexpy

The rexpy command

rexpy [FLAGS] [inputfile [outputfile]]

If inputfile is provided, it should contain one string per line; otherwise lines will be read from standard input.

If outputfile is provided, regular expressions found will be written to that (one per line); otherwise they will be written to standard output.

Optional FLAGS may be used to modify Rexpy's behaviour:

  • -h, --header
    Discard first line, as a header.

  • -?, --help
    Print this usage information and exit (without error).

  • -g, --group
    Generate capture groups for each variable fragment of each regular expression generated, i.e. surround variable components with parentheses, e.g.

    ^[A-Z]+\-[0-9]+$
    

    becomes

    ^([A-Z]+)\-([0-9]+)$
    

    This is now the default.

  • -G, --no-group
    Do not generated capture groups for each variable fragment of each regular expression generated, i.e. surround variable components with parentheses, e.g.

    ^([A-Z]+)\-([0-9]+)$
    

    becomes

    ^[A-Z]+\-[0-9]+$
    
  • -q, --quote
    Display the resulting regular expressions as double-quoted, escaped strings, in a form broadly suitable for use in Unix shells, JSON, and string literals in many programming languages. e.g.

    ^[A-Z]+\-[0-9]+$
    

    becomes

    "^[A-Z]+\\-[0-9]+$"
    
  • --portable
    Produce maximally portable regular expressions (e.g. [0-9] rather than \d). (This is the default.)

  • --grep
    Same as --portable

  • --java
    Produce Java-style regular expressions (e.g. \p{Digit})

  • --posix
    Produce POSIX-compliant regular expressions (e.g. [[:digit:]] rather than \d).

  • --perl
    Produce Perl-style regular expressions (e.g. \d)

  • -u, --underscore, -_
    Allow underscore to be treated as a letter. Mostly useful for matching identifiers.

  • -d, --dot, -., --period
    Allow dot to be treated as a letter. Mostly useful for matching identifiers.

  • -m, --minus, --hyphen, --dash
    Allow minus to be treated as a letter. Mostly useful for matching identifiers.

  • -v, --version
    Print the version number.

  • -V, --verbose
    Set verbosity level to 1

  • -VV, --Verbose
    Set verbosity level to 2

  • -vlf, --variable
    Use variable length fragments

  • -flf, --fixed
    Use fixed length fragments

Supplied Rexpy Examples

TDDA rexpy is supplied with a set of examples.

To copy the rexpy examples, run the command:

tdda examples rexpy

This will create or overwrite a directory rexpy_examples in the current directory.

Alternatively, you can copy all examples using the following command:

tdda examples

which will create a number of separate subdirectories.

Rexpy automatically constructs regular expressions from data. Regular expressions are powerful but hard to write and read. Rexpy infers them for you from examples of strings that should match.

Rexpy is intended for strings with structure: identifiers, postcodes, UUIDs, phone numbers, email addresses, version numbers, and so on. It is not useful for free text.

For all of these examples, run commands on the command line after changing to this directory with cd.

Command-line examples

Simple identifiers

The file headed-ids.txt contains a column of structured IDs with a header line:

ID
123-AA-971
12-DQ-802
198-AA-045
1-BA-834

Use -h to discard the header, then rexpy infers the pattern:

rexpy -h headed-ids.txt

This should produce:

^[0-9]{1,3}\-[A-Z]{2}\-[0-9]{3}$

Postcodes

rexpy postcodes.txt

UK postcodes have a regular structure — rexpy captures it:

^[A-Z]{1,2}[0-9]{1,2} [0-9][A-Z]{2}$

UUIDs

rexpy uuids.txt

UUIDs are highly regular, so rexpy produces a precise pattern:

^[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$

User-agent strings

rexpy agents9.txt

User-agent strings are structured but extremely varied. Rexpy produces one regex per distinct pattern — with only 9 strings it cannot generalise. This illustrates the limit: rexpy works well when strings share a common structure, but not when they are all different.

Reading from standard input

You can also pipe strings directly to rexpy:

echo -e "2024-01-15\n2024-03-22\n2023-11-07" | rexpy

Python API examples

ids.py — basic usage

python ids.py

Uses rexpy.extract() on a list of strings.

pandas_ids.py — Pandas integration

python pandas_ids.py

Uses rexpy.pdextract() to infer regexes from Pandas Series, including combining multiple columns to find a shared pattern.

Command-line flags

Run rexpy --help for full options. Useful flags include:

-h, --header     Discard the first line (treat it as a header)
-u, --underscore Allow underscore as a letter (useful for identifiers)
-d, --dot        Allow dot as a letter (useful for identifiers)
-G, --no-group   Don't add extra capture groups to the regex
--portable       Use maximally portable regular expressions (default)
--perl           Use Perl-style regular expressions (e.g. \d)
--java           Use Java-style regular expressions (e.g. \p{Digit})
--posix          Use Posix-style regular expressions (e.g. [[:digit:]])

Regular Expression Styles

$ rexpy postcodes.txt
^[A-Z]{1,2}[0-9]{1,2} [0-9][A-Z]{2}$

$ rexpy postcodes.txt --posix
^[[:upper:]]{1,2}[[:digit:]]{1,2} [[:digit:]][[:upper:]]{2}$

$ rexpy postcodes.txt --java
^\p{Upper}{1,2}\p{Digit}{1,2} \p{Digit}\p{Upper}{2}$

$ rexpy postcodes.txt --perl
^[A-Z]{1,2}\d{1,2} \d[A-Z]{2}$