Rexpy

The `rexpy` command

rexpy [FLAGS] [inputfile [outputfile]]

If inputfile is provided, it should contain one string per line; otherwise lines will be read from standard input.

If outputfile is provided, regular expressions found will be written to that (one per line); otherwise they will be written to standard output.

Optional FLAGS may be used to modify Rexpy's behaviour:

-h, --header
Discard first line, as a header.
-?, --help
Print this usage information and exit (without error).
-g, --group
Generate capture groups for each variable fragment of each regular expression generated, i.e. surround variable components with parentheses, e.g.
```
^[A-Z]+\-[0-9]+$
```
becomes
```
^([A-Z]+)\-([0-9]+)$
```
This is now the default.
-G, --no-group
Do not generated capture groups for each variable fragment of each regular expression generated, i.e. surround variable components with parentheses, e.g.
```
^([A-Z]+)\-([0-9]+)$
```
becomes
```
^[A-Z]+\-[0-9]+$
```
-q, --quote
Display the resulting regular expressions as double-quoted, escaped strings, in a form broadly suitable for use in Unix shells, JSON, and string literals in many programming languages. e.g.
```
^[A-Z]+\-[0-9]+$
```
becomes
```
"^[A-Z]+\\-[0-9]+$"
```
--portable
Produce maximally portable regular expressions (e.g. [0-9] rather than \d). (This is the default.)
--grep
Same as --portable
--java
Produce Java-style regular expressions (e.g. \p{Digit})
--posix
Produce POSIX-compliant regular expressions (e.g. [[:digit:]] rather than \d).
--perl
Produce Perl-style regular expressions (e.g. \d)
-u, --underscore, -_
Allow underscore to be treated as a letter. Mostly useful for matching identifiers.
-d, --dot, -., --period
Allow dot to be treated as a letter. Mostly useful for matching identifiers.
-m, --minus, --hyphen, --dash
Allow minus to be treated as a letter. Mostly useful for matching identifiers.
-v, --version
Print the version number.
-V, --verbose
Set verbosity level to 1
-VV, --Verbose
Set verbosity level to 2
-vlf, --variable
Use variable length fragments
-flf, --fixed
Use fixed length fragments

Supplied Rexpy Examples

TDDA rexpy is supplied with a set of examples.

To copy the rexpy examples, run the command:

tdda examples rexpy

This will create or overwrite a directory rexpy_examples in the current directory.

Alternatively, you can copy all examples using the following command:

tdda examples

which will create a number of separate subdirectories.

Rexpy automatically constructs regular expressions from data. Regular expressions are powerful but hard to write and read. Rexpy infers them for you from examples of strings that should match.

Rexpy is intended for strings with structure: identifiers, postcodes, UUIDs, phone numbers, email addresses, version numbers, and so on. It is not useful for free text.

For all of these examples, run commands on the command line after changing to this directory with cd.

Command-line examples

Simple identifiers

The file headed-ids.txt contains a column of structured IDs with a header line:

ID
123-AA-971
12-DQ-802
198-AA-045
1-BA-834

Use -h to discard the header, then rexpy infers the pattern:

rexpy -h headed-ids.txt

This should produce:

^[0-9]{1,3}\-[A-Z]{2}\-[0-9]{3}$

Postcodes

rexpy postcodes.txt

UK postcodes have a regular structure — rexpy captures it:

^[A-Z]{1,2}[0-9]{1,2} [0-9][A-Z]{2}$

UUIDs

rexpy uuids.txt

UUIDs are highly regular, so rexpy produces a precise pattern:

^[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$

User-agent strings

rexpy agents9.txt

User-agent strings are structured but extremely varied. Rexpy produces one regex per distinct pattern — with only 9 strings it cannot generalise. This illustrates the limit: rexpy works well when strings share a common structure, but not when they are all different.

Reading from standard input

You can also pipe strings directly to rexpy:

echo -e "2024-01-15\n2024-03-22\n2023-11-07" | rexpy

Python API examples

ids.py — basic usage

python ids.py

Uses rexpy.extract() on a list of strings.

pandas_ids.py — Pandas integration

python pandas_ids.py

Uses rexpy.pdextract() to infer regexes from Pandas Series, including combining multiple columns to find a shared pattern.

Command-line flags

Run rexpy --help for full options. Useful flags include:

-h, --header     Discard the first line (treat it as a header)
-u, --underscore Allow underscore as a letter (useful for identifiers)
-d, --dot        Allow dot as a letter (useful for identifiers)
-G, --no-group   Don't add extra capture groups to the regex
--portable       Use maximally portable regular expressions (default)
--perl           Use Perl-style regular expressions (e.g. \d)
--java           Use Java-style regular expressions (e.g. \p{Digit})
--posix          Use Posix-style regular expressions (e.g. [[:digit:]])

Regular Expression Styles

$ rexpy postcodes.txt
^[A-Z]{1,2}[0-9]{1,2} [0-9][A-Z]{2}$

$ rexpy postcodes.txt --posix
^[[:upper:]]{1,2}[[:digit:]]{1,2} [[:digit:]][[:upper:]]{2}$

$ rexpy postcodes.txt --java
^\p{Upper}{1,2}\p{Digit}{1,2} \p{Digit}\p{Upper}{2}$

$ rexpy postcodes.txt --perl
^[A-Z]{1,2}\d{1,2} \d[A-Z]{2}$