Command Line Reference

tdda

NAME

tdda — test-driven data analysis

SYNOPSIS

tdda discover      Generate constraints for data validation  
tdda verify        Verify (validate) data against constraints  
tdda detect        Detect data that fails constraints  

tdda examples      Copy the tdda example data and code  
tdda gentest       Auto-generate Python tests for code in any language  

tdda diff          Find difference in datasets in parquet or CSV files  
tdda ls            List fields in a dataset  
tdda cat           Display rows from a dataset as a rich table  
tdda head          Display the first N rows of a dataset  
tdda tail          Display the last N rows of a dataset  
tdda sample        Display N random rows from a dataset  
tdda serial        Convert or infer flat-file metadata in tdda.serial,  
                   CSVW, or Frictionless formats  
tdda tag           Tag tests that failed in the last reference test run  
tdda config        Show TDDA configuration  

tdda version       Print the TDDA version number  
tdda help          Print this help  
tdda help COMMAND  Print help on COMMAND (e.g. discover, verify)  
tdda installman    Install tdda man pages  

tdda test          Run the tdda library's self-tests.  

OPTIONS

-v, --version Print version number (same as tdda version)
-h, -?, --help Print this help

SEE ALSO

rexpy(1), tdda-installman(1)

TDDA Book


tdda discover

NAME

tdda discover — automatically generate constraints for data

SYNOPSIS

tdda discover [-h] [-?] [-7] [--no-config] [--colour]
              [--no-colour] [-x] [-X] [-g] [-G]
              [-r REPORT ...] [-o REPORT_PATH]
              [--no-md] [--allowed] [--no-allowed]
              [--required] [--no-required] [--no-ar]
              [--pandas] [--polars] [--backend BACKEND]
              INPUT [CONSTRAINTS]

POSITIONAL ARGUMENTS

INPUT is one of:

  • a CSV file or other flat file (e.g. .csv, .txt, .psv), optionally using : format to specify flat-file metadata (see the help for tdda serial)

  • a data frame in a Parquet file (.parquet) e.g. from pandas, polars, R

  • a table from PostgreSQL databases (e.g. postgres:tablename)

  • a table from MySQL databases (e.g. mysql:tablename)

  • a table from SQLite databases (e.g. sqlite:tablename)

  • Standard input (stdin): Use - to read from stdin

(Use tdda help serial, tdda serial --help, or man tdda-serial for more information.)

CONSTRAINTS Name of the (JSON) constraints file to create.

  • Will use .tdda extension if no extension is specified.

  • Can be missing or - to write to standard output.

DESCRIPTION

The tdda discover command is used to find constraints that are satisfied (in most cases) by the input ("training") data provided.

OPTIONS

The following options are available.

* indicates options that are the default behaviours

-h, --help Show this help message and exit
-?, --? Same as -h or --help
-7, --ascii Report without using special characters
-N, --no-config Skip loading ~/.tdda.toml
--colour Use colour in terminal output *
--no-colour Do not use colour in terminal output
-x, --rex Include regular expression generation
-X, --no-rex Exclude regular expression generation *
-g, --group-rex Group regular expression generation
-G, --no-group-rex Do not group regular expression generation *

-r, --report [REPORT ...] Report formats to write, space-separated.
Formats: html, md (markdown), txt (text), json, yaml, toml. The stem of the output file is taken from REPORT_PATH if -o is given, otherwise from CONSTRAINTS.

-o, --report-path REPORT_PATH Stem path for report files (extension
is replaced by the format).

--no-md Do not create metadata in constraints file
--allowed Create allowed-fields constraint (default)
--no-allowed Do not create allowed-fields constraint
--required Create required-fields constraint (default)
--no-required Do not create required-fields constraint
--no-allowed-required Same as --no-allowed --no-required
--no-ar Same as --no-allowed --no-required
--pandas, --pd Use Pandas as DataFrame engine. *
--polars, --pl Use Polars as DataFrame engine.
--backend, -B BACKEND Backend choice for Pandas
(when dataframe engine is Pandas) n for numpy_nullable * a for pyarrow o for original.

EXAMPLES

The example data can be obtained by running 'tdda examples', which will create various directories, including constraints_examples, containing the source data for these examples.

  1. tdda discover elements.parquet elements.tdda

This command will read data from elements.parquet and (attempt to) find constraints satisfied by every record, and the data collectively. By default this can include minimum and maximum constraints on field values or lengths, nullability constraints, uniqueness constraints, sign constraints, and allow-values constraints.

The results will be written to elements.tdda in a JSON format, including metadata. The output constraints file, elements.tdda can be used with tdda verify to verify that another dataset with the same structure satisfies the constraints, or with tdda detect to find which records and/or values fail to satisfy the constraints. The .tdda file can be edited (carefully) by hand, or programmatically, to add, remove, tighten, or loosen constraints.

  1. tdda discover elements.csv

This command is almost the same as the first except that it reads data from the CSV file specified, and writes the constraints to the screen (standard output).

The CSV structure and field types will normally be inferred (possibly incorrectly) by TDDA, and if the inference is bad, the command may fail. If you use:

tdda discover elements.csv:format.serial

metadata in format.serial will be used to guide the DataFrame creation. If you use

tdda discover elements.csv:

it will look for any associated metadata for elements.csv using naming conventions described in the help for tdda serial.

  1. tdda discover --rex md.serial:elements.parquet

This is similar to the last two except that:

  • regular expression inference is requested (--rex) for text fields. Rexpy will be used to attempt to infer one or a few regular expressions that characterize each field in the input data.

  • a metadata file to be used to interpret the .csv file is provided explicitly.

  1. tdda discover elements.parquet elements.tdda -r html -o elements

This discovers constraints as in example 1, and also writes an HTML report to elements.html.

  1. tdda discover elements.parquet elements.tdda -r md json txt -o elements

This discovers constraints as in example 1, and also writes reports to elements.md, elements.json, and elements.txt.

  1. tdda discover --rex postgres:elements

This is similar again except that now the postgres:specifier will be interpreted as a database connection file in the user's home directory, with the name ~/.dbCredential.postgres. This file should contain connection information for a supported database. The extension .postgres does not itself mean that this is a PostgreSQL database, though that is a common convention. Use one of

tdda help db
tdda help database

to get help with the database connection file format.

SEE ALSO

tdda-verify(1), tdda-detect(1), tdda-serial(1)

Test Driven Data Analysis, book by Nicholas J. Radcliffe, chapters 2-7.


tdda verify

NAME

tdda verify — Verify that constraints are satisfied by data

SYNOPSIS

tdda verify [-h] [-?] [-7] [--no-config]
            [--colour] [--no-colour]
            [--epsilon EPSILON] [-a] [-f] [--dense]
            [-t {strict,loose}] [--verify-required-fields]
            [--verify-allowed-fields] [--no-verify-required-fields]
            [--no-verify-allowed-fields] [--varf] [--no-varf]
            [--pandas] [--polars] [--backend BACKEND]
            INPUT [CONSTRAINTS]

POSITIONAL ARGUMENTS

INPUT is one of:

  • a CSV file or other flat file (e.g. .csv, .txt, .psv), optionally using : format to specify flat-file metadata (see the help for tdda serial)

  • a data frame in a Parquet file (.parquet) e.g. from pandas, polars, R

  • a table from PostgreSQL databases (e.g. postgres:tablename)

  • a table from MySQL databases (e.g. mysql:tablename)

  • a table from SQLite databases (e.g. sqlite:tablename)

  • Standard input (stdin): Use - to read from stdin

CONSTRAINTS, if provided, is a JSON .tdda file containing constraints.

If no constraints file is provided, a file with the same path as the input file, with a .tdda extension will be tried.

DESCRIPTION

The tdda verify command is used to check that data conforms to the constraints specified. Any constraints not satisfied by the data are reported, together with summary statistics.

The tdda verify command does not report which records and values cause constraints to be violated: the companion command tdda detect performs this function.

OPTIONS

-h, --help Show this help message and exit
-?, --? Same as -h or --help
-7, --ascii Report without using special characters
-N, --no-config Skip loading ~/.tdda.toml

--colour Use colour in terminal output
--no-colour Do not use colour in terminal output

--epsilon EPSILON Epsilon fuzziness (tolerance for comparisons)

-a, --all Report all fields, even if there are no
failures

-f, --fields Report only fields with failures
--dense Compact output: less vertical space used

-t, --type_checking {strict,loose}
"loose" means consider all numeric types equivalent

--verify-required-fields, --vrf Force verify of required fields

--verify-allowed-fields, --vaf
Force verify of allowed fields

--no-verify-required-fields, --no-vrf
Force no verication of required fields

--no-verify-allowed-fields, --no-vaf
Force no verification of allowed fields

--varf, --vraf Force verification of allowed and required
fields

--no-varf, --no-vraf Force no verification of allowed and required
fields

--pandas, --pd Use Pandas as DataFrame engine.
--polars, --pl Use Polars as DataFrame engine.
--backend, -B BACKEND Backend choice for Pandas
(when dataframe engine is Pandas) n for numpy_nullable * a for pyarrow o for original.

EXAMPLES

The example data can be obtained by running tdda examples, which will create various directories, including constraints_examples, containing source data for these examples.

  1. tdda verify elements.parquet elements.tdda

This command reads data from elements.parquet and checks it against the constraints in elements.tdda, reporting any constraints that are not satisfied.

SEE ALSO

tdda-detect(1), tdda-discover(1), tdda-serial(1)

Test Driven Data Analysis, book by Nicholas J. Radcliffe, chapters 2-7.


tdda detect

NAME

tdda detect — Detect data that does not obey supplied constraints

SYNOPSIS

tdda detect [-h] [-?] [-7] [--no-config] [--colour] [--no-colour]
            [-epsilon EPSILON] [-o REPORT_PATH] [-a] [-f]
            [-t {strict,loose}] [--write-all-records]
            [--per-constraint] [--no-per-constraint]
            [--no-original-fields] [--original-fields]
            [--no-output-fields] [--output-fields [OUTPUT_FIELDS ...]]
            [-r [REPORT ...]] [--interleave] [--no-interleave]
            [--index] [--int] [--key [KEY ...]] [--dense]
            [--verify-required-fields] [--verify-allowed-fields]
            [--no-verify-required-fields] [--no-verify-allowed-fields]
            [--varf] [--no-varf] [--pandas] [--polars]
            [--backend BACKEND]
            INPUT [CONSTRAINTS [OUTPUT]]

POSITIONAL ARGUMENTS

INPUT is one of:

  • a CSV file or other flat file (e.g. .csv, .txt, .psv), optionally using : format to specify flat-file metadata (see the help for tdda serial)

  • a data frame in a Parquet file (.parquet) e.g. from pandas, polars, R

  • a table from PostgreSQL databases (e.g. postgres:tablename)

  • a table from MySQL databases (e.g. mysql:tablename)

  • a table from SQLite databases (e.g. sqlite:tablename)

  • Standard input (stdin): Use - to read from stdin

CONSTRAINTS, if provided, is a JSON .tdda file containing constraints.

If no constraints file is provided, a file with the same path as the input file, with a .tdda extension will be tried.

OUTPUT specifies the destination for detected records.

This is usually a file if the input was a file (e.g. a .csv file or a parquet file), but does not have to be the same type. If the input is a database table, the output is always a database table in the same database.

DESCRIPTION

The tdda detect command finds and reports data that fails to satisfy the constraints in the CONSTRAINTS file specified. It also performs all the same functions as tdda verify.

OPTIONS

-h, --help Show this help message and exit
-?, --? Same as -h or --help
-7, --ascii Report without using special characters
-N, --no-config Skip loading ~/.tdda.toml

--colour Use colour in terminal output
--no-colour Do not use colour in terminal output

--epsilon EPSILON Epsilon fuzziness (tolerance for comparisons)

-a, --all Report all fields, even if there are no
failures

-f, --fields Report only fields with failures

-r, --report [REPORT ...]
Report formats to write, space-separated. Formats: html, md (markdown), txt (text), json, yaml, toml. The stem of the output file is taken from REPORT_PATH if -o is given, otherwise from OUTPUT.

-t, --type_checking {strict,loose}
"loose" means consider all numeric types equivalent

-o, --report-path REPORT_PATH
Stem path for report files (extension is replaced by the format).

--write-all-records Include passing records
--per-constraint Write one flag column per failing constraint in
addition to n_failures. Set by default.

--no-per-constraint Do not write out any per-constraint flag columns
--no-original-fields Do not write out original fields columns
--original-fields Write out original fields columns (default)
--no-output-fields Do not write out any original fields in the output. By
default, all original columns will be included.

--output-fields [OUTPUT_FIELDS ...]
Specify original columns to write out.

--interleave Interleave ok columns with original fields.
--no-interleave Do not interleave ok columns with original fields.
--index Include a row-number index in the output file when
detecting. Rows are usually numbered from 1, unless the input file already has an index.

--int Write out boolean fields as integers, with 1 for true
and 0 for false.

--key [KEY ...] Key or key fields to use when reporting failures
--dense Compact output: less vertical space used

--verify-required-fields, --vrf
Force verify of required fields

--verify-allowed-fields, --vaf
Force verify of allowed fields

--no-verify-required-fields, --no-vrf
Force no verication of required fields

--no-verify-allowed-fields, --no-vaf
Force no verification of allowed fields

--varf, --vraf Force verification of allowed and required
fields

--no-varf, --no-vraf Force no verification of allowed and required
fields

--pandas, --pd Use Pandas as DataFrame engine.
--polars, --pl Use Polars as DataFrame engine.
--backend, -B BACKEND Backend choice for Pandas
(when dataframe engine is Pandas) n for numpy_nullable * a for pyarrow o for original.

EXAMPLES

The example data can be obtained by running tdda examples, which will create various directories, including constraints_examples, containing source data for these examples.

  1. tdda detect elements.parquet elements.tdda elements-failures.parquet

This command reads data from elements.parquet, checks it against the constraints in elements.tdda, and writes records with one or more constraint failures to elements-failures.parquet.

  1. tdda detect elements.parquet elements.tdda elements-failures.parquet -r html -o elements

As above, and also writes an HTML report to elements.html.

  1. tdda detect elements.parquet elements.tdda elements-failures.parquet -r md json txt -o elements

As above, and also writes reports to elements.md, elements.json, and elements.txt.

SEE ALSO

tdda-verify(1), tdda-discover(1), tdda-serial(1)

Test Driven Data Analysis, book by Nicholas J. Radcliffe, chapters 2-7.


tdda diff

NAME

tdda diff — compare csv or parquet files

SYNOPSIS

tdda diff [--fields FIELD1,FIELD2,...]
            [--xfields FIELD1,FIELD2,...  ]
            [--horizontal] [-H] [--vertical] [-V]
            [--find-md] [--no-md]
            [--maxdiffs N] [--key FIELD]
            [--mono] [--bw] [--colours COLOURS] [-c COLOURS]
            [--dps N]  [--precision N]
            [--AE] [--LR] [--angles] [--pm]
            [--prefixes PREFIXES]
            [-N] [--no-config]
            [--strict] [--medium] [--loose] [--permissive]
            LEFT RIGHT

POSITIONAL ARGUMENTS

LEFT The first dataset to be compared, as a parquet or flat file (e.g. CSV), optionally using : format to specify flat-file metadata (see the help for tdda serial). (Normally thought of as left or actual)

RIGHT The second dataset to be compared as a parquet or flat file (e.g. CSV), optionally using : format to specify flat-file metadata (see the help for tdda serial). (Normally thought of as right, expected, reference, etc.)

DESCRIPTION

The tdda diff command compares two tabular datasets in CSV or Parquet files and shows some or all differences. It uses the same underlying functionality as the tdda.referencetest assertions such as assertDataFramesEqual, and provides similar control over what differences to consider, e.g. which fields, and strictness of type and numeric comparisons. It also provides a number of options for controlling the display of differences.

By default, comparisons are row-based and consider all fields (columns), as typed values after reading.

OPTIONS

* indicates options that are the default behaviours

--fields FIELD1,FIELD2,...
Check only these fields (comma-separated list)

--xfields FIELD1,FIELD2,...
Check all fields except these (comma-separated list)

--horizontal, -H
Horizontal display (left and right, side by side)

--vertical, -V
Vertical display (left above right)

--find-md
Attempt to find associated metadata for flat files automatically, without requiring : colon syntax in the path.

--no-md, --no-find-md
Do not attempt to find associated metadata for flat files (default).

--key FIELD
Use this field as a join key when reporting differences.

--maxdiffs N
Maximum number of differences to show.

--mono
Show monochrome output with different values in bold and shared values dimmed.

--bw
Show black and white output with different values in bold and shared values in the terminal's default style.

--colours COLOURS, -c COLOURS
Use colours specified e.g. -c red-blue

--dps N
Number of decimal places to show for floating-point values. Also sets precision if not specified separately.

--precision N
Precision for floating point comparisons. Two floats a and b will be considered equal if abs(a - b) < 1e-N.

--AE
Use A: and E: as labels for the two datasets (actual/expected)

--LR
Use L: and R: as labels for the two datasets (left/right)

--angles
Use < and > as labels for the two datasets

--pm
Use + and - as labels for the two datasets

--prefixes PREFIXES
Use prefixes specified as labels for the two datasets e.g. --prefixes "actual:-ref:" or "actual: -ref: " to include spaces

-N, --no-config
Use default configuration (ignore ~/.tdda.toml)

--strict
Use strict type comparisons

--medium
Use medium-strictness type comparisons

--loose
Use loose (permissive) type comparisons

--permissive
Use loose (permissive) type comparisons

--pandas, --pd Use Pandas as DataFrame engine. *
--polars, --pl Use Polars as DataFrame engine.
--backend, -B BACKEND Backend choice for Pandas
(when dataframe engine is Pandas) n for numpy_nullable * a for pyarrow o for original.

--help, -?, --?
Show help on tdda diff.

EXAMPLES

Data suitable for all examples can be obtained with

tdda examples diff

  1. tdda diff a.csv a.csv

This is the simplest form of the command. It will read a.csv and convert it to a data frame, using the default back end (Pandas).

  1. tdda diff a.csv b.csv --vertical

Compare two CSV files, stacking left and right values vertically rather than side by side. Useful when there are many columns or long values.

  1. tdda diff before.parquet after.parquet --key Income,Expenditure

Compare two Parquet files using a composite join key. The fields Income and Expenditure must form a primary key in both datasets. Rows are matched by key rather than by position.

  1. tdda diff actual.csv expected.csv --AE --bw

Compare two CSV files using A: and E: as markers for actual and expected, with monochrome bold highlighting instead of colour.

  1. tdda diff foo.csv: bar.csv:

Compare two CSV files, asking TDDA to find associated metadata files for each using naming conventions (e.g. @.serial or foo-metadata.json in the same directory).

  1. tdda diff foo.csv bar.txt:money.serial

Compare foo.csv (loaded with default settings) against bar.txt, using money.serial as the metadata file describing its format.

  1. tdda diff a.parquet b.csv --loose --dps 3

Compare a Parquet file against a CSV file with loose type matching and floating-point values compared to 3 decimal places.


tdda ls

NAME

tdda ls — List fields in a dataset

SYNOPSIS

tdda ls [-h] [-1|--one-line] [-l] [--pandas] [--polars]
         [--backend BACKEND]
         INPUT

POSITIONAL ARGUMENTS

INPUT is one of:

  • a CSV file (or .tsv, .psv, .txt)

  • a Parquet file (.parquet)

  • a flat file with colon syntax to trigger metadata lookup (e.g. foo.csv:)

  • a flat file with an explicit metadata path (e.g. foo.csv:foo.serial)

DESCRIPTION

The tdda ls command lists the fields in a dataset.

Without --long, it prints a one-line summary followed by the field names, right-aligned.

With --long, it prints a one-line summary followed by a table showing each field's dtype, minimum value, maximum value, and null count.

For flat files, a second line reports how the file was read and which metadata file was used, if any.

OPTIONS

-h, -?, --help Show this help message and exit

-1, --one-line List all field names on one line, space-separated
-l, --long Show dtype, min, max, and null count per field

--pandas, --pd Use Pandas as DataFrame engine (default)
--polars, --pl Use Polars as DataFrame engine
--backend, -B BACKEND Backend choice for Pandas
n for numpy_nullable * a for pyarrow o for original

EXAMPLES

The example data can be obtained by running tdda examples, which will create various directories, including serial_examples.

  1. tdda ls accounts1k.parquet

List the fields in accounts1k.parquet.

  1. tdda ls -l accounts1k.csv:

Show field details for accounts1k.csv, using any associated metadata file found automatically.

  1. tdda ls -l accounts1k.csv --polars

Show field details using Polars.

SEE ALSO

tdda-diff(1), tdda-serial(1), tdda-verify(1)


tdda cat

NAME

tdda cat — Display rows from a dataset as a rich table

SYNOPSIS

tdda cat [-h] [N | -N | +N] [-s | -S]
           [--fields FIELDS] [--xfields FIELDS]
           [-r N [--seed SEED]]
           [--pandas] [--polars] [--backend BACKEND]
           INPUT [FIELD ...]

POSITIONAL ARGUMENTS

INPUT is one of:

  • a CSV file (or .tsv, .psv, .txt)

  • a Parquet file (.parquet)

  • a flat file with colon syntax to trigger metadata lookup (e.g. foo.csv:)

  • a flat file with an explicit metadata path (e.g. foo.csv:foo.serial)

FIELD ... Field names (or fnmatch wildcard patterns) to display. Fields appear in the order given. Equivalent to --fields; both may be combined. Wildcards must be quoted in the shell.

DESCRIPTION

The tdda cat command displays rows from a dataset as a rich table.

Without a row count, all rows are shown.

N or -N First N rows
+N Last N rows

Null values are shown as .

OPTIONS

-h, -?, --help Show this help message and exit

--fields FIELDS Show only these fields. FIELDS is a comma- or space-separated list of field names or fnmatch wildcard patterns (e.g. eu_*, [a-z]*). Fields appear in the order specified. Requires quoting in the shell when using spaces or wildcards.

--xfields FIELDS Exclude these fields. Same format as --fields. Fields appear in dataset order.

-s Short headers: column width driven by data; headers split at word boundaries (punctuation and lowercase→uppercase transitions) and packed onto as few lines as possible.

-S Short headers: as -s but split anywhere (mid-word) to fit the data width.

-r N, --random N Show N random rows instead of a slice.

--seed SEED Random seed for -r. If omitted, a seed is chosen automatically and printed.

--pandas, --pd Use Pandas as DataFrame engine (default)
--polars, --pl Use Polars as DataFrame engine
--backend, -B BACKEND Backend choice for Pandas
n for numpy_nullable * a for pyarrow o for original

EXAMPLES

  1. tdda cat accounts1k.parquet

Display all rows from accounts1k.parquet.

  1. tdda cat -10 accounts1k.csv:

Display the first 10 rows, using any associated metadata file.

  1. tdda cat +10 accounts1k.csv:

Display the last 10 rows.

  1. tdda cat --fields 'name,balance' accounts1k.csv:

Display only the name and balance fields.

  1. tdda cat --fields 'amount*' --xfields '*_raw' accounts1k.csv:

Display fields matching amount*, excluding those ending in _raw.

  1. tdda cat -r 20 --seed 42 accounts1k.csv:

Display 20 random rows with a fixed seed.

  1. tdda cat -s accounts1k.csv:

Display all rows with compact multi-line headers, splitting at word boundaries (open_dateopen date, accountTypeaccount Type).

SEE ALSO

tdda-head(1), tdda-tail(1), tdda-sample(1), tdda-ls(1), tdda-diff(1), tdda-serial(1)


tdda head

NAME

tdda head — Display the first N rows of a dataset

SYNOPSIS

tdda head [-h] [N] [-s | -S]
            [--fields FIELDS] [--xfields FIELDS]
            [--pandas] [--polars] [--backend BACKEND]
            INPUT [FIELD ...]

POSITIONAL ARGUMENTS

INPUT Dataset path (CSV, Parquet, or colon syntax).

FIELD ... Field names (or fnmatch wildcard patterns) to display. Fields appear in the order given. Equivalent to --fields; both may be combined. Wildcards must be quoted in the shell.

DESCRIPTION

The tdda head command displays the first N rows of a dataset (default 10) as a rich table.

Null values are shown as .

OPTIONS

-h, -?, --help Show this help message and exit

N Number of rows to show (default 10)

--fields FIELDS Show only these fields. FIELDS is a comma- or space-separated list of field names or fnmatch wildcard patterns (e.g. eu_*, [a-z]*). Fields appear in the order specified. Requires quoting in the shell when using spaces or wildcards.

--xfields FIELDS Exclude these fields. Same format as --fields. Fields appear in dataset order.

-s Short headers: column width driven by data; headers split at word boundaries and packed onto as few lines as possible. See tdda-cat(1) for details.

-S Short headers: split anywhere to fit data width.

--pandas, --pd Use Pandas as DataFrame engine (default)
--polars, --pl Use Polars as DataFrame engine
--backend, -B BACKEND Backend choice for Pandas
n for numpy_nullable * a for pyarrow o for original

EXAMPLES

  1. tdda head accounts1k.parquet

Display the first 10 rows of accounts1k.parquet.

  1. tdda head 20 accounts1k.csv:

Display the first 20 rows, using any associated metadata file.

  1. tdda head --fields 'name,balance' accounts1k.csv:

Display only name and balance for the first 10 rows.

  1. tdda head -s 20 accounts1k.csv:

Display the first 20 rows with compact multi-line headers.

SEE ALSO

tdda-cat(1), tdda-tail(1), tdda-sample(1), tdda-ls(1), tdda-diff(1), tdda-serial(1)


tdda tail

NAME

tdda tail — Display the last N rows of a dataset

SYNOPSIS

tdda tail [-h] [N] [-s | -S]
            [--fields FIELDS] [--xfields FIELDS]
            [--pandas] [--polars] [--backend BACKEND]
            INPUT [FIELD ...]

POSITIONAL ARGUMENTS

INPUT Dataset path (CSV, Parquet, or colon syntax).

FIELD ... Field names (or fnmatch wildcard patterns) to display. Fields appear in the order given. Equivalent to --fields; both may be combined. Wildcards must be quoted in the shell.

DESCRIPTION

The tdda tail command displays the last N rows of a dataset (default 10) as a rich table.

Null values are shown as .

OPTIONS

-h, -?, --help Show this help message and exit

N Number of rows to show (default 10)

--fields FIELDS Show only these fields. FIELDS is a comma- or space-separated list of field names or fnmatch wildcard patterns (e.g. eu_*, [a-z]*). Fields appear in the order specified. Requires quoting in the shell when using spaces or wildcards.

--xfields FIELDS Exclude these fields. Same format as --fields. Fields appear in dataset order.

-s Short headers: column width driven by data; headers split at word boundaries and packed onto as few lines as possible. See tdda-cat(1) for details.

-S Short headers: split anywhere to fit data width.

--pandas, --pd Use Pandas as DataFrame engine (default)
--polars, --pl Use Polars as DataFrame engine
--backend, -B BACKEND Backend choice for Pandas
n for numpy_nullable * a for pyarrow o for original

EXAMPLES

  1. tdda tail accounts1k.parquet

Display the last 10 rows of accounts1k.parquet.

  1. tdda tail 20 accounts1k.csv:

Display the last 20 rows, using any associated metadata file.

  1. tdda tail --fields 'name,balance' accounts1k.csv:

Display only name and balance for the last 10 rows.

  1. tdda tail -s 20 accounts1k.csv:

Display the last 20 rows with compact multi-line headers.

SEE ALSO

tdda-cat(1), tdda-head(1), tdda-sample(1), tdda-ls(1), tdda-diff(1), tdda-serial(1)


tdda sample

NAME

tdda sample — Display N random rows from a dataset

SYNOPSIS

tdda sample [-h] [N] [--seed SEED] [-s | -S]
              [--fields FIELDS] [--xfields FIELDS]
              [--pandas] [--polars] [--backend BACKEND]
              INPUT [FIELD ...]

POSITIONAL ARGUMENTS

INPUT Dataset path (CSV, Parquet, or colon syntax).

FIELD ... Field names (or fnmatch wildcard patterns) to display. Fields appear in the order given. Equivalent to --fields; both may be combined. Wildcards must be quoted in the shell.

DESCRIPTION

The tdda sample command displays N randomly selected rows from a dataset (default 10) as a rich table.

When no --seed is given, a random seed is chosen automatically and printed so the result can be reproduced.

Null values are shown as .

OPTIONS

-h, -?, --help Show this help message and exit

N Number of random rows to show (default 10)

--seed SEED Random seed. If omitted, a seed is chosen automatically and printed.

--fields FIELDS Show only these fields. FIELDS is a comma- or space-separated list of field names or fnmatch wildcard patterns (e.g. eu_*, [a-z]*). Fields appear in the order specified. Requires quoting in the shell when using spaces or wildcards.

--xfields FIELDS Exclude these fields. Same format as --fields. Fields appear in dataset order.

-s Short headers: column width driven by data; headers split at word boundaries and packed onto as few lines as possible. See tdda-cat(1) for details.

-S Short headers: split anywhere to fit data width.

--pandas, --pd Use Pandas as DataFrame engine (default)
--polars, --pl Use Polars as DataFrame engine
--backend, -B BACKEND Backend choice for Pandas
n for numpy_nullable * a for pyarrow o for original

EXAMPLES

  1. tdda sample accounts1k.parquet

Display 10 random rows from accounts1k.parquet, printing the seed used.

  1. tdda sample 50 accounts1k.csv:

Display 50 random rows, using any associated metadata file.

  1. tdda sample 20 --seed 42 accounts1k.csv:

Display 20 random rows with a fixed seed (reproducible).

  1. tdda sample --fields 'name,balance' accounts1k.csv:

Display 10 random rows showing only name and balance.

  1. tdda sample -s 20 --seed 42 accounts1k.csv:

Display 20 random rows with compact multi-line headers.

SEE ALSO

tdda-cat(1), tdda-head(1), tdda-tail(1), tdda-ls(1), tdda-diff(1), tdda-serial(1)


tdda serial

NAME

tdda serial — Converts and generates serial metadata files.

SYNOPSIS

tdda serial [FLAGS] inmetadata outmetadata  
tdda serial --to FMT [FLAGS] inmetadata outmetadata  

Converts metadata from one metadata format, in inpath,
to another, in outpath.

tdda serial [FLAGS] indata outmetadata

Creates metadata for indata in outmetadata

tdda serial [FLAGS] inmetadata script.py

Creates Python code for reading a file in the format in inmetadata as
Python. Often, a reading library would be specified, e.g.

tdda serial a.serial a.py --to pd.r

which specifies that the Python script should use pandas.read_csv.


Supported formats FMT:

  SHORT FORM  LONG FORM/Description
  .         tdda.serial
  pd.r      pandas.read_csv
  pd.w      pandas.DataFrame.to_csv
  pl.r      polars.read_csv
  pl.w      polars.DataFrame.write_csv
  csv.r     python.csv.reader
  csv.w     python.csv.writer
  csvw      CSVW
  fl        frictionless
  fless     frictionless
  fl.r      frictionless.resource
  fl.p      frictionless.package

Multiple formats can be separated by commas.

Format is usually inferred from filename if following common conventions
for tdda.serial, CSVW, and frictionless.

OPTIONS

--to FMT Specify output metadata format (see list of formats above)

-B BE, --backend BE Specify backend for Pandas flavours: n: numpy_nullable a: pyarrow o: original Pandas backend.

--for FILE Filename for data to use when generating CSVW or Frictionless data. (Can also be used for tdda.serial and .py output)

-N, --no-config Use default configuration (ignore ~/.tdda.toml)

-g, --gen, --generate Generate (infer) metadata for flat file

-q, --quiet Quiet output

-v, --verbose Verbose output

-V, --Verbose More verbose output

Options used primarily or exclusively with --generate/--gen/-g

--sep D, --delimiter D Specify D as the field separator.

--quote-char Q, --quote Q Specify Q as the quote character. (Q is always " or ' in practice.)

--nulls S Specify null indicator, or comma-separated list of null indicators.

--escape Use backslash as escape character. NOTE: Always backslash: does not take argument.

--no-escape Do not support backslash escaping with -g. NOTE: This only affects quotes, separators, and backslashes. Standard escapes for control sequences (\t, \n, \r, \f) are always supported.

--stutter Specify quote stuttering. Usually an alternative to --escape.

--no-stutter Do not use quote stuttering. Usually used with --escape.

--encoding ENC, -e ENC Specify ENC as encoding.

--date-format D Specify D as the (file-wide default) date format.

--datetime-format D Specify D as the (file-wide default) format for datetime fields.

--sample-lines N, -n N Use (up to) N sample lines when inferring metadata.

--single-field, -1 Inform the metadata inferred that the file contains only a single field (column).

--include-path Include path in .serial output

--exclude-path Do not include in .serial output

--quoting Q Set quoting to Q. Q must be one of: QUOTE_ALL QUOTE_MINIMAL QUOTE_NONNUMERIC QUOTE_NONE QUOTE_NOTNULL QUOTE_STRINGS QUOTE_STRINGS_ONLY

--use-literal-dates Specifies that date formats should be written to .serial files with unambiguous literal examples such as 2000-12-31T12:34:56.

--use-yyyy-dates Specifies that date formats should be written to .serial files in the form exemplified by YYYY-MM-DD HH:MM:SS.

--use-pc-dates Specifies that date formats should be written to .serial files in Python strftime-compatible % formats, exemplified by %Y-%m-%dT%H:%M:%S.

EXAMPLES

  1. tdda serial a.csv a.serial
    Generate tdda.serial metadata describing format of a.csv in a.serial

  2. tdda serial --to . a.csv a.serial
    Same as previous, explicitly specifying the default, tdda.serial, output format (. is short for tdda.serial format).

  3. tdda serial a.csv a-metadata.json
    Generate CSVW metadata describing format of a.csv in a-metadata.json

  4. tdda serial --to csvw a.csv a.json
    Same as previous, explicitly specifying format with non-standard output name

  5. tdda serial a.serial a-metadata.json
    Converts tdda.serial metadata to CSVW

  6. tdda serial a-metadata.json a.serial
    Converts CSVW metadata to tdda.serial

USING SERIAL METADATA WITH TDDA COMMANDS

For all tdda command-line commands, and in most places within API calls where CSV or other flat file is specified, there is the option to specify the file format using tdda.serial files, CSVW files, or Frictionless files. This is based on the : (colon) specifier.

When specifying a path to a CSV (or other flat) file:

  • If the path is used by itself, the tdda library will use either tdda.serial.csv_to_pandas or tdda.serial.csv_to_polars to read it into a DataFrame. The default is currently pandas (with the numpy_nullable back end), but this can be configured (see tdda config) or, in many cases controlled with command line flags (--polars, --pandas, --backend BACKEND (for Pandas only)).

  • If the path ends in a colon (e.g. foo.csv:), TDDA will search for metadata in the same directory as the file and, if it finds one, pass that to the appropriate csv_to_... function for more accurate DataFrame generation.

  • In doing this, it will look for the following in priority order, given a file foo.csv:

    • foo.csv.serial (tdda.serial metadata)

    • foo.serial (tdda.serial metadata). This is actually more common than the previous form, but if there are multiple files with different extensions, the former is more specific, so is checked first.

    • Anything that matches foo using @ as a wildcard, e.g. @.serial, f@.serial, f@o.serial, @oo.serial. (@ acts like * in the shell, while avoiding needing * in filenames, which can be awkward.)

    • foo-metadata.json, foo-csvmetadata.json, foo-csv-metadata.json, foo.csvmetadata.json, foo.csv-metadata.json (all of which are common conventions for CSVW metadata files).

    • The same CSVW patterns with @ wildcards

    • foo.serial.json, foo.serial.yaml, foo.resource.json, foo.resource.yaml, foo.package.json, foo.package.yaml, all of which are common for Frictionless metadata files.

    • The same patterns for serial or package frictionless files with @ wildcards. Wildcards are not searched in resource files, because in frictionless these always correspond to a single data file.

  • If the path contains a colon, the part to the right of the colon will be interpreted as a metadata file. So foo.csv:bar.serial will use bar.serial.

BUGS

The tdda serial functionality is fairly new, and there are probably still bugs and undesirable features in the implementation.

SEE ALSO

Test Driven Data Analysis, book by Nicholas J. Radcliffe, chapter 8.


tdda gentest

NAME

tdda gentest — Gentest writes tests, so you don't have to.™

SYNOPSIS

tdda gentest   Runs the Gentest Wizard

tdda gentest   'SHELL COMMAND' [OPTIONS] [test_output.py]
               [REFERENCE_FILE ...]

POSITIONAL ARGUMENTS

SHELL COMMAND is the command to be tested. It should normally be enclosed in single quotes. It can be any terminal command — a shell built-in, a shell script, an R program, a Python program, or anything else that can be run from the terminal.

test_output.py is the name of the Python test script to generate. If not specified, Gentest derives a name from the command.

REFERENCE_FILE ... are optional additional files or directories that Gentest should monitor for files created or modified during command execution.

DESCRIPTION

Gentest will create Python tests, using the tdda's reference-testing capabilities, for terminal-based programs written in any language. For example, the shell command can be a built-in shell command or can run a shell script, an R program, or of course a Python program.

It has a wizard, invoked just by typing gentest, that prompts for the information it needs before generating the tests.

Alternatively, the command to be tested and optionally other parameters can all be specified on the command line.

Gentest's tests:

  • Runs the provided command more than once (by default)

  • Captures output to stdout and stderr

  • Captures the exit code

  • Notices any files created in the directory or subdirectories or other specified places

  • Uses variations in output and other heuristics to identify parts of the output that appear variable and uses rexpy to write reference tests that only test things that appear to be fixed and not system dependent.

  • Writes a Python test script, using tdda.referencetest, that contains a set of tests of the shell command specified.

The test script can then, of course, be edited by hand.

The test script, when run, executes the command again and checks that its behaviour is as expected (i.e., is “the same” as when Gentest ran originally, except for the variations allowed in the reference test specifications).

OPTIONS

-h, --help Show this help message and exit
-?, --? Same as -h or –help
-m N, --max-files N Max files to track
-r, --relative-paths Show relative paths wherever possible
-n N, --iterations N Number of times, N, to run the command (default 2)

-O, --no-stdout Do not generate a test checking output to STDOUT
-E, --no-stderr Do not generate a test checking output to STDERR
-Z, --non-zero-exit Do not require exit status to be 0
-C, --no-clobber Do not overwrite existing test script or
reference directory

-N, --no-config Use default configuration (ignore ~/.tdda.toml)

EXAMPLES

  1. tdda gentest

Runs the Gentest wizard, which presents a dialogue something like this (where all suggested answers, in square brackets, are accepted by hitting RETURN). (Obviously, this is an improbably simple command test; it's usually a command to run a script or program.

$ tdda gentest
Enter shell command to be tested: echo "Hey, cats!"
Enter name for test script [test_echo__Hey__cats__]:
Check all files written under $(pwd)?: [y]:
Check all files written under (gentest's) $TMPDIR?: [y]:
Enter other files/directories to be checked, one per line, then a blank line:

Check stdout?: [y]:
Check stderr?: [y]:
Exit code should be zero?: [y]:
Clobber (overwrite) previous outputs (if they exist)?: [y]:
Number of times to run script?: [2]:

Running command 'echo "Hey, cats!"' to generate output (run 1 of 2).
Saved (non-empty) output to stdout to /home/tdda/ref/echo__Hey__cats__/STDOUT.
Saved (empty) output to stderr to /home/tdda/ref/echo__Hey__cats__/STDERR.

Running command 'echo "Hey, cats!"' to generate output (run 2 of 2).
Saved (non-empty) output to stdout to /home/tdda/ref/echo__Hey__cats__/2/STDOUT.
Saved (empty) output to stderr to /home/tdda/ref/echo__Hey__cats__/2/STDERR.

Test script written as /home/tdda/test_echo__Hey__cats__.py
Command execution took: 0.022s

SUMMARY:

Directory to run in:        /home/tdda
Shell command:              echo "Hey, cats!"
Test script generated:      /home/tdda/test_echo__Hey__cats__.py
Reference files: (none)
Check stdout:               yes (was 'Hey, cats!\n')
Check stderr:               yes (was empty)
Expected exit code:         0
Clobbering permitted:       yes
Number of times script ran: 2
Number of tests written:    4
  1. tdda gentest 'echo "Hey, cats!"' 'test_echo.py' -n 3

Same as above except that the command and a custom name for the test script has been supplied, so the wizard does not run, and the number of times to run the command has been increased to three.

The test script produced is almost identical except for the number of times the command is run.

  1. tdda gentest 'diff verifier1.txt verifier2.txt' -Z

Gentest will normally fail if the program produces a non-zero exit code, generally indicating an error. Commands like diff, however, produce a non-zero exit code (1) when there are differences. The -Z option (or --non-zero-exit) allows the exit code to be non-zero, and Gentest generates a test that checks it is the expected value (1, in this case, if the two verifier files should be different).

SEE ALSO

rexpy(1), tdda-diff(1)

Test Driven Data Analysis, book by Nicholas J. Radcliffe, chapter 9, and chapter 9-12 for reference testing more generally.


tdda tag

NAME

tdda tag — tag tests that failed in the last reference test run

SYNOPSIS

tdda tag

DESCRIPTION

The tdda tag command reads the log of failing tests written by the most recent logged tdda.referencetest run and adds @tag decorators to those tests in their source files. Tagged tests can then be run in isolation, allowing a rapid edit-test cycle focused on failing tests. A logged run of tdda.referencetest uses --log-failures or (for unittest-style tests only) -F.

WORKFLOW

A typical workflow with unittest-style tests (ReferenceTestCase) is:

python tests.py -9      # Remove any existing @tag decorators
python tests.py -F      # Run tests, logging failures
tdda tag                # Add @tag to failing tests
python tests.py -1      # Run only tagged (failing) tests

When all tests are passing:

python tests.py -9      # Remove @tag decorators

The equivalent workflow with pytest is:

pytest --untag          # Remove any existing @tag decorators
pytest --log-failures   # Run tests, logging failures
tdda tag                # Add @tag to failing tests
pytest --tagged         # Run only tagged (failing) tests

When all tests are passing:

pytest --untag          # Remove @tag decorators

SEE ALSO

tdda(1)


tdda examples

NAME

tdda examples — Creates example data for TDDA

SYNOPSIS

tdda examples [OUTDIR]  
tdda examples [MODULE...] [OUTDIR]  
tdda examples all [OUTDIR]

POSITIONAL ARGUMENTS

MODULE can be any of:

  • referencetest

  • constraints

  • rexpy

  • gentest

  • book

If not specified, all the first four will be created, without requiring internet access.

OUTDIR is an optional directory in which to write the example directories; by default this will be the current working directory (.).

If all is specified, or book is included, the tdda-book-examples will be downloaded from GitHub, which does require internet access.

DESCRIPTION

Write out example code and data for all examples, by default, or for a particular module if specified.

If no module is specified, examples for all four are written out.

Examples are created in subdirectories of OUTDIR (default: the current directory .).

EXAMPLES

  1. tdda examples
    Creates the referencetest, constraints, rexpy, and gentest examples in .

  2. tdda examples gentest
    Creates examples_gentest in .

  3. tdda examples gentest book
    Creates gentest and book examples in .

  4. tdda examples all
    Creates all the examples, four from local files and the book examples from GitHub in .


tdda version

NAME

tdda version — Reports the (active) installed version of tdda

SYNOPSIS

tdda version

DESCRIPTION

Reports the version number of the (active) TDDA tools.

EXAMPLES

tdda version


tdda config

NAME

tdda config — Shows config settings

SYNOPSIS

tdda config [--annotated|-a] [--current|-c] [--default|-d] [--file|-f]  
tdda config [--annotated|-a] current|default|file

DESCRIPTION

Shows configuration information. Use:

-c, --current, or current for the current configuration
-d, --default, or default for the default configuration
-f, --file, or file for the configuration file location and contents.

With no argument, it shows the current configuration.

Use -a or --annotated with any of the above to show allowed values alongside each parameter.

EXAMPLES

tdda config
tdda config -c
tdda config -d
tdda config -f

PARAMETERS

null_rep

Used to show nulls in some contexts.
Default: "∅"
Allowed: Any string

colour

Controls whether output is colourized.
Default: true
Allowed: true, false

engine

Controls whether pandas or polars is used for CSV files by default.
Default: "pandas"
Allowed: "pandas", "polars"

pandas_backend

Controls default backend for CSV loading etc.
Default: "numpy_nullable"
Allowed: "numpy_nullable" (or "n"), "pyarrow" (or "a"), "original" (or "o")

PARAMETERS (referencetest)

left_colour

Colour for left (actual) side of diffs.
Default: "red"
Allowed: A named ANSI colour (red, bright_red etc.) or an RGB hex colour with leading # such as #FF0000 for pure red. Interpreted by the rich library.

right_colour

Colour for right (expected) side of diffs.
Default: "green"
Allowed: A named ANSI colour (red, bright_red etc.) or an RGB hex colour with leading # such as #FF0000 for pure red. Interpreted by the rich library.

failure_colour

Colour used to highlight failures.
Default: "red"
Allowed: A named ANSI colour (red, bright_red etc.) or an RGB hex colour with leading # such as #FF0000 for pure red. Interpreted by the rich library.

mono

Use bold instead of colour for diffs.
Default: false
Allowed: true, false

bw

Black and white mode: no colour or bold.
Default: false
Allowed: true, false

left_prefix

Prefix string for left (actual) diff lines.
Default: "< "
Allowed: Any string

right_prefix

Prefix string for right (expected) diff lines.
Default: "> "
Allowed: Any string

vertical

Show diffs vertically rather than side by side.
Default: false
Allowed: true, false

force_val_prefixes

Always show left/right prefixes on diff lines.
Default: false
Allowed: true, false

type_checking

How strictly to check types in reference test comparisons.
Default: "strict"
Allowed: "strict", "medium", "loose"

log_failures

Log failing test IDs to file for use with tdda tag.
Default: false
Allowed: true, false

PARAMETERS (constraints)

interleave

Interleave pass and fail results in verify output.
Default: true
Allowed: true, false

per_constraint

Report results per constraint rather than per field.
Default: true
Allowed: true, false

detect_passes

Include passing fields in detect output.
Default: true
Allowed: true, false

report_formats

List of additional report formats to generate.
Default: []
Allowed: Any subset of "html", "md", "txt", "json", "yaml", "toml"

write_all_records

Write all records to detect output, not just failures.
Default: false
Allowed: true, false

int_bools

Use integers (0/1) rather than booleans in detect output.
Default: false
Allowed: true, false

verify_required_fields

Verify that all required fields are present.
Default: unset
Allowed: true, false

verify_allowed_fields

Verify that no fields are present outside the allowed set.
Default: unset
Allowed: true, false

write_required_fields

Discover should include the required-fields constraint.
Default: false
Allowed: true, false

write_allowed_fields

Discover should include an allowed-fields constraint.
Default: false
Allowed: true, false

PARAMETERS (tddadiff)

type_checking

How strictly to check types when comparing dataframes.
Default: "medium"
Allowed: "strict", "medium", "loose"

find_md

Infer metadata when comparing dataframes with tdda diff.
Default: true
Allowed: true, false

PARAMETERS (serial)

md_inpath

Path(s) to search for serial metadata files; relative paths are resolved relative to the CSV file.
Default: "./_write.serial"


tdda test

NAME

tdda test — Run the tdda library's self-tests

SYNOPSIS

tdda test

DESCRIPTION

Runs tdda's (internal) self-tests.

NOTE: It is hard to guarantee that all will pass on all systems given that dependencies are not tightly pinned. It is not necessarily a problem if some tests fail, but is a concern if a very large number fail.

SEE ALSO

tdda(1)


tdda help

NAME

tdda help — Provides help on tdda and its sub-commands.

SYNOPSIS

tdda help
tdda help COMMAND

POSITIONAL ARGUMENTS

COMMAND can be any of:

discover
verify
detect

examples
gentest

diff
serial

tag
config

help
version
test
installman

DESCRIPTION

Shows help on a tdda subcommand or topic.

Taking inspiration from git, if the man pages are installed (see tdda installman), help on main commands can also be obtained with

man tdda-COMMAND

For example:

man tdda-discover

Help can also be obtained on each command with --help, -h or -?, e.g.

tdda discover --help

EXAMPLES

tdda help Shows this help

tdda help gentest Shows help on gentest

SEE ALSO

tdda-installman(1)


tdda installman

NAME

tdda installman — install tdda man pages

SYNOPSIS

tdda installman [--system]

DESCRIPTION

Installs the tdda man pages so they can be accessed with the man command.

Once installed, the main tdda man page is available as:

man tdda

Man pages for tdda subcommands are available as:

man tdda-COMMAND

For example:

man tdda-discover
man tdda-gentest

The rexpy man page is accessed as:

man rexpy

By default, man pages are installed to ~/.local/share/man/man1. On MacOS, this directory may not be in the default man search path; if so, tdda installman will print the line to add to your shell config file to make the man pages available in new shells.

With --system, man pages are installed to /usr/local/share/man/man1, which is in the default search path on most systems but may require running with sudo.

On Windows, man pages are not supported; consider running tdda under WSL (Windows Subsystem for Linux).

OPTIONS

--system, -s
Install system-wide to /usr/local/share/man/man1 (may require sudo).

EXAMPLES

  1. tdda installman
    Install man pages to ~/.local/share/man/man1.

  2. tdda installman --system
    Install man pages system-wide (may require sudo).

SEE ALSO

tdda-help(1)


rexpy

NAME

rexpy — infer regular expressions from example strings

SYNOPSIS

rexpy [FLAGS] [INPUTFILE [OUTPUTFILE]]

DESCRIPTION

rexpy reads a list of strings (one per line) and infers one or more regular expressions that characterize them.

If INPUTFILE is provided it should contain one string per line; otherwise lines are read from standard input.

If OUTPUTFILE is provided, the regular expressions found will be written there (one per line); otherwise they will be printed to standard output.

OPTIONS

-h, --header
Discard the first line as a header.

-?, --help
Print usage information and exit.

-g, --group
Generate capture groups for each variable fragment of each regular expression, i.e. surround variable components with parentheses.
e.g.     ^[A-Z]+\-[0-9]+$
becomes ^([A-Z]+)\-([0-9]+)$

-q, --quote
Display regular expressions as double-quoted, escaped strings, suitable for use in Unix shells, JSON, and string literals in many programming languages.
e.g.     ^[A-Z]+\-[0-9]+$
becomes "^[A-Z]+\-[0-9]+$"

--portable, --grep
Produce maximally portable regular expressions (e.g. [0-9] rather than \d). This is the default.

--java
Produce Java-style regular expressions (e.g. \p{Digit}).

--posix
Produce POSIX-compliant regular expressions (e.g. [[:digit:]] rather than \d).

--perl
Produce Perl-style regular expressions (e.g. \d).

-u, --underscore
Allow underscore to be treated as a letter. Mostly useful for matching identifiers. Also -_.

-d, --dot, --period
Allow dot to be treated as a letter. Mostly useful for matching identifiers. Also -..

-m, --minus, --hyphen, --dash
Allow minus to be treated as a letter. Mostly useful for matching identifiers.

-vlf, --variable
Use variable-length fragments.

-flf, --fixed
Use fixed-length fragments.

-v, --version
Print the version number.

-V, --verbose
Set verbosity level to 1.

-VV, --Verbose
Set verbosity level to 2.

SEE ALSO

tdda(1), tdda-discover(1)