TDDA's Constraints API

Basic API for Pandas DataFrames

tdda.constraints.api.detect(indata, constraints_path, outpath=None, engine=None, backend=None, **kwargs)

Detect records that fail any of the constraints in the .tdda file provided.

Parameters:

indata – Path to a data file or a DataFrame to be checked.
constraints_path – Path to a JSON .tdda file, or an in-memory DatasetConstraints object.
outpath – Optional path for output records (CSV or parquet). None for no output.
engine – DataFrame engine: 'pandas' or 'polars'.
backend – Pandas backend: 'numpy_nullable' (or 'n'), 'pyarrow' (or 'a'), or 'original' (or 'o').
**kwargs – Additional keyword arguments passed to detect_df.

Returns:

Detection results.

Return type:

PandasDetection

tdda.constraints.api.discover(indata, constraints_path=None, report_path=None, report_formats=None, engine=None, backend=None, verbose=True, **kwargs)

Discover constraints characterizing the data provided.

Parameters:

indata – Data for which constraints are to be discovered. Can be a path to a data file (CSV, parquet, or other flat file) or a DataFrame (Pandas or Polars).
constraints_path – Path to write discovered constraints to. If None, constraints are not written. If '-', constraints are written to stdout.
report_path – Path for reports (extension ignored). Writes reports to variations of this path if set; otherwise uses constraints_path.
report_formats – List of report formats to write. Options: 'html', 'markdown' (or 'md'), 'text' (or 'txt'), 'yaml', 'json', 'toml'.
engine – DataFrame engine: 'pandas' or 'polars'.
backend – Pandas backend: 'numpy_nullable' (or 'n'), 'pyarrow' (or 'a'), or 'original' (or 'o').
verbose – Controls level of output reporting. Default is True.
**kwargs – Additional keyword arguments passed to discover_df.

Returns:

Discovered constraints.

Return type:

DatasetConstraints

tdda.constraints.api.verify(indata, constraints_path, outdata=None, verbose=True, engine=None, backend=None, md_path=None, **kwargs)

Verify that the data provided satisfies the constraints in the .tdda file provided.

Parameters:

indata – Path to a data file or a DataFrame to be verified.
constraints_path – Path to a JSON .tdda file, or an in-memory DatasetConstraints object.
outdata – Optional destination for output data.
verbose – Controls level of output reporting. Default is True.
engine – DataFrame engine: 'pandas' or 'polars'.
backend – Pandas backend: 'numpy_nullable' (or 'n'), 'pyarrow' (or 'a'), or 'original' (or 'o').
md_path – Path to metadata for indata, if any.
**kwargs – Additional keyword arguments passed to verify_df.

Returns:

Verification results.

Return type:

PandasVerification

Basic API for Database Tables

TDDA constraint discovery and verification is provided for a number of DB-API (PEP-0249) compliant databases, and also for a number of other (NoSQL) databases.

The top-level functions are:

tdda.constraints.discover_db_table():
Discover constraints from a single database table.

tdda.constraints.verify_db_table():
Verify (check) a single database table, against a set of previously discovered constraints.

tdda.constraints.detect_db_table():
For detection of failing records in a single database table, but not yet implemented for databases.

tdda.constraints.db.constraints.detect_db_table(dbtype, dbc, tablename, constraints_path, destination, epsilon=None, type_checking='strict', testing=False, **kwargs)

Detect records in the database table that fail any of the constraints in the .tdda file provided.

Parameters:

dbtype – Database type (e.g. 'postgres', 'mysql').
dbc – Database connection object.
tablename – Name of the table to check.
constraints_path – Path to a JSON .tdda file, or an in-memory DatasetConstraints object.
destination – Destination for output records.
epsilon –
Tolerance for min/max constraint checks, as a proportion of the constraint value. For example, 0.01 allows values up to 1% larger than a max constraint without generating a failure, and minimum values can be up to 1% smaller than the minimum constraint value without generating a failure. (These are modified, as appropriate, for negative values.)

If not specified, an epsilon of 0 is used, so there is no tolerance.

NOTE: A consequence of the fact that these are proportionate is that min/max values of zero do not have any tolerance, i.e. the wrong sign always generates a failure.
type_checking – 'strict', 'sloppy', or 'loose' ('loose' and 'sloppy' are equivalent). Defaults to 'strict' for databases. With 'sloppy'/'loose', a database real column may satisfy an int type constraint.
testing – If True, suppresses type-compatibility warnings. Default is False.
**kwargs – Additional keyword arguments.

Returns:

Detection results.

Return type:

DatabaseVerification

tdda.constraints.db.constraints.discover_db_table(dbtype, dbc, tablename, inc_rex=False, group_rexes=True, report_path=None, report_formats=None, seed=None, no_md=False, **kw)

Discover constraints characterizing the database table provided.

Examines each column and generates constraints that describe the data. The kinds of constraints potentially generated for each field are:

type: the coarse TDDA type: 'bool', 'int', 'real', 'string', or 'date'.
min: for non-string fields, the minimum value (not generated for all-null columns).
max: for non-string fields, the maximum value (not generated for all-null columns).
min_length: for string fields, the shortest string length.
max_length: for string fields, the longest string length.
sign: if all values in a numeric field have consistent sign, a sign constraint is written with a value chosen from:
- 'positive' — for all values v in field: v > 0
- 'non-negative' — for all values v in field: v >= 0
- 'zero' — for all values v in field: v == 0
- 'non-positive' — for all values v in field: v <= 0
- 'negative' — for all values v in field: v < 0
- 'null' — for all values v in field: v is null
max_nulls: the maximum number of nulls allowed in the field. Set to 0 if the field has no nulls, 1 if it has a single null. Not generated if the field has more than one null.
no_duplicates: for string fields (only, for now), True if every non-null value in the field is distinct. Only generated when all non-null values are unique; otherwise no constraint is written.
allowed_values: for string fields only, if there are MAX_CATEGORIES (currently 20) or fewer distinct values, an allowed-values constraint listing them will be generated.

Regular expression constraints are not (currently) generated for database tables.

Parameters:

dbtype – Database type (e.g. 'postgres', 'mysql').
dbc – Database connection object.
tablename – Name of the table to discover constraints for.
inc_rex – If True, include regular expression constraints. Default is False.
group_rexes – If True, group regular expression constraints. Default is True.
report_path – Path for reports (extension ignored).
report_formats – List of report formats to write. Options: 'html', 'markdown' (or 'md'), 'text' (or 'txt'), 'yaml', 'json', 'toml'.
seed – Optional random seed.
no_md – If True, suppress the metadata section of the .tdda file. Default is False.
**kw – Additional keyword arguments.

Returns:

Discovered constraints, or None if no constraints were found. The returned object includes a to_json() method, which converts the constraints to JSON for saving as a .tdda constraints file. By convention, such files use a .tdda extension. The constraints file can then be used to check whether other datasets satisfy the same constraints.

Return type:

DatasetConstraints

Example:

import pgdb
from tdda.constraints import discover_db_table

dbspec = 'localhost:databasename:username:password'
tablename = 'schemaname.tablename'
db = pgdb.connect(dbspec)
constraints = discover_db_table('postgres', db, tablename)

with open('myconstraints.tdda', 'w') as f:
    f.write(constraints.to_json())

tdda.constraints.db.constraints.verify_db_table(dbtype, db, tablename, constraints_path, epsilon=None, type_checking='strict', testing=False, report='all', **kwargs)

Verify that the database table satisfies the constraints in the .tdda file provided.

Parameters:

dbtype – Database type (e.g. 'postgres', 'mysql').
db – Database connection object.
tablename – Name of the table to verify.
constraints_path – Path to a JSON .tdda file, or an in-memory DatasetConstraints object.
epsilon –
Tolerance for min/max constraint checks, as a proportion of the constraint value. For example, 0.01 allows values up to 1% larger than a max constraint without generating a failure, and minimum values can be up to 1% smaller than the minimum constraint value without generating a failure. (These are modified, as appropriate, for negative values.)

If not specified, an epsilon of 0 is used, so there is no tolerance.

NOTE: A consequence of the fact that these are proportionate is that min/max values of zero do not have any tolerance, i.e. the wrong sign always generates a failure.
type_checking – 'strict', 'sloppy', or 'loose' ('loose' and 'sloppy' are equivalent). Defaults to 'strict' for databases. With 'sloppy'/'loose', a database real column may satisfy an int type constraint.
testing – If True, suppresses type-compatibility warnings. Should only be set when running automated tests. Default is False.
report –
'all' or 'fields'. Controls the behaviour of __str__ on the resulting DatabaseVerification object (but not its content).

'all' (the default) means that all fields are shown, together with the verification status of each constraint for that field.

If set to 'fields', only fields for which at least one constraint failed are shown.
**kwargs – Additional keyword arguments.

Returns:

Verification results, with passes and failures attributes giving the number of passing and failing constraints.

Return type:

DatabaseVerification

Example:

import pgdb
from tdda.constraints import verify_db_table

dbspec = 'localhost:databasename:username:password'
tablename = 'schemaname.tablename'
db = pgdb.connect(dbspec)
v = verify_db_table('postgres', db, tablename,
                    'myconstraints.tdda')

print('Constraints passing:', v.passes)
print('Constraints failing: %d\n' % v.failures)
print(str(v))

Basic Constraints Objects

Classes for representing individual constraints.

class tdda.constraints.base.DatasetConstraints(per_field_constraints=None, loadpath=None, no_md=False, allowed_fields=True, required_fields=True)

Constraints discovered for a dataset.

Returned by discover_df, discover_db_table, and related functions. Can also be loaded from a .tdda JSON file.

fields: Per-field constraints, keyed by field name.

n_records: Number of records in the source dataset.

n_selected: Number of records selected (if filtering was applied).

source: Source path or description.

The key method for saving discovered constraints is to_json(), which serializes the constraints to a .tdda JSON string.

initialize_from_dict(in_constraints)

Initializes this object from a dictionary in_constraints. Currently, the only key used from in_constraints is fields.

The value of in_constraints['fields'] is expected to be a dictionary, keyed on field name, whose values are the constraints for that field.

They constraints are keyed on the kind of constraint, and should contain either a single value (a scalar or a list), or a dictionary of keyword arguments for the constraint initializer.

load(path): Builds a DatasetConstraints object from a json file

sort_fields(fields=None)

Sorts the field constraints within the object by field order, by default by alphabetical order.

If a list of field names is provided, then the fields will appear in that given order (with any additional fields appended at the end).

to_dict(tddafile=None): Converts the constraints in this object to a dictionary.

to_json(tddafile=None): Converts the constraints in this object to JSON. The resulting JSON is returned.

write_discovery_reports(reports_path, formats): If any detection reports are specified by report_formats parameter or by configuration, this writes the report or reports.

class tdda.constraints.base.FieldConstraints(name=None, constraints=None)

Constraints discovered for a single field.

Holds a dictionary of constraints keyed by constraint kind. The constraint kinds potentially present are:

type: coarse TDDA type ('bool', 'int', 'real', 'string', or 'date').
min: minimum value (non-string fields).
max: maximum value (non-string fields).
min_length: shortest string length (string fields).
max_length: longest string length (string fields).
sign: sign constraint ('positive', 'non-negative', 'zero', 'non-positive', 'negative', or 'null').
max_nulls: maximum number of null values allowed.
no_duplicates: True if all non-null values are distinct (string fields).
allowed_values: list of permitted values (string fields with few distinct values).
rex: list of regular expressions that values must match (string fields, if rex discovery is enabled).

name: Field name.

constraints: OrderedDict of constraint objects keyed by kind.

Parameters:

name – Field name, or None if applying to multiple fields.
constraints – List of constraint objects to initialise with.

to_dict_value(raw=False)

Returns a pair consisting of the name supplied, or the stored name, and an ordered dictionary keyed on constraint kind with the value specifying the constraint. For simple constraints, the value is a base type; for more complex constraints with several components, the value will itself be an (ordered) dictionary.

The ordering is all to make the JSON file get written in a sensible order, rather than being a jumbled mess.

class tdda.constraints.base.MultiFieldConstraints(names=None, constraints=None)

Constraints discovered for a group of two or more fields.

Subclass of FieldConstraints for multi-field constraints such as cross-field relationships.

names: Tuple of field names.

constraints: OrderedDict of constraint objects keyed by kind.

Parameters:

names – Field names, or None. Leaving them null can be appropriate if the same constraint is to be used for multiple field groups, though it will not serialize particularly well.
constraints – List of constraint objects to initialise with.

to_dict_value()

Returns a pair consisting of

a comma-separated list of the field names
an ordered dictionary keyed on constraint kind with the value specifying the constraint.

For simple constraints, the value is a base type; for more complex Constraints with several components, the value will itself be an (ordered) dictionary.

The ordering is all to make the JSON file get written in a sensible order, rather than being a jumbled mess.

class tdda.constraints.base.Verification(constraints, n_source_records, report='all', ascii=False, detect=False, outpath=None, write_all_records=False, per_constraint=False, output_fields=None, index=False, in_place=False, colour=False, verify_allowed_fields=None, verify_required_fields=None, config=None, **kwargs)

Result of verifying a dataset against a set of constraints.

Returned by verify_df, verify_db_table, and related functions. Also used to represent detection results when anomaly detection is performed.

passes: Number of constraints that passed.

failures: Number of constraints that failed.

fields: Per-field verification results, keyed by field name.

n_source_records: Number of records in the source dataset.

report: Which fields to include in string output: 'all' or 'fields' (only fields with failures).

to_string(colour=None, ascii=None)

Returns string representation of the Verification object.

The format of the string is controlled by the value of the object's report property. If this is set to 'fields', then it reports only those fields that have failures.

to_table(fails, constraints)

Produce the summary table for detection.

Parameters:

fails – dictionary keyed on fieldname for fields with any failures.
constraints – original constraints

write_detection_reports(minimal=True): If any detection reports are specified (by the extension of the output file, or -r / –report flags, or by configuration, this writes the report or reports.

Pandas Constraints Objects

The tdda.constraints.pd.constraints module provides an implementation of TDDA constraint discovery and verification for Pandas DataFrames.

This allows it to be used for data in CSV files, or for DataFrames read from Parquet files.

The top-level functions are:

tdda.constraints.discover_df:
Discover constraints from a Pandas DataFrame.

tdda.constraints.verify_df:
Verify (check) a Pandas DataFrame, against a set of previously discovered constraints.

tdda.constraints.detect_df:
For detection of failing rows in a Pandas DataFrame, verified against a set of previously discovered constraints, and generate an output dataset containing information about input rows which failed any of the constraints.

class tdda.constraints.pd.constraints.PandasDetection(*args, **kwargs)

Detection result for a Pandas DataFrame.

Extends PandasVerification with a detected() method giving access to the detected records as a Pandas DataFrame.

n_passing_records: Number of records that passed all constraints.

n_failing_records: Number of records that failed at least one constraint.

Returned by detect_df.

detected()

Return a DataFrame of detected (failing) records.

Returns:: DataFrame of records that failed at least one constraint, or None if there were no failures and write_all_records was not set.

class tdda.constraints.pd.constraints.PandasVerification(*args, **kwargs)

Verification result for a Pandas DataFrame.

Extends Verification with to_frame() to convert the verification result to a Pandas DataFrame, with columns:

field: field (column) name.
failures: number of failing constraints for the field.
passes: number of passing constraints for the field.
One boolean column per constraint type, with values True (constraint satisfied), False (constraint failed), or np.nan (no constraint of this kind).

Returned by verify_df.

get_field_stats(field)

Count the number of passes and failures across all constraints for the field (name) specified as a PassFailCount object.

Used to calculate number of failing (constrained) values.

to_dataframe(): Converts object to a Pandas DataFrame.

to_frame(): Converts object to a Pandas DataFrame.

Database Constraints Objects

TDDA constraint discovery and verification is provided for a number of DB-API (PEP-0249) compliant databases, and also for a number of other (NoSQL) databases.

The top-level functions are:

tdda.constraints.discover_db_table():
Discover constraints from a single database table.

tdda.constraints.verify_db_table():
Verify (check) a single database table, against a set of previously discovered constraints.

tdda.constraints.detect_db_table():
For detection of failing records in a single database table, but not yet implemented for databases.

class tdda.constraints.db.constraints.DatabaseVerification(*args, **kwargs)

Verification and detection result for a database table.

Extends Verification for use with database tables. Used for both verification results (from verify_db_table) and detection results (from detect_db_table).

passes: Number of constraints that passed.

failures: Number of constraints that failed.

fields: Per-field verification results, keyed by field name.

n_source_records: Number of records in the source table.

Extension Framework

The tdda command-line utility provides built-in support for constraint discovery and verification for tabular data stored in CSV files, Pandas DataFrames saved in .parquet files, and for a tables in a variety of different databases.

The utility can be extended to provide support for constraint discovery and verification for other kinds of data, via its Python extension framework.

The framework will automatically use any extension implementations that have been declared using the TDDA_EXTENSIONS environment variable. This should be set to a list of class names, for Python classes that extend the ExtensionBase base class.

The class names in the TDDA_EXTENSIONS environment variable should be colon-separated for Unix systems, or semicolon-separated for Microsoft Windows. To be usable, the classes must be accessible by Python (either by being installed in Pythons standard module directory, or by being included in the PYTHONPATH environment variable.

For example:

export TDDA_EXTENSIONS="mytdda.MySpecialExtension"
export PYTHONPATH="/my/python/sources:$PYTHONPATH"

With these in place, the tdda command will include constraint discovery and verification using the MySpecialExtension implementation class provided in the Python file /my/python/sources/mytdda.py.

An example of a simple extension is included with the set of standard examples (see tdda examples).

Extension Overview

An extension should provide:

an implementation (subclass) of ExtensionBase, to provide a command-line interface, extending the tdda command to support a particular type of input data.

an implementation (subclass) of BaseConstraintCalculator, to provide methods for computing individual constraint results.

an implementation (subclass) of BaseConstraintDetector, to provide methods for generating detection results.

A typical implementation looks like:

from tdda.constraints.flags import discover_parser, discover_flags
from tdda.constraints.flags import verify_parser, verify_flags
from tdda.constraints.flags import detect_parser, detect_flags
from tdda.constraints.extension import ExtensionBase
from tdda.constraints.base import DatasetConstraints, Detection
from tdda.constraints.baseconstraints import (BaseConstraintCalculator,
                                              BaseConstraintVerifier,
                                              BaseConstraintDetector,
                                              BaseConstraintDiscoverer)
from tdda.rexpy import rexpy

class MyExtension(ExtensionBase):
    def applicable(self):
        ...

    def help(self, stream=sys.stdout):
        print('...', file=stream)

    def spec(self):
        return '...'

    def discover(self):
        parser = discover_parser()
        parser.add_argument(...)
        params = {}
        flags = discover_flags(parser, self.argv[1:], params)
        data = ... get data source from flags ...
        discoverer = MyConstraintDiscoverer(data, **params)
        constraints = discoverer.discover()
        results = constraints.to_json()
        ... write constraints JSON to output file
        return results

    def verify(self):
        parser = verify_parser()
        parser.add_argument(...)
        params = {}
        flags = verify_flags(parser, self.argv[1:], params)
        data = ... get data source from flags ...
        verifier = MyConstraintVerifier(data, **params)
        constraints = DatasetConstraints(loadpath=...)
        results = verifier.verify(constraints)
        return results

    def detect(self):
        parser = detect_parser()
        parser.add_argument(...)
        params = {}
        flags = detect_flags(parser, self.argv[1:], params)
        data = ... get data source from flags ...
        detector = MyConstraintDetector(data, **params)
        constraints = DatasetConstraints(loadpath=...)
        results = detector.detect(constraints)
        return results

Extension API

class tdda.constraints.extension.ExtensionBase(argv, verbose=False)

Base class for tdda command-line extensions.

Subclass this to add support for new data sources to the tdda command. The subclass must implement applicable(), and should implement discover(), verify(), and detect().

Parameters:

argv – List of command-line argument strings (e.g. sys.argv).
verbose – If True, enable verbose output. Default is False.

applicable()

Return True if this extension can handle the given arguments.

For example, an extension for Excel files should return True if any of the argv strings have a .xlsx suffix.

detect()

Implement constraint detection.

Read constraints from a .tdda file specified in self.argv, verify them against the specified data, and write detection output. Use self.argv to get the data source, where the detection output should be written, and any detection-specific flags.

discover()

Implement constraint discovery.

Use self.argv to obtain the data source and output path for the discovered constraints.

help(stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)

Write help text for this extension to stream.

Parameters:: stream – Output stream. Default is sys.stdout.

spec(): Return a brief one-line string describing how to specify the input source.

verify()

Implement constraint verification.

Read constraints from a .tdda file specified in self.argv and verify them against the specified data.