TDDA's Constraints API
Basic API for Pandas DataFrames
- tdda.constraints.api.detect(indata, constraints_path, outpath=None, engine=None, backend=None, **kwargs)
Detect records that fail any of the constraints in the
.tddafile provided.- Parameters:
indata – Path to a data file or a DataFrame to be checked.
constraints_path – Path to a JSON
.tddafile, or an in-memoryDatasetConstraintsobject.outpath – Optional path for output records (CSV or parquet).
Nonefor no output.engine – DataFrame engine:
'pandas'or'polars'.backend – Pandas backend:
'numpy_nullable'(or'n'),'pyarrow'(or'a'), or'original'(or'o').**kwargs – Additional keyword arguments passed to
detect_df.
- Returns:
Detection results.
- Return type:
- tdda.constraints.api.discover(indata, constraints_path=None, report_path=None, report_formats=None, engine=None, backend=None, verbose=True, **kwargs)
Discover constraints characterizing the data provided.
- Parameters:
indata – Data for which constraints are to be discovered. Can be a path to a data file (CSV, parquet, or other flat file) or a DataFrame (Pandas or Polars).
constraints_path – Path to write discovered constraints to. If
None, constraints are not written. If'-', constraints are written to stdout.report_path – Path for reports (extension ignored). Writes reports to variations of this path if set; otherwise uses
constraints_path.report_formats – List of report formats to write. Options:
'html','markdown'(or'md'),'text'(or'txt'),'yaml','json','toml'.engine – DataFrame engine:
'pandas'or'polars'.backend – Pandas backend:
'numpy_nullable'(or'n'),'pyarrow'(or'a'), or'original'(or'o').verbose – Controls level of output reporting. Default is
True.**kwargs – Additional keyword arguments passed to
discover_df.
- Returns:
Discovered constraints.
- Return type:
- tdda.constraints.api.verify(indata, constraints_path, outdata=None, verbose=True, engine=None, backend=None, md_path=None, **kwargs)
Verify that the data provided satisfies the constraints in the
.tddafile provided.- Parameters:
indata – Path to a data file or a DataFrame to be verified.
constraints_path – Path to a JSON
.tddafile, or an in-memoryDatasetConstraintsobject.outdata – Optional destination for output data.
verbose – Controls level of output reporting. Default is
True.engine – DataFrame engine:
'pandas'or'polars'.backend – Pandas backend:
'numpy_nullable'(or'n'),'pyarrow'(or'a'), or'original'(or'o').md_path – Path to metadata for
indata, if any.**kwargs – Additional keyword arguments passed to
verify_df.
- Returns:
Verification results.
- Return type:
Basic API for Database Tables
TDDA constraint discovery and verification is provided for a number of DB-API (PEP-0249) compliant databases, and also for a number of other (NoSQL) databases.
The top-level functions are:
tdda.constraints.discover_db_table():Discover constraints from a single database table.
tdda.constraints.verify_db_table():Verify (check) a single database table, against a set of previously discovered constraints.
tdda.constraints.detect_db_table():For detection of failing records in a single database table, but not yet implemented for databases.
- tdda.constraints.db.constraints.detect_db_table(dbtype, dbc, tablename, constraints_path, destination, epsilon=None, type_checking='strict', testing=False, **kwargs)
Detect records in the database table that fail any of the constraints in the
.tddafile provided.- Parameters:
dbtype – Database type (e.g.
'postgres','mysql').dbc – Database connection object.
tablename – Name of the table to check.
constraints_path – Path to a JSON
.tddafile, or an in-memoryDatasetConstraintsobject.destination – Destination for output records.
epsilon –
Tolerance for min/max constraint checks, as a proportion of the constraint value. For example,
0.01allows values up to 1% larger than a max constraint without generating a failure, and minimum values can be up to 1% smaller than the minimum constraint value without generating a failure. (These are modified, as appropriate, for negative values.)If not specified, an epsilon of 0 is used, so there is no tolerance.
NOTE: A consequence of the fact that these are proportionate is that min/max values of zero do not have any tolerance, i.e. the wrong sign always generates a failure.
type_checking –
'strict','sloppy', or'loose'('loose'and'sloppy'are equivalent). Defaults to'strict'for databases. With'sloppy'/'loose', a databaserealcolumn may satisfy aninttype constraint.testing – If
True, suppresses type-compatibility warnings. Default isFalse.**kwargs – Additional keyword arguments.
- Returns:
Detection results.
- Return type:
DatabaseVerification
- tdda.constraints.db.constraints.discover_db_table(dbtype, dbc, tablename, inc_rex=False, group_rexes=True, report_path=None, report_formats=None, seed=None, no_md=False, **kw)
Discover constraints characterizing the database table provided.
Examines each column and generates constraints that describe the data. The kinds of constraints potentially generated for each field are:
type: the coarse TDDA type:
'bool','int','real','string', or'date'.min: for non-string fields, the minimum value (not generated for all-null columns).
max: for non-string fields, the maximum value (not generated for all-null columns).
min_length: for string fields, the shortest string length.
max_length: for string fields, the longest string length.
sign: if all values in a numeric field have consistent sign, a sign constraint is written with a value chosen from:
'positive'— for all values v in field:v > 0'non-negative'— for all values v in field:v >= 0'zero'— for all values v in field:v == 0'non-positive'— for all values v in field:v <= 0'negative'— for all values v in field:v < 0'null'— for all values v in field:v is null
max_nulls: the maximum number of nulls allowed in the field. Set to 0 if the field has no nulls, 1 if it has a single null. Not generated if the field has more than one null.
no_duplicates: for string fields (only, for now),
Trueif every non-null value in the field is distinct. Only generated when all non-null values are unique; otherwise no constraint is written.allowed_values: for string fields only, if there are
MAX_CATEGORIES(currently 20) or fewer distinct values, an allowed-values constraint listing them will be generated.
Regular expression constraints are not (currently) generated for database tables.
- Parameters:
dbtype – Database type (e.g.
'postgres','mysql').dbc – Database connection object.
tablename – Name of the table to discover constraints for.
inc_rex – If
True, include regular expression constraints. Default isFalse.group_rexes – If
True, group regular expression constraints. Default isTrue.report_path – Path for reports (extension ignored).
report_formats – List of report formats to write. Options:
'html','markdown'(or'md'),'text'(or'txt'),'yaml','json','toml'.seed – Optional random seed.
no_md – If
True, suppress the metadata section of the.tddafile. Default isFalse.**kw – Additional keyword arguments.
- Returns:
Discovered constraints, or
Noneif no constraints were found. The returned object includes ato_json()method, which converts the constraints to JSON for saving as a.tddaconstraints file. By convention, such files use a.tddaextension. The constraints file can then be used to check whether other datasets satisfy the same constraints.- Return type:
Example:
import pgdb from tdda.constraints import discover_db_table dbspec = 'localhost:databasename:username:password' tablename = 'schemaname.tablename' db = pgdb.connect(dbspec) constraints = discover_db_table('postgres', db, tablename) with open('myconstraints.tdda', 'w') as f: f.write(constraints.to_json())
- tdda.constraints.db.constraints.verify_db_table(dbtype, db, tablename, constraints_path, epsilon=None, type_checking='strict', testing=False, report='all', **kwargs)
Verify that the database table satisfies the constraints in the
.tddafile provided.- Parameters:
dbtype – Database type (e.g.
'postgres','mysql').db – Database connection object.
tablename – Name of the table to verify.
constraints_path – Path to a JSON
.tddafile, or an in-memoryDatasetConstraintsobject.epsilon –
Tolerance for min/max constraint checks, as a proportion of the constraint value. For example,
0.01allows values up to 1% larger than a max constraint without generating a failure, and minimum values can be up to 1% smaller than the minimum constraint value without generating a failure. (These are modified, as appropriate, for negative values.)If not specified, an epsilon of 0 is used, so there is no tolerance.
NOTE: A consequence of the fact that these are proportionate is that min/max values of zero do not have any tolerance, i.e. the wrong sign always generates a failure.
type_checking –
'strict','sloppy', or'loose'('loose'and'sloppy'are equivalent). Defaults to'strict'for databases. With'sloppy'/'loose', a databaserealcolumn may satisfy aninttype constraint.testing – If
True, suppresses type-compatibility warnings. Should only be set when running automated tests. Default isFalse.report –
'all'or'fields'. Controls the behaviour of__str__on the resultingDatabaseVerificationobject (but not its content).'all'(the default) means that all fields are shown, together with the verification status of each constraint for that field.If set to
'fields', only fields for which at least one constraint failed are shown.**kwargs – Additional keyword arguments.
- Returns:
Verification results, with
passesandfailuresattributes giving the number of passing and failing constraints.- Return type:
DatabaseVerification
Example:
import pgdb from tdda.constraints import verify_db_table dbspec = 'localhost:databasename:username:password' tablename = 'schemaname.tablename' db = pgdb.connect(dbspec) v = verify_db_table('postgres', db, tablename, 'myconstraints.tdda') print('Constraints passing:', v.passes) print('Constraints failing: %d\n' % v.failures) print(str(v))
Basic Constraints Objects
Classes for representing individual constraints.
- class tdda.constraints.base.DatasetConstraints(per_field_constraints=None, loadpath=None, no_md=False, allowed_fields=True, required_fields=True)
Constraints discovered for a dataset.
Returned by
discover_df,discover_db_table, and related functions. Can also be loaded from a.tddaJSON file.- fields
Per-field constraints, keyed by field name.
- n_records
Number of records in the source dataset.
- n_selected
Number of records selected (if filtering was applied).
- source
Source path or description.
The key method for saving discovered constraints is
to_json(), which serializes the constraints to a.tddaJSON string.- initialize_from_dict(in_constraints)
Initializes this object from a dictionary in_constraints. Currently, the only key used from in_constraints is fields.
The value of in_constraints['fields'] is expected to be a dictionary, keyed on field name, whose values are the constraints for that field.
They constraints are keyed on the kind of constraint, and should contain either a single value (a scalar or a list), or a dictionary of keyword arguments for the constraint initializer.
- load(path)
Builds a DatasetConstraints object from a json file
- sort_fields(fields=None)
Sorts the field constraints within the object by field order, by default by alphabetical order.
If a list of field names is provided, then the fields will appear in that given order (with any additional fields appended at the end).
- to_dict(tddafile=None)
Converts the constraints in this object to a dictionary.
- to_json(tddafile=None)
Converts the constraints in this object to JSON. The resulting JSON is returned.
- write_discovery_reports(reports_path, formats)
If any detection reports are specified by report_formats parameter or by configuration, this writes the report or reports.
- class tdda.constraints.base.FieldConstraints(name=None, constraints=None)
Constraints discovered for a single field.
Holds a dictionary of constraints keyed by constraint kind. The constraint kinds potentially present are:
type: coarse TDDA type (
'bool','int','real','string', or'date').min: minimum value (non-string fields).
max: maximum value (non-string fields).
min_length: shortest string length (string fields).
max_length: longest string length (string fields).
sign: sign constraint (
'positive','non-negative','zero','non-positive','negative', or'null').max_nulls: maximum number of null values allowed.
no_duplicates:
Trueif all non-null values are distinct (string fields).allowed_values: list of permitted values (string fields with few distinct values).
rex: list of regular expressions that values must match (string fields, if rex discovery is enabled).
- name
Field name.
- constraints
OrderedDictof constraint objects keyed by kind.
- Parameters:
name – Field name, or
Noneif applying to multiple fields.constraints – List of constraint objects to initialise with.
- to_dict_value(raw=False)
Returns a pair consisting of the name supplied, or the stored name, and an ordered dictionary keyed on constraint kind with the value specifying the constraint. For simple constraints, the value is a base type; for more complex constraints with several components, the value will itself be an (ordered) dictionary.
The ordering is all to make the JSON file get written in a sensible order, rather than being a jumbled mess.
- class tdda.constraints.base.MultiFieldConstraints(names=None, constraints=None)
Constraints discovered for a group of two or more fields.
Subclass of
FieldConstraintsfor multi-field constraints such as cross-field relationships.- names
Tuple of field names.
- constraints
OrderedDictof constraint objects keyed by kind.
- Parameters:
names – Field names, or
None. Leaving them null can be appropriate if the same constraint is to be used for multiple field groups, though it will not serialize particularly well.constraints – List of constraint objects to initialise with.
- to_dict_value()
- Returns a pair consisting of
a comma-separated list of the field names
an ordered dictionary keyed on constraint kind with the value specifying the constraint.
For simple constraints, the value is a base type; for more complex Constraints with several components, the value will itself be an (ordered) dictionary.
The ordering is all to make the JSON file get written in a sensible order, rather than being a jumbled mess.
- class tdda.constraints.base.Verification(constraints, n_source_records, report='all', ascii=False, detect=False, outpath=None, write_all_records=False, per_constraint=False, output_fields=None, index=False, in_place=False, colour=False, verify_allowed_fields=None, verify_required_fields=None, config=None, **kwargs)
Result of verifying a dataset against a set of constraints.
Returned by
verify_df,verify_db_table, and related functions. Also used to represent detection results when anomaly detection is performed.- passes
Number of constraints that passed.
- failures
Number of constraints that failed.
- fields
Per-field verification results, keyed by field name.
- n_source_records
Number of records in the source dataset.
- report
Which fields to include in string output:
'all'or'fields'(only fields with failures).
- to_string(colour=None, ascii=None)
Returns string representation of the
Verificationobject.The format of the string is controlled by the value of the object's
reportproperty. If this is set to 'fields', then it reports only those fields that have failures.
- to_table(fails, constraints)
Produce the summary table for detection.
- Parameters:
fails – dictionary keyed on fieldname for fields with any failures.
constraints – original constraints
- write_detection_reports(minimal=True)
If any detection reports are specified (by the extension of the output file, or -r / –report flags, or by configuration, this writes the report or reports.
Pandas Constraints Objects
The tdda.constraints.pd.constraints module provides an
implementation of TDDA constraint discovery and verification
for Pandas DataFrames.
This allows it to be used for data in CSV files, or for DataFrames read from Parquet files.
The top-level functions are:
tdda.constraints.discover_df:Discover constraints from a Pandas DataFrame.
tdda.constraints.verify_df:Verify (check) a Pandas DataFrame, against a set of previously discovered constraints.
tdda.constraints.detect_df:For detection of failing rows in a Pandas DataFrame, verified against a set of previously discovered constraints, and generate an output dataset containing information about input rows which failed any of the constraints.
- class tdda.constraints.pd.constraints.PandasDetection(*args, **kwargs)
Detection result for a Pandas DataFrame.
Extends
PandasVerificationwith adetected()method giving access to the detected records as a Pandas DataFrame.- n_passing_records
Number of records that passed all constraints.
- n_failing_records
Number of records that failed at least one constraint.
Returned by
detect_df.- detected()
Return a DataFrame of detected (failing) records.
- Returns:
DataFrame of records that failed at least one constraint, or
Noneif there were no failures andwrite_all_recordswas not set.
- class tdda.constraints.pd.constraints.PandasVerification(*args, **kwargs)
Verification result for a Pandas DataFrame.
Extends
Verificationwithto_frame()to convert the verification result to a Pandas DataFrame, with columns:field: field (column) name.failures: number of failing constraints for the field.passes: number of passing constraints for the field.One boolean column per constraint type, with values
True(constraint satisfied),False(constraint failed), ornp.nan(no constraint of this kind).
Returned by
verify_df.- get_field_stats(field)
Count the number of passes and failures across all constraints for the field (name) specified as a PassFailCount object.
Used to calculate number of failing (constrained) values.
- to_dataframe()
Converts object to a Pandas DataFrame.
- to_frame()
Converts object to a Pandas DataFrame.
Database Constraints Objects
TDDA constraint discovery and verification is provided for a number of DB-API (PEP-0249) compliant databases, and also for a number of other (NoSQL) databases.
The top-level functions are:
tdda.constraints.discover_db_table():Discover constraints from a single database table.
tdda.constraints.verify_db_table():Verify (check) a single database table, against a set of previously discovered constraints.
tdda.constraints.detect_db_table():For detection of failing records in a single database table, but not yet implemented for databases.
- class tdda.constraints.db.constraints.DatabaseVerification(*args, **kwargs)
Verification and detection result for a database table.
Extends
Verificationfor use with database tables. Used for both verification results (fromverify_db_table) and detection results (fromdetect_db_table).- passes
Number of constraints that passed.
- failures
Number of constraints that failed.
- fields
Per-field verification results, keyed by field name.
- n_source_records
Number of records in the source table.
Extension Framework
The tdda command-line utility provides built-in support for constraint
discovery and verification for tabular data stored in CSV files, Pandas
DataFrames saved in .parquet files, and for a tables in a variety of
different databases.
The utility can be extended to provide support for constraint discovery and verification for other kinds of data, via its Python extension framework.
The framework will automatically use any extension implementations that
have been declared using the TDDA_EXTENSIONS environment variable. This
should be set to a list of class names, for Python classes that extend the
ExtensionBase base class.
The class names in the TDDA_EXTENSIONS environment variable should be
colon-separated for Unix systems, or semicolon-separated for Microsoft
Windows. To be usable, the classes must be accessible by Python (either
by being installed in Pythons standard module directory, or by being
included in the PYTHONPATH environment variable.
For example:
export TDDA_EXTENSIONS="mytdda.MySpecialExtension"
export PYTHONPATH="/my/python/sources:$PYTHONPATH"
With these in place, the tdda command will include constraint discovery
and verification using the MySpecialExtension implementation class
provided in the Python file /my/python/sources/mytdda.py.
An example of a simple extension is included with the set of standard
examples (see tdda examples).
Extension Overview
An extension should provide:
an implementation (subclass) of
ExtensionBase, to provide a command-line interface, extending thetddacommand to support a particular type of input data.an implementation (subclass) of
BaseConstraintCalculator, to provide methods for computing individual constraint results.an implementation (subclass) of
BaseConstraintDetector, to provide methods for generating detection results.
A typical implementation looks like:
from tdda.constraints.flags import discover_parser, discover_flags
from tdda.constraints.flags import verify_parser, verify_flags
from tdda.constraints.flags import detect_parser, detect_flags
from tdda.constraints.extension import ExtensionBase
from tdda.constraints.base import DatasetConstraints, Detection
from tdda.constraints.baseconstraints import (BaseConstraintCalculator,
BaseConstraintVerifier,
BaseConstraintDetector,
BaseConstraintDiscoverer)
from tdda.rexpy import rexpy
class MyExtension(ExtensionBase):
def applicable(self):
...
def help(self, stream=sys.stdout):
print('...', file=stream)
def spec(self):
return '...'
def discover(self):
parser = discover_parser()
parser.add_argument(...)
params = {}
flags = discover_flags(parser, self.argv[1:], params)
data = ... get data source from flags ...
discoverer = MyConstraintDiscoverer(data, **params)
constraints = discoverer.discover()
results = constraints.to_json()
... write constraints JSON to output file
return results
def verify(self):
parser = verify_parser()
parser.add_argument(...)
params = {}
flags = verify_flags(parser, self.argv[1:], params)
data = ... get data source from flags ...
verifier = MyConstraintVerifier(data, **params)
constraints = DatasetConstraints(loadpath=...)
results = verifier.verify(constraints)
return results
def detect(self):
parser = detect_parser()
parser.add_argument(...)
params = {}
flags = detect_flags(parser, self.argv[1:], params)
data = ... get data source from flags ...
detector = MyConstraintDetector(data, **params)
constraints = DatasetConstraints(loadpath=...)
results = detector.detect(constraints)
return results
Extension API
- class tdda.constraints.extension.ExtensionBase(argv, verbose=False)
Base class for tdda command-line extensions.
Subclass this to add support for new data sources to the
tddacommand. The subclass must implementapplicable(), and should implementdiscover(),verify(), anddetect().- Parameters:
argv – List of command-line argument strings (e.g.
sys.argv).verbose – If
True, enable verbose output. Default isFalse.
- applicable()
Return
Trueif this extension can handle the given arguments.For example, an extension for Excel files should return
Trueif any of theargvstrings have a.xlsxsuffix.
- detect()
Implement constraint detection.
Read constraints from a
.tddafile specified inself.argv, verify them against the specified data, and write detection output. Useself.argvto get the data source, where the detection output should be written, and any detection-specific flags.
- discover()
Implement constraint discovery.
Use
self.argvto obtain the data source and output path for the discovered constraints.
- help(stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)
Write help text for this extension to
stream.- Parameters:
stream – Output stream. Default is
sys.stdout.
- spec()
Return a brief one-line string describing how to specify the input source.
- verify()
Implement constraint verification.
Read constraints from a
.tddafile specified inself.argvand verify them against the specified data.