Constraints¶
The constraints
module provides support for
constraint generation, verification and anomaly detection for datasets,
including CSV files and Pandas DataFrames.
The module includes:
- A Command-line Tool for discovering constraints in data
from various sources, and for verifying data against those constraints,
using the
.tdda
TDDA JSON file format. - A Python library
constraints
containing classes that implement constraint discovery and validation, for use from within other Python programs. - Python implementations of constraint discovery, verification and
and anomaly detection for a number of data sources:
- CSV files
- Pandas and R DataFrames saved as
.feather
files - PostgreSQL database tables (
postgres:
) - MySQL database tables (
mysql:
) - SQLite database tables (
sqlite:
) - MongoDB document collections (
mongodb
)
Python Prerequisites¶
numpy
andpandas
(required for CSV files andfeather
files)feather-format
(required forfeather
files)pygresql
(required for PostgreSQL database tables)mysql
(required for MySQL database tables)
These can be installed with (some/all of):
pip install numpy
pip install pandas
pip install feather-format
pip install pygresql
pip install mysql-python
The sqlite3
module is provided by default as part of the
standard Python libraries, so SQLite database tables can be used without
having to explicitly install it.
To install feather-format
on Windows, you will need to install
cython
as a prerequisite, which might also require you to install
the Microsoft Visual C++ compiler for python, from http://aka.ms/vcpython27.
Command-line Tool¶
The tdda
command-line utility provides a tool for discovering constraints
in data and saving them as a .tdda
file using the
TDDA JSON file format, and also for verifying constraints in
data against a previously prepared .tdda
file.
It also provides some other functionality to help with using the tool. It takes commands in the following forms:
- tdda discover to perform constraint discovery.
- tdda verify to verify data against constraints.
- tdda detect to detect anomalies in data by checking constraints.
tdda examples
to copy example data and code where you can see them.tdda help
to show help on how to use the tool.tdda test
to run the TDDA library’s internal tests.
See Examples for more detail on the code and data
examples that are included as part of the tdda
package.
See Tests for more detail on the tdda
package’s own tests,
used to test that the package is installed and configured correctly.
tdda discover¶
Discover TDDA constraints for data from various sources, and save the generated constraints as a TDDA JSON file format file.
Usage:
tdda discover [FLAGS] input [constraints.tdda]
where
- input is one of:
- a CSV file
- a
-
, meaning it will read a csv file from standard input- a
feather
file containing a DataFrame, with extension.feather
- a database table
- constraints.tdda, if provided, specifies the name of a file to which the generated constraints will be written.
If no constraints output file is provided, or is -
, the generated constraints
are written to standard output.
Optional flags are:
-r
or--rex
to include regular expression generation-R
or--norex
to exclude regular expression generation
See Constraints for CSV Files and Pandas DataFrames for details of how a CSV file is read.
See Constraints for Databases for details of how database tables are accessed.
tdda verify¶
Verify data from various sources, against constraints from a TDDA JSON file format constraints file.
Usage:
tdda verify [FLAGS] input [constraints.tdda]
where:
- input is one of:
- a csv file
- a
-
, meaning it will read a csv file from standard input - a
feather
file containing a DataFrame, with extension.feather
- a database table
- constraints.tdda, if provided, is a JSON .tdda file constaining constraints.
If no constraints file is provided and the input is a CSV or feather file, a constraints file with the same path as the input file, but with a .tdda extension, will be used.
For database tables, the constraints file parameter is mandatory.
Optional flags are:
-a
,--all
- Report all fields, even if there are no failures
-f
,--fields
- Report only fields with failures
-7
,--ascii
- Report in ASCII form, without using special characters.
--epsilon E
- Use this value of epsilon for fuzziness in comparing numeric values.
--type_checking strict|sloppy
- By default, type-checking is sloppy, meaning that when checking type
constraints, all numeric types are considered to be equivalent. With
strict typing,
int
is considered different fromreal
.
See Constraints for CSV Files and Pandas DataFrames for details of how a CSV file is read.
See Constraints for Databases for details of how database tables are accessed.
tdda detect¶
Detect anomalies on data from various sources, by checking against constraints from a TDDA JSON file format constraints file.
Usage:
tdda detect [FLAGS] input constraints.tdda output
where:
- input is one of:
- a csv file name
- a
-
, meaning it will read a csv file from standard input - a
feather
file containing a DataFrame, with extension.feather
- a database table
- constraints.tdda, is a JSON .tdda file constaining constraints.
- output is one of:
- a csv file to be created containing failing records
- a
-
, meaning it will write the csv file containing failing records to standard output - a
feather
file with extension.feather
, to be created containing a DataFrame of failing records
If no constraints file is provided and the input is a CSV or feather file, a constraints file with the same path as the input file, but with a .tdda extension, will be used.
Optional flags are:
-a
,--all
- Report all fields, even if there are no failures
-f
,--fields
- Report only fields with failures
-7
,--ascii
- Report in ASCII form, without using special characters.
--epsilon E
- Use this value of epsilon for fuzziness in comparing numeric values.
--type_checking strict|sloppy
- By default, type-checking is sloppy, meaning that when checking type
constraints, all numeric types are considered to be equivalent. With
strict typing,
int
is considered different fromreal
.
--write-all
- Include passing records in the output.
--per-constraint
- Write one column per failing constraint, as well as the
n_failures
total column for each row.
--output-fields FIELD1 FIELD2 ...
- Specify original columns to write out. If used with no field names, all original columns will be included.
--index
- Include a row-number index in the output file. The row number is automatically included if no output fields are specified. Rows are usually numbered from 1, unless the (feather) input file already has an index.
If no records fail any of the constraints, then no output file is created (and if the output file already exists, it is deleted).
See Constraints for CSV Files and Pandas DataFrames for details of how a CSV file is read.
See Constraints for Databases for details of how database tables are accessed.
Constraints for CSV Files and Pandas DataFrames¶
The tdda.constraints.pd.constraints
module provides an
implementation of TDDA constraint discovery and verification
for Pandas DataFrames.
This allows it to be used for data in CSV files, or for Pandas or R DataFrames saved as Feather files.
The top-level functions are:
discover_df()
:- Discover constraints from a Pandas DataFrame.
verify_df()
:- Verify (check) a Pandas DataFrame, against a set of previously discovered constraints.
detect_df()
:- Verify (check) a Pandas DataFrame, against a set of previously discovered constraints, and generate an output dataset containing information about input rows which failed any of the constraints.
-
tdda.constraints.pd.constraints.
discover_df
(df, inc_rex=False, df_path=None)¶ Automatically discover potentially useful constraints that characterize the Pandas DataFrame provided.
Input:
- df:
- any Pandas DataFrame.
- inc_rex:
- If
True
, include discovery of regular expressions for string fields, using rexpy (default:False
). - df_path:
- The path from which the dataframe was loaded, if any.
Possible return values:
DatasetConstraints
objectNone
— (if no constraints were found).
This function goes through each column in the DataFrame and, where appropriate, generates constraints that describe (and are satisified by) this dataframe.
Assuming it generates at least one constraint for at least one field it returns a
tdda.constraints.base.DatasetConstraints
object.This includes a
fields
attribute, keyed on the column name.The returned
DatasetConstraints
object includes ato_json()
method, which converts the constraints into JSON for saving as a tdda constraints file. By convention, such JSON files use a.tdda
extension.The JSON constraints file can be used to check whether other datasets also satisfy the constraints.
The kinds of constraints (potentially) generated for each field (column) are:
type
:- the (coarse, TDDA) type of the field. One of
bool
,int
,real
,string
ordate
. min
:- for non-string fields, the minimum value in the column. Not generated for all-null columns.
max
:- for non-string fields, the maximum value in the column. Not generated for all-null columns.
min_length
:- For string fields, the length of the shortest string(s) in the field. N.B. In Python2, this assumes the strings are encoded in UTF-8, and an error may occur if this is not the case. String length counts unicode characters, not bytes.
max_length
:- For string fields, the length of the longest string(s) in the field. N.B. In Python2, this assumes the strings are encoded in UTF-8, and an error may occur if this is not the case. String length counts unicode characters, not bytes.
sign
:If all the values in a numeric field have consistent sign, a sign constraint will be written with a value chosen from:
positive
— For all valuesv
in field:v > 0
non-negative
— For all valuesv
in field:v >= 0
zero
— For all valuesv
in field:v == 0
non-positive
— For all valuesv
in field:v <= 0
negative
— For all valuesv
in field:v < 0
null
— For all valuesv
in field:v is null
max_nulls
:The maximum number of nulls allowed in the field.
- If the field has no nulls, a constraint
will be written with
max_nulls
set to zero. - If the field has a single null, a constraint will
be written with
max_nulls
set to one. - If the field has more than 1 null, no constraint will be generated.
- If the field has no nulls, a constraint
will be written with
no_duplicates
:- For string fields (only, for now), if every
non-null value in the field is different,
this constraint will be generated (with value
True
); otherwise no constraint will be generated. So this constraint indicates that all the non-null values in a string field are distinct (unique). allowed_values
:- For string fields only, if there are
MAX_CATEGORIES
or fewer distinct string values in the dataframe, an AllowedValues constraint listing them will be generated.MAX_CATEGORIES
is currently “hard-wired” to 20.
Example usage:
import pandas as pd from tdda.constraints import discover_df df = pd.DataFrame({'a': [1, 2, 3], 'b': ['one', 'two', pd.np.NaN]}) constraints = discover_df(df) with open('example_constraints.tdda', 'w') as f: f.write(constraints.to_json())
See simple_generation.py in the Examples for a slightly fuller example.
-
tdda.constraints.pd.constraints.
verify_df
(df, constraints_path, epsilon=None, type_checking=None, report='all', **kwargs)¶ Verify that (i.e. check whether) the Pandas DataFrame provided satisfies the constraints in the JSON
.tdda
file provided.Mandatory Inputs:
- df:
- A Pandas DataFrame, to be checked.
- constraints_path:
- The path to a JSON
.tdda
file (possibly generated by the discover_df function, below) containing constraints to be checked.
Optional Inputs:
- epsilon:
When checking minimum and maximum values for numeric fields, this provides a tolerance. The tolerance is a proportion of the constraint value by which the constraint can be exceeded without causing a constraint violation to be issued.
For example, with epsilon set to 0.01 (i.e. 1%), values can be up to 1% larger than a max constraint without generating constraint failure, and minimum values can be up to 1% smaller that the minimum constraint value without generating a constraint failure. (These are modified, as appropriate, for negative values.)
If not specified, an epsilon of 0 is used, so there is no tolerance.
NOTE: A consequence of the fact that these are proportionate is that min/max values of zero do not have any tolerance, i.e. the wrong sign always generates a failure.
- type_checking:
strict
orsloppy
. Because Pandas silently, routinely and automatically “promotes” integer and boolean columns to reals and objects respectively if they contain nulls, strict type checking can be problematical in Pandas. For this reason,type_checking
defaults tosloppy
, meaning that type changes that could plausibly be attributed to Pandas type promotion will not generate constraint values.If this is set to strict, a Pandas
float
columnc
will only be allowed to satisfy a anint
type constraint if:c.dropnulls().astype(int) == c.dropnulls()
Similarly, Object fields will satisfy a
bool
constraint only if:c.dropnulls().astype(bool) == c.dropnulls()
- report:
all
orfields
. This controls the behaviour of the__str__()
method on the resultingPandasVerification
object (but not its content).The default is
all
, which means that all fields are shown, together with the verification status of each constraint for that field.If report is set to
fields
, only fields for which at least one constraint failed are shown.
Returns:
PandasVerification
object.This object has attributes:
- passes — Number of passing constriants
- failures — Number of failing constraints
It also has a
to_frame()
method for converting the results of the verification to a Pandas DataFrame, and a__str__()
method to print both the detailed and summary results of the verification.Example usage:
import pandas as pd from tdda.constraints import verify_df df = pd.DataFrame({'a': [0, 1, 2, 10, pd.np.NaN], 'b': ['one', 'one', 'two', 'three', pd.np.NaN]}) v = verify_df(df, 'example_constraints.tdda') print('Constraints passing: %d\n' % v.passes) print('Constraints failing: %d\n' % v.failures) print(str(v)) print(v.to_frame())
See simple_verification.py in the Examples for a slightly fuller example.
-
tdda.constraints.pd.constraints.
detect_df
(df, constraints_path, epsilon=None, type_checking=None, outpath=None, write_all=False, per_constraint=False, output_fields=None, index=False, in_place=False, rownumber_is_index=True, boolean_ints=False, report='records', **kwargs)¶ Check the records from the Pandas DataFrame provided, to detect records that fail any of the constraints in the JSON
.tdda
file provided. This is anomaly detection.Mandatory Inputs:
- df:
- A Pandas DataFrame, to be checked.
- constraints_path:
- The path to a JSON
.tdda
file (possibly generated by the discover_df function, below) containing constraints to be checked.
Optional Inputs:
- epsilon:
When checking minimum and maximum values for numeric fields, this provides a tolerance. The tolerance is a proportion of the constraint value by which the constraint can be exceeded without causing a constraint violation to be issued.
For example, with epsilon set to 0.01 (i.e. 1%), values can be up to 1% larger than a max constraint without generating constraint failure, and minimum values can be up to 1% smaller that the minimum constraint value without generating a constraint failure. (These are modified, as appropriate, for negative values.)
If not specified, an epsilon of 0 is used, so there is no tolerance.
NOTE: A consequence of the fact that these are proportionate is that min/max values of zero do not have any tolerance, i.e. the wrong sign always generates a failure.
- type_checking:
strict
orsloppy
. Because Pandas silently, routinely and automatically “promotes” integer and boolean columns to reals and objects respectively if they contain nulls, strict type checking can be problematical in Pandas. For this reason,type_checking
defaults tosloppy
, meaning that type changes that could plausibly be attributed to Pandas type promotion will not generate constraint values.If this is set to strict, a Pandas
float
columnc
will only be allowed to satisfy a anint
type constraint if:c.dropnulls().astype(int) == c.dropnulls()
Similarly, Object fields will satisfy a
bool
constraint only if:c.dropnulls().astype(bool) == c.dropnulls()
- outpath:
This specifies that the verification process should detect records that violate any constraints, and write them out to this CSV (or feather) file.
By default, only failing records are written out to file, but this can be overridden with the
write_all
parameter.By default, the columns in the detection output file will be a boolean
ok
field for each constraint on each field, an andn_failures
field containing the total number of constraints that failed for each row. This behavious can be overridden with theper_constraint
,output_fields
andindex
parameters.- write_all:
- Include passing records in the detection output file when detecting.
- per_constraint:
- Write one column per failing constraint, as well
as the
n_failures
total. - output_fields:
Specify original columns to write out when detecting.
If passed in as an empty list (rather than None), all original columns will be included.
- index:
Boolean to specify whether to include a row-number index in the output file when detecting.
This is automatically enabled if no output field names are specified.
Rows are numbered from 0.
- in_place:
Detect failing constraints by adding columns to the input DataFrame.
If
outpath
is also specified, then failing records will also be written to file.- rownumber_is_index:
False
if the DataFrame originated from a CSV file (and therefore any detection output file should refer to row numbers from the file, rather than items from the DataFrame index).- boolean_ints:
- If
True
, write out all boolean values to CSV file as integers (1 for true, and 0 for false), rather than astrue
andfalse
values.
The report parameter from
verify_df()
can also be used, in which case a verification report will also be produced in addition to the detection results.Returns:
PandasDetection
object.This object has a
detected()
method for obtaining the Pandas DataFrame containing the detection results.Example usage:
import pandas as pd from tdda.constraints import detect_df df = pd.DataFrame({'a': [0, 1, 2, 10, pd.np.NaN], 'b': ['one', 'one', 'two', 'three', pd.np.NaN]}) v = detect_df(df, 'example_constraints.tdda') detection_df = v.detected() print(detection_df.to_string())
-
class
tdda.constraints.pd.constraints.
PandasConstraintCalculator
(df)¶ Implementation of the Constraint Calculator methods for Pandas dataframes.
-
class
tdda.constraints.pd.constraints.
PandasConstraintDetector
(df)¶ Implementation of the Constraint Detector methods for Pandas dataframes.
-
class
tdda.constraints.pd.constraints.
PandasConstraintVerifier
(df, epsilon=None, type_checking=None)¶ A
PandasConstraintVerifier
object provides methods for verifying every type of constraint against a Pandas DataFrame.
-
class
tdda.constraints.pd.constraints.
PandasConstraintDiscoverer
(df, inc_rex=False)¶ A
PandasConstraintDiscoverer
object is used to discover constraints on a Pandas DataFrame.
-
class
tdda.constraints.pd.constraints.
PandasVerification
(*args, **kwargs)¶ A
PandasVerification
object adds ato_frame()
method to atdda.constraints.base.Verification
object.This allows the result of constraint verification to be converted to a Pandas DataFrame, including columns for the field (column) name, the numbers of passes and failures and boolean columns for each constraint, with values:
True
— if the constraint was satified for the columnFalse
— if column failed to satisfy the constraintpd.np.NaN
— if there was no constraint of this kind
This Pandas-specific implementation of constraint verification also provides methods
to_frame()
to get the overall verification result as as a Pandas DataFrame, anddetected()
to get any detection results as a a Pandas DataFrame (if the verification has been run with indetect
mode).-
to_dataframe
()¶ Converts object to a Pandas DataFrame.
-
to_frame
()¶ Converts object to a Pandas DataFrame.
-
class
tdda.constraints.pd.constraints.
PandasDetection
(*args, **kwargs)¶ A
PandasDetection
object adds adetected()
method to aPandasVerification
object.This allows the Pandas DataFrame resulting from constraint detection to be made available.
The object also provides properties n_passing_records and n_failing_records, recording how many records passed and failed the detection process.
-
detected
()¶ Returns a Pandas DataFrame containing the detection results.
If there are no failing records, and the detection was not run with the write_all flag set, then
None
is returned.
-
If a CSV file is used with the tdda
command-line tool, it will be
processed by the standard Pandas CSV file reader with the following settings:
index_col
isNone
infer_datetime_format
isTrue
quotechar
is"
quoting
iscsv.QUOTE_MINIMAL
escapechar
is\\
(backslash)na_values
are the empty string,"NaN"
, and"NULL"
keep_default_na
isFalse
Constraints for Databases¶
When a database table is used with the tdda
command-line tool, the table
name (including an optional schema) can be preceded by DBTYPE
chosen
from postgres
, mysql
, sqlite
or mongodb
:
DBTYPE:schema.tablename
If DBTYPE
is used, this will cause the file .tdda_db_conn_DBTYPE
to be read from your home directory (see Database Connection Files), which can
contain all connection parameters.
Parameters can also be provided using the following flags (which override
the values in the .tdda_db_conn_DBTYPE
file, if provided):
-conn FILE
- Database connection file (see Database Connection Files)
-dbtype DBTYPE
- Type of database
-db DATABASE
- Name of database to connect to
-host HOSTNAME
- Name of server to connect to
-port PORTNUMBER
- IP port number to connect to
-user USERNAME
- Username to connect as
-password PASSWORD
- Password to authenticate with
If -conn
is provided, then none of the other options are required, and
the database connection details are read from the specified file.
If the database type is specified (with the -dbtype
option, or by
prefixing the table name, such as postgres:mytable
), then a default
connection file .tdda_db_conn_DBTYPE
(in your home directory) is used,
if present.
Database Connection Files¶
To use a database source, you can either specify the database type
using the --dbtype DBTYPE
option, or you can prefix the table name
with DBTYPE:
.
You can provide default values for all of the other database options in
a database connection file .tdda_db_conn_DBTYPE
, in your home directory.
Any database-related options passed in on the command line will override the default settings from the connection file.
A tdda_db_conn_DBTYPE
file is a JSON file of the form:
{
"dbtype": DBTYPE,
"db": DATABASE,
"host": HOSTNAME,
"port": PORTNUMBER,
"user": USERNAME,
"password": PASSWORD,
"schema": SCHEMA,
}
All the entries are optional.
If a password
is provided, then care should be taken to ensure that the
file has appropriate filesystem permissions so that it cannot be read by
other users.
If a schema
is provided, then it will be used as the default schema,
when constraints are discovered or verified on a table name with no
schema specified.
API¶
TDDA constraint discovery and verification is provided for a number of DB-API (PEP-0249) compliant databases, and also for a number of other (NoSQL) databases.
The top-level functions are:
discover_db_table()
:- Discover constraints from a single database table.
verify_db_table()
:- Verify (check) a single database table, against a set of previously discovered constraints.
detect_db_table()
:- Verify (check) a single database table, against a set of previously discovered constraints.
-
tdda.constraints.db.constraints.
discover_db_table
(dbtype, db, tablename, inc_rex=False)¶ Automatically discover potentially useful constraints that characterize the database table provided.
Input:
- dbtype:
- Type of database.
- db:
- a database object
- tablename:
- a table name
Possible return values:
DatasetConstraints
objectNone
— (if no constraints were found).
This function goes through each column in the table and, where appropriate, generates constraints that describe (and are satisified by) this dataframe.
Assuming it generates at least one constraint for at least one field it returns a
tdda.constraints.base.DatasetConstraints
object.This includes a ‘fields’ attribute, keyed on the column name.
The returned
DatasetConstraints
object includes ato_json()
method, which converts the constraints into JSON for saving as a tdda constraints file. By convention, such JSON files use a ‘.tdda’ extension.The JSON constraints file can be used to check whether other datasets also satisfy the constraints.
The kinds of constraints (potentially) generated for each field (column) are:
- type:
- the (coarse, TDDA) type of the field. One of ‘bool’, ‘int’, ‘real’, ‘string’ or ‘date’.
- min:
- for non-string fields, the minimum value in the column. Not generated for all-null columns.
- max:
- for non-string fields, the maximum value in the column. Not generated for all-null columns.
- min_length:
- For string fields, the length of the shortest string(s) in the field.
- max_length:
- For string fields, the length of the longest string(s) in the field.
- sign:
If all the values in a numeric field have consistent sign, a sign constraint will be written with a value chosen from:
- positive — For all values v in field: v > 0
- non-negative — For all values v in field: v >= 0
- zero — For all values v in field: v == 0
- non-positive — For all values v in field: v <= 0
- negative — For all values v in field: v < 0
- null — For all values v in field: v is null
- max_nulls:
The maximum number of nulls allowed in the field.
- If the field has no nulls, a constraint will be written with max_nulls set to zero.
- If the field has a single null, a constraint will be written with max_nulls set to one.
- If the field has more than 1 null, no constraint will be generated.
- no_duplicates:
- For string fields (only, for now), if every
non-null value in the field is different,
this constraint will be generated (with value
True
); otherwise no constraint will be generated. So this constraint indicates that all the non-null values in a string field are distinct (unique). - allowed_values:
- For string fields only, if there are
MAX_CATEGORIES
or fewer distinct string values in the dataframe, an AllowedValues constraint listing them will be generated.MAX_CATEGORIES
is currently “hard-wired” to 20.
Example usage:
import pgdb from tdda.constraints import discover_db_table dbspec = 'localhost:databasename:username:password' tablename = 'schemaname.tablename' db = pgdb.connect(dbspec) constraints = discover_db_table('postgres', db, tablename) with open('myconstraints.tdda', 'w') as f: f.write(constraints.to_json())
-
tdda.constraints.db.constraints.
verify_db_table
(dbtype, db, tablename, constraints_path, epsilon=None, type_checking='strict', testing=False, report='all', **kwargs)¶ Verify that (i.e. check whether) the database table provided satisfies the constraints in the JSON .tdda file provided.
Mandatory Inputs:
- dbtype:
- Type of database.
- db:
- A database object
- tablename:
- A database table name, to be checked.
- constraints_path:
- The path to a JSON .tdda file (possibly generated by the discover_constraints function, below) containing constraints to be checked.
Optional Inputs:
- epsilon:
When checking minimum and maximum values for numeric fields, this provides a tolerance. The tolerance is a proportion of the constraint value by which the constraint can be exceeded without causing a constraint violation to be issued.
For example, with epsilon set to 0.01 (i.e. 1%), values can be up to 1% larger than a max constraint without generating constraint failure, and minimum values can be up to 1% smaller that the minimum constraint value without generating a constraint failure. (These are modified, as appropriate, for negative values.)
If not specified, an epsilon of 0 is used, so there is no tolerance.
NOTE: A consequence of the fact that these are proportionate is that min/max values of zero do not have any tolerance, i.e. the wrong sign always generates a failure.
- type_checking:
strict
orsloppy
. For databases (unlike Pandas DataFrames), this defaults to ‘strict’.If this is set to sloppy, a database “real” column c will only be allowed to satisfy a an “int” type constraint.
- report:
all
orfields
. This controls the behaviour of the__str__()
method on the resultingDatabaseVerification
object (but not its content).The default is
all
, which means that all fields are shown, together with the verification status of each constraint for that field.If report is set to
fields
, only fields for which at least one constraint failed are shown.- testing:
- Boolean flag. Should only be set to
True
when being run as part of an automated test. It suppresses type-compatibility warnings.
Returns:
DatabaseVerification
object.This object has attributes:
- passed — Number of passing constriants
- failures — Number of failing constraints
Example usage:
import pgdb from tdda.constraints import verify_db_table dbspec = 'localhost:databasename:username:password' tablename = 'schemaname.tablename' db = pgdb.connect(dbspec) v = verify_db_table('postgres' db, tablename, 'myconstraints.tdda') print('Constraints passing:', v.passes) print('Constraints failing: %d\n' % v.failures) print(str(v))
-
class
tdda.constraints.db.constraints.
DatabaseConstraintVerifier
(dbtype, db, tablename, epsilon=None, type_checking='strict', testing=False)¶ A
DatabaseConstraintVerifier
object provides methods for verifying every type of constraint against a single database table.
-
class
tdda.constraints.db.constraints.
DatabaseVerification
(*args, **kwargs)¶ A
DatabaseVerification
object is the variant of thetdda.constraints.base.Verification
object used for verification of constraints on a database table.
-
class
tdda.constraints.db.constraints.
DatabaseConstraintDiscoverer
(dbtype, db, tablename, inc_rex=False)¶ A
DatabaseConstraintDiscoverer
object is used to discover constraints on a single database table.
Extension Framework¶
The tdda
command-line utility provides built-in support for constraint
discovery and verification for tabular data stored in CSV files, Pandas
DataFrames saved in .feather
files, and for a tables in a variety of
different databases.
The utility can be extended to provide support for constraint discovery and verification for other kinds of data, via its Python extension framework.
The framework will automatically use any extension implementations that
have been declared using the TDDA_EXTENSIONS
environment variable. This
should be set to a list of class names, for Python classes that extend the
ExtensionBase
base class.
The class names in the TDDA_EXTENSIONS
environment variable should be
colon-separated for Unix systems, or semicolon-separated for Microsoft
Windows. To be usable, the classes must be accessible by Python (either
by being installed in Pythons standard module directory, or by being
included in the PYTHONPATH
environment variable.
For example:
export TDDA_EXTENSIONS="mytdda.MySpecialExtension"
export PYTHONPATH="/my/python/sources:$PYTHONPATH"
With these in place, the tdda
command will include constraint discovery
and verification using the MySpecialExtension
implementation class
provided in the Python file /my/python/sources/mytdda.py
.
An example of a simple extension is included with the set of standard examples. See Examples.
Extension Overview¶
An extension should provide:
- an implementation (subclass) of
ExtensionBase
, to provide a command-line interface, extending thetdda
command to support a particular type of input data.- an implementation (subclass) of
BaseConstraintCalculator
, to provide methods for computing individual constraint results.- an implementation (subclass) of
BaseConstraintDetector
, to provide methods for generating detection results.
A typical implementation looks like:
from tdda.constraints.flags import discover_parser, discover_flags
from tdda.constraints.flags import verify_parser, verify_flags
from tdda.constraints.flags import detect_parser, detect_flags
from tdda.constraints.extension import ExtensionBase
from tdda.constraints.base import DatasetConstraints, Detection
from tdda.constraints.baseconstraints import (BaseConstraintCalculator,
BaseConstraintVerifier,
BaseConstraintDetector,
BaseConstraintDiscoverer)
from tdda.rexpy import rexpy
class MyExtension(ExtensionBase):
def applicable(self):
...
def help(self, stream=sys.stdout):
print('...', file=stream)
def spec(self):
return '...'
def discover(self):
parser = discover_parser()
parser.add_argument(...)
params = {}
flags = discover_flags(parser, self.argv[1:], params)
data = ... get data source from flags ...
discoverer = MyConstraintDiscoverer(data, **params)
constraints = discoverer.discover()
results = constraints.to_json()
... write constraints JSON to output file
return results
def verify(self):
parser = verify_parser()
parser.add_argument(...)
params = {}
flags = verify_flags(parser, self.argv[1:], params)
data = ... get data source from flags ...
verifier = MyConstraintVerifier(data, **params)
constraints = DatasetConstraints(loadpath=...)
results = verifier.verify(constraints)
return results
def detect(self):
parser = detect_parser()
parser.add_argument(...)
params = {}
flags = detect_flags(parser, self.argv[1:], params)
data = ... get data source from flags ...
detector = MyConstraintDetector(data, **params)
constraints = DatasetConstraints(loadpath=...)
results = detector.detect(constraints)
return results
Extension API¶
-
class
tdda.constraints.extension.
BaseConstraintCalculator
¶ The
BaseConstraintCalculator
class defines a default or dummy implementation of all of the methods that are required in order to implement a constraint discoverer or verifier via subclasses of the baseBaseConstraintDiscoverer
andBaseConstraintVerifier
classes.-
allowed_values_exclusions
()¶ Get list of values to ignore when computing allowed values
-
calc_all_non_nulls_boolean
(colname)¶ Checks whether all the non-null values in a column are boolean. Returns True of they are, and False otherwise.
This is only required for implementations where a dataset column may contain values of mixed type.
-
calc_max
(colname)¶ Calculates the maximum (non-null) value in the named column.
-
calc_max_length
(colname)¶ Calculates the length of the longest string(s) in the named column.
-
calc_min
(colname)¶ Calculates the minimum (non-null) value in the named column.
-
calc_min_length
(colname)¶ Calculates the length of the shortest string(s) in the named column.
-
calc_non_integer_values_count
(colname)¶ Calculates the number of unique non-integer values in a column
This is only required for implementations where a dataset column may contain values of mixed type.
-
calc_non_null_count
(colname)¶ Calculates the number of nulls in a column
-
calc_null_count
(colname)¶ Calculates the number of nulls in a column
-
calc_nunique
(colname)¶ Calculates the number of unique non-null values in a column
-
calc_rex_constraint
(colname, constraint, detect=False)¶ Verify whether a given column satisfies a given regular expression constraint (by matching at least one of the regular expressions given).
Returns a ‘truthy’ value (typically the set of the strings that do not match any of the regular expressions) on failure, and a ‘falsy’ value (typically False or None or an empty set) if there are no failures. Any contents of the returned value are used in the case where detect is set, by the corresponding extension method for recording detection results.
-
calc_tdda_type
(colname)¶ Calculates the TDDA type of a column
-
calc_unique_values
(colname, include_nulls=True)¶ Calculates the set of unique values (including or excluding nulls) in a column
-
column_exists
(colname)¶ Returns whether this column exists in the dataset
-
find_rexes
(colname, values=None)¶ Generate a list of regular expressions that cover all of the patterns found in the (string) column.
-
get_column_names
()¶ Returns a list containing the names of all the columns
-
get_nrecords
()¶ Return total number of records
-
is_null
(value)¶ Determine whether a value is null
-
to_datetime
(value)¶ Convert a value to a datetime
-
types_compatible
(x, y, colname)¶ Determine whether the types of two values are compatible
-
-
class
tdda.constraints.extension.
BaseConstraintDetector
¶ The
BaseConstraintDetector
class defines a default or dummy implementation of all of the methods that are required in order to implement constraint detection via the a subclass of the baseBaseConstraintVerifier
class.-
detect_allowed_values_constraint
(colname, value, violations)¶ Detect failures for an allowed_values constraint.
-
detect_max_constraint
(colname, value, precision, epsilon)¶ Detect failures for a max constraint.
-
detect_max_length_constraint
(colname, value)¶ Detect failures for a max_length constraint.
-
detect_max_nulls_constraint
(colname, value)¶ Detect failures for a max_nulls constraint.
-
detect_min_constraint
(colname, value, precision, epsilon)¶ Detect failures for a min constraint.
-
detect_min_length_constraint
(colname, value)¶ Detect failures for a min_length constraint.
-
detect_no_duplicates_constraint
(colname, value)¶ Detect failures for a no_duplicates constraint.
-
detect_rex_constraint
(colname, value, violations)¶ Detect failures for a rex constraint.
-
detect_sign_constraint
(colname, value)¶ Detect failures for a sign constraint.
-
detect_tdda_type_constraint
(colname, value)¶ Detect failures for a type constraint.
-
write_detected_records
(detect_outpath=None, detect_write_all=False, detect_per_constraint=False, detect_output_fields=None, detect_index=False, detect_in_place=False, rownumber_is_index=True, boolean_ints=False, **kwargs)¶ Write out a detection dataset.
Returns a :py:class:
~tdda.constraints.base.Detection
object (orNone
).
-
-
class
tdda.constraints.extension.
ExtensionBase
(argv, verbose=False)¶ An extension must provide a class that is based on the
ExtensionBase
class, providing implementations for itsapplicable()
,help()
,discover()
andverify()
methods.-
applicable
()¶ The
applicable()
method should returnTrue
if theargv
property contains command-line parameters that can be used by this implementation.For example, if the extension can handle data stored in Excel
.xlsx
files, then itsapplicable()
method should returnTrue
if any of its parameters are filenames that have a.xlsx
suffix.
-
detect
()¶ The
detect()
method should implement constraint detection.It should read constraints from a
.tdda
file specified on the command line, and verify these constraints on the data specified, and produce detection output.It should use the
self.argv
variable to get whatever other optional or mandatory flags or parameters are required to specify the data on which the constraints are to be verified, where the output detection data should be written, and detection-specific flags.
-
discover
()¶ The
discover()
method should implement constraint discovery.It should use the
self.argv
variable to get whatever other optional or mandatory flags or parameters are required to specify the data from which constraints are to be discovered, and the name of the file to which the constraints are to be written.
-
help
(self, stream=sys.stdout)¶ The
help()
method should document itself by writing lines to the given output stream.This is used by the
tdda
command’shelp
option.
-
spec
()¶ The
spec()
method should return a short one-line string describing, briefly, how to specify the input source.
-
verify
()¶ The
verify()
method should implement constraint verification.It should read constraints from a
.tdda
file specified on the command line, and verify these constraints on the data specified.It should use the
self.argv
variable to get whatever other optional or mandatory flags or parameters are required to specify the data on which the constraints are to be verified.
-
Constraints API¶
TDDA constraint discovery and verification, common underlying functionality.
-
class
tdda.constraints.baseconstraints.
BaseConstraintDiscoverer
(inc_rex=False, **kwargs)¶ The
BaseConstraintDiscoverer
class provides a generic framework for discovering constraints.A concrete implementation of this class is constructed by creating a mix-in subclass which inherits both from
BaseConstraintDiscover
and from a specific implementation ofBaseConstraintCalculator
.
-
class
tdda.constraints.baseconstraints.
BaseConstraintVerifier
(epsilon=None, type_checking=None, **kwargs)¶ The
BaseConstraintVerifier
class provides a generic framework for verifying constraints.A concrete implementation of this class is constructed by creating a mix-in subclass which inherits both from
BaseConstraintVerifier
and from specific implementations ofBaseConstraintCalculator
andBaseConstraintDetector
.-
cache_values
(colname)¶ Returns the dictionary for colname from the cache, first creating it if there isn’t one on entry.
-
detect
(constraints, VerificationClass=<class 'tdda.constraints.base.Verification'>, outpath=None, write_all=False, per_constraint=False, output_fields=None, index=False, in_place=False, rownumber_is_index=True, boolean_ints=False, **kwargs)¶ Apply verifiers to a set of constraints, for detection.
Note that if there is a constraint for a field that does not exist, then it fails verification, but there are no records to detect against. Similarly if the field exists but the dataset has no records.
-
get_all_non_nulls_boolean
(colname)¶ Looks up or caches the number of non-integer values in a real column, or calculates and caches it.
-
get_cached_value
(value, colname, f)¶ Return cached value of colname, calculating it and caching it first, if it is not already there.
-
get_max
(colname)¶ Looks up cached maximum of column, or calculates and caches it
-
get_max_length
(colname)¶ Looks up cached maximum string length in column, or calculates and caches it
-
get_min
(colname)¶ Looks up cached minimum of column, or calculates and caches it
-
get_min_length
(colname)¶ Looks up cached minimum string length in column, or calculates and caches it
-
get_non_integer_values_count
(colname)¶ Looks up or caches the number of non-integer values in a real column, or calculates and caches it.
-
get_non_null_count
(colname)¶ Looks up or caches the number of non-null values in a column, or calculates and caches it
-
get_null_count
(colname)¶ Looks up or caches the number of nulls in a column, or calculates and caches it
-
get_nunique
(colname)¶ Looks up or caches the number of unique (distinct) values in a column, or calculates and caches it.
-
get_tdda_type
(colname)¶ Looks up cached tdda type of a column, or calculates and caches it
-
get_unique_values
(colname)¶ Looks up or caches the list of unique (distinct) values in a column, or calculates and caches it.
-
verifiers
()¶ Returns a dictionary mapping constraint types to their callable (bound) verification methods.
-
verify
(constraints, VerificationClass=<class 'tdda.constraints.base.Verification'>, **kwargs)¶ Apply verifiers to a set of constraints, for reporting
-
verify_allowed_values_constraint
(colname, constraint, detect=False)¶ Verify whether a given column satisfies the constraint on allowed (string) values provided.
-
verify_max_constraint
(colname, constraint, detect=False)¶ Verify whether a given column satisfies the maximum value constraint specified.
-
verify_max_length_constraint
(colname, constraint, detect=False)¶ Verify whether a given (string) column satisfies the maximum length constraint specified.
-
verify_max_nulls_constraint
(colname, constraint, detect=False)¶ Verify whether a given column satisfies the supplied constraint that it should contain no nulls.
-
verify_min_constraint
(colname, constraint, detect=False)¶ Verify whether a given column satisfies the minimum value constraint specified.
-
verify_min_length_constraint
(colname, constraint, detect=False)¶ Verify whether a given (string) column satisfies the minimum length constraint specified.
-
verify_no_duplicates_constraint
(colname, constraint, detect=False)¶ Verify whether a given column satisfies the constraint supplied, that it should contain no duplicate (non-null) values.
-
verify_rex_constraint
(colname, constraint, detect=False)¶ Verify whether a given column satisfies a given regular expression constraint (by matching at least one of the regular expressions given).
-
verify_sign_constraint
(colname, constraint, detect=False)¶ Verify whether a given column satisfies the supplied sign constraint.
-
verify_tdda_type_constraint
(colname, constraint, detect=False)¶ Verify whether a given column satisfies the supplied type constraint.
-
Underlying API Classes¶
Classes for representing individual constraints.
-
class
tdda.constraints.base.
DatasetConstraints
(per_field_constraints=None, loadpath=None)¶ Container for constraints pertaining to a dataset. Currently only supports per-field constraints.
-
initialize_from_dict
(in_constraints)¶ Initializes this object from a dictionary in_constraints. Currently, the only key used from in_constraints is fields.
The value of in_constraints[‘fields’] is expected to be a dictionary, keyed on field name, whose values are the constraints for that field.
They constraints are keyed on the kind of constraint, and should contain either a single value (a scalar or a list), or a dictionary of keyword arguments for the constraint initializer.
-
load
(path)¶ Builds a DatasetConstraints object from a json file
-
sort_fields
(fields=None)¶ Sorts the field constraints within the object by field order, by default by alphabetical order.
If a list of field names is provided, then the fields will appear in that given order (with any additional fields appended at the end).
-
to_dict
(tddafile=None)¶ Converts the constraints in this object to a dictionary.
-
to_json
(tddafile=None)¶ Converts the constraints in this object to JSON. The resulting JSON is returned.
-
-
class
tdda.constraints.base.
FieldConstraints
(name=None, constraints=None)¶ Container for constraints on a field.
-
to_dict_value
()¶ Returns a pair consisting of the name supplied, or the stored name, and an ordered dictionary keyed on constraint kind with the value specifying the constraint. For simple constraints, the value is a base type; for more complex constraints with several components, the value will itself be an (ordered) dictionary.
The ordering is all to make the JSON file get written in a sensible order, rather than being a jumbled mess.
-
-
class
tdda.constraints.base.
MultiFieldConstraints
(names=None, constraints=None)¶ Container for constraints on a pairs (or higher numbers) of fields
-
to_dict_value
()¶ - Returns a pair consisting of
- a comma-separated list of the field names
- an ordered dictionary keyed on constraint kind with the value specifying the constraint.
For simple constraints, the value is a base type; for more complex Constraints with several components, the value will itself be an (ordered) dictionary.
The ordering is all to make the JSON file get written in a sensible order, rather than being a jumbled mess.
-
-
class
tdda.constraints.base.
Constraint
(kind, value, **kwargs)¶ Base container for a single constraint. All specific constraint types (should) subclass this.
-
check_validity
(name, value, *valids)¶ Check that the value of a constraint is allowed. If it isn’t, then the TDDA file is not valid.
-
-
class
tdda.constraints.base.
MinConstraint
(value, precision=None)¶ Constraint specifying the minimum allowed value in a field.
-
class
tdda.constraints.base.
MaxConstraint
(value, precision=None)¶ Constraint specifying the maximum allowed value in a field.
-
class
tdda.constraints.base.
SignConstraint
(value)¶ Constraint specifying allowed sign of values in a field. Used only for numeric fields (
real
,int
,bool
), and normally used in addition to Min and Max constraints.Possible values are
positive
,non-negative
,zero
,non-positive
,negative
andnull
.
-
class
tdda.constraints.base.
TypeConstraint
(value)¶ Constraint specifying the allowed (TDDA) type of a field. This can be a single value, chosen from:
bool
int
real
string
date
or a list of such values, most commonly
['int', 'real']
, sometimes used because of Pandas silent and automatic promotion of integer fields to floats if nulls are present.)
-
class
tdda.constraints.base.
MaxNullsConstraint
(value)¶ Constraint on the maximum number of nulls allowed in a field. Usually 0 or 1. (The constraint generator only generates 0 and 1, but the verifier will verify and number.)
-
class
tdda.constraints.base.
NoDuplicatesConstraint
(value=True)¶ Constraint specifying that non dupicate non-null values are allowed in a field.
Currently only generated for string fields, though could be used more broadly.
-
class
tdda.constraints.base.
AllowedValuesConstraint
(value)¶ Constraint restricting the allowed values in a field to an explicity list.
Currently only used for string fields.
When generating constraints, this code will only generate such a constraint if there are no more than
MAX_CATEGORIES
(= 20 at the time of writing, but check above in case this comment rusts) different values in the field.
-
class
tdda.constraints.base.
MinLengthConstraint
(value)¶ Constraint restricting the minimum length of strings in a string field.
Generated instead of a
MinConstraint
by this generation code, but can be used in conjunction with aMinConstraint
.
-
class
tdda.constraints.base.
MaxLengthConstraint
(value)¶ Constraint restricting the maximum length of strings in a string field.
Generated instead of a
MaxConstraint
by this generation code, but can be used in conjunction with aMinConstraint
.
-
class
tdda.constraints.base.
LtConstraint
(value)¶ Constraint specifying that the first field of a pair should be (strictly) less than the second, where both are non-null.
-
class
tdda.constraints.base.
LteConstraint
(value)¶ Constraint specifying that the first field of a pair should be no greater than the second, where both are non-null.
-
class
tdda.constraints.base.
EqConstraint
(value)¶ Constraint specifying that two fields should have identical values where they are both non-null.
-
class
tdda.constraints.base.
GtConstraint
(value)¶ Constraint specifying that the first field of a pair should be (strictly) greater than the second, where both are non-null.
-
class
tdda.constraints.base.
GteConstraint
(value)¶ Constraint specifying that the first field of a pair should be greater than or equal to the second, where both are non-null.
-
class
tdda.constraints.base.
RexConstraint
(value)¶ Constraint restricting a string field to match (at least) one of the regular expressions in a list given.
-
class
tdda.constraints.base.
Verification
(constraints, report='all', ascii=False, detect=False, detect_outpath=None, detect_write_all=False, detect_per_constraint=False, detect_output_fields=None, detect_index=False, detect_in_place=False, **kwargs)¶ Container for the result of a constraint verification for a dataset in the context of a given set of constraints.
TDDA JSON file format¶
A .tdda
file is a JSON file containing a single JSON object of the form:
{
"fields": {
field-name: field-constraints,
...
}
}
Each field-constraints
item is a JSON object containing a property for
each included constraint:
{
"type": one of int, real, bool, string or date
"min": minimum allowed value,
"max": maximum allowed value,
"min_length": minimum allowed string length (for string fields),
"max_length": maximum allowed string length (for string fields),
"max_nulls": maximum number of null values allowed,
"sign": one of positive, negative, non-positive, non-negative,
"no_duplicates": true if the field values must be unique,
"values": list of distinct allowed values,
"rex": list of regular expressions, to cover all cases
}