TDDA Serial API
tdda.serial
Reading CSV Files
- tdda.serial.csv_to_pandas(path=None, md_path=None, md_file_type=None, find_md=False, backend=None, upgrade_types=True, upgrade_possible_ints=False, return_md=False, table_number=None, use_table_name=False, preferred=None, verbosity=2, infer_datetime_formats=False, warner=None, config=None, **kw)
Load a CSV file into a Pandas DataFrame using metadata for type guidance.
- Parameters:
path (str) – Path to the data file (usually CSV) to be read. If None, md_path must be set and contain the path to the data. Can contain ':' to trigger metadata search.
md_path (str) – Optional path to the associated metadata file. If path is None, this must be set and contain the path to the data (CSV file). If path is not None, the path in the metadata file is ignored. If md_path is None, path must not be None. In this case, if find_md is set to True, this function will try to find an associated metadata file and use that if possible, and will raise an error if it cannot be found.
md_file_type (str) – Optional specification of the kind of metadata file. One of 'tdda.serial', 'csvw', 'frictionless'.
find_md (bool) – If True, the library will try to find associated metadata based on filename conventions. Should not be set if md_path is provided. If associated metadata cannot be found, an error will be raised.
upgrade_types (bool) – If True (the default), upgrade some object-dtype columns to stricter types.
upgrade_possible_ints (bool) – If True (not the default), upgrade float columns with nulls but no fractional components to nullable Int types.
return_md (bool) – If True, returns a (DataFrame, metadata) tuple instead of just the DataFrame.
table_number (int) – If set, use the nth table (indexed from zero) from a multi-table metadata file.
preferred (str) – Override the metadata flavour used. By default csv_to_pandas uses the pandas.read_csv flavour if present. Can be set to 'tdda.serial' or 'csvw'.
verbosity (int) – Controls warning output for the metadata reader.
**kw – Passed directly to pandas.read_csv, overriding any values derived from the metadata file.
- Returns:
DataFrame, or (DataFrame, metadata) tuple if return_md is True.
- tdda.serial.csv_to_polars(path=None, md_path=None, md_file_type=None, find_md=False, upgrade_types=True, upgrade_possible_ints=False, return_md=False, table_number=None, use_table_name=False, preferred=None, map_other_bools_to_string=False, include_data_path_in_md=None, verbosity=2, warner=None, infer_datetime_formats=False, **kw)
Load a CSV file into a Polars DataFrame using metadata for type guidance.
- Parameters:
path (str) – Path to the data file (usually CSV) to be read. If None, md_path must be set and contain the path to the data.
md_path (str) – Optional path to the associated metadata file. If path is None, this must be set and contain the path to the data (CSV file). If path is not None, the path in the metadata file is ignored. If md_path is None, path must not be None. In this case, if find_md is set to True, this function will try to find an associated metadata file and use that if possible, and will raise an error if it cannot be found.
md_file_type (str) – Optional specification of the kind of metadata file. One of 'tdda.serial', 'csvw', 'frictionless'.
find_md (bool) – If True, the library will try to find associated metadata based on filename conventions. Should not be set if md_path is provided. If associated metadata cannot be found, an error will be raised.
upgrade_types (bool) – If True (the default), upgrade some object-dtype columns to stricter types.
upgrade_possible_ints (bool) – If True (not the default), upgrade float columns with nulls but no fractional components to nullable Int types.
return_md (bool) – If True, returns a (DataFrame, metadata) tuple instead of just the DataFrame.
table_number (int) – If set, use the nth table (indexed from zero) from a multi-table metadata file.
preferred (str) – Override the metadata flavour used. By default csv_to_polars uses the polars.read_csv flavour if present. Can be set to 'tdda.serial' or 'csvw'.
map_other_bools_to_string (bool) – If True, boolean fields whose metadata specifies non-true/false values are read as strings. Default: False.
include_data_path_in_md – If None, the path is not set in tdda.serial metadata. If set to any truthy value, the path to the datafile is included. For csvw and frictionless, None causes a url/path to be written.
verbosity (int) – Controls warning output for the metadata reader.
**kw – Passed directly to polars.read_csv, overriding any values derived from the metadata file.
- Returns:
DataFrame, or (DataFrame, metadata) tuple if return_md is True.
Writing CSV Files
- tdda.serial.pandas_to_csv(df, path=None, md_inpath=None, md_outpath=None, auto_md_inpath=False, auto_md_outpath=False, flavour=None, preferred_in_flavour=None, in_table_number=None, find_safe_null=False, include_data_path_in_md=None, warner=None, **kw_overrides)
Write a Pandas DataFrame to a CSV file, optionally using metadata.
- Parameters:
df (DataFrame) – The DataFrame to write.
path (str) – Path to write the CSV data to.
md_inpath (str) – Optional path to a .serial (or CSVW) metadata file to use when writing the CSV.
md_outpath (str or bool) – Optional path to write a .serial metadata file describing the format used. If True, the .serial path is derived from the data path by swapping the extension.
auto_md_inpath (bool) – If True, find the input metadata path automatically from filename conventions.
auto_md_outpath (bool) – If True, choose the output metadata path automatically.
flavour (str or list) – Flavour(s) to include in the written .serial file. By default only 'tdda.serial' is included.
preferred_in_flavour (str) – If multiple formats are available in the .serial file, use this one. By default, uses the first available of: pandas.DataFrame.to_csv, pandas.read_csv, tdda.serial, or anything it can find.
find_safe_null (bool) – If True, choose a null representation that is safe for this data (not present in any string column).
include_data_path_in_md – If None, the path is not set in tdda.serial metadata. If set to any truthy value, the path to the datafile is included. For csvw and frictionless, None causes a url/path to be written.
**kw_overrides – Passed directly to DataFrame.to_csv, overriding any values derived from md_inpath. It is usually better not to mix md_inpath and overrides as it is easy to generate incompatibilities. Any na_rep specified here will be replaced if find_safe_null is set and the nominated null indicator is not safe (a warning is issued).
- Returns:
.md_out_path: path metadata was written to, or None. .out_path: path data was written to, or None. .md_inpath: path from which write metadata was read. .to_csv_kwargs: keyword args used to write the CSV.
- Return type:
Object with attributes
Loading Metadata
- tdda.serial.load_metadata(path, md_file_type=None, table_number=None, for_table_name=None, preferred_serial_flavour=None, verbosity=2)
Load metadata from a .serial, CSVW, or Frictionless file.
- Parameters:
path (str) – Path to the metadata file.
md_file_type (str) – Optional metadata file type. One of 'tdda.serial', 'csvw', 'frictionless'.
table_number (int) – If specified, use the nth table from a multi-table metadata file (indexed from zero). Raises an error if not present.
for_table_name (str) – If specified, select the metadata from a multi-table metadata file by matching the end of the url in the metadata to this table name.
preferred_serial_flavour (str or list) – If multiple metadata flavours are found at the same level of a .serial file, the one to choose (or a priority list).
verbosity (int) – Controls warning/error output. 2: errors and warnings to stderr. 1: warnings to stderr only. 0: silent.
- Returns:
SerialMetadata (or subclass) loaded from the file.
Inferring Metadata
- tdda.serial.infer_format_from_flat_file(path, lines_to_use=None, warner=None, add_defaults=False, report_added_defaults=True, raise_error=False, delimiter=None, quote_char=None, escape=None, no_escape=False, stutter=None, null=None, encoding=None, date_format=None, datetime_format=None, quoting=None, **kw)
Infer SerialMetadata for a flat file by sampling its contents.
Reads a sample of the file and infers delimiter, quoting style, encoding, field types, date formats, and null indicators.
- Parameters:
path (str) – Path to the flat file (CSV or similar) to sample.
lines_to_use (int) – Number of lines to sample. Uses a default sample size if not specified.
warner – Optional callable for issuing warnings.
add_defaults (bool) – If True, include default values (e.g. standard encoding, quote char) in the inferred metadata even when they match the library default.
report_added_defaults (bool) – If True (the default), issue warnings when defaults are added. Only relevant when add_defaults is True.
raise_error (bool) – If True, raise an error on problems rather than issuing a warning.
delimiter (str) – Override the inferred field separator.
quote_char (str) – Override the inferred quote character.
escape (str) – Override the inferred escape character.
no_escape (bool) – If True, treat no escape character as given (do not infer one).
stutter (bool) – Override whether stutter (doubled-quote) escaping is used.
null (str) – Override the inferred null indicator string.
encoding (str) – Override the inferred file encoding.
date_format (str) – Override the inferred date format.
datetime_format (str) – Override the inferred datetime format.
quoting (str) – Override the inferred quoting style (e.g. 'QUOTE_MINIMAL', 'QUOTE_ALL').
**kw – Additional keyword arguments passed to MetadataInferrer.
- Returns:
SerialMetadata inferred from the file.
Metadata Classes
- class tdda.serial.SerialMetadata(fields=None, path=None, encoding=None, delimiter=None, quote_char=None, escape_char=None, stutter_quotes=None, date_format=None, datetime_format=None, null_indicator=None, header_row_count=None, header_row=None, quoting=None, decimal_point=None, dps=None, accept_percentages_as_floats=None, map_missing_trailing_cols_to_null=None, true_values=None, false_values=None, thou_sep=None, dp=None, verbosity=2, libs=None, source=None, extra_kwargs='warn', **kw)
Metadata describing the format and structure of a flat file (CSV or similar). Corresponds to the 'tdda.serial' section of a .serial file.
All parameters are optional. Where not specified, library defaults (e.g. pandas.read_csv defaults) apply.
- Parameters:
fields (list or dict) – List of
FieldMetadataobjects, or a dict mapping CSV column names to field attribute dicts. Use a list when specifying all fields (complete schema); use a dict when specifying only a subset (partial schema), allowing extra fields in the file. OPTIONAL.path (str) – Path to the associated flat file. OPTIONAL.
encoding (str) – Character encoding of the file (e.g. 'UTF-8', 'latin-1'). OPTIONAL.
delimiter (str) – Field separator character (e.g. ',', '|', 't'). OPTIONAL.
quote_char (str) – Quote character used to wrap fields containing delimiters or newlines (e.g. '"'). OPTIONAL.
escape_char (str) – Escape character used within quoted strings (e.g. '\'). OPTIONAL.
stutter_quotes (bool) – If True, quotes within quoted strings are doubled rather than escaped (doublequote=True in pandas). OPTIONAL.
date_format (str) – Default date/datetime format for all date and datetime fields in the dataset. Can be a named format (e.g. 'iso8601-date', 'eu-datetime') or a strftime string. Overridden by per-field format if set. OPTIONAL.
null_indicator (str or list) – String(s) to interpret as NULL/NA values throughout the file. OPTIONAL.
header_row_count (int) – Number of header rows at the top of the file. 0 means no header. Defaults to 1. OPTIONAL.
header_row – TBC. OPTIONAL.
quoting (str) – CSV quoting style (e.g. 'QUOTE_MINIMAL', 'QUOTE_ALL'). Accepts Python csv module quoting constants by name or value. OPTIONAL.
decimal_point (str) – Character used as decimal separator (e.g. '.' or ','). OPTIONAL.
dps (int) – Default number of decimal places for float fields. TBC. OPTIONAL.
accept_percentages_as_floats (bool) – If True, values like '12.5%' are read as 0.125. OPTIONAL.
map_missing_trailing_cols_to_null (bool) – If True, short rows (fewer fields than expected) are padded with nulls rather than causing an error. Useful for Excel-generated CSVs. OPTIONAL.
true_values (str or list) – Default string(s) to interpret as True for bool fields across the dataset. OPTIONAL.
false_values (str or list) – Default string(s) to interpret as False for bool fields across the dataset. OPTIONAL.
thou_sep (str) – Thousands separator character. TBC. OPTIONAL.
dp – TBC. OPTIONAL.
verbosity (int) – Controls warning/error output. 0=silent, 1=errors only, 2=errors and warnings (default), 3=verbose.
libs (dict) – Library-specific parameter blocks (e.g. 'pandas.read_csv', 'polars.read_csv'). When present for a given library, these parameters are used directly instead of being derived from the tdda.serial section. OPTIONAL.
source – TBC. OPTIONAL.
extra_kwargs (str) – Controls handling of unrecognised keyword arguments. 'warn' (default) issues a warning, 'error' raises an error, 'allow' silently accepts.
- single_date_format(warner=None)
Get a single date/time format from serial metadata.
Typically needed for write parameters. Uses the default format if set; otherwise looks at field formats and uses the modal value, or iso8601datetime if there is a tie.
- single_null_indicator(default='', warner=None)
Return a single null indicator string, for use when writing.
- write(path, use_serial_ext=True, indent=4, verbose=0, date_style=None)
Write metadata to a
.serialfile.- Parameters:
path (str) – Output path. Changed to
.serialextension unlessuse_serial_extis False.use_serial_ext (bool) – If False, keep the extension in path unchanged.
indent (int) – JSON indentation level (default 4).
verbose (int) – If non-zero, print the output path.
date_style (DateStyle) – Controls output date format style.
- class tdda.serial.FieldMetadata(name, fieldtype=None, csvname=None, format=None, null_indicator=None, true_values=None, false_values=None, allow_extra_keys=False, description=None, thou_sep=None, dp=None, dps=None, examples=None, rdf_type=None, altnames=None, **kw)
Metadata for a single field (column) in a flat file.
- Parameters:
name (str) – Internal name for the field/column used in the resulting dataframe. Need not match the name in the file (see csvname). MANDATORY.
fieldtype (str) – Type of the field. Must be one of the values in FieldType: bool, int, float, number, string, date, datetime, datetime_tz, time, iso8601. OPTIONAL.
csvname (str) – Name of the column in the file, if different from name. OPTIONAL.
format (str) – Format of the field. Interpretation depends on fieldtype. For date/datetime: a named format (e.g. 'iso8601-date', 'eu-datetime') or a strftime string (e.g. '%d/%m/%Y'). For bool: a boolean format spec (e.g. 'yes|no'). Unambiguous because fieldtype is known. OPTIONAL.
null_indicator (str or list) – String(s) to interpret as NULL/NA in this field. Overrides the dataset-level null_indicator. OPTIONAL.
true_values (str or list) – String(s) to interpret as True for bool fields. OPTIONAL.
false_values (str or list) – String(s) to interpret as False for bool fields. OPTIONAL.
description (str) – Human-readable description of the field. OPTIONAL.
thou_sep (str) – Thousands separator character (e.g. ','). TBC. OPTIONAL.
dp – TBC. OPTIONAL.
dps (int) – Number of decimal places for float fields. TBC. OPTIONAL.
examples (list) – Example values for the field. TBC. OPTIONAL.
rdf_type (str) – RDF type URI for the field. TBC. OPTIONAL.
altnames (list) – Alternative names for the field (e.g. from CSVW titles). TBC. OPTIONAL.
- class tdda.serial.FieldType
Constants for the supported field types in tdda.serial metadata.
Use these values as the
fieldtypeparameter inFieldMetadata.
- class tdda.serial.DateFormat
Named date/datetime format constants for tdda.serial metadata.
These values can be used as the
formatparameter inFieldMetadatafor date and datetime fields, or as thedate_format/datetime_formatparameter inSerialMetadata.ISO8601 formats accept any ISO variant on read and write the canonical form. European and US formats use slash-separated day/month/year or month/day/year respectively.
- class tdda.serial.DateStyle
Constants controlling the format style used for date strings.
Used when converting between date format representations. LITERAL uses human-readable names (e.g. 'YYYY-MM-DD'), YYYY uses four-digit-year notation, PERCENT uses strftime % codes.
Format Conversion
- tdda.serial.serial_to_csvw(md, name='data.csv')
Convert a SerialMetadata object to a CSVWMetadata object.
- Parameters:
md (SerialMetadata) – Metadata to convert.
name (str) – Data file name to record in the CSVW url field (default 'data.csv').
- Returns:
A broadly equivalent CSVWMetadata object.
- tdda.serial.serial_to_frictionless(md)
Convert a SerialMetadata object to a FrictionlessMetadata object.
- Parameters:
md (SerialMetadata) – Metadata to convert.
- Returns:
A broadly equivalent FrictionlessMetadata object.
- class tdda.serial.CSVWMetadata(spec=None, extensions=False, table_number=None, for_table_name=None, url=None, verbosity=2)
SerialMetadata subclass that reads CSVW metadata.
Imports metadata from a CSVW JSON file (typically foo-metadata.json for file foo.csv) into the SerialMetadata representation.
- Parameters:
spec (str or dict) – Path to a CSVW file (usually .json), or a dict of the form returned by json.load on a valid CSVW file. If None, minimal initialization is performed.
extensions (bool) – If True, accept tdda CSVW extensions.
table_number (int) – If set, use the nth table (indexed from zero) from a multi-table CSVW file.
for_table_name (str) – If set, select the table whose url ends with this name from a multi-table CSVW file.
url (str) – Override the url for the data file.
verbosity (int) – Controls warning/error output.
- Validation attributes (read-only):
_valid (bool): True if no errors were encountered. _errors (list): Textual errors found while reading. _warnings (list): Textual warnings generated while reading.
- get_context()
Read and validate the mandatory
@contextproperty.CSVW files must have
@contextset to http://www.w3.org/ns/csvw (CSVW.CONTEXT). It may be stored as a plain string or as the first item in a list; when a list, the second element may be a dict with keys@base(a base URL for resolving other URLs) and/or@language(a language code such asen).
- get_dialect()
Read the dialect from the CSVW spec into
_dialect.Reads from the first table's dialect section. Falls back to the
dc:replacesblock if no dialect section is present.
- get_schema_and_columns()
Set
_schemaand_columnsfrom the CSVW spec.
- process_dialect()
Processes the dialect part of a CSVW specification.
https://w3c.github.io/csvw/metadata/#dfn-dialect-descriptions specifies the defaults for these as:
{ "encoding": "utf-8", "lineTerminators": ["\r\n", "\n"], "quoteChar": "\"", "doubleQuote": true, "skipRows": 0, "commentPrefix": "#", "header": true, "headerRowCount": 1, "delimiter": ",", "skipColumns": 0, "skipBlankRows": false, "skipInitialSpace": false, "trim": false }
which presumably means that a conformant CSV reader will use those settings if they are not specified in the CSVW file.
- read(spec)
Read the CSVW spec from file or dict and store in
._csvw.- Parameters:
spec (str or dict) – Path to a CSVW file, or a dict of the form returned by json.load on a valid CSVW file.
- set_if_attr_non_null(d, key, attribute=None)
Set
d[key]toself.<attribute>if non-null.- Parameters:
d (dict) – Dictionary to update.
key (str) – Key to set.
attribute (str) – Attribute of self to look up; defaults to key.
- set_if_non_null(d, key, value)
Set
d[key] = valueif value is not None.- Parameters:
d (dict) – Dictionary to update.
key (str) – Key to set.
value – Value to assign.
- class tdda.serial.FrictionlessMetadata(spec=None, extensions=False, table_number=None, for_table_name=None, verbosity=2)
SerialMetadata subclass that reads Frictionless metadata.
Imports metadata from a Frictionless YAML or JSON file (typically foo.resource.yaml or similar for file foo.csv) into the SerialMetadata representation.
- Parameters:
spec (str or dict) – Path to a Frictionless file (.yaml or .json), or a dict of the form returned by loading a valid Frictionless file. If None, minimal initialization is performed.
extensions (bool) – If True, accept tdda Frictionless extensions.
table_number (int) – If set, use the nth table (indexed from zero) from a multi-table Frictionless file.
for_table_name (str) – If set, select the table whose path ends with this name from a multi-table Frictionless file.
verbosity (int) – Controls warning/error output.
- Validation attributes (read-only):
_valid (bool): True if no errors were encountered. _errors (list): Textual errors found while reading. _warnings (list): Textual warnings generated while reading.
- get_dialect()
Read the dialect from the Frictionless spec.
Supports flat v4 dialect (CSV keys at top level) and the older nested
{'csv': {...}}form. Falls back todc:replacesif no dialect section is present.
- get_schema_and_fields()
Set
_schemaand_fieldsfrom the Frictionless spec.Handles three layouts: resource inside a package, standalone resource with a schema key, or bare schema without a wrapper.
- read(spec)
Read the Frictionless spec from file or dict into
._frictionless.- Parameters:
spec (str or dict) – Path to a Frictionless file (.yaml or .json), or a dict of the form returned by loading one.
- set_if_attr_non_null(d, key, attribute=None)
Set
d[key]toself.<attribute>if non-null.- Parameters:
d (dict) – Dictionary to update.
key (str) – Key to set.
attribute (str) – Attribute of self to look up; defaults to key.
- set_if_non_null(d, key, value)
Set
d[key] = valueif value is not None.- Parameters:
d (dict) – Dictionary to update.
key (str) – Key to set.
value – Value to assign.
Low-level Arguments
- tdda.serial.serial_to_pandas_read_csv_args(md, backend=None, warner=None, config=None)
Convert SerialMetadata to keyword arguments for pandas.read_csv.
- Parameters:
md (SerialMetadata) – Metadata describing the CSV file format.
backend (str) – Pandas dtype backend, e.g. 'numpy_nullable' or 'pyarrow'.
warner – Optional callable for issuing warnings.
config – Optional tdda configuration object.
- Returns:
Dict of keyword arguments suitable for passing to pandas.read_csv.
- tdda.serial.serial_to_polars_read_csv_args(md, warner=None, serializable=False, map_other_bools_to_string=False, backend=None)
Convert SerialMetadata to keyword arguments for polars.read_csv.
- Parameters:
md (SerialMetadata) – Metadata describing the CSV file format.
warner – Optional callable for issuing warnings.
serializable (bool) – If True, return only JSON-serializable values (e.g. dtype repr strings instead of Polars types).
map_other_bools_to_string (bool) – If True, boolean fields whose metadata specifies non-true/false values are mapped to string type.
backend – Unused; accepted for API consistency with the pandas equivalent.
- Returns:
Dict of keyword arguments suitable for passing to polars.read_csv.