tdda.serial: Metadata and Tools for Flat (“CSV”) Files
A .serial file (also known as a tdda.serial file)
is a JSON file describing the format of one or more
flat (“CSV”) files. It is the primary metadata format
used by the tdda library, which includes various tools
for working with .serial metadata in the tdda.serial module.
The primary goal of tdda.serial is to allow data in flat files to be read
more accurately and to serve as documentation of how data
has been written to flat files.
The current main features of tdda.serial are:
A file format for describing key aspects of CSV and other “flat” files, including things like separators, quoting, escaping and nulls, together with field names, types, and formats.
Facilities for using
.serialfiles when reading and writing flat files with (currently) Pandas and Polars in Python (csv_to_pandas,csv_to_polars,pandas_to_csv,polars_to_csv, as well as tools for generating keyword arguments forread_csvfunctions and Python code for reading flat files based on a metadata specification).Tools allowing conversion between
tdda.serialfiles and other metadata formats—currently CSVW and Frictionless.Facilities for generating inferred
tdda.serialfiles from a flat file.A command-line interface for most of the above functionality (
tdda serial).The ability for various tdda tools to be guided by a
tdda.serialfiles when reading flat files for things like constraint generation, data validation, and dataset comparison withtdda diff. (In general, this can be achieved by using eitherfoo.csv:as a filename specifier, which tellstddato look for metadata based on filename conventions, orfoo.csv:md.serialto use metadata inmd.serial.)
A single .serial file can describe a format shared by many concrete
flat files, unlike CSVW, which is primarily concerned with describing
specific named files. In this respect, it is more similar to a
Frictionless schema, as opposed to a Frictionless resource or package.
A .serial file can also contain library-specific sections (for
pandas.read_csv, polars.read_csv etc.) alongside or instead of the
tdda.serial section.
In general, it is preferable to use the tdda.serial section;
the other sections are intended primarily for cases in which:
Something is not capable of being expressed in
tdda.serial;The intention is only to use a single library for reading/writing, and higher fidelity may be guaranteed by using a custom section.
The tdda.serial File Structure
A .serial file is a JSON object with the following top-level keys:
{
"format": "http://tdda.info/ns/tdda.serial",
"writer": "tdda.serial-3.0.0",
"tdda.serial": { ... },
"pandas.read_csv": { ... },
"pandas.DataFrame.to_csv": { ... },
"polars.read_csv": { ... },
"polars.DataFrame.write_csv": { ... },
"python.csv.reader": { ... },
"python.csv.writer": { ... }
}
In fact, any key may be used for a custom format specification for some flat-file reader or writer.
Top-level Keys
format (string, required)
: Must be "http://tdda.info/ns/tdda.serial". Identifies the file as
a tdda.serial file. In future, this string may include numbered
versions of tdda.serial.
writer (string, optional)
: The name and version of the software that wrote the file,
e.g. "tdda.serial-3.0.0".
tdda.serial (object, optional)
: The primary metadata section, described in detail below.
pandas.read_csv, pandas.DataFrame.to_csv (objects, optional)
: Library-specific sections containing keyword arguments for the
corresponding Pandas functions, stored verbatim.
polars.read_csv, polars.DataFrame.write_csv (objects, optional)
: Library-specific sections containing keyword arguments for the
corresponding Polars functions, stored verbatim.
python.csv.reader, python.csv.writer (objects, optional)
: Library-specific sections for the Python standard library csv module.
When multiple sections are present, the tdda.serial library chooses
which to use based on a preference order (typically tdda.serial
first, then library-specific sections). A preferred section can also
be specified explicitly with both the API and the command-line tools.
Example
{
"format": "http://tdda.info/ns/tdda.serial",
"tdda.serial": {
"encoding": "UTF-8",
"delimiter": "|",
"quote_char": "\"",
"escape_char": "\\",
"stutter_quotes": false,
"null_indicator": "",
"date_format": "YYYY-MM-DD",
"datetime_format": "DD/MM/YYYY HH:MM:SS",
"header_row_count": 1,
"header_row": 0,
"fields": [
{
"name": "id",
"fieldtype": "int"
},
{
"name": "name",
"fieldtype": "string"
},
{
"name": "joined",
"fieldtype": "date"
},
{
"name": "last_seen",
"fieldtype": "datetime",
"format": "MM/DD/YYYY HH:MM:SS"
}
]
}
}
The tdda.serial Section
This section contains a dataset-level description of the flat file format, with optional per-field overrides.
Dataset-level Keys
All dataset-level keys are optional.
encoding (string)
: The text encoding of the file. Any Python codec name is accepted,
e.g. "UTF-8", "latin-1".
(Suggested default: "UTF-8".)
delimiter (string)
: The field separator character. (Suggested default: ",".)
quote_char (string)
: The character used to quote fields string data or, in some cases,
other data, especially data that contains the delimiter or
the field separator.
Almost always double quote or single quote.
(Suggested default: ", written as "\"" in JSON.)
escape_char (string)
: When this is set to backslash (the only value it is likely to take,
written as "\\" in JSON)
this means that quote characters in quoted strings are escaped
with backslash "like \"this\" example". Backslash also escapes itself,
and may be used to escape separators, whether quoted or not.
It has no significance for common control escapes, which
may appear as \n, \r, \t, \f in files regardless
of this setting.
Overescaping is common (escaping characters that don't
need to be escaped in context) and this setting has no
particular bearing on the cases. Most readers simply remove
unnecessary escape characters.
(Suggested default: "\\".)
stutter_quotes (boolean)
: If true, the quote character is escaped by doubling it,
"like ""this"" example".
If false, the escape character is used to escape embedded quotes.
"""this""" would be used to represent the word this in double
quotes when stutter is true.
(Suggested default: false.)
null_indicator (string or array of strings)
: The string or strings used to represent null/missing values.
(Suggested Default: empty string, written as "" in JSON.)
date_format (string)
: The default format for date fields that have no per-field
format. Also used as the fallback for datetime fields if
datetime_format is not set. Accepts named formats (see
Date and Datetime Formats) and Python
strftime strings. (Suggested default: "iso8601-date".)
datetime_format (string)
: The default format for datetime fields that have no per-field
format. If not set, date_format is used as the fallback.
Accepts named formats or Python strftime strings.
header_row_count (integer)
: The number of header rows at the top of the file. Default: 1.
header_row (integer)
: The zero-based index of the row containing column names, within the
header rows. Default: 0.
accept_percentages_as_floats (boolean)
: If true, values like "12.5%" are read as 0.125.
(Suggested default: false.)
map_missing_trailing_cols_to_null (boolean)
: If true, rows with fewer fields than expected (as Excel sometimes
produces) have missing trailing fields treated as null.
(Suggested default: false.)
decimal_point (string)
: The character used as the decimal point in numeric fields.
(Suggested default: ".").
thou_sep (string)
: The thousands separator character in numeric fields, e.g. ",".
Optional; no suggested default.
true_values (string or array of strings)
: String values to interpret as true for boolean fields, in addition
to the library defaults. Applies to all bool fields unless
overridden for a given field.
false_values (string or array of strings)
: String values to interpret as false for boolean fields, in addition
to the library defaults. Applies to all bool fields unless
overridden for a given field.
quoting (string)
: Specifies which values are quoted
There is little consistency about which values are quoted in flat files.
Python's csv module defines constants for various options, but fail
to include one of the most important. The tdda.serial format allows
quoting to be set to any of the values supported by Python csv and
also to one other common scheme. The values from Python's csv library
are:
QUOTE_ALL: all values are quoted.QUOTE_MINIMAL: Only values that contain special characters, such as the field delimiter, or literal newlines.QUOTE_NONNUMERIC: All non-numeric values (including nulls) are quoted.QUOTE_NONE: No values are quoted (escape is used for special characters).QUOTE_NOTNULL: All non-null values are quoted (including numbers)QUOTE_STRINGS: This is the same as QUOTE_NONNUMERIC except that null values are not quoted.
The extra value supported by tdda.serial is
QUOTE_STRINGS_ONLY: Only string values are quoted: nulls and values of all non-string types are not quoted. (This is essentially what JSON does, except that JSON has no date/datetime types, so those are usually stored as strings, and therefore quoted.)
Optional. No default.
path (string)
: Optional path to a datafile. Not usually populated since .serial
data usually applies to a set of datafiles rather than a single one.
Can be relative or absolute. The path should be considered advisory
rather than definitive, i.e. there is no problem at all to use
a .serial file specifying a particular flat file with a different
flat file. Indeed, this is common.
Note: since path is only really for information, it can be set
to a glob (wildcard) pattern like '.csv' or 'foo.csv' to indicate
a set of files. @ can also be used as the wildcard
to match tdda.serial filename conventions (see below).
fields (array or object)
: Descriptions of the fields in the file. See
The fields Entry.
The fields Entry
Fields can be specified as either an array or an object (dictionary).
Array form — the field list is taken to be complete and ordered. Fields appear in the order they occur in the file. Each entry is a field object (described below).
"fields": [
{"name": "id", "fieldtype": "int"},
{"name": "date", "fieldtype": "date"}
]
Object (dictionary) form — the keys are the names of the fields as they appear in the file (the external names), and the values are field objects. This form is used for partial specifications, where only some fields are described, and/or where internal names differ from external names. Fields may appear in any order, and additional fields in the file are permitted.
"fields": {
"commission date": {"name": "DateOfCommission", "fieldtype": "date"},
"qa passed?": {"name": "PassedQA", "fieldtype": "bool",
"true_values": "yes", "false_values": "no"}
}
NOTE (field sortedness):
JSON Objects (dictionaries) are formally unordered, i.e. the
order in which the entries appear has no semantic significance.
The object form is preferred when a .serial file is not making a
claim about the order of fields in a flat file. Despite this,
all recent versions of Python read JSON objects into Python dictionaries
preserving key order. The tdda.serial library always preserves order
when reading and writing serial files, so in practice the order in the
.serial file will always match the order in the flat file
if tdda.serial is used to generate the .serial file.
Conversely, in the rare cases that data is written without a header,
the table form is preferred because this unambiguously specifies field
order. In practice, the object/dictionary form will also work provided
the .serial has the field entries in the same order as the data file.
Field Keys
name (string, required in array form)
: The internal name for the field — the column name used in the
resulting dataframe. In array form this is mandatory. In object form
it can be omitted if the internal name is the same as the external
(dictionary key) name. This is useful if there is a desire to replace
some or all of the names used in the flat file with more convenient
ones during processing.
fieldtype (string)
: The type of the field, from the table below.
Value |
Description |
|---|---|
|
Boolean |
|
Integer |
|
Floating-point |
|
Unspecified numeric (integer or float) |
|
Text |
|
Date (no time component) |
|
Date and time |
|
Date and time with timezone |
|
Time only |
|
ISO 8601 date or datetime (unspecified) |
csvname (string)
: The name of the column in the file, if different from name. Used
in array form when the file's column name differs from the desired
internal name.
format (string)
: For date and datetime fields: the format of the date or datetime
values in this field. Overrides date_format / datetime_format.
Accepts named formats or Python strftime strings (see
Date and Datetime Formats).
For bool fields: a boolean format specification (e.g. "yes|no").
null_indicator (string or array of strings)
: Null indicator(s) for this field, overriding the dataset-level
null_indicator.
true_values (string or array of strings)
: True value(s) for this bool field, overriding the dataset-level
true_values.
false_values (string or array of strings)
: False value(s) for this bool field, overriding the dataset-level
false_values.
description (string)
: A human-readable description of the field.
TDDA Serial Date and Datetime Formats
ISO8601
ISO8601 is the most widely understood, unproblematical format for dates and times and is recommended for new data written with dates and datetimes.
This can be specified in tdda serial as follows.
iso8601-datecan be used for ISO8601 dates (2000-12-31)iso8601-datetimecan be used for ISO8601 datetimes (2000-12-31T12:34:56)iso8601-datetime-tzcan be used for ISO8601 datetimes with timezone (2000-12-31T12:34:56+00:00) etc.iso8601can be used to indicate “any of the above”.
With the ISO8601 variants, reading is liberal, allowing some variation in separators, whereas writing is strict.
The tdda.serial format allows date formats to be specified in three
other ways:
Dates can be specified using:
YYYY for four-digit years
YY for two-digit years
MM for numeric month
DD for day
HH for hour
MM for minute (Same as month! That's OK. Context disambiguates.)
MON for a spelt-out three letter month like Jan
MONTH for a spelt-out full month like January
SS for whole seconds
SS.S, SS.SS etc. (any number of S's after period) for fractional seconds.
+ZZ:ZZ or +ZZZZ for timezone. (Can use
-instead of+)AM or PM for 12-hour clock AM/PM indicator
Any case may be used. MM for minutes and month is disambiguated by saying that MM when adjacent to YY or DD (other than a separator) is a month, and when adjacent to SS or HH indicates minutes.
For example:
YYYY-MM-DDTHH:MM:SS.S+ZZ:ZZYYYY-MM-DDTHH:MM:SS.S-ZZ:ZZYYYY-MM-DD HH:MM:SS+ZZZZYYYY-MM-DDTHH:MM:SSYYYY-MM-DD HH:MMYYYY-MM-DDYY-MM-DDYY-MM-DDTHH:MM:SS.SPMYY-MM-DDTHH:MM:SSAMDD/MM/YYYY HH:MM:SSMM/DD/YYYY HH:MM:SSMM.DD.YY HH:MM:SSAM
The string will always be standardized to upper case on write, but is
case insensitive on read, for .serial files. (For CSVW files,
mixed case is used, following CSVW's conventions).
Any unambiguous date or datetime in the specified format can be used as a specifier. By unambiguous, we mean:
The day is at least 13
The year is either four digits or 60 or greater or 00 So
2000-12-31T12:34:56.789+000031/12/2000T12:34:56.789+00:0031-12-2000T12:34:56.789-000031.12.2000T12:34:56-00:0031/12/2000T12:34:5631/12/2000 12:34:5631/12/200012/31/200012/31/0031/12/0031 Dec 200031 December 2000Dec 31 002000-12-31T12:34:56.789AM2000-12-31T12:34:56.789PM
etc. are all acceptable. Things like
2000-02-01T12:34:56.789+000001/02/2000T12:34:56.789+00:0001-02-2000T12:34:56.789+000001.02.2000T12:34:56+000001/02/2000T12:34:5601/02/2000 12:34:5601/02/200002/01/200002/01/0001/02/0001 Dec 2222 Dec 01
are not, because they are ambiguous.
When reading .serial files the tdda library will accept
any unambiguous date and reject anything it considers ambiguous.
On writing, it will replace any other specific date with
components from 2000-12-31T12:34:56.789+0000.
Strings usable by Python
strftimecan be used. These are typified by%Y-%m-%dT%H:%M:%S.
On writing new .serial files, the library will default to named
iso8601 formats when the format is conformant with ISO8601,
and will then choose the YYYY/MM/DD type specifier or
strftime-conforming strings.
You can specify a preferred form at the command line with –use-literal-dates, –use-yyyy-dates, –use-pc-dates.
ISO 8601 Formats
Named Format |
|
Example |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
Any ISO 8601 date or datetime |
Any of the above |
On read, ISO8601 formats should ideally accept / and .
as separators for date components in addition to - (e.g. 2000/12/31, 2000.12.31),
should accept space (" ") instead of T,
and should also accept fractional seconds on times.
Format Precedence
For each date or datetime field, the format is determined as follows:
The field's own
formatkey, if present.For
datefields: the dataset-leveldate_format, if set. Fordatetimefields: the dataset-leveldatetime_format, if set.The other dataset-level format (
date_formatfalling back todatetime_formatand vice versa).The library default (typically ISO 8601).
Library-specific Sections
In addition to, or instead of, the tdda.serial section, a .serial
file can contain sections with library-specific keyword arguments.
These are stored verbatim and used directly when reading or writing
with the corresponding library.
For example, a pandas.read_csv section contains exactly the keyword
arguments to be passed to pandas.read_csv:
{
"format": "http://tdda.info/ns/tdda.serial",
"pandas.read_csv": {
"sep": "|",
"encoding": "UTF-8",
"dtype": {"id": "Int64", "name": "string"},
"parse_dates": ["joined"],
"date_format": {"joined": "%d/%m/%Y"},
"na_values": [""],
"keep_default_na": false
}
}
When both tdda.serial and library-specific sections are present, the
library chooses which to use based on a preference order, or an
explicit preference can be specified.
Supported Format Abbreviations
When specifying formats to the tdda serial command-line tool
(with --to), the
following abbreviations are accepted:
Abbreviation |
Full name |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Reading Data Using tdda.serial Files and other Metadata Specifications
The tdda library provides several mechanisms for assisting the reading
of flat files as guided by tdda.serial, or by CSVW or Frictionless files,
both as command-line tools and using APIs.
In illustrating the tdda.serial functionality, many of the examples
will use this small flat file, which is in a deliberately obscure
format (though all of its features are individually not particularly
uncommon).
b;i;f;s;t
'n';0;0.5;;31/01/1970
.;.;.;.;.
'Yes';1;1.5;'a';31/12/1999
and the following (starting) tdda.serial file:
{
"format": "http://tdda.info/ns/tdda.serial",
"writer": "tdda.serial-2.2.15",
"tdda.serial": {
"fields": [
{
"csvname": "b",
"name": "IAmBoolean",
"fieldtype": "bool",
"true_values": [
"Yes",
"y"
],
"false_values": [
"No",
"n"
]
},
{
"csvname": "i",
"name": "IAmInt",
"fieldtype": "int"
},
{
"name": "f",
"fieldtype": "float"
},
{
"csvname": "s",
"name": "IAmString",
"fieldtype": "string"
},
{
"csvname": "t",
"name": "IAmDate",
"fieldtype": "datetime",
"format": "%d/%m/%Y"
}
],
"encoding": "latin-1",
"delimiter": ";",
"quote_char": "'",
"escape_char": "`",
"null_indicator": ".",
"header_row_count": 1,
"header_row": 0,
"path": "docdata.txt"
}
}
When read correctly in Pandas, this produces (with the nullable backend
for Pandas to which tdda.serial defaults):
IAmBoolean IAmInt f IAmString IAmDate
0 False 0 0.5 1970-01-31
1 <NA> <NA> <NA> <NA> NaT
2 True 1 1.5 a 1999-12-31
dtypes: boolean Int64 Float64 string datetime64[ns]
Reading Flat Files with Metadata from Python using the API
The simplest way to read a flat file with accompanying metadata
is to use the csv_to_x functions. In its simplest forms,
from tdda.serial import csv_to_pandas
df1 = csv_to_pandas('docdata.txt', 'docdata.serial')
df2 = csv_to_pandas('docdata.txt:docdata.serial')
df3 = csv_to_pandas('docdata.txt:')
df4 = csv_to_pandas('docdata.txt', find_md=True)
all do the same thing, reading the flat file docdata.txt using the metadata
specification in docdata.serial. The last two forms are only available when the
.serial file has the same stem name as the flat file.
The pandas dtype back end can also be passed in, e.g.
from tdda.serial import csv_to_pandas
dfb = csv_to_pandas('docdata.txt', 'docdata.serial', backend='original')
print(dfb)
for the original Pandas dtype back end.
Similarly
from tdda.serial import csv_to_polars
df1 = csv_to_polars('docdata.txt', 'docdata.serial')
df2 = csv_to_polars('docdata.txt:docdata.serial')
df3 = csv_to_polars('docdata.txt:')
df4 = csv_to_polars('docdata.txt', find_md=True)
are the equivalent forms for Polars.
NOTE (polars.read_csv):
Whereas csv_to_pandas largely just translates metadata settings
into arguments for pandas.read_csv, the corresponding polars.read_csv
function is less flexible, and there are a number of kinds of flat files
that it struggles to read accurately. A particular example of this is
that polars can only read ISO8601-formatted dates and datestamps.
The tdda.serial csv_to_polars works around this by instructing
polars to read fields it cannot parse with read_csv as strings
and then post-processing them to convert them to date or datetime fields.
Metadata Matching and @ wildcards
When a colon is added to the end of a flat file name or path
to request that tdda.serial finds matching metadata,
and then find_md=True is passed into relevant API calls,
tdda's matching process for a file foo.ext is as follows:
It first looks for
foo.ext.serial(in the same directory asfoo.ext).It then looks for
foo.serial(in the same directory asfoo.ext). (This the most common pattern;foo.ext.serialis checked first to allow matching of metadata when the same stem appears with multiple extensions, e.g..csvand.psv)Failing that, it looks for any wildcard matches using
@as a wildcard similar to how*is used in globbing, i.e.@matches any characters or no characters. So for example any of@.serialfoo@.serial@foo.serial@f@o@.serial
will match, but none of
Foo.serialfool*.serialf0*.serial
will do so. If a single metadata file containing
@matches, that will be used. If multiple.serialfiles with wildcards match, an error will be raised.Next, the following are checked:
foo-metadata.jsonfoo-csvmetadata.jsonfoo-csv-metadata.jsonfoo.csvmetadatafoo.csv-metadata
These are common patterns for CSVW, and will be used as CSVW if the
@contextattribute indicates csvw, and astdda.serialif theformatattribute indicates that.Common Frictionless patterns are explored. Frictionless usually used either
.yamlor.json, and includes either.resource,.package, or.schemain the filename before it, so any of:foo.resource.jsonfoo.package.jsonfoo.schema.jsonfoo.resource.yamlfoo.package.yamlfoo.schema.yaml
will match.
Pandas dtype Back Ends
Pandas has three different back ends for storing data.
Its original one uses floating-point columns whenever integer fields
contain null (na) values, and uses the object type for strings,
as well as for various other things.
More recent versions of Pandas have introduced support for
nullable integers (e.g. Int64) and a proper string type.
This backend is called numpy_nullable.
Even more recently, Pandas has also added an (Apache) pyarrow backend,
which also has (different) nullable integers and string types.
A single DataFrame can contain columns with any mixture of these types.
The tdda.serial library supports all of these backends, but defaults
to the nullable_numpy backend. In general, alternate backends may be
specified with either the command-line switch --backend or -B,
with values:
oororiginalfor the original back end,nornumpy_nullablefor thenumpy_nullableback end, andaorpyarrowfor thepyarrowback end.
The API generally uses a backend keyword argument on calls to specify
the backend when appropriate.
CSVW and Frictionless data may also be used in place of .serial files. In the case of CSVW, the most common file name format is:
docdata-metadata.json
for data docdata.csv, and one of
docdata.resource.yamldocdata.resource.jsondocdata.package.yamldocdata.package.json
for Frictionless.
If the metadata specifies the name of the flat file (which is usual for csvw and Frictionless, and allowed for tdda.serial) then the metadata file itself can be specified instead. For example:
from tdda.serial import csv_to_polars, csv_to_pandas
df1 = csv_to_pandas('docdata.serial')
df2 = csv_to_polars('docdata-metadata.json')
df3 = csv_to_polars('docdata.resource.yaml')
would use three different metadata files that point to data to read them into three dataframes, the first using Pandas and the last two using Polars.
NOTE: (full function arguments)
There are more optional parameters to the csv_to_pandas
and csv_to_polars functions;
see the detailed API documentation for details.
Using the tdda serial Command to Assist with Reading Flat Files
The tdda serial command can generate Python code or sets of keyword
arguments for read_csv methods from Pandas or Polars.
tdda serial docdata.serial docdata_pandas.py --to pd.r
This command generates stand-alone Python code docdata_pandas.py containing
a function for reading flat file in the format specified by docdata.serial
with Pandas, taking the path to the datafile as an argument.
(Standalone, here, means code that does not require the tdda library.)
tdda serial docdata.serial docdata_polars.py --to pl.r
This command generates Python code docdata_polars.py containing a function
for reading flat file in the format specified by docdata.serial
with Polars, taking the path to the datafile as an argument.
The --for parameter can be added so that the generated Python
calls the function with the appropriate inpath. For example:
tdda serial docdata.serial docdata_polars2.py --to pl.r --for docdata.txt
Writing Data with tdda.serial (API)
Just as csv_to_pandas and csv_to_polars read flat files
into dataframes for the two libraries, pandas_to_csv and
polars_to_csv write data.
Metadata files can have two roles in this process:
Metadata files can be written with the flat data to specify the format used (these can be
.serialfiles, or CSVW files, or Frictionless files).Metadata files can be used to specify the write format.
These two roles can be combined.
The pandas_to_csv function
The simplest form for writing a Pandas dataframe to CSV with metadata is:
from tdda.serial import pandas_to_csv, csv_to_pandas
df = csv_to_pandas('docdata.txt:')
pandas_to_csv(df, 'docdata1pd.csv', md_outpath='docdata1pd.serial')
This will write the Pandas dataframe df to docdata1pd.csv using
df.to_csv, with default settings,
and writing accompanying metadata to docdata1pd.serial.
Specific write formatting parameters can be passed directly to
to_csv as keyword arguments, and these will also, where appropriate,
affect the written metadata in the .serial file. For example:
from tdda.serial import pandas_to_csv, csv_to_pandas
df = csv_to_pandas('docdata.txt:')
pandas_to_csv(df, 'docdata2pd.csv', md_outpath='docdata2pd.serial',
sep='|', na_rep='NULL', quotechar="'")
will write the data using a pipe separator (|), single quotes ('),
and NULL as the null marker, and the resulting .serial file
will document those settings.
Instead of providing settings as keyword parameters, a metadata file
can be used to determine the flat-file write settings using the
md_inpath argument. So if we want to write the data in a DataFrame
using the metadata in docdata.serial, we can use:
from tdda.serial import pandas_to_csv, csv_to_pandas
df = csv_to_pandas('docdata.txt:')
pandas_to_csv(df, 'docdata3pd.csv', md_inpath='docdata.serial')
This does not write a serial file.
It is occasionally useful to specify both an in_metadata
and an out_metadata path, like this:
from tdda.serial import pandas_to_csv, csv_to_pandas
df = csv_to_pandas('docdata.txt:')
pandas_to_csv(df, 'docdata4pd.csv', md_inpath='docdata.serial',
md_outpath='docdata4pd.serial')
The reasons for wanting to write metadata when a metadata file is used to specify the write format might include:
The input and output metadata formats might be different (e.g.
tdda.serialin and CSVW out)A central metadata file in one location might be used to specify the format, and the output format file might be in a different location, e.g. by the data file written
The output metadata can include a pointer to the specific dataset. This is normal in the case of CSVW and Frictionless, and optional in the case of
tdda.serial.
The polars_to_csv function
The polars function works in the same way as its Pandas counterpart.
The simplest form for writing a polars dataframe to CSV with metadata is:
from tdda.serial import polars_to_csv, csv_to_polars
df = csv_to_polars('docdata.txt:')
polars_to_csv(df, 'docdata1pl.csv', md_outpath='docdata1pl.serial')
This will write the Polars dataframe df to docdata1pl.csv using
df.write_csv, with default settings,
and writing accompanying metadata to docdata1pl.serial.
Specific write formatting parameters can be passed directly to
to_csv as keyword arguments, and these will also, where appropriate,
affect the written metadata in the .serial file. For example:
from tdda.serial import polars_to_csv, csv_to_polars
df = csv_to_polars('docdata.txt:')
polars_to_csv(df, 'docdata2pl.csv', md_outpath='docdata2pl.serial',
sep='|', na_rep='NULL', quotechar="'")
will write the data using a pipe separator (|), single quotes ('),
and NULL as the null marker, and the resulting .serial file
will document those settings.
Instead of providing settings as keyword parameters, a metadata file
can be used to determine the flat-file write settings using the
md_inpath argument. So if we want to write the data in a DataFrame
using the metadata in docdata.serial, we can use:
from tdda.serial import polars_to_csv, csv_to_polars
df = csv_to_polars('docdata.txt:')
polars_to_csv(df, 'docdata3pl.csv', md_inpath='docdata.serial')
This does not write a serial file.
It is occasionally useful to specify both an in_metadata
and an out_metadata path, like this:
from tdda.serial import polars_to_csv, csv_to_polars
df = csv_to_polars('docdata.txt:')
polars_to_csv(df, 'docdata4pl.csv', md_inpath='docdata.serial',
md_outpath='docdata4pl.serial')
The reasons for wanting to write metadata when a metadata file is used to specify the write format might include:
The input and output metadata formats might be different (e.g.
tdda.serialin and CSVW out)A central metadata file in one location might be used to specify the format, and the output format file might be in a different location, e.g. by the data file written
The output metadata can include a pointer to the specific dataset. This is normal in the case of CSVW and Frictionless, and optional in the case of
tdda.serial.
NOTE: There are more optional parameters to the polars_to_csv
and csv_to_polars functions;
see the detailed API documentation for details.
Generating tdda.serial Files by Inference from a Flat File
Experimental Functionality
The tdda.serial library can also read a flat file and attempt
to infer its format and write it to a tdda.serial or other
metadata file using the --generate (or -g) switch.
This is an inherently heuristic process and results cannot be guaranteed
to be correct.
The general approach used in tdda's format inference is only to write
that for which there is evidence in the data, i.e. it does not write
default values to the tdda.serial or other metadata files. So for example,
if there are no quote characters used in a datafile, the quote character
will not be set. Similarly, if the data is pure ASCII, no encoding will
be written to the file.
Basic usage is as follows:
tdda serial --generate docdata.txt docdata-inferred.serial
Warnings will be issued in some cases if the inference process has
significant ambiguity, and the process may fail if tdda.serial cannot
infer the format.
Overrides for most top-level parameters can be provided with switches.
These can be used as hints, e.g. specifying an encoding where in cases
where this cannot be inferred, or to select alternatives if, for example,
the inference process infers iso-8859-1 (latin-1) but the data is in fact
iso-8859015 (latin-9) or cp1252. (These encodings are hard to
differentiate.)
The overrides will normally be accepted except in cases where it is too obviously in conflict with the data, which complicates further inference.
The available Override switches are:
--sep C, --delimiter C
: Set field delimiter to C
--quote-char Q, --quote Q
: Set the quote character to Q (normally straight single or double quote)
--escape
: Set the escape character to backslash (no argument)
--no-escape
: Do not set the escape character
--stutter
: Specify stuttering for embedded quotes
--no-stutter
: Specify that quotes are not stuttered (usually implying escaping)
--date-format FMT
: Use FMT as the default date format for the file (can be a named
format or a %-string suitable for Python's strftime.
--datetime-format FMT
: Use FMT as the default datetime format for the file (can be a named
format or a %-string suitable for Python's strftime.
--quoting STYLE
: Specifies that the quoting style should be set to STYLE,
which must be one of Python csv's QUOTE styles or
the tdda-specific JSON-style QUOTE_STRINGS_ONLY.
In addition to these overrides, other useful switches for inference are:
--sample-lines N, -n N
: Specify the number of lines to use for inference.
This defaults to 1000. (This will be reduced to the number of lines
in the file if it is larger than that.)
--single, -1
: Specifies that there is only one field in the file.
Single-field files cause difficulty with finding the field delimiter.
--Verbose, -V
: Be highly verbose.
--verbose, -v
: Be somewhat verbose.
--quiet, -q
: Be as quiet as practical.
Creating Fieldless tdda.serial files
It is sometimes desirable to create a tdda.serial file that contains
specification of whole-file-level properties only, e.g., delimiter,
quote character etc.—in fact, the very things that support overrides
with command-line switches. If the filename is set to empty, such
a metadata file can be generated. For example:
tdda serial --generate '' generated.serial --null NULL --sep '|' --quote-char "'"
will generate a tdda.serial file with only those properties specified:
{
"format": "http://tdda.info/ns/tdda.serial",
"writer": "tdda.serial-3.0.00",
"tdda.serial": {
"delimiter": "|",
"quote_char": "'",
"null_indicator": "NULL"
}
}
This can either be used directly, allowing the chosen flat-file reader
to figure out field names, types and per-field formats, or can be used
as the starting point for a more detailed .serial file including
hand-crafted field specifications.
Converting Between tdda.serial, CSVW, and Frictionless
Experimental Functionality
The tdda.serial module has some ability to convert between different
tdda.serial formats and between tdda.serial, CSVW, and Frictionless
metadata specifications. There are limitations in all cases, but
the most important specifications generally work.
Here are some example conversion commands:
tdda serial docdata.serial docdata-metadata.json
tdda serial docdata.serial docdata-metadata-from-serial2.json --to csvw
The first command produces docdata-metadata.json as CSVW, assuming
a CSVW target because the pattern fits the normal CSVW convention.
The second version uses a non-standard name for the CSVW metadata,
so the --to flag needs to be used to specify csvw as the target
format.
The result is:
{
"@context": "http://www.w3.org/ns/csvw",
"dc:conformsTo": "data-package",
"dc:creator": "tdda.serial-3.0.00",
"tables": [
{
"tableSchema": {
"columns": [
{
"name": "IAmBoolean",
"titles": "b",
"datatype": {
"base": "boolean",
"format": "Yes|No"
}
},
{
"name": "IAmInt",
"titles": "i",
"datatype": "integer"
},
{
"name": "f",
"datatype": "float"
},
{
"name": "IAmString",
"titles": "s",
"datatype": "string"
},
{
"name": "IAmDate",
"titles": "t",
"datatype": {
"base": "datetime",
"format": "dd/MM/yyyy"
}
}
]
},
"url": "docdata.txt"
}
],
"dialect": {
"encoding": "latin-1",
"delimiter": ";",
"headerRowCount": 1,
"quoteChar": "'"
},
"null": "."
}
tdda serial docdata.serial docdata.package.yaml
tdda serial docdata.serial docdata.resource.json
These two commands produce Frictionless metadata specifications.
The first produces a Frictionless package metadata files as YAML,
while the second produces a Frictionless resource metadata description
as JSON. Again, the names are conventional for Frictionless, so the
format is inferred. Again --to frictionless or --to fless could
be used to specify this if an unconventional name were used.
The YAML package output is:
resources:
- name: docdata
type: table
path: docdata.txt
scheme: file
format: csv
mediatype: text/csv
encoding: latin-1
dialect:
header: true
headerRows:
- 0
delimiter: ;
quoteChar: ''''
escapeChar: '`'
schema:
fields:
- name: b
type: boolean
trueValues:
- 'Yes'
- y
falseValues:
- 'No'
- n
- name: i
type: integer
- name: f
type: number
- name: s
type: string
- name: t
type: datetime
missingValues:
- .
and the JSON resource output is
{
"name": "docdata",
"type": "table",
"path": "docdata.txt",
"scheme": "file",
"format": "csv",
"mediatype": "text/csv",
"encoding": "latin-1",
"dialect": {
"header": true,
"headerRows": [
0
],
"delimiter": ";",
"quoteChar": "'",
"escapeChar": "`"
},
"schema": {
"fields": [
{
"name": "b",
"type": "boolean",
"trueValues": [
"Yes",
"y"
],
"falseValues": [
"No",
"n"
]
},
{
"name": "i",
"type": "integer"
},
{
"name": "f",
"type": "number"
},
{
"name": "s",
"type": "string"
},
{
"name": "t",
"type": "datetime"
}
],
"missingValues": [
"."
]
}
}
In the case of converting to a Pandas csv_read specification, a
dtype back end can be specified. So we might say:
tdda serial docdata.serial docdata-pd.r-o.serial --to pd.r -B o
will ask for the pandas.read_csv version of a tdda.serial file
using the original Pandas dtype backend.
The result is:
{
"format": "http://tdda.info/ns/tdda.serial",
"writer": "tdda.serial-3.0.00",
"pandas.read_csv": {
"sep": ";",
"encoding": "latin-1",
"escapechar": "`",
"quotechar": "'",
"dtype": {
"IAmBoolean": "object",
"f": "float",
"IAmString": "object"
},
"date_format": {
"IAmDate": "%d/%m/%Y"
},
"parse_dates": [
"IAmDate"
],
"na_values": ".",
"keep_default_na": false,
"names": [
"IAmBoolean",
"IAmInt",
"f",
"IAmString",
"IAmDate"
],
"header": 0,
"true_values": [
"Yes",
"y"
],
"false_values": [
"No",
"n"
]
}
}
The source does not have to be a tdda.serial file: tdda.serial can convert
between any supported formats and (in the case of Pandas, any of the
dtype back ends).
Custom Sections within tdda.serial Files
The primary goals of the tdda.serial format are to allow detailed
specification of flat-file formats and to provide facilities for
utilising such specifications for more accurate reading and writing
of data using (inherently ambiguous) flat-file formats. The tdda.serial
specifications are not written with a single library in mind, and are
intended, where possible, to be sufficiently general to allow
use with various libraries.
If, however, a single library is to be used, there is still potentially
value in recording read and write settings for that library in an
companion metadata file, and tdda.serial can serve that function too.
Where library parameters are expressible as JSON (i.e. nulls, numbers,
strings, booleans, arrays, and dictionaries), they can simply be written
into a relevant section of a tdda.serial file.
For example, we can convert the Example tdda.serial file
to one that contains custom sections for Pandas read_csv function
and DataFrame.to_csv methods as follows:
tdda serial example.serial examplepd.serial --to pd.r,pd.w
(Here we have specified both pd.r and pd.w, but we could have requested
only one of them.)
The result is the following file:
{
"format": "http://tdda.info/ns/tdda.serial",
"writer": "tdda.serial-3.0.00",
"pandas.read_csv": {
"sep": "|",
"encoding": "UTF-8",
"escapechar": "\\",
"quotechar": "\"",
"doublequote": false,
"dtype": {
"id": "Int64",
"name": "string"
},
"date_format": {
"joined": "ISO8601",
"last_seen": "%m/%d/%Y %H:%M:%S"
},
"parse_dates": [
"joined",
"last_seen"
],
"na_values": "",
"keep_default_na": false
},
"pandas.DataFrame.to_csv": {
"sep": "|",
"encoding": "UTF-8",
"escapechar": "\\",
"quotechar": "\"",
"doublequote": false,
"date_format": "%Y-%m-%d",
"na_rep": ""
}
}
The reason tdda.serial supports both read and write sections for
libraries such as Pandas and Polars is that the parameters for reading
and writing are different in various ways. First, most libraries aim to
write consistently (e.g. using a single null marker and a consistent
date format), so allow fewer things to be specified on write. They also
use different parameter names in some cases.
Things are slightly more complicated when the parameters are not naturally
expressible in JSON, as is the case with Polars types, which are passed
as type objects. In this case, tdda.serial converts such objects
to strings and the csv_to_polars method handles this when it uses them.
So, considering only the read case, the Polars equivalent conversion is:
tdda serial example.serial examplepl.serial --to pl.r
which produces:
{
"format": "http://tdda.info/ns/tdda.serial",
"writer": "tdda.serial-3.0.00",
"polars.read_csv": {
"separator": "|",
"quote_char": "\"",
"null_values": [
""
],
"encoding": "UTF-8",
"schema": {
"id": "Int64",
"name": "String",
"joined": "Datetime",
"last_seen": "String"
}
}
}
Here, last_seen set to type String (i.e. polars.String) because
Polars will not understand the date format. However, if the Python read
code is generated with
tdda serial example.serial examplereadpl.py --to pl.r
all is handled:
import polars as pl
def read_data(inpath):
df = pl.read_csv(
inpath,
separator='|',
quote_char='"',
null_values=[''],
encoding='UTF-8',
schema={
'id': pl.Int64,
'name': pl.String,
'joined': pl.Datetime,
'last_seen': pl.String
}
)
df = df.with_columns([
pl.col('last_seen').str.to_datetime(format='%m/%d/%Y %H:%M:%S'),
])
return df
Using tdda.serial Metadata in the tdda Library
For all tdda command-line commands, and in most places within
API calls where CSV or other flat file is specified, there is the
option to specify the file format using tdda.serial files,
CSVW files, or Frictionless files. This is based on the : (colon) specifier.
When specifying a path to a CSV (or other flat) file:
If the path is used by itself, the
tddalibrary will use eithertdda.serial.csv_to_pandasortdda.serial.csv_to_polarsto read it into a DataFrame. The default is currently pandas (with thenumpy_nullableback end), but this can be configured or, in many cases controlled with command line flags (--polars,--pandas,--backend BACKEND(for Pandas only)).If the path ends in a colon (e.g.
foo.csv:), TDDA will search for metadata in the same directory as the file and, if it finds one, pass that to the appropriatecsv_to_...function for more accurate DataFrame generation.In doing this, it will look for the following in priority order, given a file
foo.csv:foo.csv.serial(tdda.serialmetadata)foo.serial(tdda.serialmetadata). This is actually more common than the previous form, but if there are multiple files with different extensions, the former is more specific, so is checked first.Anything that matches foo using
@as a wildcard, e.g.@.serial,f@.serial,f@o.serial,@oo.serial. (@acts like*in the shell, while avoiding needing*in filenames, which can be awkward.)foo-metadata.json,foo-csvmetadata.json,foo-csv-metadata.json,foo.csvmetadata.json,foo.csv-metadata.json(all of which are common conventions for CSVW metadata files).The same CSVW patterns with
@wildcardsfoo.serial.json,foo.serial.yaml,foo.resource.json,foo.resource.yaml,foo.package.json,foo.package.yaml, all of which are common for Frictionless metadata files.The same patterns for
serialorpackagefrictionless files with@wildcards. Wildcards are not searched inresourcefiles, because in frictionless these always correspond to a single data file.
If the path contains a colon, the part to the right of the colon will be interpreted as a metadata file. So
foo.csv:bar.serialwill usebar.serial.