# `tdda.serial`: Metadata and Tools for Flat (“CSV”) Files A `.serial` file (also known as a `tdda.serial` file) is a JSON file describing the format of one or more flat (“CSV”) files. It is the primary metadata format used by the `tdda` library, which includes various tools for working with `.serial` metadata in the `tdda.serial` module. The primary goal of `tdda.serial` is to allow data in flat files to be read more accurately and to serve as documentation of how data has been written to flat files. The current main features of `tdda.serial` are: * A file format for describing key aspects of CSV and other “flat” files, including things like separators, quoting, escaping and nulls, together with field names, types, and formats. * Facilities for using `.serial` files when reading and writing flat files with (currently) Pandas and Polars in Python (`csv_to_pandas`, `csv_to_polars`, `pandas_to_csv`, `polars_to_csv`, as well as tools for generating keyword arguments for `read_csv` functions and Python code for reading flat files based on a metadata specification). * Tools allowing conversion between `tdda.serial` files and other metadata formats---currently [CSVW](https://csvw.org) and [Frictionless](https://frictionlessdata.io). * Facilities for generating inferred `tdda.serial` files from a flat file. * A command-line interface for most of the above functionality (`tdda serial`). * The ability for various tdda tools to be guided by a `tdda.serial` files when reading flat files for things like constraint generation, data validation, and dataset comparison with `tdda diff`. (In general, this can be achieved by using either `foo.csv:` as a filename specifier, which tells `tdda` to look for metadata based on filename conventions, or `foo.csv:md.serial` to use metadata in `md.serial`.) A single `.serial` file can describe a format shared by many concrete flat files, unlike CSVW, which is primarily concerned with describing specific named files. In this respect, it is more similar to a Frictionless schema, as opposed to a Frictionless resource or package. A `.serial` file can also contain library-specific sections (for `pandas.read_csv`, `polars.read_csv` etc.) alongside or instead of the `tdda.serial` section. In general, it is preferable to use the `tdda.serial` section; the other sections are intended primarily for cases in which: - Something is not capable of being expressed in `tdda.serial`; - The intention is only to use a single library for reading/writing, and higher fidelity may be guaranteed by using a custom section. ## The `tdda.serial` File Structure A `.serial` file is a JSON object with the following top-level keys: ``` { "format": "http://tdda.info/ns/tdda.serial", "writer": "tdda.serial-3.0.0", "tdda.serial": { ... }, "pandas.read_csv": { ... }, "pandas.DataFrame.to_csv": { ... }, "polars.read_csv": { ... }, "polars.DataFrame.write_csv": { ... }, "python.csv.reader": { ... }, "python.csv.writer": { ... } } ``` In fact, any key may be used for a custom format specification for some flat-file reader or writer. ### Top-level Keys **`format`** *(string, required)* : Must be `"http://tdda.info/ns/tdda.serial"`. Identifies the file as a `tdda.serial` file. In future, this string may include numbered versions of tdda.serial. **`writer`** *(string, optional)* : The name and version of the software that wrote the file, e.g. `"tdda.serial-3.0.0"`. **`tdda.serial`** *(object, optional)* : The primary metadata section, described in detail below. **`pandas.read_csv`**, **`pandas.DataFrame.to_csv`** *(objects, optional)* : Library-specific sections containing keyword arguments for the corresponding Pandas functions, stored verbatim. **`polars.read_csv`**, **`polars.DataFrame.write_csv`** *(objects, optional)* : Library-specific sections containing keyword arguments for the corresponding Polars functions, stored verbatim. **`python.csv.reader`**, **`python.csv.writer`** *(objects, optional)* : Library-specific sections for the Python standard library `csv` module. When multiple sections are present, the `tdda.serial` library chooses which to use based on a preference order (typically `tdda.serial` first, then library-specific sections). A preferred section can also be specified explicitly with both the API and the command-line tools. ### Example %% docdata/example.serial ``` { "format": "http://tdda.info/ns/tdda.serial", "tdda.serial": { "encoding": "UTF-8", "delimiter": "|", "quote_char": "\"", "escape_char": "\\", "stutter_quotes": false, "null_indicator": "", "date_format": "YYYY-MM-DD", "datetime_format": "DD/MM/YYYY HH:MM:SS", "header_row_count": 1, "header_row": 0, "fields": [ { "name": "id", "fieldtype": "int" }, { "name": "name", "fieldtype": "string" }, { "name": "joined", "fieldtype": "date" }, { "name": "last_seen", "fieldtype": "datetime", "format": "MM/DD/YYYY HH:MM:SS" } ] } } ``` ### The `tdda.serial` Section This section contains a dataset-level description of the flat file format, with optional per-field overrides. #### Dataset-level Keys All dataset-level keys are optional. **`encoding`** *(string)* : The text encoding of the file. Any Python codec name is accepted, e.g. `"UTF-8"`, `"latin-1"`. (Suggested default: `"UTF-8"`.) **`delimiter`** *(string)* : The field separator character. (Suggested default: `","`.) **`quote_char`** *(string)* : The character used to quote fields string data or, in some cases, other data, especially data that contains the delimiter or the field separator. Almost always double quote or single quote. (Suggested default: `"`, written as `"\""` in JSON.) **`escape_char`** *(string)* : When this is set to backslash (the only value it is likely to take, written as `"\\"` in JSON) this means that quote characters in quoted strings are escaped with backslash `"like \"this\" example"`. Backslash also escapes itself, and may be used to escape separators, whether quoted or not. It has no significance for common control escapes, which may appear as `\n`, `\r`, `\t`, `\f` in files regardless of this setting. Overescaping is common (escaping characters that don't need to be escaped in context) and this setting has no particular bearing on the cases. Most readers simply remove unnecessary escape characters. (Suggested default: `"\\"`.) **`stutter_quotes`** *(boolean)* : If `true`, the quote character is escaped by doubling it, `"like ""this"" example"`. If `false`, the escape character is used to escape embedded quotes. `"""this"""` would be used to represent the word *this* in double quotes when stutter is `true`. (Suggested default: `false`.) **`null_indicator`** *(string or array of strings)* : The string or strings used to represent null/missing values. (Suggested Default: empty string, written as `""` in JSON.) **`date_format`** *(string)* : The default format for `date` fields that have no per-field `format`. Also used as the fallback for `datetime` fields if `datetime_format` is not set. Accepts named formats (see [Date and Datetime Formats](#tdda-serial-date-and-datetime-formats)) and Python `strftime` strings. (Suggested default: `"iso8601-date"`.) **`datetime_format`** *(string)* : The default format for `datetime` fields that have no per-field `format`. If not set, `date_format` is used as the fallback. Accepts named formats or Python `strftime` strings. **`header_row_count`** *(integer)* : The number of header rows at the top of the file. Default: `1`. **`header_row`** *(integer)* : The zero-based index of the row containing column names, within the header rows. Default: `0`. **`accept_percentages_as_floats`** *(boolean)* : If `true`, values like `"12.5%"` are read as `0.125`. (Suggested default: `false`.) **`map_missing_trailing_cols_to_null`** *(boolean)* : If `true`, rows with fewer fields than expected (as Excel sometimes produces) have missing trailing fields treated as null. (Suggested default: `false`.) **`decimal_point`** *(string)* : The character used as the decimal point in numeric fields. (Suggested default: `"."`). **`thou_sep`** *(string)* : The thousands separator character in numeric fields, e.g. `","`. Optional; no suggested default. **`true_values`** *(string or array of strings)* : String values to interpret as *true* for boolean fields, in addition to the library defaults. Applies to all `bool` fields unless overridden for a given field. **`false_values`** *(string or array of strings)* : String values to interpret as *false* for boolean fields, in addition to the library defaults. Applies to all `bool` fields unless overridden for a given field. **`quoting`** *(string)* : Specifies which values are quoted There is little consistency about which values are quoted in flat files. Python's `csv` module defines constants for various options, but fail to include one of the most important. The `tdda.serial` format allows quoting to be set to any of the values supported by Python `csv` and also to one other common scheme. The values from Python's `csv` library are: - `QUOTE_ALL`: all values are quoted. - `QUOTE_MINIMAL`: Only values that contain special characters, such as the field delimiter, or literal newlines. - `QUOTE_NONNUMERIC`: All non-numeric values (including nulls) are quoted. - `QUOTE_NONE`: No values are quoted (escape is used for special characters). - `QUOTE_NOTNULL`: All non-null values are quoted (including numbers) - `QUOTE_STRINGS`: This is the same as QUOTE_NONNUMERIC except that null values are not quoted. The extra value supported by `tdda.serial` is - `QUOTE_STRINGS_ONLY`: Only string values are quoted: nulls and values of all non-string types are not quoted. (This is essentially what JSON does, except that JSON has no date/datetime types, so those are usually stored as strings, and therefore quoted.) Optional. No default. **`path`** *(string)* : Optional path to a datafile. Not usually populated since `.serial` data usually applies to a set of datafiles rather than a single one. Can be relative or absolute. The path should be considered advisory rather than definitive, i.e. there is no problem at all to use a `.serial` file specifying a particular flat file with a different flat file. Indeed, this is common. Note: since `path` is only really for information, it can be set to a glob (wildcard) pattern like '*.csv' or 'foo*.csv' to indicate a set of files. `@` can also be used as the wildcard to match `tdda.serial` filename conventions (see below). **`fields`** *(array or object)* : Descriptions of the fields in the file. See [The `fields` Entry](#the-fields-entry). (the-fields-entry)= #### The `fields` Entry Fields can be specified as either an **array** or an **object (dictionary)**. **Array form** — the field list is taken to be complete and ordered. Fields appear in the order they occur in the file. Each entry is a field object (described below). ```json "fields": [ {"name": "id", "fieldtype": "int"}, {"name": "date", "fieldtype": "date"} ] ``` **Object (dictionary) form** — the keys are the names of the fields *as they appear in the file* (the external names), and the values are field objects. This form is used for partial specifications, where only some fields are described, and/or where internal names differ from external names. Fields may appear in any order, and additional fields in the file are permitted. ```json "fields": { "commission date": {"name": "DateOfCommission", "fieldtype": "date"}, "qa passed?": {"name": "PassedQA", "fieldtype": "bool", "true_values": "yes", "false_values": "no"} } ``` **NOTE (field sortedness):** JSON Objects (dictionaries) are formally unordered, i.e. the order in which the entries appear has no semantic significance. The object form is preferred when a `.serial` file is not making a claim about the order of fields in a flat file. Despite this, all recent versions of Python read JSON objects into Python dictionaries preserving key order. The `tdda.serial` library always preserves order when reading and writing serial files, so in practice the order in the `.serial` file will always match the order in the flat file if `tdda.serial` is used to generate the `.serial` file. Conversely, in the rare cases that data is written without a header, the table form is preferred because this unambiguously specifies field order. In practice, the object/dictionary form will also work provided the `.serial` has the field entries in the same order as the data file. #### Field Keys **`name`** *(string, required in array form)* : The internal name for the field — the column name used in the resulting dataframe. In array form this is mandatory. In object form it can be omitted if the internal name is the same as the external (dictionary key) name. This is useful if there is a desire to replace some or all of the names used in the flat file with more convenient ones during processing. **`fieldtype`** *(string)* : The type of the field, from the table below. | Value | Description | |:---------------|:-----------------------------------------| | `bool` | Boolean | | `int` | Integer | | `float` | Floating-point | | `number` | Unspecified numeric (integer or float) | | `string` | Text | | `date` | Date (no time component) | | `datetime` | Date and time | | `datetime_tz` | Date and time with timezone | | `time` | Time only | | `iso8601` | ISO 8601 date or datetime (unspecified) | **`csvname`** *(string)* : The name of the column in the file, if different from `name`. Used in array form when the file's column name differs from the desired internal name. **`format`** *(string)* : For `date` and `datetime` fields: the format of the date or datetime values in this field. Overrides `date_format` / `datetime_format`. Accepts named formats or Python `strftime` strings (see [Date and Datetime Formats](#tdda-serial-date-and-datetime-formats)). For `bool` fields: a boolean format specification (e.g. `"yes|no"`). **`null_indicator`** *(string or array of strings)* : Null indicator(s) for this field, overriding the dataset-level `null_indicator`. **`true_values`** *(string or array of strings)* : True value(s) for this `bool` field, overriding the dataset-level `true_values`. **`false_values`** *(string or array of strings)* : False value(s) for this `bool` field, overriding the dataset-level `false_values`. **`description`** *(string)* : A human-readable description of the field. ### TDDA Serial Date and Datetime Formats 1. ISO8601 ISO8601 is the most widely understood, unproblematical format for dates and times and is recommended for new data written with dates and datetimes. This can be specified in tdda serial as follows. - `iso8601-date` can be used for ISO8601 dates (2000-12-31) - `iso8601-datetime` can be used for ISO8601 datetimes (2000-12-31T12:34:56) - `iso8601-datetime-tz` can be used for ISO8601 datetimes with timezone (2000-12-31T12:34:56+00:00) etc. - `iso8601` can be used to indicate “any of the above”. With the ISO8601 variants, reading is liberal, allowing some variation in separators, whereas writing is strict. The `tdda.serial` format allows date formats to be specified in three other ways: 2. Dates can be specified using: - YYYY for four-digit years - YY for two-digit years - MM for *numeric* month - DD for day - HH for hour - MM for minute (Same as month! That's OK. Context disambiguates.) - MON for a spelt-out three letter month like Jan - MONTH for a spelt-out full month like January - SS for whole seconds - SS.S, SS.SS etc. (any number of S's after period) for fractional seconds. - +ZZ:ZZ or +ZZZZ for timezone. (Can use `-` instead of `+`) - AM or PM for 12-hour clock AM/PM indicator Any case may be used. MM for minutes and month is disambiguated by saying that MM when adjacent to YY or DD (other than a separator) is a month, and when adjacent to SS or HH indicates minutes. For example: - `YYYY-MM-DDTHH:MM:SS.S+ZZ:ZZ` - `YYYY-MM-DDTHH:MM:SS.S-ZZ:ZZ` - `YYYY-MM-DD HH:MM:SS+ZZZZ` - `YYYY-MM-DDTHH:MM:SS` - `YYYY-MM-DD HH:MM` - `YYYY-MM-DD` - `YY-MM-DD` - `YY-MM-DDTHH:MM:SS.SPM` - `YY-MM-DDTHH:MM:SSAM` - `DD/MM/YYYY HH:MM:SS` - `MM/DD/YYYY HH:MM:SS` - `MM.DD.YY HH:MM:SSAM` The string will always be standardized to upper case on write, but is case insensitive on read, for `.serial` files. (For `CSVW` files, mixed case is used, following CSVW's conventions). 3. Any **unambiguous** date or datetime in the specified format can be used as a specifier. By unambiguous, we mean: - The day is at least 13 - The year is either four digits or 60 or greater or 00 So - ` 2000-12-31T12:34:56.789+0000` - ` 31/12/2000T12:34:56.789+00:00` - ` 31-12-2000T12:34:56.789-0000` - ` 31.12.2000T12:34:56-00:00` - ` 31/12/2000T12:34:56` - ` 31/12/2000 12:34:56` - ` 31/12/2000` - ` 12/31/2000` - ` 12/31/00` - ` 31/12/00` - ` 31 Dec 2000` - ` 31 December 2000` - ` Dec 31 00` - ` 2000-12-31T12:34:56.789AM` - ` 2000-12-31T12:34:56.789PM` etc. are all acceptable. Things like - ` 2000-02-01T12:34:56.789+0000` - ` 01/02/2000T12:34:56.789+00:00` - ` 01-02-2000T12:34:56.789+0000` - ` 01.02.2000T12:34:56+0000` - ` 01/02/2000T12:34:56` - ` 01/02/2000 12:34:56` - ` 01/02/2000` - ` 02/01/2000` - ` 02/01/00` - ` 01/02/00` - ` 01 Dec 22` - ` 22 Dec 01` are not, because they are ambiguous. When reading `.serial` files the `tdda` library will accept any unambiguous date and reject anything it considers ambiguous. On writing, it will replace any other specific date with components from 2000-12-31T12:34:56.789+0000. 4. Strings usable by [Python `strftime`](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes) can be used. These are typified by `%Y-%m-%dT%H:%M:%S`. On writing new `.serial` files, the library will default to named `iso8601` formats when the format is conformant with ISO8601, and will then choose the YYYY/MM/DD type specifier or `strftime-conforming` strings. You can specify a preferred form at the command line with --use-literal-dates, --use-yyyy-dates, --use-pc-dates. #### ISO 8601 Formats | Named Format | `strftime` equivalent | Example | |:----------------------|:------------------------------|:--------------------------| | `iso8601-date` | `%Y-%m-%d` | `2000-12-31` | | `iso8601-datetime` | `%Y-%m-%dT%H:%M:%S` | `2000-12-31T12:34:56` | | `iso8601-datetime-tz` | `%Y-%m-%dT%H:%M:%S%z` | `2000-12-31T12:34:56+00:00` | | `iso8601` | Any ISO 8601 date or datetime | Any of the above | On read, ISO8601 formats should ideally accept `/` and `.` as separators for date components in addition to `-` (e.g. `2000/12/31`, `2000.12.31`), should accept space (`" "`) instead of T, and should also accept fractional seconds on times. #### Format Precedence For each date or datetime field, the format is determined as follows: 1. The field's own `format` key, if present. 2. For `date` fields: the dataset-level `date_format`, if set. For `datetime` fields: the dataset-level `datetime_format`, if set. 3. The other dataset-level format (`date_format` falling back to `datetime_format` and vice versa). 4. The library default (typically ISO 8601). ### Library-specific Sections In addition to, or instead of, the `tdda.serial` section, a `.serial` file can contain sections with library-specific keyword arguments. These are stored verbatim and used directly when reading or writing with the corresponding library. For example, a `pandas.read_csv` section contains exactly the keyword arguments to be passed to `pandas.read_csv`: ```json { "format": "http://tdda.info/ns/tdda.serial", "pandas.read_csv": { "sep": "|", "encoding": "UTF-8", "dtype": {"id": "Int64", "name": "string"}, "parse_dates": ["joined"], "date_format": {"joined": "%d/%m/%Y"}, "na_values": [""], "keep_default_na": false } } ``` When both `tdda.serial` and library-specific sections are present, the library chooses which to use based on a preference order, or an explicit preference can be specified. ### Supported Format Abbreviations When specifying formats to the `tdda serial` command-line tool (with `--to`), the following abbreviations are accepted: | Abbreviation | Full name | |:-------------|:-------------------------------| | `.` | `tdda.serial` | | `pd.r` | `pandas.read_csv` | | `pd.w` | `pandas.DataFrame.to_csv` | | `pl.r` | `polars.read_csv` | | `pl.w` | `polars.DataFrame.write_csv` | | `csv.r` | `python.csv.reader` | | `csv.w` | `python.csv.writer` | | `fl` | `frictionless` | | `fl.r` | `frictionless.resource` | | `fl.p` | `frictionless.package` | | `csvw` | `csvw` | ## Reading Data Using `tdda.serial` Files and other Metadata Specifications The `tdda` library provides several mechanisms for assisting the reading of flat files as guided by `tdda.serial`, or by CSVW or Frictionless files, both as command-line tools and using APIs. In illustrating the `tdda.serial` functionality, many of the examples will use this small flat file, which is in a deliberately obscure format (though all of its features are individually not particularly uncommon). %% docdata/docdata.txt ```text b;i;f;s;t 'n';0;0.5;;31/01/1970 .;.;.;.;. 'Yes';1;1.5;'a';31/12/1999 ``` and the following (starting) `tdda.serial` file: %% docdata/docdata.serial ``` { "format": "http://tdda.info/ns/tdda.serial", "writer": "tdda.serial-2.2.15", "tdda.serial": { "fields": [ { "csvname": "b", "name": "IAmBoolean", "fieldtype": "bool", "true_values": [ "Yes", "y" ], "false_values": [ "No", "n" ] }, { "csvname": "i", "name": "IAmInt", "fieldtype": "int" }, { "name": "f", "fieldtype": "float" }, { "csvname": "s", "name": "IAmString", "fieldtype": "string" }, { "csvname": "t", "name": "IAmDate", "fieldtype": "datetime", "format": "%d/%m/%Y" } ], "encoding": "latin-1", "delimiter": ";", "quote_char": "'", "escape_char": "`", "null_indicator": ".", "header_row_count": 1, "header_row": 0, "path": "docdata.txt" } } ``` When read correctly in Pandas, this produces (with the nullable backend for Pandas to which `tdda.serial` defaults): %% docdata/docdata-output.txt ```text IAmBoolean IAmInt f IAmString IAmDate 0 False 0 0.5 1970-01-31 1 NaT 2 True 1 1.5 a 1999-12-31 dtypes: boolean Int64 Float64 string datetime64[ns] ``` ### Reading Flat Files with Metadata from Python using the API The simplest way to read a flat file with accompanying metadata is to use the `csv_to_x` functions. In its simplest forms, %% docdata/csv2pandas.py ```python from tdda.serial import csv_to_pandas df1 = csv_to_pandas('docdata.txt', 'docdata.serial') df2 = csv_to_pandas('docdata.txt:docdata.serial') df3 = csv_to_pandas('docdata.txt:') df4 = csv_to_pandas('docdata.txt', find_md=True) ``` all do the same thing, reading the flat file `docdata.txt` using the metadata specification in `docdata.serial`. The last two forms are only available when the `.serial` file has the same stem name as the flat file. The pandas `dtype` back end can also be passed in, e.g. %% docdata/csv2pandasbackend.py ```python from tdda.serial import csv_to_pandas dfb = csv_to_pandas('docdata.txt', 'docdata.serial', backend='original') print(dfb) ``` for the original Pandas `dtype` back end. Similarly % TODO: polars needs to handle boolean conversion (like dates) %% docdata/csv2polars.py ```python from tdda.serial import csv_to_polars df1 = csv_to_polars('docdata.txt', 'docdata.serial') df2 = csv_to_polars('docdata.txt:docdata.serial') df3 = csv_to_polars('docdata.txt:') df4 = csv_to_polars('docdata.txt', find_md=True) ``` are the equivalent forms for Polars. **NOTE (`polars.read_csv`):** Whereas `csv_to_pandas` largely just translates metadata settings into arguments for `pandas.read_csv`, the corresponding `polars.read_csv` function is less flexible, and there are a number of kinds of flat files that it struggles to read accurately. A particular example of this is that polars can only read ISO8601-formatted dates and datestamps. The `tdda.serial` `csv_to_polars` works around this by instructing polars to read fields it cannot parse with `read_csv` as strings and then post-processing them to convert them to date or datetime fields. ### Metadata Matching and `@` wildcards When a colon is added to the end of a flat file name or path to request that `tdda.serial` finds matching metadata, and then `find_md=True` is passed into relevant API calls, `tdda`'s matching process for a file `foo.ext` is as follows: 1. It first looks for `foo.ext.serial` (in the same directory as `foo.ext`). 2. It then looks for `foo.serial` (in the same directory as `foo.ext`). (This the most common pattern; `foo.ext.serial` is checked first to allow matching of metadata when the same stem appears with multiple extensions, e.g. `.csv` and `.psv`) 3. Failing that, it looks for any wildcard matches using `@` as a wildcard similar to how `*` is used in globbing, i.e. `@` matches any characters or no characters. So for example any of - `@.serial` - `foo@.serial` - `@foo.serial` - `@f@o@.serial` will match, but none of - `Foo.serial` - `fool*.serial` - `f0*.serial` will do so. If a single metadata file containing `@` matches, that will be used. If multiple `.serial` files with wildcards match, an error will be raised. 4. Next, the following are checked: - `foo-metadata.json` - `foo-csvmetadata.json` - `foo-csv-metadata.json` - `foo.csvmetadata` - `foo.csv-metadata` These are common patterns for CSVW, and will be used as CSVW if the `@context` attribute indicates csvw, and as `tdda.serial` if the `format` attribute indicates that. 5. Common Frictionless patterns are explored. Frictionless usually used either `.yaml` or `.json`, and includes either `.resource`, `.package`, or `.schema` in the filename before it, so any of: - `foo.resource.json` - `foo.package.json` - `foo.schema.json` - `foo.resource.yaml` - `foo.package.yaml` - `foo.schema.yaml` will match. ### Pandas `dtype` Back Ends Pandas has three different back ends for storing data. Its `original` one uses floating-point columns whenever integer fields contain null (`na`) values, and uses the object type for strings, as well as for various other things. More recent versions of Pandas have introduced support for nullable integers (e.g. `Int64`) and a proper `string` type. This backend is called `numpy_nullable`. Even more recently, Pandas has also added an (Apache) `pyarrow` backend, which also has (different) nullable integers and `string` types. A single DataFrame can contain columns with any mixture of these types. The `tdda.serial` library supports all of these backends, but defaults to the `nullable_numpy` backend. In general, alternate backends may be specified with either the command-line switch `--backend` or `-B`, with values: - `o` or `original` for the original back end, - `n` or `numpy_nullable` for the `numpy_nullable` back end, and - `a` or `pyarrow` for the `pyarrow` back end. The API generally uses a `backend` keyword argument on calls to specify the backend when appropriate. CSVW and Frictionless data may also be used in place of .serial files. In the case of CSVW, the most common file name format is: * `docdata-metadata.json` for data `docdata.csv`, and one of * `docdata.resource.yaml` * `docdata.resource.json` * `docdata.package.yaml` * `docdata.package.json` for Frictionless. If the metadata specifies the name of the flat file (which is usual for csvw and Frictionless, and allowed for tdda.serial) then the metadata file itself can be specified instead. For example: %% docdata/pandas-and-polars-reads.py ```python from tdda.serial import csv_to_polars, csv_to_pandas df1 = csv_to_pandas('docdata.serial') df2 = csv_to_polars('docdata-metadata.json') df3 = csv_to_polars('docdata.resource.yaml') ``` would use three different metadata files that point to data to read them into three dataframes, the first using Pandas and the last two using Polars. **NOTE: (full function arguments)** There are more optional parameters to the `csv_to_pandas` and `csv_to_polars` functions; see the detailed API documentation for details. ### Using the `tdda serial` Command to Assist with Reading Flat Files The `tdda serial` command can generate Python code or sets of keyword arguments for `read_csv` methods from Pandas or Polars. %% docdata/tddaserial1.sh ```bash tdda serial docdata.serial docdata_pandas.py --to pd.r ``` This command generates stand-alone Python code `docdata_pandas.py` containing a function for reading flat file in the format specified by `docdata.serial` with Pandas, taking the path to the datafile as an argument. (Standalone, here, means code that does not require the `tdda` library.) %% docdata/tddaserial2.sh ```bash tdda serial docdata.serial docdata_polars.py --to pl.r ``` This command generates Python code `docdata_polars.py` containing a function for reading flat file in the format specified by `docdata.serial` with Polars, taking the path to the datafile as an argument. The `--for` parameter can be added so that the generated Python calls the function with the appropriate inpath. For example: % TODO: This doesn't seem to work currently %% docdata/tddaserial3.sh ```bash tdda serial docdata.serial docdata_polars2.py --to pl.r --for docdata.txt ``` ## Writing Data with `tdda.serial` (API) Just as `csv_to_pandas` and `csv_to_polars` read flat files into dataframes for the two libraries, `pandas_to_csv` and `polars_to_csv` *write* data. Metadata files can have two roles in this process: * Metadata files can be written *with* the flat data to specify the format used (these can be `.serial` files, or CSVW files, or Frictionless files). * Metadata files can be used to *specify* the write format. These two roles can be combined. ### The `pandas_to_csv` function The simplest form for writing a Pandas dataframe to CSV *with metadata* is: %% docdata/pandas2csv1.py ```python from tdda.serial import pandas_to_csv, csv_to_pandas df = csv_to_pandas('docdata.txt:') pandas_to_csv(df, 'docdata1pd.csv', md_outpath='docdata1pd.serial') ``` This will write the Pandas dataframe `df` to `docdata1pd.csv` using `df.to_csv`, with default settings, and writing accompanying metadata to `docdata1pd.serial`. Specific write formatting parameters can be passed directly to `to_csv` as keyword arguments, and these will also, where appropriate, affect the written metadata in the `.serial` file. For example: %% docdata/pandas2csv2.py ```python from tdda.serial import pandas_to_csv, csv_to_pandas df = csv_to_pandas('docdata.txt:') pandas_to_csv(df, 'docdata2pd.csv', md_outpath='docdata2pd.serial', sep='|', na_rep='NULL', quotechar="'") ``` will write the data using a pipe separator (`|`), single quotes (`'`), and `NULL` as the null marker, and the resulting `.serial` file will document those settings. Instead of providing settings as keyword parameters, a metadata file can be used to determine the flat-file write settings using the `md_inpath` argument. So if we want to write the data in a DataFrame using the metadata in `docdata.serial`, we can use: %% docdata/pandas2csv3.py ```python from tdda.serial import pandas_to_csv, csv_to_pandas df = csv_to_pandas('docdata.txt:') pandas_to_csv(df, 'docdata3pd.csv', md_inpath='docdata.serial') ``` This does *not* write a serial file. It is occasionally useful to specify both an `in_metadata` and an `out_metadata` path, like this: %% docdata/pandas2csv4.py ```python from tdda.serial import pandas_to_csv, csv_to_pandas df = csv_to_pandas('docdata.txt:') pandas_to_csv(df, 'docdata4pd.csv', md_inpath='docdata.serial', md_outpath='docdata4pd.serial') ``` The reasons for wanting to write metadata when a metadata file is used to specify the write format might include: - The input and output metadata formats might be different (e.g. `tdda.serial` in and CSVW out) - A central metadata file in one location might be used to specify the format, and the output format file might be in a different location, e.g. by the data file written - The output metadata can include a pointer to the specific dataset. This is normal in the case of CSVW and Frictionless, and optional in the case of `tdda.serial`. ### The `polars_to_csv` function The polars function works in the same way as its Pandas counterpart. The simplest form for writing a polars dataframe to CSV *with metadata* is: %% docdata/polars2csv1.py ```python from tdda.serial import polars_to_csv, csv_to_polars df = csv_to_polars('docdata.txt:') polars_to_csv(df, 'docdata1pl.csv', md_outpath='docdata1pl.serial') ``` This will write the Polars dataframe `df` to `docdata1pl.csv` using `df.write_csv`, with default settings, and writing accompanying metadata to `docdata1pl.serial`. Specific write formatting parameters can be passed directly to `to_csv` as keyword arguments, and these will also, where appropriate, affect the written metadata in the `.serial` file. For example: %% docdata/polars2csv2.py ```python from tdda.serial import polars_to_csv, csv_to_polars df = csv_to_polars('docdata.txt:') polars_to_csv(df, 'docdata2pl.csv', md_outpath='docdata2pl.serial', sep='|', na_rep='NULL', quotechar="'") ``` will write the data using a pipe separator (`|`), single quotes (`'`), and `NULL` as the null marker, and the resulting `.serial` file will document those settings. Instead of providing settings as keyword parameters, a metadata file can be used to determine the flat-file write settings using the `md_inpath` argument. So if we want to write the data in a DataFrame using the metadata in `docdata.serial`, we can use: %% docdata/polars2csv3.py ```python from tdda.serial import polars_to_csv, csv_to_polars df = csv_to_polars('docdata.txt:') polars_to_csv(df, 'docdata3pl.csv', md_inpath='docdata.serial') ``` This does *not* write a serial file. It is occasionally useful to specify both an `in_metadata` and an `out_metadata` path, like this: %% docdata/polars2csv4.py ```python from tdda.serial import polars_to_csv, csv_to_polars df = csv_to_polars('docdata.txt:') polars_to_csv(df, 'docdata4pl.csv', md_inpath='docdata.serial', md_outpath='docdata4pl.serial') ``` The reasons for wanting to write metadata when a metadata file is used to specify the write format might include: - The input and output metadata formats might be different (e.g. `tdda.serial` in and CSVW out) - A central metadata file in one location might be used to specify the format, and the output format file might be in a different location, e.g. by the data file written - The output metadata can include a pointer to the specific dataset. This is normal in the case of CSVW and Frictionless, and optional in the case of `tdda.serial`. % TODO: for tdda.serial **NOTE:** There are more optional parameters to the `polars_to_csv` and `csv_to_polars` functions; see the detailed API documentation for details. ## Generating `tdda.serial` Files by Inference from a Flat File **Experimental Functionality** The `tdda.serial` library can also read a flat file and attempt to infer its format and write it to a `tdda.serial` or other metadata file using the `--generate` (or `-g`) switch. This is an inherently heuristic process and results cannot be guaranteed to be correct. The general approach used in `tdda`'s format inference is only to write that for which there is evidence in the data, i.e. it does not write default values to the `tdda.serial` or other metadata files. So for example, if there are no quote characters used in a datafile, the quote character will not be set. Similarly, if the data is pure ASCII, no encoding will be written to the file. Basic usage is as follows: %% docdata/inference1.sh ```bash tdda serial --generate docdata.txt docdata-inferred.serial ``` Warnings will be issued in some cases if the inference process has significant ambiguity, and the process may fail if `tdda.serial` cannot infer the format. Overrides for most top-level parameters can be provided with switches. These can be used as hints, e.g. specifying an encoding where in cases where this cannot be inferred, or to select alternatives if, for example, the inference process infers `iso-8859-1` (`latin-1`) but the data is in fact `iso-8859015` (`latin-9`) or `cp1252`. (These encodings are hard to differentiate.) The overrides will normally be accepted except in cases where it is too obviously in conflict with the data, which complicates further inference. The available Override switches are: `--sep C`, `--delimiter C` : Set field delimiter to `C` `--quote-char Q`, `--quote Q` : Set the quote character to `Q` (normally straight single or double quote) `--escape` : Set the escape character to backslash (no argument) `--no-escape` : Do not set the escape character `--stutter` : Specify stuttering for embedded quotes `--no-stutter` : Specify that quotes are not stuttered (usually implying escaping) `--date-format FMT` : Use `FMT` as the default date format for the file (can be a named format or a `%`-string suitable for Python's `strftime`. `--datetime-format FMT` : Use `FMT` as the default datetime format for the file (can be a named format or a %-string suitable for Python's `strftime`. % TODO: `--quoting STYLE` : Specifies that the quoting style should be set to `STYLE`, which must be one of Python `csv`'s `QUOTE` styles or the `tdda`-specific JSON-style `QUOTE_STRINGS_ONLY`. In addition to these overrides, other useful switches for inference are: `--sample-lines N`, `-n N` : Specify the number of lines to use for inference. This defaults to 1000. (This will be reduced to the number of lines in the file if it is larger than that.) `--single`, `-1` : Specifies that there is only one field in the file. Single-field files cause difficulty with finding the field delimiter. `--Verbose`, `-V` : Be highly verbose. `--verbose`, `-v` : Be somewhat verbose. `--quiet`, `-q` : Be as quiet as practical. ### Creating Fieldless `tdda.serial` files It is sometimes desirable to create a `tdda.serial` file that contains specification of whole-file-level properties only, e.g., delimiter, quote character etc.---in fact, the very things that support overrides with command-line switches. If the filename is set to empty, such a metadata file can be generated. For example: %% docdata/generation1.sh ```bash tdda serial --generate '' generated.serial --null NULL --sep '|' --quote-char "'" ``` will generate a `tdda.serial` file with only those properties specified: %% docdata/generated.serial ``` { "format": "http://tdda.info/ns/tdda.serial", "writer": "tdda.serial-3.0.00", "tdda.serial": { "delimiter": "|", "quote_char": "'", "null_indicator": "NULL" } } ``` This can either be used directly, allowing the chosen flat-file reader to figure out field names, types and per-field formats, or can be used as the starting point for a more detailed `.serial` file including hand-crafted field specifications. ## Converting Between `tdda.serial`, CSVW, and Frictionless **Experimental Functionality** The `tdda.serial` module has some ability to convert between different `tdda.serial` formats and between `tdda.serial`, CSVW, and Frictionless metadata specifications. There are limitations in all cases, but the most important specifications generally work. Here are some example conversion commands: %% docdata/convert1.sh ```bash tdda serial docdata.serial docdata-metadata.json tdda serial docdata.serial docdata-metadata-from-serial2.json --to csvw ``` The first command produces `docdata-metadata.json` as CSVW, assuming a CSVW target because the pattern fits the normal CSVW convention. The second version uses a non-standard name for the CSVW metadata, so the `--to` flag needs to be used to specify `csvw` as the target format. The result is: %% docdata/docdata-metadata-from-serial2.json ```json { "@context": "http://www.w3.org/ns/csvw", "dc:conformsTo": "data-package", "dc:creator": "tdda.serial-3.0.00", "tables": [ { "tableSchema": { "columns": [ { "name": "IAmBoolean", "titles": "b", "datatype": { "base": "boolean", "format": "Yes|No" } }, { "name": "IAmInt", "titles": "i", "datatype": "integer" }, { "name": "f", "datatype": "float" }, { "name": "IAmString", "titles": "s", "datatype": "string" }, { "name": "IAmDate", "titles": "t", "datatype": { "base": "datetime", "format": "dd/MM/yyyy" } } ] }, "url": "docdata.txt" } ], "dialect": { "encoding": "latin-1", "delimiter": ";", "headerRowCount": 1, "quoteChar": "'" }, "null": "." } ``` % Why is a warning issued? Doesn't CSVW understand that format? --- %% docdata/convert2.sh ```bash tdda serial docdata.serial docdata.package.yaml tdda serial docdata.serial docdata.resource.json ``` These two commands produce Frictionless metadata specifications. The first produces a Frictionless package metadata files as YAML, while the second produces a Frictionless resource metadata description as JSON. Again, the names are conventional for Frictionless, so the format is inferred. Again `--to frictionless` or `--to fless` could be used to specify this if an unconventional name were used. % .ssv again! The YAML package output is: %% docdata/docdata.package.yaml ```yaml resources: - name: docdata type: table path: docdata.txt scheme: file format: csv mediatype: text/csv encoding: latin-1 dialect: header: true headerRows: - 0 delimiter: ; quoteChar: '''' escapeChar: '`' schema: fields: - name: b type: boolean trueValues: - 'Yes' - y falseValues: - 'No' - n - name: i type: integer - name: f type: number - name: s type: string - name: t type: datetime missingValues: - . ``` and the JSON resource output is %% docdata/docdata.resource.json ```json { "name": "docdata", "type": "table", "path": "docdata.txt", "scheme": "file", "format": "csv", "mediatype": "text/csv", "encoding": "latin-1", "dialect": { "header": true, "headerRows": [ 0 ], "delimiter": ";", "quoteChar": "'", "escapeChar": "`" }, "schema": { "fields": [ { "name": "b", "type": "boolean", "trueValues": [ "Yes", "y" ], "falseValues": [ "No", "n" ] }, { "name": "i", "type": "integer" }, { "name": "f", "type": "number" }, { "name": "s", "type": "string" }, { "name": "t", "type": "datetime" } ], "missingValues": [ "." ] } } ``` --- In the case of converting to a Pandas `csv_read` specification, a `dtype` back end can be specified. So we might say: %% docdata/convert3.sh ```bash tdda serial docdata.serial docdata-pd.r-o.serial --to pd.r -B o ``` will ask for the `pandas.read_csv` version of a `tdda.serial` file using the original Pandas `dtype` backend. The result is: %% docdata/docdata-pd.r-o.serial ``` { "format": "http://tdda.info/ns/tdda.serial", "writer": "tdda.serial-3.0.00", "pandas.read_csv": { "sep": ";", "encoding": "latin-1", "escapechar": "`", "quotechar": "'", "dtype": { "IAmBoolean": "object", "f": "float", "IAmString": "object" }, "date_format": { "IAmDate": "%d/%m/%Y" }, "parse_dates": [ "IAmDate" ], "na_values": ".", "keep_default_na": false, "names": [ "IAmBoolean", "IAmInt", "f", "IAmString", "IAmDate" ], "header": 0, "true_values": [ "Yes", "y" ], "false_values": [ "No", "n" ] } } ``` --- The source does not have to be a `tdda.serial` file: `tdda.serial` can convert between any supported formats and (in the case of Pandas, any of the `dtype` back ends). ## Custom Sections within `tdda.serial` Files The primary goals of the `tdda.serial` format are to allow detailed specification of flat-file formats and to provide facilities for utilising such specifications for more accurate reading and writing of data using (inherently ambiguous) flat-file formats. The `tdda.serial` specifications are not written with a single library in mind, and are intended, where possible, to be sufficiently general to allow use with various libraries. If, however, a single library is to be used, there is still potentially value in recording read and write settings for that library in an companion metadata file, and `tdda.serial` can serve that function too. Where library parameters are expressible as JSON (i.e. nulls, numbers, strings, booleans, arrays, and dictionaries), they can simply be written into a relevant section of a `tdda.serial` file. For example, we can convert the [Example](#example) `tdda.serial` file to one that contains custom sections for Pandas `read_csv` function and `DataFrame.to_csv` methods as follows: %% docdata/converttopd.sh ```bash tdda serial example.serial examplepd.serial --to pd.r,pd.w ``` (Here we have specified both `pd.r` and `pd.w`, but we could have requested only one of them.) The result is the following file: %% docdata/examplepd.serial ``` { "format": "http://tdda.info/ns/tdda.serial", "writer": "tdda.serial-3.0.00", "pandas.read_csv": { "sep": "|", "encoding": "UTF-8", "escapechar": "\\", "quotechar": "\"", "doublequote": false, "dtype": { "id": "Int64", "name": "string" }, "date_format": { "joined": "ISO8601", "last_seen": "%m/%d/%Y %H:%M:%S" }, "parse_dates": [ "joined", "last_seen" ], "na_values": "", "keep_default_na": false }, "pandas.DataFrame.to_csv": { "sep": "|", "encoding": "UTF-8", "escapechar": "\\", "quotechar": "\"", "doublequote": false, "date_format": "%Y-%m-%d", "na_rep": "" } } ``` The reason `tdda.serial` supports both read and write sections for libraries such as Pandas and Polars is that the parameters for reading and writing are different in various ways. First, most libraries aim to write consistently (e.g. using a single null marker and a consistent date format), so allow fewer things to be specified on write. They also use different parameter names in some cases. Things are slightly more complicated when the parameters are not naturally expressible in JSON, as is the case with Polars types, which are passed as `type` objects. In this case, `tdda.serial` converts such objects to strings and the `csv_to_polars` method handles this when it uses them. So, considering only the read case, the Polars equivalent conversion is: % TODO: pl.w %% docdata/converttopl.sh ```bash tdda serial example.serial examplepl.serial --to pl.r ``` which produces: %% docdata/examplepl.serial ``` { "format": "http://tdda.info/ns/tdda.serial", "writer": "tdda.serial-3.0.00", "polars.read_csv": { "separator": "|", "quote_char": "\"", "null_values": [ "" ], "encoding": "UTF-8", "schema": { "id": "Int64", "name": "String", "joined": "Datetime", "last_seen": "String" } } } ``` Here, `last_seen` set to type `String` (i.e. `polars.String`) because Polars will not understand the date format. However, if the Python read code is generated with %% docdata/converttoplpy.sh ```bash tdda serial example.serial examplereadpl.py --to pl.r ``` all is handled: %% docdata/examplereadpl.py ```python import polars as pl def read_data(inpath): df = pl.read_csv( inpath, separator='|', quote_char='"', null_values=[''], encoding='UTF-8', schema={ 'id': pl.Int64, 'name': pl.String, 'joined': pl.Datetime, 'last_seen': pl.String } ) df = df.with_columns([ pl.col('last_seen').str.to_datetime(format='%m/%d/%Y %H:%M:%S'), ]) return df ``` (tdda-serial-colon-format)= ## Using `tdda.serial` Metadata in the `tdda` Library For all [`tdda`](cli.md#tdda) command-line commands, and in most places within API calls where CSV or other flat file is specified, there is the option to specify the file format using `tdda.serial` files, CSVW files, or Frictionless files. This is based on the `:` (colon) specifier. When specifying a path to a CSV (or other flat) file: * If the path is used by itself, the `tdda` library will use either `tdda.serial.csv_to_pandas` or `tdda.serial.csv_to_polars` to read it into a DataFrame. The default is currently pandas (with the `numpy_nullable` back end), but this can be [configured](configuration.md) or, in many cases controlled with command line flags (`--polars`, `--pandas`, `--backend BACKEND` (for Pandas only)). * If the path ends in a colon (e.g. `foo.csv:`), TDDA will search for metadata in the same directory as the file and, if it finds one, pass that to the appropriate `csv_to_...` function for more accurate DataFrame generation. In doing this, it will look for the following in priority order, given a file `foo.csv`: - `foo.csv.serial` (`tdda.serial` metadata) - `foo.serial` (`tdda.serial` metadata). This is actually more common than the previous form, but if there are multiple files with different extensions, the former is more specific, so is checked first. - Anything that matches foo using `@` as a wildcard, e.g. `@.serial`, `f@.serial`, `f@o.serial`, `@oo.serial`. (`@` acts like `*` in the shell, while avoiding needing `*` in filenames, which can be awkward.) - `foo-metadata.json`, `foo-csvmetadata.json`, `foo-csv-metadata.json`, `foo.csvmetadata.json`, `foo.csv-metadata.json` (all of which are common conventions for CSVW metadata files). - The same CSVW patterns with `@` wildcards - `foo.serial.json`, `foo.serial.yaml`, `foo.resource.json`, `foo.resource.yaml`, `foo.package.json`, `foo.package.yaml`, all of which are common for Frictionless metadata files. - The same patterns for `serial` or `package` frictionless files with `@` wildcards. Wildcards are not searched in `resource` files, because in frictionless these always correspond to a single data file. * If the path contains a colon, the part to the right of the colon will be interpreted as a metadata file. So `foo.csv:bar.serial` will use `bar.serial`.