Skip to content

Commit 13f8a4a

Browse files
authored
Merge pull request zarr-developers#1 from d-v-b/zarr-python-dtypes
fill out numpy `U` dtype
2 parents aa21bfa + 46ac0b8 commit 13f8a4a

File tree

5 files changed

+55
-95
lines changed

5 files changed

+55
-95
lines changed

data-types/datetime64/README.md

Lines changed: 0 additions & 33 deletions
This file was deleted.

data-types/datetime64/schema.json

Lines changed: 0 additions & 26 deletions
This file was deleted.

data-types/fixed-length-ucs4/README.md

Lines changed: 0 additions & 33 deletions
This file was deleted.
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# `fixed_length_utf32` data type
2+
3+
This document defines a data type for fixed-length Unicode strings encoded using [UTF-32](https://www.unicode.org/versions/Unicode5.0.0/appC.pdf#M9.19040.HeadingAppendix.C2.Encoding.Forms.in.ISOIEC.10646). UTF-32, also known as UCS4, is an encoding of Unicode strings that allocates 4 bytes to each Unicode code point.
4+
5+
"Fixed length" as used here means that the `fixed_length_utf32` data type is parametrized by a integral length, which sets a fixed length for every scalar belonging to that data type.
6+
7+
### Name
8+
9+
The name of this data type is the string `"fixed_length_utf32"`
10+
11+
### Configuration
12+
13+
This data type requires a configuration. The configuration for this data type is a JSON object with the following fields:
14+
15+
| field name | type | required | notes |
16+
|------------|----------|---|---|
17+
| `"length_bytes"` | integer | yes | The number MUST represent an integer divisible by 4 in the inclusive range `[0, 2147483644]` |
18+
19+
> Note: the maximum length of 2147483644 was chosen to match the semantics of the [NumPy `"U"` data type](https://numpy.org/devdocs/reference/arrays.scalars.html#numpy.str_), which as of this writing has a maximum length in code points of 536870911, i.e. 2147483644 / 4.
20+
21+
> Note: given a particular `fixed_length_utf32` data type, the length of an array element in Unicode code points is the value of the `length_bytes` field divided by 4.
22+
23+
### Examples
24+
25+
```json
26+
{
27+
"name": "fixed_length_utf32",
28+
"configuration" : {
29+
"length_bytes": 4
30+
}
31+
}
32+
```
33+
34+
## Fill value representation
35+
36+
The value of the `fill_value` metadata key must be a string. When encoded in UTF-32, the fill value MUST have a length in bytes equal to the value of the `length_bytes` specified in the `configuration` of this data type.
37+
38+
## Codec compatibility
39+
40+
This data type is compatible with any codec that supports arrays with fixed-sized data types.
41+
42+
## Notes
43+
44+
This data type is designed for NumPy compatibility. UTF-32 is not a good fit for many applications that need to model arrays of strings, as real string datasets are often composed of variable-length strings. A variable-length string data type should be preferred in these cases.
45+
46+
## Change log
47+
48+
No changes yet.
49+
50+
## Current maintainers
51+
52+
* [zarr-python core development team](https://github.com/orgs/zarr-developers/teams/python-core-devs)

data-types/fixed-length-ucs4/schema.json renamed to data-types/fixed-length-utf32/schema.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
"type": "object",
66
"properties": {
77
"name": {
8-
"const": "fixed-length-ucs4"
8+
"const": "fixed_length_utf32"
99
},
1010
"configuration": {
1111
"type": "object",
@@ -14,13 +14,13 @@
1414
"type": "integer"
1515
}
1616
},
17-
"required": ["length_bits"],
17+
"required": ["length_bytes"],
1818
"additionalProperties": false
1919
}
2020
},
2121
"required": ["name", "configuration"],
2222
"additionalProperties": false
2323
},
24-
{ "const": "fixed-length-ucs4" }
24+
{ "const": "fixed_length_utf32" }
2525
]
2626
}

0 commit comments

Comments
 (0)