Creating a new volume¶
Initial setup¶
The first step for creating a new dataset is to create a new repository. By splitting datasets out into different repositories, we can ensure that the data release/management processes are much simpler. To make this step easier, we have created a copier repository that can be used to easily initialise a new repository using the following.
uvx copier copy gh:climate-resource/copier-bookshelf-dataset directory/to/new/repo
cd directory/to/new/repo
git add .
git commit -m "Initial commit"
Copier will then ask you a few questions to set up the new repository. After the new repository is generated, a few more administration steps are required to get fully set up.
- A new repository should be created on GitHub
- The git remote
origin
should be updated to point to the new repository (git remote add origin <new-repo-ssh-url>
) - A
PERSONAL_ACCESS_TOKEN
secret is required to be added to the repository (instructions)
After this, you can start creating your new dataset.
Repository structure¶
The repository structure is as follows:
- pyproject.toml: A description of the repository and its dependencies
- src/{dataset_name}.py: The main script that generates the dataset (this file includes a jupytext header)
- src/{dataset_name}.yaml: The metadata that describes the dataset and the versions that are to be processed.
Some examples for source files for datasets can be found in the notebooks
directory in this repository.
Metadata storage¶
Updating the {example_volume}.yaml
with the volume's metadata, this may include:
- name of the volume
- edition
- license
- metadata about author and author_email
- data dictionary
- detailed version information
- etc.
Steps in processing a dataset¶
Below the steps to process a dataset are described in more detail. These steps are included in the src/{dataset_name}.py
file as a starting point, but can be modified as needed.
Logging configuration¶
Load the packages and set up the basic configuration for logging:
import logging
import tempfile
from scmdata import testing
from bookshelf import LocalBook
from bookshelf_producer.notebook import load_nb_metadata
logging.basicConfig(level=logging.INFO)
/home/runner/work/bookshelf/bookshelf/.venv/lib/python3.10/site-packages/scmdata/database/_database.py:9: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html import tqdm.autonotebook as tqdman
Metadata loading¶
If multiple versions are to be processed, the version can be passed as an argument to the script using a paper parameter section. Papermill will inject a new set of parameters into the notebook when it is run.
## %% tags=["parameters"]
# This cell contains additional parameters that are controlled using papermill
local_bookshelf = tempfile.mkdtemp()
version = "v3.4"
Metadata loading¶
Load and verify the volume's metadata
metadata = load_nb_metadata("example_volume/example_volume")
metadata.dict()
{'name': 'example_volume', 'version': 'v0.1.0', 'edition': 1, 'description': None, 'license': 'MIT', 'source_file': '/home/runner/work/bookshelf/bookshelf/notebooks/example_volume/example_volume.yaml', 'private': False, 'metadata': {'author': 'Yini Lai', 'author_email': 'yini.lai@climate-resource.com'}, 'dataset': {'url': None, 'doi': None, 'files': [{'url': 'https://rcmip-protocols-au.s3-ap-southeast-2.amazonaws.com/v5.1.0/rcmip-radiative-forcing-annual-means-v5-1-0.csv', 'hash': '15ef911f0ea9854847dcd819df300cedac5fd001c6e740f2c5fdb32761ddec8b'}], 'author': 'Zebedee Nicholls'}, 'data_dictionary': [{'name': 'model', 'description': 'The IAM that was used to create the scenario', 'type': 'string', 'required_column': True, 'allowed_NA': True, 'controlled_vocabulary': None}, {'name': 'unit', 'description': 'Unit of the timeseries', 'type': 'string', 'required_column': True, 'allowed_NA': False, 'controlled_vocabulary': None}, {'name': 'scenario', 'description': 'scenario', 'type': 'string', 'required_column': True, 'allowed_NA': False, 'controlled_vocabulary': None}, {'name': 'region', 'description': 'Area that the results are valid for', 'type': 'string', 'required_column': True, 'allowed_NA': False, 'controlled_vocabulary': [{'value': 'World', 'description': 'Aggregate results for the world'}]}, {'name': 'variable', 'description': 'Variable name', 'type': 'string', 'required_column': True, 'allowed_NA': True, 'controlled_vocabulary': None}]}
Data loading and transformation¶
Load the data intended for storage in the volume. This data may be sourced locally, scraped from the web, or downloaded from a server. For data downloads, we recommend using pooch
to ensure integrity through hash verification.
Once the data is loaded, perform any necessary manipulations to prepare it for storage. Convert the data to an scmdata.ScmRun
object if it isn't already in this format.
data = testing.get_single_ts()
data.timeseries()
time | 0001-01-01 00:00:00 | 0002-01-01 00:00:00 | 0003-01-01 00:00:00 | ||||
---|---|---|---|---|---|---|---|
model | region | scenario | unit | variable | |||
mod | World | scen | GtC / yr | Emissions|CO2 | 1.0 | 2.0 | 3.0 |
Local book creation¶
Initialize a local book instance using the prepared metadata:
# create and return a unique temporary directory
local_bookshelf = tempfile.mkdtemp()
book = LocalBook.create_from_metadata(metadata, local_bookshelf=local_bookshelf)
Resource creation¶
Add a new Resource
to the Book utilizing the scmdata.ScmRun
object. This process involves copying the timeseries data into a local file, then calculating the hash of the file's contents to ensure data integrity. Additionally, the timeseries data is transformed into a long format, followed by a hash calculation of this transformed data. Utilizing these hashes allows for a straightforward verification process to determine if the files have undergone any modifications.
book.add_timeseries("example_resource_name", data)
Display the Book
's metadata, which encompasses all metadata about the Book and its associated Resources
:
book.metadata()
{'name': 'example_volume', 'version': 'v0.1.0', 'private': False, 'edition': 1, 'resources': [{'name': 'example_resource_name_wide', 'timeseries_name': 'example_resource_name', 'shape': 'wide', 'format': 'csv.gz', 'filename': 'example_volume_v0.1.0_e001_example_resource_name_wide.csv.gz', 'hash': '5062b49bf8e836e95debab847eb58cbeed4ad9edf22dd6cd1cb916b3d71f4167', 'content_hash': '7a80f4271ddae808da0d517deaeaab6cf0a484bb0002d38ff2c09d226e4e221a', 'profile': 'data-resource'}, {'name': 'example_resource_name_long', 'timeseries_name': 'example_resource_name', 'shape': 'long', 'format': 'csv.gz', 'filename': 'example_volume_v0.1.0_e001_example_resource_name_long.csv.gz', 'hash': '7888571e9e5d8c856a56dd36bee7c37e78117f5faccf7d0cd6c70374c2235827', 'content_hash': 'a08fa87c0f8d908b7f685c2bcc725b7736fc290cae94833b391157605508342c', 'profile': 'data-resource'}], 'profile': 'data-package'}
The metadata outlined above is available for clients to download and use for fetching the Book
'sResources
. Upon deployment, the Book becomes immutable, meaning any modifications to its metadata or data necessitate the release of a new Book version. It is important to note that the steps provided herein pertain to the process of constructing a volume locally. This process does not cover the publication of the volume.
Generation¶
The volumes for a book can be generated using:
make run
This will run the src/{volume_name}.py
script for each version in the configuration file and output data to the dist/
directory. The output folder contains the generated data, metadata, and the processed notebooks.
The CI will automatically run this command during a Merge Request to verify that that processing scripts are valid.