12 views
# RDM Tools Search ###### tags: `aiida`, `2022`, `data management`, `RDM`, `RSE` - Created: 2022-01-05. - Last modified: 2022-05-11, Johannes Wasmer. - Authors: Johannes Wasmer. ## Table of contents [TOC] ## Introduction ### Why this document The division **PGI-1/IAS-1 Quantum Theory of Materials** is researching ways to better implement modern **Research Data Management (RDM)** (German: Forschungsdatenmanagement (FDM)). This document aims to help as overview of possibly suitable RDM tools. To foster RDM at FZJ, the board of directors has started the [Jülich RDM Challenge Call](https://intranet.fz-juelich.de/en/topics/rdm/data-management/funding/julich-rdm-challenge-call) in cooperation with the [HMC Hub Information](https://www.hmc-plattform.org/en/information/uebersicht) at FZJ. Under the **[project proposal](https://iffcloud.fz-juelich.de/s/KowjgSMZbrtTWPk) "Linking DFT Simulation Workflows to Data Repositories"** to the [Research Data Management Challenges at FZJ call](https://helmholtz-metadaten.de/en/news/datenmanagement-challengesatfzj-start-der-ausschreibung), the [AiiDA group](https://iffchat.fz-juelich.de/aiida/channels/aiida-meeting) at PGI-1/IAS-1 is researching ways to implement sustainable **Research Data Management (RDM)** (German: Forschungsdatenmanagement (FDM)), with the help of the [HMC Hub Information](https://www.hmc-plattform.org/en/information/people), and the [PGI/JCNS-TA](https://www.fz-juelich.de/pgi/pgi-ta/EN/Home/home_node.html). - [👉 Meetings, start of discussion](https://iffchat.fz-juelich.de/aiida/pl/gbuxncixqb8fzfya4eop3oftic). **Update 2022-04-17**: The project proposal has **not** been selected for the [first round of funded projects](https://intranet.fz-juelich.de/en/topics/rdm/on-campus/rdm-challenge-projects). However, this document is still valid as research for application to the next round call at the end of 2022. ----- ### What is in this document The landscape of available RDM tools is vast. This collaborative notebook maps it out exploratively. The goal here is to 1. see what is out there, 2. to help in tools selection for this project. Requirements for tools in this list: - As outlined in the proposal, a special requirement of this RDM project is the integration with **[iffaiida](iffaiida.iff.kfa-juelich.de/)** in particular, and of **[AiiDA](aiida.net)** databases in general. - Open source. Enterprise tools may be listed for comparison, clearly marked as such. - Should be usable completely on-premise, and not dependent on an external host provider. Cloud tools may be listed for comparison, but clearly marked as such. AiiDA already solves a large part of the RDM problem (FAIR data management). The tools here should solve the missing parts (project templates, and, put simply, project data plus code archiving). The list is sorted from general / end-to-end solution to specific / sub-concerns. ## RDM knowledge ### ZB Research Data Management The **[ZB Research Data Management information portal](https://intranet.fz-juelich.de/en/organization/zb/expertise/research_data_management)** of the FZJ central library (ZB) defines RDM in the FZJ context. It offers tools such as the data management plan tool [DMP](https://dmp.fz-juelich.de/), lists metadata standards (e.g. [here](https://rd-alliance.github.io/metadata-directory/standards/) and [here](https://beta.fairsharing.org/search?fairsharingRegistry=Standard)), and defines best practices for groups and individual researchers. **Update 2022-04-17**. The **[FZJ RDM Portal](https://intranet.fz-juelich.de/en/topics/rdm)** replaces the ZB RDM portal as official RDM portal in the FZJ Intranet. It is now jointly curated by the HMC and ZB. New: - RDM / FDM events announcements - **[RDM Toolbox](https://intranet.fz-juelich.de/en/topics/rdm/toolbox)**. This collection partially supersedes the collection in this document. - [] ## Data management tools This sections lists solutions for the whole pipeline of distributed data storage, versioning, annotation, archiving, retrieval, and publication. This includes end-to-end as well as only-part-way solutions. ### JülichDATA **[JülichDATA](https://intranet.fz-juelich.de/en/topics/rdm/toolbox/julichdata)** is the central institutional repository for research data of Forschungszentrum Jülich. It serves as a platform for all research data generated at Forschungszentrum Jülich or created in this context. The service is based on [Dataverse](https://dataverse.org/). ### SampleDB The PGI/JCNS-TA's **[SampleDB](https://intranet.fz-juelich.de/en/topics/rdm/toolbox/sampledb)** (mentioned in proposal WP2 Task 2.3) is a web-based electronic lab notebook with a focus on sample and measurement metadata. In more detail (from the [publication](https://joss.theoj.org/papers/10.21105/joss.02107)): > SampleDB is a web-based sample and measurement metadata database developed at Jülich Centre for Neutron Science (JCNS) and Peter Grünberg Institute (PGI). Researchers can use SampleDB to store and retrieve information on samples, measurements **and simulations**, analyze them using Jupyter notebooks, track sample storage locations and responsibilities and view sample life cycles. In particular, it features a [notebook templating server](https://scientific-it-systems.iffgit.fz-juelich.de/SampleDB/administrator_guide/jupyterhub_support.html#notebook-templating-server). **Contact persons:** - [Florian Rhiem](https://www.fz-juelich.de/SharedDocs/Personen/PGI/PGI-TA/EN/Rhiem_F.html) (PGI/JCNS-TA, developer) ### DataLad **[DataLad](https://intranet.fz-juelich.de/en/topics/rdm/toolbox/datalad)** is a Python-based tool for the joint management of code, data, and their relationship, built on top of a versatile system for data logistics (git-annex) and the most popular distributed version control system (Git). It is mainly used by neuroscience, but invites general use. A DataLad Dataset is just a git repository, so a folder. In addition to the code thoug, DataLad also keeps track of the data within the Dataset which the code produced (e.g. result plots and pdfs). In addition, DataLad Datasets can be nested to allow for modular components. **Contact persons:** - [Benjamin Poldrack](https://www.fz-juelich.de/SharedDocs/Personen/INM/INM-7/DE/Poldrack_b.html?nn=653620) (INM-7, developer) - [Prof. Michael Hanke](https://www.fz-juelich.de/inm/inm-7/DE/Forschung/Psychoinformatik/artikel.html?nn=653620) (INM-7, lead) ### MLOps data management tools Some resources discuss data management in terms of MLOps (machine learning operations). However, some tools discussed in that context are more general, such that 'ML' could be replaced with any computational task and associated data, such as 'DFTOps'. The tools listed here fulfill the requirement, with the respect to the AiiDA constraint, that they [do not](https://dagshub.com/blog/solve-your-mlops-problems-with-an-open-source-data-science-stack/) impose their own data model or database. **[Data Version Control (DVC)](https://dvc.org/)** is a lightweight, Git-like solution for management and versioning of data and machine learning models. However, it should also be suitable for this proposal. Rather, it should be understood better as an alternative to Git LFS for data versioning. In practice it works in tandem with Git (`git add code`, `dvc add data`, ...). ![Model for versioning & sharing: Git for code, DVC for data](https://dvc.org/static/9cccd49a995845bdc6466caa17ad3bad/ab90f/model-sharing-digram.png) Other MLOps data management tools which might be suitable here, are [Pachyderm](https://www.pachyderm.com/) (reproducible containers and pipelines) and [LakeFS](https://lakefs.io/data-versioning-does-it-mean-what-you-think-it-means/) (versioning in data lakes). However, they are considerably more complicated to set up. ### Other data management tools For these tools, it is unclear yet whether they are suitable for the proposal. **[Metador](https://intranet.fz-juelich.de/en/topics/rdm/toolbox/metador)** faciliates FAIRification of research data and forms a boundary between the unstructured, unannotated outside world and the FAIR, semantically annotated data inside your amazing research institution. **Contact person:** [Dr. Anton Pirogov](https://fz-juelich.de/SharedDocs/Kontaktdaten/Mitarbeiter/P/Pirogov_a_pirogov_fz_juelich_de.html?nn=2750422) (IAS-9, developer). ### Articles and resource collections on data management - [github/awesome-reproducible-research](https://github.com/leipzig/awesome-reproducible-research). Resource collection on reproducible research. - [github/awesome-mlops > data management](https://github.com/kelvins/awesome-mlops#data-management). List of data management tools used in MLOps. - [github/topics/research-data-management](https://github.com/topics/research-data-management). - [github/awesome-data-engineering](https://github.com/igorbarinov/awesome-data-engineering). List of data engineering tools. - [Comparing Data Version Control Tools - 2020](https://dagshub.com/blog/data-version-control-tools/) - [Data Versioning – Does It Mean What You Think It Means?](https://lakefs.io/data-versioning-does-it-mean-what-you-think-it-means/) - [Introducing the Machine Learning Reproducibility Scale](https://dagshub.com/blog/introducing-the-machine-learning-reproducibility-scale/). Breaks down (technical) reproducibility of computational workflows into five steps 1. Code, 2. Configuration, 3. Data (+Artifacts), 4. Environment, 5. Evaluation, and comes with a nice [reproducibility checklist](https://dagshub.com/DAGsHub-Official/reproducibility-challenge/src/master/REPRODUCIBILITY.md). ## Containerization tools ### Docker **[Docker](https://www.docker.com/)** packages software into standardized units called containers that have everything the software needs to run including libraries, system tools, code, and runtime. Using Docker, you can quickly deploy and scale applications into any environment and know your code will run. ### Apptainer **[Apptainer](https://apptainer.org/)** (formerly called **Singularity**) is the most widely used container system for HPC. ### Jupyter Docker Stacks **[Jupyter Docker Stacks](https://jupyter-docker-stacks.readthedocs.io)** are a set of ready-to-run Docker images containing Jupyter applications and interactive computing tools. For example, [Jupyter-JSC](https://jupyter-jsc.fz-juelich.de), the Jupyter service of the JSC running in their [HDF-Cloud](https://www.fz-juelich.de/ias/jsc/EN/Expertise/SciCloudServices/HDFCloud/_node.html), [allows](https://github.com/FZJ-JSC/jupyter-jsc-notebooks/blob/master/FAQ_HDFCloud.ipynb) to select from the [eight base jupyter images](https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html#jupyter-base-notebook) which this tool defines. (Info on Jupyter-JSC architecture: [here](https://www.unicore.eu/about-unicore/case-studies/jupyter-at-jsc/)). ## Version control tools ### Git LFS **[Git Large File Storage (LFS)](https://git-lfs.github.com/)** replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server like GitHub.com or GitHub Enterprise. ### Git-annex **[Git-annex](https://git-annex.branchable.com/)** allows managing files with git, without checking the file contents into git. While that may seem paradoxical, it is useful when dealing with files larger than git can currently easily handle, whether due to limitations in memory, time, or disk space. [SO > How do Git LFS and git-annex differ?](https://stackoverflow.com/questions/39337586/how-do-git-lfs-and-git-annex-differ). ## Jupyter notebook tools Definition here: Notebooks = Jupyter notebooks = `.ipynb` files, if not otherwise specified. ### Jupyter notebook workflow tools **[ploomber.io](https://ploomber.io/)**. Maintainable and Collaborative Pipelines in Jupyter (and script files). Use your favorite editor (Jupyter, VSCode, PyCharm) to develop interactively and deploy cloud without code changes (Kubernetes, Airflow, AWS Batch, and **SLURM**). Do you have legacy notebooks? Refactor them into modular pipelines with a single command. - [Jupyter Blog post introducing Ploomber in 2021](https://blog.jupyter.org/ploomber-maintainable-and-collaborative-pipelines-in-jupyter-acb3ad2101a7). - [Example ML project combining Ploomber, PyCaret, MLFlow](https://towardsdatascience.com/machine-learning-pipeline-with-ploomber-pycaret-and-mlflow-db6e76ee8a10). ### Jupyter notebook template tools **[jpmorganchase/jupyterlab_templates](https://github.com/jpmorganchase/jupyterlab_templates)**. Support for jupyter notebook templates in jupyterlab. **[xtreamsrl/jupytemplate](https://github.com/xtreamsrl/jupytemplate)**. A simple template for jupyter notebooks. [Article](https://towardsdatascience.com/stop-copy-pasting-notebooks-embrace-jupyter-templates-6bd7b6c00b94). **[Jinja](https://palletsprojects.com/p/jinja/)** is one of the most used template engines for Python. It can be used to create [custom templates for Jupyter notebooks](https://www.datacamp.com/community/tutorials/jinja2-custom-export-templates-jupyter). ### Jupyter notebooks version control tools Jupyter notebooks produce dirty diffs since they store cell inputs and outputs in JSON format. This makes collaboration (code review, merging) difficult. Below are some tools solutions to this problem. Summary: - nbdime & jupyterlab-git extension are great for local diffing. - gitlab web interface does the same. - jupytext & nbstripout are good if you don't need outputs in version control. - ReviewNB is good for diffing & commenting on GitHub commits & pull requests. #### GitLab clean notebook diffs Since GitLab 14.5, the web interface shows **[clean diffs](https://docs.gitlab.com/ee/user/project/repository/jupyter_notebooks/#cleaner-diffs)** of Jupyter notebooks, by internally converting them to markdown files for diff views. Diffs of graphical cell outputs (e.g. plots) are represented by short hash strings. This feature is activated in iffgit. #### Nbstripout **[Nbstripout](https://github.com/kynan/nbstripout)** removes the output cells from notebook files. That's all. Can be used as a pre-commit hook. #### Nbdime **[Nbdime](https://github.com/jupyter/nbdime)** is a CLI diff and merge tool for Jupyter notebooks. In the CLI, it shows a cleaner diff. In the browser, diffs of graphical cell outputs (e.g., plot images) are visualized with the actual images. #### Jupyterlab-Git **[Jupyterlab-Git](https://github.com/jupyterlab/jupyterlab-git)** integrates Git into the Jupyterlab interface. This includes a visual diff viewer. Diffs of graphical cell outputs are visualized with the actual images. #### ReviewNB **[ReviewNB](https://www.reviewnb.com)** is a commercial online service for GitHub which does what the Nbdime browser view does, but nicer and automatic, and allows for collaboration. For open source projects it is free to use. It does not offer integration with GitLab. #### Jupytext **[Jupytext](https://jupytext.readthedocs.io/)** is a Python library. It allows to treat notebooks like code (script) or text (markdown) files in variety of formats, and vice versa. This allows for flexibility in many respects, like for code refactoring. With the 'pairing' feature (a notebook is represented by automatically synched files, one of them e.g. a script file), it can also be used for clean notebook diffs and collaboration via Git, by only committing the script mirror file (without cell outputs) of a notebook to the repository. ### Articles and resource collections on Jupyter notebooks - [github/awesome-jupyter](https://github.com/markusschanta/awesome-jupyter). Resource collection concerning Jupyter. - [github/awesome-notebooks](https://github.com/jupyter-naas/awesome-notebooks). Resource collection concerning Jupyter. - [github/best-of-jupyter](https://github.com/ml-tooling/best-of-jupyter). Resource collection concerning Jupyter. - [Jupyter notebook version tracking with Neptune](https://docs.neptune.ai/integrations-and-supported-tools/ide-and-notebooks/jupyter-lab-and-jupyter-notebook) (external host provider solution, for comparison).