Logo

1 Basic Information:

1.1 Give the project name or acronym to be used:

To whom data will be useful? (data utility) :

1.3 Other DMP Metadata

1.4 Chose some additional information

Where will you submit your data as endpoints?

2. What kind of data will you handle:


3. How much data you will probably generate:

GB
GB

4. Are there any standards that are relevant for you?


4.1 Will you adhere to any high level metadata submission standards?


4.2 When will you make your data public?


5. Do you use visualization in the project?







The project aim should be a apart of a sentence.

Example 1: aims at creating a computational model of carbon and water flow within a whole plant architecture


Example 2: aims at generating data management plan with minimal effort and making the data as open as possible

The project object = target.

Example 1: carbon and water flow in plants


Example 2: data management plan

User-defined template

You can click the dotted box to start editing.
Click the grey buttons to reuse templates.
Click submit when you finished.


Data Management Plan of $_PROJECT


Action Number:

$_PROJECT

Action Acronym:

$_PROJECT

Action Title:

$_PROJECT

Date:

DMP version:

$_DMPVERSION


1    Introduction

#if$_EU The $_PROJECT is part of the Open Data Initiative (ODI) of the EU. #endif$_EU To best profit from open data, it is necessary not only to store data but to make data Findable, Accessible, Interoperable, and Reusable (FAIR).#if$_PROTECT Open and FAIR data, however, considers the need to protect individual data sets. #endif$_PROTECT

The aim of this document is to provide guidelines on principles guiding the data management in the $_PROJECT and what data will be stored by using the responses to the EU questionnaire on Data Management Plan (DMP) as a DMP document.

The detailed DMP instructs how data will be handled during and after the project. The $_PROJECT DMP is modelled according to the Horizon 2020 and Horizon Europe online Manual. #if$_UPDATE It will be updated/its validity checked during the $_PROJECT project several times. At the very least, this will happen at month $_UPDATEMONTH. #endif$_UPDATE

2    Data Management Plan EU Template

2.1    Data Summary

What is the purpose of the data collection/generation and its relation to the objectives of the project?

For the $_PROJECT, data collection#if!$_VVISUALIZATION and integration #endif!$_VVISUALIZATION#if$_VVISUALIZATION, integration and visualization #endif$_VVISUALIZATION #if$_DATAPLANT through the DataPLANT ARC structure are absolutely necessary #endif$_DATAPLANT #if!$_DATAPLANT through a standardized data management process is absolutely necessary #endif!$_DATAPLANT as data is not only used to understand principles, but it is also used for analyzing data. In the end, stakeholder involvement needs to be informed about data provenance. It is therefore of importance that not only data is well generated, but also well annotated with metadata using open standards, as it is laid out in the following section. As the $_PROJECT aims at $_PROJECTAIM.

What types and formats of data will the project generate/collect?

We foresee that the following data will be collected and generated at the very least: $_PHENOTYPIC, $_GENETIC, $_GENOMIC, $_METABOLOMIC, $_RNASEQ, data about $_STUDYOBJECT. In addition, derived data from the original raw data sets will also be collected. This is important, as different analytical pipelines might yield different results or include ad-hoc data analysis parts,#if$_DATAPLANT and these pipelines will be tracked in the DataPLANT ARC#endif$_DATAPLANT. Therefore, specific care will be taken to document and archive these resources (including the analytic pipelines) as well#if$_DATAPLANT relying on the vast expertise in the DataPLANT consortium#endif$_DATAPLANT.

Will you re-use any existing data and how?

The project builds on existing data sets and relies on them. #if$_RNASEQ For instance, without a proper genomic reference it is very difficult to analyze NGS data sets.#endif$_RNASEQ It is also important to include existing data sets on the expression and metabolic behaviour of $_STUDYOBJECT but, of course, also on existing characterization and the background knowledge. #if$_PARTNERS of the partners. #endif$_PARTNERS Genomic references can simply be gathered from reference databases for genomes/sequences, like the National Center for Biotechnology Information: NCBI (US); European Bioinformatics Institute: EBI (EU); DNA Data Bank of Japan: DDBJ (JP). Furthermore, prior 'unstructured' data in the form of publications and data contained therein will be used for decision making.

What is the origin of the data?

Public data will be extracted as described in the previous paragraph. For the $_PROJECT, specific data sets will be generated by the consortium partners.

Data of different type or of different domain will be generated differently. For example:

#if$_PREVIOUSPROJECTS

Data from previous projects such as $_PREVIOUSPROJECTS will be considered.

#endif$_PREVIOUSPROJECTS

What is the expected size of the data?

We expect to generate raw data in the range of $_RAWDATA GB of data. The size of the derived data will be about $_DERIVEDDATA GB.

To whom might it be useful ('data utility')?

The data will be useful for the $_PROJECT partners, the scientific community working on $_STUDYOBJECT or the general public interested in $_STUDYOBJECT. Hence, the $_PROJECT also strives to collect the data that has been disseminated and potentially advertise it#if$_DATAPLANT e.g. through the DataPLANT platform or other means#endif$_DATAPLANT, if it is not included in a publication anyway, which is the most likely form of dissemination.

$_DATAUTILITY

2.2    FAIR data

Making data findable, including provisions for metadata

Are the data produced and/or used in the project discoverable with metadata, identifiable and locatable by means of a standard identification mechanism (e.g. persistent and unique identifiers such as Digital Object Identifiers)?

All data sets will receive unique identifiers, and they will be annotated with metadata.#if$_MIAPPE The $_PROJECT will rely on community standards plus additional recommendations necessary in plant science adapted by e.g. using suggestions from the Minimum Information About a Plant Phenotyping Experiment (MIAPPE). #endif$_MIAPPE These unlike cross-domain minimal sets such as Dublin core, which mostly defines the submitter and what general type of data is being dealt with (e.g. images), allow reusability by other researchers as it also defines properties of the plant (see the preceding section). However, of course minimal cross-domain annotations are part of the $_PROJECT. #if$_DATAPLANT The core integration with DataPLANT will also allow one to tag individual releases with a Digital Object Identifier (DOI). #endif$_DATAPLANT #if$_OTHERSTANDARDS $_OTHERSTANDARDINPUT #endif$_OTHERSTANDARDS

What naming conventions do you follow?

Data variables will use standard names. For example, this is the case for genes, metabolites, and proteins. These will also be linked to free biomedical ontologies where possible. In the case of datasets, the dataset names will also be made to be meaningful and human readable. In addition, traditional names will of course be included where necessary as synonyms.

Will search keywords be provided that optimize possibilities for re-use?

Keywords about the experiment and the general consortium will be included, as well as an abstract about the data, where useful. In addition, certain keywords can be auto-generated from dense metadata and its underlying ontologies. #if$_DATAPLANT Here, DataPLANT strives to complement these with standardized DataPLANT ontologies that are supplemented where the ontology does not yet include the variables. #endif$_DATAPLANT

Do you provide clear version numbers?

To maintain data integrity and to be able to re-analyze data, data sets will get version numbers where this is useful (e.g. raw data must not be changed and will not get a version number and is considered immutable). #if$_DATAPLANT This is automatically supported by the ARC Git DataPLANT infrastructure. #endif$_DATAPLANT

What metadata will be created? In case metadata standards do not exist in your discipline, please outline what type of metadata will be created and how.

We foresee to use Investigation, Study, Assay (ISA) specification for metadata creation. #if$_RNASEQ|$_GENOMIC For specific data, e.g., RNAseq or genomic data, we use metadata template from the end-point repositories. #if$_MINSEQE The Minimum Information About a Next-generation Sequencing Experiment (MinSEQe) will also be used. #endif$_MINSEQE #endif$_RNASEQ|$_GENOMIC #if$_METABOLOMIC The Metabolights submission compliant standards are used for metabolomic data.#issuewarning some Metabolomics partners considers Metabolights not an accepted standard#endissuewarning#endif$_METABOLOMIC The plant community use #if$_MIAPPE MIAPPE for phenotyping data in the broadest sense, but we will also be relying on #endif$_MIAPPE specific SOPs for additional annotations #if$_DATAPLANT relying on advanced DataPLANT annotation and ontologies. #endif$_DATAPLANT

Making data openly accessible

Which data produced and/or used in the project will be made openly available as the default? If certain datasets cannot be shared (or need to be shared under restrictions), we explain why, clearly separating legal and contractual reasons from voluntary restrictions.

By default, all data sets from the $_PROJECT will be shared with the community and made openly available. This is, however, after partners have had the ability to check for IP protection (according to agreements and background rights). #if$_INDUSTRY This applies in particular to data pertaining to the industry. #endif$_INDUSTRY However, all partners also strive for IP protection of data sets which will be tested and due diligence will be given.

Note that in multi-beneficiary projects it is also possible for specific beneficiaries to keep their data closed if relevant provisions are made in the consortium agreement and are in line with the reasons for opting out.

How will the data be made accessible (e.g. by deposition in a repository)?

#if!$_DATAPLANT Data will be made available via the $_PROJECT platform using a user-friendly front end that allows data visualization. Besides this it will be ensured that data which can be stored in international discipline related repositories which use specialized technologies (Sequencing at the national US center $_NCBI, $_GEO; Sequencing and sequence data at the EU center: EBI, $_ENA; Proteome database: $_PRIDE; metabolomic database: $_METABOLIGHTS; #if$_OTHEREP and $_OTHEREP #endif$_OTHEREP ) will be used to store data and the data will be processed there as well. #endif!$_DATAPLANT

As noted above, specialized repositories like SRA /ENA, Pride /Proteomexchange are the most common ones and will be used when appropriate. In the case of unstructured less standardized data (e.g. experimental phenotypic measurements), these will be metadata annotated and if complete given a digital object identifier (DOI). #if$_DATAPLANT and the whole data sets wrapped into an ARC will get DOIs as well. The ARC and the converters provided by DataPLANT will guarantee the upload into the endpoint repositories is fast and easy. #endif$_DATAPLANT

What methods or software tools are needed to access the data?

#if$_PROPRIETARY The $_PROJECT relies on the tool(s) $_PROPRIETARY. #endif$_PROPRIETARY

#if!$_PROPRIETARY No specialized software will be needed to access the data, usually just a modern browser. Access will be possible through web interfaces. For data processing after obtaining raw data, typical open-source software can be used. #endif!$_PROPRIETARY

#if$_DATAPLANT DataPLANT offers tools such as the open-source SWATE plugin for Excel, the ARC commander, and the DMP tool which will not necessarily make the interaction with data more convenient. #endif$_DATAPLANT

Is documentation about the software needed to access the data included?

As no software is needed, no documentation needs to be provided. #if$_DATAPLANT However, DataPLANT resources are well described, and their setup is documented on their github project pages. #endif$_DATAPLANT

Is it possible to include the relevant software (e.g. in open-source code)?

As stated above, here we use publicly available open-source and well-documented certified software #if$_PROPRIETARY except for $_PROPRIETARY #endif$_PROPRIETARY.

Where will the data and associated metadata, documentation and code be deposited? Preference should be given to certified repositories that support open access, where possible.

As noted above, specialized repositories like SRA /ENA, Pride /Proteomexchange are the most common ones and will be used when appropriate. In the case of unstructured less standardized data (e.g. experimental phenotypic measurements), these will be metadata annotated and if complete given a digital object identifier (DOI) #if$_DATAPLANT and the whole data sets wrapped into an ARC will get DOIs as well#endif$_DATAPLANT.

Have you explored appropriate arrangements with the identified repository?

The submission is for free, and it is the goal (at least of ENA) to obtain as much data as possible. Therefore, arrangements are neither necessary nor useful. Catch-all repositories are not required. #if$_DATAPLANT For DataPLANT, this has been agreed upon. #issuewarning Please use endpoint repositories to store or archive your data after publication. #endissuewarning #endif$_DATAPLANT

If there are restrictions on use, how will access be provided?

There are no restrictions, beyond the aforementioned IP checks, which are in line with e.g. European open data policies.

Is there a need for a data access committee?

Consequently, there is no need for a committee.

Are there well described conditions for access (i.e. a machine-readable license)?

Yes, where possible, e.g. CC REL will be used for data not submitted to specialized repositories such as ENA.

How will the identity of the person accessing the data be ascertained?

In case data is only shared within the consortium, if the data is not yet finished or under IP checks, the data is hosted internally and username and password will be required (see also our GDPR rules). In the case data is made public under final EU or US repositories, completely anonymous access is normally allowed. This is the case for ENA as well and both are in line with GDPR requirements.

#if$_DATAPLANT Currently, data management relies on the annotated research context ARC. It is password protected, so before any data can be obtained or samples, generated an authentication needs to take place. #endif$_DATAPLANT

Making data interoperable

Are the data produced in the project interoperable, that is allowing data exchange and re-use between researchers, institutions, organizations, countries, etc. (i.e. adhering to standards for formats, as much as possible compliant with available (open) software applications, and in particular facilitating re-combinations with different datasets from different origins)?

Whenever possible, data will be stored in common and openly defined formats including all the necessary metadata to interpret and analyze data in a biological context. By default, no proprietary formats will be used; however Microsoft Excel files (according to ISO/IEC 29500-1:2016) might be used as intermediates by the consortium#if$_DATAPLANT and by some ARC components in form#endif$_DATAPLANT. In addition, text files might be edited in text processor files, but will be shared as pdf.

What data and metadata vocabularies, standards or methodologies will you follow to make your data interoperable?

As mentioned above, we foresee using #if$_RNASEQ|$_GENOMIC e.g. #if$_MINSEQE MinSEQe for sequencing data and #endif$_MINSEQE #endif$_RNASEQ|$_GENOMIC Metabolights compatible forms for metabolites #if$_MIAPPE as well as MIAPPE for phenotyping-like data #endif$_MIAPPE. The latter will thus allow the integration of data across projects and safeguards that reuse established and tested protocols. Additionally, we will use ontology terms to enrich the data sets relying on free and open ontologies. In addition, additional ontology terms might be created and be canonized during the $_PROJECT.

Will you be using standard vocabularies for all data types present in your data set, to allow inter-disciplinary interoperability?

In fact, open biomedical ontologies will be used where they are mature. As stated in the previous question, sometimes ontologies and controlled vocabularies might have to be extended. #if$_DATAPLANT Here, the $_PROJECT will build on the advanced ontologies developed in DataPLANT. #endif$_DATAPLANT

In case it is unavoidable that you use uncommon or generate project specific ontologies or vocabularies, will you provide mappings to more commonly used ontologies?

Common and open ontologies will be used, hence this question does not apply.

Increase data reuse (by clarifying licences)

How will the data be licensed to permit the widest re-use possible?

Open licenses, such as Creative Commons CC, will be used whenever possible.

When will the data be made available for re-use? If an embargo is sought to give time to publish or seek patents, specify why and how long this will apply, bearing in mind that research data should be made available as soon as possible.

#if$_early The data will be published as soon as possible to guarantee reusability. #endif$_early #if$_ipissue In general, IP issues will first be checked. #endif$_ipissue All consortium partners will be encouraged to make data available prior to publication openly and/or under pre-publication agreements #if$_GENOMIC such as those started in Fort Lauderdale and set forth by the Toronto International Data Release Workshop. #endif$_GENOMIC

Are the data produced and/or used in the project usable by third parties, in particular after the end of the project? If the re-use of some data is restricted, explain why.

There will be no restrictions once the data is made public.

How long is it intended that the data remains re-usable?

Data will be made available for many years#if$_DATAPLANT and potentially indefinitely after the end of the project#endif$_DATAPLANT.

In any case data submitted to repositories (as detailed above) e.g. ENA /PRIDE would be subject to local data storage regulation.

Are data quality assurance processes described?

The data will be checked and curated. #if$_DATAPLANT Furthermore, data will be analyzed for quality control (QC) problems using automatic procedures as well as by manual curation #endif$_DATAPLANT.

2.3    Allocation of resources

What are the costs for making data FAIR in your project?

The costs comprise data curation, #if$_DATAPLANT ARC consistency checks, #endif$_DATAPLANT and maintenance on the $_PROJECT´s side.

Additionally, last-level costs for storage are incurred by last-level repositories (e.g. ENA) but not charged against the $_PROJECT or its members but by the operation budget of these repositories.

How will these be covered? Note that costs related to open access to research data are eligible as part of the Horizon 2020 or Horizon Europe grant (if compliant with the Grant Agreement conditions).

A large part of the cost is covered by the $_PROJECT. #if$_DATAPLANT The structures tools, and knowledge laid down in the DataPLANT consortium. #endif$_DATAPLANT

Who will be responsible for data management in your project?

The responsible will be $_DATAOFFICER.

Are the resources for long term preservation discussed (costs and potential value, who decides and how/what data will be kept and for how long)?

The data responsible(s) (data officer #if$_PARTNERS or $_PARTNERS #endif$_PARTNERS) decides on preservation of data not submitted to end-point subject area repositories #if$_DATAPLANT or ARCs in DataPLANT #endif$_DATAPLANT after project end. This will be in line with EU, institute policies, and data sharing based on EU and international standards.

2.4    Data security

What provisions are in place for data security (including data recovery as well as secure storage and transfer of sensitive data)?

Once data is transferred to the $_PROJECT platform#if$_DATAPLANT and ARCs have been generated in DataPLANT#endif$_DATAPLANT, data security will be imposed. This comprises secure storage, and the use of passwords and usernames is generally transferred via separate safe media.

Is the data safely stored in certified repositories for long term preservation and curation?

Wherever there are certified repositories, these will be used as end-point repositories. #if$_RNASEQ Transcriptomics data and gene sequence data will be also made available upon publication via the standards ENA/SRA, #endif$_RNASEQ #if$_METABOLOMIC metabolite data in e.g. Metabolights (and/or Nationwide repositories like the German NFDI the French INRAe), #endif$_METABOLOMIC #if$_PROTEOMIC Proteomics data in e.g. Pride/Proteomexchange #endif$_PROTEOMIC. In addition, the national resource will maintain safekeeping of data also after the project ends. #if$_DATAPLANT In addition, databases like e.g. Proteomexchange do not support deep plant specific metadata; hence ARCs will be maintained to ensure reusability. #endif$_DATAPLANT

2.5    Ethical aspects

Are there any ethical or legal issues that can have an impact on data sharing? These can also be discussed in the context of an ethics review. If relevant, include references to ethics deliverables and ethics chapter in the Description of the Action (DoA).

At the moment, we do not anticipate ethical or legal issues with data sharing. In terms of ethics, since this is plant data, there is no need for an ethics committee, however, diligence for plant resource benefit sharing is considered (🡺see Nagoya protocol). #issuewarning You have to check here and enter any due diligence here at the moment we are awaiting if Nagoya gets also part of sequence information. In any case if you use material not from your (partner) country and characterize this physically e.g., metabolites, proteome, biochemically RNASeq etc. this might represent a Nagoya relevant action unless this is from e.g. US (non partner), Ireland (not signed still contact them) etc but other laws might apply…. #endissuewarning

Is informed consent for data sharing and long term preservation included in questionnaires dealing with personal data?

The only personal data that will potentially be stored is the submitter name and affiliation in the metadata for data. In addition, personal data will be collected for dissemination and communication activities using specific methods and procedures developed by the $_PROJECT partners to adhere to data protection. #issuewarning You need to inform and better get WRITTEN consent that you store emails and names or even pseudonyms such as twitter handles, we are very sorry about these issues we didn’t invent them #endissuewarning

2.6    Other issues

Do you make use of other national/funder/sectorial/departmental procedures for data management? If yes, which ones?

Yes, the $_PROJECT will use common Research Data Management (RDM) tools#if$_DATAPLANT and in particular resources developed by the NFDI of Germany#endif$_DATAPLANT.

3     Annexes

3.1     Abbreviations

#if$_DATAPLANT

ARC Annotated Research Context

#endif$_DATAPLANT

CC Creative Commons

CC CEL Creative Commons Rights Expression Language

DDBJ DNA Data Bank of Japan

DMP Data Management Plan

DoA Description of Action

DOI Digital Object Identifier

EBI European Bioinformatics Institute

ENA European Nucleotide Archive

EU European Union

FAIR Findable Accessible Interoperable Reproducible

GDPR General data protection regulation (of the EU)

IP Intellectual Property

ISO International Organization for Standardization

MIAMET Minimal Information about Metabolite experiment

MIAPPE Minimal Information about Plant Phenotyping Experiment

MinSEQe Minimum Information about a high-throughput Sequencing Experiment

NCBI National Center for Biotechnology Information

NFDI National Research Data Infrastructure (of Germany)

NGS Next Generation Sequencing

RDM Research Data Management

RNASeq RNA Sequencing

SOP Standard Operating Procedures

SRA Short Read Archive

#if$_DATAPLANT

SWATE Swate Workflow Annotation Tool for Excel

#endif$_DATAPLANT

ONP Oxford Nanopore

qRTPCR quantitative real time polymerase chain reaction

WP Work Package


Data Management Plan required by DFG

1.    Data description

1.1    Introduction

#if$_EU

The $_PROJECT is part of the Open Data Initiative (ODI) of the EU. #endif$_EU To best profit from open data, it is necessary not only to store data but to make data Findable, Accessible, Interoperable and Reusable (FAIR). #if$_PROTECT Open and FAIR data, however, considers the need to protect individual data sets. #endif$_PROTECT

The aim of this document is to provide guidelines on principles guiding the data management in the $_PROJECT and what data will be stored by using the responses to the DFG Data Management Plan (DMP) checklist to generate a DMP document.

The detailed DMP instructs how data will be handled during and after the project. The $_PROJECT DMP is modelled according to the DFG data management checklist. #if$_UPDATE It will be updated/its validity checked during the $_PROJECT project several times. At the very least, this will happen at month $_UPDATEMONTH. #endif$_UPDATE

1.2    How does your project generate new data?

Data of different type or of different domain will be generated differently. For example:

For the $_PROJECT, data collection#if!$_VVISUALIZATION and integration #endif!$_VVISUALIZATION#if$_VVISUALIZATION, integration and visualization #endif$_VVISUALIZATION #if$_DATAPLANT through the DataPLANT ARC structure are absolutely necessary #endif$_DATAPLANT as data is not only used to understand principles, but it is also used for analyzing data and in the end, stakeholder involvement needs to be informed about data provenance. It is therefore of importance that not only data is well generated, but also well annotated with metadata using open standards as it is laid out in the following section. As the $_PROJECT aims at $_PROJECTAIM.

Public data will be extracted as described in the paragraph 1.3. For the $_PROJECT, specific data sets will be generated by the consortium partners.

1.3    Is existing data reused?

The project builds on existing data sets and relies on them. #if$_RNASEQ For instance, without a proper genomic reference it is very difficult to analyze NGS data sets.#endif$_RNASEQ It is also important to include existing data sets on the expression and metabolic behaviour of $_STUDYOBJECT but, of course, also on existing characterization and the background knowledge#if$_PARTNERS of the partners#endif$_PARTNERS. Genomic references can simply be gathered from reference databases for genomes/sequences, like the National Center for Biotechnology Information: NCBI (US); European Bioinformatics Institute: EBI (EU); DNA Data Bank of Japan: DDBJ (JP). Furthermore, prior “unstructured” data in the form of publications and data contained therein will be used for decision making.

1.4    Which data types (in terms of data formats like image data, text data or measurement data) arise in your project and in what way are they further processed?

We foresee that the following data about $_STUDYOBJECT will be collected and generated at the very least: $_PHENOTYPIC, $_GENETIC, $_GENOMIC, $_METABOLOMIC, $_RNASEQ, $_IMAGE, $_PROTEOMIC, $_TARGETED, $_MODELS, $_CODE, $_EXCEL, $_CLONED-DNA and result data. Furthermore, data derived from the original raw data sets will also be collected. This is important, as different analytical pipelines might yield different results or include ad-hoc data analysis parts#if$_DATAPLANT and these pipelines will be tracked in the DataPLANT ARC#endif$_DATAPLANT. Therefore, specific care will be taken, to document and archive these resources (including the analytic pipelines) as well#if$_DATAPLANT relying on the vast expertise in the DataPLANT consortium #endif$_DATAPLANT.

1.5    To what extent do these arise or what is the anticipated data volume?

We expect to generate raw data in the range of $_RAWDATA GB of data. The size of the derived data will be about $_DERIVEDDATA GB.

2.    Documentation and data quality

2.1.    What approaches are being taken to describe the data in a comprehensible manner (such as the use of available metadata, documentation standards or ontologies)?

All data sets will receive unique identifiers, and they will be annotated with metadata.#if$_MIAPPE The $_PROJECT will rely on community standards plus additional recommendations necessary in plant science adapted by e.g. using suggestions from the Minimum Information About a Plant Phenotyping Experiment (MIAPPE). #endif$_MIAPPE These unlike cross-domain minimal sets such as Dublin core, which mostly defines the submitter and what general type of data is being dealt with (e.g. images), allow reusability by other researchers as it also defines properties of the plant (see the preceding section). However, of course minimal cross-domain annotations are part of the $_PROJECT. #if$_DATAPLANT The core integration with DataPLANT will also allow one to tag individual releases with a Digital Object Identifier (DOI). #endif$_DATAPLANT #if$_OTHERSTANDARDS $_OTHERSTANDARDINPUT #endif$_OTHERSTANDARDS

Keywords about the experiment and the general consortium will be included, as well as an abstract about the data, where useful. In addition, certain keywords can be auto-generated from dense metadata and its underlying ontologies. #if$_DATAPLANT Here, DataPLANT strives to complement these with standardized DataPLANT ontologies that are supplemented where the ontology does not yet include the variables. #endif$_DATAPLANT

We foresee to use Investigation, Study, Assay (ISA) specification for metadata creation. #if$_RNASEQ|$_GENOMIC For specific data, e.g., RNAseq or genomic data, we use metadata template from the end-point repositories. #if$_MINSEQE The Minimum Information About a Next-generation Sequencing Experiment (MinSEQe) will also be used. #endif$_MINSEQE #endif$_RNASEQ|$_GENOMIC #if$_METABOLOMIC The Metabolights submission compliant standards are used for metabolomics data.#issuewarning some Metabolomics partners considers Metabolights not an accepted standard#endissuewarning#endif$_METABOLOMIC The plant community use #if$_MIAPPE MIAPPE for phenotyping data in the broadest sense, but we will also be relying on #endif$_MIAPPE specific SOPs for additional annotations #if$_DATAPLANT relying on advanced DataPLANT annotation and ontologies. #endif$_DATAPLANT

In fact, open biomedical ontologies will be used where they are mature. As stated in the previous question, sometimes ontologies and controlled vocabularies might have to be extended. #if$_DATAPLANT Here, the $_PROJECT will build on the advanced ontologies developed in DataPLANT. #endif$_DATAPLANT

2.2    What measures are being adopted to ensure high data quality?

Data variables will use standard names. For example, this is the case for genes, metabolites, and proteins. These will also be linked to free biomedical ontologies where possible. In the case of datasets, the dataset names will also be made to be meaningful and human readable. In addition, traditional names will of course be included where necessary as synonyms.

To maintain data integrity and to be able to re-analyze data, data sets will get version numbers where this is useful (e.g. raw data must not be changed and will not get a version number and is considered immutable). #if$_DATAPLANT this is automatically supported by the ARC Git DataPLANT infrastructure. #endif$_DATAPLANT

As mentioned above, we foresee using e.g. #if$_RNASEQ|$_GENOMIC #if$_MINSEQE MinSEQe for sequencing data and #endif$_MINSEQE #endif$_RNASEQ|$_GENOMIC Metabolights compatible forms for metabolites#if$_MIAPPE as well as MIAPPE for phenotyping-like data#endif$_MIAPPE. The latter will thus allow the integration of data across projects and safeguards that reuse established and tested protocols. Additionally, we will use ontology terms to enrich the data sets relying on free and open ontologies. In addition, additional ontology terms might be created and be canonized during the $_PROJECT.

2.3    Are quality controls in place and if so, how do they operate?

The data will be checked and curated through the project period. #if$_DATAPLANT Furthermore, data will be analyzed for quality control (QC) problems using automatic procedures as well as by manual curation. #endif$_DATAPLANT Phd students and lab professionals will be responsible for the first-hand quality control. Afterwards, the data will be checked and annotated by $_DATAOFFICER. #if$_RNASEQ|$_GENOMIC FastQC will be conducted on the base-calling. #endif$_RNASEQ|$_GENOMIC Before publication, the data will be controlled again.

2.4    Which digital methods and tools (e.g. software) are required to use the data?

The $_PROJECT will use common Research Data Management (RDM) tools#if$_DATAPLANT and in particular resources developed by the NFDI of Germany#endif$_DATAPLANT.

#if$_PROPRIETARY The $_PROJECT relies on the tool(s) $_PROPRIETARY. #endif$_PROPRIETARY

#if!$_PROPRIETARY No specialized software will be needed to access the data, usually just a modern browser. Access will be possible through web interfaces. For data processing after obtaining raw data, typical open-source software can be used. #endif!$_PROPRIETARY

As no software is needed, no documentation needs to be provided. #if$_DATAPLANT However, DataPLANT resources are well described, and their setup is documented on their github project pages. #endif$_DATAPLANT

#if$_DATAPLANT DataPLANT offers tools such as the open-source SWATE plugin for Excel, the ARC commander, and the DMP tool which will not necessarily make the interaction with data more convenient. #endif$_DATAPLANT

As stated above, here we use publicly available open-source and well-documented certified software #if$_PROPRIETARY except for $_PROPRIETARY #endif$_PROPRIETARY.

3.    Storage and technical archiving the project

3.1    How is the data to be stored and archived throughout the project duration?

Wherever there are certified repositories, these will be used as end-point repositories. #if$_RNASEQ Transcriptomics data and gene sequence data will be also made available upon publication via the standards ENA/SRA, #endif$_RNASEQ #if$_METABOLOMIC metabolite data in e.g. Metabolights (and/or Nationwide repositories like the German NFDI the French INRAe), #endif$_METABOLOMIC #if$_PROTEOMIC Proteomics data in e.g. Pride/Proteomexchange #endif$_PROTEOMIC. In addition, the national resource will maintain safekeeping of data also after the project ends. #if$_DATAPLANT In addition, databases like e.g. Proteomexchange do not support deep plant-specific metadata; hence ARCs will be maintained to ensure reusability. #endif$_DATAPLANT

Data will be made available for many years#if$_DATAPLANT and potentially indefinitely after the end of the project#endif$_DATAPLANT.

In any case data submitted to international discipline related repositories which use specialized technologies (as detailed above) e.g. ENA /Pride would be subject to local data storage regulation.

3.2    What is in place to secure sensitive data throughout the project duration (access and usage rights)?

#if$_DATAPLANT In DataPLANT, data management relies on the Annotated Research Context (ARC). It is password protected, so before any data can be obtained or samples generated, an authentication needs to take place. #endif$_DATAPLANT

In case data is only shared within the consortium, if the data is not yet finished or under IP checks, the data is hosted internally, and the username and the password will be required (see also our GDPR rules). In the case data is made public under final EU or US repositories, completely anonymous access is normally allowed this is the case for ENA as well and both are in line with GDPR requirements.

There will be no restrictions once the data is made public.

4.    Legal obligations and conditions

4.1    What are the legal specifics associated with the handling of research data in your project?

At the moment, we do not anticipate ethical or legal issues with data sharing. In terms of ethics, since this is plant data, there is no need for an ethics committee, however, diligence for plant resource benefit sharing is considered (🡺see Nagoya protocol). #issuewarning You have to check here and enter any due diligence here at the moment we are awaiting if Nagoya gets also part of sequence information. In any case if you use material not from your (partner) country and characterize this physically e.g., metabolites, proteome, biochemically RNASeq etc. this might represent a Nagoya relevant action unless this is from e.g. US (non partner), Ireland (not signed still contact them) etc but other laws might apply…. #endissuewarning

The only personal data that will potentially be stored is the submitter name and affiliation in the metadata for data. In addition, personal data will be collected for dissemination and communication activities using specific methods and procedures developed by the $_PROJECT partners to adhere to data protection. #issuewarning You need to inform and better get WRITTEN consent that you store emails and names or even pseudonyms such as twitter handles, we are very sorry about these issues we didn’t invent them #endissuewarning

4.2    Do you anticipate any implications or restrictions regarding subsequent publication or accessibility?

Once data is transferred to the $_PROJECT platform#if$_DATAPLANT and ARCs have been generated in DataPLANT#endif$_DATAPLANT, data security will be imposed. This comprises secure storage, and the use of passwords and usernames is generally transferred via separate safe media.

4.3    What is in place to consider aspects of use and copyright law as well as ownership issues?

Open licenses, such as Creative Commons (CC), will be used whenever possible.

4.4    Are there any significant research codes or professional standards to be taken into account?

Whenever possible, data will be stored in common and openly defined formats including all the necessary metadata to interpret and analyze data in a biological context. By default, no proprietary formats will be used; however, Microsoft Excel files (according to ISO/IEC 29500-1:2016) might be used as intermediates by the consortium#if$_DATAPLANT and by some ARC components in form#endif$_DATAPLANT. In addition, text files might be edited in text processor files, but will be shared as pdf.

5.    Data exchange and long-term data accessibility

5.1    Which data sets are especially suitable for use in other contexts?

The data will be useful for the $_PROJECT partners, the scientific community working on $_STUDYOBJECT or the general public interested in $_STUDYOBJECT. Hence, the $_PROJECT also strives to collect the data that has been disseminated and potentially advertise it#if$_DATAPLANT e.g. through the DataPLANT platform or other means #endif$_DATAPLANT, if it is not included in a publication anyway, which is the most likely form of dissemination.

5.2    Which criteria are used to select research data to make it available for subsequent use by others?

By default, all data sets from the $_PROJECT will be shared with the community and made openly available. This is, however, after partners have had the ability to check for IP protection (according to agreements and background rights). #if$_INDUSTRY This applies in particular to data pertaining to the industry. #endif$_INDUSTRY However, all partners also strive for IP protection of data sets which will be tested and due diligence will be given.

Note that in multi-beneficiary projects it is also possible for specific beneficiaries to keep their data closed if relevant provisions are made in the consortium agreement and are in line with the reasons for opting out.

5.3    Are you planning to archive your data in a suitable infrastructure?

#if$_DATAPLANT As the $_PROJECT is closely aligned with DataPLANT, the ARC converter and DataHUB will be used to find the end-point repositories and upload to the repositories automatically. #endif$_DATAPLANT

#if!$_DATAPLANT Data will be made available via the $_PROJECT platform using a user-friendly front end that allows data visualization. Besides this it will be ensured that data which can be stored in international discipline related repositories which use specialized technologies (Sequencing at the national US center $_NCBI, $_GEO; Sequencing and sequence data at the EU center: EBI, $_ENA; Proteome database: $_PRIDE; metabolomic database: $_METABOLIGHTS; #if$_OTHEREP and $_OTHEREP #endif$_OTHEREP ) will be used to store data and the data will be processed there as well. The ARC and the converters provided by DataPLANT will guarantee the upload into the endpoint repositories is fast and easy.#endif!$_DATAPLANT

As noted above, specialized repositories like SRA /ENA, Pride /Proteomexchange are the most common ones and will be used when appropriate. In the case of unstructured less standardized data (e.g. experimental phenotypic measurements), these will be metadata annotated and if complete given a digital object identifier (DOI). #if$_DATAPLANT and the whole data sets wrapped into an ARC will get DOIs as well. #endif$_DATAPLANT

The submission is for free, and it is the goal (at least of ENA) to obtain as much data as possible. Therefore, arrangements are neither necessary nor useful. Catch-all repositories are not required. #if$_DATAPLANT For DataPLANT, this has been agreed upon. #issuewarning Please use endpoint repositories to store or archive your data after publication #endissuewarning #endif$_DATAPLANT

5.4    If so, how and where? Are there any retention periods?

There are no restrictions, beyond the aforementioned IP checks, which are in line with e.g. European open data policies.

The $_PARTNERS decides on preservation of data not submitted to end-point subject area repositories #if$_DATAPLANT or ARCs in DataPLANT#endif$_DATAPLANT after project end. This will be in line with EU institute policies and data sharing based on EU and international standards.

5.5    When is the research data available for use by third parties?

#if$_early The data will be published as soon as possible to guarantee reusability. #endif$_early #if$_ipissue In general, IP issues will first be checked. #endif$_ipissue All consortium partners will be encouraged to make data available prior to publication openly and/or under pre-publication agreements #if$_GENOMIC such as those started in Fort Lauderdale and set forth by the Toronto International Data Release Workshop. #endif$_GENOMIC

6.    Responsibilities and resources

6.1    Who is responsible for adequate handling of the research data (description of roles and responsibilities within the project)?

The responsible will be $_DATAOFFICER as data Officer. The data responsible(s) (data officer#if$_PARTNERS or $_PARTNERS #endif$_PARTNERS) decides on the preservation of data not submitted to end-point subject area repositories #if$_DATAPLANT or ARCs in DataPLANT #endif$_DATAPLANT after the project end. This will be in line with EU institute policies, and data sharing based on EU and international standards.

6.2    Which resources (costs; time or other) are required to implement adequate handling of research data within the project?

The costs comprise data curation, #if$_DATAPLANT ARC consistency checks, #endif$_DATAPLANT and maintenance on the $_PROJECT´s side.

Additionally, last-level costs for storage are incurred by end-point repositories (e.g. ENA) but not charged against the $_PROJECT or its members but by the operation budget of these repositories.

A large part of the cost is covered by the $_PROJECT #if$_DATAPLANT and the structures, tools and knowledge laid down in the DataPLANT consortium. #endif$_DATAPLANT

6.3    Who is responsible for curating the data once the project has ended?

As applicable, $_DATAOFFICER, who is responsible for ongoing data maintenance will also take care of it after the finish of the $_PROJECT. #if$_DATAPLANT DataPLANT as external data archives may provide such services in some cases. #endif$_DATAPLANT

7     Annexes

7.1     Abbreviations

#if$_DATAPLANT

ARC Annotated Research Context

#endif$_DATAPLANT

CC Creative Commons

CC CEL Creative Commons Rights Expression Language

DDBJ DNA Data Bank of Japan

DMP Data Management Plan

DoA Description of Action

DOI Digital Object Identifier

EBI European Bioinformatics Institute

ENA European Nucleotide Archive

EU European Union

FAIR Findable Accessible Interoperable Reproducible

GDPR General data protection regulation (of the EU)

IP Intellectual Property

ISO International Organization for Standardization

MIAMET Minimal Information about Metabolite experiment

MIAPPE Minimal Information about Plant Phenotyping Experiment

MinSEQe Minimum Information about a high-throughput Sequencing Experiment

NCBI National Center for Biotechnology Information

NFDI National Research Data Infrastructure (of Germany)

NGS Next Generation Sequencing

RDM Research Data Management

RNASeq RNA Sequencing

SOP Standard Operating Procedures

SRA Short Read Archive

#if$_DATAPLANT

SWATE Swate Workflow Annotation Tool for Excel

#endif$_DATAPLANT

ONP Oxford Nanopore

qRTPCR quantitative real time polymerase chain reaction

WP Work Package

Practical Data Management Guide of the $_PROJECT

This practical guide of data management in the $_PROJECT should be considered as a minimum description, leaving flexibility to include additional actions of specific domain or to national or local legislation.#if$_EU The $_PROJECT will follow EU FAIR principles.  #endif$_EU 


The practical guide of data management in the $_PROJECT aims at providing a complete walkthrough for the researcher. The contents are customized based on the user input in the Data Management Plant Generator (DMPG). The practices in this guide are customized to fit related legal, ethical, standardization and funding body requirements. The suitable practices will cover all steps of a data management life-cycle:


  1. Data acquisition:

    1. Data generation

Data should be generated by devices that are compatible with the open-source format. The $_STUDYOBJECT should be compliant to biodiversity protocols. The protocols used to collect $_PHENOTYPIC, $_GENETIC, $_GENOMIC, $_METABOLOMIC, $_RNASEQ data about $_STUDYOBJECT will be stored#if$_DATAPLANT in the assays folder of ARC repositories.#endif$_DATAPLANT#if!$_DATAPLANT in a FAIR data storage. #endif!$_DATAPLANT 

  1. Data collection

The data collection process is conducted by experimental scientists and stewarded by $_DATAOFFICER.#if$_DATAPLANT An electronic lab notebook will be used to ensure enough metadata is recorded and guarantees that the data can be further reused.#endif$_DATAPLANT 

  1. Data Organization

The data organization process is conducted by $_DATAOFFICER. The detailed organization method and procedure are reported to the PIs. #if$_DATAPLANT The data organization will profit from the knowledge-base and data-base of DataPLANT, elastic search will be used to find better ways to organize the data. #endif$_DATAPLANT 



  1. Annotation

    1. Workflow documentation

Because the data collection process is conducted by experimental scientists and stewarded by $_DATAOFFICER.#if$_DATAPLANT An electronic lab notebook was used to ensure enough metadata is recorded and guarantees that the data can be further reused. The workflow can be retrieved from the electronic workbook by using the toolkits provided from the DataPLANT such as SWATE and arccommander. #endif$_DATAPLANT 

  1. Metadata completion

In case some of the metadata is still missing from the documentation from the experimental scientists and data officer. #if$_DATAPLANT Raw data identifier and parsers provided by DataPLANT will be used to get meta data directly from the raw data file. The metadata collected from the raw data file can also be used to validate the metadata previously collected in case there are any mistakes. #endif$_DATAPLANT We foresee to use #if$_RNASEQ|$_GENOMIC e.g.#if$_MINSEQE MinSEQe for sequencing data and#endif$_MINSEQE #endif$_RNASEQ|$_GENOMIC Metabolights compatible forms for metabolites as well as MIAPPE for phenotyping like data. The latter will thus allow the integration of data across projects and safeguards that reuse established and tested protocols. Additionally, we will use ontology terms to enrich the data sets relying on free and open ontologies. In addition, additional ontology terms might be created and be canonized during the $_PROJECT.


  1. Maintenance: 

  1. Data storage

Raw data collected in previous steps are stored immediately by using#if$_DATAPLANT the infrastructure of DataPLANT #endif$_DATAPLANT #if!$_DATAPLANT in a secure infrastructure. ARC (Annotated Research Context) are used as a container to store the raw data as well as metadata and workflow.#endif!$_DATAPLANT

  1. Data curation

#if$_DATAPLANT Data stored in ARC is curated regularly as long as there are needs for update or revision.#endif$_DATAPLANT #if!$_DATAPLANT Data is curated regularly as long as there are needs for update or revision.#endif!$_DATAPLANT



  1. Publication and sharing

    1. Data publishing

#if$_RNASEQ Transcriptomics data and gene sequence data will be also made available upon publication via the standards ENA/SRA. #endif$_RNASEQ #if$_METABOLOMIC Metabolite data in e.g. Metabolights (and/or Nationwide repositories like the German NFDI the French INRAe). #endif$_METABOLOMIC #if$_PROTEOMIC and Proteomics data in e.g. Pride/Proteomexchange. #endif$_PROTEOMIC In addition, the national resource will maintain safekeeping of data also after the project ends. #if$_DATAPLANT In addition, databases like e.g. Proteomexchange does not support deep plant-specific metadata; hence ARCs will be maintained to ensure reusability. #endif$_DATAPLANT

  1. Data sharing

In case data is only shared within the consortium, if the data is not yet finished or under IP checks, the data is hosted internally, and the username and the password will be required (see also our GDPR rules). In the case data is made public under final EU or US repositories, completely anonymous access is normally allowed; this is the case for ENA as well and both are in line with GDPR requirements.

Metadata focus timeline



Stages

Actions

Study

initialization

The metadata of study is created at the beginning of the project and updated continuously afterwards#if$_DATAPLANT, the input of the DMP generator created during the proposal stage can be reused. #endif$_DATAPLANT 

Sample

Collection

The information used to identify exact samples are initiated before experiments and updated at assay creation stages.

#if$_DATAPLANT The sample SWATE template will be used to document the sample metadata. A part of sample metadata which can be retrieved from the raw data will be updated afterwards using the ARC parsers #endif$_DATAPLANT 

Assay

Creation

Assay metadata must be collected as a daily routine during the experimental phrase. #if$_DATAPLANT A electronic lab notebooks will be used to guarantee the applicability and correctness of the notebook content#endif$_DATAPLANT 

Computational Analysis


Workflow annotation will be conducted during the computational analysis phrase. #if$_DATAPLANT The workflow metadata will be stored in the assay folder of the ARC.#endif$_DATAPLANT 

Results Sharing

The metadata of results are collected after all modifications and should not be changed after publication. #if$_DATAPLANT Collection of result metadata before publication and the conversion from ARC to the repositories will be taken care of by the ARC2REPO converter and done with minimal efforts. #endif$_DATAPLANT 


Preferred formats for raw data

#if$_GENOMIC  

extension_ident

Format Name

.h5

Hierarchical Data Format

.bam

compressed binary version of a SAM file

.cram

compressed columnar file format for storing biological sequences aligned to a reference sequence

.fa

fasta

.faa

fasta

.fas

fasta

.fasta

fasta

.fastq

fastq

.ffn

fasta

.fna

fasta

.fq

fastq

.frn

fasta

.sff

sff-trim

#endif$_GENOMIC


#if$_RNASEQ  

.bam

compressed binary version of a SAM file

.cram

compressed columnar file format for storing biological sequences aligned to a reference sequence

.fa

fasta

.faa

fasta

.fas

fasta

.fast5

HDF5

.fasta

fasta

.fastq

fastq

.ffn

fasta

.fna

fasta

.fq

fastq

.frn

fasta

.sff

sff-trim

bas.h5

HDF5

.h5

Hierarchical Data Format

#endif$_RNASEQ


 #if$_METABOLOMIC  

.cdf

netCDF (AIA/ANDI) interchange data format

.cmp

netCDF compare file

.abf

Axon Binary File

.d

Agilent

.dat

Chromtech, Finnigan, VG

.idb

MASSLAB binary file

.jpf

Mass Center Main Mass Spectrometry Data (JEOL USA, Inc.)

.lcd

Shimadzu LC Solution / Labsolutions Data File

.mgf

Mascot Generic File

.raw

Thermo Xcalibur, Micromass (Waters), PerkinElmer, Waters

.scan

a spectrum or a Total Ion Chromatogram (TIC)

.wiff

ABI/Sciex

.xps

Thermo Fisher Scientific K-Alpha+ spectrometer file

cdf.cmp

netCDF compare file

#endif$_METABOLOMIC


 #if$_PROTEOMIC  

.baf

Bruker

.d

Agilent

.dat

Chromtech, Finnigan, VG

.fid

Bruker

.ita

ION-TOF

.itm

ION-TOF

.mgf 

Mascot Generic File

.ms

Finnigan (Thermo)

.ms2

Sequest MS/MS peak list

.pkl

Micromass peak list

.qgd

Shimadzu

.qgd

Shimadzu

.raw

Thermo Xcalibur, Micromass (Waters), PerkinElmer, Waters

.raw

Physical Electronics/ULVAC-PHI

.sms

Bruker/Varian

.spc

Shimadzu

.splib 

spectral library file

.t2d

ABI/Sciex

.tdc

Physical Electronics/ULVAC-PHI

.wiff

ABI/Sciex

.xms

Bruker/Varian

.yep

Bruker

.dta

Sequest MS/MS peak list

.msp


.nist


#endif$_PROTEOMIC



Datenmanagementplan

Projektname: $_PROJECT

Forschungsförderer: Bundesministerium für Bildung und Forschung

Förderprogramm:

FKZ:

Primärforscher/Wissenschaftler:

ID Primärforscher/Wissenschaftler: $_USERNAME

Kontaktperson Datenmanagement: $_DATAOFFICER

ID Kontaktperson Datenmanagement:

Kontakt: $_EMAIL

Projektbeschreibung:

Erstellungsdatum:

Änderungsdatum:

Zu beachtende Vorgaben:

Datenspeicherung

Die Dateibenennung erfolgt nach folgendem Standard:

Dateien werden in möglichst offenen, standardisierten Formaten gespeichert.

Datendokumentation

Folgende Dokumente werden erstellt:

Public data will be extracted as described in the previous paragraph. For $_PROJECT, specific data sets will be generated by the consortium partners.

#if$_RNASEQ

Short read sequencing will either be collected or outsourced and raw data will be received.

#endif$_RNASEQ#if$_METABOLOMIC

Metabolomic data will be generated using chromatography coupled to mass spectrometry and from enzyme platforms mostly.

#endif$_METABOLOMIC#if$_PROTEOMIC

proteomic data will be generated using an EU platform which are in line with community standards.

#endif$_PROTEOMIC

#if$_PREVIOUSPROJECTS data from previous projects such as $_PREVIOUSPROJECTS will be considered. #endif$_PREVIOUSPROJECTS

Legitimität

Data Sharing

Datenerhalt

a document template
DatausedDataPLANrepositoriesopenmetadatamadeprojectsetsusecasepossibleavailableaccessEUadditionManagementExamplespecificInformationsoftwarestoredneedENADMPgeneratedrawlineusefulGeneralopenlyIPlongcostssharingmakeendlikeviaPlancoursepublicallowfreegetlegaletclaststorefindabletypesleastresourcesreferenceEBIDNABankpriorformdescribedReadsizeseedueietextthussetarea20202950012016DataPLAN1IntroductionTobestprofitreusableFAIRTheaimguidelinesguidinganswersquestionnaireinstructs