SDRF notes

Introduction

This page provides detailed help on the Sample and Data Relationship Format (SDRF), and lists all the available tags for use in MAGE-TAB SDRF documents. For an overview of the MAGE-TAB submission process, and examples of both IDF and SDRF, please see these submission help notes and MAGE-TAB overview. For more detail on the MAGE-TAB format, please see the MAGE-TAB specification.

MAGE-TAB documents may optionally include information on the sources of any controlled terms used, providing users with the ability to link to ontologies, databases or other sources of controlled vocabularies to describe their experiment. These so-called "Term Sources" are defined in the IDF and may be used throughout the document. If they are not used, then all controlled vocabulary terms are assumed to be user-defined.

[ Back ][ Top of page ]

Overall SDRF layout

The SDRF file consists of a table in which each hybridization channel is represented by a row, and columns represent the steps of the experiment. In contrast to the Tab2MAGE format, the ordering of these columns is important, and should read left-to-right in chronological order. The overall organization of this table is shown below. To get more detail on the properties of each section, click on the relevant box:

An experiment can be described in terms of a graph, in which the graph nodes correspond to materials or data files, and the graph edges (or arcs) correspond to treatments. Each block in the diagram above would be represented as a node in such a graph, with the treatments (protocols) acting as edges. Each node block starts with a "Name" or "File" column (e.g. "Extract Name", "Array Data File") identifying the type of node, followed by a set of attribute columns. Each block is separated from its predecessor by "Protocol REF" graph edge columns containing references to the "Protocol Name" values defined in the IDF.

A further set of columns is used to specify the values for the variables ("experimental factors") within the experiment. These Factor Value[] columns reference the Experimental Factor Names defined in the IDF, and should be placed after the hybridization section (i.e., to the right of it, in or after the scanning, normalization and data section in the image above). The contents of these columns will usually duplicate those in a material Characteristics or a protocol Parameter Value column. See below for an example.

[ Back ][ Top of page ]

Example SDRF sections

Protocols
In most cases, each treatment within an experiment will be represented simply by a Protocol REF column containing references to the Protocol Names defined in the IDF. If multiple protocols need to be chained together this can be achieved using multiple Protocol REF columns.

Additionally, any Protocol Parameters associated with the protocol (as defined in the IDF) should have their values listed after the Protocol REF column. For example, with this in your IDF:

Protocol Name	My Ext. Protocol	My Labeling Protocol
Protocol Type	nucleic_acid_extraction	labeling
Protocol Parameters	amplification;RNA quality	amount of RNA used

The SDRF describing the use of these protocols might look as follows:

Protocol REF	Parameter Value [RNA quality]	Parameter Value [amplification]	Protocol REF	Parameter Value [amount of RNA used]	Unit [MassUnit]
My Ext. Protocol	RIN 8	RNA polymerase	My Labeling Protocol	10	ug

Other columns which may be used to annotate these Protocol REF columns are: Performer, Date, and Comment.

Sources
Sources are the starting material for the experiment. The section starts with a Source Name column, which will typically be followed by several Characteristics columns and a Material Type column:

Source Name	Material Type	Characteristics [Organism]	Characteristics [OrganismPart]	Characteristics [DiseaseState]	Characteristics [BioSourceType]
Tumor 1	organism_part	Homo sapiens	mammary gland	invasive ductal carcinoma	frozen_sample

Additional columns which may be used to annotate Sources are: Provider, Description, and Comment.

Samples
Samples represent steps in the chain of treatments applied to the original Source. MAGE-TAB allows you to create as many Sample steps as necessary:

Source Name	Protocol REF	Sample Name	Material Type	Protocol REF	Sample Name	Material Type	Characteristics [OrganismPart]
Young Rat 99	My growth protocol	Adult Rat 99	whole_organism	My dissect protocol	Adult Rat Liver 99	organism_part	liver

For ArrayExpress submissions, typically only one or two Sample steps are needed. Columns which may be used to annotate Samples are: Characteristics[], Material Type, Description, and Comment.

Extracts
Extracts refer to the extracted nucleic acid used in the experiment. Again, as many Extract steps may be used as are necessary. For example, if you need to represent separate nucleic acid extraction and chromatin immunoprecipitation steps in your SDRF, we recommend that you use two Extract steps. In most cases, however, a single Extract Name column would suffice:

Sample Name	Material Type	Protocol REF	Extract Name	Material Type
Rat Liver 99	organism_part	My Ext. Protocol	Liver RNA 99	total_RNA

Columns which may be used to annotate Extracts are: Characteristics[], Material Type, Description, and Comment.

Labeled Extracts
The Labeled Extracts in an experiment are those materials which have been conjugated to a label of some kind, prior to hybridization on an array. Typically there is only one Labeled Extract step. For submission to ArrayExpress, a Label column must be included with the Labeled Extract Name column to indicate which label (and therefore scanner channel) corresponds to which sample:

Extract Name	Material Type	Protocol REF	Labeled Extract Name	Label	Material Type
Liver RNA 99	total_RNA	My labeling protocol	Liver LE 99 Cy3	Cy3	synthetic_DNA
Kidney RNA 34	total_RNA	My labeling protocol	Kidney LE 34 Cy5	Cy5	synthetic_DNA

Note that it it wise to also include the Label in the Labeled Extract Name itself, so that unique objects are correctly created for each labeled extract. Columns which may be used to annotate Labeled Extracts are: Characteristics[], Material Type, Description, and Comment.

Hybridizations
The hybridization of Labeled Extract to an array is a key step in the SDRF, since it connects the "materials" section of the SDRF from the "data" section. For submission to ArrayExpress, an Array Design REF column must be included with the Hybridization Name column, indicating which array design was used in the hybridization:

Labeled Extract Name	Label	Hybridization Name	Array Design REF
Liver LE 1	Cy3	Liver vs. Kidney 1	A-MEXP-88
Kidney LE 1	Cy5	Liver vs. Kidney 1	A-MEXP-88

It is also possible to use Comment columns to annotate both Hybridization Name and Array Design REF columns. Note that the values in Hybridization Name columns may be used in Data Matrix files to link columns of data to individual hybridizations.

Scans
If desired, the act of scanning the hybridized array may be represented as a distinct node in the experimental graph, and encoded in the SDRF using Scan Name columns. These columns are optional, but can be useful in cases where e.g. multiple scans have been made of a single hybridized array, but where the data files do not explicitly reflect this:

Hybridization Name	Array Design REF	Scan Name	Array Data File
Liver vs. Kidney 1	A-MEXP-88	LK1 First Scan	Data1.txt
Liver vs. Kidney 1	A-MEXP-88	LK1 Second Scan	Data1.txt

Again, Comment columns may be used to further annotate Scan Name columns, where appropriate. Note that the values in Scan Name columns may be used in Data Matrix files to link columns of data to individual scanning events.

Array Data Files
The raw data files generated by an investigation should be listed in an Array Data File column following the Hybridization Name and (optional) Scan Name columns:

Hybridization Name	Array Design REF	Array Data File	Comment [EXP]
Liver vs. Kidney 1	A-AFFY-33	Data1.CEL	Data1.EXP

Comment columns can be used to add information relating to Array Data Files. For example, if you are coding an Affymetrix-based experiment and you wish to include the EXP files in your submission, you should list them in a "Comment[EXP]" column following the "Array Data File" column, as shown above.

Normalizations
Similarly to the use of Scan Name columns above, it is possible to represent the act of normalizing your data independently from the listing of data files themselves. This is done using the optional Normalization Name column:

Hybridization Name	Array Design REF	Array Data File	Normalization Name	Derived Array Data File
Liver vs. Kidney 1	A-AFFY-33	Data1.CEL	Norm 1	Data1.CHP

Again, Comment columns may be used to further annotate Normalization Name columns, where appropriate. Note that the values in Normalization Name columns may be used in Data Matrix files to link columns of data to individual normalization events.

Derived Array Data Files
The processed data files which have been derived from the raw data should be listed in an Derived Array Data File column. Note that this generally only applies to processed data arranged into one file per hybridization (or scan, or normalization). If your files contain processed data columns for more than one hybridization, you should reformat these into the MAGE-TAB Data Matrix format and include them instead in a Derived Array Data Matrix File column. Multiple steps of normalization can be captured:

Hybridization Name	Array Design REF	Array Data File	Normalization Name	Derived Array Data File	Normalization Name	Derived Array Data Matrix File	Comment [CDF]
Liver vs. Kidney 1	A-AFFY-33	Data1.CEL	MAS5 Norm 1	Data1.CHP	RMA Norm	RMANormData.txt	HG-U133A.cdf
Liver vs. Kidney 2	A-AFFY-33	Data2.CEL	MAS5 Norm 2	Data2.CHP	RMA Norm	RMANormData.txt	HG-U133A.cdf
Liver vs. Kidney 3	A-AFFY-33	Data3.CEL	MAS5 Norm 3	Data3.CHP	RMA Norm	RMANormData.txt	HG-U133A.cdf
Liver vs. Kidney 4	A-AFFY-33	Data4.CEL	MAS5 Norm 4	Data4.CHP	RMA Norm	RMANormData.txt	HG-U133A.cdf

In the above example, the columns from the "RMANormData.txt" data matrix file could be linked to either the Hybridization Names or the previous set of Normalization Names, allowing for flexible representation of the flow of data through the process. See the Data Matrix notes for details of how these links are encoded in the data matrix file header.

Comment columns may be used to add information relating to processed data files. For example, when coding an Affymetrix-based experiment with a Data Matrix file, as in the example above, a Comment[CDF] column should be used to indicate which Affymetrix library ("CDF") file applies to these data.

Factor Values
The Factor Values for an experiment are the values of the variables under investigation. For example, an experiment studying the effect of different compounds on a cell culture would have "compound" as an experimental variable. These variables are listed in the IDF as "Experimental Factor Names" with associated Types:

Experimental Factor Name	Cells	Drug
Experimental Factor Type	cell_line	compound

Protocol Name	All Treatments
Protocol Parameters	drug compound

Given the above definitions in the accompanying IDF, the SDRF file can then reference these factors when we come to list the factor values:

Source Name	Characteristics [CellLine]	Protocol REF	Parameter Value [drug compound]	Hybridization Name	Factor Value [Cells]	Factor Value [Drug]
Line 1	Jurkat	All Treatments	imatinib	Jurkat vs imatinib	Jurkat	imatinib
Line 1	Jurkat	All Treatments	lapatinib	Jurkat vs lapatinib	Jurkat	lapatinib
Line 2	RKO	All Treatments	imatinib	RKO vs imatinib	RKO	imatinib
Line 2	RKO	All Treatments	lapatinib	RKO vs lapatinib	RKO	lapatinib

Note that there is inevitably duplication between Factor Values and values entered elsewhere in the SDRF. It is particularly common to have the Factor Value column duplicate either a Characteristics[] column or a Parameter Value[] column.

Factor Value columns may be placed anywhere after the hybridization section of the SDRF, although they should be placed to avoid disrupting any of the other sections. This is most easily achieved by adding them at the end (i.e., the far right) of the SDRF.

[ Back ][ Top of page ]

All valid SDRF column headings

Source Name
Used as an identifier within the MAGE-TAB document. This column contains user-defined names for the Source materials. The following columns can be used to annotate Source Name columns:

Sample Name
Used as an identifier within the MAGE-TAB document. This column contains user-defined names for each Sample material. The following columns can be used to annotate Sample Name columns:

Extract Name
Used as an identifier within the MAGE-TAB document. This column contains user-defined names for each Extract material. The following columns can be used to annotate Extract Name columns:

Labeled Extract Name
Used as an identifier within the MAGE-TAB document. This column contains user-defined names for each Labeled Extract material. The following columns can be used to annotate Labeled Extract Name columns:

Hybridization Name
Used as an identifier within the MAGE-TAB document.This column contains user-defined names for each Hybridization. The following columns can be used to annotate Hybridization Name columns:

Array Design REF (required)
Comment[]

Assay Name
Used as an identifier within the MAGE-TAB document.This column contains user-defined names for each Assay. "Assay Name" may be used instead of "Hybridization Name" to identify generic biological assays, such as rtPCR. Note that this column should not be used for submission of regular microarray experiments to ArrayExpress. The following columns can be used to annotate Assay Name columns:

Note that as of MAGE-TAB version 1.1, all Assay Name columns must be followed by a Technology Type column.

Scan Name
Used as an identifier within the MAGE-TAB document. This optional column contains user-defined names for each Scan event. The following columns can be used to annotate Scan Name columns:

Comment[]

Normalization Name
Used as an identifier within the MAGE-TAB document. This optional column contains user-defined names for each Normalization event. The following columns can be used to annotate Normalization Name columns:

Comment[]

Array Data File
This column contains a list of raw data files, one for each row of the SDRF file, linking these data files to their respective hybridizations. The following columns can be used to annotate Array Data File columns:

Comment[]

Derived Array Data File
This column contains a list of processed data files, one for each row of the SDRF file, linking these data files to their respective hybridizations. The following columns can be used to annotate Derived Array Data File columns:

Comment[]

Array Data Matrix File
This column contains a list of raw data matrix files, where data from multiple hybridizations is stored in a single file, and the data mapped to each hybridization via the Data Matrix format itself. The following columns can be used to annotate Array Data Matrix File columns:

Comment[]

Derived Array Data Matrix File
This column contains a list of processed data matrix files, where data from multiple hybridizations is stored in a single file, and the data mapped to each hybridization (or scan, or normalization) via the Data Matrix format itself. The following columns can be used to annotate Derived Array Data Matrix File columns:

Comment[]

Image File
This optional column contains a list of image files, one for each row of the SDRF file, linking these image files to their respective hybridizations. Note that ArrayExpress does not store image data due to size constraints on the database. If desired, you may use this column to include links to image files stored on your local webserver. The following columns can be used to annotate Derived Array Data File columns:

Comment[]

Array Design REF
This column contains references to the array design used for each hybridization. For ArrayExpress submissions this should be an ArrayExpress accession number, e.g. "A-AFFY-33". The following columns can be used to annotate Array Design REF columns:

The Term Source REF column here can be used to point to the source of the array design referenced; however for ArrayExpress submissions this should always be ArrayExpress itself, and so this column is in effect ignored.

Protocol REF
This column contains references to Protocol Names defined in the IDF, or accession numbers of protocols already deposited with ArrayExpress. The following columns can be used to annotate Protocol REF columns:

The Term Source REF column here can be used to point to the source of the protocol referenced, if it is not contained within the IDF; for ArrayExpress submissions this should always be ArrayExpress itself, and a suitable ArrayExpress Term Source should be defined in the IDF.

Characteristics[<category term>]
Controlled vocabulary term or measurement. Used as an attribute column following Source Name, Sample Name, Extract Name, or Labeled Extract Name. This column contains terms describing each material according to the characteristics category indicated in the column header. For example, a column headed "Characteristics[OrganismPart]" would contain individual OrganismPart terms. These terms may be user-defined (the default), from an external ontology source (indicated using a Term Source REF column), or a measurement (indicated using a Unit[] column).

Provider
Used as an attribute column following Source Name. A free-text string identifying the organization or person from which the Source was obtained.

Material Type
Controlled vocabulary term. Used as an attribute column following Source Name, Sample Name, Extract Name, or Labeled Extract Name. This column contains terms describing the type of each material. For ArrayExpress submissions this term should be an instance of MaterialType from the MGED Ontology. Examples: whole_organism, organism_part, cell, total_RNA. The following columns can be used to annotate Material Type columns:

Term Source REF

The Term Source REF column in this case would point to the ontology (defined in the IDF) from which the Material Type terms are taken (the MGED Ontology in the example above).

Label
Controlled vocabulary term. Used as an attribute column following Labeled Extract Name. The label compound which is conjugated to an Extract to create the Labeled Extract. For ArrayExpress submissions this term should be an instance of LabelCompound from the MGED Ontology. Examples: Cy3, Cy5, biotin, alexa_546. The following columns can be used to annotate Label columns:

Term Source REF

The Term Source REF column in this case would point to the ontology (defined in the IDF) from which the Label terms are taken (the MGED Ontology in the example above).

Technology Type
Controlled vocabulary term. Used as an attribute column following Assay Name. This column contains terms describing the type of each generic (non-hybridization) assay. Example: rtPCR. The following columns can be used to annotate Technology Type columns:

Term Source REF

The Term Source REF column in this case would point to the ontology (defined in the IDF) from which the Technology Type terms are taken.

Factor Value[<experiment factor name>]
Controlled vocabulary term or measurement. This column contains terms describing the experimental factor values (i.e., variables) for each row of the SDRF. The Experimental Factor Name to which it pertains (from the accompanying IDF) should be indicated in the column heading. For example, if you have this in your IDF:

Experimental Factor Name	TissueEF
Experimental Factor Type	organism_part

You could then use this factor in your SDRF (assuming you had also defined the "Mouse Anatomy" term source in your IDF):

Factor Value[TissueEF]	Term Source REF
brain	Mouse Anatomy
kidney	Mouse Anatomy
liver	Mouse Anatomy
intestine	Mouse Anatomy
pancreas	Mouse Anatomy

The terms in the column may be user-defined (the default), from an external ontology source (indicated using a Term Source REF column), or a measurement (indicated using a Unit[] column).

In the example above, the column terms would be treated as describing organism parts. For more precise control over the treatment of these terms, the optional form "Factor Value [] ()" is available, e.g. "Factor Value [growth condition EF] (Nutrients)".

Performer
Used as an attribute column following Protocol REF. The name of the researcher who carried out the protocol.

Date
Used as an attribute column following Protocol REF. The date (and time, where available) upon which the protocol was performed, in the following format: YYYY-MM-DD

Parameter Value[<protocol parameter>]
Used as an attribute column following Protocol REF columns. This column contains values for the protocol parameters referenced in the column header. The following columns can be used to annotate Parameter Value[] columns:

For example, if a Protocol Name "Array Hybridization" is defined in the accompanying IDF, with Protocol Parameters "hyb temp;hyb volume", the following would be valid:

Protocol REF	Parameter Value [hyb temp]	Unit [TemperatureUnit]	Parameter Value [hyb volume]	Unit [VolumeUnit]
Array Hybridization	55	degrees_C	100	ul

Unit[<unit category>]
Controlled vocabulary term. Used as an attribute column following Characteristics[], Factor Value[] or Parameter Value[]. This column contains terms describing the unit(s) to be applied to the values in the preceding column. The type of unit is included in the column heading, e.g. "Unit[TimeUnit]". These unit types should correspond to Unit subclasses from the MGED Ontology. The following columns can be used to annotate Unit[] columns:

Term Source REF

The Term Source REF column in this case would point to the ontology (defined in the IDF) from which the Unit terms are taken.

Description
Used as an attribute column following Source Name, Sample Name, Extract Name, or Labeled Extract Name. A free-text description to be attached to the corresponding material. To be used sparingly, if at all - most annotations should be provided using controlled vocabulary terms, using Characteristics[] columns.

Term Source REF
Used as an attribute column following any controlled vocabulary column (e.g., Characteristics[], or column allowing reference of external entities (e.g., Protocol REF. This column contains references to ontology or database Term Sources defined in the IDF, and from which the values in the previous column were taken. The following columns can be used to annotate Term Source REF columns:

Term Accession Number

Term Accession Number
Used as an attribute column following Term Source REF columns. This column contains the accession numbers from the term source used to identify the ontology or database terms in question. For example:

Source Name	Characteristics [DiseaseState]	Term Source REF	Term Accession Number
Sample 1	acute lymphocytic leukemia	NCI Metathesaurus	C0023449

(This example relies on the "NCI Metathesaurus" Term Source having been pre-defined in the IDF accompanying the SDRF.)

Comment[<comment name>]
This column can be used to annotate the main graph node and edge columns listed above. It is included as an extensibility mechanism, and should not generally be used to encode meaningful biological annotation. The column heading should contain a name for the type of values included in the column.

[ Back ][ Top of page ]