MAGE-TAB logo Overview of the MAGE-TAB format

On this page:

See also:

Introduction

These notes provide an introduction to the MAGE-TAB format and a brief primer on how to submit microarray data to ArrayExpress in this format. MAGE-TAB is the successor to the Tab2MAGE format, developed by members of the microarray research community to extend the capabilities of the spreadsheet format. A full MAGE-TAB specification document is available from the MGED web site.

ArrayExpress supports submission of MAGE-TAB documents via a web-based submission system. This system allows the user to create an experiment submission and download an automatically-generated template MAGE-TAB document. This document can then be completed by the user and uploaded to the web page alongside their data files. For more information, please see the ArrayExpress submission help notes

[ Back ][ Top of page ]

Overall structure of MAGE-TAB

The MAGE-TAB format uses a number of different files to capture information about a microarray experiment:

The IDF file is used to give an overview of the experiment, including the experimental variables (factors) used, protocols, quality control strategy, publication information and contact details. Also included in the IDF file is an (optional) list of sources from which controlled vocabulary terms may have been used elsewhere in the MAGE-TAB document. These term sources may be fully-fledged ontologies (e.g. the MGED ontology), databases providing queryable accession numbers (e.g. ArrayExpress), or simply a file defining terms for local users.

The SDRF file describes the relationship between every step in the chain of biological materials used in the experiment through to the hybridization, and the acquisition and normalization of data. It is similar in concept to the Hybridization section found in Tab2MAGE spreadsheets. Experimental factors, protocols, protocol parameters and term sources defined in the IDF are referenced by the SDRF.

The ADF file provides the array-level annotation for the experiment, relating the row-level identifiers in the data files to biological sequence annotation. Array designs are usually deposited in ArrayExpress as separate submissions to the experimental data, and in the case of commercial arrays may not need to be submitted to ArrayExpress at all.

Currently, ArrayExpress only supports submission of array designs using ADF files via the MIAMExpress web submission system. For help with this process, please see the MIAMExpress ADF help documentation. This array design component of MAGE-TAB will not be discussed further here.

An experimental data submission will usually consist of an IDF file, an SDRF file, and a series of data files. Typically there will be one raw data file per hybridization. Each hybridization may also have a normalized data file, or the final transformed data may be combined into a data matrix file. Note that the ArrayExpress MAGE-TAB generating system combines both the IDF and the SDRF into a single file format, for the sake of convenience.

For more detail on the MAGE-TAB document format, please see the MAGE-TAB specification document, available from the MGED web site.

[ Back ][ Top of page ]

IDF overview

The IDF component of a MAGE-TAB document consists of a set of unique tags attached to their corresponding values in a simple tab-delimited text format. For example, "Experiment Description" should be followed by a free-text description of the experiment. Most of the fields in the IDF document can handle multiple values. A full description of every valid IDF tag is given in the detailed IDF help notes. An example of an IDF document is given below:

Investigation TitleInvasive vs. non-invasive strains of yeast 
   
Experimental Designindividual_genetic_characteristics_designgrowth_condition_design
Experimental Factor NameEF_GenotypeEF_GrowthCond
Experimental Factor Typegenotypegrowth_condition
   
Person Last NameFalstaffShakespeare
Person First NameJohnBill
Person Emailjfalstaff@wagglespike.combills@wagglespike.com
Person AddressOntario, CanadaOntario, Canada
Person AffiliationWindsor LaboratoriesWindsor Laboratories
Person Rolessubmitter;investigatorinvestigator
   
Quality Control Typedye_swap_quality_control 
   
Public Release Date2004-08-30 
   
PubMed ID1234567887654321
Publication Author ListFalstaff, J. and Shakespeare, B.Goodfellow, R. et al.
Publication TitleImproved yeast flocculationYeast and beer: a retrospective
Publication Statusin preparationsubmitted
   
Experiment DescriptionAn experiment was performed to... 
   
Protocol NameYeast GrowthRNA extraction
Protocol Typegrownucleic_acid_extraction
Protocol DescriptionS. cerevisiae cultures were grown on...Total cellular RNA was extracted...
Protocol Parameterscarbon source;temperature 
   
SDRF Filemy_sdrf_file.txt 

Note that this is only a minimal subset of the available IDF tags. Blank lines may be included for legibility. Lines beginning with the "#" symbol are treated as comments and ignored. A full listing of all supported IDF tags can be found in these IDF help notes.

[ Back ][ Top of page ]

SDRF overview

The SDRF file consists of a table in which each hybridization channel is represented by a row, and columns represent the steps of the experiment. In contrast to the Tab2MAGE format, the ordering of these columns is important, and should read left-to-right in chronological order. The overall organization of this table is shown below. To get more detail on the properties of each section, click on the relevant box below or read these detailed SDRF notes.

Derived Array Data File Scan Array Data File Normalization Labeled Extract Extract Sample Source Hybridization
SDRF layout

Each block in the diagram above starts with a "Name" or "File" column (e.g. "Extract Name", "Array Data File"), followed by a set of attribute columns. Each block is separated from its predecessor by "Protocol REF" columns containing references to the "Protocol Name" values defined in the IDF.

A further set of columns is used to specify the values for the variables ("experimental factors") within the experiment. These Factor Value[] columns reference the Experimental Factor Names defined in the IDF, and should be placed after the hybridization section (i.e., to the right of it, in or after the scanning, normalization and data section in the image above). The contents of these columns will usually duplicate those in a material Characteristics or a protocol Parameter Value column. See below for an example.

Below is a very simple example SDRF showing the links between materials and hybridization for a simple Affymetrix-based experiment:

Source Name Characteristics [OrganismPart] Labeled Extract Name Label Protocol REF Hybridization Name Array Design REF
liver sample 1liverLE 1biotinHyb protocol nameHyb 1A-AFFY-33
liver sample 2liverLE 2biotinHyb protocol nameHyb 2A-AFFY-33
kidney sample 1kidneyLE 3biotinHyb protocol nameHyb 3A-AFFY-33
kidney sample 2kidneyLE 4biotinHyb protocol nameHyb 4A-AFFY-33

Note that normally, many more "Characteristics[]" columns would be used to fully describe the Source. The SDRF might then continue in the following fashion, linking the Hybridization to data files and experimental factor values:

Hybridization Name Array Design REF Scan Name Array Data File Protocol REF Normalization Name Derived Array Data File Factor Value [FactorOP]
Hyb 1A-AFFY-33Scan 1Data1.CELNorm protocol nameNorm 1Data1.CHPliver
Hyb 2A-AFFY-33Scan 2Data2.CELNorm protocol nameNorm 2Data2.CHPliver
Hyb 3A-AFFY-33Scan 3Data3.CELNorm protocol nameNorm 3Data3.CHPkidney
Hyb 4A-AFFY-33Scan 4Data4.CELNorm protocol nameNorm 4Data4.CHPkidney

In this case, the "Factor Value [FactorOP]" column refers to an Experimental Factor named "FactorOP", with Type "organism_part", from the IDF which would accompany this SDRF. Blank lines may be included for legibility. Lines beginning with the "#" symbol are treated as comments and ignored. Note that the examples on this page only illustrate a very minimal set of the available SDRF columns available. A full listing of all supported column names can be found in these SDRF help notes.

[ Back ][ Top of page ]

Data Files

A variety of data file formats, produced by several different scanner makes and models, are supported by the ArrayExpress MAGE-TAB parser. A full list of supported formats can be found in the Tab2MAGE data file documentation. Note that the MAGE-TAB specification introduces a new file format, that of the data matrix, for files which contain data from multiple hybridizations. This new format is discussed below.

[ Back ][ Top of page ]

Data Matrices

If you wish to represent data from more than one hybridization, scan or normalization in a single data file, you will need to reformat it as a MAGE-TAB Data Matrix. This is a simplified format which allows data columns to be mapped to rows in the SDRF file. The first header line of a Data Matrix file describes this mapping, and the second lists the quantitation types for each column (e.g. "log2 ratio"). The first column is used to map the data rows to identifiers from the array design used. Examples are shown here:

Example two-color data matrix

Hybridization REFHyb1Hyb2Hyb3Hyb4Hyb5Hyb6
Reporter REFlog2 ratiolog2 ratiolog2 ratiolog2 ratiolog2 ratiolog2 ratio
Probe 10.270.430.320.120.540.28
Probe 21.671.461.911.491.501.89
Probe 30.780.690.910.990.750.80

In this example, six hybridizations from the SDRF (Hyb1 - Hyb6) are being mapped to log2 ratio values. Each row of data is mapped to a Reporter Identifier defined by the array design (itself indicated in the SDRF file).

Example Affymetrix data matrix

Hybridization REFHyb1Hyb1Hyb2Hyb2Hyb3Hyb3
CompositeElement REFCELIntensityCELStdevCELIntensityCELStdevCELIntensityCELStdev
Gene 122287.956.83222.9111.19984.334.8
Gene 2267.47.6118.17.6236.89.0
Gene 3876.516.7936.814.9735.68.0

In this example, three hybridizations from the SDRF file (Hyb1, Hyb2 and Hyb3) are being mapped to data with two different quantitation types (CELIntensity, CELStdev). Each row of data is mapped to a CompositeElement Identifier (equivalent to MAGE CompositeSequence Identifiers) defined by the array design.

There are some limitations imposed by ArrayExpress when submitting data in this format. Firstly, each data matrix should correspond to hybridizations performed on a single array design. Experiments using multiple array designs should use one data matrix per design. Secondly, we rely on there being an ordered and regular organisation of the columns: first by hybridization, and then by quantitation type:

Correct:
Hybridization REFHyb1Hyb1Hyb2Hyb2
Reporter REFQT XQT YQT XQT Y

Wrong:
Hybridization REFHyb1Hyb2Hyb1Hyb2
Reporter REFQT XQT XQT YQT Y

If your processed data does not readily fall into such a structure you may need to break the data up into multiple data matrices, and use the SDRF file to represent the relationships between them.

[ Back ][ Top of page ]

Please see the MAGE-TAB specification for further information and examples.


SourceForge.net Logo
Last modified: Mon Feb 25 18:04:42 GMT 2008