Tab2MAGE logo Script detail: tab2mage.pl

NAME

tab2mage.pl - a script to produce valid MAGE-ML from a set of raw datafiles and a summary spreadsheet of defined format.


SYNOPSIS

 tab2mage.pl -e <spreadsheet_filename> -t <target_directory>

DESCRIPTION

This script is designed to take a set of unprocessed raw datafiles and a spreadsheet providing the experiment metadata and generate valid MAGE-ML. The supported data file formats are listed in the accompanying documentation.

Important: the data files must each contain a column heading line immediately preceding the start of the data. Typically, raw unprocessed data files have this as standard. Do not remove this line, as the script will use it to determine which QuantitationTypes to keep and which to discard.

There are two auxilliary files which can be supplied alongside the data files. The first, experiment summary spreadsheet describes the way in which the experiment was performed. The second simply defines the QuantitationTypes which are to be extracted from the data files. This QuantitationType file is optional, however, as the script is supplied with a set of defaults determined by a survey of incoming QuantitationTypes by ArrayExpress curators. The script generates a log file in the target directory which details which columns have been ignored.


EXPERIMENT SUMMARY FILE

The experiment summary file has a flexible format based around a set of predefined column headers. Comments may be inserted using the '#' character at the start of any line. The file consists of three sections, each of which ends with a blank line:

Experiment section

This section contains top-level information about the experiment, such as the title, description and accession number. The section is constructed in two columns, with row names as described in SUPPORTED HEADINGS in the ArrayExpress::Curator::MAGE::Definitions manpage.

Protocol section

Protocols are defined as needed in this section. The section is organized into rows, with column headings as described in SUPPORTED HEADINGS in the ArrayExpress::Curator::MAGE::Definitions manpage. If all of the protocols used in the experiment have previously been loaded into ArrayExpress and given accession numbers, this whole section can be omitted.

Hybridization section

This section contains the bulk of the experiment information. At its simplest, each row describes the route taken from BioSource to output data file.

Column headings for BioMaterialCharacteristics and FactorValue should be provided in the following form:

BioMaterialCharacteristics[<MGED Ontology Category>]
FactorValue[<MGED Ontology Category>]

Each of these headings should contain a valid Category subclass from the MGED ontology. The values in the columns must likewise be valid ontology entry values for these subclasses.

Multichannel (e.g., two-colour) data can be described by entering each channel as a separate line. Pooling can be described at multiple levels by using as many lines as necessary to describe all the relationships between upstream and downstream samples.

In a sense the Hybridization table can be compared to an SQL database table in the way that it provides links between MAGE objects. Again, the recognized column headings for this section are described in SUPPORTED HEADINGS in the ArrayExpress::Curator::MAGE::Definitions manpage.


OPTIONS

-e filename

The Tab2MAGE spreadsheet to be checked.

-t directory

The target directory to be created. This directory will contain the MAGE-ML file and external data files ready for validation.

-n accession

Normally the script uses the experiment accession number from the spreadsheet to be parsed. In cases where no accession has been entered in the spreadsheet, or you wish to override that accession, use this option.

-q QT filename

QuantitationType file. This option allows you to specify a custom QuantitationType definition file to override those defined as part of the Tab2MAGE package. See the ArrayExpress::Datafile::QT_list manpage for more information.

-Q QT filename

QuantitationType file. This option will add the new QuantitationType definitions to those included with the Tab2MAGE package. See the ArrayExpress::Datafile::QT_list manpage for more information.

-k

Keep all columns in the data files, regardless of whether they are recognized or not. Unrecognized QTs will be created as generic SpecializedQuantitationTypes in the output MAGE-ML.

-K

If the autosubmissions system is configured, tab2mage.pl will automatically reassign protocol accessions to fit a local convention. Use the -K option to suppress this behaviour.

-s

Standalone option. This prevents the script from attempting to connect to ArrayExpress to retrieve array information.

-R namespace

Reporter identifier prefix. By default the script uses the MIAMExpress convention for generating reporter identifiers. This option allows you to override this behaviour by supplying an alternate prefix for identifiers.

-C namespace

CompositeSequence identifier prefix. By default the script uses the MIAMExpress convention for generating composite sequence identifiers. This option allows you to override this behaviour by supplying an alternate prefix for identifiers.

-d directory

Source directory containing all the data files referenced in the tab2mage spreadsheet. If this is omitted, the current working directory will be searched for data files.

-P

By default, native (usually binary) file formats are used for Affymetrix CEL files and all NimbleScan (NimbleGen) files. This encoding uses far less overhead and retains the files in their original formats; this is often appealing to end-users. When used for ArrayExpress submissions, this option allows for such files to be directly downloadable from the ArrayExpress web interface. However, in unusual circumstances when you might wish to use plain-text encoding for these datafile, use this option.

-L

Ignore the data file size limit as configured in Config.yml (i.e., MAX_DATAFILE_SIZE).

-f font name

Name of the font to be used for Graphviz-generated PNGs.

-c

Overwrite preexisting files (``clobber'' option).

-v

Prints the version number of the script.

-h

Print a short help text summarizing these options.


QUANTITATIONTYPE FILE

Please see the ArrayExpress::Datafile::QT_list manpage for a description of the format of this file.


AUTHOR

Tim Rayner (rayner@ebi.ac.uk), ArrayExpress team, EBI, 2004.

Acknowledgements go to the ArrayExpress curation team for feature requests, bug reports and other valuable comments. Particular credit goes to Ele Holloway, who was responsible for curating the lists of QuantitationTypes included with this script.


SourceForge.net Logo