Module detail: MAGETAB.pm

NAME

ArrayExpress::MAGETAB - A parser class for MAGE-TAB documents.

SYNOPSIS

 use ArrayExpress::MAGETAB;
 my $magetab = ArrayExpress::MAGETAB->new({
     idf              => $idf_file,
     target_directory => $dir,
     expt_accession   => $accn,
 });

 $magetab->write_mageml();

DESCRIPTION

This module acts as a front end to the MAGE-TAB parsing API. Parser objects are instantiated with a number of attributes to control how the MAGE-TAB document is parsed. Support is only provided for IDF and SDRF documents at present; it is anticipated that the parser will be extended to support ADF at a later date.

Currently the parser is built using a MAGEv1.1 object model to store the MAGE-TAB metadata. It is envisioned that this dependence on the Bio::MAGE modules may be removed once a full MAGE-TAB object model is agreed upon by the community.

To simplify the process of data submission, ArrayExpress has introduced a new flavour of MAGE-TAB in which the IDF and SDRF sections are combined into a single worksheet. This parser supports both MAGE-TAB v1.1 documents (with separate IDF and SDRF) and these combined documents.

METHODS

new

Object constructor. This recognises the following attributes:

idf: The path of the IDF file with which to start parsing.
magetab_doc: The path of a combined IDF+SDRF file to parse.
output_file: The name of the output MAGE-ML file.
namespace: The namespace to use in MAGE identifier creation.
authority: The authority to use in MAGE identifier creation.
expt_accession: The accession number assigned to the experiment.
target_directory: The directory into which to write the output files.
source_directory: The directory which contains all data and SDRF files.
is_standalone: A flag indicating whether the script is able to connect to ArrayExpress to retrieve array design information. It is sometimes desirable to skip these downloads, which can be quite large.
qt_filename: QuantitationType file. This option allows you to specify a custom QuantitationType definition file to override those defined as part of the Tab2MAGE package. See the ArrayExpress::Datafile::QT_list manpage for more information.
include_default_qts: This option can be used in conjunction with qt_filename to indicate that the QuantitationType listing from the Tab2MAGE package itself should be included in the lists of known QuantitationTypes used in data file parsing. The default behaviour is to deactivate these known QTs if a custom QT file is to be used.
keep_all_qts: A flag indicating whether unrecognised QuantitationTypes in data files should be kept or not. The default behaviour is to strip unrecognised columns out of the data files.
reporter_prefix: The prefix to be used during Reporter identifier construction. This prefix is prepended to the identifiers listed in the data files.
compseq_prefix: The prefix to be used during CompositeElement (CompositeSequence) identifier construction. This prefix is prepended to the identifiers listed in the data files.
protocol_accession_service: A code reference used to reassign protocol accessions. See PROTOCOL_ACCESSIONS, below.
protocol_accession_prefix: The prefix to be used for protocol accession creation, when the autosubmissions system is in use. See PROTOCOL_ACCESSIONS, below.
keep_protocol_accns: A flag indicating whether the protocols in the IDF should be assigned new accession numbers. This option overrides protocol_accession_service.
use_plain_text: Some file formats are only supported in their native forms by ArrayExpress. Nonetheless, this package can parse some of these data formats into tab-delimited representations fully encoded in the MAGE-ML document (examples include Nimblegen data, Affymetrix CEL files).
skip_datafiles: This option tells the parser to skip attempting to read the data files referenced by a given MAGE-TAB document, and instead attempts to generate MAGE in their absence. This option is particularly useful for unsupported data file formats.
ignore_size_limits: The Tab2MAGE configuration file allows the user to set maximum data file sizes for parsing and web download, to provide some protection from overloading the system in a production pipeline setting. To temporarily ignore these size limits, use this option.
in_relaxed_mode: A flag indicating whether to allow minor errors during parsing. At the moment the only errors which are ignored by this option are Term Source REF, Protocol REF, Parameter Value [] and Factor Value [] columns which reference Names which have not been defined in the IDF.
clobber: Flag indicating whether or not to overwrite existing files without prompting the user.

parse

Starts the MAGE-TAB parse and loads the document into memory.

write_mageml

Writes out MAGE-ML corresponding to the input MAGE-TAB document. If the MAGE-TAB has not yet been parsed, parse() is called automatically.

PROTOCOL_ACCESSIONS

The parser provides a set of callbacks which can be used to assign MAGE-TAB Protocol Names to unique accessions at the point of parsing. If the autosubmissions system has been set up and configured, then the parser will default to using that mechanism to assign protocol accessions. If you wish to use your own service, you may use the protocol_accession_service and protocol_accession_prefix attributes to control this. The protocol_accession_service should point to a code reference which will accept two arguments: (a) the Protocol Name as given in the IDF, and (b) the experiment accession. The code reference should return a unique accession which will then be assigned to the protocol in the output MAGE-ML.

If the autosubmissions system is to be used, the protocol_accession_prefix attribute must be set, e.g. to ``P-MTAB-''.

AUTHOR

Tim Rayner (rayner@ebi.ac.uk), ArrayExpress team, EBI, 2008.

Acknowledgements go to the ArrayExpress curation team for feature requests, bug reports and other valuable comments.