expt_check.pl - a script to check experiment data files submitted to MIAMExpress
expt_check.pl -e <Tab2MAGE spreadsheet>
expt_check.pl -i <IDF file>
expt_check.pl -l <login name> -t <experiment title>
expt_check.pl -s <list of data files>
(use -A or -a to check files against an array design in standalone mode).
This script can be used to check experimental data for submission to ArrayExpress in a few different ways. For Tab2MAGE submissions it parses the Tab2MAGE spreadsheet format and reports on problems with data files and MIAME metadata. Used in conjunction with a local MIAMExpress installation the script takes the experiment title and the submitter login name, and checks the submitted data files for errors. Several log files are written to the relevant MIAMExpress submission directory unless the -p option is used to redirect them to the current directory. Normally the script will be able to figure out which array design to use from the Tab2MAGE spreadsheet or by querying the MIAMExpress database. An optional ADF filename argument may also be provided using the -a option, and ArrayExpress accession numbers may be specified using the -A option. Specifying your own ADFs/accession numbers will make the script ignore any array designs pointed to by the spreadsheet or database entry for the experiment.
Known QuantitationTypes are listed in a separate file or files, created with a simple tab-delimited format. The layout of these files is described in the ArrayExpress::Datafile::QT_list manpage.
spreadsheet filename
The Tab2MAGE spreadsheet to be checked.
IDF
filename
The MAGE-TAB IDF file to be checked.
login
name
The MIAMExpress login name of the experiment submitter
experiment
title
The MIAMExpress title of the experiment, surrounded by quotes if the title contains spaces.
ADF
filename
The -a switch designates the ADF filename to be used for all the hybridizations in the experiment. This option overrides any database links between hybridizations and array designs, and is provided initially as a convenience.
Array accession number
(ArrayExpress)
Use the -A switch to indicate the accession number of an ArrayExpress array design to be used for checking the data files. Ordinarily this should not be needed, as the script should be able to link the MIAMExpress submission with ArrayExpress array designs automatically.
Forces overwriting of existing files (``clobber''). If this switch is omitted the user will be asked whether to overwrite already existing files.
Write to files in present working directory (``pwd''). The default is to write to files in the submission directory.
Standalone option. The script will check the files listed on the command line rather than connecting to MIAMExpress and ArrayExpress. When used with the -e or -i options, the Tab2MAGE or MAGE-TAB document is checked but no connection is made to ArrayExpress to retrieve array information. To check features and reporter identifiers with this option, an ADF must be specified with the -a option or an ArrayExpress accession number can be used with the -A option. In the latter case a connection is made to ArrayExpress without connecting to MIAMExpress.
directory
Source directory. This indicates the directory to search for data files. This option is only used for Tab2MAGE submissions checks, as MIAMExpress defines its own directory structure which is automatically searched by this script. If this option is omitted, only the current working directory is searched for Tab2MAGE submission data files.
Skip data file checking. This option can be used to quickly check experiment annotation without having to wait for the script to validate all the data files.
QT
filename
QuantitationType file. This option allows you to specify a custom QuantitationType definition file to override those defined in the ArrayExpress::Curator::Config module. See the ArrayExpress::Datafile::QT_list manpage for more information.
QT
filename
QuantitationType file. This option will add the new QuantitationType definitions to those included with the Tab2MAGE package. See the ArrayExpress::Datafile::QT_list manpage for more information.
namespace
Prefix of the Reporter identifier to use when checking data files against array designs. This prefix is added to each ``Reporter Identifier'' in the data files prior to comparison with the actual identifiers in the ADF. The default is to assume MIAMExpress-like identifiers.
namespace
Prefix of the CompositeSequence identifier to use when checking data files against array designs. This prefix is added to each ``CompositeSequence Identifier'' in the data files prior to comparison with the actual identifiers in the ADF. The default is to assume MIAMExpress-like identifiers.
Ignore the data file size limit as configured in Config.yml (i.e., MAX_DATAFILE_SIZE).
The checker will support MAGE-TAB documents in which a single IDF and SDRF have been combined (in that order), with the start of each section marked by [IDF] and [SDRF] respectively. Note that such documents are not compliant with the MAGE-TAB format specification; this format is used by ArrayExpress to simplify data submissions.
Prints the version number of the script.
Prints a short help text.
There are numerous tests performed by this script. Listed below are the tests which are performed on each data file. There is also a series of checks which are made on any MIAME metadata supplied to the script. This metadata may be in the form of a Tab2MAGE spreadsheet, or an experiment submission in a local MIAMExpress database. Please see TESTS in the ArrayExpress::Curator::ExperimentChecker manpage for a list of general tests performed on all submissions. See also TESTS in the ArrayExpress::Curator::Validate manpage for the Tab2MAGE spreadsheet tests, TESTS in the ArrayExpress::MAGETAB::Checker manpage for MAGE-TAB document checking, and TESTS in the ArrayExpress::Curator::MIAMExpress manpage for more information on the MIAMExpress tests.
Users wishing to use this script with local installations of MIAMExpress should check the ArrayExpress::Curator::Config module and change whatever parameters are necessary.
Tim Rayner (rayner@ebi.ac.uk), ArrayExpress team, EBI, 2004.
Acknowledgements go to the ArrayExpress curation team for feature requests, bug reports and other valuable comments.
Affymetrix CHP files are not currently supported. Moreover, the script will not check the features referred to in CEL files against an array design until I can figure out how to retrieve a list of valid features in a timely fashion. CompositeSequence IDs in final data matrix files from Affymetrix submissions are fully supported and checked against the array design. Basic data quality calculations (percent null, Benfords law) are made on Affymetrix CEL and FGEM data.