ArrayExpress::Curator::ExperimentChecker - a module used by expt_check.pl
use base qw/ArrayExpress::Curator::ExperimentChecker/;
This module represents an abstract parent class providing methods for data file and experiment annotation checks to its child classes. See the following documentation for concrete checker classes: the ArrayExpress::Curator::Validate manpage (Tab2MAGE), the ArrayExpress::MAGETAB::Checker manpage (MAGE-TAB), the ArrayExpress::Curator::MIAMExpress manpage (MIAMExpress) and the ArrayExpress::Curator::Standalone manpage (standalone).
The objects created by this module can be instantiated with the following options, common to all submission routes.
log_to_current_dir
Write the log files to the current working directory, rather than in the submissions directory.
is_standalone
Run checker in standalone mode.
clobber
Overwrite existing log and graph files without asking for confirmation.
adf_filename
The name of the file to use as ADF in feature/reporter checks. This overrides any array designs specified in the submission.
array_accession
The ArrayExpress accession number of the array design to use for feature/reporter checks. This overrides any array accessions specified in the submission.
qt_filename
The name of the file to be used for QT definitions. See the ArrayExpress::Datafile::QT_list
manpage for information on the format of this file. See also
include_default_qts
.
include_default_qts
Use the QT definitions supplied with these scripts alongside any
new QT definitions provided by the qt_filename
option.
check()
Starts the checks, based on the options specified in the constructor.
get_miamexpress_software_type()
Returns the software term to use for a MIAMExpress export, if a unanimous verdict can be reached. Otherwise returns undef. This is also used for information purposes with Tab2MAGE and MAGE-TAB submissions, which is why it's in this superclass.
There are currently five log files written out by the script:
expt_report.log
Contains some summary information and data quality statistics for each data file.
expt_errors.log
Lists the errors that were encountered in parsing the data files. Details on Feature and QuantitationType errors are given in separate log files. Qualifier Value Source usage is also logged in this error log file.
expt_biomaterials.log
Provides a basic sanity check of the flow of BioMaterials through the experiment (i.e. Sample -> Extract -> Labeled Extract -> Hybridization). For Tab2MAGE checking the majority of the information in this file has been moved to an output PNG file, since it is a much clearer and more flexible visualization format.
expt_feature.log
Lists feature coordinates and/or reporter identifiers (FGEM
only) missing from the array design(s)
used in the
experiment. Entries here will typically mean that a dummy array has
to be used. Update: This file now also lists duplicate features in
a separate list.
expt_columnheadings.log
Lists unrecognized QuantitationTypes or hybridization IDs (FGEM only) appearing in the data column headings.
The following tests are performed by this module, with output printed to the error and/or report filehandles:
Checks that the data files referred to in the Tab2MAGE spreadsheet/MAGE-TAB SDRF/MIAMExpress database actually exist on the filesystem (error log).
Checks that CHP files have been submitted as normalized rather than raw data (error log).
Confirms that submitted data files are text, not binary. This test is not applied to Affymetrix CHP files.
Checks for Unix/DOS/Mac line endings (report log).
Checks that EXP files are submitted as raw data and that they all have Protocol, Station and Module information (error log).
Prints out the file format (Affymetrix, GenePix or Generic), the type (raw, normalized or transformed), row and column counts (report log).
Checks for repeated column headings in each data file (error log).
Checks for possible data corruption by Excel truncation of the file (error log).
Checks column headings against a list of known QTs. Reports on unrecognized QTs (error log, column headings log). This is a work in progress.
Checks FGEM column headings against the hybridization IDs for the submission, reports on those which are not recognized. This incorporates a check on the QTs for FGEM files (error log, column headings log).
Checks that the included final data matrix is laid out in DBQ order, rather than DQB. The ArrayExpress MAGE-ML loader software does not support the DQB order.
Data checks are only performed on recognized QT columns. Checks are for:
Text in numeric columns Null values in numeric columns Floats in integer columns Inappropriate boolean values (i.e., not 0 or 1) Log ratios outside reasonable range Basic check on saturation indicators (primarily GenePix files)
(error log).
Calculates Benford's law across dimensioned float data. In theory this should be approximately 30% for good data (report log).
Calculates overall percent null across the whole data set. Zero values are also counted as null. In practice under 10% null values seems to be a reasonable expectation (report log).
Checks the feature coordinates (raw, normalized data) or the reporter identifiers (FGEM) against either the array designs linked to the hybridization in MIAMExpress, or against a user-supplied ADF. Prints out a list of features not found in the array design (error log, features log). Also alerts the curator when significantly fewer features are found in the data file compared to the array design (error log). Note that this will give false errors on array designs associated with dummy array designs (e.g., some Affy arrays).
Checks that Features or Reporter identifiers are not repeated within the same file. Prints out a list of duplicate features/reporters (error log, features log).
Checks that there are no duplicate files associated with the submission (error log).
Checks that all the files of a given type (raw, normalized) have consistent numbers of rows and columns (error log).
Checks that the parameters in each EXP file which should be the same are the same (error log).
Checks a single line from each file against the same line in every other file, and reports on any matching pairs of files (error log).
Tim Rayner (rayner@ebi.ac.uk), ArrayExpress team, EBI, 2004.
Acknowledgements go to the ArrayExpress curation team for feature requests, bug reports and other valuable comments.