Module detail: ExperimentChecker.pm

NAME

ArrayExpress::Curator::ExperimentChecker - a module used by expt_check.pl

SYNOPSIS

 use base qw/ArrayExpress::Curator::ExperimentChecker/;

DESCRIPTION

This module represents an abstract parent class providing methods for data file and experiment annotation checks to its child classes. See the following documentation for concrete checker classes: the ArrayExpress::Curator::Validate manpage (Tab2MAGE), the ArrayExpress::MAGETAB::Checker manpage (MAGE-TAB), the ArrayExpress::Curator::MIAMExpress manpage (MIAMExpress) and the ArrayExpress::Curator::Standalone manpage (standalone).

OPTIONS

The objects created by this module can be instantiated with the following options, common to all submission routes.

log_to_current_dir: Write the log files to the current working directory, rather than in the submissions directory.
is_standalone: Run checker in standalone mode.
clobber: Overwrite existing log and graph files without asking for confirmation.
adf_filename: The name of the file to use as ADF in feature/reporter checks. This overrides any array designs specified in the submission.
array_accession: The ArrayExpress accession number of the array design to use for feature/reporter checks. This overrides any array accessions specified in the submission.
qt_filename: The name of the file to be used for QT definitions. See the ArrayExpress::Datafile::QT_list manpage for information on the format of this file. See also include_default_qts.
include_default_qts: Use the QT definitions supplied with these scripts alongside any new QT definitions provided by the qt_filename option.

METHODS

check(): Starts the checks, based on the options specified in the constructor.
get_miamexpress_software_type(): Returns the software term to use for a MIAMExpress export, if a unanimous verdict can be reached. Otherwise returns undef. This is also used for information purposes with Tab2MAGE and MAGE-TAB submissions, which is why it's in this superclass.

FILES

There are currently five log files written out by the script:

expt_report.log: Contains some summary information and data quality statistics for each data file.
expt_errors.log: Lists the errors that were encountered in parsing the data files. Details on Feature and QuantitationType errors are given in separate log files. Qualifier Value Source usage is also logged in this error log file.
expt_biomaterials.log: Provides a basic sanity check of the flow of BioMaterials through the experiment (i.e. Sample -> Extract -> Labeled Extract -> Hybridization). For Tab2MAGE checking the majority of the information in this file has been moved to an output PNG file, since it is a much clearer and more flexible visualization format.
expt_feature.log: Lists feature coordinates and/or reporter identifiers (FGEM only) missing from the array design(s) used in the experiment. Entries here will typically mean that a dummy array has to be used. Update: This file now also lists duplicate features in a separate list.
expt_columnheadings.log: Lists unrecognized QuantitationTypes or hybridization IDs (FGEM only) appearing in the data column headings.

TESTS

The following tests are performed by this module, with output printed to the error and/or report filehandles:

File existence

Checks that the data files referred to in the Tab2MAGE spreadsheet/MAGE-TAB SDRF/MIAMExpress database actually exist on the filesystem (error log).

Affymetrix CHP file as normalized data

Checks that CHP files have been submitted as normalized rather than raw data (error log).

Text file check

Confirms that submitted data files are text, not binary. This test is not applied to Affymetrix CHP files.

File line endings check

Checks for Unix/DOS/Mac line endings (report log).

Affymetrix EXP file check

Checks that EXP files are submitted as raw data and that they all have Protocol, Station and Module information (error log).

Basic data file summary

Prints out the file format (Affymetrix, GenePix or Generic), the type (raw, normalized or transformed), row and column counts (report log).

Duplicate columns

Checks for repeated column headings in each data file (error log).

Excel truncated files

Checks for possible data corruption by Excel truncation of the file (error log).

QuantitationTypes

Checks column headings against a list of known QTs. Reports on unrecognized QTs (error log, column headings log). This is a work in progress.

FGEM hybridization IDs

Checks FGEM column headings against the hybridization IDs for the submission, reports on those which are not recognized. This incorporates a check on the QTs for FGEM files (error log, column headings log).

FGEM BioDataCube order

Checks that the included final data matrix is laid out in DBQ order, rather than DQB. The ArrayExpress MAGE-ML loader software does not support the DQB order.

Data checks

Data checks are only performed on recognized QT columns. Checks are for:

  Text in numeric columns
  Null values in numeric columns
  Floats in integer columns
  Inappropriate boolean values (i.e., not 0 or 1)
  Log ratios outside reasonable range
  Basic check on saturation indicators (primarily GenePix files)

(error log).

Benford's law

Calculates Benford's law across dimensioned float data. In theory this should be approximately 30% for good data (report log).

Percent null

Calculates overall percent null across the whole data set. Zero values are also counted as null. In practice under 10% null values seems to be a reasonable expectation (report log).

Feature/Reporter check vs. array design

Checks the feature coordinates (raw, normalized data) or the reporter identifiers (FGEM) against either the array designs linked to the hybridization in MIAMExpress, or against a user-supplied ADF. Prints out a list of features not found in the array design (error log, features log). Also alerts the curator when significantly fewer features are found in the data file compared to the array design (error log). Note that this will give false errors on array designs associated with dummy array designs (e.g., some Affy arrays).

Duplicate Features/Reporters

Checks that Features or Reporter identifiers are not repeated within the same file. Prints out a list of duplicate features/reporters (error log, features log).

Duplicate filenames

Checks that there are no duplicate files associated with the submission (error log).

Row/column count consistency

Checks that all the files of a given type (raw, normalized) have consistent numbers of rows and columns (error log).

Affy EXP file consistency

Checks that the parameters in each EXP file which should be the same are the same (error log).

Potentially duplicated files with different names

Checks a single line from each file against the same line in every other file, and reports on any matching pairs of files (error log).

AUTHOR

Tim Rayner (rayner@ebi.ac.uk), ArrayExpress team, EBI, 2004.

Acknowledgements go to the ArrayExpress curation team for feature requests, bug reports and other valuable comments.