These data guidelines are aimed at helping researchers to improve the integrity of their research data and the efficiency of analysis, either in-house or by external groups such as QCIF Bioinformatics. Correctly specified data is essential for processing by analytical software and interpretation of the results. Well-formatted data is simpler and easier to import into analysis software, and also simplifies engagement with core analysis facilities and external service providers.
If exporting or using data in a text-only format, tab separated values (.tab or .tsv) is preferred over space delimited or comma separated values (.csv). It is too frequent for a text field to contain its own spaces and European countries use commas instead of full stops for decimal places.
File size and transfer
QCIF Bioinformatics currently has no file size limit but the size of the file will define the method of delivery.
- Files up to 100 gigabytes in size can be transferred using CloudStor or CloudStor FileSender.
- Files larger than 100 gigabytes are best transferred by Secure File Transfer Protocol (SFTP), Secure Copy Protocol (SCP) or rsync. QCIF Bioinformatics can provide access to a server for SFTP, SCP or rsync tranfers and instructions for depositing data.
Should the files be compressed, QCIF Bioinformatics recommends use of the gzip or zip utilities to compress data. If sending multiple small files, the files should be compressed and stored in a tar archive or compressed folder so that only one file is transferred (one large file will upload much faster than multiple small files.)
Please contact us (firstname.lastname@example.org) regarding the most convenient method for data transfer.
File versions should be uniquely labelled by a date or an iterative version number. File versions using dates should have a yyyymmdd format in the file name.
All file or data information should be restricted to the UTF-8 character set. The QCIF Bioinformatics Data Guidelines document provides additional information on this.
Bioinformatic data formats have specific metadata conventions that must be adhered to and traditionally the metadata has been defined at the top of a file followed by the biological data.
Analyses can be less reliable because of increased variation due to batch effects. The more information we have regarding your experiment the more likely we can account for and adjust for any batch effects.
We are potentially interested in knowing the following, if available: sample extraction measurement id, session, order and date; sequencing date, session, time, batch number, array number, spatial location, operator, equipment identifier, equipment calibration settings (if this changes); as well as quality control information.
QCIF Bioinformatics prefers to receive the raw data and do its own data transformation and normalisation. The client will inform QCIF Bioinformatics of any pre-processing performed such as cleaning, transformation, standardisation, or normalisation and the methodology used.
Please provide us with the reference genome, and version, and the annotation link or file, if appropriate.
Additional restrictions apply or are a precaution for statistical software.
There is to be only one header line or row at the top and no summary statistics at the bottom.
Variable or field names in the headings are to contain only alpha-numerics with no space and be in CamelCase style: e.g. “Gene name” becomes “GeneName”. A variable or field name should never begin with a number. Avoid using uppercase characters except to concatenate words together or to distinguish program/place acronyms.
Use the short underline (“_”), not the long one (“__”). Do not use special characters in file, folder, or variable names (e.g., -% ( ) # $ @ . , * &[ ] / \).
Each sample or subject should be identified by a single, unique identifier known as a record number or key.
Record dates in DD/MM/YYYY formats to avoid confusion.
Personal subject or patient information is to be removed or de-identified. A unique ID replaces the subject’s name and Age replaces date of birth. Addresses are to be replaced by a post code or GIS coordinates rounded to the nearest kilometre.
Each line or record should ideally represent data for a single sample or subject even if there are multiple fixed time points. This is not always possible especially if the data set contains differing times and numbers of times per sample or subject.
Most statistical software requires text data to be converted or coded into numerical data before analysis. For analysis not performed in R, almost all English (character, string, non-numeric) data needs to be converted into numeric data, i.e. into numbers for the analytical software to read. Exceptions might include variables such as Notes or Comments that are written in sentences and not easily classifiable.
Numerical data can be further defined as being continuous, categorical, ordinal, date, or time.
>Columns containing multiple information, for example a listing of patient comorbidities, must be split up into a series of yes/no variables or else consolidated into one or a few summary variables.
Categories of more than five levels should be consolidated into categories with five or less levels to facilitate statistical analysis.
Age should be converted into an AgeClass variable based on a common definition discovered by a literature review.
Each data set is to be accompanied by a variable explanatory tab or file known as a data dictionary, data definition, or metadata. For each variable, the data dictionary explains what the variable stands for (if abbreviated), type (numeric, date, or character) and what each numeric code represents. Additional information can be recorded depending on the size of the study.
QCIF Bioinformatics uses the following standardised data definition style:
- Gender (or Sex): Female – 0, Male – 1, Missing value – 9
- Success (or Agree or Correct etc.) variable: Yes – 1, No – 0, Missing value- 9
- Presence variable: Present – 1, Absent – 0, Missing value – 9
- Status variable: e.g. Alive – 1, Deceased – 0, Missing value – 9
- Ordered variable: e.g. Low – 1, Medium – 2, High – 3, Missing value – 9
- Conflicting variable: Negative – -1, Neutral – 0, Positive – 1, Missing value – 9
- Categories with nine or more levels instead use a missing value of 99, those with more than 99 levels use a missing value of 999 etc.
Survey data missing values
There are a number of irregular coding possibilities that also need to be entered:
- The respondent answered the question incorrectly (Other).
- The question is only applicable for certain respondents (Not Applicable).
- The data has yet to arrive (Not Available).
- The respondent did not answer the question (Missing).
For coding these responses use your profession’s conventions but failing that the following convention is suggested:
- Other 6[6…]
- Not Applicable 7[7…]
- Not Available 8[8…]
- Missing 9[9…]
For example, categories with nine or more levels instead use an Other value of 66, those with more than 99 levels use an Other value of 666 etc.
For all other missing values use the software’s default for missing values. An empty cell may not necessarily be converted or identified as a missing value. The default missing value for R is “NA”.
Data validation & Cleaning
These are some simple checks for any formatting or data entry errors in your data:
- Search for missing or replicate unique identifiers
- Change the format of a date. Incorrectly typed dates may stand out.
- Check the ranges of percentages and ratios for inappropriate entries.
- Check for decimal places in supposed integers.
- Check for notation such as “< 5days” which will be rejected as it is not a numeric. Decide on a procedure to convert to 2.5 or 4 days for example.
- Depending on the size of the data, randomly audit 1% to 10% of records looking for systematic errors.
Quality Control (QC)
Perform statistical summaries
- Frequency tables for each categorical variable.
- Histogram plots for each continuous variable and date.
- Summary statistics on each numerical variable including count, minimum, median, mean, maximum, and standard deviation.
These will reveal any miscoding and outliers.
Statistical summaries should not be mixed in with the data but derived and stored elsewhere.