Home > Consulting > Statistical Consulting Center > Database Guidelines

Database Guidelines

Main Content

Data capturing and collection is for the most part, the responsibility of the client/student or researcher, unless the consulting unit is providing hosting services for an online questionnaire, in which case the data is automatically captured by the service.

The service does not retain any rights to that data, or to data submitted for analysis, and confidentiality is guaranteed.

Data is not used beyond the termination of services, except by special permission from the researcher in question, or in the case of continued joint research where the unit plays a collaborative role.

Database management is of utmost importance. Whilst data cleaning, validation and manipulation is offered as an additional service, databases submitted for analyses are expected to conform to the specific prerequisites outlined below, unless by pre-arrangement with the consultant. Templates are available on request.

Database Guidelines:

Databases are to be submitted in Excel, text or any equivalent format (e.g. comma or tab delimited files or SPSS/Statistica/Stata database files).

Codebooks should be provided with the data sets.

When capturing data from hard copies, it is generally useful to enter the data in Excel first, before importing it to a statistical software package. Here follows a list of guidelines for data capturing in Excel.

  • All information should be entered into a single sheet
  • Any other information (such as graphs or initial calculations) should be removed from this sheet
  • Each row should correspond to a single unit (i.e. person, animal, etc.) on which you have made observations
  • Each unit should have a unique identifier (which in some way corresponds to the hard copy)
  • Each column should correspond to a measured variable/field
  • Both the database and the variables should have meaningful names, e.g. “projectname_date.xlsx” rather than “mydata.xlsx” and “Gender” rather than “var1”).
  • Variables (columns) should only be named using the FIRST row/header.
  • Variable names should be as short as possible, and restricted to one cell (i.e. do not merge across cells). Additionally, underscores are preferred to blank spaces between words.
  • Do not use punctuation (e.g. apostrophes, inverted commas, accents etc.)
  • Missing information should be represented using a blank cell
  • Individual cells should contain either numerical or text information. Do not include units in the cell information-rather include this in the variable name or in the codebook that should accompany your data.
  • Dates should be entered consistently: they may be entered as DD-MM-YYYY, or as YYYY-MM-DD or even as (e.g.) 25 January 2011, provided that ALL dates are treated in the same manner.

Codebook Guidelines:

Codebooks should contain information on the following:

  • Variable name and description
  • Data type (numeric/text etc.)
  • Units/Range
  • Codes (labelled categories for variables which require them, and the corresponding values, e.g. for a likert scale: “0-Strongly Disagree”, “1-Disagree”, “2-Neutral”, “3-Agree”, “4-Strongly Agree”)
  • Any and all calculations used to derive “created” variables

The consultancy unit reserves the right to refuse to analyse databases not conforming to these specifications.

Database integrity:

Regardless of the software used in the analysis of data, database integrity is maintained through the use of a scripted program. In this way, analyses can be re-run/checked at a later stage, either at the request of reviewers, or should new information have come to light.

Copies of the original, unaltered database are preserved; however, it is recommended that the client maintains their own backup database as well. Databases submitted for analysis are not altered without the use of a script for reasons of transparency and accountability. Changes can thereby be tracked and verified.