Review Process

Data Extraction

Design a data extraction form, code study variables, pilot the form, double-extract with inter-rater reliability, and prepare data for synthesis.

By Angel Reyes · Last updated

Tools

  • Excel
  • REDCap
  • Covidence
  • Literature Matrix

Phase 3: Extract data from included studies

Data extraction is where a review becomes a dataset. Every variable you need in the synthesis — effect sizes, sample sizes, participant characteristics, intervention details, outcome definitions, risk of bias judgements — has to be pulled from the included full texts in a consistent, auditable way. A well-designed extraction form makes synthesis almost mechanical; a bad one makes synthesis impossible.

1. Design the extraction form around your question

The form is a direct translation of the variables your review needs to answer its question. Start from the PICO/PEO framework you set in the search strategy phase and map each element to concrete fields.

A minimum extraction form contains:

  • Study identifiers — author, year, citation, study ID, country, funding source, conflict of interest statement.
  • Methods — study design, setting, recruitment, randomisation method, blinding, follow-up duration, analysis type (intention-to-treat, per protocol).
  • Participants — n enrolled, n analysed, age, sex, ethnicity, inclusion criteria, baseline comparability.
  • Intervention / exposure — description, dose, duration, delivery mode, fidelity.
  • Comparator — description (usual care, placebo, alternative intervention).
  • Outcomes — definition, measurement instrument, time points, effect estimate, variance, missing data handling.
  • Risk of bias / quality — tool used (Cochrane RoB 2, ROBINS-I, JBI, Newcastle–Ottawa), judgements by domain.
  • Notes — ambiguities, correspondence with authors, translation notes.

For qualitative reviews, replace effect sizes with themes, participant quotes, and contextual codes as described in our integrative review guide.

2. Choose the right software

  • Microsoft Excel or Google Sheets — flexible, familiar, free or near-free. Best for small reviews (n < 50 included studies) with a disciplined naming convention. Lock structure with data validation; use a single row per study or one row per outcome.
  • REDCap — browser-based data capture with audit trail and user roles. Built for clinical research, ideal for medium-to-large reviews with multiple extractors.
  • Covidence — extraction module paired with dual-reviewer workflow and consensus resolution. Integrates with the screening phase so records flow in automatically.
  • DistillerSR — enterprise tool for HTA and policy-grade reviews with configurable forms and full audit trails.

Download the Data Extraction Form Template from the templates library as an Excel starting point.

3. Code the variables

Coding turns prose into structured data. Decide in advance how you will encode:

  • Study design — a fixed vocabulary (RCT, cluster-RCT, quasi-experimental, cohort, case-control, cross-sectional, qualitative, mixed).
  • Outcome measures — map heterogeneous scales to a common metric where possible (e.g., convert all quality-of-life scores to z-scores or SMDs).
  • Missing data — use distinct codes for "not reported," "not applicable," and "unclear." Never leave a cell blank.
  • Multiple arms or outcomes — long-format data (one row per arm/outcome) is almost always cleaner than wide-format for meta-analysis.

Write a codebook that defines every variable, its allowed values, and its extraction rule. The codebook is as important as the form itself.

4. Pilot the form

Never extract all studies before piloting. The pilot:

  1. Two reviewers independently extract 5–10 studies representing the diversity of your included set (different designs, different outcomes).
  2. Compare extractions field by field.
  3. Revise the form: add missing variables, split ambiguous ones, clarify rules in the codebook.
  4. Re-pilot on another 3–5 studies if the revisions were substantial.

Most first-draft forms miss at least 15% of the variables reviewers eventually need. Piloting is cheap; re-extracting 80 studies after the form changes is not.

5. Double-extract and measure agreement

For systematic reviews and meta-analyses, Cochrane requires dual independent extraction of at least key outcome data. Two reviewers extract each study blind to each other, then reconcile. Disagreements expose ambiguous forms, ambiguous source papers, or reviewer error — all fixable, all necessary to document.

Measure agreement on a subset using Cohen's κ for categorical variables (design, risk of bias judgements) and the intraclass correlation coefficient (ICC) for continuous variables (effect sizes, sample sizes). Benchmarks mirror those in screening: κ ≥ 0.70 for well-operationalised variables.

Log every disagreement, its resolution, and the resolving reviewer. A public audit trail is what distinguishes a systematic extraction from a casual read.

6. Handle missing and ambiguous data

  • Contact authors for missing outcome data, especially for meta-analyses. Give 14–28 days and log the request.
  • Impute standard deviations from standard errors, confidence intervals, or p-values only with an explicit, pre-registered rule (Cochrane Handbook Chapter 6).
  • Flag unclear reporting so the risk-of-bias assessment reflects it.
  • Never invent data. If a value cannot be extracted or calculated, record it as missing and discuss its impact in the synthesis.

7. Prepare data for synthesis

Export the final extraction in long format (one row per study-outcome-time-arm combination) and attach a data dictionary. Your synthesis tool — R's metafor, Review Manager, NVivo, or a narrative matrix — will consume this export directly.

An example extraction table layout is shown below.

Study ID Design n (I) n (C) Outcome Effect 95% CI RoB
Smith 2022 RCT 124 121 HbA1c (%) -0.42 -0.78 to -0.06 Low
Lee 2023 Cluster-RCT 310 298 HbA1c (%) -0.28 -0.55 to -0.01 Some concerns
Oyewole 2024 Quasi-exp 88 90 HbA1c (%) -0.51 -0.97 to -0.05 High

8. Common extraction pitfalls

  • Single extractor. Violates Cochrane guidance for systematic reviews.
  • No codebook. Makes the extraction unauditable.
  • Wide-format tables for meta-analysis. Painful to transform later.
  • Conflating "not reported" with "zero." Produces biased pooled estimates.
  • Skipping the pilot. Almost always costs more time than it saves.

Tools and templates for this phase

Next phase

With extraction complete and double-checked, move to Phase 4 — Synthesis →