Inter-rater reliability of AMSTAR: is it dependent on the pair of reviewers?




Poster session 3


Tuesday 25 October 2016 - 10:30 to 11:00


All authors in correct order:

Wegewitz U1, Weikert B1, Fishta A1, Jacobs A2, Pieper D3
1 Federal Institute for Occupational Safety and Health (BAuA), Germany
2 The Federal Joint Committee (G-BA), Germany
3 Witten/Herdecke University, Germany
Presenting author and contact person

Presenting author:

Dawid Pieper

Contact person:

Abstract text
Background: A recent systematic review found AMSTAR (A MeaSurement Tool to Assess systematic Reviews), but not R(evised)-AMSTAR, to have good measurement properties, including inter-rater reliability. However, inter-rater reliability is mainly assessed with only two reviewers and without information about their level of expertise, both of which may influence inter-rater reliability. This has not been investigated in prior studies of evidence-based health care.

Objectives: To examine differences in the inter-rater reliability of AMSTAR depending on the pair of reviewers.

Methods: We sampled 16 systematic reviews (eight Cochrane Reviews and eight non-Cochrane reviews) randomly from the field of occupational health via MEDLINE and CDSR. Following a calibration exercise with two systematic reviews, five reviewers independently applied AMSTAR and R-AMSTAR to all 16 systematic reviews. Responses were dichotomized ('yes' scores vs any other scores) and reliability measures were calculated applying Holsti's method (r) and Cohen's kappa (κ) for all potential ten pairs of reviewers.

Results: Inter-rater reliability ranged between r = 0.83 and r = 0.98 (median r = 0.88) with Holsti's method and κ = 0.55 and κ = 0.84 (median κ = 0.64) applying Cohen's kappa for AMSTAR, and between r = 0.82 and r = 0.92 (median r = 0.87) and κ = 0.60 and κ = 0.77 (median κ = 0.65) for R-AMSTAR. The same pair of reviewers yielded the highest inter-rater reliability for both instruments (independent of the reliability measure). Cohen's κ pairwise reliability measures showed a strong correlation between AMSTAR and R-AMSTAR (Spearman r = 0.68).

Conclusions: Inter-rater reliability varies heavily depending on the pair of reviewers. Our range for Cohen's κ reflects the range from several studies reported in the literature for AMSTAR. Conducting reliability studies with only one pair of reviewers might not be enough. Further studies should include more reviewers and probably also pay attention to their level of expertise. Although we observed a wide range of measures, our study also supports the findings of prior studies that the AMSTAR tool has a good inter-rater reliability.