RESEARCH ARTICLE Open Access
The reproducibility of psychiatric
evaluations of work disability: two reliability
and agreement studies
Regina Kunz1* , David Y. von Allmen1, Renato Marelli2,3, Ulrike Hoffmann-Richter4,5, Joerg Jeger6, Ralph Mager2,7,
Etienne Colomb8, Heinz J. Schaad9, Monica Bachmann1, Nicole Vogel1, Jason W. Busse10,11, Martin Eichhorn12,
Oskar Bänziger13, Thomas Zumbrunn1, Wout E. L. de Boer1 and Katrin Fischer14
Abstract
Background: Expert psychiatrists conducting work disability evaluations often disagree on work capacity (WC)
when assessing the same patient. More structured and standardised evaluations focusing on function could
improve agreement. The RELY studies aimed to establish the inter-rater reproducibility (reliability and agreement) of
‘functional evaluations’ in patients with mental disorders applying for disability benefits and to compare the effect
of limited versus intensive expert training on reproducibility.
Methods: We performed two multi-centre reproducibility studies on standardised functional WC evaluation (RELY 1
and 2). Trained psychiatrists interviewed 30 and 40 patients respectively and determined WC using the Instrument for
Functional Assessment in Psychiatry (IFAP). Three psychiatrists per patient estimated WC from videotaped evaluations.
We analysed reliability (intraclass correlation coefficients [ICC]) and agreement (‘standard error of measurement’ [SEM]
and proportions of comparisons within prespecified limits) between expert evaluations of WC. Our primary outcome
was WC in alternative work (WCalternative.work), 100–0%. Secondary outcomes were WC in last job (WClast.job), 100–0%;
patients’ perceived fairness of the evaluation, 10–0, higher is better; usefulness to psychiatrists.
Results: Inter-rater reliability for WCalternative.work was fair in RELY 1 (ICC 0.43; 95%CI 0.22–0.60) and RELY 2 (ICC 0.44;
0.25–0.59). Agreement was low in both studies, the ‘standard error of measurement’ for WCalternative.work was 24.6
percentage points (20.9–28.4) and 19.4 (16.9–22.0) respectively. Using a ‘maximum acceptable difference’ of 25
percentage points WCalternative.work between two experts, 61.6% of comparisons in RELY 1, and 73.6% of comparisons in
RELY 2 fell within these limits. Post-hoc secondary analysis for RELY 2 versus RELY 1 showed a significant change in
SEMalternative.work (− 5.2 percentage points WCalternative.work [95%CI − 9.7 to − 0.6]), and in the proportions on the
differences ≤ 25 percentage points WCalternative.work between two experts (p = 0.008). Patients perceived the functional
evaluation as fair (RELY 1: mean 8.0; RELY 2: 9.4), psychiatrists as useful.
Conclusions: Evidence from non-randomised studies suggests that intensive training in functional evaluation may
increase agreement on WC between experts, but fell short to reach stakeholders’ expectations. It did not alter reliability.
Isolated efforts in training psychiatrists may not suffice to reach the expected level of agreement. A societal discussion
about achievable goals and readiness to consider procedural changes in WC evaluations may deserve considerations.
Keywords: Disability evaluation, Work capacity evaluation, Return to work, Social security, Reproducibility of results,
Observer variation, Evidence-based medicine
© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
* Correspondence: regina.kunz@usb.ch; https://www.unispital-basel.ch/ebim
1Department of Clinical Research, Evidence-based Insurance Medicine,
University of Basel, University Hospital, 4031 Basel, Switzerland
Full list of author information is available at the end of the article
Kunz et al. BMC Psychiatry          (2019) 19:205 
https://doi.org/10.1186/s12888-019-2171-y
Background
Western countries have social security systems in place
that provide wage replacement benefits to individuals
whose reduced health restricts or prevents them from
working [1]. Over the last decade, most countries of the
Organisation for Economic Co-operation and Develop-
ment (OECD) have reported escalating rates of disabled
workers, with current estimates ranging between four to
eight individuals per thousand of working age population
per year [2, 3]. In absolute terms, the annual number of
new recipients of disability benefits ranges between 16,000
individuals for Switzerland and 1,700,000 for the USA.
These numbers constitute a substantial economic chal-
lenge for society.
Many treating psychiatrists [3, 4] are engaged to perform
medical evaluations, aimed at clarifying functional capacity
of workers who claim inability to work due to illness or in-
jury. Work capacity (WC) evaluations integrate detailed in-
formation about patients’ jobs, their functioning at work,
residual ability to perform job-specific skills, and self-
perceived work ability. This process involves a number of
implicit and explicit judgements. The experts’ final judge-
ment is further influenced by their interaction with pa-
tients, personal experiences, training, personal and societal
norms and values [5]. This complexity calls for a rigorously
structured approach to medical evaluations, with clear
guidance on the process for acquiring and integrating
information.
Our research on the reproducibility of WC evaluations
evolved from widespread dissatisfaction with medical evalu-
ations in Switzerland, where two nationwide surveys
highlighted serious concerns regarding psychiatric evalua-
tions of WC [3, 6]. Respondents ranked the missing link be-
tween expert findings and their final judgement on work
incapacity as their top concern. Moreover, a systematic re-
view on work disability evaluations from 12 countries re-
vealed low reproducibility [7]. Almost all countries lacked
an evidence-based approach to address the complexity of
the task [8, 9]. We developed and piloted a functional
evaluation programme that was intended to close the gap
between health complaints and work limitations, and
thereby increase transparency and uniformity of WC evalu-
ations [10, 11].
Reproducibility is an umbrella term that encompasses
two related concepts [12, 13]. First, the reliability of a
‘measuring device’, which –in our context– means how well
expert judgements can distinguish patients with different
degrees of WC from each other, despite measurement er-
rors. Second agreement, which assesses how close the
scores for repeated measurements (by the same or different
raters) are for the same individual, and therefore concerns
measurement error.
Good reproducibility, which encompasses both reli-
ability and agreement, is a prerequisite for implementing
a procedure in routine practice. We explored the effect
of standardised training in functional evaluation for psy-
chiatrists assessing the WC of patients reporting disabil-
ity due to mental illness. We focused on patients with
mental disorders, as this population is perceived as being
particularly vulnerable to subjectivity regarding the
evaluation of work disability [2, 14, 15].
Methods
Two major administrative governmental changes1 inter-
fered with our original research plan - a reproducibility
study followed by a randomised controlled trial (RCT) on
work disability evaluations based on usual practice versus
evaluations using functional evaluations [16]. We therefore
conducted two reproducibility studies in the same setting,
one based on limited training in functional evaluation with
delayed application in the study (RELY 1), the second pro-
viding intensive standardised and manualised training with
timely application [16] (RELY 2).
Study design and participants
We performed two multi-centre reproducibility studies,
RELY 1 and 2, using a partially crossed design in which
four expert psychiatrists (one interviewer, three video
raters) independently rated the WC of actual patients
claiming disability benefits (see study protocol [11],
Additional files 1 and 2 for detailed methodology). We
followed the Guidelines for Reporting Reliability and
Agreement Studies (GRRAS) [13].
In RELY 1, eligible psychiatrists performed disability eval-
uations commissioned by the National Disability Insurance
Scheme or the Swiss National Accident Insurance Fund
(Suva). Psychiatrist recruitment took place in five assess-
ment centres. Eligible patients had submitted an application
for disability benefits from the Zurich office of the National
Disability Insurer or from Suva, were fluent in German,
and were attending an independent psychiatric evaluation
for the first time. In line with routine procedures of the
commissioning organisation1, eligible patients were ran-
domly distributed among the assessment centres and allo-
cated to the next available interviewing psychiatrist. Patient
recruitment in the five disability assessment centres took
place between November 2013 and February 2015.
In RELY 2, all but one RELY 1-experts were recruited
as interviewers. The recruitment of new video raters was
carried out through Swiss Insurance Medicine, the pro-
fessional society of insurance medicine. Patient recruit-
ment for RELY 2 followed the procedures of RELY 1 and
took place between July 2015 and April 2016. To com-
pensate for the time loss in RELY 1 (see below), RELY 2
re-used 15 videos from RELY 1 that scored highest for
functional interviewing criteria [17].
Kunz et al. BMC Psychiatry          (2019) 19:205 Page 2 of 15
Procedures
Our functional evaluation approach incorporated three tools
to systematically collect and document information for judg-
ing the patients’ work disability: (1) a semi-structured inter-
view about their work and self-perceived work limitations;
(2) concise descriptions of exemplary reference jobs for alter-
native work, and (3) a three-part instrument for document-
ing work-related limitations (ICF-based Instrument for
Functional Assessment in Psychiatry, IFAP 1 on mental
functions; IFAP 2a&b on functional capacities, based on [18]
(with further enhancements in RELY 2), IFAP 3a&b on over-
all WC, single-item scale from 100 to 0% WC, relating to
the patients’ last job [3a] and alternative work [3b]) [10, 11].
IFAP 1 and 2 will be reported elsewhere.
Formal training in RELY 1 included written mater-
ial, instructions on the use of IFAP [10] and three
training sessions with didactic presentations, inter-
active small group sessions [11] and individual prac-
tice between sessions. The governmental changes1 and
the enforced reorganization in the assessment centres
stalled RELY 1 with a mean training-to-rating delay
exceeding one year. We named RELY 1 the group
‘with limited training and delayed application’. The
rating psychiatrists in RELY 2 underwent an intensive
manualised training with expert calibration to the
IFAP rating rules and enhanced descriptions of refer-
ence jobs, and doubling of training hours followed by
timely implementation.
Assigning video raters randomly to patients ensured
concealed allocation and prevented rater-group member-
ship where the same raters repeatedly form a rating group
for a patient [19]. Video raters reviewed the material inde-
pendently, unaware of the other raters. Neither patients
nor psychiatrists were blinded. Interviewing psychiatrists
integrated the functional interview into their usual evalu-
ation which was videotaped. They completed IFAP ratings,
and summarized patients’ medical files for the rating psy-
chiatrists. Three psychiatric raters per patient viewed the
videos with medical summaries and job descriptions, and
completed the IFAP ratings. In total, four independent rat-
ings were generated for each patient.
Outcomes, data collection, analysis
The primary outcome was expert judgement of patients’
overall WC for alternative work (IFAP 3b) used by the in-
surers to calculate the patients’ benefits. Secondary out-
comes were WC for patients’ last job (IFAP 3a), experts’
certainty in their own judgements of WC (scale 0–10), pa-
tients’ perceived fairness of the evaluation (a 29-item
questionnaire [20, 21], see Additional file 3), including
general satisfaction with the evaluation (scale 0–10), and
experts’ perception of the functional evaluation (telephone
interviews, RELY 1; online survey, RELY 2).
We collected socio-demographic data on patients,
experts, patients’ mental disorder(s) [22] with impact
on WC and the experts’ judgement of the disorders’
severity (scale from 0 to 10). To establish patients’
main diagnosis, three of four psychiatrists had to code
the same diagnosis on the second digit level of ICD-
10 (i.e. F0, F1, etc.). Typicality was ascertained by
comparing study patients to patients in usual practice
with respect to six predefined criteria [11].
Observations that expert evaluations without stan-
dardised procedures typically achieve low reliability
(ICC or Kappa around 0.4) [8] informed our sample
size calculation. With a sample size of 30 in RELY 1,
a two-sided 95%CI around the intraclass correlation
coefficient (ICC) would extend + 0.15 from the ob-
served ICC, assuming a true ICC value of 0.6 [23,
24]. The sample size of 40 in RELY 2 accounted for
the wider 95%CI observed in RELY 1.
We used descriptive statistics for continuous and categor-
ical data, plotting experts’ ratings of overall WC per patient
(‘last job’, ‘alternative work’) and counting patients with
maximum divergent WC ratings (i.e., ranging between 100
and 0%) [6]. Variance components (psychiatrists, patients,
residuals) underlying the ICC were determined using a lin-
ear mixed-effects model. We reported reliability by the ICC
variant measuring absolute agreement, ICCabs.agree [25]:
ICCabs:agree ¼
σ
2
Patients
σ
2
Patients þ σ
2
Psychiatrists þ σ
2
Residuals
with 2Patients (between-patient variance),
2
Psychiatrists
(between-psychiatrist variance), and 2Residuals (residual
variance) as a value between 0 and 1. The linear mixed-
effects model used WC as response and crossed random
intercepts for patients and psychiatrists. An intercept
was fitted as the only fixed effect. Model-based paramet-
ric bootstrapping was used to estimate 95%CIs. We
interpreted the ICC as poor (ICC < 0.40), fair (0.40–
0.59), good (0.60–0.74) and excellent (> 0.75) [26].
For agreement, we report 1) standard error of meas-
urement (SEM) and 2) proportion of psychiatrist-by-
psychiatrist comparisons that stayed within a prespeci-
fied limit for the difference in WC [6, 12]. Agreement
parameters retain their actual scale of measurement
making clinical interpretations more accessible [13].
‘Standard error of measurement’ describes the psych-
iatrist variation in WC [12].
SEMagreement ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
σ
2
Psychiatrists þ σ
2
Residuals
q
To facilitate the clinical interpretation of the observed
‘standard error of measurement’, we calculated an ex-
pected value of ‘standard error of measurement’ [12]
based on the results of a recent survey [6] in which more
Kunz et al. BMC Psychiatry          (2019) 19:205 Page 3 of 15
than 600 Swiss stakeholders from five interest groups (psy-
chiatrists, experts, lawyers, judges, insurers) expressed their
expectations on what constitutes a ‘maximum acceptable
difference’ (Table 1). Expected value of ‘standard error of
measurement’ is defined [12] as SEM expected ¼ MAD
1:96
ffiffi
2
p ,
where MAD denotes the ‘maximum acceptable difference’
in WC ratings between any two raters (corresponding to the
‘smallest detectable change’ in de Vet 2006 [12]). We used
the upper limit of the interquartile range (IQR) of the ‘max-
imum acceptable difference’ determined by psychiatrists and
experts (25 percentage points, see Table 1) and by lawyers,
judges and insurers (20 percentage points). We used the
upper limit of the IQR rather than the median to indicate
that 75% of that stakeholder group considered higher differ-
ences in WC ratings as unacceptable. However, many in this
group felt that the ‘maximum acceptable difference’ should
be as low as 20, 15, or even 10 percentage points.
For the stakeholders [6], observed ‘standard error of
measurement’ had to be smaller than 9.0 percentage
points WC (SEMexpected by psychiatrists and experts), and
7.2 percentage points WC (lawyers, judges, insurers). 2)
Proportion of comparisons within a prespecified limit:
Comparing the ratings of all four psychiatrists per patient
with each other resulted in six comparisons per patient.
We calculated how often this proportion varied with a
threshold of ≤ 10 (15-, 20-, up to 50-) percentage points
WC. These thresholds were informed by a Swiss survey
[6] with over 600 stakeholders (lawyers, treating psychia-
trists, expert psychiatrists, social judges, insurers’ em-
ployees) who reported what degree of deviation of
assumed WC between two psychiatrists they would find –
at maximum – acceptable (Table 1, ‘maximum acceptable
difference’, reported as median and interquartile range
[IQR]). We used the upper limit of IQR as threshold (i.e.,
75% of respondents who approved only equal or lower dif-
ferences between two raters as acceptable) to determine
agreement between study psychiatrists at different levels
of stakeholder expectations.
To test whether the psychiatrists systematically differ in
their ratings, we formulated two mixed-effects models. The
null model consists of percentage WC as the response vari-
able, an intercept as the single fixed effect, and a random
intercept for the claimants. The alternative model includes
crossed random intercepts for claimants and psychiatrists.
A likelihood ratio test was performed to test whether allow-
ing for a separate variance component for the psychiatrists
significantly improved model fit. For each test, we reported
χ2-statistic and associated p-value using Satterthwaite’s ap-
proximation of degrees of freedom [27].
Comparing RELY 1 and RELY 2
RELY 1 and 2 can be conceptualized as two treatment arms
of a non-randomised comparative study, with psychiatrists in
RELY 1 resembling the control group, having received lim-
ited training but probably suffered substantial knowledge
decay due to the one-year delay in starting the study. Those
in RELY 2 resemble the intervention group with intensive
training in functional evaluation and timely application in
the study as planned. Both studies had recruited psychiatrists
and patients from the same population, patients had received
the same procedures and had been rated using the same
reporting instrument. We used these similarities to justify
post-hoc analyses comparing RELY 1 and 2 [28]. Intensive
calibration of experts was expected to decrease psychiatrist
variance and total variance, and to reduce maximum diver-
gent ratings among patients.
We used the linear mixed effects model to compare
RELY 1 and RELY 2 for difference in percentage WC (‘last
job’; ‘alternative work’) and ‘standard error of measurement’
(WCalternative.work). We used model-based parametric boot-
strapping analogous to estimating the 95%CI of the ICC.
Each pair of datasets was compared by fitting the linear
mixed-effects models described above and by calculating
the differences in percentage WC and ‘standard error of
measurement’ (RELY 2 minus RELY 1). The procedure was
repeated 9999 times.
Table 1 Inter-rater variability: Expectation of stakeholders. ‘Maximum acceptable difference’ in work capacity (WC) ratings between
two experts performing a psychiatric evaluation in the same patient [6]
What is the maximum difference in WC ratings that stakeholders
would find acceptable when two experts independently assess the
same patient?
Lawyers
(n = 81)
Psychiatrists
(n = 242)
Experts
(n = 114)
Judges
(n = 47)
Insurers
(n = 108)
… in the current situation of performing evaluations, median
difference (interquartile range, IQR)
15%
(10–20%)
20%
(10–25%)
20%
(10–25%)
15%
(10–20%)
10%
(10–20%)
Legend: WC: work capacity; % WC = absolute percentage points in work capacity
How to interpret this table?
• 75% of treating and expert psychiatrists felt that the ‘maximum acceptable difference’ in WC ratings between two experts should be 25% corresponding to the
upper limit of the IQR
• 75% of lawyers, judges and insurers and 50% of treating and expert psychiatrists felt that the ‘maximum acceptable difference’ in WC ratings between two
experts should be 20% WC corresponding to the upper limit of the IQR (jurists) or the median (psychiatrists)
• 50% of lawyers, judges and insurers felt that the ‘maximum acceptable difference’ in WC ratings between two experts should be 15% corresponding to
the median
• 25% of all stakeholders felt that the ‘maximum acceptable difference’ in WC ratings between two experts should be 10% corresponding to the lower
limit of the IQR
Kunz et al. BMC Psychiatry          (2019) 19:205 Page 4 of 15
Patient and public involvement
To promote trust in our study, we assembled an obser-
ver group of stakeholders from patient organisations, the
legal profession (patient lawyers, academics, cantonal
courts and the Swiss Federal Supreme Court), profes-
sional medical societies and representatives of social se-
curity. The group met once a year for update and
discussion. Furthermore, we piloted a questionnaire on
perceived fairness focusing on comprehension, accept-
ance, and ease of use with 40 patients from one assess-
ment centre. We communicated study rationale and
design online (www.unispital-basel.ch/ebim/RELY).
Results
RELY 1-study
Of 160 potentially eligible patients, 109 met inclusion
criteria and 30 (28%) entered the RELY 1-study (Add-
itional file 4). Non-responder analysis showed no dif-
ference for age (p = 0.65), but greater number of
females among non-responders (p = 0.02). Twelve of
19 psychiatrists performed interviews, all performed
IFAP ratings. Table 2 describes psychiatrists and
patients.
Mean WC was 43.6% for ‘last job’ (95%CI 34.1–53.2%)
and 55.0% for ‘alternative work’ (95%CI 47.3–62.8%).
When judging WC for ‘last job’ and ‘alternative work’,
experts arrived at maximum divergent estimates in two
(2/30, 6.7%) and five (5/30, 16.7%) patients, respectively.
Although the WC ratings of the same patient varied
widely across psychiatrists (Fig. 1), psychiatrists were
highly certain that their own ratings reflected patients’
WC (rating scale: 7.4 points, mean, 95%CI 6.8–8.0 for
‘last job’ and 7.2 points, 95%CI 6.6–7.8 for ‘alternative
work’). The ratings for ‘last job’ showed that some psy-
chiatrists were systematically stricter than others (rater
effect, ‘last job’ p < 0.001, ‘alternative work’ p = 0.07).
Reliability and agreement
Table 3 provides variance estimates on the absolute and
relative contributions of three sources of variation - psy-
chiatrists, patients, residuals - to WC ratings, adding up
to a total variance of 1092 (‘last job’) and 1060 (‘alterna-
tive work’), respectively. Inter-rater reliability on WC
ratings was poor for ‘last job’ (ICC 0.38; 95%CI 0.19–
0.55) and fair for ‘alternative work’ (ICC 0.43; 95%CI
0.22–0.60).
Figure 2 shows the proportion of psychiatrist-by-
psychiatrist comparisons across a spectrum of varying
limits for ‘maximum acceptable difference’ in WC be-
tween two psychiatrists. With a difference of < 25 per-
centage points WC – the limit suggested by treating
psychiatrists and experts (Table 1) -, 61.6% of compari-
sons would fall within this prespecified limit.
Observed ‘standard error of measurement’ as a meas-
ure for agreement on WC was 26.0 percentage points
(95%CI 21.5–31.0) for ‘last job’ and 24.6 percentage
points (95%CI 20.9–28.4) for ‘alternative work’. Both re-
sults were larger than the expected ‘standard error of
measurement’ converted from the ‘maximum acceptable
difference’ that stakeholders considered appropriate
(9.0 for experts and psychiatrists; 7.2 for lawyers,
judges, insurers, Table 4).
RELY 2-study
Of 147 potentially eligible patients, 123 met inclusion
criteria and 25 entered the RELY 2-study, along with 15
RELY 1-patient videos (Additional file 5). Non-responder
analysis showed no difference for age (p = 0.09) or gender
(p = 0.34). Twenty-four new psychiatrists participated in
the study. Eleven RELY 1-psychiatrists performed the in-
terviews, and all psychiatrists performed IFAP ratings.
Table 2 provides characteristics of psychiatrists and
patients.
Mean WC was 46.3% for ‘last job’ (95%CI 39.9–52.6%)
and 62.9% for ‘alternative work’ (95%CI 57.7–68.0%).
Psychiatrists arrived at maximum divergent WC ratings
in two patients for ‘last job’ (2/40, 5%) and none for ‘al-
ternative work’. Again, WC ratings of the same patient
varied widely across psychiatrists (Fig. 3), even though
the psychiatrists were highly confident in their own rat-
ings (rating scale: 7.7 points, mean, 95%CI 7.3–8.1 for
‘last job’ and 7.4 points, 95%CI 7.0–7.9, for ‘alternative
work’). There was no rater effect (‘last job’, p = 0.07, ‘al-
ternative work’, p = 0.10).
Reliability and agreement
Table 3 provides variance estimates on the contributions
of the different sources of variance to WC ratings, add-
ing up to a total variance of 1064 (‘last job’) and 669 (‘al-
ternative work’), respectively. Inter-rater reliability on
WC (Table 3) was fair for ‘last job’ (ICCabs.agree 0.47;
95%CI 0.29–0.61) and for ‘alternative work’ (0.44; 95%CI
0.25–0.59).
Figure 2 shows the proportion of psychiatrist-by-
psychiatrist comparisons with difference in WC rat-
ing < 25 percentage points. Here, 73.6% of comparisons
would fall within this limit. ‘Standard error of measure-
ment’ was a difference in WC of 23.9 percentage points
(95%CI 20.8–27.0) for ‘last job’ and of 19.4 percentage
points (95%CI 16.9–22.0) for ‘alternative work’. Both
results were larger than the expected ‘standard error of
measurement’ converted from the ‘maximum acceptable
difference’ that stakeholders considered appropriate
(9.0 for experts and psychiatrists, 7.2 for lawyers,
judges, insurers, Table 4).
Kunz et al. BMC Psychiatry          (2019) 19:205 Page 5 of 15
Comparing RELY 1 and 2
Sociodemographics
Psychiatrists and patients in RELY 1 resembled those
in RELY 2. RELY 1 and 2 patients showed no differ-
ence in WClast.job (43.6% versus 46.3%, 2.7% WC,
mean difference, 95%CI − 8.8 to 13.9), but a trend for
higher WCalternative.work (55.0% versus 62.9%, 7.9% WC,
95%CI − 1.1 to 17.1) in RELY 2.
Variances, reliability, and agreement (Table 3)
While 24% of variance in WC for ‘last job’ in RELY 1 was at-
tributable to the psychiatrists, more intensive standardisation
Table 2 Characteristics of psychiatrists and patients. Characteristics of psychiatrists and patients, including the main diagnoses of
the patients’ mental disorder(s) with impact on work capacity. In RELY 1 (RELY 2), six (seven) patients had been assigned two main
diagnoses
RELY 1 RELY 2
Psychiatrists, RELY 1: n=19a; RELY 2: n=35b
Age
31–40/ 41–50/ 51–60/ > 60 years/ missing 5/ 42/ 21/ 32/ 0%c 3/ 40/ 31/ 20/ 6%
Gender
male 79% 83%
Experience
Years since board certification as psychiatrist, mean (SD) 15.6 (9.7) 15.8 (9.0)
Number of years performing disability evaluations, mean (SD) 13.8 (9.2) 12.4 (7.5)
Number of evaluations in the previous year,
0–4/ 5–20/ 21–50/ > 50/ missing 0/ 10 / 32 / 58/ 0% 6/ 17/ 31/ 40/ 6%
Time span from training to rating in days, mean (range) 404 days (115–578) 41 days (5–88)
Patients, RELY 1: n=30; RELY 2: n=40
Age, years: mean (SD) 47.2 (8.6) 48.6 (10.1)
Gender
male 57% 53%
Marital status
Unmarried/ married/ divorced/ missing 20/ 40/ 40/ 0% 20/ 28/ 45/ 8%
Nationality
Swiss/ others/ missing 63/ 23/ 14% 70/ 28/ 2%
Country of birth
Switzerland/ others/ missing 67/ 27/ 6% 75/ 23/ 2%
Severity of disorderd
mean (SD) 5.3 (2.1) 4.9 (1.8)
Typicality of study patient compared to other patients seen by the expert
frequent / semifrequent / rare 36/ 44/ 20% 27/ 56/ 17%
Main diagnoses (ICD 10 classification)
Number of diagnoses RELY 1: n=36; RELY 2: n=47
Mood disorders (F3) 26% 40%
Neurotic, stress-related, somatoform disorders (F4) 19% 21%
thereof somatoform disorders (F45) 6% 15%
Organic (F0) 11% 9%
Disorders of adult personality and behaviour (F6) 11% 6%
Psychoactive substance use (F1) 3% 0%
Mental retardation (F7) 0% 2%
Behavioural and emotional disorders with onset in childhood (F9) 0% 2%
Patients without main diagnosis 19% 19%
a) Twelve out of 19 psychiatrists performed interviews, all performed ratings. b) Eleven out of 35 psychiatrists performed interviews, all performed ratings. c)
Percentages are rounded to nearest whole numbers, d) Scale from 0 to 10, higher score indicates more severe disorder
Kunz et al. BMC Psychiatry          (2019) 19:205 Page 6 of 15
reduced this variation to 7% in RELY 2. The reliability of the
expert judgement on the patients’ WC did not change in any
of the four situations (RELY 1 and 2, last job and alternative
work), indicating that our training programme in
functional evaluation did not improve the reliability
of expert judgements (ICCabs.agree between 0.38 [poor]
and 0.47 [fair]), i.e., experts were not enabled to bet-
ter distinguish between individuals with higher and
those with low remaining WC.
With regards to agreement between experts, the
proportion of psychiatrist-by-psychiatrist comparisons
that stayed below the prespecified threshold was
higher in RELY 2 for all thresholds (Fig. 2). For ex-
ample, at a threshold of 25 percentage points WC,
the proportion of comparisons within the ‘maximum
acceptable difference’ was 73.6% in RELY 2,
contrasted by 61.6% in RELY 1 (p = 0.008). The com-
parison of SEMalternative.work showed a significant
change by − 5.2 percentage points (95%CI − 9.7 to − 0.6,
Tables 3 and 4) in RELY 2.
Patients’ and psychiatrists’ perception of the functional
evaluation
Patients’ approval of the functional evaluation was high,
with scores of 8.0 points (mean, 95%CI 7.2–8.8) in RELY 1
and 9.4 (95%CI 9.1–9.7) in RELY 2 for ‘Overall perception
of fairness’ (see Additional file 3). Psychiatrists experienced
the functional evaluation as a valuable addition to their
current approach. RELY 2-psychiatrists reported a greater
focus on functional aspects (21/25, 84%) by integrating the
IFAP in their WC evaluations and acknowledged substan-
tial professional benefit from the training (96%, 24/25).
Fig. 1 Work capacity ratings in RELY 1. Thirty plots of the four psychiatrists’ ratings of the patients’ overall work capacity in their last job and in
alternative work for 30 patients (c01 to c30). The dots on the left in each cell indicate the psychiatrists’ ratings in relation to the patients’ last job
and the dots on the right indicate their ratings in relation to the patients’ alternative work. The lines linking the dots represent the changes in the
psychiatrists’ ratings. Each psychiatrist has a different colour. Red frames: psychiatrists disagreed with each other by 100% about the extent of
work capacity. This was the case for two patients in relation to their last job, and for five patients in relation to alternative work. Patients with
maximum divergent expert ratings. For ‘alternative work’, one rating of patient 26 was excluded from the analysis due to a violation of the
rating rules
Kunz et al. BMC Psychiatry          (2019) 19:205 Page 7 of 15
Discussion
Main findings
Two multi-centre real-life reproducibility studies (RELY
1 and 2) of expert psychiatrists assessing WC in patients
with mental disorders found that more intensive training
in functional evaluation of WC reduced variance but did
not change psychiatrists’ low ability to discriminate pa-
tients with different degrees of WC for alternative work
from each other. Post-hoc comparisons of RELY 1 and 2
indicated that intensive training achieved higher agree-
ment between experts for WCalternative.work ratings, albeit
improvements fell short of expectations. Patients per-
ceived the functional evaluation as fair, and psychiatrists
perceived it as an useful addition to their current prac-
tice of work disability evaluation.
Strengths, limitations, challenges in design and
performance
Strengths of our studies include the use of real-life
disability evaluations with their heterogeneous mix of
typical patients, a broad spectrum of experts, and
calibration of experts and description of work de-
mands as reference. Despite clear differences in con-
cepts [12, 13, 25], both reproducibility parameters
‘reliability’ and ‘agreement’ are frequently used inter-
changeably in the literature. In our study, we analysed
these parameters separately.
We did not achieve the expected improvement in
reliability in RELY 2alternative.work. There, experts con-
sidered fewer patients as fully able or fully unable to
work in alternative work compared to RELY 1, and
consequently, almost all patients were attributed some
remaining WC. The reduction of patient variance in
RELY 2alternative.work indicates that patients were per-
ceived as more homogeneous than those in RELY 1.
However, the equal reduction of variance across all
variance components resulted in unaltered low discrimin-
ation of remaining WC across patients (ICCalternative.work
RELY 1 versus RELY 2: 0.43 versus 0.44, Table 5) [29].
This reflects reality: ‘It is more difficult to tell people apart
if they are relatively similar than if they are very differ-
ent’([25], Chapter 8).
Table 3 Reliability and agreement measures. Absolute and relative contributions of the different sources of variation to work
capacity ratings: work capacity ratings, total variance and variance components (psychiatrists, patients, residuals), reliability and
agreement parameters for ‘last job’ and ‘alternative work’ in RELY 1 and RELY 2
Reference for WC WC
Mean
(95%CI)
Total
variance
Variance components
Absolute variance
(Relative variance)
Reliability Agreement
Proportion
of WC
ratings
between
two
psychiatrists
whose
ratings
differed
equal or less
than the
‘maximum
acceptable
difference’
of 25
percentage
points WC
‘Standard error of
measurement’ (95%CI)
‘Maximum acceptable
difference’ (95%CI)
Psychiatrists Patients Residuals ICCabs.agree
(95%CI)
reported in natural
units
reported in natural
units
Last job RELY
1
N =
120
43.6%
(34.1–
53.2)
1092 263
(24%)
414
(38%)
415
(38%)
0.38
(0.19–0.55)
52.2%
(94/180)
26.0% WC
(21.5–31.0)
72.2% WC
(59.5–86.0)
RELY
2
N =
160
46.3%
(39.9–
52.6)
1064 76
(7%)
495
(47%)
493
(46%)
0.47
(0.29–0.61)
61.7%
(148/240)
23.9% WC
(20.8–27.0)
66.1% WC
(57.7–74.9)
Alternative
work
RELY
1
N =
119
55.0%
(47.3–
62.8)
1060 88
(8%)
457
(43%)
515
(49%)
0.43
(0.22–0.60)
61.6%
(112/177)
24.6% WC
(20.9–28.4)
68.1% WC
(57.9–78.8)
RELY
2
N =
155
62.9%
(57.7–
68.0)
669 50
(7%)
292
(44%)
328
(49%)
0.44
(0.25–0.59)
73.6%
(170/231)
19.4% WC
(16.9–22.0)
53.8% WC
(46.8–61.0)
Legend: WC: work capacity, % WC = absolute percentage points in work capacity, ICC
abs.agree
= intraclass correlation coefficient (agreement variant); CI: confidence
interval
Kunz et al. BMC Psychiatry          (2019) 19:205 Page 8 of 15
Agreement provides information about the measure-
ment error of an instrument. Here, the ‘Instrument
Functional Evaluation’ were experts with their presumed
ability to elicit the relevant information from patients,
experts who have suitable instruments, a good under-
standing of work demands, and skills to turn the com-
piled information into reasoned judgement on WC.
Intensive manualised training improved expert agree-
ment, but agreement remained low, indicating that the
measurement error of functional evaluation with limited
and intensive training passed any maximum acceptable
disagreement [12]. The high measurement error far ex-
ceeding patient variance contributed directly to low reli-
ability (Streiner 2014, chapter 8 [25]).
Studies on WC evaluations focus on reproducibility
without addressing validity [8]. Although validity is crucial
for credibility, it remains challenging to quantify work
(in-)capacity, a social notion with implicit societal values,
using psychometric methodology. Professional consensus
grounded in evidence or predictive validity may provide a
surrogate for validity. This assumption needs proof.
Psychiatrists constantly rated their confidence in their
own WC assessment as very high, despite the fact that
experts seeing the same patient often disagreed with
each other. This phenomenon suggests that individual
raters are working with different frames of reference,
as can be seen with chronic pain: Some clinicians be-
lieve strongly that patients with (for example) fibro-
myalgia will not be able to work, while others feel
very differently.
Prior evidence to inform our study design was very
limited [7, 8]: We lacked information about potential ef-
fect sizes, sources and extent of variations, the impact of
expert calibration on reproducibility, criteria to decide
Fig. 2 Agreement between experts for varying levels of ‘maximum acceptable difference’ This figure demonstrates the impact of varying limits for
‘maximum acceptable difference’ in WC ratings on level of agreement. Agreement is defined as the proportions of comparisons (in percentage, values
in the bars) for whom the WC ratings between any two experts’ differ less than a prespecified limit, here, the ‘maximum acceptable agreement’.
We used the expectations from a recent survey among stakeholders to specify the limits for ‘maximum acceptable difference’ (see Table 1 [6]).
Illustrative examples from the stakeholder survey [6]. a Treating and expert psychiatrists defined 25 percentage points* in work capacity ratings
between two experts as the ‘maximum acceptable difference’. In RELY 1, 61.6% (109/177) of comparisons would fall within this limit versus 73.6% (170/231)
of comparisons in RELY 2. b Lawyers, judges and insurers defined 20 percentage points* in work capacity ratings between two experts as the ‘maximum
acceptable difference’. In RELY 1, 59.3% (105/177) of comparisons would fall within this limit versus 65.4% (151/231) of comparisons in RELY 2.
* upper limit of the interquartile range (see Table 1)
Kunz et al. BMC Psychiatry          (2019) 19:205 Page 9 of 15
on the most appropriate outcome measure, trustworthy
data to feed the design, including power calculation for
reliability and agreement estimates.
The original research plan proposed a reliability study
on functional evaluation and WC judgements (RELY 1),
followed by a randomised comparison with current prac-
tice [16]. Since governmental changes stalled RELY 1 for
more than a year, the observed reproducibility reflects
the impact of short training in functional evaluation
without standardisation [7]. The reproducibility of ex-
perts without training may be comparably low or worse.
External factors interfered with the planned randomised
comparison. It remains untested whether patients and
lawyers would have indeed consented to a chance -rather
than a ‘preferred option’- allocation to either type of dis-
ability evaluation. In case of objections, a non-randomised
comparison might have been the best alternative to test
the effectiveness of training on reproducibility. Since the
RELY studies lacked randomisation (‘low quality evi-
dence’), findings need to be interpreted with caution.
Some might argue that our study examines videos of
patients rather than actual patients. However, our design
purposefully mimics real-life disability assessments (see
Bachmann 2016 [11], Fig. 3; [8]) where training intensity
was balanced against feasibility for practicing psychia-
trists, and the functional evaluation was integrated in
individual conventional psychiatric interviews. Semi-
structured questions introduced mandatory themes
about work, but left space for open questions. However,
these elements facilitate heterogeneity in the raters’ in-
terpretation and reduced the intended reproducibility.
This contrasts lab-like designs with highly standardised
video-recorded interviews and experienced interviewers
calibrated over longer periods in performing and rating
interviews which achieve high reproducibility [8], but do
not mirror reality.
Our study focused on the psychiatrists’ evaluations all
of which were part of multi-disciplinary WC evaluations.
The ultimate judgments of remaining work capacity
would have to integrate functional and WC evaluations
from other (e.g., musculoskeletal, neurological) disci-
plines which adds challenges that were beyond our
study.
The complexity of WC evaluation brings about many
more sources of variation than we could address in our
study (Table 6) [5, 30]. We targeted the study to raise a low
ICC (around 0.4) to a fair to good level (ICC of 0.6). Sys-
tematic efforts will be required to identify and tackle add-
itional modifiable sources of variance in future research.
Comparisons to other studies
Systematic research on direct evaluation of WC is sparse
[7]. Our recent systematic review with a low threshold
for inclusion identified 16 reproducibility studies from
12 countries published over a period of 25 years [8].
Most studies were of low methodological quality, only
three studies were conducted with real patients, and
most reported only poor to fair reproducibility for work
disability. Though, exceptions existed [31, 32].
Implications for practice and policy
Was training in functional evaluation sufficient to tackle
the tasks as medical expert? A critical revision would in-
clude a review of training material, training intensity, and
duration, and documention of success in expert calibra-
tion. This requirement is analogous to medical training
Table 4 Expected versus observed agreement
a) Expected by stakeholders b) Observed in the RELY studies
‘Maximum acceptable
difference’a
Corresponding ‘Standard error of
measurement’
‘Standard error of
measurement’
Corresponding ‘Maximum
acceptable difference’
25% WC 9.0% WC Last job RELY
1
26.0% WC 72.2% WC
20% WC 7.2% WC RELY
2
23.9% WC 66.1% WC
15% WC 5.4% WC Alternative
job
RELY
1
24.6% WC 68.1% WC
10% WC 3.6% WC RELY
2
19.4% WC 53.9% WC
Legend: % WC = absolute percentage points in work capacity
a derived from the stakeholder survey (Table 1) [6]
This table compares the expectations of Swiss stakeholders of the agreement in WC ratings between two experts, expressed as ‘maximum acceptable differencea’,
with the agreement observed in the RELY studies, i.e., the variation between experts, expressed as ‘standard error of measurement’. Converting ‘maximum
acceptable difference’ into ‘standard error of measurement’ and vice versa allows comparison of the level of agreement
a) Agreement expected by stakeholders: Treating and expert psychiatrists considered a difference of 25% WC between two experts as the ‘maximum acceptable
difference’ (i.e. for example, expert A: 60% WC; expert B: 35% WC or 85% WC) which corresponds to a variation between experts of 9.0% WC ‘standard error of
measurement’
If the ‘maximum acceptable difference’ between two experts were only 15% WC (i.e. for example, expert A: 60% WC, expert B: 45% WC or 75% WC), the
corresponding variation between experts would be as low as 5.4% WC ‘standard error of measurement’
b) Agreement observed in the RELY studies: RELY 2last job found a level of agreement of 23.9% WC ‘standard error of measurement’ which corresponds to a
(‘maximum acceptable’) difference in WC of 66.1% (i.e. for example, expert A: 30% WC; expert B: 96% WC)
Kunz et al. BMC Psychiatry          (2019) 19:205 Page 10 of 15
where trainees acquire skills, e.g. in ultrasound imaging,
by performing hundreds of scans under supervision in
order to distinguish normal images from pathologies and
discriminate similar but different pathologies.
Current expert-based WC evaluations contain many
discretionary judgements that contribute to low repro-
ducibility. Standardising the process as done in RELY 2
appears to have some but not sufficient impact. Experts
have called for more tools to complement their func-
tional judgements [33], such as the Work Disability-
Functional Assessment Battery (WD-FAB [34, 35]) that
elicits self-reported behavioural and physical impair-
ments, or tests for mental or physical functional capaci-
ties [36]. The impact of these tests on the experts’ final
judgement and their agreement on WC would require
empirical testing.
More far-reaching approaches would restrict the physi-
cians’ role to their professional core competences: reporting
the impact of impaired health on the patients’ functional
capacities. Work capacity is legally defined as expected
Fig. 3 Work capacity ratings in RELY 2. Forty plots of the four psychiatrists’ ratings of the patients’ overall work capacity in their last job and in
alternative work for 40 patients (c01 to c40). Red frames: Psychiatrists disagreed with each other by 100% about the extent of work capacity for
two patients in their last job, and for no patient in relation to alternative work, which was the primary outcome. Patients with maximum
divergent ratings. For ‘alternative work’, all ratings of patient 19 and one rating of patient 23 were excluded from the analysis due to violations
of the rating rules
Kunz et al. BMC Psychiatry          (2019) 19:205 Page 11 of 15
earning capacity in suitable work [37]. Most physicians
understand little about the diversity of modern work life,
specific job demands, and their interactions with functional
impairments. These tasks could be shifted to labour experts
and their specific expertise. Models exist in the Netherlands
[38], where medical experts establish the patients’ func-
tional profile and labour experts match potential jobs for
determining wage replacement. In Sweden and Denmark,
labour experts participate in interdisciplinary evaluation
teams [39].
What level of variation is acceptable for WC evalu-
ation (Figs. 1, 3, Table 1)? Insurers who commission
evaluations expressed lowest tolerance for ‘maximum
acceptable differences’ between experts, while psychia-
trists who perform the evaluations showed the highest
tolerance [6], albeit tolerance was half the variation
observed in RELY. While crucial to get variation
down, it is equally important for insurers to align
their expectations with reality and abandon prospects
on the precision in WC ratings that evaluations are
unlikely to provide even with improved methods.
Acceptable level of variation in WC evaluation is a
social policy issue that requires a societal discussion,
adressing the balance between the principles of fair-
ness (‘similar treatment for similar cases’) versus the
principle of treating each case individually – implying
discretionary expert judgements and highly variable
WC ratings across cases. Our stakeholder survey
demonstrates a strong preference for fair and equal
evaluations.
What level of agreement would be required to
reach these objectives? Widely accepted guidance for
evaluating psychological tests [40] require reliabilities
of 0.9 for decisions on individuals. In contrast, clinical
guidance acknowledges that purpose and conse-
quences of scores determine how much error should
be allowed in clinical decision-making [25]. While the
functional evaluation uses instruments such as IFAP
to ascertain the patients’ functional capacities, the
translation from functional capacities to WC is a
judgement at the experts’ discretion, not a measure-
ment. Judgements will never reach the same level of
Table 6 Sources of variation. Potential factors for the three sources of variation (psychiatrists, patients, residuals) which may
contribute to the variance in overall WC ratings (modified from [5, 30])
Source of variation Factors that may impact on the variance of overall work capacity
Psychiatrists • Experience in disability evaluation
• Knowledge about previous work
• Structuring and prioritizing of information
• Psychiatrists’ idiosyncrasies (e.g. leniency/strictness)
Patients • Socio-demographic features
• Diagnosis, severity of disorder
• Compliance, including malingering
• Skills in presenting their case
• Symptom exaggeration
Residuals • Interaction psychiatrists*patients
• Interaction patient*last job; patient*‘alternative work’
External factors:
• Changes in legislation with impact on medical evaluations
• Interferences of legal demands with medical judgements
• Turn-over of staff involved in the studies
• Overall attitude in society towards disability
Table 5 Interaction of various sources of variance on reliability
Illustration of the interaction of various sources of variance and their impact on the reliability measure ICC.
General formula for ICCabs.agree [25]:
σ
2
Patients
σ
2
Patientsþσ
2
Psychiatristsþσ
2
Residuals
Example 1 - Analogy to the situation observed in RELY 1: the ICC is calculated based on a patient variance of 500, a psychiatrist variance of 100
and a large residual (unexplained) variance of 500.
ICC = 500500þ100þ500 ¼ 0:45 which corresponds to a fair discrimination of patients [26]
Example 2 - Analogy to the situation observed in RELY 2: The ICC is calculated with a patient variance of 250, a psychiatrist variance of 50 and a
large residual (unexplained) variance of 250.
ICC = 250250þ50þ250 ¼ 0:45 which corresponds to a fair discrimination of patients (equal to example 1)
Despite reduction of total variance, the proportionate reduction of variance across all sources of variance results in an ICC of 0.45 identical to
example 1. Despite reduction of variance by half, the ability to discriminate patients in their ability to work did not change.
Example 3 - Typical situation for a reliable instrument: Most variance is explained by patient variance, with little psychiatrist variance and residual
variance: patient variance of 500, psychiatrist variance of 25, and residual variance of 75. As a result, expert variance and residual variance contribute
little to the total variance, indicating low measurement error. This allows excellent discrimination among patients. ICC = 500500þ25þ75 ¼ 0:83
Kunz et al. BMC Psychiatry          (2019) 19:205 Page 12 of 15
reliability and agreement as fully standardised psycho-
logical tests do. Nevertheless, society needs a discus-
sion on the desired levels, how to get there and at
what price.
Future research
Unexplained variance remained a major concern in
the RELY studies. Research needs to identify add-
itional potentially modifiable sources of variance
(Streiner 2014, chapter 8 [25]), such as the psychia-
trist*patient interaction [5, 8, 12, 30] and may require
lab-type settings [31, 32]. Reproducibility is closely
linked to the population under investigation and its
characteristics [13], (Streiner 2014, chapter 8 [25]).
To reach an in-depth understanding of the perform-
ance of functional evaluation in social security, similar
studies need to investigate other health conditions
and settings. Despite current low reproducibility
which badly affects validity, considerations on how to
establish validity of WC evaluations beyond profes-
sional consensus are warrented.
All aspects of WC evaluations are seriously under-
researched which challenges the planning of studies. We
need data about potential effect sizes, sources and extent
of variations, impact of expert calibration on reproduci-
bility, criteria to decide on outcome measure, data to
feed power calculations. Training experts alone may not
result in acceptable reproducibility. Nevertheless, a bet-
ter understanding of the cognitive approaches how med-
ical experts come up with WC ratings may inform
training. Furthermore, teaching ‘functional evaluation’, a
novel technique, needs iterative refinement integrating
experience from practice into training curricula, includ-
ing material, intensity, duration, teaching techniques,
and evaluation of learning.
No study in our systematic review [8] provided recom-
mendations on what level of reproducibility would be
mandatory, desirable, or acceptable to ensure equal
treatment of patients. A societal discussion would need
to address alternative approaches with their advantages
and limitations, how to get there and at what cost. Con-
ceptually, to establish a direct link from functional cap-
acity to WC would require to match ICF-based [41]
functional profiles of patients with ICF-based functional
features of job demands in today’s working environment.
The dimension of the task may require the umbrella of
organisations such as World Health Organisation or
International Social Security Association [1].
Conclusions
Evidence from non-randomised studies suggests that inten-
sive training in functional evaluation may increase agree-
ment on WC between experts, but fell short to reach
stakeholders’ expectations. It did not alter reliability. Iso-
lated efforts in training psychiatrists may not suffice to
tackle the complexity of the task to reach the expected level
of agreement. Adding additional components in the proce-
dures of WC evaluations may deserve considerations.
Endnotes
1The administrative changes of the Swiss Government
led to the instalment of an electronic system for random
distribution of patients to registered assessment centres
www.suissemedap.ch and to the implementation of strict
deadlines for the delivery of reports.
Additional files
Additional file 1: Planned design versus actual conduct of the studies.
Comparison of the design for the RELY study as planned with the actual
conduct of the studies as RELY 1 and RELY 2. *SIM = Swiss Insurance
Medicine, professional society of medical experts (DOCX 15 kb)
Additional file 2: Design of the RELY studies. Both RELY studies
recruited psychiatrists from the same population (practicing experts
being SIM members). Training differed in training intensity and duration
to implementation. Patients were recruited from the same population
through the National Disability Insurer and Suva. In RELY 2, we re-used 15
interviews from RELY 1. The 25 new RELY 2-interviews were conducted
by 11 RELY 1-interviewers who were re-trained for rating. Both studies
used the same implementation procedure. (PNG 45 kb)
Additional file 3: Questionnaire on Perceived Fairness. Patients’
perception of the fairness of the work disability evaluation. The
questionnaire had 29 items on a scale from 1 to 5 (higher scores indicate
stronger affirmation) and a single item on overall perception of fairness
on a scale from 10 to 0. The table shows five typical items. (DOCX 18 kb)
Additional file 4: Patient flow in RELY 1. * n = 1 missing due to
violation of rating rules (JPG 67 kb)
Additional file 5: Patient flow in RELY 2. *: n = 5 missing due to
violation of rating rules (JPG 74 kb)
Abbreviations
95%CI: 95% confidence interval; GRRAS: Guidelines for Reporting Reliability
and Agreement Studies; ICC: Intraclass correlation coefficient; ICD
10: International Classification of Diseases (10th revision); ICF: International
Classification of Functioning, Disability and Health; IFAP: Instrument for
Functional Assessment in Psychiatry; OECD: Organisation for Economic Co-
operation and Development; RELY: Reliable disability EvaLuation in
psychiatrY; SMBA: Sociaal-Medische Beoordeling van Arbeidsvermogen
(‘Socio-Medical Assessment of Work Capacity’); Suva: Swiss National Accident
Insurance Fund; WC: Work capacity
Acknowledgments
We thank the patients who participated in the RELY studies; Gordon Guyatt,
McMaster University, Canada and James Young, Christchurch, New Zealand,
for methodological advice; Yvonne Bollag, asim Basel; the assessment
centres, the Disability Insurance Office Zurich, and Suva for their support in
the recruitment of patients; all expert psychiatrists for participation and
valuable discussions in training sessions; Sacha Röschard and Brigitte Walter
Meyer for technical and administrative support; and the RELY observer group
including Claudia Bretscher (Inclusion Handicap), Andreas Brunner (Cantonal
Court of Basel-Country), Etienne Colomb (French-Speaking Swiss Association
of Practitioners in Medical Expertise), Walter Gekle (Swiss Foundation Pro
Mente Sana), Ulrich Kieser (Institute for Legal Studies and Legal Practice),
Renato Marelli (Swiss Society of Insurance Psychiatry), Volker Pribnow (Law
Firm DFP&Z Advokatur Baden), Martin Reinert (Swiss Foundation Pro Mente
Sana), Fulvia Rota (Swiss Society of Psychiatry and Psychotherapy), Andreas
Traub (Swiss Federal Supreme Court, Bundesgericht).
Kunz et al. BMC Psychiatry          (2019) 19:205 Page 13 of 15
Research partners
Zurich Office of the Swiss Federal Disability Insurance, Swiss National
Accident Insurance Fund (Suva), and the following assessment centres: asim
Basel, ZMB Basel, MEDAS Zentralschweiz, MEDAS Interlaken, Suva
Clearinghouse, Lucerne.
Authors’ contributions
RK, WdB, KF, RM1, JJ, UHR, RM2, and JWB conceived and designed the study.
RK, KF, WdB obtained the funding. RK, WdB, RM1, JJ, HS, UHR, RM2 recruited
psychiatrists. MB, DvA, NV, OB, and UHR recruited patients. WdB, RM1, ME,
and OH trained psychiatrists in functional evaluation. WdB, MB, DvA, and NV
supervised the psychiatrists’ rating and data entry. RK, WdB, KF, RM1, EC,
UHR, RM2 monitored progress. TZ and DvA analyzed the data, TZ wrote the
statistical report. All interpreted the data. RK and DvA drafted the manuscript,
all revised it critically for important intellectual content. RK, WdB, KF had full
access to all the data in the study. They take responsibility for the integrity of
the data and the accuracy of the data analysis. All authors read and
approved the final manuscript.
Funding
The studies were supported by grants from the Swiss National Science
Foundation (project number 325130_144200), from the Federal Social
Insurance Office, and from the Swiss National Accident Insurance. Swiss
National Science Foundation, Federal Social Insurance Office, and the Swiss
National Accident Insurance had no role in the design, data collection,
analysis or interpretation of the data.
Availability of data and materials
The datasets used and analysed during the current study are available from
the corresponding author on reasonable request.
Ethics approval and consent to participate
All study procedures were approved by the cantonal ethics committees in Basel
(EKBB, the lead ethics committee: decision 21/13; 22 Jan 2013, Dec 2014), Berne
(KEK Z033/13), Lucerne (EK 13066), Zürich (KEK-ZH-Nr.2013–0329); Amendmend
RELY 2 (EKBB, lead ethics committee, Approval 22 Feb 2015); the data
protection officers of Basel-Stadt; Swiss National Science Foundation, Federal
Social Insurance Office, Swiss National Accident Insurance Fund (Suva), and Dis-
ability Insurance Office in Zürich. All patients provided written informed consent
according to procedures approved by the ethics committees.
Consent for publication
Not applicable.
Competing interests
None of the authors received support from any external organization or
company for the submitted work. No financial relationships with any
organizations that might have an interest in the submitted work in the
previous three years; after data collection was finished (07/2016), RK
became head of the Medical Competence Center of Suva, Lucerne. No
other relationships or activities that could appear to have influenced the
submitted work.
Author details
1Department of Clinical Research, Evidence-based Insurance Medicine,
University of Basel, University Hospital, 4031 Basel, Switzerland. 2Swiss Society
of Insurance Psychiatry, SGVP, 4051 Basel, Switzerland. 3Private Practice for
Psychiatry, 4051 Basel, Switzerland. 4Swiss National Accident Insurance Funds,
6004 Luzern, Switzerland. 5Private Practice for Psychiatry and Psychotherapy,
6004 Lucerne, Switzerland. 6Institute of Medical Disability Evaluations of
Central Switzerland, 6003 Lucerne, Switzerland. 7Psychiatric University
Hospital Basel, 4002 Basel, Switzerland. 8French-Speaking Swiss Association of
Practitioners in Medical Expertise (ARPEM), 1025 St Sulpice, Switzerland.
9Institute for Medical Disability Evaluation Interlaken, 3800 Unterseen,
Switzerland. 10Department of Anaesthesia, McMaster University, Hamilton L8S
4K1, ON, Canada. 11Department of Health Research Methods, Evidence and
Impact, McMaster University Hamilton, Hamilton L8S 4K1, ON, Canada.
12Private Practice for Psychiatry, 4057 Basel, Switzerland. 13Zuerich Office of
the Swiss National Disability Insurance, 8005 Zürich, Switzerland. 14Institute
Humans in Complex Systems, School of Applied Psychology, University of
Applied Sciences Northwestern Switzerland, 4600 Olten, Switzerland.
Received: 25 August 2018 Accepted: 4 June 2019
References
1. International Social Security Association I: Country Profiles. https://www.issa.
int/en/country-profiles, last accessed 14.04.2019.
2. OECD. Sickness, disability and work: breaking the barriers. A synthesis of
findings across OECD countries. Paris: OECD; 2010.
3. Schandelmaier S, Fischer K, Mager R, Hoffmann-Richter U, Leibold A,
Bachmann MS, Kedzia S, Jeger J, Marelli R, Kunz R, et al. Evaluation of work
capacity in Switzerland: a survey among psychiatrists about practice and
problems. Swiss Med Wkly. 2013;143:w13890.
4. de Boer W, Brage S, Kunz R. Insurance medicine in clinical
epidemiological terms: A concept paper for discussion. Dutch J Occup
Insurance Med (Tijdschrift voor Bedrijfs- en Verzekeringsgeneeskunde -
TBV). 2018;26(2):97–9.
5. Spanjer J, Krol B, Brouwer S, Groothoff JW. Sources of variation in work
disability assessment. Work. 2010;37(4):405–11.
6. Schandelmaier S, Leibold A, Fischer K, Mager R, Hoffmann-Richter U,
Bachmann MS, Kedzia S, Busse JW, Guyatt GH, Jeger J, et al. Attitudes
towards evaluation of psychiatric disability claims: a survey of Swiss
stakeholders. Swiss Med Wkly. 2015;145:w14160.
7. Baumberg Geiger B, Garthwaite K, Warren J, Bambra C. Assessing work
disability for social security benefits: international models for the direct
assessment of work capacity. Disabil Rehabil. 2018;40(24):2962–70.
8. Barth J, WELd B, Busse JW, Hoving JL, Kedzia S, Couban R, Fischer K, DYv A,
Spanjer J, Kunz R. Inter-rater agreement in evaluation of disability:
systematic review of reproducibility studies. BMJ. 2017;356:j14.
9. Anner J, Kunz R, Wd B. Reporting about disability evaluation in European
countries. Disabil Rehabil. 2013;36(10):848–54.
10. de Boer W, Marelli R, Hoffmann-Richter U, Eichhorn M, Jeger J, Colomb E,
Mager R, Fischer K, Kunz R. Functional assessment in psychiatry. The manual
(die Funktionsorientierte Begutachtung in der Psychiatrie. Ein manual). Basel:
Evidence-based Insurance Medicine, Dept. of Clinical Research, University of
Basel; 2015.
11. Bachmann M, de Boer W, Schandelmaier S, Leibold A, Marelli R, Jeger J,
Hoffmann-Richter U, Mager R, Schaad H, Zumbrunn T, et al. Use of a
structured functional evaluation process for independent medical
evaluations of claimants presenting with disabling mental illness: rationale
and design for a multi-center reliability study. BMC Psychiatry. 2016;16:271.
12. de Vet HC, Terwee CB, Knol DL, Bouter LM. When to use agreement versus
reliability measures. J Clin Epidemiol. 2006;59(10):1033–9.
13. Kottner J, Audigé L, Brorson S, Donner A, Gajewski BJ, Hróbjartsson A,
Roberts C, Shoukri M, Streiner DL. Guidelines for reporting reliability
and agreement studies (GRRAS) were proposed. J Clin Epidemiol. 2011;
64(1):96–106.
14. World Report on Disability In. Geneva: World Health Organization; 2011.
www.who.int/disabilities/world_report/2011/en/ last accessed 14.Apr.2019.
15. Holwerda A, Groothoff JW, de Boer MR, van der Klink JJL, Brouwer S. Work-
ability assessment in young adults with disabilities applying for disability
benefits. Disabil Rehabil. 2013;35(6):498–505.
16. Kunz R: Improving reliability and transparency of Independent Medical
Expertises (IMEs) and their usefulness to social judges, claimants and social
insurance organisations. In.: Swiss National Science Foundation, SNSF; 2013.
http://p3.snf.ch/project-144200 last accessed 14.Apr.2019.
17. von Allmen DY, Kedzia S, Dettwiler R, Vogel N, Kunz R, de Boer W: Higher
agreement in psychiatric disability evaluations through information about
claimants' self-perceived work capacities and limitations (in preparation).
18. Linden M, Baron S, Muschalla B. Mini-ICF-APP. Mini-ICF-rating for activity
and participation in mental health disorders. Göttingen: Hans Huber; 2009.
19. Crits-Christoph P, Johnson J, Gallop R, Gibbons MBC, Ring-Kurtz S, Hamilton
JL, Tu X. A generalizability theory analysis of group process ratings in the
treatment of cocaine dependence. Psychother Res. 2011;21(3):252–66.
20. Harmsen J. Development and analysis of the questionnaire for client
monitoring in social-medical affairs. Leiden: Ontwikkeling en Analyse
Vragenlijst Cliëntenmonitor SMZ; 2013.
21. Lohss R, Bachmann M, Wd B, Kunz R, Fischer K. What are the concerns of
claimants who underwent a disability assessment? Dutch J Occup Insurance
Med (Tijdschrift voor Bedrijfs- en Verzekeringsgeneeskunde - TBV). 2018;
26(7):358.
Kunz et al. BMC Psychiatry          (2019) 19:205 Page 14 of 15
22. ICD-10. International Statistical Classification of Diseases and Related Health
Problems 10th Revision [https://icd.who.int/browse10/2016/en, last accessed
19.Apr.2019].
23. Bonett DG. Sample size requirements for estimating intraclass correlations
with desired precision. Stat Med. 2002;21(9):1331–5.
24. Schellart AJ, Mulders H, Steenbeek R, Anema JR, Kroneman H, Besseling J.
Inter-doctor variations in the assessment of functional incapacities by
insurance physicians. BMC Public Health. 2011;11:864.
25. Streiner DL, Norman GR, Cairney J. Health measurement scales: a practical
guide to their development and use. Oxford: Oxford University Press; 2014.
26. Fleiss JL. Statisticals methods for rates and proportions. New York:
Wiley; 1981.
27. Satterthwaite FE. An approximate distribution of estimates of variance
components. Biom Bull. 1946;2(6):110–4.
28. Sterne JA, Hernan MA, Reeves BC, Savovic J, Berkman ND, Viswanathan M,
Henry D, Altman DG, Ansari MT, Boutron I, et al. ROBINS-I: a tool for
assessing risk of bias in non-randomised studies of interventions. BMJ. 2016;
355:i4919.
29. Ten Cate DF, Luime JJ, Hazes JM, Jacobs JW, Landewe R. Does the intraclass
correlation coefficient always reliably express reliability? Comment on the
article by Cheung et al. Arthritis Care Res (Hoboken). 2010;62(9):1357–8;
author reply 1358.
30. Kobak KA, Brown B, Sharp I, Levy-Mack H, Wells K, Okum F, Williams JBW.
Sources of unreliability in depression ratings. J Clin Psychopharmacol. 2009;
29(1):82–5.
31. Schellart AJM, Zwerver F, Anema JR, derBeek AJ V. The influence of applying
insurance medicine guidelines for depression on disability assessments.
BMC Research Notes. 2013;6:225.
32. Slebus FG, Kuijer PFM, Willems JHBM, Frings-Dresen MHW, Sluiter JK. Work
ability assessment in prolonged depressive illness. Occup Med (Lond). 2010;
60(4):307–9.
33. Kunz R, Verbel A, Weida R, Hoving JL, Weinbrenner S, Friberg E, De Boer
WEL, Schaafsma F: Knowledge and training needs on evidence-based
medicine in social security and insurance medicine. An international survey.
submitted 2019.
34. Marfeo EE, McDonough C, Ni P, Peterik K, Porcino J, Meterko M, Rasch E,
Kazis L, Chan L. Measuring work related physical and mental health
function: updating the work disability functional assessment battery (WD-
FAB) using item response theory. J Occup Environ Med. 2019;61(3):219–24.
35. Meterko M, Marino M, Ni P, Marfeo E, McDonough CM, Jette A, Peterik K,
Rasch E, Brandt DE, Chan L. Psychometric evaluation of the improved work-
disability functional assessment battery. Arch Phys Med Rehabil epub. 2018.
https://doi.org/10.1016/j.apmr.2018.09.125.
36. Gouttebarge V, Wind H, Kuijer PP, Frings-Dresen MH. Reliability and validity
of functional capacity evaluation methods: a systematic review with
reference to Blankenship system, Ergos work simulator, ergo-kit and
Isernhagen work system. Int Arch Occup Environ Health. 2004;77(8):527–37.
37. de Boer WEL, Besseling JJM, Willems JHBM. Organisation of disability
evaluation in 15 countries. Revue pratiques et organisations des soins. 2007;
3(38):205–17.
38. Mabbett D. Definitions of disability in Europe: a comparative analysis. In.
Brussels: European Commission. Directorate for Employment and Social
Affairs; 2003.
39. Toren K, Jarvholm B. Who is the expert for the evaluation of work ability?
Scand J Work Environ Health. 2015;41(1):102–4.
40. (AERA) AERA, (APA) APA, (NCME) NCoMiE. Standards for educational and
psychological testing. Washington, DC: American Educational Research
Association; 2013.
41. World Health Organisation. International Classification of Functioning,
Disability and Health. [http://www.who.int/classifications/icf/en/]. Last
accessed: 14.04.2019
Kunz et al. BMC Psychiatry          (2019) 19:205 Page 15 of 15