user manual COSMIN Risk of Bias tool v4 JAN final · 2021. 1. 16. · 6 1. Background information...

Post on 05-Apr-2021

3 views 0 download

Transcript of user manual COSMIN Risk of Bias tool v4 JAN final · 2021. 1. 16. · 6 1. Background information...

1

COSMINRiskofBiastooltoassessthequalityofstudiesonreliabilityandmeasurementerrorofoutcomemeasurementinstrument

usermanual

Version1.0datedJanuary2021

LidwineBMokkinkMaartenBoers

CeesvanderVleutenDonaldLPatrickJordiAlonsoLexMBouter

HenricaCWdeVetCarolineBTerwee

ContactLBMokkink,PhDAmsterdamUMC,VrijeUniversiteitAmsterdam,DepartmentofEpidemiologyandDataScienceAmsterdamPublicHealthresearchinstituteDeBoelelaan1117,1081BTAmsterdamTheNetherlandsWebsite:www.cosmin.nlE‐mail:w.mokkink@amsterdamumc.nl

2

ThedevelopmentoftheCOSMINRiskofBiastooltoassessthequalityofstudiesonreliabilityandmeasurementerrorwaspartoftheVENIprogrammewithprojectnumber91617098,fundedbyZonMw(TheNetherlandsOrganisationforHealthResearchandDevelopment).

3

TableofContent

Foreword 5

1. Backgroundinformation 6

1.1 COSMINinitiativeandsteeringcommittee 6

1.2Howtocitethismanual 7

1.3DevelopmentoftheCOSMINRiskofBiastooltoassessthequalityofstudiesonreliabilityandmeasurementerror

7

1.4 Definitionsofreliabilityandmeasurementerror 7

1.5 FocusoftheCOSMINRiskofBiastool 8

1.6ThestructureoftheCOSMINRiskofBiastool 10

1.7 The“worst‐score‐counts”method 10

1.8 Relevanceoftheresearchquestion 11

1.9 UsingtheCOSMINRiskofBiastoolinasystematicreview 11

1.10Expertiserequiredforusingthetool 12

1.11UsingtheCOSMINRiskofBiastooltoassessstudiesonPROMsorObsROMs

12

1.12ARiskofBiastoolisnotastudydesignchecklist,norareportinggiudeline

13

2. PartA.Understandinghowastudyinformsusaboutthereliabilityandmeasurementerrorofanoutcomemeasurementinstrument

14

2.1 Componentsofoutcomemeasurementinstruments 14

2.2 Extractingtheelementsofacomprehensiveresearchquestion 20

2.3 ExampleofhowtousePartAoftheCOSMINRiskofBiastooltoassessthequalityofastudybySkeieetal.(2015)

27

3. PartB.Assessingtheriskofbiasofastudyonreliabilityormeasurementerror

31

3.1Elaborationonstandardsforstudiesonreliability 33

3.2Elaborationonstandardsforstudiesonmeasurementerror 40

3.3ExampleofhowtousePartBoftheCOSMINRiskofBiastooltoassessthequalityofastudybySkeieetal.(2015)

45

4. UsingtheCOSMINRiskofBiastoolinasystematicreviewofoutcomemeasurementinstruments

47

4.1Theeleven‐stepprocedureforconductingasystematicreviewofClinROMs,PerFOMs,orlaboratoryvalues

50

4

Appendix1.DataExtractiontableofrelevantinformationforeachincludedstudyinasystematicreview.

60

Appendix2.RiskofBiasratingsperstandardperstudy 62

Appendix3.ExampleofaFlow‐chart 63

Appendix4.Exampleofreportingtableoncharacteristicsoftheincludedmeasurementinstruments.

64

Appendix5.Exampleofreportingtableoncharacteristicsofthestudypopulations. 65

Appendix6.OverviewTableofqualityandresultsofstudiesonreliabilityandmeasurementerror.

66

Appendix7.SummaryofFindingsTablesforReliabilityandMeasurementerror. 67

References 68

5

ForewordTheCOSMINRiskofBiastooltoassessthequalityofstudiesonreliabilityandmeasurementerrorwasdevelopedtotransparentlyandsystematicallyassessthemethodologicalqualityofstudiesonreliabilityandmeasurementerrorofalltypesofoutcomemeasurementinstruments.ItisanextendedversionoftheCOSMINRiskofBiaschecklistfortheboxesreliabilityandmeasurementerrorforPROMs(1).Itwasdevelopedforclinician‐reportedoutcomemeasures(ClinROMs)(includinge.g.readingsbasedonimagingmodalitiesandratingsbasedonobservations),performance‐basedoutcomemeasurementinstruments(PerFOMs),orbiomarkers–alsocalledlaboratoryvalues(2,3).ThesemeasurementinstrumentsaremorecomplexthanPROMs,asnotonlypatientsareinvolved,butalsoprofessionals,andsometimes(complex)devices.Specificallyinstudiesonreliabilityandmeasurementerrortheseadditionalsourcesofvariationcomplicatethedesignofthesestudiesandmayinfluencetheirquality.Asdifferentsourcesofvariationcanplayarole,differentstudiescanbeconductedtoassessthereliabilityormeasurementerrorofanoutcomemeasurementinstrument.Toassessthequalityofsuchastudy,oneshouldunderstand(1)howtheresultsofapublishedstudyonreliabilityormeasurementerrorinformusaboutthereliabilityandmeasurementerroroftheoutcomemeasurementinstrumentunderstudy,and(2)whetherwecantrusttheresultfoundinthestudybyassessingtheriskofbiasofthestudy.ThesetwostepsarereflectedinthenewCOSMINRiskofBiastooltoassessthequalityofstudiesonreliabilityormeasurementerrorofoutcomemeasurementinstruments(4).Thequalityassessmentofastudyonreliabilityormeasurementerrorcanbeconductedinthecontextofasystematicreviewofoutcomemeasurementinstruments.Insuchareviewallmeasurementpropertiesareconsidered,thequalityoftheeachstudyisassessed,theresultsofthestudiesareextracted,andpermeasurementpropertyanoverallconclusionisdrawnaboutthequalityoftheinstrumentbasedonallavailableevidenceforeachmeasurementinstrument.Subsequently,thequalityoftheevidenceisgraded,takingthenumber,quality,and(consistencyof)resultsofthestudiesintoaccount.Arecommendationforthemostsuitableinstrumentismade,basedonquality,feasibilityandinterpretabilityofeachinstrument.Asthisisnotaneasytasktoperform,weencouragetousesystematicandtransparentmethodswhenconductingsuchsystematicreviews.WedevelopedtheCOSMINmethodologyforconductingsystematicreviewsofPROMS(5),includingtheCOSMINRiskofBiaschecklist(1,6).Whenconductingasystematicreviewofothertypesofoutcomemeasurementinstruments,suchasClinROMs,PerFOMs,orlaboratoryvalues,thisnewlydevelopedCOSMINRiskofBiastooltoassessthequalityofstudiesonreliabilityandmeasurementerrorcanbeincorporatedintotheCOSMINmethodology.Inthismanualwewillexplainhowthisnewtoolshouldbeused.

6

1. Backgroundinformation

1.1 COSMINinitiativeandsteeringcommittee

TheCOSMINinitiativeaimstoimprovetheselectionofhealthmeasurementinstrumentsbothinresearchandclinicalpracticebydevelopingtoolsforselectingthemostsuitableinstrumentforagivensituation.COSMINisaninternationalinitiativeconsistingofamultidisciplinaryteamofresearcherswithexpertiseinepidemiology,psychometrics,andqualitativeresearch,andinthedevelopmentandevaluationofoutcomemeasurementinstrumentsinthefieldofhealthcare,aswellasinperformingsystematicreviewsofoutcomemeasurementinstruments.ThistoolwasdevelopedinaDelphistudy(4).Thesteeringcommitteeofthisstudyconsistedof:LidwineBMokkinkMaartenBoersCeesvanderVleutenDonaldLPatrickJordiAlonsoLexMBouterHenricaCWdeVetCarolineBTerweeWeareverygratefultoallthepanelistsofthisstudy,whoprovideduswithmanyhelpfulandcriticalcommentsandarguments(inalphabeticalorder):M.A.D’Agostino,DorcasBeaton,SophievanBelle,SandraBeurskens,KristieBjornson,JanBoehnke,PatrickBossuyt,DonBushnell,StefanCano,SaskialeCessie,AlessandroChiarotto,MikeClark,JonDeeks,IrisEekhout,JimFarnsworthII,OkeGerke,SabineGoldhahn,RobertM.Gow,PhilipGriffiths,CristianGugiu,Jean‐BenoitHardouin,DesiréevanderHeijde,I‐ChanHuang,EllenJanssen,BrianJolly,LarsKonge,JanKottner,BrittanyLapin,HannekevanderLee,MariskaLeeflang,NancyMayo,SueMallett,JoyC.MacDermid,GeertMolenberghs,HolgerMuehlan,KoenNeijenhuijs,RaymondOstelo,LauraQuinn,DennisRevicki,JussiRepo,JohannesB.Reitsma,AnneW.Rutjes,MohsenSadatsafavi,DavidStreiner,MatthewStephenson,BerendTerluin,ZyphanieTyack,WernerVach,GemmaVilagutSaiz,MarcK.Walton,MatthijsWarrens,andDanielYeeTakFong.

7

1.2 Howtocitethismanual

ThismanualaccompaniesthetooldevelopedintheDelphistudy.Please,refertothearticlewhenusingthemanualoftheCOSMINRiskofBiastooltoassessthequalityofstudiesonreliabilityandmeasurementerror.LBMokkink,MBoers,CPMvanderVleuten,LMBouter,JAlonso,DLPatrick,HCWdeVet,CBTerwee.COSMINRiskofBiastooltoassessthequalityofstudiesonreliabilityormeasurementerrorofoutcomemeasurementinstruments:aDelphistudy.BMCMedicalResearchMethodology.2020;20(293).1.3 DevelopmentoftheCOSMINRiskofBiastooltoassessthequalityofstudieson

reliabilityandmeasurementerror

ThisCOSMINtoolwasdevelopedinaDelphistudy,containingthreerounds.Formoreinformationaboutthemethodsofthisstudy,werefertoMokkinketal.2020.InthisDelphistudywereachedconsensusonhowtoformulateacomprehensiveresearchquestionforstudiesonreliabilityandmeasurementerror,oncomponentsofoutcomemeasurementinstruments(whicharethepotentialsourcesofvariationrelevantinstudiesonreliabilityandmeasurementerror),andonstandardstoassessthequalityofastudyonreliabilityandmeasurementerrorofClinROMs,PerFOMs,orlaboratoryvalues.Basedonthoseresults,wedevelopedtheCOSMINRiskofBiastoolwhichcomprisestwoparts:1)sevenelementsthatmakeupacomprehensiveresearchquestionofthestudy,whichinformsusonhowthereliabilityandmeasurementerroroftheoutcomemeasurementinstrumentwasstudied,and2)standardsondesignrequirementsandpreferredstatisticalmethodsofstudiesonreliabilityandmeasurementerror,whichcanbeusedtoassessthequalityofthestudy.1.4 Definitionsofreliabilityandmeasurementerror

Reliabilityandmeasurementerrorareimportantmeasurementpropertiesofoutcomemeasurementinstruments.Reliabilityandmeasurementerroraredeterminedbasedonthesamestudydesignanddatacollection,butwithdifferentstatisticalmethods.Thesemeasurementpropertiesarethereforerelated,butdistinct.Reliabilityisdefinedastheproportionofthetotalvarianceinthemeasurementwhichisduetotruedifferencesbetweenpatients(7).Itreferstowhatextendaninstrumentisabletodistinguishbetweenpatients;areliabilitystudyinvestigatestheextenttowhichdifferentsourcesofvariationinfluencethemeasurement.Thisgivesdirectionforhowtoimprovethemeasurement,forexamplebystandardizationorrestrictionofthesourceofvariation.ReliabilitycanbecalculatedwithanIntra‐classCorrelation

8

Coefficient(ICC),aGeneralizabilityCoefficientorwithakappa.Reliabilityparametersareexpressedasaproportionandliesbetween0and1.Measurementerrorisdefinedasthesystematicandrandomerrorofapatient’sscorethatisnotattributedtotruechangesintheconstructtobemeasured(7).Itreferstohowclosethescoresofrepeatedmeasurementsinstablepatientsare;suchstudiesinvestigatetheabsolutedeviationofthescoresortheamountoferrorofrepeatedmeasurementsinstablepatients.Incaseofcategoricaloutcomesitisalsocalled‘agreement’.ForcontinuousoutcomesmeasurementerrorisexpressedinthemeasurementunitsofthemeasurementinstrumentwithaStandardErrorofMeasurement(SEM)orLimitsofAgreement(LoA).Forcategoricaloutcomesagreementisexpressedaspercentagetotalagreementorpercentagesspecific(e.g.positiveandnegative)agreement.1.5 FocusoftheCOSMINRiskofBiastoolWefocusonoutcomemeasurementinstruments,definedasinstrumentsusedtomonitorthehealthstatusof(agroupof)peopleovertime,forexampleinaclinicaltrialorinclinicalpractice.

Severaltypesofmeasurementinstrumentsexist,suchaspatient‐reportedoutcomemeasure(PROM);observer‐reportedoutcomemeasures(ObsROMs;i.e.proxymeasures);clinician‐reportedoutcomemeasurementinstruments(ClinROMs)(includinge.g.readingsbasedonimagingmodalitiesandratingsbasedonobservations);performance‐basedoutcomemeasurementinstruments(PerFOMs);andbiomarkeroutcomes–alsocalledlaboratoryvalues(2).TheCOSMINRiskofBiastooltoassessreliabilityandmeasurementerrorisspecificallydevelopedforClinROMs,PerFOMs,andlaboratoryvalues(seeTable1forexamples).Theseoutcomemeasurementinstrumentstypicallyrequireinvolvementofoneormoreprofessionalstooperateequipmentortools,togiveinstructionstothepatient(e.g.toperformataskoraction)ortocometoascorethroughtheirclinicalexpertise(e.g.afterobservingapatientoranimage).Anoutcomemeasurementinstrumentcomprisesthewholemeasurementproceduretocometoascore,includingissuessuchasmaterials,communication(e.g.instructionsandmotivatingpatientsincaseofperformance‐basedtest),clinicaljudgment,performingatask.Allissuesrelevantforreliableandvalidmeasurementshouldbedescribedinthemeasurementprotocolofanoutcomemeasurementinstrument.

9

Table1.ExamplesofClinROMs,PerFOMs,andlaboratoryvaluesClinician‐reportedoutcomemeasurementinstruments(ClinROMs)Clinician‐reportedratingoftheseverityofadiseaseorcondition.Forexample,theHamiltonAnxietyRatingScaletoassesstheseverityofanxietysymptomscomprises14itemsthatarescoredbyaclinician(8).AGlobalAssessmentoftheseverityofaconditionscorede.g.onasingle‐itemVisualAnalogueScalebyahealth‐careprofessional.Resultofclinicalexaminationof(patho)physiology,suchasbloodpressureoracountofswollenjoints.Clinicalreadingofdevice‐basedresults(oftenimaging),suchpowerDopplerultrasonographytoassessscardiacstructure,functionandhemodynamics(echocardiography)(9),orMRIusedtoevaluatecartilagedefectsize,depth,andsubchondralboneinordertoassesschondralandosteochondrallesionsattheknee(10).Performance‐basedoutcomemeasurementinstrument(PerFOMs)Aperformance‐basedwalkingtest(e.g.thetimed25‐footwalktest(11)),inwhichaprofessionalinstructsapatienttowalk25feetathisowncomfortablepacewithorwithoutawalkingaid.Timeneededtocover25feetismeasuredbytheprofessional.LaboratoryvalueorbiomarkerLaboratoryvaluesuchasHbA1c(glycatedhaemoglobin)measuredbytheturbidimetricinhibitionimmunoassay(TINIA)(12).DifferentversionsoroperationalizationsofoutcomemeasurementinstrumentsTomeasureaspecificconstruct,differentversionsofameasurementinstrumentmayexist.Forexample,theDoloplusisaclinicalassessmenttooltomeasurebehaviouralpainassessmentincognitivelyimpairedpatients,andisadministerede.g.bytheattendingnurse.TheoriginalDoloplus‐1contained15items,whiletheDoloplus‐2contains10items(13).Ameasurementinstrument(i.e.themeasurementprotocol)canbeoperationalizedinmanydifferentways,andeachoperationalizationcouldbeconsideredadifferentversion.Forexample,thespecificequipmentusedtomeasuretherangeofmotion(ROM)candiffer,e.g.,asimpleuniversalgoniometer(14)oranelectromagnetic3‐dimensionaltrackingsystem(15).Thelocationtobemeasuredcandiffer,e.g.,theneck(14)ortheshoulder(16).Thebackgroundoftheprofessionalinvolvedcandiffer,e.g.,arheumatologistoraradiologistwhoconductsthemeasurement,andtheseratersmayhavehaddifferentlevelsoftraining(17).Inprinciple,weconsidereachversionofanoutcomemeasurementinstrumentoreachdifferentoperationalizationofthemeasurementprotocolasaseparatemeasurementinstrument,untilevidenceisprovided(e.g.testingofmeasurementinvariance,orreliability)thattheversionsperformsimilarly.

10

1.6 ThestructureoftheCOSMINRiskofBiastool

TheCOSMINRiskofBiastoolcomprisestwoparts.PartAhelpstounderstandhowtheresultsofapublishedstudyinformusaboutthereliabilityormeasurementerroroftheoutcomemeasurementinstrumentsunderstudy.PartBhelpstoassesswhetherwecantrusttheresultobtainedinthestudybyassessingtheriskofbiasofthestudy.PartAForagoodunderstandingofhowtheresultsofastudyinformsusaboutthereliabilityandmeasurementerroroftheinstrument,agoodunderstandingofthedesignofthestudyanditscorrespondingcomprehensiveresearchquestionisneeded.InpartAwedescribethesevenelementsthatwerecommendtobeextracted,andthattogethercanbeusedtoconstructacomprehensiveresearchquestionforeachanalysis.Inaddition,PartAofthetoolcontainsanoverviewofthecomponentsofoutcomemeasurementinstruments.Thesecomponentarethepotentialsourcesofvariationthatcaneitherbestudied(i.e.variedacrosstherepeatedmeasurements),orarekeptorassumedtobestable(i.e.standardized).PartB.Next,wedevelopedtwoboxeswithstandardsforstudiesonreliabilityandforstudiesonmeasurementerror,respectively.AsintheCOSMINRiskofBiaschecklistforPROMs(1),standardsrefertodesignrequirementsandpreferredstatisticalmethodsofstudiesonmeasurementproperties.Forexample,‘reliabilityandmeasurementerrorshouldbeassessedinpatientsthatareassumedtobestable’;or‘measurementerrorshouldbeassessedwiththestandarderrorofmeasurementorwiththelimitsofagreement’.Thestandardsarestatedasquestions:e.g.‘werepatientsstableintheinterimperiodontheconstructtobemeasured?’.Wereferto‘preferred’statisticalmethods.Wemeanby‘preferred’thatthesestatisticalmethodsareappropriatetousewhenevaluatingreliabilityormeasurementerrorofoutcomemeasurementinstruments,andarecommonlyused.Othermethodsmaybeappropriatetouseaswell(forexamplebi‐factormodelsorMulti‐TraitMulti‐Method(MTMM)analyses,ornewlydevelopedmethods).Itisnotourintentiontocomprehensivelydescribeallpossiblestatisticalmethods,rathertodescribetheadequatemethodsthatarecommonlyusedintheliterature.ItisuptotheuseroftheCOSMINtoolhowstudiesusingtheselesscommonlyusedmethodsareassessed.1.7 The“worst‐score‐counts”principle

Eachstandardinaboxisscoredonthefour‐pointscale,i.e.‘verygood’,‘adequate’,‘doubtful’,and‘inadequate’,seechapter3formoreinformation.SimilarasintheCOSMINRiskofBiaschecklistforPROMs(1),weusetheworst‐score‐countsmethod(18)tocometoaratingforthequalityofthestudyonreliabilityormeasurementerror.

11

1.8 Relevanceoftheresearchquestion

Whilemanydifferentresearchquestionsconcerningthereliabilityormeasurementerrorofanoutcomemeasurementinstrumentcanbeinvestigated,therelevanceofastudyisnotunderquestionwhenusingthistool.Therelevanceofastudyreferstodifferentaspects.

‐ Choiceofthepotentialsource(s)ofvariationthathasbeenvariedovertherepeatedmeasurements.

‐ Choiceofthetargetpopulationofpatientsandprofessionals(whenapplicable)ofthestudy.

‐ Choiceofhowthemeasurementprotocolwasexecuted,whenapplicable.‐ Choiceofevaluatingthespecificmeasurementproperty,eitherreliabilityor

measurementerror.Oftenonlyreliabilityisreported,whilethemeasurementerrorcanbecalculatedusingthesamedata.

WhenusingthisCOSMINRiskofBiastool,theseaspectswillbeextractedfromthedesignofthestudy(inpartA).However,nojudgementwillbegivenabouttheappropriatenessofthechoicesmade.Thechoicesmadeintheresearchquestionandstudydesignbytheresearchersdeterminetheinterpretationandgeneralizabilityoftheresults.1.9 UsingtheCOSMINRiskofBiastoolinasystematicreview

TheCOSMINRiskofBiastoolisdevelopedtoassessthequalityofapublishedstudy.OneapplicationoftheCOSMINRiskofBiastoolistoassessthequalityofstudieswhenconductingasystematicreviewonmeasurementinstruments.COSMINdevelopedasystematicmethodologyforconductingsystematicreviewsofPROMs(5).Itconsistsofa10stepprocedure,inwhichtheCOSMINRiskofBiaschecklist(1)(containingstandardsforallninemeasurementproperties)canbeappliedtothestudiestoassessthequalityofeachstudy.TousetheCOSMINmethodologyforconductingsystematicreviewsofothertypesofinstruments–thatis:otherthanPROMs–weadvisetoreplacetheboxes6(Reliability)and7(Measurementerror)withtheCOSMINRiskofBiastooltoassessthequalityofstudiesonreliabilityandmeasurementerrorofoutcomemeasurementinstruments.MoreinformationabouthowtoconductasystematicreviewusingthenewCOSMINRiskofBiastoolcanbefoundinchapter4.

12

1.10 Expertiserequiredforusingthetool

Toassessthequalityofastudyonreliabilityandmeasurementerror,i.e.foruseinasystematicreviewonthequalityofoutcomemeasurementisquitecomplexandtimeconsuming,anditrequiresexpertisewithintheresearchteamonseveralaspects.Werecommendthatatleastoneoftheteammembersshouldhaveexpertiseontheconstructtobemeasured,e.g.tounderstandwhatappropriatetimeintervalsarebetweenrepeatedmeasurements;onthemeasurementinstruments,e.g.tounderstandwhatconcomitantsourcesofvariationcouldbe(andtheseshouldberestrictedorstandardized–seeelement2inPartA);onthepatientpopulation,e.g.tounderstandwhetherpatientswerestablebetweenrepeatedmeasurementsorwhethersubgroupsofpatientscanbeconsideredinonestudy.Aclinicalexpertmightcombinetheseexpertises.Amethodologicalexpertshouldbepartoftheteammemberwithexpertiseonthetheoryofreliabilityandmeasurementerror,e.g.tounderstandwhetherthedesignisappropriatelyanalyzed(e.g.standards7).1.11 UsingtheCOSMINRiskofBiastooltoassessstudiesonPROMsorObsROMs

ThisnewCOSMINRiskofBiastoolisdevelopedspecificallyforClinROMs,PerFOMs,andlaboratoryvalues.However,itcanalsobeusedtoassessthequalityofstudiesonreliabilityormeasurementerrorofPROMsorobserver‐reportedoutcomemeasures(ObsROMs;i.e.observationsmade,appraised,andrecordedbyapersonotherthanthepatientwhodoesnotrequirespecializedprofessionaltraining(2),e.g.proxymeasures).However,forthesetwotypesofinstrumentsthetoolmayseemunnecessarilycomplex.Thefirststepinthetool(i.e.understandinghowtheresultsinformusonthequalityofthemeasurementinstrumentunderstudy)isoftenobvious,astheaimofreliabilitystudiesofPROMsandObsROMsismostoftentoassesstest‐retestreliabilityormeasurementerrorofthewholemeasurementinstrument(asthesemeasurementinstrumentscanonlybetakeninonego,andtheonlypotentialsourceofvarianceisoccasion).Thesecondstepinthetool(assessingthequalityofthestudyusingthestandards)willleadtothesameratingcomparedtousingthestandardsoftheRiskofBiaschecklistforPROMs.Thestandardsondesignrequirementsinbothtoolsarepartlythesame.However,thenewtypesofoutcomemeasurementinstrumentsforwhichweadaptedtheCOSMINchecklist(i.e.ClinROMs,PerFOMsandlaboratoryvalues),requireadditionalstandards,whicharenotusuallyapplicableforPROMsandObsROMs.(Ifitisapplicableinaspecificstudy,itcouldberatedusingthe‘otherflaws’standardintheRiskofBiaschecklistforPROMs).Theresponseoptionsforstandardsonpreferredstatisticalmethodsinthenewtoolaresomewhatdifferentlyformulated,butwillleadtothesameratingasthePROMRiskofBiaschecklist.

13

1.12 ARiskofBiastoolisnotastudydesignchecklist,norareportingguideline

ThisCOSMINRiskofBiastoolisdevelopedtoassessthequality(i.e.riskofbias)ofapublishedstudyonreliabilityormeasurementerror.Thistoolisnotdevelopedasadesignchecklistorareportingguideline.Whendesigningorreportingastudyonreliabilityormeasurementerroradditionalitemsarerelevanttoconsiderorreport.Forexample,thesamplesizeofpatientsamplesandnumberofratersorrepeatedmeasurementsareimportantinthedesignofastudy,andwhenreportingspecificresultssuchasthevariancecomponents,95%confidenceintervalsaroundICCs,marginalwhenreportingkappa’s,oradditionalassumptionsarerequired.

14

2. PartA.Understandinghowastudyinformsusaboutthereliabilityandmeasurementerrorofanoutcomemeasurementinstrument.

Ingeneral,thedesignofastudyonreliabilityandmeasurementerrorisaboutrepeatedmeasurementinstablepatients.Eachmeasurementisaccompaniedbysomeerror.Thiserroriscausedbysourcesofvariation,suchastheequipmentused,theprofessionalsinvolved,andothercomponentsofmeasurementinstruments.Forexample,thescoreonaninstrumentcanbeinfluencedbyhowtheratermotivatesthepatient,howthemachinewassetup,orbytheoccasion(e.g.firstandsecondoccasion,dayoftheweek,timeoftheday).Inchapter2.1wesystematicallydescribeallcomponentsofoutcomemeasurementinstruments,whicharethepotentialsourcesofvariationofanoutcomemeasurementinstrument.Manydifferentsourcesofvariationcanaffectthemeasurement,andeachofthemcanbestudiedusingadifferentstudydesigns.Eachstudydesignanswersadifferentresearchquestion,andeachresearchquestiongivesspecificinformationaboutthequalityofthemeasurementinstrument.Tounderstandhowastudycaninformusaboutthequalityofanoutcomemeasurementinstrumentwedescribeinchapter2.2sevenelementsofacomprehensiveresearchquestion.PartAofthetoolcontainstheoverviewsofthecomponentsofoutcomemeasurementinstruments(foroutcomemeasurementinstrumentsthatdoesnotinvolvebiologicalsampling,andthosethatinvolvebiologicalsampling,respectively),andthesevenelementsofacomprehensiveresearchquestion.Inchapter2.3weprovideanexampleinwhichweshowhowtousePartAofthetool,byapplyingittoapaperbySkeie(19).Inchapter2.2wewillusethisexample,too(amongotherexamples).

2.1 Componentsofoutcomemeasurementinstruments

Allmeasurementinstrumentsconsistofcomponents,suchasequipmentandpreparatoryactions.Wedevelopedtwotaxonomiesofcomponentsofoutcomemeasurementinstruments,oneforoutcomemeasurementinstrumentsthatdonotinvolvebiologicalsampling(i.e.ClinROMsandPerFOMs)(seeTable2),andoneforthosethatdo(i.e.thelaboratoryvalues,suchasbloodorurinetests,tissuebiopsy)(seeTable3).

15

Table2.ComponentsofoutcomemeasurementinstrumentsthatdonotinvolvebiologicalsamplingComponent Elaboration Examples

Equipment Allequipmentnecessaryinthepreparation,theadministration,andtheassignmentofscoresoftheoutcomemeasurementinstrument

Questionnaireforms,computers,tablet,penandpaper;stairstepsofaspecificheight;deviceortools(suchasstopwatch,probe,tube);ultrasoundmachine,ultrasoundgels,MRIscanner;software.

Preparatoryactionsprecedingrawdatacollectionbyprofessionals,patients,andothers(ifapplicable)

1.Generalpreparatoryactions,suchasrequiredexpertiseortrainingforprofessionalstoprepare,administer,storeorassignthescores2.Specificpreparatoryactionsforeachmeasurement,suchas

preparationsofequipment,environment,storagebyprofessionalsa

preparationsofthepatientbbytheprofessional

Training,educationorexperiencerequired,certification.Preparationofequipment:calibrationofdevice/equipment,adjustsettingsofthemachine.Preparationoftheenvironment:lightconditions,roomtemperature,humidity,specificlengthofawalkingtrack.Preparationforstorage:designdatabaseandlogbookProvidegeneralandpreparatoryinstructionsforthepatients,suchasexplainingthetasks/actionthatneedtobeperformedincludingtimeschedule,safetyissuesandsideeffects;instructionsondiet(e.g.useofcaffeine),clothing(e.g.comfortableshoes,nojewelry,glassesordevices),performanceduringtests(e.g.performataskasusual;trytowalkasfastasyoucan;lieascalmaspossible);setsometrainingorperformafamiliarizationsession.Attachingelectrodestothebody,injectionwithradioactivesubstanceorcontrastdye,positioningthepatient,applyingultrasoundgel.

16

Component Elaboration Examples

Preparationsundertakenbythepatients

Listentoandunderstandingtheinstructionsprovided;adherencetothepreparatoryinstructionssuchasfasting,resting,takingmedication,bowelpreparation,exercising,shaving.

Collectionofrawdata

Allactionsundertakenbypatientandprofessional(s)tocollectthedata,beforeanydataprocessing

Thepatientcompletingquestionsathome,oratthehospital;orperformingthetasks;theraterobservingortimingtheperformance;switchingtheimagingdeviceonandoff;positioningandmovingtheultrasoundprobe.

Dataprocessingandstorage

Allactionsundertakenontherawdatatostoreitinausable(electronic)formforlaterdatamanipulation(suchasscoreassignmentorstatisticalanalysis)

ThedigitallyconvertedsignalofaspecificbodyMRIscanwhichistemporarilystoredintheK‐space,issenttoanimageprocessorwhereamathematicalformula(i.e.Fouriertransformation)isapplied,leadingtoanimagewhichisdisplayedonamonitorandsavedonacomputer;Otherexamples:answersofquestionitemsarerecordedone.g.paperformsandstoredorLikertscaleformatresponseoptionsareconvertedintoa0‐4scoreanddirectlyenteredinacomputerdatabase.Performanceofdataqualitycheckse.g.doubleentryorvalidationchecksonthestored/entereddata.

Assignmentofthescore(s)

Methodsusedtoconvertprocesseddataintoascorecthatconstitutestheoutcomemeasurementinstrument.

Acalculationofamathematicalformulaortheapplicationofascoringsalgorithm(e.g.asetofrulestobefollowed)totheprocesseddata;aclinicianselectsthespecificimagesandjudgestheseverityandquantityofe.g.lesionsonthesetofimagesorcomparesittoareference;scoresadjustedfore.g.missingdataorpatientsusingdevicessuchasmobilityaids.

aProfessionalsarethosewhoareinvolvedinthepreparationortheperformanceofthemeasurement,inthedataprocessing,orintheassignmentofthescore;thismaybedonebyoneandthesameperson,orbydifferentpersons.bIntheCOSMINmethodologyweusetheword‘patient.’However,sometimesthetargetpopulationisnotpatients,bute.g.healthyindividuals,caregivers,clinicians,orbodystructures(e.g.joints,orlesions).Inthesecases,thewordpatientshouldbereadase.g.healthyvolunteer,clinician,ortherelevantbodystructure.cThescorecanbefurtherusedorinterpreted,byconvertingascoretoanotherscale,metricorclassification.Forexample,acontinuousscoreisclassifiedintoanordinalscore(e.g.mild/moderate/severe),ascoreisdichotomizedintobeloworaboveanormalvalue,patientsareclassifiedasrespondertotheintervention(e.g.whentheirchangeislargerthantheMinimalImportantChange(MIC)value).

17

Table3.Componentsofoutcomemeasurementinstrumentsthatinvolvebiologicalsampling

Component Elaboration Examples

Equipment Allequipmentusedinthepreparation,theadministration,andthedeterminationofthevaluesoftheoutcomemeasurementinstrument

Collectiontools,suchasvenapunctureset,biopsytool;materialcontainers,suchasforbloodplasma(EDTAofheparintube),fortissue(containerforfrozenspecimensforimmunofluorescence,jarfilledwithformalin),forurinecollection(sterile,screw‐topcontainer),forstandardmicroscopictissueevaluation(fluidortissueforculture(sterilejar));laboratoryequipmentsuchascentrifuges,cabinets,andchromatographysystems,computers,software.

Preparatoryactionsprecedingsamplecollectionbyprofessionals,patients,andothers(ifapplicable)

1.Generalpreparatoryactions,suchasrequiredexpertiseortrainingforprofessionalstoprepare,administer,storeanddeterminethevalue

Training,educationorexperiencerequired,certification.

2.Specificpreparatoryactionsforeachmeasurement,suchas

preparationsofequipment,environment,andstoragebyprofessionalsa

preparationofthepatientbbytheprofessional

Preparationofequipment:calibrationofdevice/equipment,adjustsettingsofthemachine.Preparationoftheenvironment:lightconditions,roomtemperature,humidity.Preparationofstorage:set‐upallequipmentforstorage.Providegeneralandpreparatoryinstructionstothepatients,suchasexplainingthemeasurementprocedureincludingsafetyissuesandsideeffects;instructionsondiet;insertionandwithdrawalofacatheterintoabloodvessel.

18

Component Elaboration Examples

Preparatoryactionsundertakenbythepatients

Listentoandunderstandingtheinstructionsprovided;adherencetothepreparatoryinstructionssuchasfasting,resting,takingmedication,exercising,shaving,washingofhands.

Collectionofbiologicalsample

Allactionsundertakentocollectthebiologicalsample,beforeanysampleprocessing

Takingabloodsampleortissuebiopsy,collectionofasampleofurine‘mid‐stream’inacontainer.

Biologicalsamplingprocessingandstorage

Allactionsundertakentobeabletopreserve,transport,andstorethebiologicalsamplefordetermination;and,ifapplicable,furtheractionsundertakenonthestoredsampletobeabletoconductthedeterminationofthebiologicalsample

Initialreactionofmaterialtoreagentincontainer(e.g.anticoagulationbyheparin).Bloodisdecomposed(bygravity)intoplasmaandbloodcells,andstoredataspecifictemperature.Tissueissnapfrozenbyimmersioninliquidnitrogen,orfixedinformalinembeddedin/processedtoparaffinforlong‐termstorage.Bloodiscollectedinatubecontaininganaqueoussolutiontetra‐sodiumsaltofethylene‐diamine‐tetra‐aceticacid(EDTA)andmixedwithairtolysetheerythrocytesandconverthemoglobintooxyhemoglobin.Cutsectionsorprepareasmearonaslide,tissuesarestainedbyimmunofluorescentmarkersspecificforcertainsurfaceantigens.Screwthelidoftheurinecontainershut,putinasealedplasticbagandstoreitinthefridgeataround4degreesCelsius,formax.24hours.

Determinationofthevalueofthebiologicalsample

Methodsusedforcountingorquantifyingtheamountofthesubstanceorentityofinterestc

Theabsorbanceofoxyhemoglobinat540nmthroughspectrophotometryquantifiesthehemoglobinconcentrationinthesample.Thepresenceofthemarkeronthecellsurfaceisdetectedandquantifiedbyfluorescencesignalintensity.Raterobserveseachslideandcountspositivecellsinanarea.Acalculationortheapplicationofamathematicalformulatothepreparedsample.

19

aProfessionalsarethosewhoareinvolvedinthepreparationortheperformanceofthemeasurement,inthedataprocessing,orintheassignmentofthescore;thismaybedonebyoneandthesameperson,orbydifferentpersons;bIntheCOSMINmethodologyweusetheword‘patient.’However,sometimesthetargetpopulationisnotpatients,bute.g.healthyindividuals,caregivers,clinicians,orbodystructures(e.g.joints,orlesions).Inthesecases,thewordpatientshouldbereadase.g.healthyvolunteer,clinician,orrelevantbodystructure;cThevaluecanbefurtherprocessedintoaclinicalscore,ifapplicable,byalinearorsemi‐quantitativeconversion.Forexample,acontinuousscoreisclassifiedintoanordinalscore(e.g.mild/moderate/severe),ascoresisdichotomizedintobeloworaboveanormalvalue,patientsareclassifiedasresponderontreatment(e.g.whentheirchangeislargerthantheMinimalImportantChange(MIC)value).Asnonoisewilloccurfromthisconversion,thisisnotapotentialsourceofvariance,butratheraninterpretationofthevalue.Thereforewedonotincludethisphaseinthecomponentsforoutcomemeasurementinstrumentsthatinvolvebiologicalmaterials.

20

2.2ExtractingtheelementsofacomprehensiveresearchquestionBeforewecancomprehensivelyassesstheinformationinastudyonthereliabilityormeasurementerrorofaninstrument,weneedtofullyunderstandthedesignofthestudyandreformulatetheresearchquestionintowhatwecalla‘comprehensiveresearchquestion’.Oftenthepublishedresearchquestionisnotspecificenoughtoratetheadequacyofthestudydesign.Forexample,ifthestatedaimoftheirstudyistoassessinter‐raterreliabilityofaninstrument,itisclearthatraterswillbevaried.However,withoutfurtherinformationitisnotclearwhethertheinterestisintheinter‐raterreliabilityofthewholemeasurementprocedure(e.g.bydifferentclinicians),oronlyinthereliabilityofapartofthemeasurementprocedure(e.g.onlytheassignmentofthescorebasedonanimage).Togetacompletepicture,werecommendtoextractsevenelementsfromthepublicationthattogethercanformthe‘comprehensiveresearchquestion’(seeTable4).Notethatonearticlecancontainmultiplequestions,eachrequiringanextractionofthesevenelements.Table4.Elementsofacomprehensiveresearchquestion.1 thenameoftheoutcomemeasurementinstrument2 theversionoftheoutcomemeasurementinstrumentorwayofoperationalizationofthe

measurementprotocol3 theconstructmeasuredbythemeasurementinstrument4 aspecificationwhetheroneisinterestedinareliabilityparameter(i.e.arelative

parametersuchasforcontinuousoutcomesanICC,Generalizabilitycoefficientφ,orKappaκ)oraparameterofmeasurementerror(i.e.anabsoluteparameterexpressedintheunitofmeasuremente.g.SEM,LoAorSDC;orforcategoricaloutcomesexpressedasagreementormisclassification,e.g.thepercentagespecificagreement).

5 aspecificationofthecomponentsofthemeasurementinstrumentthatwillberepeated(especiallywhenonlypartofthemeasurementinstrumentisrepeated,e.g.onlyassignmentofthescorebasedonthesameimages)

6 aspecificationofthesource(s)ofvariationthatwillbevaried(e.g.timeoroccasion,the(levelofexpertiseof)professionals,themachines,orothercomponentsofthemeasurement)

7 aspecificationofthepatientpopulationstudiedICC=Intraclasscorrelationcoefficient;SEM=standarderrorofmeasurement;LoA=LimitsofAgreement;SDC=smallestdetectablechange.

21

ElaborationontheelementsofacomprehensiveresearchquestionElement1.ThenameoftheoutcomemeasurementinstrumentThenameoftheinstrumentshouldbeexactlyspecified.Sometimes,thisisreadilyapparent,e.g.the6minuteWalkingtest(6MWT)ortheNineHolePegTest(NHPT).Insomecases,ameasurementprotocolinvolvesmultiplemeasurementinstruments(e.g.theMultipleSclerosisFunctionalComposite(MSFC)includestheTimed25‐FootWalktest,theNineHolePegTest,andthePacedAuditorySerialAdditionTest(11)),whileinothercases(e.g.imaging)theremaynotyetbeaclearname.Notethatthenameofthemachineisnotthenameoftheoutcomemeasurementinstrument;oftenamachinecanbeusedtomeasureavarietyofparameters(e.g.Greyscaleultrasound[tomeasure]synovialthickening(synovialhypertrophy)orDopplerultrasound[tomeasure]increasedbloodflow(Synovialhyperemia)(19)),orapathologicalentitycanbemeasuredbydifferenttypesofimages(forexample,enthesitismeasuredbyultrasound(17)orbyMRI(20)).Werecommendtoincludethetypeofmeasurement(e.g.ultrasound)incombinationwiththeentitymeasuredasthenameofthescore(e.g.ultrasoundenthesitisscore).Element2.TheversionoftheoutcomemeasurementinstrumentorwayofoperationalizationofthemeasurementprotocolDetailsontheversion,andoperationalizationoftheoutcomemeasurementinstrumentshouldbeextracted.Detailsonspecificversionreferthee.g.thelengthofthetask(e.g.the2‐,6‐or12‐minutewalkingtest(21)),orthenumberofitemsincludedintheversion(e.g.Doloplus‐1orDoloplus‐2(13)),orthelanguageused(theEnglish(21)orDutchversion(22)ofthe6‐minutewalktest).Choicesinhowthemeasurementprotocolwasoperationalizedmayaffectthemeasurement,andshouldthusbemadeexplicit.Specifically,thecomponentsthatarepotentialsourcesofvariation,needtobelisted,forexample,specificcharacteristicsoftheequipmentused(e.g.brandandtypeofthemachine),andcharacteristicsoftheprofessionalsinvolvedinthemeasurement(e.g.backgroundandexperiences).Thetaxonomyofthecomponentsofmeasurementinstruments(seechapter2.1)canbeusedforthis.Element2referstocomponentsknownorexpectedtoinfluencethescorethatarenottheobjectofstudy.Toeliminatetheinfluenceofthesepotentialsourcesofvariationonthescoresobtained,thesecomponentsshouldhavebeenrestrictedorstandardizedinthestudy.Forexample,ifitisexpectedthatdifferenttypesorbrandsofmachinesmayinterferewiththescore,onlyonetypeandbrandofamachineisused(andreported).InthestudybySkeieetal(2015)onlytheMedisonAccuvixV10ultrasoundscannerwitha3–7MHzcurvilinearprobewasused(19)–inotherwords,thebrandandtypeofmachineandprobewasstandardized.Moreover,chiropractorswithrespectively4and8yearsofexperiencedindiagnosticultrasoundforthemusculoskeletalsystem,andwitha

22

postgraduatediplomaindiagnosticultrasoundwereinvolvedinthemeasurements(19).Thus,thebackgroundoftheraterswasrestrictedtoaspecificprofession(i.e.chiropractors)withspecificdurationofexpertise(4/8yearsindiagnosticultrasound)havingreceivedspecifictraining.Inaddition,insomecasestheinstrumentprocedurerequiresmultiplereadings,andasummarystatistic(usuallythemean,butsometimesthemedian,maximumorminimum)iscalculatedasorusedtoassignthefinalscore(i.e.theresultsofthemeasurement).Awell‐knownexampleisbloodpressuremeasurementintheclinic.1Howthemeasurementistaken,shouldbespecified,asitisneededtoassessstandards7(seechapter3).ForpeoplefamiliarwiththeterminologyoftheGeneralizabilityTheory,theversionorthewayofoperationalizationofthemeasurementinstrumentreferstothefacetsofstratification,wherepatients(i.e.theobjectofmeasurement)arenestedinafacet(23).

Element3.TheconstructmeasuredbythemeasurementinstrumentToidentifyexactlywhichoutcomemeasurementinstrumentwasstudied,werecommendtoextracttheconstructmeasured,unlessitisclearfromthegivenname.Theconstructreferstowhatisbeingmeasured,i.e.the‘aspectofhealth’.Itisalsoreferredtoasthe‘conceptofinterest’orthe’intendedobjectivetobemeasured’.Whenthemeasurementinstrumentdoesnothaveaname,identifyingtheconstructcanhelptofullycharacterizetheoutcomemeasurementinstrument(whichwealsorecommendtomentioninthename,i.e.element1).Table5providessomeexamples.Notethatastudyonreliabilityormeasurementerrordoesnotprovideinformationaboutwhetherindeedtheconstructisbeingmeasured,forthatyouneedvalidityandaccuracystudies.

1 To measure blood pressure, the technician first palpates the radial artery, inflates the cuff until the pulse disappears, inflates an extra 20-30 mm Hg, and then slowly deflates until the pulse reappears. The pressure is noted, and the measurement begins: first, the stethoscope is placed on the brachial artery just medial and above the cubital fold. Then the cuff is reinflated. The pressure is quickly increased to 30 mm Hg above the previous reading, and then slowly deflated until the pulse sounds are detected (systolic blood pressure, measured in 2 mm increments), then further deflated until the sounds disappear (diastolic blood pressure). The cuff is fully deflated, then inflated again to repeat the measurement.

23

Table5.Examplesofelements1,2,and3.

Element 1: name Element2:version/operationalization Element3:construct

Nineholepegtest(24)

Awoodenorplasticboardwith9holes(10mmdiameter,15mmdepth),placedapartby32mm(25)

Fingerdexterity

Ultrasound enthesitis score

Sonography images obtained by experienced sonographers using the Esaote Technos MPX machine

Enthesitis

HbA1cvaluebasedonimmune‐turbidimetry(12)

Turbidimetricinhibitionimmunoassay(TINIA),including2reagens(i.e.anti‐HbA1cantibody(R1),andbuffer/polyhaptenreagent(R2));Tetradecyltrimethylammoniumbromide(TTAB)isdetergent;Roche/Hitachicobascsystems.

HbA1c(glycatedhaemoglobin)

Element4.Specificationofthemeasurementpropertyofinterest

Whenthemeasurementpropertyofinterestisreliability,thestudywillreportrelativeparameterssuchasanICC,Generalizabilitycoefficientφ,orKappaκ.Whenthemeasurementpropertyofinterestismeasurementerror,thestudywillreportabsoluteparameters,eitherexpressedintheunitofmeasurement,suchasSEM,LOAorSDC,orexpressedasagreementormisclassification,e.g.thepercentagespecificagreement.

WerecommendtousetheCOSMINterminologytodeterminewhetherastudyassessedreliabilityormeasurementerror,regardlessofthetermsusedinthearticle,becauseconfusionpersistsaboutthecorrectapplicationoftheseterms.Forexample,wheninaparticulararticleitisstatedthat‘reliability’wasassessed,butthestandarderrorofmeasurement(SEM)orthelimitsofagreementarereported,theresultofthatstudyshouldbeconsideredasevidenceformeasurementerror(26).Whenanauthorstatestohaveevaluated‘agreementbetweenraters’usingthekappastatistic,theresultofthisstudyreferstothereliabilityoftheoutcomemeasurementinstrument(27).

24

Element5.Specificationofthecomponentsofthemeasurementinstrumentthatwillberepeated.(Figure1)

Itshouldbeextractedwhethertheinterestofthestudyisinthereliabilityormeasurementerrorofthewholemeasurementprocedure(seeFigure1,studyA),oronlyinpartofthemeasurementprocedure(seeFigure1,studyB).Forexample,basedonanstaticimagethatwasmadeonceforapatient,onlytheassignmentofthescorewasrepeated,ortheperformanceofataskofeachpatientwasvideotaped,andonlythelastcomponent(i.e.assignmentofthescores)isrepeated.

Figure1.Whichpartofthemeasurementisrepeated.

Element6.Specificationofthecomponentsofthemeasurementinstrumentthatwillbevaried

Thecomponentofthemeasurementinstrumentthatisbeingvariedacrossthemeasurementsisthemainfocusofthestudy.Examplesaretimeoroccasion(test‐retest,orintra‐rater),theprofessionals(inter‐rater),orthemachines(inter‐machineorinter‐device)(28).Forexample,inFigure1ratersarevaried:raterAconductsthefirstmeasurementandraterBconductsthesecondmeasurementforeachpatients.

25

Inthedesignofthestudyoneormoresourcescanbeconsidered.Forexample,boththemachineandtheraterwhoconductsthewholemeasurementarevariedacrosstherepeatedmeasurements(seeFigure2,studyA).Thetaxonomiesofcomponentsofmeasurementinstruments(seechapter2.1)canbeusedtoconsidervariouspotentialsourcesofvariation.

Figure2.Designsinwhichcomponentsarevariedacrossrepeatedmeasurements

Alternatively,theresearcherscanassumethatacomponent(e.g.preparationorassignmentofthescore)is‘stable’,inotherwords,thattheraterwhopreparesthemeasurementorwhoassignsthescorewillnotintroduceerrorinthispartofthemeasurement(indicatedingreyinFigure2studyBandC),andinvestigateonlytheinfluenceofthecomponents(e.g.)equipment,preparation,collectionofrawdataanddataprocessingandstorage.

InthedesignsshowninFigure1and2weassumethatallpatientsweremeasuredthisway.Thisiscalledacrosseddesign(29).However,so‐callednesteddesignsarepossible,too(seeFigure3).Inthesedesigns,partofthepatientsaremeasuredfollowingmeasurementconditionsAandotherpatientsaremeasuredusingmeasurementconditionsB.InFigure3anestedinter‐raterreliabilitydesignisshown,wheresomeofthepatientsaremeasuredfirstbyraterAandnextbyraterB(i.e.measurementconditionA),whileotherpatientsaremeasuredfirstbyRaterCandnextbyraterD(i.e.measurementconditionB),etc.Thesedesignsareappropriatetouse,andinthecalculationoftheICC,thiscouldbetakenintoaccount.Forexample,bycalculating

26

variancecomponentspermeasurementcondition,andnextpoolthesevariancecomponents(weightedbysamplesize)acrossthemeasurementconditions(e.g.(30)),orbyusingaone‐wayrandomeffectsmodel(31).

Figure3.Nestedinter‐raterreliabilitydesign.

ForpeoplefamiliarwiththeterminologyoftheGeneralizabilityTheory,thecomponentsthatarebeingvariedacrossmeasurementsarecalledtherandomorfixedfacetsofGeneralizability(23).

Element7.Patientpopulation

Thereliabilitydependsonthehomogeneityorheterogeneityofthestudypopulation.Therefore,thesample(anditssubgroups)includedinthestudyshouldbeextractedandassessedbytheuserofthistool.InthestudybySkeieetal(2015)therecruitedsampleconsistedoflowbackpatients,patientswithotherspinalcomplaints,butalsoofpain‐freesubjects.Thislattergroupcouldhaveincreasedthevariancebetweenpatients,andsubsequently,influencedtheresults(i.e.increasedtheICC)ofthereliabilitystudy.

IntheCOSMINmethodologyweusethewordpatient.However,sometimesthestudypopulationofinterestconsistsofhealthyindividuals,bodystructures(e.g.joints,kidneys),cliniciansorcaregivers.Inthesecases,thewordpatientshouldbereadase.g.healthypersonorcaregiver.

ForpeoplefamiliarwiththeterminologyoftheGeneralizabilityTheory,thepatientpopulationreferstotheobjectofmeasurementorthefacetsofdifferentiation(23).

27

2.3ExampleofhowtousePartAoftheCOSMINRiskofBiastooltoassessthequalityofastudybySkeieetal.(2015)

InthischapterweprovideanexampleofhowtousetheCOSMINtool–PartAusingapaperbySkeieetal.(19).Togetafullunderstandingofthestudy,werecommendtofirstreadtheintroductionandmethodsectionofthepaper.Inthispaperfourdifferentstudiesaredescribed.Hereweusethefirsttwosubstudies,andprovideasummaryofthesetwostudies.

Inthispaper,thelumbarmultifidusmuscle(LMM)thicknessscore(study1)andcontractionscore(study2)wasinvestigatedbyultrasound.Themeasurementproceedsasfollows:apatientisaskedtolaydowninaspecificposition,andtheprobeisplacedonaveryspecificbodypart.Thisyieldsanon‐screenimage.Subsequently,amarkerisplacedonaspecificstructure(i.e.theapexofthefacetjoint)identifiedontheimage.Instudy1,astillimageisrecorded,andthefirstraterplacesthesecondmarkeronanotherspecificstructure(i.e.processusmammillaris)onthisimage,andmeasuresthedistancebetweenthemarkerswiththecallipersoftware.ThetwomarkerscorrespondwiththethicknessoftheLMM.Thefirstraterrepeatsthesecondmarkerplacementanddistancemeasurementonthestillimagetwice,foratotalofthreemeasurements.Thepatientleaves.Next,basedontheverysamestillimage(withonlythefirstmarkervisible)asecondraterplacesthesecondmarkeronthescreenandmeasuresthedistanceatotalofthreetimes.Next,alldataistransferredtoaseparatepaperbyrater1whocalculatesameanvalueperpatientperrater.ThismeanvalueistheLMMthicknessscore.Therepeatedplacementofthesecondmarkeronthestillimageandapplicationofthecalipertooltomeasurethedistancebetweenthetwomarkersispartofonemeasurement(19).ThisprocedureisdepictedinFigure3,study1.

Figure3.StudydesignsofSkeieetal.

28

Instudy2,foreachpatienteachoftheratersindependentlygeneratedoneimageoftheLMMintherestingstateandoneimageoftheLMMincontractedstate.Usingasplit‐screenofthetwostillimagesofbothstates,eachratermeasuredthickness(i.e.caliper‐assesseddistancebetweenthemarkers)ofthetwostatesthreetimes.Next,rater1transferredthedatatoaseparatepaperandcalculatedmeanvalues of the thickness of each state. Next,rater1calculatedthe‘LMMcontractionscore’astheexactchangeinthickness(contractedLMMminusrestingLMM)(19).ThisprocedureisdepictedinFigure3,study2.

BasedonthethoroughelaborationofthestudyperformedanddescribedbySkeieandcolleagues,weextracttheelementsofacomprehensiveresearchquestion.

Table6.ExampleofhowtousePartAoftheCOSMINRiskofBiastoolbasedonthestudybySkeie(19).

Element Instruction Study1 Study21.Nameoftheinstrument

Alternatively:typeofinstrumentandparameter

Ultrasoundmeasurementofthelumbarmultifidusmuscle(LMM)thicknessscore

UltrasoundmeasurementoftheLMMcontractionscore

2.Versionorwayofoperationalization

Allrelevantcomponentsthatareknownorexpectedtoinfluencethescore,andwhicharestandardizedorrestricted(facetofstratification(23))

Equipment:MedisonAccuvixV10ultrasoundscannerwitha3–7MHzcurvilinearprobe;Preparatoryactions:twochiropractorswith4respectively8yearsofexperienceindiagnosticultrasoundforthemusculoskeletalsystem,withapostgraduatediplomaindiagnosticultrasound;stillon‐screenimageswereobtainedwiththesubjectsinapronepositionwithapillowplacedundertheabdomentoflattenthelumbarlordosis.Preparation:Imagewason‐screengeneratedandamarkerwasplacedontheimageonthemamillaryprocessoftheleveltobemeasured.Unprocesseddatacollection:Thesecondmarkerwasplacedontheon‐screenimage,andthedistancewascomputedbythecallipersoftware.Thispartwasrepeatedthreetimes.

Preparation:Inrestingposition,animagewason‐screengeneratedandamarkerwasplacedontheimageonthemamillaryprocessoftheleveltobemeasured.Next,incontractedstate(LMMcontractionwasinducedbyacontralateralarmliftingtask),animagewason‐screengenerated,too,andamarkerwasplacedontheimage.

29

Element Instruction Study1 Study2Dataprocessingandstorage:Dataistransferredtoaseparatepaperbyrater1.

Unprocesseddatacollection:basedonthesplit‐screenofbothimages,thesecondmarkerwasplacesoneachimage,andthedistance(perimage)wascalculatedbythecallipersoftware.Thispartwasrepeatedthreetimes.Dataprocessingandstorage:Dataistransferredtoaseparatepaperbyrater1.

Assignmentofthescore:Rater1calculatedameanvalueperpatientperrater.

Assignmentofthescore:Rater1calculatesameanvalueperpatientperraterforbothstates.Next,theratercalculatedthe‘LMMcontractionscore’astheexactchangeinthickness(contractedLMMminusrestingLMM).

3.Construct Descriptionofwhatisbeingmeasured

LMMthickness LMMcontraction,whichischangeinLMMthicknessincontractedandrestingstate(contractedLMMminusrestingLMM).

4.Measurementproperty

Reliabilityandmeasurementerror

Reliabilityandmeasurementerror

5.Componentsthatwillberepeated

Eitherthewholemeasurement(i.e.allcomponents)ortheassignmentofthescore(i.e.lastcomponent)

Thewholemeasurementwillberepeated.However,thefocusofinterestinontheunprocesseddatacollection:placingofthesecondmarkerontheon‐screenimage(meanofthreetimes).

Thewholemeasurementwillberepeated.However,thefocusofinterestinonthepreparation(i.e.preparationandgenerationofimagesintherestingandcontractedstates,andtheplacingofthefirstmarker),andontheunprocesseddatacollection(placingofthe

30

Element Instruction Study1 Study2secondmarkerontheon‐screenimage(meanofthreetimes).

6.Source(s)ofvariationvaried

Componentswhichisvariedacrossthemeasurements(i.e.focusofanalysis;facetofgeneralizability(23))

Raters(n=2;inter‐raterreliability)

Raters(n=2;inter‐raterreliability)

7.Patientpopulation

(i.e.facetofdifferentiation(23))

LBPpatients,patientswithotherspinalcomplaintssuchasmidbackpain,neckpain,and/orextremitypain,andpain‐freesubjects(n=30ineachexperiment,totaln=120)

  

Basedontheextractedinformation,acomprehensiveresearchquestioncanbeformulatedas:

Study1:Whatistheinter‐raterreliabilityofthedatacollectionphaseofthelumbarmultifidusmuscle(LMM)thicknessscorebasedonthemeanofthreemarkeddistancewiththecallipersoftwareonastillimageoftheultrasoundmeasurement,measuredusingtheMedisonAccuvixV10ultrasoundscannerwitha3–7MHzcurvilinearprobebypost‐graduateexperiencedchiropractors,inLBPpatients,patientswithotherspinalcomplaintssuchasmidbackpain,neckpain,and/orextremitypain,andpain‐freesubjects?

Study2:Whatistheinter‐raterreliabilityofpreparing,generating,anddatacollectionphasesofthelumbarmultifidusmuscle(LMM)contractionscore,basedonthemeanofthreemarkeddistancewiththecallipersoftwareonanon‐screenimageinrestingandincontractionstateoftheultrasoundmeasurement,measuredusingtheMedisonAccuvixV10ultrasoundscannerwitha3–7MHzcurvilinearprobebypost‐graduateexperiencedchiropractors,inLBPpatients,patientswithotherspinalcomplaintssuchasmidbackpain,neckpain,and/orextremitypain,andpain‐freesubjects?

Please,notethatwedonotrecommendtoreporttheresearchquestionalwaysasthisinonelongquestion.Though,weconsideritveryusefultodescribeallthisinformationclearly,e.g.inthemethodsectionofapaper.

31

3. PartB.Assessingtheriskofbiasofastudyonreliabilityormeasurementerror

PartBoftheCOSMINRiskofBiastoolcontainstwoboxeswithstandardsthatcanbeusedtodeterminewhethertheresultofastudyonreliabilityormeasurementerror,respectively,canbetrusted.Standardsrefertothedesignrequirementsofthestudyortothepreferredstatisticalmethods.Thestandards1to5inbothboxesrefertodesignrequirements.Thesestandardsarethesameforstudiesonreliabilityandforstudiesonmeasurementerror,asthesamedesigncanbeusedforassessingbothmeasurementproperties.Threestandardsrefertothepreferredstatisticalmethodsforstudiesonreliabilityandtwostandardsrefertothepreferredstatisticalmethodsforstudiesonmeasurementerror.IntheCOSMINRiskofBiastool,weincludedstandardsconcerningthepreferredstatisticalmethodsthatareappropriatetousewhenevaluatingreliabilityormeasurementerrorofoutcomemeasurementinstruments(seealsosection1.6).Othermethodsmaybeappropriatetouseaswell(forexamplebi‐factormodelsorMulti‐TraitMulti‐Method(MTMM)analyses,ornewlydevelopedmethods).Itisnotourintentiontocomprehensivelydescribeallpossiblestatisticalmethods,rathertodescribetheadequatemethodsthatarecommonlyusedintheliterature.Eachboxalsocontainsastandardaskingiftherewereanyotherimportantmethodologicalflawsthatarenotcoveredbytheotherstandards(standard6),butthatmayhaveledtobiasedresultsorconclusions.Someflawsareratheruncommon,andtherefore,donotjustifyaseparatestandard.Inchapter3.1weprovideseveralexamplesfortheseflaws.Eachstandardwillbescoredonafour‐pointratingsystem(i.e.‘verygood’,‘adequate’,‘doubtful’,or‘inadequate’)inlinewiththeCOSMINRiskofBiaschecklistforPatient‐ReportedOutcomeMeasures(PROMs)(1).Subsequently,thelowestratinggiveninaboxdeterminesthefinalrating,i.e.thequalityofthestudy(thisiscalledtheworst‐score‐countsmethod(18)todeterminetheriskofbias).Sometimesaresponseoptionisindicatedingrey,meaningthattheresponseoptionisnotapplicableforthestandard,andusersshouldchoosebetweentheotheroptions.Final,somestandardscanberatedas‘notapplicable’.Ingeneral,astandardonadesignrequirementisratedas‘verygood’whenthereisevidenceorconvincingargumentswereprovidedthatthestandardismet;‘adequate’whenitisassumable,althoughnotexplicitlydescribed,thatthestandardismet;‘doubtful’whenitisunclearthatthestandardismet;and‘inadequate’whenthereisevidencethatthestandardisnotmet(18).Astandardaboutpreferredstatisticalmethodsisingeneralratedas‘verygood’whenapreferredmethodwasoptimallyused;‘adequate’whenthepreferredmethodwasused,

32

butitwasnotoptimallyused,‘doubtful’whenitisunclearifapreferredmethodwasused,and‘inadequate’whenthestatisticalmethodsusedareconsideredinadequate.Theboxesforreliabilityandmeasurementerror,respectively,canbefoundhere.Below,anelaborationofeachstandardisdescribedforreliability(chapter3.1)andmeasurementerror(chapter3.2).Inchapter3.3weprovideanexampleforratingtheboxonreliabilityinthestudybySkeie,thatwasalsousedasanexampleinchapter2.3.

33

3.1ElaborationonstandardsforstudiesonreliabilityTheboxonreliabilitycontainsfivestandardsaboutdesignrequirements,onestandards‘otherflaws’andthreestandardsaboutpreferredstatisticalmethods.Foreachstandardwegivesuggestionsforhowtoratethestandard.Standard1.Stabilityofthepatient verygood adequate doubtful inadequate NA

Werepatientsstableinthetimebetweentherepeatedmeasurementsontheconstructtobemeasured?

Yes(evidenceprovided)

Reasonstoassumestandardwasmet

Unclear No(evidenceprovided)

Notapplicable

Elaboration:Patientsshouldbestablewithregardtotheconstructtobemeasuredbetweentherepeatedmeasurements.Whenaninterventionsuchassurgeryormedicationisgivenintheinterimperiod,itislikelythat(manyof)thepatientshavechangedontheconstructtobemeasured.Inotherwords,theyarenotstable–andthestandardshouldberatedas‘inadequate’.Whentheaimistoassessthereliabilityoftheassignmentofthescore,e.g.usingstaticimagesorvideosoftheperformanceofataskasobjectofinterest(seeFigure1study2–page24),thisstandardisnotapplicableastheimagesandvideoswereacquiredonlyonce.Furthermore,themeasurementcaninterferewiththestabilityofthepatient.Forexample,thereshouldbeenoughtimeforpatientstorecoverfromexperiencedpainorfatiguebetweenrepeatedmeasurementsandpermitpatientstoreturntotheirinitialstate.Ifnot,thestandardshouldberatedas‘doubtful’,asitisunclearwhetherthepatientsarestableontheconstructtobemeasured.Whenevidenceorconvincingargumentsareprovidedthatthepatientswerestable,thestandardisscored‘verygood’.Standard2:Timeinterval verygood adequate doubtful inadequate

Wasthetimeintervalbetweenthemeasurementsappropriate?

Yes Doubtful,ORtimeintervalnotstated

No

Elaboration:Thetimeintervalbetweenthemeasurementsmustbeappropriate.Thedefinitionof“appropriate”dependsontheconstructtobemeasuredandthestudypopulation.Thetimeintervalshouldbelongenoughtopreventrecallbiasofpreviousscoresincaseofintra‐raterreliability,andshortenoughtoensurethatpatientshavenotchangedontheconstructtobemeasured.Forexamplesynovitiscanchangeinafewdays,whileachangeincartilageorbonestatuswouldtakeafewmonths.

34

Standard3.Similarmeasurementconditions

verygood adequate doubtful inadequate

Werethemeasurementconditionssimilarforthemeasurements–exceptfortheconditionbeingevaluatedasasourceofvariation?

Yes(evidenceprovided)

Reasonstoassumestandardwasmet,ORchangewasunavoidable

Unclear No(evidenceprovided)

Elaboration:Eachrepeatedmeasurementshouldbeconductedwiththesamemeasurementprotocol–exceptforthesourceofvariationthatwasintentionallyvaried,i.e.element6ofthecomprehensiveresearchquestion(seechapter2.2).Forexample,iftheaimwastounderstandthevariationduetodifferentraters(i.e.inter‐raterreliability),onlytheratersshouldbevaried.Otherconcomitantsourcesofvariation(i.e.element2ofthecomprehensiveresearchquestion,seechapter2.2)shouldbekeptsimilar.Wasthestudyuptostandard?Wereallequipment,preparatoryactions,theenvironmentalconditions(e.g.temperature),andmethodsofprocessingthesameinbothmeasurements?Forexample,whenthepatientsareverylikelytoshowalearningeffect(forexampleonaperformance‐basedtest),theabsenceofafamiliarizationsessionshouldyieldaratingofdoubtfulorinadequateonthisstandard,asthefirstmeasurementcanthenbeconsideredtobethefamiliarizationsession,andthemeasurementconditionsarenotthesame.Adescriptionofsimilarityofthemeasurementconditionsoftherepeatedmeasurementscanbeconsideredasevidence.Standards4.AdministrationofmeasurementsIninstrumentsthatdonotinvolvebiologicalsampling,theadministrationreferstothecomponents‘Collectionofrawdata’and‘Dataprocessingandstorage’(seechapter2.1).Ininstrumentsinvolvingbiologicalsampling,itreferstothecomponents‘Collectionofbiologicalsampling’and‘Biologicalsamplingprocessingandstorage’(seechapter2.1). verygood adequate doubtful inadequate

Didtheprofessional(s)administerthemeasurementwithoutknowledgeofscoresorvaluesofotherrepeatedmeasurement(s)inthesamepatients?

Yes(evidenceprovided)

Reasonstoassumestandardwasmet

Unclear No(evidenceprovided)

Elaboration:Allmeasurementsshouldbeadministeredbytheprofessional(s)involvedwithoutthemhavingknowledgeofthescoresorvaluesofotherrepeatedmeasurementsonthesameoutcomemeasurementinstrument.Thismeansthatthemeasurementsshouldallbeadministeredwithoutknowledgeoftheprior(e.g.incaseofanintra‐raterreliabilitystudy)orother(e.g.incaseofaninter‐raterreliabilitystudy)score(s)orvalue(s)ontheinstrumentofinterest.

35

Theratingofthisstandardisrathersubjective.Forexample,ifinastudytheratersindependentlyadministeredthemeasurement,andnonewereinvolvedinthecareofthepatients(makingitveryunlikelythattheratersreceivedadditionalinformationofthepatientsincludingknowledgeofthescore(s)ofotherrepeatedmeasurements),thiscanbeconsideredas‘evidenceprovided’,andtheratingis‘verygood’.Whentheotherscoreisknowntotheprofessionalwhileadministeringtherepeatedmeasurement,itmayinfluencethewaythemeasurementisadministered.Forexample,withaseverescoreobtainedwithanimagingtechnique,therepeatedmeasurementcanbeadministeredmorecarefully,andmoretimecanbeusedtolookatthepatient.Ifitisknownthatthiswasthecase,theratingis‘inadequate’.Whenthereisnoexplicitdescription,butitseemsveryunlikelythattheratersknewthescoresorvaluesofotherrepeatedmeasurements,itcanberatedas‘adequate’,or‘doubtful’.Insomesituationsthisstandardisnotapplicable,i.e.whentheadministration(i.e.collectionoftherawmaterialorbiologicalsample,dataorsamplingprocessingandstorage)isnotrepeatedinthestudy,butonlytheassignmentofthescoreorthedeterminationofthevalue(seeforexampleschapter2.2element5ofthecomprehensiveresearchquestion,orFigure1study2).Standard5.Assignmentofthescoreordeterminationofthebiologicalvalue

verygood adequate doubtful inadequate

Didtheprofessional(s)assignscoresordeterminevalueswithoutknowledgeofthescoresorvaluesofotherrepeatedmeasurement(s)inthesamepatients?

Yes(evidenceprovided)

Reasonstoassumestandardwasmet

Unclear No(evidenceprovided)

Elaboration:Thescoresonallmeasurementsshouldbeassignedorvaluesshouldbedeterminedbytheprofessional(s)involvedwithoutthemhavingknowledgeofthescoresorvaluesofotherrepeatedmeasurements.Thismeansthatassigningascoretoameasurementordeterminingthevalueofabiologicalsampleshouldbedonewithoutknowledgeoftheprior(e.g.incaseofanintra‐raterreliabilitystudy)orother(e.g.incaseofaninter‐raterreliabilitystudy)score(s)orvalue(s)ontheinstrumentofinterest.Althoughpartofthedeterminationofthevalueofabiologicalsamplecanbeanautomaticstep,theremaybehumanactionrequiredtodothisdetermination.Forexample,anurinepHleveltesttomeasuretheacidityoralkalinityofurinewherethecolorofthestripisinterpretedbytheprofessional.Theratingissimilarlyasexplainedforstandard4.

36

Standard6.Otherimportantflaws verygood adequate doubtful inadequate

Werethereanyotherimportantflawsinthedesignorstatisticalmethodsofthestudy?

No Minorothermethodologicalflaws

Yes

Elaboration:Thisstandardisincludedbecausetheremightbeuncommondesignflawsthatarenotcoveredbyotherstandardsbutthatmaycauseadditionalriskofbias.Below,someexamplesareprovided.Whenvariousprofessionalsareinvolvedinthemeasurementinstrument,andoneoftheprofessionalsistheattendingphysicianofthepatient,thisphysicianhas(much)moreinformationaboutthepatientthantheotherprofessionals.Insomesituations–dependingontheaimofthestudyandthespecificconstructtobemeasured–thiscouldbeconsideredaflawbecauseoftheinfluenceonthescoresobtained.InthepreviouschapterwesawintheexampleofSkeiethatpartofthesamplecomprisedhealthypatients,whereastheauthorswereultimatelyinterestedinthesemeasurementsinlowbackpainpatients(19).Asthiswillincreasethevariancebetweenpatients,anditwillincreasetheresultsofthestudy(i.e.theICCorGCoefficient).Dependingonwherethisstudysitsinthedevelopmentoftheinstrument,thiscouldbedeemedproper(whenthefullrangeofthescoresisnotyetknown)oranimportantflawwhenthepurposeistodeterminethereliabilityofmeasurementintheclinicalsettingoflowbackpain.AfinalexamplereferstotheuseoftheICCmodelforaveragescores.Althoughdiscussedunderstandard7forreliability,itmaybethattheICCforthemeanscoreofthemeasurementsisreported,whereasinclinicalpracticethesinglescoreisused.Dependingonthepurposeofthestudythiscanbeproper(whenthemeanscoreisgoingtobeusedinfutureresearch)oranimportantflawwhenthestudyisaimedatprovingreliabilityonclinicalpractice(wherethesinglescoreisused).

ItisuptotheuseroftheCOSMINRiskofBiastoolwhetheraflawisconsideredminor(andisratedas‘doubtful’)orimportant(andisratedas‘inadequate’).Thescoresoftheotherflawsareincludedintheoverallscore/ratingbasedontheworstscorecountsprinciple.

37

Standard7:Preferredstatisticalmethodsforcontinuousscores verygood adequate doubtful Inadequate

Forcontinuousscores:wasanintraclasscorrelationcoefficient(ICC)calculated?

ICCcalculated;themodelorformulawasdescribed,andmatchesthestudydesignandthedata

ICCcalculatedbutmodelorformulawasnotdescribedordoesnotoptimallymatchthestudydesignORPearsonorSpearmancorrelationcoefficientcalculatedWITHevidenceprovidedthatnosystematicdifferencebetweenmeasurementshasoccurred

PearsonorSpearmancorrelationcoefficientcalculatedWITHOUTevidenceprovidedthatnosystematicdifferencebetweenmeasurementshasoccurredORWITHevidenceprovidedthatsystematicdifferencebetweenmeasurementshasoccurred

Elaboration:Forcontinuousscorestheintraclasscorrelationcoefficient(ICC)ispreferredtoevaluatereliability.ICCsareafamilyofstatisticalparameters,includingGeneralizability(G)coefficients,andDecision(D)coefficients.Togeta“verygood”rating,theICCmodelusedinthereliabilitystudyshouldmatchthestudydesign(andtheaim)ofthestudythatisbeingassessed.Therefore,themodelorformulaoftheICCorGCoefficientusedshouldbedescribed.Itshouldbeclear,e.g.whetheracrossedornesteddesignwasused(seealsopage25/26),orwhetheraone‐wayrandomeffectsmodel,two‐orthree‐wayrandomormixedeffectsmodelwasused.Next,itshouldbecomparedtothestudydesignusingtheextractedinformationfromPartA,anddeterminedwhethertheICCorGCoefficientusedindeedmatchesthestudydesign.TheICCbasedonthetwo‐waymixedeffectsmodelofconsistency(31)(alsoreferredtoasICCmodel3.1(32)),andthePearsonorSpearmancorrelationcoefficientdonottakeasystematicdifferencebetweentherepeatedmeasurementsintoaccount,andarethereforeconsideredlessappropriate,asitcanleadtooverestimatingthereliability.Therefore,basedoninformationofasystematicdifferencebetweenthesourceofvariationconsidered(e.g.raters)either‘adequate’(whennoorverylittlesystematicdifferenceoccurred),or‘doubtful’(whentherewasasystematicdifferencebetweene.g.theraters)canberated.Whenthestudywasdesignedtoinvestigateaspecificsourceofvariation(e.g.inter‐rater),andthesystematicdifferencesbetweenthissourceofvariationintherepeatedmeasurementswastakenintoaccountintheformula(forexample,byusingtheICCrandomeffectsmodelforagreement(31),alsoreferredtoasModel2.1(32)ortheφcoefficient(seee.g.(23)),thestudycanberatedas‘verygood’.Whenastudyisdesignedwithoutanyspecificsourceofvariationisconsidered,theappropriateICCmodelisaone‐wayrandomeffectsmodel(31).Inthissituationtheuse

38

ofaone‐wayrandomeffectsmodelcanberatedas‘verygood’,whiletheuseofothermodelscanberatedas‘adequate’.Next,theICCcanbecalculatedforasinglemeasurementoranaveragemeasurement(31).Ifasinglemeasurementisnormallyusedinclinicalpracticeortrials(andnottheaveragescoreofmultiplemeasurements,suchisdonebyabloodpressuremeasurement),theICCforsinglemeasuresshouldhavebeencalculated.TheICCaveragereferstothereliabilityoftheaveragedscoreofthemeasurements,andreferstotheuseoftheaveragedscoreonrepeatedmeasurements.WhentheICCforaveragemeasuresisreported,inthesituationthatusuallyasinglemeasurementistaken,werecommendthisstandardtoberatedas‘adequate’,asthemodeldoesnotoptimallymatchthedesignofthestudy.However,wealsorecommendinthissituation,toratestandard6(i.e.otherflaws),as‘doubtful’oreven‘inadequate’(seealsotheexampleatstandard6).Moreover,togeta‘verygood’rating,thedescribedICCorGcoefficientmodelorformulashouldmatchthedata.Ifthereisa(known)problemwithnormaldistributionofthedata(normality)whichisnotproperlytakenintoaccount,thestudycouldberatedas‘adequate’insteadof‘verygood’.Itisimpossibletodescribeallotherflawshere,ThereforeitisuptotheuseroftheCOSMINRiskofBiastooltodecidehowtheidentifiedflawshouldbescored.Relevantquestioninthisregardishowcertainandhowlargetheinfluenceisonthestudyresult.Standard8:Preferredstatisticalmethodsforordinalscores verygood adequate doubtful inadequate

Forordinalscores:wasa(weighted)kappacalculated?

Kappacalculated;theweightingschemewasdescribed,andmatchesthestudydesignandthedata

Kappacalculated,butweightingschemenotdescribedordoesnotoptimallymatchthestudydesign

Elaboration:Toassessreliabilityforordinalscores,Cohen’skappa(33‐35)isconsideredthepreferredstatisticalparameter.Nobetteralternativeisknown(4,36).Informationonthespecifickappausedshouldbedescribedintermsofwhetheraweightingschemewasusedandwhichschemewasused.Unweightedkappaconsidersanymisclassificationequallyinappropriate.However,amisclassificationoftwoadjacentcategoriesmaybelesserroneousasamisclassificationofcategoriesthataremoreapartfromeachother.Aweightedkappatakesthisintoaccount(e.g.usinglinearorquadraticweights(37)).Ifthegoalofthestudywastoconsideranymisclassificationasequallyimportant,anditwasstatedthattheunweightedkappawasused,thisstandardcanberateda‘verygood’.However,inothersituation(e.g.misclassificationofcategoriesmore

39

apartfromeachotherisabiggerproblemthatmisclassificationofadjacentcategories)aspecificweightingschemeismorepreferred.Ifunweightedkappacalculatedinthatcasethestandardcouldberatedas‘adequate’.Standard9:Preferredstatisticalmethodsfordichotomousornominalscores

verygood adequate doubtful inadequate

Fordichotomous/nominalscores:wasKappacalculatedforeachcategoryagainsttheothercategoriescombined?

Kappacalculatedforeachcategoryagainsttheothercategoriescombined

Elaboration:Astudyonreliabilityofanoutcomemeasurementinstrumentwithdichotomousornominalscoresgetsa‘verygood’score,whenanunweightedkappawascalculatedofeachcategoryagainsttheothercategories(33).

40

3.2Elaborationonstandardsforstudiesonmeasurementerror

Standards1to6oftheboxforstandardsforstudiesonmeasurementerrorarethesameasforstudiesonreliability.Foranelaborationoneachofthestandards,pleaseseeabove.Standard7:Preferredstatisticalmethodsforcontinuousscores

verygood adequate doubtful inadequate

Forcontinuousscores:wastheStandardErrorofMeasurement(SEM),SmallestDetectableChange(SDC),LimitsofAgreement(LoA)orCoefficientofVariation(CV)calculated?

SEM,SDC,LoAorCVcalculated;themodelorformulafortheSEM/SDCisdescribed;itmatchesthereviewerconstructedresearchquestionandthedata

SEM,SDC,LoAorCVcalculated,butthemodelorformulaisnotdescribedordoesnotoptimallymatchthereviewerconstructedresearchquestionandevidenceprovidedthatnosystematicdifferencehasoccurred

SEMconsistencySDCconsistencyorLoAorCVcalculated,withoutknowledgeaboutsystematicdifferenceorwithevidenceprovidedthatsystematicdifferencehasoccurred

SEMcalculatedbasedonCronbach’salpha,ORusingSDfromanotherpopulation

Elaboration:ForcontinuousscorespreferredmeasuresforthemeasurementerrorofasinglescorearetheSEM,LoAortheCoefficientofVariation(CV);theSDCispreferredasameasureforchangescores.Differentformulascanbeusedtocometocalculatethesevariousmeasures.Therefore,wewillfirstdescribetheirformulas.Subsequently,wewillexplainthestandardforstudiesusingSEMandSDCderivedfromvariancecomponentsanalyses.Next,wewilldiscussLoA,SEMandSDCusingtheSDdifference.Wewillexplainwhenignoringtheinfluenceofthesourceofvariationisappropriate.Andlast,wewilldiscusssomeothermethodsused,includingtheCV.Measuresthattakeallerrorintoaccount,includingthesystematicdifferencebetweenrepeatedmeasurements,basedonaone‐wayortwo‐wayeffectsmodel,are:

(1)

(2)

1.96 ∗ √2 ∗ 1.96 ∗ √2 ∗ (3)

41

Measuresthatdonottakethesystematicdifferencebetweenrepeatedmeasurementsintoaccount:

(4)

1.96 ∗ √2 ∗ 1.96 ∗ √2 ∗ (5)

(6)

1.96 ∗ √2 ∗ 1.96 ∗ √2 ∗√

(7)

1.96 ∗ (8)

1.96 ∗ (9)

Togeta‘verygood’rating,theformulausedshouldmatchthestudydesign(andtheaim)ofthestudythatisbeingassessed.Therefore,itshouldbeclearwhattheaimis,andwhichmeasureorwhichformulawasusedinthestudybeingassessed.Measurementerrorderivedfromvariancecomponentsanalyses(formulas1‐5)Thespecificmodelusedshouldbeclearlydescribed,e.g.whetheraone‐wayrandomeffectsmodel,oratwo‐orthree‐wayrandomormixedeffectsmodelwasused,andwhetherallerror(exceptfromthevarianceduetovariationbetweenpatients)wasincludedinthecalculationofthemeasurementerror,orwhetherthesystematicerrorbetweenthesourceofvariationthatisbeingvariedinthedesignisignored(i.e.asoccurredwhencalculatingSEMconsistencyforsinglescores(formula4)andSDCconsistencyforchangescores(formula5)).Next,itshouldbecomparedtothestudydesignusingtheextractedinformationaboutthecomprehensiveresearchquestion(seePartAofthetool),anddeterminedwhetherthemethodusedindeedmatchesthestudydesign.Inotherwords,whentheaimofthestudywastoassessthemeasurementerrorofasinglescoreofanymeasurementtakeninclinicalpracticeoftrials,theaimistogeneralizetheresultsbeyond(e.g.)thespecificratersinvolvedinthestudy.Inthiscase,thesystematicerrorbetweenratersshouldbetakenintoaccount;theraters(inthisexample)shouldbeconsideredrandom;andallerrorshouldbetakenintoaccount(i.e.formulas1‐3)tomatchthedesignofthestudy(andthisisrated‘verygood’).Ifinthiscase,(withtheaimtogeneralizebeyondthespecificraters)theSEMconsistency(formula4)orSDCconsistency(formula5)wascalculated(i.e.ignoringasystematic

42

differencebetweenraters),evidenceshouldbeprovidedthatno(oronlyverysmall)systematicdifferencehasoccurredbetweentheraters.Incaseofnoorverysmalldifferencesthestandardcanberatedas‘adequate’,astheSEMagreement(formula2)andSEMconsistency(formula4),orSDCagreement(formula3)andSDCconsistency(formula5)willbethesameorveryclose.Ifitisunclearwhethersystematicdifferencesoccurred(becauseitwasnotreported),thestandardisratedas‘doubtful’.MeasurementerrorderivedfromtheSDdifference(formulas6‐9)ThemeasurementerrorofasinglescoreorachangescorecanalsobecalculatedusingtheSDdifference.Thisreferstothestandarddeviationofthedifferenceofthescoresontherepeatedmeasurements(38,39).InaBlandandAltmanplottworepeatedmeasurementsperpatientareplotted:onthex‐axesthemeanscoreofthetwomeasurements,andonthey‐axesthemeandifferencebetweentherepeatedmeasurements(39).Althoughtheplotisdesignedinsuchawaythatsystematicdifferencescaneasilybeseen(i.e.thelineofthemeandifferencesinscores,andtheasymmetricallylocatedlimitsofagreementaroundthezero),thesystematicdifferenceisdisregardedwhentheSDCiscalculatedfromtheselimits(resultingintheSDCconsistency).Therefore,ifa(large)systematicerrorbetweentherepeatedmeasurementsoccurred,whiletheaimofthestudyistogeneralizebeyondthespecificsourceofvariation(e.g.raters),thestandardshouldberatedas‘doubtful’,astheresultsofthestudyisunderestimatingthemeasurementerror.Whenisameasureofconsistency(formulas4‐9)appropriate?Sometime,thesourceofvariationthatisbeingvariedacrossthemeasurementsisconsideredtobefixedinastudy.Thismeansthattheaimofthestudyisnottogeneralizebeyondthespecificstudyobjectsincludedinthestudy.Forexample,inastudyonlytworatersareconsidered(e.g.theratersMyrtheandBrechtje),andtheaimofthestudyiswhetherthesetworaterswillcometoequalscores(e.g.becausetheywillbetheonlytworatersinvolvedinthemeasurementsforaspecifictrial).IfasystematicerroroccursbetweenMyrtheandBrechtje(e.g.Myrthesystematicallyscores5pointshighercomparedtoBrechtje),thescoresobtainedinthetrialcaneasilybeadjustedbyextracting5pointsofeachmeasurementobtainedbyMyrthe.Inthisstudy,thesourceofvariation‘rater’isdeemedirrelevant(31),asthesystematicdifferencewillbeadjustedlateronwhenusingtheinstrumentbyeitherMyrtheorBrechtje.Inthisspecificsituation,theSEMconsistency,SDCconsistencyorthelimitsofagreementmatchtheaimanddesignofthestudy,soitcanberatedas‘verygood’.However,theseresultscannotbegeneralizedtootherraters,as‘rater’wasconsideredfixed.Therefore,thestudyislessrelevantinothersituations,especiallywhenthereisasystematicdifferencebetweentheraters.

43

MeasurementerrorcalculatedusingtheformulaSD*(√1‐ICC)ThereisanotherformulawhichissometimesusedtocalculatetheSEMfromtheICCformula:SEM=SD*(√1‐ICC)(40).ThestandarddeviationreferstotheSDpooledofthesample,thatisofSDtestandSDretest.UsingthisformulaisonlyjustifiedifthedataforICCandSDarederivedfromthesamestudy.WhentheSDisbasedonanotherpopulation,thisisconsideredinadequate,astheSDofthisotherpopulationmaybesmaller,andsubsequently,themeasurementerrorissmaller.Moreover,sometimestheCronbach’salphaisinsertedintheformulainsteadoftheICC.Thisisconsideredinadequate,asthismeasureisbasedononefull‐scalemeasurementwhereitemsareconsideredastherepeatedmeasurements,insteadofatleasttwofull‐scalemeasurementsusingthetotalscoreinthecalculationoftheSEM.OftenCronbach’salphaishigherthanICC’sbasedonrepeatedmeasurements,thusleadingtosmallerSEMvalues.Byratingthisinadequate,theresultofthisstudycanstillbeconsidered,however,itisconsideredtobelesstrustworthy.Moreover,Cronbach’salphaissometimesusedinadequately,becauseitiscalculatedforascalethatisnotunidimensional,orbasedonaformativemodel.InsuchcasestheCronbach’salphacannotbeinterpreted.Otherparametersthatarebasedonsinglemeasurements,suchasthepersonseparationindex(orotherIRT‐basedmeasurementerrormeasures)ortheOmega,arenotcoveredbythemeasurementerroraccordingtotheCOSMINtaxonomy,butbyinternalconsistency.TheCoefficientofvariationCoefficientofvariation(CV)isalsoaparameterofmeasurementerror.Itisoftenusedinphysicsandtopresentthemeasurementerrorofadevice.Whendevelopinganewdevicethemeasurementerrorisassessedbymeasuringafixedsamplemany(e.g.50)times.TheSDofthesemeasurementsisthestandarderrorofmeasurements.Oftenthemeasurementerrorincreaseswithhighervalues.ForthesesituationCVisasuitablemeasure,asCVexpressestheSDaspercentageofthemeanvalue:informulaCV=SD/mean.Usually,itisexpressedinpercentage,forexample,themeasurementerroris2%ofthemeasuredvalue.TheassumptionunderlyingCVisthattheCVgivesaconstantvalueoverallvaluesofthemean,sothattheSDise.g.2%ofthemeanvalue,regardlessofameanvalueof10or100or1000.InaBlandandAltmanplot,wehadacontraryassumption,i.e.thattheSDofthedifferenceisconstantoverthemeanvalues,ontheX‐axis.Ifthedifferencesarelowerwithsmallvaluesandhigherwithlargevaluethehorizontallinesofthelimitsofagreementgiveawrongvalue:toolargeforthesmallvaluesandtoosmallforthelargemeanvalues.Inthatcaseoneshouldtransformthedata.Oftenanaturallogarithmor10loglogarithmtransformationisused.Thishastheadvantagethatthelimitsofagreementcanbedirectlyexpressedinacoefficientsofvariation(41).

44

Standard8:Preferredstatisticalmethodsfordichotomous,nominal,orordinalscores

verygood adequate doubtful inadequate

Fordichotomous/nominal/ordinalscores:Wasthepercentagespecific(e.g.positiveandnegative)agreementcalculated?

%specificagreementcalculated

%agreementcalculated

Elaboration:Oftenkappaisconsideredasameasureofagreement,however,kappaisameasureofreliability(42).Anappropriateparameterofmeasurementerror(alsocalledagreement)ofdichotomous/nominal/ordinalscoresistheproportionofspecificagreement(42‐44).Itisameasurethatexpressestheagreementseparatelyforeachcategoryofthescore–thatispositiveandnegativeratingsagreementincasethescoreisdichotomous.

45

3.3ExampleofhowtousePartBoftheCOSMINRiskofBiastooltoassessthequalityofastudybySkeieetal.(2015)

InthischapterweprovideanexampleofhowtousetheCOSMINtool–PartBusingagainthepaperbySkeieetal.(19).TofullyunderstandtheexplanationinTable7,werecommendtofirstreadtheintroductionandmethodsectionofthepaper,andthesummaryprovidedatpage27/28.Inthispaperfourdifferentstudiesaredescribed.Hereweusethefirsttwosubstudies.

Table7.ExampleofhowtousePartBoftheCOSMINRiskofBiastoolbasedonthestudybySkeie(19).

StandardsondesignrequirementsforReliabilityandMeasurementerrorDesignrequirements Ratingstudy1 Ratingstudy2 1 Werepatientsstableinthetimebetween

therepeatedmeasurementsontheconstructtobemeasured?

NA(measurementswerebasedonastillimage

Verygood.Measurementswereconductedinsuccession.

2 Wasthetimeintervalbetweentherepeatedmeasurementsappropriate?

NA Verygood.Thetimeinterval(i.e.thesecondraterstartedimmediatelyafterthefirsthadcompletedtheprocedure)hasprobablynotinfluencedthescores.

3 Werethemeasurementconditionsimilarfortherepeatedmeasurements–exceptfortheconditionbeingevaluatedasasourceofvariation?

Verygood Verygood

4 Didtheprofessional(s)administerthemeasurementwithoutknowledgeofscoresorvaluesofotherrepeatedmeasurement(s)inthesamepatients?

Verygood.Noneofthepreviousscoreswereavailable

Verygood.Noneofthepreviousscoreswereavailable

5 Didtheprofessional(s)assignthescoresordeterminedthevalueswithoutknowledgeofthescoresorvaluesofotherrepeatedmeasurement(s)inthesamepatients?

Verygood.Noneofthepreviousscoreswereavailable

Verygood.Noneofthepreviousscoreswereavailable

6 Werethereanyotherimportantflawsinthedesignorstatisticalmethodsofthestudy?

Forreliability:Doubtful.5of30persons(seeTable1ofthepaper)werepain‐freesubjects,whichcouldhavemajorlyincreasedthevariationbetweenthepatients,andsubsequentlytheICC

Forreliability:Verygood.(inthisstudynopain‐freepersonswereincluded,seeTable1ofthepaper)

Formeasurementerror:verygood.Heterogeneityofthesampleisconsideredlessaproblem,asthevariationbetweenpatientsisnotincludedintheparameter.

46

StandardsonpreferredstatisticalmethodsforReliability Ratingstudy1 Ratingstudy2

7 Forcontinuousscores:wasanIntraclass

CorrelationCoefficient(ICC)calculated?

Adequate.ICCtwo‐waymixedsinglemeasures(3.1)andtwo‐waymixedaveragemeasures(3.2)werecalculated.ThisistheICCconsistency,whichdoesnottakethesystematicerrorbetweenratersintoaccount.Thestudyaimstogeneralizebeyondtheratersinvolved,therefore,theratersshouldnotbeconsideredfixed,andtheICCmodeldoesnotmatchoptimallytheresearchaimanddesign.BasedonthemeanofthemeasurementsprovidedinTable2,wecanconcludethatnosystematicdifferencebetweentheratersoccurred.TheICCtwo‐waymixedaveragemeasures(3.2)referstothepracticeinwhichtworaterswouldmeasureeachpatient(withtripleplacementofsecondmarker),andbothfinalscoreswereaveraged.Asthiswillnotbecommonpractice,wewillignorethisICC.Therepetitionofpartofthemeasurementisalreadypartofonemeasurement.

8 Forordinalscores:wasa(weighted)

Kappacalculated?

Notapplicable Notapplicable

9 Fordichotomous/nominalscores:was

Kappacalculatedforeachcategoryagainst

theothercategoriescombined?

Notapplicable Notapplicable

FinalRiskofBiasratingReliabilitystudies Doubtful Adequate

StandardsonpreferredstatisticalmethodsforMeasurementerrorRatingstudy1 Ratingstudy2

7 Forcontinuousscores:wastheStandard

ErrorofMeasurement(SEM),Smallest

DetectableChange(SDC),Limitsof

Agreement(LoA)orCoefficientofVariation

(CV)calculated?

Adequate,asthelimitsofagreementwerecalculated,whiletheaimwastogeneralizebeyondtheratersincludedinthisstudy,andprobablytherewasnosystematicdifferencebetweentheraters.

8 Fordichotomous/nominal/ordinalscores:

Wasthepercentagespecific(e.g.positiveand

negative)agreementcalculated?

Notapplicable Notapplicable

FinalRiskofBiasratingstudyonMeasurement

error

Adequate Adequate

47

4. UsingtheCOSMINRiskofBiastoolinasystematicreviewofoutcomemeasurementinstruments

Researchersandclinicianswhoaredecidingonthemostsuitableoutcomemeasurementinstrumentforuseintheirstudy,canoftenchoosefrommultipledifferentinstruments.Theselectionshouldbebasedontheevidenceofthequalityoftheoutcomemeasurementinstruments(i.e.reliability,validity,andresponsiveness),aswellasonaspectsoffeasibilityandinterpretability.Ahigh‐qualitysystematicreviewonoutcomemeasurementinstrumentsgivesaclearoverviewofallimportantaspectstomakeyourchoice.Understandingthequalityofthestudiesandthequalityofthemeasurementinstrumentunderstudyisachallengingtask,specificallyforresearchersandclinicianswhoarelessfamiliarwiththemethodologytoevaluateallmeasurementproperties.Therefore,in2018,we(COSMINinitiative)publishedathoroughmethodologytoconductasystematicreviewofPROMs(5).Itconsistedofaten‐stepproceduretosummarizetheavailableevidencepermeasurementpropertyperincludedPROManddrawconclusionsoneachmeasurementpropertyperPROM.Andsubsequently,togiverecommendationsofthemostsuitablePROMforagivenpurpose,includingalsofeasibilityandinterpretabilityaspects.ThismethodologyalsoincludestheCOSMINRiskofBiaschecklisttoassessthequalityofstudiesonmeasurementpropertiesofPROMs(1),includingstandardsfordesignrequirementsandpreferredstatisticalmethodsorganizedinboxespermeasurementproperty.ToperformasystematicreviewonthequalityofClinROMs,PerFOMsandlaboratoryvalues,thesamemethodologycanbeused.However,werecommendsomeadaptations.TwoaspectsoftheCOSMINmethodologyforsystematicreviewsofPROMsaredifferentforClinROMs,PerFOMsorlaboratoryvalues:recommendationtousedifferentboxesforreliabilityandmeasurementerror,andtheadditionofanewstepThenewboxesInsystematicreviewsofClinROMs,PerFOMsorlaboratoryvaluestheCOSMINRiskofBiaschecklistforPROMs(1)canbeused,althoughtheboxesforreliabilityandmeasurementerrorshouldbereplacedwiththeCOSMINRiskofBiastooltoassessthequalityofastudyonreliabilityormeasurementerror(4).Standardsformostoftheremainingmeasurementproperties(i.e.contentvalidity,internalconsistency,constructvalidity,criterionvalidityandresponsiveness)developedforPROMscanbeusedforothertypesofmeasurementinstrumentsaswell.Somemeasurementpropertiesareonlyrelevantformulti‐iteminstrumentsbasedonareflectivemodel(i.e.structuralvalidityandinternalconsistency).Forsomeothermeasurementpropertiesonlythefinalscoreorvalueofameasurementinstrumentisconsidered(i.e.hypothesestesting

48

forconstructvalidity,criterionvalidityandresponsiveness).Thequalityofstudiesonthesemeasurementpropertiesaresimilarlyassessedforalltypesofoutcomemeasurementinstruments,andtheexistingboxesfromtheCOSMINRiskofBiaschecklistforPROMscanbeused.AnadditionalstepInareliabilitystudyorastudyonmeasurementerrorofaPROMthefocusofinterestisusuallyonthequalityofthePROMasitisbeingusedinclinicalpractice(analyzedusingaone‐wayrandomeffectsmodel),orinthetest‐retestreliability(usingatwo‐wayrandomeffectsmodelofagreement).However,thefocusofinterestinareliabilitystudyofothertypesofmeasurementinstrumentsismuchmorediverse.Asexplainedinchapter2,therearemanypotentialsourcesofvariation(i.e.manydifferentwaystooperationalizethecomponentsofoutcomemeasurementinstruments)thatcouldbethefocusofinterestinastudyonreliability.Eachresultofallthosestudiestellsyousomethingaboutthequalityoftheinstrument(andgivessuggestionsforimprovementofthemeasurementbystandardizingorrestrictingthesourceofvariationwhichshowedthelargesterror).Basedonanoverviewofallthesestudies,anbest‐evidencemeasurementprotocolcanberecommended.InaCOSMINreviewsofClinROMs,PerFOMsorlaboratoryvalues,anadditionalstepisneededintheten‐stepprocedure(seeFigure3),specificallyintheassessmentofreliabilityandmeasurementerror.Towellinterprettheresultsofstudiesincludedinasystematicreview,youneedtodecidehowtheresultsofthestudyyouwanttoassessinformyouaboutthequalityofthemeasurementinstrument.Therefore,weseparatedtheassessmentofreliabilityandmeasurementerrorfromtheothermeasurementproperties.Changeinthemethodology

Basedonourexperienceusingthemethodology,wedecidedtoremovestep8(whichwas‘Evaluateinterpretabilityandfeasibility’)fromthemethodology.Aspectsofinterpretabilityandfeasibilityareonlyextracted(andsummarized)ratherthanevaluated.Therefore,thisstepisirrelevantinthemethodology.However,weconsideritveryusefultohaveaseparatestepondataextraction.Onceyouincludedallthestudiesinareview,wefirstrecommendyoutoextractallnecessaryinformationfromanarticle,beforeassessingtheriskofbias,andthequalityoftheinstrument.Relevantinformationtobeextractedreferstocharacteristicsoftheincludedmeasurementinstruments,informationonfeasibilityandinterpretability,characteristicsofthestudies,andtheresultsofthestudy.

Consequently,thestep‐numbersaredeviatingfromthestepnumberspresentedintheoriginal10‐stepprocedureoftheCOSMINmethodologytoconductasystematicreviewofPROMs(5).

49

Figure3.Eleven‐stepprocedureforconductingasystematicreviewonanytypeofoutcomemeasurementinstrument

50

4.1Theeleven‐stepprocedureforconductingasystematicreviewofClinROMs,PerFOMs,orlaboratoryvalues

Below,asummaryisgivenfortheeleven‐stepprocedure.IntheusermanualoftheCOSMINmethodologyforsystematicreviewsofPROMs(45)athoroughexplanationofeachstepisprovided.OnlythestepsthataredifferentforareviewofoutcomemeasurementinstrumentsotherthanPROMsaredescribedhereindetail.Pleasenotethatthenumberofthesteparechanged.

Themethodologyofasystematicreviewofoutcomemeasurementinstrumentsissubdividedintothreeparts(A,B,andC)(5).

Step1‐4:Performtheliteraturesearch

Thesteps1‐4arestandardprocedureswhenperformingsystematicreviews,andareinagreementwithexistingguidelinesforreviews(46,47):formulatingthespecificaimofthereview,andtheeligibilitycriteria,performingtheliteraturesearch,andselectingrelevantpublications.

Intheresearchquestion,andeligibilitycriteriafourkeyelementsshouldbeincluded:1)theconstruct;2)thepopulation;3)thetype(s)ofinstruments;and4)themeasurementpropertiesofinterest.

Inthesearchstrategywerecommendtoalsousethesekeyelements,exceptfromthetypeofinstruments,aswearenotawareofhighlysensitivesearchblocksfordifferenttypesofmeasurementinstruments.Searchfiltersfordifferentconstructsmaybefoundathttps://blocks.bmi‐online.nl/.Whenusingthesearchfilterforfindingstudiesonmeasurementproperties(48)ofCLinROMs,PerFOMsandlaboratoryvalues,werecommendtouseadditionalsearchtermsforfindingstudiesusingGeneralizabilitytheory.Thisstring,developedwiththehelpofaclinicallibrarian,canbeaddedwiththebolean“OR”tothesearchfilter.

PubmedsearchstringforfindingstudiesusingGeneralizabilitytheory:

G‐theory[tiab]OR"Gtheory"[tiab]OR"generalizabilitytheory"[tiab]OR"generalisabilitytheory"[tiab]

EMBASEsearchstringforfindingstudiesusingGeneralizabilitytheory:

‘g‐theory’:ti:abOR‘gtheory’:ti,abOR‘generalizabilitytheory’:ti,abOR‘generalisabilitytheory’:ti,ab

51

Step5:Dataextraction

Onceyouincludedallrelevantarticles,youcheckperarticlewhichmeasurementpropertieswereevaluated(andsubsequentlydecidewhichCOSMINboxesarerelevanttobecompletedforthespecificarticle).Whenreadingthroughthearticle,atthispoint,werecommendyoutoextractallinformationfromthearticleaboutthecharacteristicsoftheincludedmeasurementinstruments(forsuggestionsofcharacteristicsseeappendix4),includingaspectsoffeasibilityandinterpretability(seebelow).Interpretabilityisdefinedasthedegreetowhichonecanassignqualitativemeaning(thatis,clinicalorcommonlyunderstoodconnotations)toaquantitativescoreorchangeinscoresofanoutcomemeasurementinstrument(7).Boththeinterpretabilityofsinglescoresandtheinterpretabilityofchangescoresisinformativetoreportinasystematicreview.Theinterpretationofsinglescorescanbeoutlinedbyprovidinginformationonthedistributionofscoresinthestudypopulationorotherrelevantsubgroups,asitmayrevealclusteringofscores,anditcanindicatefloorandceilingeffects.TheinterpretabilityofchangescorescanbeenhancedbyreportingM(C)ICvalues.However,thereisanongoingdebateabouthowthesevaluesshouldbeassessed.

Feasibilityisdefinedastheeaseofapplicationofthemeasurementinstrumentinitsintendedcontextofuse,givenconstraintssuchastimeormoney(49).Aspectsoffeasibilityare,forexample,completiontime,costofaninstrument,lengthoftheinstrument,typeandeaseofadministration.Feasibilityappliestoboththepatientsandtheprofessionalwhoareinvolvedinthemeasurement.Theconcept‘feasibility’isrelatedtotheconcept‘clinicalutility’,wherefeasibilityreferstoameasurementinstrument,andclinicalutilityreferstoanintervention(50).

Interpretabilityandfeasibilityarenotmeasurementpropertiesbecausetheydonotrefertothequalityofanoutcomemeasurementinstrument.However,theyareconsideredimportantaspectsforawell‐consideredselectionofanoutcomemeasurementinstrument.

52

Steps6‐9:Evaluatethemeasurementproperties

Thesteps6‐9concerntheevaluationoftheninemeasurementpropertiesoftheincludedoutcomemeasurementinstruments.Inthesestepspermeasurementproperty,dataisextractedonthecharacteristicsofthestudies,andtheresultofeachstudy,theriskofbiasoftheincludedstudiesisratedbyusingtheCOSMINRiskofBiasstandards,andtheresultsofthestudiesareratedbyapplyingthecriteriaforgoodmeasurementproperties.Subsequently,allevidenceissummarized,andthequalityofallavailableevidencepermeasurementpropertypermeasurementinstrumentisgradedusingamodifiedGRADEapproach.

Characteristicsofthestudiesrefertothecharacteristicsoftheincludedpatientpopulations,andpopulationofincludedprofessionals(forsuggestionsofcharacteristicsseeappendix5).Forspecificrecommendationsforextractinginformationontheresultsofstudiesonreliabilityandmeasurementerrorseestep8extractinginformation(p53).

Instep6thecontentvalidityisassessed.Instep7theinternalstructure(structuralvalidity,internalconsistencyandcross‐culturalvalidity\measurementinvariance)isassessed.Astheassessmentofreliabilityandmeasurementerrorrequiresanadditionalstep(i.e.understandinghowtheresultsofastudyinformyouaboutthereliabilityormeasurementerrorofaoutcomemeasurementinstrument),thesetwomeasurementpropertiesarenowassessedinaseparatestep,i.e.step8,apartfromtheassessmentofthemeasurementpropertiescriterionvalidity,hypothesestestingforconstructvalidity,andresponsiveness(i.e.step9).

Step6.Evaluatecontentvalidity

Instep6contentvalidityisevaluated.InthecurrentstandardsandcriteriaforassessingcontentvalidityofPROMs(6)emphasizeisputontherelevance,comprehensiveness,andcomprehensibilityofthePROMfortheconstruct,targetpopulation,andintendedcontextofuse.InthisassessmentalsothedevelopmentofthePROMisconsidered,specifically,theitemelicitationphaseandtheresultsfromthepilot‐testingphase.Theassessmentofcontentvalidityofothertypesofinstrumentsmaybedifferent,andmoreresearchisneededtodevelopstandardsandcriteriaforothertypesofmeasurementinstruments.

Assessingthecontentvalidityofmeasurementinstrumentsthatincludemultipleitems–eitherbasedonareflectiveorformativemodel–canheavilyleanonthestandardsandcriteriaforPROMs.Only,becauseprofessionalsareinvolvedinthemeasurement,thethreeaspectsofcontentvalidity(i.e.relevance,comprehensiveness,andcomprehensibility)shouldbeaskedtotheprofessionals.Dependingontheconstructofinterest,theseaspectscouldbeaskedtopatients,too,forexampleforPerFOMs,orClinROMsaboutsymptomsorseverityofconditions.

53

Fortheassessmentofcontentvalidityofmeasurementinstrumentsthatexistofasingleparameter(e.g.imaging‐basedparameters,orlaboratoryvalues),otheraspectsarelikelymorerelevant.Forexample,youshouldjudgewhetheritmakessensethatthemeasurementinstrumentindeedmeasurestheconstructitpurportstomeasure,basedontheoryandmedicalknowledge,andbasedontheclaimsbythemanufacturer.Inaddition,theunitofmeasurementshouldmatchtheconstructtobemeasured.Forexample,a6minutewalktest–expressedinthedistancecoveredoveratimeof6minutes–measureswalkingcapacity,ratherthanphysicalfunctioning(51).Ascurrentlynostandardsandcriteriaforcontentvalidityexist,facevalidity(whichisarathersubjectivejudgmentaboutwhetherthecontentoftheinstrumentindeedlooksasanadequatereflectionoftheconstructtobemeasured)couldbeassessedbythereviewer.

Step7.Evaluatetheinternalstructure

Instep7theinternalstructure(structuralvalidity,internalconsistencyandcross‐culturalvalidity\measurementinvariance)isassessed.Thisstepisonlyrelevantwhenthemeasurementinstrumentisamulti‐iteminstrumentbasedonareflectivemodel.Thestandards(1)andcriteria(5)providedforsystematicreviewsofPROMscanbeused.

Step8.Evaluatereliabilityandmeasurementerror

Next,instep8reliabilityandmeasurementerrorareassessed.Inchapter2and3wehaveexplainedhowtoassessthequalityofeachstudyonreliabilityandmeasurementerror.

Inasystematicreviewperstudy,youshouldfirstextractinformationabouttheelementsofacomprehensiveresearchquestion(seechapter2),thespecificICCmodelorformula,andtheresultsofeachstudy.Next,youshouldassessthestudyqualityusingthestandards(seechapter3),andassesstheresultsofeachstudy,bycomparingtheresultsagainstthecriteriaforgoodmeasurementproperties(5).Subsequently,youshouldsummarizeallevidenceforreliabilityandformeasurementerror,respectively,andgradethequalityoftheevidenceusingthemodifiedGRADEapproach(5).Basedonthisoverview,youcanrecommendonthebest‐evidencemeasurementprotocolforaspecificmeasurementinstrument.

Extractinginformation

InAppendix1weprovideanexampleofadataextractiontable.First,werecommendtoextractthesevenelementsofacomprehensiveresearchquestion,andtheresearch

54

questionasstatedbytheauthorsinthearticle.Basedontheelements,youcansubsequentlyformulateacomprehensiveresearchquestion.Next,werecommendtoextracttheinformationaboutthekeyelementsofthereview,i.e.theconstruct,population,typeofmeasurementinstrument,andmeasurementpropertiesofinterest.Theconstructtobemeasured(element3ofacomprehensiveresearchquestion),andthespecificmeasurementproperties(element4ofacomprehensiveresearchquestion)arealreadyextracted,sothetargetpopulationandthetypeofmeasurementinstrumentarerecommendedtobeextracted.Thetargetpopulationreferstothetargetpopulationofthespecificstudy.IntheexampleofSkeieetal.(19),thetargetpopulationwerepatientswithlow‐backpain.Thiscanbedifferentfromthestudypopulation(i.e.thesampleused)asextractedinitem7,or(slightly)differentfromthetargetpopulationofthereview(e.g.abroaderpopulation).InthestudyofSkeie,notonlypatientswithlow‐backpainwereincluded,butalsopatientswithotherspinalcomplaintssuchasmidbackpain,neckpain,and/orextremitypain,orevenpain‐freesubjects.ThetypeofmeasurementinstrumentreferstowhethertheinstrumentunderstudyisaClinROM,PerFOM,laboratoryvalue,aPROMoranObsROM.

Last,werecommendtoextractinformationaboutthestatistics:themodelorformulaused,theresult,and,ifapplicable,its95%confidenceinterval.Ifavailable,werecommendtoextractthevariancecomponents,ortheSDsampleorSDdifference(seealsochapter3.2formoreexplanation).Forordinalordichotomousdatawerecommendtoextracttherawnumbersinthecellsplusmarginaltotals.

RiskofBiasassessment

Thenextstepinthereview,istoassessthequalityofeachstudy,usingPartBoftheRiskofBiastooltoassessreliabilityandmeasurementerror(asdescribedinchapter3).Werecommendtousetheworst‐scorecountsmethodstocometoanoverallratingperstudy.InAppendix2weprovideanexampleofsuchatabletoorganizetheseratings.Werecommendthateachstudyisassessedbytwoindependentreviewers,andthattheycometoconsensus.

Comparisonagainstthecriteriaforgoodmeasurementproperties

Eachresultofeachsinglestudyonreliabilityormeasurementerrorisnowcomparedagainstthecriteriaforgoodmeasurementproperties(5).AsnocriteriafortheunweightedKappa,andCVwereprovidedintheguidelinesforsystematicreviewsofPROMs,weaddedthesemissingcriteria(seeTable8).Criteriafor%specificagreementaredifficulttoset,becausetheyare,justlikesensitivityandspecificity,highlydependentonthesituation.Asaruleofthumb80%mightbeused.

55

Table8.Extendedcriteriaforgoodreliabilityandmeasurementerror(adaptedfromPrinsenetal.(5))

Reliability

+ ICCor(weighted)Kappa≥0.70

? ICCor(weighted)Kappanotreported

– ICCor(weighted)Kappa<0.70

Measurementerror

+

SDCorLoAorCV*√2*1.96<M(C)IC1;%specificagreement>80%2

? MICnotdefined

–SDCorLoAorCV*√2*1.96>M(C)IC1;%specificagreement<80%2

1theM(C)ICvaluemaycomefromanotherstudy.2Sometimesahigherpercentageismoreappropriate;whensubstantiated,thiscouldbeappropriate,too.

Summarizingtheevidence

Tocometoanoverallconclusionofthereliabilityorthemeasurementerrorofanoutcomemeasurementinstrument,oneshouldfirstdecidewhethertheresultsfrommultiplestudiescanbecombined.Youshouldtaketwoaspectsintoaccountinthisdecision.1)Dotheresultsrefertothesameinformation(i.e.refertothesameunderlyingcomprehensiveresearchquestion).Resultsfromdifferentdesigns(i.e.differentcomponentswerevariedacrosstherepeatedmeasurements)giveyouotherinformationaboutthereliabilityofaninstrument,andthereforecannotsimplybesummarized.And2)Aretheresultsconsistent,thatisallresultsareeithersufficient(+)orinsufficient(‐).Incaseofinconsistencyinresults,werecommendtosearchforreasonsforthisinconsistency,e.g.differentdesignsorstatisticalmodels,differentpopulations,differentbackgroundofraters.Subsequently,subgroupsofstudiescanbesummarized.

Tosummarizetheevidence,youcaneitherqualitativelysummarizetheresults(e.g.describetherangeoftheresults)orquantitativelypooltheresults.Inreliabilitystudies,onlythepointestimateofanICCorCohen’skappaisusedtoconcludewhetherthespecificmeasurementinstrumenthassufficientreliability(e.g.inthecriteriathatweproposeabove).Therefore,itisnotnecessarytopoolthedatatoobtainamoreprecisepointestimate.

Themeasurementerrorreferstotheabsolutedeviationofthescorefromthe‘true’scoreortheamountoferrorinthescore.Thepointestimateofthemeasurementerrorparameterreferstothisdeviationorerror,andthereforeitisusedtoknowhowprecisethemeasurementinstrumentisabletomeasureapatient.Tocometoamoreprecisepointestimatesofthemeasurementerror,theparametersobtainedinstudieswiththesamedesign(i.e.thathavethesameunderlyingcomprehensiveresearchquestion)can

56

bepooled,whentheconfidenceintervalsarealsoreporting(whichcanbeobtainedusingthesamplesize(39)orbootstrappingmethods(52)).

Note,thatyoushouldonlysummarizeorpoolparametersofmeasurementerrorthatwerederivedfromthesamestudydesignandmodelorformulaused.Forexample,theSEMconsistency(eitherformula4or8,chapter3.2)andSEMagreement(formula2,chapter3.2)shouldnotbecombined.However,SEMconsistencyusingeitherformula4or6(chapter3.2)canbecombinedastheywillleadtothesameresult,andtheSDCconsistencyusingeitherformula5,7,or9(chapter3.2)canbecombined.ThesameresultsarefoundwhenusingeithertheSEMone‐wayrandomeffectsmodel(formula1,chapter3.2)orSEMagreement(formula2,chapter3.2).Thisisbecauseallsourcesofvariance(apartfromthevariancebetweenpatients)aretakenintoaccountinbothformulas.Therefore,theseparameterscanbecombined.

Handlinginconsistentresults.

Iftheresultsofstudieswiththesameunderlyingresearchquestionareinconsistent(e.g.bothsufficientandinsufficientresultsarefound),firstexplanationsforinconsistencyshouldbeexplored.Forexample,slightlydifferentstudypopulationsormethodswereused.Ifanexplanationisfound,subgroupsofstudies(e.g.nowbasedonthesamestudypopulation,orinwhichthesamesourceofvariationisvaried)canbesummarized.Theoverallconclusionfor(e.g.)reliabilitycansubsequentlybedrawnpersubgroup.Whentheexplanationisfoundinthequalityofthestudies(i.e.verygoodandadequatestudiesleadtoanotheroverallratingthandoubtfulandinadequatestudies),thedoubtfulandinadequatequalitystudiesmayonlybereported,butignoredinthisstep,andonlyverygoodandadequatequalitystudiesareconsideredtobedecisiveindeterminingtheoverallratingwhenratingsareinconsistent.Thisshouldbeexplainedinthemanuscript.

Ifstudieswiththesameunderlyingresearchquestionshowedinconsistentresults,andnoexplanationcanbefound,onecanconcludethatresultsareinconsistent.

WerefertotheUsermanualoftheCOSMINmethodologyforsystematicreviewsofPROMsformoreinformation.

Ratethequalityofthesummarizedresult

Ifmultiplestudiescanqualitativelybesummarized(e.g.therangeofresults)orquantitativelypooled,theoverallresultcanagainbecomparedtothecriteriaforgoodmeasurementproperties(seeTable8);youcanthenconcludethattheoutcomemeasurementinstrumenthaseithersufficient(+)orinsufficient(‐)reliabilityormeasurementerror.Oryoushouldconcludethattheresultsareinconsistent(±),or

57

indeterminate(?).Formoreinformation,werefertotheUsermanualoftheCOSMINmethodologyforsystematicreviewsofPROMs.

GradingthequalityoftheevidenceusingthemodifiedGRADEapproach

Aftersummarizingorpoolingallevidenceperoutcomemeasurementinstrumentforreliabilityandformeasurementerror,andratingthesummarizedorpooledresultsagainstthecriteriaforgoodmeasurementproperties,thenextstepistogradethequalityoftheevidence.Thequalityoftheevidencereferstotheconfidencethatthesummarizedorpooledresultsistrustworthy.WedevelopedamodifiedGRADE(GradingofRecommendationsAssessment,Development,andEvaluation)approachtogradetheevidenceashigh,moderate,loworverylow(5),basedonthe1)riskofbias(i.e.themethodologicalqualityofthestudies),2)inconsistency(i.e.unexplainedinconsistencyofresultsacrossstudies),3)imprecision(i.e.totalsamplesizeoftheavailablestudies),and4)indirectness(i.e.evidencefromdifferentpopulationsthanthepopulationofinterestinthereview).ThisprocedureisdescribedintheUsermanualoftheCOSMINmethodologyforsystematicreviewsofPROMs(5,45).

Drawconclusionon‘best‐evidencemeasurementprotocol’

Theresultsofreliabilitystudieswiththeirspecificdesignsinformyouwhetherasourceofvariation(forexamplethetrainingofarater,thespecificmachineused)importantlyaffectsthescore(i.e.themeasurement).Ifpossible,thissourceofvariationshouldbestandardizedorrestrictedinfuturemeasurements.Bylookingatallevidenceforvarioussourceofvariation,youcannowdrawconclusionsabouthowtostandardizeandrestrictthemeasurement,anddescribethisbest‐evidencemeasurementprotocol.

Step8.Evaluatecriterionvalidity,hypothesestestingforconstructvalidity,andresponsiveness

Instep8criterionvalidity,hypothesestestingforconstructvalidity,andresponsivenessisassessed.Thestandards(1)andcriteria(5)providedforsystematicreviewsofPROMscanbeused.

58

Steps10‐11:.Selecttheoutcomemeasurementinstrument

Thesteps10and11concernstheformulatingrecommendations(step10)andthereportingofthesystematicreview(step11).

Step10.Formulaterecommendations

Thegoalofasystematicreviewonmeasurementinstrumentsistogetanoverviewofallavailableevidenceonthequalityofoutcomemeasurementinstrumentsthatmeasureaspecificconstructinadefinedpatientpopulation.Basedonthisoverview,andtakingaspectsoffeasibilityandinterpretabilityintoaccount,werecommendyoutoformulateyourrecommendationsaboutthemostsuitableoutcomemeasurementinstrument.Tocometoanevidence‐basedandfully‐transparentrecommendation,werecommendtocategorizetheincludedmeasurementinstrumentsintothreecategories.Pertypeofmeasurementinstrumentyoucanconcludewhichinstrument(s)arerecommended(categoryA)orpromising(categoryB),orinsufficient(categoryC)andshouldnotbeusedanymore.

Category(A):

Werecommendusingdifferentdefinitionsofthecategory(A),dependingonthestructureofthemeasurementinstrument:

Multi‐itemreflectief

Evidenceforsufficientcontentvalidity(anylevel),ANDsufficientinternalconsistency(atleastlowquality,meaningalsosufficientstructuralvalidity)

Multi‐itemformatief

Evidenceforsufficientcontentvalidity(anylevel)

Singleitem(singleparameter)(nogoldstandard)

Sufficientfacevalidity(ratedbye.g.thereviewersteam),ANDevidenceforsufficientreliability(anylevel)

Singleitem(goldstandardavailable)

Evidenceforsufficientcriterionvalidity,ANDevidenceforsufficientreliability(anylevel)

Category(B):outcomemeasurementinstrumentnotcategorizedas‘A’or‘B’.

Category(C):outcomemeasurementinstrumentwithhighqualityevidenceforaninsufficientmeasurementproperty.

59

Step11.Reportthesystematicreview

InaccordancewiththePRISMAStatement(53,54),werecommendtoreportthefollowinginformation:(1)thesearchstrategy(forexampleonawebsiteorinthe(online)supplementalmaterialstothearticleatissue),andtheresultsoftheliteraturesearchandselectionofthestudiesandmeasurementinstruments,displayedinthePRISMAflowdiagram(includingthefinalnumberofarticlesandthefinalnumberofmeasurementinstrumentsincludedinthereview)(Appendix3);(2)thecharacteristicsoftheincludedmeasurementinstruments,includingaspectsoffeasibilityandinterpretability(Appendix4);(3)thecharacteristicsofthestudies,includingthecharacteristicsoftheincludedpatientpopulations,andpopulationofincludedprofessionals(Appendix5);(4)themethodologicalqualityratingsofeachstudypermeasurementpropertypermeasurementinstrument(i.e.verygood,adequate,doubtful,inadequate),theresultsofeachstudy,andtheaccompanyingratingsoftheresultsbasedonthecriteriaforgoodmeasurementproperties(sufficient(+)/insufficient(‐)/indeterminate(?)).IntheUserManualforconductingsystematicreviewsofPROMs(45)anexampleisprovided.InAppendix6weprovideexamplesspecificallyforcolumnsonreliabilityandmeasurementerror.ThetablecouldbepublishedforexampleasAppendixorsupplementalmaterialonthewebsiteofthejournalonly;(5)aSummaryofFindings(SoF)tablepermeasurementproperty,includingthepooledorsummarizedresultsofthemeasurementproperties,itsoverallrating(i.e.sufficient(+)/insufficient(‐)/inconsistent(±)/indeterminate(?)),andthegradingofthequalityofevidence(high,moderate,low,verylow).IntheUserManualforconductingsystematicreviewsofPROMs(45)anexampleisprovided.InAppendix7weprovideexamplesspecificallyforcolumnsonreliabilityandmeasurementerror.TheseSoFtables(i.e.onepermeasurementproperty)willultimatelybeusedinprovidingrecommendationsfortheselectionofthemostappropriatePROMforagivenpurposeoraparticularcontextofuse.

60

Appendix1.DataExtractiontableofrelevantinformationforeachincludedstudyinasystematicreview.

Extractionitem Instruction Study1 Study2Elementsofacomprehensiveresearchquestion1.Nameoftheinstrument

Alternatively:typeofinstrumentandparameter

2.Versionorwayofoperationalization

Allrelevantcomponentsthatareknownorexpectedtoinfluencethescore,andwhicharestandardizedorrestricted(facetofstratification(23))

Equipment:Preparatoryactions:

Equipment:Preparatoryactions:

Unprocesseddata/samplecollection:Dataprocessingandstorage:

Unprocesseddata/samplecollection:Dataprocessingandstorage:

Assignmentofthescore/determinationofthevalue:

Assignmentofthescore/determinationofthevalue:

3.Construct Descriptionofwhatisbeingmeasured

4.Measurementproperty

Reliabilityand/ormeasurementerror

5.Componentsthatwillberepeated

e.g.wholemeasurement(i.e.allcomponents)orsomeofthecomponent

6.Source(s)ofvariationvaried

Componentswhichisvariedacrossthemeasurements(i.e.focusofanalysis;facetofgeneralizability(23))

7.Patientpopulation

(i.e.facetofdifferentiation(23))

Theresearchquestion

Publishedresearchquestion

Asformulatedbytheauthors

Comprehensiveresearchquestion

Asformulatedbythereviewer

Additionalkeyelementofresearchaimofthereview

Targetpopulation Descriptionofthepopulationtowhichtheauthorswanttogeneralize

Typesofmeasurementinstrument

e.g.ClinROM,PerFOM,laboratoryvalue,PROMorObsROM

61

Statisticalinformationandresults

Modelorformulaused

Statisticalmodel

Result e.g.results(95%CI)ofICC,kappa,SEM,LoAandsystematicdifference

Variancecomponents

Allreportedvariancecomponents

Applycriteriaforgoodmeasurementproperty*

sufficient(+),insufficient(‐),orindeterminate(?)

*althoughthisisarating,andnotdataextraction,weincludeithere,astherequiredinformationtomaketheratingisextractedhere.

62

Appendix2.RiskofBiasratingsperstandardperstudy

RiskofBiasrating study1 rater1 rater2 consensusDesignrequirements 1 Stabilityofthepatients 2 Timeinterval 3 Similarityofmeasurementcondition 4 Administationwithoutknowledgeof

scoresorvalues 5 Scoreassignmentordeterminationof

valueswithoutknowledgeofthescoresorvalues

6 Otherimportantflaws Statisticalmethods 7 Forcontinuousscores:ICC 8 Forordinalscores:Kappa 9 Fordichotomous/nominalscores:

Kappaforeachcategoryagainsttheothercategoriescombined?

Finalrating

63

Appendix3.ExampleofaFlow‐chart

64

Appendix4.Exampleofreportingtableoncharacteristicsoftheincludedmeasurementinstruments.

Name(referencetofirstarticle)

Construct Intendedcontextofuse

Best‐evidencemeasurementprotocol

Targetpopulation

Typeofmeasurementinstrument

Feasibilityaspects

Interpretabilityaspects

LMMthickness(19)

Thicknessofrestingmuscle

Evaluation Trainingdiagnosticultrasound.Specificinstructionsforpatient,andprobepositions.

Patientswithlowbackpain

Ultrasound Meanscoreinmixofpainpatientswas27.9mm(±3.2)

LMMcontraction(19)

Comparisonofthethicknessofrestingmusclewiththatofactivatedmuscle

Evaluation Trainingdiagnosticultrasound.Specificinstructionsforpatient,andprobepositions.

Patientswithlowbackpain

Ultrasound Meanscoreinmixofpainpatientsranges1.3mm(±1.7)–3.5mm(±2.6)

Othercharacteristicswhichmaybeextractedare:conceptualmodelused,recommendedbystandardizationinitiatives,fullcopyavailable,fitforpurpose(diagnostic,prognostic,evaluation).

Aspectsoffeasibilityare,forexample,completiontime,licensinginformationandcostsofaninstrument,typeandeaseofadministration.Feasibilityappliestoboththepatientsandtheprofessionalwhoareinvolvedinthemeasurement.ItmaybeconsideredtoreportthisinformationinaseparateTable.

Aspectsofinterpretabilityreferto1)interpretabilityofsinglescores(e.g.informationonthedistributionofscoresinstudypopulationorotherrelevantsubgroups,andfloorandceilingeffects),and2)interpretabilityofchangescores(i.e.M(C)ICvalues).

65

Appendix5.Exampleofreportingtableoncharacteristicsofthestudypopulations.

Measurementinstrument

Reference Measurementpropertyassessed

Patientpopulation Professionalpopulation Responserate

Samplesize

Patientcharacteristics Samplesize

Characteristicsofprofessionals

LMMcontraction

(19)Study2 Reliability,measurementerror

30 47%female,agemean(SD)37(±12);LBPn=20;neck/midbackpainn=5;extremitypainn=1;painfreen=4

2 Chiropractorsexperiencedindiagnosticultrasoundforthemusculoskeletalsystem,i.e.4and8yearsresp.,withapostgraduatediplomaindiagnosticultrasound.Beforethestudy.bothdevelopedtheprotocolofdiagnosticultrasoundthatwasappliedinthisstudy.

(19)Study3 Reliability 30 50%female,agemean(SD)38(±11);LBPn=23;neck/midbackpainn=7

2

(19)Study4 Reliability,measurementerror

30 43%female,agemean(SD)40(±11);LBPn=20;neck/midbackpainn=6;extremitypainn=3;painfreen=1

2

B 1

2

Patientcharacteristicsreferto,e.g.age,gender,diseasecharacteristics(diagnosis,diseaseduration,diseaseseverity),setting,andgeographicallocation.

Ratercharacteristicsmayreferto,e.g.professionalbackground,specifictrainingreceived,oryearsofexperience.

66

Appendix6.OverviewTableofqualityandresultsofstudiesonreliabilityandmeasurementerror.

Measurementinstrument(MI)(ref)

TypeofMI Reliability Measurementerrorn Studyquality Result(rating) N Studyquality Result(rating)

LLMcontractionscore(study2)(19)

Ultrasound 30 Adequate 0.97(0.92‐0.98) 30 Adequate LoA[−0.94;1.22mm]

LLMcontractionscore(study3)(19)

Ultrasound 30 Adequate 0.94(0.88‐0.97)

LLMcontractionscore(study4)(19)

Ultrasound 30 Adequate 0.97(0.94‐0.99) 30 Adequate LoA[−1.32;1.25mm]

LLMcontractionscore(ref)

LLMcontractionscore(ref)

Pooledorsummaryresult(overallrating)

90 0.94‐0.97(+) 90 SDCconsistsncy=1.08;1.29a

acalculatedfromLoA

67

Appendix7.SummaryofFindingsTablesforReliabilityandMeasurementerror.

BasedonthestudiesonreliabilitydescribedbySkeie(19)

Reliability Summaryresult Overallrating Qualityofevidence

UltrasoundmeasurementoftheLMMcontractionscore–best‐evidencemeasurementprotocol:rater,dayandactivemotortasksperformedbeforemeasurementwerenotofinfluence

RangeICC:0.94‐0.97 Sufficient High(twostudiesofadequatequality)

MeasurementinstrumentB–

BasedonthestudiesonmeasurementerrordescribedbySkeie(19)

Measurementerror Summaryresult Overallrating Qualityofevidence

UltrasoundmeasurementoftheLMMcontractionscore–best‐evidencemeasurementprotocol:rater,dayandactivemotortasksperformedbeforemeasurementwerenotofinfluence

RangeSDCconsistsncy:1.08‐1.29

MIC=notassessed

?

MeasurementinstrumentB–

68

References1. Mokkink LB, de Vet HCW, Prinsen CAC, Patrick DL, Alonso J, Bouter LM, et al. COSMIN Risk of Bias checklist for systematic reviews of Patient-Reported Outcome Measures. Qual Life Res. 2018;27(5):1171-9. 2. Walton MK, Powers JA, Hobart J, al. e. Clinical outcome assessments: A conceptual foundation – Report of the ISPOR Clinical Outcomes Assessment Emerging Good Practices Task Force. Value Health. 2015;18:741-52. 3. Powers JH, 3rd, Patrick DL, Walton MK, Marquis P, Cano S, Hobart J, et al. Clinician-Reported Outcome Assessments of Treatment Benefit: Report of the ISPOR Clinical Outcome Assessment Emerging Good Practices Task Force. Value Health. 2017;20(1):2-14. 4. Mokkink LB, Boers M, van der Vleuten CPM, Bouter LM, Alonso J, Patrick DL, et al. COSMIN Risk of Bias tool to assess the quality of studies on reliability or measurement error of outcome measurement instruments: a Delphi study. . BMC Medical Research Methodology. 2020;20(293). 5. Prinsen CAC, Mokkink LB, Bouter LM, Alonso J, Patrick DL, de Vet HCW, et al. COSMIN guideline for systematic reviews of patient-reported outcome measures. Qual Life Res. 2018;27(5):1147-57. 6. Terwee CB, Prinsen CAC, Chiarotto A, Westerman MJ, Patrick DL, Alonso J, et al. COSMIN methodology for evaluating the content validity of patient-reported outcome measures: a Delphi study. Qual Life Res. 2018;27(5):1159-70. 7. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, et al. The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. J Clin Epidemiol. 2010;63(7):737-45. 8. Hamilton M. The assessment of anxiety states by rating. Br J Med Psychol. 1959;32(1):50-5. 9. Douglas PS, DeCara JM, Devereux RB, Duckworth S, Gardin JM, Jaber WA, et al. Echocardiographic imaging in clinical trials: American Society of Echocardiography Standards for echocardiography core laboratories: endorsed by the American College of Cardiology Foundation. J Am Soc Echocardiogr. 2009;22(7):755-65. 10. Jungmann PM, Welsch GH, Brittberg M, Trattnig S, Braun S, Imhoff AB, et al. Magnetic Resonance Imaging Score and Classification System (AMADEUS) for Assessment of Preoperative Cartilage Defect Severity. Cartilage. 2017;8(3):272-82. 11. Fischer JSJ, A.J.; Kniker, J.E.; Rudick, R.A.; Cutter,G. Multiple Sclerosis Functional Composite (MSFC). Administration and scoring manual.; 2001. 12. Genc S, Omer B, Aycan-Ustyol E, Ince N, Bal F, Gurdol F. Evaluation of turbidimetric inhibition immunoassay (TINIA) and HPLC methods for glycated haemoglobin determination. J Clin Lab Anal. 2012;26(6):481-5. 13. Holen JC, Saltvedt I, Fayers PM, Hjermstad MJ, Loge JH, Kaasa S. Doloplus-2, a valid tool for behavioural pain assessment? BMC Geriatr. 2007;7:29. 14. Farooq MN, Mohseni Bandpei MA, Ali M, Khan GA. Reliability of the universal goniometer for assessing active cervical range of motion in asymptomatic healthy persons. Pak J Med Sci. 2016;32(2):457-61. 15. Jordan K, Haywood KL, Dziedzic K, Garratt AM, Jones PW, Ong BN, et al. Assessment of the 3-dimensional Fastrak measurement system in measuring range of motion in ankylosing spondylitis. J Rheumatol. 2004;31(11):2207-15.

69

16. Correll S, Field J, Hutchinson H, Mickevicius G, Fitzsimmons A, Smoot B. Reliability and Validity of the Halo Digital Goniometer for Shoulder Range of Motion in Healthy Subjects. Int J Sports Phys Ther. 2018;13(4):707-14. 17. D'Agostino M A, Aegerter P, Jousse-Joulin S, Chary-Valckenaere I, Lecoq B, Gaudin P, et al. How to evaluate and improve the reliability of power Doppler ultrasonography for assessing enthesitis in spondylarthritis. Arthritis Rheum. 2009;61(1):61-9. 18. Terwee CB, Mokkink LB, Knol DL, Ostelo RW, Bouter LM, de Vet HC. Rating the methodological quality in systematic reviews of studies on measurement properties: a scoring system for the COSMIN checklist. Qual Life Res. 2012;21(4):651-7. 19. Skeie EJ, Borge JA, Leboeuf-Yde C, Bolton J, Wedderkopp N. Reliability of diagnostic ultrasound in measuring the multifidus muscle. Chiropr Man Therap. 2015;23:15. 20. Mathew AJ, Ostergaard M. Magnetic Resonance Imaging of Enthesitis in Spondyloarthritis, Including Psoriatic Arthritis-Status and Recent Advances. Front Med (Lausanne). 2020;7:296. 21. Butland RJ, Pang J, Gross ER, Woodcock AA, Geddes DM. Two-, six-, and 12-minute walking tests in respiratory disease. Br Med J (Clin Res Ed). 1982;284(6329):1607-8. 22. de Jong K ea. Richtlijnen 6-minutes timed walking test.; 2000. 23. Bloch R, Norman G. Generalizability theory for the perplexed: a practical introduction and guide: AMEE Guide No. 68. Med Teach. 2012;34(11):960-92. 24. Feys P, Lamers I, Francis G, Benedict R, Phillips G, LaRocca N, et al. The Nine-Hole Peg Test as a manual dexterity performance measure for multiple sclerosis. Mult Scler. 2017;23(5):711-20. 25. Mathiowetz V, Weber K, Kashman N, Volland G. Adult norms for the Nine Hole Peg Test of finger dexterity. Occup Particip Health. 1985;5:24-38. 26. Arvidsson Lindvall M, Anderzen-Carlsson A, Appelros P, Forsberg A. Validity and test-retest reliability of the six-spot step test in persons after stroke. Physiother Theory Pract. 2020;36(1):211-8. 27. Romani J, Giavedoni P, Roe E, Vidal D, Luelmo J, Wortsman X. Inter- and Intra-rater Agreement of Dermatologic Ultrasound for the Diagnosis of Lobular and Septal Panniculitis. J Ultrasound Med. 2020;39(1):107-12. 28. Gellhorn AC, Carlson MJ. Inter-rater, intra-rater, and inter-machine reliability of quantitative ultrasound measurements of the patellar tendon. Ultrasound Med Biol. 2013;39(5):791-6. 29. Brennan RL. Generalizability Theory. New York: Springer-Verlag; 2001. 30. Govaerts MJ, van der Vleuten CP, Schuwirth LW. Optimising the reproducibility of a performance-based assessment test in midwifery education. Adv Health Sci Educ Theory Pract. 2002;7(2):133-45. 31. McGraw KOW, S.P. Forming inferences about some intraclass correlation coefficients. Psychological Methods. 1996;1:30-46. 32. Shrout PE, Fleiss JL. Intraclass Correlations: Uses in assessing rater reliability. Psychological Bulletin. 1979;86:420-8. 33. Kraemer HC, Periyakoil, V. S., Noda, A. Kappa coefficients in medical research. Tutorial in biostatistics. Statistics in Medicine. 2002;21:2109–29. 34. Cohen J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin. 1968;70:213-20. 35. Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 1960;20:37-46. 36. Vach W. The dependence of Cohen's kappa on the prevalence does not matter. J Clin Epidemiol. 2005;58(7):655-61.

70

37. Fleiss JL, Cohen J. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. . Educational and Psychological Measurement. 1973;33:613-9. 38. Bland JM, Altman DG. Measuring agreement in method comparison studies. Stat Methods Med Res. 1999;8(2):135-60. 39. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;1(8476):307-10. 40. de Vet HC, Terwee CB, Mokkink L, Knol DL. Measurement in Medicine. Cambridge: Cambridge University Press; 2011 2010. 41. Euser AM, Dekker FW, le Cessie S. A practical approach to Bland-Altman plots and variation coefficients for log transformed variables. J Clin Epidemiol. 2008;61(10):978-82. 42. de Vet HC, Mokkink LB, Terwee CB, Hoekstra OS, Knol DL. Clinicians are right not to like Cohen's kappa. BMJ. 2013;346:f2125. 43. de Vet HC, Dikmans RE, Eekhout I. Specific agreement on dichotomous outcomes can be calculated for more than two raters. J Clin Epidemiol. 2017. 44. de Vet HCW, Mullender MG, Eekhout I. Specific agreement on ordinal and multiple nominal outcomes can be calculated for more than two raters. J Clin Epidemiol. 2018;96:47-53. 45. Mokkink LB, Vet HC, Prinsen CA, patrick DL, Alonso J, Bouter LM, et al. COSMIN methodology for systematic reviews of Patient‐Reported Outcome Measures (PROMs) - user manual 2018 [Available from: www.cosmin.nl. 46. Higgins JP, Green S. Cochrane Handbook for Systematic Reviews of Interventions Version 5.1.0 [updated March 2011]. The Cochrane Collaboration, 2011. 2011 [Available from: www.handbook.cochrane.org. 47. Cochrane Hanbook for Systematic reviews of Diagnostic Test Accuracy Reviews 2013 [Available from: http://methods.cochrane.org/sdt/handbook-dta-reviews. 48. Terwee CB, Jansma EP, Riphagen, II, de Vet HC. Development of a methodological PubMed search filter for finding studies on measurement properties of measurement instruments. Qual Life Res. 2009;18(8):1115-23. 49. Boers M, Kirwan JR, Tugwell P, Beaton D, Bingham CO, III, Conaghan PG, et al. The OMERACT handbook: OMERACT; 2015 2015. 50. Smart A. A multi-dimensional model of clinical utility. International journal for quality in health care : journal of the International Society for Quality in Health Care. 2006;18(5):377-82. 51. Stratford PW, Kennedy D, Pagura SM, Gollish JD. The relationship between self-report and performance-related measures: questioning the content validity of timed tests. Arthritis Rheum. 2003;49(4):535-40. 52. Efron B. Better bootstrap confidence intervals. Journal of the American Statistical Association. 1987;82(397):171-85. 53. Moher D, Liberati A, Tetzlaff J, Altman DG, Group P. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. 2009;6(7):e1000097. 54. Peterson DAB, P.; Jabusch, H. C.; Altenmuller, E.; Frucht, S. J. Rating scales for musician's dystonia: the state of the art. Neurology. 2013;81(6):589-98.