SAS Personal

download SAS Personal

of 19

Transcript of SAS Personal

  • 8/17/2019 SAS Personal

    1/19

    Q- What is regression analysis?

    A- Regression analysis is statistical process to determine the relationship among

    one variable called dependent as function of other variable called

    independent variable.

    Q-What is logistics Regression?

      A-Logistic Regression is a statistical process to determine the relationship

    among a variable whose response is binary called as dependent variable and other

    variables called as independent variables.

    Q- What is covariance?A- Covariance is a measure of how much two variables change together.Q-What is correlation

    • Correlation scaled version of covariance which means it is a measures

    that how much two eually scaled variable change together. !he result of

    correlation is called as coe"cient of correlation which is denoted by #r$ . It

    ranges from -1.0 to %&.'. Correlation does not have units. Covariance always

    has units.

     Q- What is formula of covariance? A- Let us denote the dependent variable by and independent variable by ! then formula of covariance

    "ill be covariance #!$%& ' #!-(% #-(%)n or ' #!-(% #-(%)n-1#sample%.Q- What is mean$ median and mode?We use mean to describe the entire set of observation "ith a single value representing the center data.*ean or arithmetic mean is the sum of observation divided by the sum of all observations. *edian iscentral point of data after arranging it in ascending order. +o if there is n odd numbers of observations indata the *edian "ill ,1) if , is odd then median "ill be mid of ,1) and ,)

    /he mode is the value that occurs most freuently in a set of observations.

    Q- What is statistics? A- A method to get the information from a given set of observations.Q- What is descriptive statics? A- escriptive statistics provides the concise summary of data.Q- What is inferential statistics?

     A- Inferential statistics uses the random sample of a population to dra" the conclusion about thatpopulation.

    Q-What is variable? A- A variable is some characteristics of population or sample.

    Q- What is ata?

    ata is collection of variables or observed values of all variables.Q- efine types of data? ata is categori2ed into t"o categories - ,umeric or uantitative data and categorical or ualitativedata. Quantitative data has meaning as measurement such as height$ "eight etc.,umerical data can be further bro3en into t"o types4 discrete and continuous.iscrete data represent items that can be counted.

    5ontinuous data represent measurements6 their possible values cannot be counted and can only bedescribed using intervals on the real number line

  • 8/17/2019 SAS Personal

    2/19

     5ategorical data4 categorical data represent characteristics such as a person7s gender. /he categoricaldata fall into t"o categories. 8rdinal ---1$ 9 and nominal such as age se: etc.

     

    Q- What is Range?

    Range is di(erence between highest and lowest observations of a variable.

    Q-What is coe"cient of determination?

     !he coe"cient of determination is a measure of how well the regression line

    represents the data. )f the regression line passes e*actly through every point on

    the scatter plot+ it would be able to e*plain all of the variation. !he further the line

    is away from the points+ the less it is able to e*plain.

    What is Re,ection Region?

    Re,ection level is the range of values such that if test statistics fall into it we re,ect

    the null hypothesis.

  • 8/17/2019 SAS Personal

    3/19

    What is value?

    values of test are probability of observing test statistic at least as e*treme as one

    computed assuming that null hypothesis is true.

    What is R?

    R is stand for sum of suares of di(erence between the mean value of observed

    dependent variable and the regression value of dependent variable.

    What is /?

    um of suare error?

    )t is sum of suares of di(erence between the regression value of a dependentvariable and the observed value of that dependent variable.

    What is suare sum of total?

    )t is sum of R and /.

    What is coe"cient of determination or R suare?

     !he R0 of regression model is the proportion of total variance e*plained by the

    independent variables of model.

    )t is de1ned by the R02R3!

    What is ad,usted r suare and why we use it?

    When we add any variable in linear regression model the value of r suare increase

    whether added variable is not signi1cant hence we use ad,usted r0 which is

    modi1ed version of the r suare.

    When we add an insigni1cant variable in the model the ad,usted r suare penali4ed

    whereas its value increases when we add signi1cant variable in the model.

    R2 assumes that every single variable explains the variation in the dependent variable. The

    adjusted R2 tells you the percentage of variation explained by only the independent

    variables that actually affect the dependent variable.

    The formula for adjusted r square =1-1-r square! "2!#n-1$n-1-p!.

    http://www.statisticshowto.com/dependent-variable-definition/http://www.statisticshowto.com/independent-variable-definition/http://www.statisticshowto.com/independent-variable-definition/http://www.statisticshowto.com/independent-variable-definition/http://www.statisticshowto.com/independent-variable-definition/http://www.statisticshowto.com/dependent-variable-definition/

  • 8/17/2019 SAS Personal

    4/19

    What is standard deviation?

    tandard deviation is the measure of spread of data across the mean.

    tandard deviation is useful when we compare two data in order to get the varianceor spread in both data.

    What are the two rules or theorem to calculate the spread of data away 3across

    3around the mean?

     !here two rule to calculate the spread of data

    &- empirical rule- Which states that if the distribution of data is bell

    curved or data is normally distributed then5a6 78 9 data will fall within one standard deviation from mean.5b6 :; 9 data will fall within two standard deviation from mean.

    5c6 ::.0 where @&+ so when >20 the

  • 8/17/2019 SAS Personal

    5/19

    Bne of the >ey assumptions of regression is that the variance of the errors is

    constant across observations. )f the errors have constant variance+ the errors are

    called homoscedastic. !ypically+ residuals are plotted to assess this assumption.

    tandard estimation methods are ine"cient when the errors are heteroscedastic or

    have non-constant variance.

    Eow to remove heteroscedasity? Feed to learnG..

    What is H)I- variance inflation factor ?

    /he variance inflation factor #'()% id used to describe the multicolieartity bet"een the

    independent variables. It measure of ho" much variance of estimated regression coefficients

    inflated as compare to "hen there no linear relation bet"een the I;s.

    What are methods to fit a linear regression model?

    •  Ad *ean absolute percentage error should be less than or eual to @

    • ;IB should me less than 10

    • *odel fit chec3 by plotting predicted vs actual

    • Cesidual chec3)heterscadesity

    What is logistics regression and "hen to use logistic regression?

    When the dependent variable is categorical and I;s are mi: of categorical and uantitative

    variables "e use logistic regression. It predicts the log odds of response instead of the directly

    predicting the response.

    Do" many types of logistic regression?

    Einary logistic regression "hen the response of dependent variable is dichotomous or only t"o

    responses "e use the binary logistic regression.

    8rdered or ordinal logistic regression- "hen the response of dependent variable is ordinal andmore than t"o e.g. high lo" medium "e use ordinal logistic regression model .

    *ultinomial logistic regression - "hen the response of dependent variable is nominal and more

    than t"o e.g. se: of group$ type of bread etc "e use ordinal logistic regression model.

    Do" to chec3 multicolinearty in logistic regression in sas?????

  • 8/17/2019 SAS Personal

    6/19

    What is *L>#*a:imum Li3e hood estimation%?

    *a:imum li3e hood is a techniue to find the regression coefficient of the logistic regression.

    What are the assumptions of *L>?

    • ;ariance of >rror in estimated model is constant.

    What is central tendency of data?

    /he term central tendency refers to the middle value of the data. *ean$ median and mode arethree statistics measures used to measure the central tendency of the data.

    *ean is the arithmetic average$ mode is freuencie average and medium is positional average.

     

    What are different types of sampling?

  • 8/17/2019 SAS Personal

    7/19

    Simple Random Sample- A simple random sample is a sample selected in such a waythat every possible sample with the same number of observations is eually li>ely to bechosen.

    Stratifed Random SampleA stratifed random sample is obtained by separating the population into mutually

    e*clusive sets+ or strata+ and then drawing simple random samples from each stratum.

    Cluster Sample

    A cluster sample is a simple random sample of groups or clusters of elements 

    httpsJ33faculty.elgin.edu3d>ernler3statistics3ch'&3&-D.html

    imple Random ampling andBther ampling ethods

    rinter-friendly version

    Sampling Methods can be classified into one of two categories:

    Probability SamplingJ ample has a >nown probability of being selected

    Non-probability SamplingJ ample does not have >nown probability ofbeing selected as in convenience or voluntary response surveys

    Probability Sampling

    In probability sampling it is possible to both determine which sampling units belong to which sample

    and the probability that each sample will be selected. The following sampling methods, which are

    listed in Chapter 4, are types of probability sampling:

    &. Simple Random Sampling (SRS)

    0. Stratifed Sampling

    =. Cluster SamplingD. Systematic Sampling

    ;. Multistage Sampling (in which some o the methods aboe are

    combined in stages)

    Of the five methods listed above, students have the most trouble distinguishing between stratified

    sampling and cluster sampling.

    https://onlinecourses.science.psu.edu/stat100/print/book/export/html/18https://onlinecourses.science.psu.edu/stat100/print/book/export/html/18

  • 8/17/2019 SAS Personal

    8/19

    Stratified Sampling is possible when it makes sense to partition the population into groups based on

    a factor that may influence the variable that is being measured. These groups are then called strata.

    n individual group is called a stratum. !ith stratified samplingone should:

    partition the population into groups 5strata6

    obtain a simple random sample from each group 5stratum6 collect data on each sampling unit that was randomly sampled from each

    group 5stratum6

    Stratified sampling works best when a heterogeneous population is split into fairly homogeneous

    groups. "nder these conditions, stratification generally produces more precise estimates of the

     population percents than estimates that would be found from a simple random sample. Table

    3.2 shows some e#amples of ways to obtain a stratified sample.

    Table 3.2. Examples of Stratified Samples

    !"ample # !"ample $ !"ample %

    Population All people in K.. All K

    intercollegiate

    athletes

    All elementary students

    in the local school distr

    &roups (Strata)

    $ Time %ones in the ".S.

    &'astern,(entral,

    Mountain,)acific*

    07 K

    intercollegiate

    teams

    && di(erent elementary

    schools in the local sch

    district

    'btain a Simple

    Random Sample

    ;'' people from each of 

    the D time 4ones

    ; athletes from each

    of the 07 K teams

    0' students from each

    the && elementary

    schools

    Sample D ;'' 2 0'''

    selected people

    07 ; 2 &='

    selected athletes

    && 0' 2 00' selected

    students

    Cluster Sampling is very different from Stratified Sampling. !ith cluster sampling one should

    divide the population into groups 5clusters6.

    obtain a simple random sample of so many clusters from all possible

    clusters. obtain data on every sampling unit in each of the randomly selected

    clusters.

  • 8/17/2019 SAS Personal

    9/19

    It is important to note that, unlike with the strata in stratified sampling, the clusters should be

    microcosms, rather than subsections, of the population. 'ach cluster should be heterogeneous.

    dditionally, the statistical analysis used with cluster sampling is not only different, but also more

    complicated than that used with stratified sampling.

    Table 3.3. Examples of Cluster Samples

    !"ample # !"ample $ !"ample %

    Population All people in K.. All K intercollegiate

    athletes

    All elementary stude

    in a local school distr

    &roups (Clusters) D !ime Mones in the

    K.. 5/astern+Central+

    ountain+aci1c.6

    07 K intercollegiate

    teams

    && di(erent elementa

    schools in the local

    school district

    'btain a Simple

    Random Sample

    0 time 4ones from the

    D possible time 4ones

    8 teams from the 07

    possible teams

    D elementary schools

    from the l& possible

    elementary schools

    Sample every person in the 0

    selected time 4ones

    every athlete on the 8

    selected teams

    every student in the

    selected elementary

    schools

    What are different types of error arises "hen "e ta3e a sample of observation is ta3en from

    population?

    /here are t"o types of error arise

    +ampling error- Sampling error refers to di(erences between the sample and the populationthat e*ists only because of the observations that happened to be selected for the sample.

    Fon ampling /rror-

    Nonsampling errors result from mista>es made in the acuisition of data or from the sampleobservations being selected improperly.1-Errors in data acquisition.2-Nonresponse error.3-Selection bias. Selection bias occurs when the sampling plan is such that some members ofthe target population cannot possibly be selected for inclusion in the sample. !ogether withnonresponse error+ selection bias played a role in the Literary Digest poll being so wrong+ asvoters without telephones or without a subscription toLiterary Digest were e*cluded frompossible inclusion in the sample ta>en.

  • 8/17/2019 SAS Personal

    10/19

    What id random e*periment?A random e*periment is an e*periment or process that leads to one of several possible outcomes.

    hat is Sample SpaceA sample space of a random e*periment is a list of all possible outcomes of thee*periment. !he outcomes must be e*haustive and mutually e*clusive. hat is Probability o an !ent

     !he probability of an event is the sum of the probabilities of the simple events thatconstitute the event.

    What is discrete and continuous data?

    Discrete Data can only take certain values.

    Example: the number of students in a class (you can't have half a student).

    Continuous Data can take any value (within a rane)

    Examples:

    • ! person's heiht: could be any value (within the rane of human

    heihts)" not #ust certain fixed heihts"

    hat is Random *ariable

    • A random ariable is a function or rule that assigns a number to each outcome of

    an e*periment.

    hat is +inomial !"periment&. !he binomial e"periment consists of a 1*ed number of trials. We represent the numberof trials by n.0. /ach trial has two possible outcomes. We label one outcome a success,and the other afailure.=. !he probability of success is p. !he probability of failure is & N p.D. !he trials are independent+ which means that the outcome of one trial does not a(ect theoutcomes of any other trials.+inomial Probability ,istribution

     !he probability of x successes in a binomial e*periment with n trials and probability ofsuccess 2 p is5 x 6 2 nO x O5n N x 6O p x 5& N p6nN x for x 2 '+ &+ 0+ . . . + n

    =oision distribution from boo3.

    What is central limit theorem?

  • 8/17/2019 SAS Personal

    11/19

    /he 5entral Limit /heorem #CLT  for short% basically says that for non-normal data$ the

    distribution of the sample means has an appro:imate normal distribution$ no matter "hat the

    distribution of the original data loo3s li3e$ as long as the sample si2e is large enough #usually at

    least 90% and all samples have the same si2e.

    What is concordance discordance and tied pair?

    =ercent 5oncordant4 =ercentage of pairs "here the observation "ith the desired outcome#event% has a higher predicted probability than the observation "ithout the outcome #nonevent%.

    =ercent iscordant4 =ercentage of pairs "here the observation "ith the desired outcome#event% has a lo"er predicted probability than the observation "ithout the outcome #nonevent%.

    =ercent /ied4 =ercentage of pairs "here the observation "ith the desired outcome #event% has

    same predicted probability than the observation "ithout the outcome #nonevent%.

    What is C85?

    C85- Ceceiver 8perating 5haracteristic- In C85 curve "e plot the true positive rate #+ensitivity% vsfalse positive rate #100-specificity% for different cutoff points.>ach point on the C85 curve represents a +ensitivity)specificity pair corresponding to a particulardecision threshold. A test "ith perfect discrimination #no overlap in the t"o distributions% has a C85curve that passes through the upper left corner #100 sensitivity$100 specificity%. /herefore thecloser the C85 curve is to the upper left corner$ the Digher the overall accuracy of the test.

    Sensitivity 4 probability that a test result "ill be positive "hen the disease is present #true positive

    rate$e:pressed as a percentage%.& a ) #ab%.Specificity 4 probability that a test result "ill be negative "hen the disease is not present #truenegative rate$ e:pressed as a percentage%.& d ) #cd%.

    What is difference bet"een ,8dupFey and ,oduprecs?

    data test16input id1 G id G e:tra 6

    cards6aa ab 9

    aa ab 1aa ab aa ab 96proc sort nodup data&test16by id16run6options nocenter6proc print data&test16

  • 8/17/2019 SAS Personal

    12/19

    run6)H,oduprecs reads values of all varibales of priovious observation before "riting it oouput andif there is duplicate it does not "riteH)6data test16input id1 G id G e:tra 6

    cards6

    aa ab 9aa ab 1aa ab aa ab 96proc sort nodup data&test16by id1 id6run6options nocenter6proc print data&test16run6data test16input id1 G id G e:tra 6

    cards6aa ab 9aa ab 1aa ab aa ab 96proc sort nodup data&test16by id1 id e:tra6run6options nocenter6proc print data&test16run6

    )Hnudup3ey chec3s and delete the observation that have duplicate by varibale valuesH)6data test16input id1 G id G e:tra 6

    cards6aa ab 9aa ab 1aa ab aa ab 96proc sort nodup3ey data&test16by id16run6options nocenter6

    proc print data&test16run6data test16input id1 G id G e:tra 6

    cards6aa ab 9aa ab 1aa ab aa ab 9

  • 8/17/2019 SAS Personal

    13/19

    6proc sort nodup3ey data&test16by id1 id6run6options nocenter6proc print data&test16

    run6data test16input id1 G id G e:tra 6

    cards6aa ab 9aa ab 1aa ab aa ab 96proc sort nodup3ey data&test16by id1 id e:tra6run6options nocenter6

    proc print data&test16run6

    What are differences bet"een "here and if statement?

    1- If statement can used only at datastep "hereas "here statement can be used at datastep as"ell proc step.ata test #"here name&gau%%

    - If statement can be used at data step to read records "hile specifying input statement"hereas "here statement cannot be used.

     

    ata test6Input a b c6If b in #$9%6ataline6 1 9 9 J @ J K @ 9 6Cun6

    9-If statement is used after the data is read into p; "hereas "here statement must be used before

    the data is "ritten in =;.

    >:ecute multiple conditional statements

    suppose$ you have data for college students7 mathematics scores. ou "ant to rate them on thebasis of their scores.

    5onditions4

  • 8/17/2019 SAS Personal

    14/19

    1. If a score is less than J0$ create a ne" variable named MCating and give M=oor rating to thesestudents.

    What is +A+ *acros?

    *acros are used to perform repetitive tas3.

    What is *acro variable and ho" many types of macro variables are there?

    *acro variables are used to store values of variables. /here are t"o types of macro variable

    ocal - )f the macro variable is de1ned inside a macro code+ then

    scope is local. )t would be available for use in that macro only and gets

    removed when the macro is 1nished.

      &lobal - )f the macro variable is de1ned outside a macro code+ then

    scope is global. )t can be using anywhere in the A program and gets

    removed at the end of the session.data testPinput aPdatalinesP&0PrunP3global varibale3P

    9Let dat2testP

    3de1ne macro3P9macro samplePproc print data2datPrunP9mend sampleP3invo>e micro3P9samplerunP

    3local varibale3P9macro sample&P9let dat&2testPproc print data2dat&PrunP9mend sample&P3invo>e micro3P9sample&

  • 8/17/2019 SAS Personal

    15/19

    runP

    Fote that !alue does not reuire uotation mar>s even forcharacters

    What are methods to create micro variables?

    &-9 let statement0-macro parameters 5named and positional6

    • ositional S)n this we provide parameter name while de1ningthe macro

    9acro 5obs+b6PTata t0P

    et bPRunP9mend P

    *ey+ord ,arameters (n this +e provide parameter name +ith equal sign and also +e canassign default values to parameter

    Definition

    9ACRB Umacro name 5arameter&2Halue+ arameter02HalueGG.arameter-n2Halue6P

    acro !e*tP

    9/FTP

    Calling

    9Umacro name 5arameter&2Halue+ arameter02Halue G..arameter-n2Halue6P

    =- Call symput- suppose we want to use the variable Sneed tochec>.

  • 8/17/2019 SAS Personal

    16/19

    What is di(erence between proc univaritae and proc mean?

    &- Eoth procedure produce descriptive statistics. Ey proc univariate$ by default itproduce all the statistics #some timenot all reuired% but in proc means it is possible

    to reuest the statistics that "e "ant..0- roc univariate produces histogram+ uartiles and bo* plots

    whereas proc means dose not.

    What are the rules for A Tata sets?&- A A data set can be & to =0 characters long.0- hould start with underscore or letter and subseuent

    character can be letter numeric or underscore.=- A A data set consists of two parts STescriptive portion and

    data portion.

    What are the A attributes?

    • Name S can be & to =0 characters long. hould start withunderscore or letter and subseuent character can be letternumeric or underscore.

    • .ype-Fumeric Snumbers 5%+ -+.+/ and scienti1c notation 6character contains letter

    •  !o represent blan> character variable we use blan> while #.$ for

    numeric variables. 

    ength / A A character variable has default value of 8 andcan have the value upto =0

  • 8/17/2019 SAS Personal

    17/19

    What is yearcuttoff system option?When encounters t"o digit year in date of an e:ternal or internal "e normally use yearcutoffoption because +A+ reads consider #by default% 1

  • 8/17/2019 SAS Personal

    18/19

     

  • 8/17/2019 SAS Personal

    19/19