SAS Personal

8/17/2019 SAS Personal

1/19

Q- What is regression analysis?

A- Regression analysis is statistical process to determine the relationship among

one variable called dependent as function of other variable called

independent variable.

Q-What is logistics Regression?

A-Logistic Regression is a statistical process to determine the relationship

among a variable whose response is binary called as dependent variable and other

variables called as independent variables.

Q- What is covariance?A- Covariance is a measure of how much two variables change together.Q-What is correlation

• Correlation scaled version of covariance which means it is a measures

that how much two eually scaled variable change together. !he result of

correlation is called as coe"cient of correlation which is denoted by #r$ . It

ranges from -1.0 to %&.'. Correlation does not have units. Covariance always

has units.

Q- What is formula of covariance? A- Let us denote the dependent variable by and independent variable by ! then formula of covariance

"ill be covariance #!$%& ' #!-(% #-(%)n or ' #!-(% #-(%)n-1#sample%.Q- What is mean$ median and mode?We use mean to describe the entire set of observation "ith a single value representing the center data.*ean or arithmetic mean is the sum of observation divided by the sum of all observations. *edian iscentral point of data after arranging it in ascending order. +o if there is n odd numbers of observations indata the *edian "ill ,1) if , is odd then median "ill be mid of ,1) and ,)

/he mode is the value that occurs most freuently in a set of observations.

Q- What is statistics? A- A method to get the information from a given set of observations.Q- What is descriptive statics? A- escriptive statistics provides the concise summary of data.Q- What is inferential statistics?

A- Inferential statistics uses the random sample of a population to dra" the conclusion about thatpopulation.

Q-What is variable? A- A variable is some characteristics of population or sample.

Q- What is ata?

ata is collection of variables or observed values of all variables.Q- efine types of data? ata is categori2ed into t"o categories - ,umeric or uantitative data and categorical or ualitativedata. Quantitative data has meaning as measurement such as height$ "eight etc.,umerical data can be further bro3en into t"o types4 discrete and continuous.iscrete data represent items that can be counted.

5ontinuous data represent measurements6 their possible values cannot be counted and can only bedescribed using intervals on the real number line


2/19

5ategorical data4 categorical data represent characteristics such as a person7s gender. /he categoricaldata fall into t"o categories. 8rdinal ---1$ 9 and nominal such as age se: etc.

Q- What is Range?

Range is di(erence between highest and lowest observations of a variable.

Q-What is coe"cient of determination?

!he coe"cient of determination is a measure of how well the regression line

represents the data. )f the regression line passes e*actly through every point on

the scatter plot+ it would be able to e*plain all of the variation. !he further the line

is away from the points+ the less it is able to e*plain.

What is Re,ection Region?

Re,ection level is the range of values such that if test statistics fall into it we re,ect

the null hypothesis.


3/19

What is value?

values of test are probability of observing test statistic at least as e*treme as one

computed assuming that null hypothesis is true.

What is R?

R is stand for sum of suares of di(erence between the mean value of observed

dependent variable and the regression value of dependent variable.

What is /?

um of suare error?

)t is sum of suares of di(erence between the regression value of a dependentvariable and the observed value of that dependent variable.

What is suare sum of total?

)t is sum of R and /.

What is coe"cient of determination or R suare?

!he R0 of regression model is the proportion of total variance e*plained by the

independent variables of model.

)t is de1ned by the R02R3!

What is ad,usted r suare and why we use it?

When we add any variable in linear regression model the value of r suare increase

whether added variable is not signi1cant hence we use ad,usted r0 which is

modi1ed version of the r suare.

When we add an insigni1cant variable in the model the ad,usted r suare penali4ed

whereas its value increases when we add signi1cant variable in the model.

R2 assumes that every single variable explains the variation in the dependent variable. The

adjusted R2 tells you the percentage of variation explained by only the independent

variables that actually affect the dependent variable.

The formula for adjusted r square =1-1-r square! "2!#n-1$n-1-p!.

http://www.statisticshowto.com/dependent-variable-definition/http://www.statisticshowto.com/independent-variable-definition/http://www.statisticshowto.com/independent-variable-definition/http://www.statisticshowto.com/independent-variable-definition/http://www.statisticshowto.com/independent-variable-definition/http://www.statisticshowto.com/dependent-variable-definition/


4/19

What is standard deviation?

tandard deviation is the measure of spread of data across the mean.

tandard deviation is useful when we compare two data in order to get the varianceor spread in both data.

What are the two rules or theorem to calculate the spread of data away 3across

3around the mean?

!here two rule to calculate the spread of data

&- empirical rule- Which states that if the distribution of data is bell

curved or data is normally distributed then5a6 78 9 data will fall within one standard deviation from mean.5b6 :; 9 data will fall within two standard deviation from mean.

5c6 ::.0 where @&+ so when >20 the


5/19

Bne of the >ey assumptions of regression is that the variance of the errors is

constant across observations. )f the errors have constant variance+ the errors are

called homoscedastic. !ypically+ residuals are plotted to assess this assumption.

tandard estimation methods are ine"cient when the errors are heteroscedastic or

have non-constant variance.

Eow to remove heteroscedasity? Feed to learnG..

What is H)I- variance inflation factor ?

/he variance inflation factor #'()% id used to describe the multicolieartity bet"een the

independent variables. It measure of ho" much variance of estimated regression coefficients

inflated as compare to "hen there no linear relation bet"een the I;s.

What are methods to fit a linear regression model?

• Ad *ean absolute percentage error should be less than or eual to @

• ;IB should me less than 10

• *odel fit chec3 by plotting predicted vs actual

• Cesidual chec3)heterscadesity

What is logistics regression and "hen to use logistic regression?

When the dependent variable is categorical and I;s are mi: of categorical and uantitative

variables "e use logistic regression. It predicts the log odds of response instead of the directly

predicting the response.

Do" many types of logistic regression?

Einary logistic regression "hen the response of dependent variable is dichotomous or only t"o

responses "e use the binary logistic regression.

8rdered or ordinal logistic regression- "hen the response of dependent variable is ordinal andmore than t"o e.g. high lo" medium "e use ordinal logistic regression model .

*ultinomial logistic regression - "hen the response of dependent variable is nominal and more

than t"o e.g. se: of group$ type of bread etc "e use ordinal logistic regression model.

Do" to chec3 multicolinearty in logistic regression in sas?????


6/19

What is *L>#*a:imum Li3e hood estimation%?

*a:imum li3e hood is a techniue to find the regression coefficient of the logistic regression.

What are the assumptions of *L>?

• ;ariance of >rror in estimated model is constant.

What is central tendency of data?

/he term central tendency refers to the middle value of the data. *ean$ median and mode arethree statistics measures used to measure the central tendency of the data.

*ean is the arithmetic average$ mode is freuencie average and medium is positional average.

What are different types of sampling?


7/19

Simple Random Sample- A simple random sample is a sample selected in such a waythat every possible sample with the same number of observations is eually li>ely to bechosen.

Stratifed Random SampleA stratifed random sample is obtained by separating the population into mutually

e*clusive sets+ or strata+ and then drawing simple random samples from each stratum.

Cluster Sample

A cluster sample is a simple random sample of groups or clusters of elements

httpsJ33faculty.elgin.edu3d>ernler3statistics3ch'&3&-D.html

imple Random ampling andBther ampling ethods

rinter-friendly version

Sampling Methods can be classified into one of two categories:

Probability SamplingJ ample has a >nown probability of being selected

Non-probability SamplingJ ample does not have >nown probability ofbeing selected as in convenience or voluntary response surveys

Probability Sampling

In probability sampling it is possible to both determine which sampling units belong to which sample

and the probability that each sample will be selected. The following sampling methods, which are

listed in Chapter 4, are types of probability sampling:

&. Simple Random Sampling (SRS)

0. Stratifed Sampling

=. Cluster SamplingD. Systematic Sampling

;. Multistage Sampling (in which some o the methods aboe are

combined in stages)

Of the five methods listed above, students have the most trouble distinguishing between stratified

sampling and cluster sampling.

https://onlinecourses.science.psu.edu/stat100/print/book/export/html/18https://onlinecourses.science.psu.edu/stat100/print/book/export/html/18


8/19

Stratified Sampling is possible when it makes sense to partition the population into groups based on

a factor that may influence the variable that is being measured. These groups are then called strata.

n individual group is called a stratum. !ith stratified samplingone should:

partition the population into groups 5strata6

obtain a simple random sample from each group 5stratum6 collect data on each sampling unit that was randomly sampled from each

group 5stratum6

Stratified sampling works best when a heterogeneous population is split into fairly homogeneous

groups. "nder these conditions, stratification generally produces more precise estimates of the

population percents than estimates that would be found from a simple random sample. Table

3.2 shows some e#amples of ways to obtain a stratified sample.

Table 3.2. Examples of Stratified Samples

!"ample # !"ample $ !"ample %

Population All people in K.. All K

intercollegiate

athletes

All elementary students

in the local school distr

&roups (Strata)

$ Time %ones in the ".S.

&'astern,(entral,

Mountain,)acific*

07 K

intercollegiate

teams

&& di(erent elementary

schools in the local sch

district

'btain a Simple

Random Sample

;'' people from each of

the D time 4ones

; athletes from each

of the 07 K teams

0' students from each

the && elementary

schools

Sample D ;'' 2 0'''

selected people

07 ; 2 &='

selected athletes

&& 0' 2 00' selected

students

Cluster Sampling is very different from Stratified Sampling. !ith cluster sampling one should

divide the population into groups 5clusters6.

obtain a simple random sample of so many clusters from all possible

clusters. obtain data on every sampling unit in each of the randomly selected

clusters.


9/19

It is important to note that, unlike with the strata in stratified sampling, the clusters should be

microcosms, rather than subsections, of the population. 'ach cluster should be heterogeneous.

dditionally, the statistical analysis used with cluster sampling is not only different, but also more

complicated than that used with stratified sampling.

Table 3.3. Examples of Cluster Samples

!"ample # !"ample $ !"ample %

Population All people in K.. All K intercollegiate

athletes

All elementary stude

in a local school distr

&roups (Clusters) D !ime Mones in the

K.. 5/astern+Central+

ountain+aci1c.6

07 K intercollegiate

teams

&& di(erent elementa

schools in the local

school district

'btain a Simple

Random Sample

0 time 4ones from the

D possible time 4ones

8 teams from the 07

possible teams

D elementary schools

from the l& possible

elementary schools

Sample every person in the 0

selected time 4ones

every athlete on the 8

selected teams

every student in the

selected elementary

schools

What are different types of error arises "hen "e ta3e a sample of observation is ta3en from

population?

/here are t"o types of error arise

+ampling error- Sampling error refers to di(erences between the sample and the populationthat e*ists only because of the observations that happened to be selected for the sample.

Fon ampling /rror-

Nonsampling errors result from mista>es made in the acuisition of data or from the sampleobservations being selected improperly.1-Errors in data acquisition.2-Nonresponse error.3-Selection bias. Selection bias occurs when the sampling plan is such that some members ofthe target population cannot possibly be selected for inclusion in the sample. !ogether withnonresponse error+ selection bias played a role in the Literary Digest poll being so wrong+ asvoters without telephones or without a subscription toLiterary Digest were e*cluded frompossible inclusion in the sample ta>en.


10/19

What id random e*periment?A random e*periment is an e*periment or process that leads to one of several possible outcomes.

hat is Sample SpaceA sample space of a random e*periment is a list of all possible outcomes of thee*periment. !he outcomes must be e*haustive and mutually e*clusive. hat is Probability o an !ent

!he probability of an event is the sum of the probabilities of the simple events thatconstitute the event.

What is discrete and continuous data?

Discrete Data can only take certain values.

Example: the number of students in a class (you can't have half a student).

Continuous Data can take any value (within a rane)

Examples:

• ! person's heiht: could be any value (within the rane of human

heihts)" not #ust certain fixed heihts"

hat is Random *ariable

• A random ariable is a function or rule that assigns a number to each outcome of

an e*periment.

hat is +inomial !"periment&. !he binomial e"periment consists of a 1*ed number of trials. We represent the numberof trials by n.0. /ach trial has two possible outcomes. We label one outcome a success,and the other afailure.=. !he probability of success is p. !he probability of failure is & N p.D. !he trials are independent+ which means that the outcome of one trial does not a(ect theoutcomes of any other trials.+inomial Probability ,istribution

!he probability of x successes in a binomial e*periment with n trials and probability ofsuccess 2 p is5 x 6 2 nO x O5n N x 6O p x 5& N p6nN x for x 2 '+ &+ 0+ . . . + n

=oision distribution from boo3.

What is central limit theorem?


11/19

/he 5entral Limit /heorem #CLT for short% basically says that for non-normal data$ the

distribution of the sample means has an appro:imate normal distribution$ no matter "hat the

distribution of the original data loo3s li3e$ as long as the sample si2e is large enough #usually at

least 90% and all samples have the same si2e.

What is concordance discordance and tied pair?

=ercent 5oncordant4 =ercentage of pairs "here the observation "ith the desired outcome#event% has a higher predicted probability than the observation "ithout the outcome #nonevent%.

=ercent iscordant4 =ercentage of pairs "here the observation "ith the desired outcome#event% has a lo"er predicted probability than the observation "ithout the outcome #nonevent%.

=ercent /ied4 =ercentage of pairs "here the observation "ith the desired outcome #event% has

same predicted probability than the observation "ithout the outcome #nonevent%.

What is C85?

C85- Ceceiver 8perating 5haracteristic- In C85 curve "e plot the true positive rate #+ensitivity% vsfalse positive rate #100-specificity% for different cutoff points.>ach point on the C85 curve represents a +ensitivity)specificity pair corresponding to a particulardecision threshold. A test "ith perfect discrimination #no overlap in the t"o distributions% has a C85curve that passes through the upper left corner #100 sensitivity$100 specificity%. /herefore thecloser the C85 curve is to the upper left corner$ the Digher the overall accuracy of the test.

Sensitivity 4 probability that a test result "ill be positive "hen the disease is present #true positive

rate$e:pressed as a percentage%.& a ) #ab%.Specificity 4 probability that a test result "ill be negative "hen the disease is not present #truenegative rate$ e:pressed as a percentage%.& d ) #cd%.

What is difference bet"een ,8dupFey and ,oduprecs?

data test16input id1 G id G e:tra 6

cards6aa ab 9

aa ab 1aa ab aa ab 96proc sort nodup data&test16by id16run6options nocenter6proc print data&test16


12/19

run6)H,oduprecs reads values of all varibales of priovious observation before "riting it oouput andif there is duplicate it does not "riteH)6data test16input id1 G id G e:tra 6

cards6

aa ab 9aa ab 1aa ab aa ab 96proc sort nodup data&test16by id1 id6run6options nocenter6proc print data&test16run6data test16input id1 G id G e:tra 6

cards6aa ab 9aa ab 1aa ab aa ab 96proc sort nodup data&test16by id1 id e:tra6run6options nocenter6proc print data&test16run6

)Hnudup3ey chec3s and delete the observation that have duplicate by varibale valuesH)6data test16input id1 G id G e:tra 6

cards6aa ab 9aa ab 1aa ab aa ab 96proc sort nodup3ey data&test16by id16run6options nocenter6

proc print data&test16run6data test16input id1 G id G e:tra 6

cards6aa ab 9aa ab 1aa ab aa ab 9


13/19

6proc sort nodup3ey data&test16by id1 id6run6options nocenter6proc print data&test16

run6data test16input id1 G id G e:tra 6

cards6aa ab 9aa ab 1aa ab aa ab 96proc sort nodup3ey data&test16by id1 id e:tra6run6options nocenter6

proc print data&test16run6

What are differences bet"een "here and if statement?

1- If statement can used only at datastep "hereas "here statement can be used at datastep as"ell proc step.ata test #"here name&gau%%

- If statement can be used at data step to read records "hile specifying input statement"hereas "here statement cannot be used.

ata test6Input a b c6If b in #$9%6ataline6 1 9 9 J @ J K @ 9 6Cun6

9-If statement is used after the data is read into p; "hereas "here statement must be used before

the data is "ritten in =;.

>:ecute multiple conditional statements

suppose$ you have data for college students7 mathematics scores. ou "ant to rate them on thebasis of their scores.

5onditions4


14/19

1. If a score is less than J0$ create a ne" variable named MCating and give M=oor rating to thesestudents.

What is +A+ *acros?

*acros are used to perform repetitive tas3.

What is *acro variable and ho" many types of macro variables are there?

*acro variables are used to store values of variables. /here are t"o types of macro variable

ocal - )f the macro variable is de1ned inside a macro code+ then

scope is local. )t would be available for use in that macro only and gets

removed when the macro is 1nished.

&lobal - )f the macro variable is de1ned outside a macro code+ then

scope is global. )t can be using anywhere in the A program and gets

removed at the end of the session.data testPinput aPdatalinesP&0PrunP3global varibale3P

9Let dat2testP

3de1ne macro3P9macro samplePproc print data2datPrunP9mend sampleP3invo>e micro3P9samplerunP

3local varibale3P9macro sample&P9let dat&2testPproc print data2dat&PrunP9mend sample&P3invo>e micro3P9sample&


15/19

runP

Fote that !alue does not reuire uotation mar>s even forcharacters

What are methods to create micro variables?

&-9 let statement0-macro parameters 5named and positional6

• ositional S)n this we provide parameter name while de1ningthe macro

9acro 5obs+b6PTata t0P

et bPRunP9mend P

*ey+ord ,arameters (n this +e provide parameter name +ith equal sign and also +e canassign default values to parameter

Definition

9ACRB Umacro name 5arameter&2Halue+ arameter02HalueGG.arameter-n2Halue6P

acro !e*tP

9/FTP

Calling

9Umacro name 5arameter&2Halue+ arameter02Halue G..arameter-n2Halue6P

=- Call symput- suppose we want to use the variable Sneed tochec>.


16/19

What is di(erence between proc univaritae and proc mean?

&- Eoth procedure produce descriptive statistics. Ey proc univariate$ by default itproduce all the statistics #some timenot all reuired% but in proc means it is possible

to reuest the statistics that "e "ant..0- roc univariate produces histogram+ uartiles and bo* plots

whereas proc means dose not.

What are the rules for A Tata sets?&- A A data set can be & to =0 characters long.0- hould start with underscore or letter and subseuent

character can be letter numeric or underscore.=- A A data set consists of two parts STescriptive portion and

data portion.

What are the A attributes?

• Name S can be & to =0 characters long. hould start withunderscore or letter and subseuent character can be letternumeric or underscore.

• .ype-Fumeric Snumbers 5%+ -+.+/ and scienti1c notation 6character contains letter

• !o represent blan> character variable we use blan> while #.$ for

numeric variables.

ength / A A character variable has default value of 8 andcan have the value upto =0


17/19

What is yearcuttoff system option?When encounters t"o digit year in date of an e:ternal or internal "e normally use yearcutoffoption because +A+ reads consider #by default% 1


18/19


19/19

SAS Personal

Documents

Transcript of SAS Personal