SAS Personal
-
Upload
gaurav-tiwari -
Category
Documents
-
view
219 -
download
0
Transcript of SAS Personal
-
8/17/2019 SAS Personal
1/19
Q- What is regression analysis?
A- Regression analysis is statistical process to determine the relationship among
one variable called dependent as function of other variable called
independent variable.
Q-What is logistics Regression?
A-Logistic Regression is a statistical process to determine the relationship
among a variable whose response is binary called as dependent variable and other
variables called as independent variables.
Q- What is covariance?A- Covariance is a measure of how much two variables change together.Q-What is correlation
• Correlation scaled version of covariance which means it is a measures
that how much two eually scaled variable change together. !he result of
correlation is called as coe"cient of correlation which is denoted by #r$ . It
ranges from -1.0 to %&.'. Correlation does not have units. Covariance always
has units.
Q- What is formula of covariance? A- Let us denote the dependent variable by and independent variable by ! then formula of covariance
"ill be covariance #!$%& ' #!-(% #-(%)n or ' #!-(% #-(%)n-1#sample%.Q- What is mean$ median and mode?We use mean to describe the entire set of observation "ith a single value representing the center data.*ean or arithmetic mean is the sum of observation divided by the sum of all observations. *edian iscentral point of data after arranging it in ascending order. +o if there is n odd numbers of observations indata the *edian "ill ,1) if , is odd then median "ill be mid of ,1) and ,)
/he mode is the value that occurs most freuently in a set of observations.
Q- What is statistics? A- A method to get the information from a given set of observations.Q- What is descriptive statics? A- escriptive statistics provides the concise summary of data.Q- What is inferential statistics?
A- Inferential statistics uses the random sample of a population to dra" the conclusion about thatpopulation.
Q-What is variable? A- A variable is some characteristics of population or sample.
Q- What is ata?
ata is collection of variables or observed values of all variables.Q- efine types of data? ata is categori2ed into t"o categories - ,umeric or uantitative data and categorical or ualitativedata. Quantitative data has meaning as measurement such as height$ "eight etc.,umerical data can be further bro3en into t"o types4 discrete and continuous.iscrete data represent items that can be counted.
5ontinuous data represent measurements6 their possible values cannot be counted and can only bedescribed using intervals on the real number line
-
8/17/2019 SAS Personal
2/19
5ategorical data4 categorical data represent characteristics such as a person7s gender. /he categoricaldata fall into t"o categories. 8rdinal ---1$ 9 and nominal such as age se: etc.
Q- What is Range?
Range is di(erence between highest and lowest observations of a variable.
Q-What is coe"cient of determination?
!he coe"cient of determination is a measure of how well the regression line
represents the data. )f the regression line passes e*actly through every point on
the scatter plot+ it would be able to e*plain all of the variation. !he further the line
is away from the points+ the less it is able to e*plain.
What is Re,ection Region?
Re,ection level is the range of values such that if test statistics fall into it we re,ect
the null hypothesis.
-
8/17/2019 SAS Personal
3/19
What is value?
values of test are probability of observing test statistic at least as e*treme as one
computed assuming that null hypothesis is true.
What is R?
R is stand for sum of suares of di(erence between the mean value of observed
dependent variable and the regression value of dependent variable.
What is /?
um of suare error?
)t is sum of suares of di(erence between the regression value of a dependentvariable and the observed value of that dependent variable.
What is suare sum of total?
)t is sum of R and /.
What is coe"cient of determination or R suare?
!he R0 of regression model is the proportion of total variance e*plained by the
independent variables of model.
)t is de1ned by the R02R3!
What is ad,usted r suare and why we use it?
When we add any variable in linear regression model the value of r suare increase
whether added variable is not signi1cant hence we use ad,usted r0 which is
modi1ed version of the r suare.
When we add an insigni1cant variable in the model the ad,usted r suare penali4ed
whereas its value increases when we add signi1cant variable in the model.
R2 assumes that every single variable explains the variation in the dependent variable. The
adjusted R2 tells you the percentage of variation explained by only the independent
variables that actually affect the dependent variable.
The formula for adjusted r square =1-1-r square! "2!#n-1$n-1-p!.
http://www.statisticshowto.com/dependent-variable-definition/http://www.statisticshowto.com/independent-variable-definition/http://www.statisticshowto.com/independent-variable-definition/http://www.statisticshowto.com/independent-variable-definition/http://www.statisticshowto.com/independent-variable-definition/http://www.statisticshowto.com/dependent-variable-definition/
-
8/17/2019 SAS Personal
4/19
What is standard deviation?
tandard deviation is the measure of spread of data across the mean.
tandard deviation is useful when we compare two data in order to get the varianceor spread in both data.
What are the two rules or theorem to calculate the spread of data away 3across
3around the mean?
!here two rule to calculate the spread of data
&- empirical rule- Which states that if the distribution of data is bell
curved or data is normally distributed then5a6 78 9 data will fall within one standard deviation from mean.5b6 :; 9 data will fall within two standard deviation from mean.
5c6 ::.0 where @&+ so when >20 the
-
8/17/2019 SAS Personal
5/19
Bne of the >ey assumptions of regression is that the variance of the errors is
constant across observations. )f the errors have constant variance+ the errors are
called homoscedastic. !ypically+ residuals are plotted to assess this assumption.
tandard estimation methods are ine"cient when the errors are heteroscedastic or
have non-constant variance.
Eow to remove heteroscedasity? Feed to learnG..
What is H)I- variance inflation factor ?
/he variance inflation factor #'()% id used to describe the multicolieartity bet"een the
independent variables. It measure of ho" much variance of estimated regression coefficients
inflated as compare to "hen there no linear relation bet"een the I;s.
What are methods to fit a linear regression model?
• Ad *ean absolute percentage error should be less than or eual to @
• ;IB should me less than 10
• *odel fit chec3 by plotting predicted vs actual
• Cesidual chec3)heterscadesity
What is logistics regression and "hen to use logistic regression?
When the dependent variable is categorical and I;s are mi: of categorical and uantitative
variables "e use logistic regression. It predicts the log odds of response instead of the directly
predicting the response.
Do" many types of logistic regression?
Einary logistic regression "hen the response of dependent variable is dichotomous or only t"o
responses "e use the binary logistic regression.
8rdered or ordinal logistic regression- "hen the response of dependent variable is ordinal andmore than t"o e.g. high lo" medium "e use ordinal logistic regression model .
*ultinomial logistic regression - "hen the response of dependent variable is nominal and more
than t"o e.g. se: of group$ type of bread etc "e use ordinal logistic regression model.
Do" to chec3 multicolinearty in logistic regression in sas?????
-
8/17/2019 SAS Personal
6/19
What is *L>#*a:imum Li3e hood estimation%?
*a:imum li3e hood is a techniue to find the regression coefficient of the logistic regression.
What are the assumptions of *L>?
• ;ariance of >rror in estimated model is constant.
What is central tendency of data?
/he term central tendency refers to the middle value of the data. *ean$ median and mode arethree statistics measures used to measure the central tendency of the data.
*ean is the arithmetic average$ mode is freuencie average and medium is positional average.
What are different types of sampling?
-
8/17/2019 SAS Personal
7/19
Simple Random Sample- A simple random sample is a sample selected in such a waythat every possible sample with the same number of observations is eually li>ely to bechosen.
Stratifed Random SampleA stratifed random sample is obtained by separating the population into mutually
e*clusive sets+ or strata+ and then drawing simple random samples from each stratum.
Cluster Sample
A cluster sample is a simple random sample of groups or clusters of elements
httpsJ33faculty.elgin.edu3d>ernler3statistics3ch'&3&-D.html
imple Random ampling andBther ampling ethods
rinter-friendly version
Sampling Methods can be classified into one of two categories:
Probability SamplingJ ample has a >nown probability of being selected
Non-probability SamplingJ ample does not have >nown probability ofbeing selected as in convenience or voluntary response surveys
Probability Sampling
In probability sampling it is possible to both determine which sampling units belong to which sample
and the probability that each sample will be selected. The following sampling methods, which are
listed in Chapter 4, are types of probability sampling:
&. Simple Random Sampling (SRS)
0. Stratifed Sampling
=. Cluster SamplingD. Systematic Sampling
;. Multistage Sampling (in which some o the methods aboe are
combined in stages)
Of the five methods listed above, students have the most trouble distinguishing between stratified
sampling and cluster sampling.
https://onlinecourses.science.psu.edu/stat100/print/book/export/html/18https://onlinecourses.science.psu.edu/stat100/print/book/export/html/18
-
8/17/2019 SAS Personal
8/19
Stratified Sampling is possible when it makes sense to partition the population into groups based on
a factor that may influence the variable that is being measured. These groups are then called strata.
n individual group is called a stratum. !ith stratified samplingone should:
partition the population into groups 5strata6
obtain a simple random sample from each group 5stratum6 collect data on each sampling unit that was randomly sampled from each
group 5stratum6
Stratified sampling works best when a heterogeneous population is split into fairly homogeneous
groups. "nder these conditions, stratification generally produces more precise estimates of the
population percents than estimates that would be found from a simple random sample. Table
3.2 shows some e#amples of ways to obtain a stratified sample.
Table 3.2. Examples of Stratified Samples
!"ample # !"ample $ !"ample %
Population All people in K.. All K
intercollegiate
athletes
All elementary students
in the local school distr
&roups (Strata)
$ Time %ones in the ".S.
&'astern,(entral,
Mountain,)acific*
07 K
intercollegiate
teams
&& di(erent elementary
schools in the local sch
district
'btain a Simple
Random Sample
;'' people from each of
the D time 4ones
; athletes from each
of the 07 K teams
0' students from each
the && elementary
schools
Sample D ;'' 2 0'''
selected people
07 ; 2 &='
selected athletes
&& 0' 2 00' selected
students
Cluster Sampling is very different from Stratified Sampling. !ith cluster sampling one should
divide the population into groups 5clusters6.
obtain a simple random sample of so many clusters from all possible
clusters. obtain data on every sampling unit in each of the randomly selected
clusters.
-
8/17/2019 SAS Personal
9/19
It is important to note that, unlike with the strata in stratified sampling, the clusters should be
microcosms, rather than subsections, of the population. 'ach cluster should be heterogeneous.
dditionally, the statistical analysis used with cluster sampling is not only different, but also more
complicated than that used with stratified sampling.
Table 3.3. Examples of Cluster Samples
!"ample # !"ample $ !"ample %
Population All people in K.. All K intercollegiate
athletes
All elementary stude
in a local school distr
&roups (Clusters) D !ime Mones in the
K.. 5/astern+Central+
ountain+aci1c.6
07 K intercollegiate
teams
&& di(erent elementa
schools in the local
school district
'btain a Simple
Random Sample
0 time 4ones from the
D possible time 4ones
8 teams from the 07
possible teams
D elementary schools
from the l& possible
elementary schools
Sample every person in the 0
selected time 4ones
every athlete on the 8
selected teams
every student in the
selected elementary
schools
What are different types of error arises "hen "e ta3e a sample of observation is ta3en from
population?
/here are t"o types of error arise
+ampling error- Sampling error refers to di(erences between the sample and the populationthat e*ists only because of the observations that happened to be selected for the sample.
Fon ampling /rror-
Nonsampling errors result from mista>es made in the acuisition of data or from the sampleobservations being selected improperly.1-Errors in data acquisition.2-Nonresponse error.3-Selection bias. Selection bias occurs when the sampling plan is such that some members ofthe target population cannot possibly be selected for inclusion in the sample. !ogether withnonresponse error+ selection bias played a role in the Literary Digest poll being so wrong+ asvoters without telephones or without a subscription toLiterary Digest were e*cluded frompossible inclusion in the sample ta>en.
-
8/17/2019 SAS Personal
10/19
What id random e*periment?A random e*periment is an e*periment or process that leads to one of several possible outcomes.
hat is Sample SpaceA sample space of a random e*periment is a list of all possible outcomes of thee*periment. !he outcomes must be e*haustive and mutually e*clusive. hat is Probability o an !ent
!he probability of an event is the sum of the probabilities of the simple events thatconstitute the event.
What is discrete and continuous data?
Discrete Data can only take certain values.
Example: the number of students in a class (you can't have half a student).
Continuous Data can take any value (within a rane)
Examples:
• ! person's heiht: could be any value (within the rane of human
heihts)" not #ust certain fixed heihts"
hat is Random *ariable
• A random ariable is a function or rule that assigns a number to each outcome of
an e*periment.
hat is +inomial !"periment&. !he binomial e"periment consists of a 1*ed number of trials. We represent the numberof trials by n.0. /ach trial has two possible outcomes. We label one outcome a success,and the other afailure.=. !he probability of success is p. !he probability of failure is & N p.D. !he trials are independent+ which means that the outcome of one trial does not a(ect theoutcomes of any other trials.+inomial Probability ,istribution
!he probability of x successes in a binomial e*periment with n trials and probability ofsuccess 2 p is5 x 6 2 nO x O5n N x 6O p x 5& N p6nN x for x 2 '+ &+ 0+ . . . + n
=oision distribution from boo3.
What is central limit theorem?
-
8/17/2019 SAS Personal
11/19
/he 5entral Limit /heorem #CLT for short% basically says that for non-normal data$ the
distribution of the sample means has an appro:imate normal distribution$ no matter "hat the
distribution of the original data loo3s li3e$ as long as the sample si2e is large enough #usually at
least 90% and all samples have the same si2e.
What is concordance discordance and tied pair?
=ercent 5oncordant4 =ercentage of pairs "here the observation "ith the desired outcome#event% has a higher predicted probability than the observation "ithout the outcome #nonevent%.
=ercent iscordant4 =ercentage of pairs "here the observation "ith the desired outcome#event% has a lo"er predicted probability than the observation "ithout the outcome #nonevent%.
=ercent /ied4 =ercentage of pairs "here the observation "ith the desired outcome #event% has
same predicted probability than the observation "ithout the outcome #nonevent%.
What is C85?
C85- Ceceiver 8perating 5haracteristic- In C85 curve "e plot the true positive rate #+ensitivity% vsfalse positive rate #100-specificity% for different cutoff points.>ach point on the C85 curve represents a +ensitivity)specificity pair corresponding to a particulardecision threshold. A test "ith perfect discrimination #no overlap in the t"o distributions% has a C85curve that passes through the upper left corner #100 sensitivity$100 specificity%. /herefore thecloser the C85 curve is to the upper left corner$ the Digher the overall accuracy of the test.
Sensitivity 4 probability that a test result "ill be positive "hen the disease is present #true positive
rate$e:pressed as a percentage%.& a ) #ab%.Specificity 4 probability that a test result "ill be negative "hen the disease is not present #truenegative rate$ e:pressed as a percentage%.& d ) #cd%.
What is difference bet"een ,8dupFey and ,oduprecs?
data test16input id1 G id G e:tra 6
cards6aa ab 9
aa ab 1aa ab aa ab 96proc sort nodup data&test16by id16run6options nocenter6proc print data&test16
-
8/17/2019 SAS Personal
12/19
run6)H,oduprecs reads values of all varibales of priovious observation before "riting it oouput andif there is duplicate it does not "riteH)6data test16input id1 G id G e:tra 6
cards6
aa ab 9aa ab 1aa ab aa ab 96proc sort nodup data&test16by id1 id6run6options nocenter6proc print data&test16run6data test16input id1 G id G e:tra 6
cards6aa ab 9aa ab 1aa ab aa ab 96proc sort nodup data&test16by id1 id e:tra6run6options nocenter6proc print data&test16run6
)Hnudup3ey chec3s and delete the observation that have duplicate by varibale valuesH)6data test16input id1 G id G e:tra 6
cards6aa ab 9aa ab 1aa ab aa ab 96proc sort nodup3ey data&test16by id16run6options nocenter6
proc print data&test16run6data test16input id1 G id G e:tra 6
cards6aa ab 9aa ab 1aa ab aa ab 9
-
8/17/2019 SAS Personal
13/19
6proc sort nodup3ey data&test16by id1 id6run6options nocenter6proc print data&test16
run6data test16input id1 G id G e:tra 6
cards6aa ab 9aa ab 1aa ab aa ab 96proc sort nodup3ey data&test16by id1 id e:tra6run6options nocenter6
proc print data&test16run6
What are differences bet"een "here and if statement?
1- If statement can used only at datastep "hereas "here statement can be used at datastep as"ell proc step.ata test #"here name&gau%%
- If statement can be used at data step to read records "hile specifying input statement"hereas "here statement cannot be used.
ata test6Input a b c6If b in #$9%6ataline6 1 9 9 J @ J K @ 9 6Cun6
9-If statement is used after the data is read into p; "hereas "here statement must be used before
the data is "ritten in =;.
>:ecute multiple conditional statements
suppose$ you have data for college students7 mathematics scores. ou "ant to rate them on thebasis of their scores.
5onditions4
-
8/17/2019 SAS Personal
14/19
1. If a score is less than J0$ create a ne" variable named MCating and give M=oor rating to thesestudents.
What is +A+ *acros?
*acros are used to perform repetitive tas3.
What is *acro variable and ho" many types of macro variables are there?
*acro variables are used to store values of variables. /here are t"o types of macro variable
ocal - )f the macro variable is de1ned inside a macro code+ then
scope is local. )t would be available for use in that macro only and gets
removed when the macro is 1nished.
&lobal - )f the macro variable is de1ned outside a macro code+ then
scope is global. )t can be using anywhere in the A program and gets
removed at the end of the session.data testPinput aPdatalinesP&0PrunP3global varibale3P
9Let dat2testP
3de1ne macro3P9macro samplePproc print data2datPrunP9mend sampleP3invo>e micro3P9samplerunP
3local varibale3P9macro sample&P9let dat&2testPproc print data2dat&PrunP9mend sample&P3invo>e micro3P9sample&
-
8/17/2019 SAS Personal
15/19
runP
Fote that !alue does not reuire uotation mar>s even forcharacters
What are methods to create micro variables?
&-9 let statement0-macro parameters 5named and positional6
• ositional S)n this we provide parameter name while de1ningthe macro
9acro 5obs+b6PTata t0P
et bPRunP9mend P
*ey+ord ,arameters (n this +e provide parameter name +ith equal sign and also +e canassign default values to parameter
Definition
9ACRB Umacro name 5arameter&2Halue+ arameter02HalueGG.arameter-n2Halue6P
acro !e*tP
9/FTP
Calling
9Umacro name 5arameter&2Halue+ arameter02Halue G..arameter-n2Halue6P
=- Call symput- suppose we want to use the variable Sneed tochec>.
-
8/17/2019 SAS Personal
16/19
What is di(erence between proc univaritae and proc mean?
&- Eoth procedure produce descriptive statistics. Ey proc univariate$ by default itproduce all the statistics #some timenot all reuired% but in proc means it is possible
to reuest the statistics that "e "ant..0- roc univariate produces histogram+ uartiles and bo* plots
whereas proc means dose not.
What are the rules for A Tata sets?&- A A data set can be & to =0 characters long.0- hould start with underscore or letter and subseuent
character can be letter numeric or underscore.=- A A data set consists of two parts STescriptive portion and
data portion.
What are the A attributes?
• Name S can be & to =0 characters long. hould start withunderscore or letter and subseuent character can be letternumeric or underscore.
• .ype-Fumeric Snumbers 5%+ -+.+/ and scienti1c notation 6character contains letter
• !o represent blan> character variable we use blan> while #.$ for
numeric variables.
ength / A A character variable has default value of 8 andcan have the value upto =0
-
8/17/2019 SAS Personal
17/19
What is yearcuttoff system option?When encounters t"o digit year in date of an e:ternal or internal "e normally use yearcutoffoption because +A+ reads consider #by default% 1
-
8/17/2019 SAS Personal
18/19
-
8/17/2019 SAS Personal
19/19