For today’s lesson, you will want to download/make a copy of the Github repository https://github.com/melindahiggins2000/N736Fall2017_lesson06. This repo contains 3 datasets we will be working with in the exercises below:
dataA.csv
- has 3 variables (var1, var2, var3) and 6 subjects (1,2,3,5,8,9)dataB.csv
- has 3 variables: 2 same and 1 different than dataA
(var1, var2, var4) and 6 more subjects (11,12,14,15,17,18)dataC.csv
- has 3 more variables (var5, var6, var7) and 6 subjects (1,2,3,4,7,10) some in common with dataA
and some differentLesson 05 covered:
For Lesson 06 we’ll cover:
In future lessons, we’ll also discuss:
factors
in R
specifically - but in SAS (and in SPSS), numeric variables assumed to be continuous are modeled differently than numeric variables treated as “factors” (i.e. ordinal or categorical data represented numerically).It can sometimes be helpful to add a label to your dataset or to your variables.
In R
you can add a comment on your dataset which can then be pulled up using the attributes()
function, which could be used later in a title or other text function.
R CODE
library(readr)
library(knitr)
dataA <- read_csv("dataA.csv")
dataB <- read_csv("dataB.csv")
comment(dataA) <- "This is a comment on dataset dataA"
comment(dataB) <- "This is a comment on dataset dataB"
attributes(dataA)
## $names
## [1] "id" "var1" "var2" "var3"
##
## $class
## [1] "tbl_df" "tbl" "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6
##
## $spec
## cols(
## id = col_integer(),
## var1 = col_integer(),
## var2 = col_character(),
## var3 = col_integer()
## )
##
## $comment
## [1] "This is a comment on dataset dataA"
attributes(dataB)
## $names
## [1] "id" "var1" "var2" "var4"
##
## $class
## [1] "tbl_df" "tbl" "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6
##
## $spec
## cols(
## id = col_integer(),
## var1 = col_integer(),
## var2 = col_character(),
## var4 = col_integer()
## )
##
## $comment
## [1] "This is a comment on dataset dataB"
I could use the result of the attributes()
function to print out info on dataA
.
print(paste("Here are the details on dataset dataA:",
attributes(dataA)$comment))
## [1] "Here are the details on dataset dataA: This is a comment on dataset dataA"
SAS Code
Load dataA
and dataB
.
* ======================================;
* load dataA;
* ======================================;
proc import datafile='C:\MyGithub\N736Fall2017_lesson05\dataA.csv'
out=dataA dbms=csv;
run;
* ======================================;
* load dataB;
* ======================================;
proc import datafile='C:\MyGithub\N736Fall2017_lesson05\dataB.csv'
out=dataB dbms=csv;
run;
In SAS
it is possible to add a comment about the dataset, but this only shows up as a note in output like PROC CONTENTS
.
data dataAcomment (label="This is a comment on dataA");
set dataA;
run;
proc contents data=dataAcomment; run;
Data Set Name | WORK.DATAACOMMENT | Observations | 6 |
---|---|---|---|
Member Type | DATA | Variables | 4 |
Engine | V9 | Indexes | 0 |
Created | 09/13/2017 12:42:23 | Observation Length | 32 |
Last Modified | 09/13/2017 12:42:23 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | This is a comment on dataA | ||
Data Representation | WINDOWS_64 | ||
Encoding | wlatin1 Western (Windows) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 1 |
First Data Page | 1 |
Max Obs per Page | 2039 |
Obs in First Data Page | 6 |
Number of Data Set Repairs | 0 |
ExtendObsCounter | YES |
Filename | C:Temporary Files_TD12732_SON-HIGGINS_.sas7bdat |
Release Created | 9.0401M1 |
Host Created | X64_8PRO |
Alphabetic List of Variables and Attributes | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Informat |
1 | id | Num | 8 | BEST12. | BEST32. |
2 | var1 | Num | 8 | BEST12. | BEST32. |
3 | var2 | Char | 1 | $1. | $1. |
4 | var3 | Num | 8 | BEST12. | BEST32. |
factors
in R
R Code
For var
which has 5-point LiKert scale data coded from 1 to 5, let’s assume it should have the following labeling and interpretation:
1 = "strongly disagree"
2 = "disagree"
3 = "neutral"
4 = "agree"
5 = "strongly agree"
Also notice the changes in the “class”/“variable type” when we make var1
a factor.
kable(dataA)
id | var1 | var2 | var3 |
---|---|---|---|
1 | 3 | m | 45 |
2 | 3 | m | 44 |
3 | 2 | f | 34 |
5 | 4 | f | 38 |
8 | 1 | m | 41 |
9 | 5 | f | 39 |
class(dataA$var1)
[1] "integer"
typeof(dataA$var1)
[1] "integer"
is.numeric(dataA$var1)
[1] TRUE
dataA$var1 <- factor(dataA$var1,
levels=c(1,2,3,4,5),
labels=c("strongly disagree",
"disagree", "neutral",
"agree","strongly agree"))
kable(dataA)
id | var1 | var2 | var3 |
---|---|---|---|
1 | neutral | m | 45 |
2 | neutral | m | 44 |
3 | disagree | f | 34 |
5 | agree | f | 38 |
8 | strongly disagree | m | 41 |
9 | strongly agree | f | 39 |
class(dataA$var1)
[1] "factor"
typeof(dataA$var1)
[1] "integer"
is.numeric(dataA$var1)
[1] FALSE
We can also apply better labeling for gender var2
which is currently a character variable coed as “m” or “f”.
kable(dataA)
id | var1 | var2 | var3 |
---|---|---|---|
1 | neutral | m | 45 |
2 | neutral | m | 44 |
3 | disagree | f | 34 |
5 | agree | f | 38 |
8 | strongly disagree | m | 41 |
9 | strongly agree | f | 39 |
class(dataA$var2)
[1] "character"
typeof(dataA$var2)
[1] "character"
is.numeric(dataA$var2)
[1] FALSE
dataA$var2 <- factor(dataA$var2,
levels=c("m","f"),
labels=c("male","female"))
kable(dataA)
id | var1 | var2 | var3 |
---|---|---|---|
1 | neutral | male | 45 |
2 | neutral | male | 44 |
3 | disagree | female | 34 |
5 | agree | female | 38 |
8 | strongly disagree | male | 41 |
9 | strongly agree | female | 39 |
class(dataA$var2)
[1] "factor"
typeof(dataA$var2)
[1] "integer"
is.numeric(dataA$var2)
[1] FALSE
SAS
formatsSAS Code
Create some formats for likert scale variables and for the gender character variable.
proc format;
value likert
1='strongly disagree'
2='disagree'
3='neutral'
4='agree'
5='strongly agree';
value $gender
'm'='male'
'f'='female';
run;
See what it looks like without the format applied.
proc print data=dataA; run;
Apply the format to a new dataset dataA2
for the var1
variable.
data dataA2;
set dataA;
format var1 likert.;
run;
proc print data=dataA2; run;
Obs | id | var1 | var2 | var3 |
---|---|---|---|---|
1 | 1 | neutral | m | 45 |
2 | 2 | neutral | m | 44 |
3 | 3 | disagree | f | 34 |
4 | 5 | agree | f | 38 |
5 | 8 | strongly disagree | m | 41 |
6 | 9 | strongly agree | f | 39 |
You don’t have to use the format in a DATA
step, you can simply use it when you need it, such as within a proc print statement to get the formatting you want in the output. Let’s use the format
statement here when we print the data table for dataB
.
proc print data=dataB;
format var2 $gender.;
run;
Obs | id | var1 | var2 | VAR4 |
---|---|---|---|---|
1 | 11 | 4 | female | 0 |
2 | 12 | 4 | male | 6 |
3 | 14 | 2 | male | 4 |
4 | 15 | 3 | female | 7 |
5 | 17 | 1 | female | 8 |
6 | 18 | 5 | male | 5 |
We can also use the format
statement here to print the likert scale labels in a frequency table. First, without formatting and again with the formatting - compare the output.
* run a frequency table without formatting;
proc freq data=dataB;
tables var2;
run;
* add the formatting for the table;
proc freq data=dataB;
format var2 $gender.;
tables var2;
run;
Frequency table and plots without format
applied for dataA
and again with format
applied for dataA2
.
SAS Code
proc freq data=dataA;
tables var1;
run;
proc freq data=dataA2;
tables var1;
run;
SAS
- titlesHere is an example of adding a title to the ODS output.
proc freq data=dataB;
title "Frequency Table of Gender";
format var2 $gender.;
tables var2;
run;
Frequency Table of Gender |
var2 | Frequency | Percent |
Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
female | 3 | 50.00 | 3 | 50.00 |
male | 3 | 50.00 | 6 | 100.00 |
Notice the title remains in the next output also because the title
command is a GLOBAL setting that stays in effect until you change it.
NOTE The changes will show up in the SAS
output window but are not shown here in this markdown
based output.
proc print data=dataB; run;
Set the title back to empty. In general, it is a good idea to ALWAYS have some kind of title with your output to keep track.
title;
proc print data=dataB; run;
You can also use footnotes. You can have up to 3 titles and 3 footnotes.
see more at http://support.sas.com/documentation/cdl/en/lestmtsref/69738/HTML/default/viewer.htm#p10gcmrmf83iaxn1ilrx4pra969n.htm.
title "Univariate Report for var 1 in dataA2";
title2 "Report by Melinda Higgins";
title3 "dated 09/15/2017";
footnote "this is footnote 1";
footnote2 "this is footnote 2";
footnote3 "this is footnote 3";
proc univariate data=dataA2 plots;
var var1;
histogram var1/normal;
run;
* set all back to empty;
title;
title2;
title3;
footnote;
footnote2;
footnote3;
Univariate Report for var 1 in dataA2 |
Report by Melinda Higgins |
dated 09/15/2017 |
Moments | |||
---|---|---|---|
N | 6 | Sum Weights | 6 |
Mean | 3 | Sum Observations | 18 |
Std Deviation | 1.41421356 | Variance | 2 |
Skewness | 0 | Kurtosis | -0.3 |
Uncorrected SS | 64 | Corrected SS | 10 |
Coeff Variation | 47.1404521 | Std Error Mean | 0.57735027 |
Basic Statistical Measures | |||
---|---|---|---|
Location | Variability | ||
Mean | 3.000000 | Std Deviation | 1.41421 |
Median | 3.000000 | Variance | 2.00000 |
Mode | 3.000000 | Range | 4.00000 |
Interquartile Range | 2.00000 |
Tests for Location: Mu0=0 | ||||
---|---|---|---|---|
Test | Statistic | p Value | ||
Student's t | t | 5.196152 | Pr > |t| | 0.0035 |
Sign | M | 3 | Pr >= |M| | 0.0313 |
Signed Rank | S | 10.5 | Pr >= |S| | 0.0313 |
Quantiles (Definition 5) | |
---|---|
Level | Quantile |
100% Max | 5 |
99% | 5 |
95% | 5 |
90% | 5 |
75% Q3 | 4 |
50% Median | 3 |
25% Q1 | 2 |
10% | 1 |
5% | 1 |
1% | 1 |
0% Min | 1 |
Extreme Observations | |||
---|---|---|---|
Lowest | Highest | ||
Value | Obs | Value | Obs |
1 | 5 | 2 | 3 |
2 | 3 | 3 | 1 |
3 | 2 | 3 | 2 |
3 | 1 | 4 | 4 |
4 | 4 | 5 | 6 |
this is footnote 1 |
this is footnote 2 |
this is footnote 3 |
Univariate Report for var 1 in dataA2 |
Report by Melinda Higgins |
dated 09/15/2017 |
this is footnote 1 |
this is footnote 2 |
this is footnote 3 |
Univariate Report for var 1 in dataA2 |
Report by Melinda Higgins |
dated 09/15/2017 |
Parameters for Normal Distribution | ||
---|---|---|
Parameter | Symbol | Estimate |
Mean | Mu | 3 |
Std Dev | Sigma | 1.414214 |
Goodness-of-Fit Tests for Normal Distribution | ||||
---|---|---|---|---|
Test | Statistic | p Value | ||
Kolmogorov-Smirnov | D | 0.16666667 | Pr > D | >0.150 |
Cramer-von Mises | W-Sq | 0.02803177 | Pr > W-Sq | >0.250 |
Anderson-Darling | A-Sq | 0.17100986 | Pr > A-Sq | >0.250 |
Quantiles for Normal Distribution | ||
---|---|---|
Percent | Quantile | |
Observed | Estimated | |
1.0 | 1.00000 | -0.28995 |
5.0 | 1.00000 | 0.67383 |
10.0 | 1.00000 | 1.18761 |
25.0 | 2.00000 | 2.04613 |
50.0 | 3.00000 | 3.00000 |
75.0 | 4.00000 | 3.95387 |
90.0 | 5.00000 | 4.81239 |
95.0 | 5.00000 | 5.32617 |
99.0 | 5.00000 | 6.28995 |
this is footnote 1 |
this is footnote 2 |
this is footnote 3 |
[1] "R"
Suppose we want to recode var1
into a new variable identifying subjects with var1
scores <=3 and those >3 (splitting the data at 3). We’ll create the new variable and then add the formatting we want using the factor()
function in R
.
Since we made var1
a “factor” type variable above, we have to use the as.numeric()
function to be able to manipulate the numeric values underlying this variable as a factor.
We also use the as.numeric()
function again since the statement x > 3
is a logic statement and returns FALSE
and TRUE
and what we want is 0
and 1
. If you do want a logic variable you can leave off this 2nd (outer) as.numeric()
function.
R Code
x <- c(1,2,3,4,5)
x > 3
[1] FALSE FALSE FALSE TRUE TRUE
typeof(x > 3)
[1] "logical"
dataA$var1r <- as.numeric(
as.numeric(dataA$var1) > 3
)
typeof(dataA$var1r)
[1] "double"
dataA$var1r <- factor(dataA$var1r,
levels=c(0,1),
labels=c("NO var1 <= 3",
"YES var1 > 3"))
kable(dataA)
id | var1 | var2 | var3 | var1r |
---|---|---|---|---|
1 | neutral | male | 45 | NO var1 <= 3 |
2 | neutral | male | 44 | NO var1 <= 3 |
3 | disagree | female | 34 | NO var1 <= 3 |
5 | agree | female | 38 | YES var1 > 3 |
8 | strongly disagree | male | 41 | NO var1 <= 3 |
9 | strongly agree | female | 39 | YES var1 > 3 |
SAS Code
From lesson 5 - merge dataA
and dataB
together.
data dataAB;
set dataA dataB;
run;
proc print data=dataAB noobs; run;
id | var1 | var2 | var3 | VAR4 |
---|---|---|---|---|
1 | 3 | m | 45 | . |
2 | 3 | m | 44 | . |
3 | 2 | f | 34 | . |
5 | 4 | f | 38 | . |
8 | 1 | m | 41 | . |
9 | 5 | f | 39 | . |
11 | 4 | f | . | 0 |
12 | 4 | m | . | 6 |
14 | 2 | m | . | 4 |
15 | 3 | f | . | 7 |
17 | 1 | f | . | 8 |
18 | 5 | m | . | 5 |
Let’s recode var1
into a new variable identifying subjects with var1
scores <=3 and those >3 (splitting the data at 3). First we’ll define the formatting we want for our new variable and after we recode it in the DATA
step we’ll apply the formatting.
proc format;
value splitYN
0 = 'NO var1 <= 3'
1 = 'YES var1 > 3';
run;
Create new dataset with recoded variables and apply new format labels.
data dataAB_new;
set dataAB;
if var1 > 3 then var1split3 = 1;
else var1split3 = 0;
format var1split3 splitYN.;
run;
Create table and plot of new variable split <=3 and >3.
proc freq data=dataAB_new;
tables var1split3/plots=freqplot;
run;