Data Wrangling

For today’s lesson, you will want to download/make a copy of the Github repository https://github.com/melindahiggins2000/N736Fall2017_lesson06. This repo contains 3 datasets we will be working with in the exercises below:

Lesson 05 covered:

For Lesson 06 we’ll cover:

In future lessons, we’ll also discuss:

Adding Labels and Other Notations

It can sometimes be helpful to add a label to your dataset or to your variables.

In R you can add a comment on your dataset which can then be pulled up using the attributes() function, which could be used later in a title or other text function.

R CODE

library(readr)
library(knitr)
dataA <- read_csv("dataA.csv")
dataB <- read_csv("dataB.csv")

comment(dataA) <- "This is a comment on dataset dataA"
comment(dataB) <- "This is a comment on dataset dataB"

attributes(dataA)
## $names
## [1] "id"   "var1" "var2" "var3"
## 
## $class
## [1] "tbl_df"     "tbl"        "data.frame"
## 
## $row.names
## [1] 1 2 3 4 5 6
## 
## $spec
## cols(
##   id = col_integer(),
##   var1 = col_integer(),
##   var2 = col_character(),
##   var3 = col_integer()
## )
## 
## $comment
## [1] "This is a comment on dataset dataA"
attributes(dataB)
## $names
## [1] "id"   "var1" "var2" "var4"
## 
## $class
## [1] "tbl_df"     "tbl"        "data.frame"
## 
## $row.names
## [1] 1 2 3 4 5 6
## 
## $spec
## cols(
##   id = col_integer(),
##   var1 = col_integer(),
##   var2 = col_character(),
##   var4 = col_integer()
## )
## 
## $comment
## [1] "This is a comment on dataset dataB"

I could use the result of the attributes() function to print out info on dataA.

print(paste("Here are the details on dataset dataA:",
            attributes(dataA)$comment))
## [1] "Here are the details on dataset dataA: This is a comment on dataset dataA"

SAS Code

Load dataA and dataB.

* ======================================;
* load dataA;
* ======================================;
proc import datafile='C:\MyGithub\N736Fall2017_lesson05\dataA.csv'
  out=dataA dbms=csv;
run;

* ======================================;
* load dataB;  
* ======================================;
proc import datafile='C:\MyGithub\N736Fall2017_lesson05\dataB.csv'
  out=dataB dbms=csv;
run;

In SAS it is possible to add a comment about the dataset, but this only shows up as a note in output like PROC CONTENTS.

data dataAcomment (label="This is a comment on dataA");
  set dataA;
run;

proc contents data=dataAcomment; run;
Data Set Name WORK.DATAACOMMENT Observations 6
Member Type DATA Variables 4
Engine V9 Indexes 0
Created 09/13/2017 12:42:23 Observation Length 32
Last Modified 09/13/2017 12:42:23 Deleted Observations 0
Protection Compressed NO
Data Set Type Sorted NO
Label This is a comment on dataA
Data Representation WINDOWS_64
Encoding wlatin1 Western (Windows)

Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 1
First Data Page 1
Max Obs per Page 2039
Obs in First Data Page 6
Number of Data Set Repairs 0
ExtendObsCounter YES
Filename C:Temporary Files_TD12732_SON-HIGGINS_.sas7bdat
Release Created 9.0401M1
Host Created X64_8PRO

Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
1 id Num 8 BEST12. BEST32.
2 var1 Num 8 BEST12. BEST32.
3 var2 Char 1 $1. $1.
4 var3 Num 8 BEST12. BEST32.

adding labels and levels for factors in R

R Code

For var which has 5-point LiKert scale data coded from 1 to 5, let’s assume it should have the following labeling and interpretation:

1 = "strongly disagree"
2 = "disagree"
3 = "neutral"
4 = "agree"
5 = "strongly agree"

Also notice the changes in the “class”/“variable type” when we make var1 a factor.

kable(dataA)
id var1 var2 var3
1 3 m 45
2 3 m 44
3 2 f 34
5 4 f 38
8 1 m 41
9 5 f 39
class(dataA$var1)
[1] "integer"
typeof(dataA$var1)
[1] "integer"
is.numeric(dataA$var1)
[1] TRUE
dataA$var1 <- factor(dataA$var1,
                     levels=c(1,2,3,4,5),
                     labels=c("strongly disagree",
                              "disagree", "neutral",
                              "agree","strongly agree")) 

kable(dataA)
id var1 var2 var3
1 neutral m 45
2 neutral m 44
3 disagree f 34
5 agree f 38
8 strongly disagree m 41
9 strongly agree f 39
class(dataA$var1)
[1] "factor"
typeof(dataA$var1)
[1] "integer"
is.numeric(dataA$var1)
[1] FALSE

We can also apply better labeling for gender var2 which is currently a character variable coed as “m” or “f”.

kable(dataA)
id var1 var2 var3
1 neutral m 45
2 neutral m 44
3 disagree f 34
5 agree f 38
8 strongly disagree m 41
9 strongly agree f 39
class(dataA$var2)
[1] "character"
typeof(dataA$var2)
[1] "character"
is.numeric(dataA$var2)
[1] FALSE
dataA$var2 <- factor(dataA$var2,
                     levels=c("m","f"),
                     labels=c("male","female")) 

kable(dataA)
id var1 var2 var3
1 neutral male 45
2 neutral male 44
3 disagree female 34
5 agree female 38
8 strongly disagree male 41
9 strongly agree female 39
class(dataA$var2)
[1] "factor"
typeof(dataA$var2)
[1] "integer"
is.numeric(dataA$var2)
[1] FALSE

using SAS formats

SAS Code

Create some formats for likert scale variables and for the gender character variable.

proc format;
  value likert  
    1='strongly disagree'
    2='disagree'
    3='neutral'
      4='agree'
      5='strongly agree';
  value $gender  
    'm'='male'
    'f'='female';
run;
    

See what it looks like without the format applied.

proc print data=dataA; run;
Obs id var1 var2 var3
1 1 3 m 45
2 2 3 m 44
3 3 2 f 34
4 5 4 f 38
5 8 1 m 41
6 9 5 f 39

Apply the format to a new dataset dataA2 for the var1 variable.

data dataA2;
  set dataA;
  format var1 likert.;
run;

proc print data=dataA2; run;
Obs id var1 var2 var3
1 1 neutral m 45
2 2 neutral m 44
3 3 disagree f 34
4 5 agree f 38
5 8 strongly disagree m 41
6 9 strongly agree f 39

You don’t have to use the format in a DATA step, you can simply use it when you need it, such as within a proc print statement to get the formatting you want in the output. Let’s use the format statement here when we print the data table for dataB.

proc print data=dataB;
  format var2 $gender.;
  run;
Obs id var1 var2 VAR4
1 11 4 female 0
2 12 4 male 6
3 14 2 male 4
4 15 3 female 7
5 17 1 female 8
6 18 5 male 5

We can also use the format statement here to print the likert scale labels in a frequency table. First, without formatting and again with the formatting - compare the output.

* run a frequency table without formatting;

proc freq data=dataB;
  tables var2;
  run;

* add the formatting for the table;

proc freq data=dataB;
  format var2 $gender.;
  tables var2;
  run;
var2 Frequency Percent Cumulative
Frequency
Cumulative
Percent
f 3 50.00 3 50.00
m 3 50.00 6 100.00



var2 Frequency Percent Cumulative
Frequency
Cumulative
Percent
female 3 50.00 3 50.00
male 3 50.00 6 100.00

tables with labels

Frequency table and plots without format applied for dataA and again with format applied for dataA2.

SAS Code

proc freq data=dataA;
 tables var1;
 run;

proc freq data=dataA2;
 tables var1;
 run;
var1 Frequency Percent Cumulative
Frequency
Cumulative
Percent
1 1 16.67 1 16.67
2 1 16.67 2 33.33
3 2 33.33 4 66.67
4 1 16.67 5 83.33
5 1 16.67 6 100.00



var1 Frequency Percent Cumulative
Frequency
Cumulative
Percent
strongly disagree 1 16.67 1 16.67
disagree 1 16.67 2 33.33
neutral 2 33.33 4 66.67
agree 1 16.67 5 83.33
strongly agree 1 16.67 6 100.00

Other labeling in SAS - titles

Here is an example of adding a title to the ODS output.

proc freq data=dataB;
  title "Frequency Table of Gender";
  format var2 $gender.;
  tables var2;
  run;
Frequency Table of Gender

var2 Frequency Percent Cumulative
Frequency
Cumulative
Percent
female 3 50.00 3 50.00
male 3 50.00 6 100.00

Notice the title remains in the next output also because the title command is a GLOBAL setting that stays in effect until you change it.

NOTE The changes will show up in the SAS output window but are not shown here in this markdown based output.

proc print data=dataB; run;
Obs id var1 var2 VAR4
1 11 4 f 0
2 12 4 m 6
3 14 2 m 4
4 15 3 f 7
5 17 1 f 8
6 18 5 m 5

Set the title back to empty. In general, it is a good idea to ALWAYS have some kind of title with your output to keep track.

title;

proc print data=dataB; run;
Obs id var1 var2 VAR4
1 11 4 f 0
2 12 4 m 6
3 14 2 m 4
4 15 3 f 7
5 17 1 f 8
6 18 5 m 5

You can also use footnotes. You can have up to 3 titles and 3 footnotes.

see more at http://support.sas.com/documentation/cdl/en/lestmtsref/69738/HTML/default/viewer.htm#p10gcmrmf83iaxn1ilrx4pra969n.htm.

title "Univariate Report for var 1 in dataA2";
title2 "Report by Melinda Higgins";
title3 "dated 09/15/2017";
footnote "this is footnote 1";
footnote2 "this is footnote 2";
footnote3 "this is footnote 3";

proc univariate data=dataA2 plots;
  var var1;
  histogram var1/normal;
  run;

* set all back to empty;
title;
title2;
title3;
footnote;
footnote2;
footnote3;
Univariate Report for var 1 in dataA2
Report by Melinda Higgins
dated 09/15/2017

Variable: var1

Moments
N 6 Sum Weights 6
Mean 3 Sum Observations 18
Std Deviation 1.41421356 Variance 2
Skewness 0 Kurtosis -0.3
Uncorrected SS 64 Corrected SS 10
Coeff Variation 47.1404521 Std Error Mean 0.57735027

Basic Statistical Measures
Location Variability
Mean 3.000000 Std Deviation 1.41421
Median 3.000000 Variance 2.00000
Mode 3.000000 Range 4.00000
Interquartile Range 2.00000

Tests for Location: Mu0=0
Test Statistic p Value
Student's t t 5.196152 Pr > |t| 0.0035
Sign M 3 Pr >= |M| 0.0313
Signed Rank S 10.5 Pr >= |S| 0.0313

Quantiles (Definition 5)
Level Quantile
100% Max 5
99% 5
95% 5
90% 5
75% Q3 4
50% Median 3
25% Q1 2
10% 1
5% 1
1% 1
0% Min 1

Extreme Observations
Lowest Highest
Value Obs Value Obs
1 5 2 3
2 3 3 1
3 2 3 2
3 1 4 4
4 4 5 6

Plots for var1


this is footnote 1
this is footnote 2
this is footnote 3



Univariate Report for var 1 in dataA2
Report by Melinda Higgins
dated 09/15/2017

Histogram for var1


this is footnote 1
this is footnote 2
this is footnote 3



Univariate Report for var 1 in dataA2
Report by Melinda Higgins
dated 09/15/2017

Fitted Normal Distribution for var1

Parameters for Normal Distribution
Parameter Symbol Estimate
Mean Mu 3
Std Dev Sigma 1.414214

Goodness-of-Fit Tests for Normal Distribution
Test Statistic p Value
Kolmogorov-Smirnov D 0.16666667 Pr > D >0.150
Cramer-von Mises W-Sq 0.02803177 Pr > W-Sq >0.250
Anderson-Darling A-Sq 0.17100986 Pr > A-Sq >0.250

Quantiles for Normal Distribution
Percent Quantile
Observed Estimated
1.0 1.00000 -0.28995
5.0 1.00000 0.67383
10.0 1.00000 1.18761
25.0 2.00000 2.04613
50.0 3.00000 3.00000
75.0 4.00000 3.95387
90.0 5.00000 4.81239
95.0 5.00000 5.32617
99.0 5.00000 6.28995

this is footnote 1
this is footnote 2
this is footnote 3

[1] "R"

Recoding

Suppose we want to recode var1 into a new variable identifying subjects with var1 scores <=3 and those >3 (splitting the data at 3). We’ll create the new variable and then add the formatting we want using the factor() function in R.

Since we made var1 a “factor” type variable above, we have to use the as.numeric() function to be able to manipulate the numeric values underlying this variable as a factor.

We also use the as.numeric() function again since the statement x > 3 is a logic statement and returns FALSE and TRUE and what we want is 0 and 1. If you do want a logic variable you can leave off this 2nd (outer) as.numeric() function.

R Code

x <- c(1,2,3,4,5)
x > 3
[1] FALSE FALSE FALSE  TRUE  TRUE
typeof(x > 3)
[1] "logical"
dataA$var1r <- as.numeric(
  as.numeric(dataA$var1) > 3
  )

typeof(dataA$var1r)
[1] "double"
dataA$var1r <- factor(dataA$var1r,
                      levels=c(0,1),
                      labels=c("NO var1 <= 3",
                               "YES var1 > 3")) 

kable(dataA)
id var1 var2 var3 var1r
1 neutral male 45 NO var1 <= 3
2 neutral male 44 NO var1 <= 3
3 disagree female 34 NO var1 <= 3
5 agree female 38 YES var1 > 3
8 strongly disagree male 41 NO var1 <= 3
9 strongly agree female 39 YES var1 > 3

SAS Code

From lesson 5 - merge dataA and dataB together.

data dataAB;
  set dataA dataB;
run;

proc print data=dataAB noobs; run;
id var1 var2 var3 VAR4
1 3 m 45 .
2 3 m 44 .
3 2 f 34 .
5 4 f 38 .
8 1 m 41 .
9 5 f 39 .
11 4 f . 0
12 4 m . 6
14 2 m . 4
15 3 f . 7
17 1 f . 8
18 5 m . 5

Let’s recode var1 into a new variable identifying subjects with var1 scores <=3 and those >3 (splitting the data at 3). First we’ll define the formatting we want for our new variable and after we recode it in the DATA step we’ll apply the formatting.

proc format;
  value splitYN
    0 = 'NO var1 <= 3'
    1 = 'YES var1 > 3';
run;

Create new dataset with recoded variables and apply new format labels.

data dataAB_new;
  set dataAB;
  if var1 > 3 then var1split3 = 1;
  else             var1split3 = 0;
  format var1split3 splitYN.;
run;

Create table and plot of new variable split <=3 and >3.

proc freq data=dataAB_new;
 tables var1split3/plots=freqplot;
 run;
var1split3 Frequency Percent Cumulative
Frequency
Cumulative
Percent
NO var1 <= 3 7 58.33 7 58.33
YES var1 > 3 5 41.67 12 100.00

Bar Chart of Frequencies for var1split3