Data Wrangling

For today’s lesson, you will want to download/make a copy of the Github repository https://github.com/melindahiggins2000/N736Fall2017_lesson06. This repo contains 3 datasets we will be working with in the exercises below:

dataA.csv - has 3 variables (var1, var2, var3) and 6 subjects (1,2,3,5,8,9)
dataB.csv - has 3 variables: 2 same and 1 different than dataA (var1, var2, var4) and 6 more subjects (11,12,14,15,17,18)
dataC.csv - has 3 more variables (var5, var6, var7) and 6 subjects (1,2,3,4,7,10) some in common with dataA and some different

Lesson 05 covered:

merging;
joining;
stacking; and
concatenating 2 datasets together.

For Lesson 06 we’ll cover:

adding labels and other notations
recoding

In future lessons, we’ll also discuss:

manipulating data types
switching from numeric to character and back
handling dates and times
handling factors in R specifically - but in SAS (and in SPSS), numeric variables assumed to be continuous are modeled differently than numeric variables treated as “factors” (i.e. ordinal or categorical data represented numerically).

Adding Labels and Other Notations

It can sometimes be helpful to add a label to your dataset or to your variables.

In R you can add a comment on your dataset which can then be pulled up using the attributes() function, which could be used later in a title or other text function.

R CODE

library(readr)
library(knitr)
dataA <- read_csv("dataA.csv")
dataB <- read_csv("dataB.csv")

comment(dataA) <- "This is a comment on dataset dataA"
comment(dataB) <- "This is a comment on dataset dataB"

attributes(dataA)

## $names
## [1] "id"   "var1" "var2" "var3"
## 
## $class
## [1] "tbl_df"     "tbl"        "data.frame"
## 
## $row.names
## [1] 1 2 3 4 5 6
## 
## $spec
## cols(
##   id = col_integer(),
##   var1 = col_integer(),
##   var2 = col_character(),
##   var3 = col_integer()
## )
## 
## $comment
## [1] "This is a comment on dataset dataA"

attributes(dataB)

## $names
## [1] "id"   "var1" "var2" "var4"
## 
## $class
## [1] "tbl_df"     "tbl"        "data.frame"
## 
## $row.names
## [1] 1 2 3 4 5 6
## 
## $spec
## cols(
##   id = col_integer(),
##   var1 = col_integer(),
##   var2 = col_character(),
##   var4 = col_integer()
## )
## 
## $comment
## [1] "This is a comment on dataset dataB"

I could use the result of the attributes() function to print out info on dataA.

print(paste("Here are the details on dataset dataA:",
            attributes(dataA)$comment))

## [1] "Here are the details on dataset dataA: This is a comment on dataset dataA"

SAS Code

Load dataA and dataB.

* ======================================;
* load dataA;
* ======================================;
proc import datafile='C:\MyGithub\N736Fall2017_lesson05\dataA.csv'
  out=dataA dbms=csv;
run;

* ======================================;
* load dataB;  
* ======================================;
proc import datafile='C:\MyGithub\N736Fall2017_lesson05\dataB.csv'
  out=dataB dbms=csv;
run;

In SAS it is possible to add a comment about the dataset, but this only shows up as a note in output like PROC CONTENTS.

data dataAcomment (label="This is a comment on dataA");
  set dataA;
run;

proc contents data=dataAcomment; run;

Data Set Name	WORK.DATAACOMMENT	Observations	6
Member Type	DATA	Variables	4
Engine	V9	Indexes	0
Created	09/13/2017 12:42:23	Observation Length	32
Last Modified	09/13/2017 12:42:23	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label	This is a comment on dataA
Data Representation	WINDOWS_64
Encoding	wlatin1 Western (Windows)

Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	1
First Data Page	1
Max Obs per Page	2039
Obs in First Data Page	6
Number of Data Set Repairs	0
ExtendObsCounter	YES
Filename	C:Temporary Files_TD12732_SON-HIGGINS_.sas7bdat
Release Created	9.0401M1
Host Created	X64_8PRO

Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Format	Informat
1	id	Num	8	BEST12.	BEST32.
2	var1	Num	8	BEST12.	BEST32.
3	var2	Char	1	$1.	$1.
4	var3	Num	8	BEST12.	BEST32.

adding labels and levels for `factors` in `R`

R Code

For var which has 5-point LiKert scale data coded from 1 to 5, let’s assume it should have the following labeling and interpretation:

1 = "strongly disagree"
2 = "disagree"
3 = "neutral"
4 = "agree"
5 = "strongly agree"

Also notice the changes in the “class”/“variable type” when we make var1 a factor.

kable(dataA)

id	var1	var2	var3
1	3	m	45
2	3	m	44
3	2	f	34
5	4	f	38
8	1	m	41
9	5	f	39

class(dataA$var1)

[1] "integer"

typeof(dataA$var1)

[1] "integer"

is.numeric(dataA$var1)

[1] TRUE

dataA$var1 <- factor(dataA$var1,
                     levels=c(1,2,3,4,5),
                     labels=c("strongly disagree",
                              "disagree", "neutral",
                              "agree","strongly agree")) 

kable(dataA)

id	var1	var2	var3
1	neutral	m	45
2	neutral	m	44
3	disagree	f	34
5	agree	f	38
8	strongly disagree	m	41
9	strongly agree	f	39

class(dataA$var1)

[1] "factor"

typeof(dataA$var1)

[1] "integer"

is.numeric(dataA$var1)

[1] FALSE

We can also apply better labeling for gender var2 which is currently a character variable coed as “m” or “f”.

kable(dataA)

id	var1	var2	var3
1	neutral	m	45
2	neutral	m	44
3	disagree	f	34
5	agree	f	38
8	strongly disagree	m	41
9	strongly agree	f	39

class(dataA$var2)

[1] "character"

typeof(dataA$var2)

[1] "character"

is.numeric(dataA$var2)

[1] FALSE

dataA$var2 <- factor(dataA$var2,
                     levels=c("m","f"),
                     labels=c("male","female")) 

kable(dataA)

id	var1	var2	var3
1	neutral	male	45
2	neutral	male	44
3	disagree	female	34
5	agree	female	38
8	strongly disagree	male	41
9	strongly agree	female	39

class(dataA$var2)

[1] "factor"

typeof(dataA$var2)

[1] "integer"

is.numeric(dataA$var2)

[1] FALSE

using `SAS` formats

SAS Code

Create some formats for likert scale variables and for the gender character variable.

proc format;
  value likert  
    1='strongly disagree'
    2='disagree'
    3='neutral'
      4='agree'
      5='strongly agree';
  value $gender  
    'm'='male'
    'f'='female';
run;

See what it looks like without the format applied.

proc print data=dataA; run;

Obs	id	var1	var2	var3
1	1	3	m	45
2	2	3	m	44
3	3	2	f	34
4	5	4	f	38
5	8	1	m	41
6	9	5	f	39

Apply the format to a new dataset dataA2 for the var1 variable.

data dataA2;
  set dataA;
  format var1 likert.;
run;

proc print data=dataA2; run;

Obs	id	var1	var2	var3
1	1	neutral	m	45
2	2	neutral	m	44
3	3	disagree	f	34
4	5	agree	f	38
5	8	strongly disagree	m	41
6	9	strongly agree	f	39

You don’t have to use the format in a DATA step, you can simply use it when you need it, such as within a proc print statement to get the formatting you want in the output. Let’s use the format statement here when we print the data table for dataB.

proc print data=dataB;
  format var2 $gender.;
  run;

Obs	id	var1	var2	VAR4
1	11	4	female	0
2	12	4	male	6
3	14	2	male	4
4	15	3	female	7
5	17	1	female	8
6	18	5	male	5

We can also use the format statement here to print the likert scale labels in a frequency table. First, without formatting and again with the formatting - compare the output.

* run a frequency table without formatting;

proc freq data=dataB;
  tables var2;
  run;

* add the formatting for the table;

proc freq data=dataB;
  format var2 $gender.;
  tables var2;
  run;

var2	Frequency	Percent	Cumulative Frequency	Cumulative Percent
f	3	50.00	3	50.00
m	3	50.00	6	100.00

var2	Frequency	Percent	Cumulative Frequency	Cumulative Percent
female	3	50.00	3	50.00
male	3	50.00	6	100.00

tables with labels

Frequency table and plots without format applied for dataA and again with format applied for dataA2.

SAS Code

proc freq data=dataA;
 tables var1;
 run;

proc freq data=dataA2;
 tables var1;
 run;

var1	Frequency	Percent	Cumulative Frequency	Cumulative Percent
1	1	16.67	1	16.67
2	1	16.67	2	33.33
3	2	33.33	4	66.67
4	1	16.67	5	83.33
5	1	16.67	6	100.00

var1	Frequency	Percent	Cumulative Frequency	Cumulative Percent
strongly disagree	1	16.67	1	16.67
disagree	1	16.67	2	33.33
neutral	2	33.33	4	66.67
agree	1	16.67	5	83.33
strongly agree	1	16.67	6	100.00

Other labeling in `SAS` - titles

Here is an example of adding a title to the ODS output.

proc freq data=dataB;
  title "Frequency Table of Gender";
  format var2 $gender.;
  tables var2;
  run;

Frequency Table of Gender

var2	Frequency	Percent	Cumulative Frequency	Cumulative Percent
female	3	50.00	3	50.00
male	3	50.00	6	100.00

Notice the title remains in the next output also because the title command is a GLOBAL setting that stays in effect until you change it.

NOTE The changes will show up in the SAS output window but are not shown here in this markdown based output.

proc print data=dataB; run;

Obs	id	var1	var2	VAR4
1	11	4	f	0
2	12	4	m	6
3	14	2	m	4
4	15	3	f	7
5	17	1	f	8
6	18	5	m	5

Set the title back to empty. In general, it is a good idea to ALWAYS have some kind of title with your output to keep track.

title;

proc print data=dataB; run;

Obs	id	var1	var2	VAR4
1	11	4	f	0
2	12	4	m	6
3	14	2	m	4
4	15	3	f	7
5	17	1	f	8
6	18	5	m	5

You can also use footnotes. You can have up to 3 titles and 3 footnotes.

see more at http://support.sas.com/documentation/cdl/en/lestmtsref/69738/HTML/default/viewer.htm#p10gcmrmf83iaxn1ilrx4pra969n.htm.

title "Univariate Report for var 1 in dataA2";
title2 "Report by Melinda Higgins";
title3 "dated 09/15/2017";
footnote "this is footnote 1";
footnote2 "this is footnote 2";
footnote3 "this is footnote 3";

proc univariate data=dataA2 plots;
  var var1;
  histogram var1/normal;
  run;

* set all back to empty;
title;
title2;
title3;
footnote;
footnote2;
footnote3;

Univariate Report for var 1 in dataA2

Report by Melinda Higgins

dated 09/15/2017

Variable: var1

Moments
N	6	Sum Weights	6
Mean	3	Sum Observations	18
Std Deviation	1.41421356	Variance	2
Skewness	0	Kurtosis	-0.3
Uncorrected SS	64	Corrected SS	10
Coeff Variation	47.1404521	Std Error Mean	0.57735027

Basic Statistical Measures
Location		Variability
Mean	3.000000	Std Deviation	1.41421
Median	3.000000	Variance	2.00000
Mode	3.000000	Range	4.00000
		Interquartile Range	2.00000

Tests for Location: Mu0=0
Test	Statistic		p Value
Student's t	t	5.196152	Pr > \|t\|	0.0035
Sign	M	3	Pr >= \|M\|	0.0313
Signed Rank	S	10.5	Pr >= \|S\|	0.0313

Quantiles (Definition 5)
Level	Quantile
100% Max	5
99%	5
95%	5
90%	5
75% Q3	4
50% Median	3
25% Q1	2
10%	1
5%	1
1%	1
0% Min	1

Extreme Observations
Lowest		Highest
Value	Obs	Value	Obs
1	5	2	3
2	3	3	1
3	2	3	2
3	1	4	4
4	4	5	6

Plots for var1

this is footnote 1

this is footnote 2

this is footnote 3

Univariate Report for var 1 in dataA2

Report by Melinda Higgins

dated 09/15/2017

Histogram for var1

this is footnote 1

this is footnote 2

this is footnote 3

Univariate Report for var 1 in dataA2

Report by Melinda Higgins

dated 09/15/2017

Fitted Normal Distribution for var1

Parameters for Normal Distribution
Parameter	Symbol	Estimate
Mean	Mu	3
Std Dev	Sigma	1.414214

Goodness-of-Fit Tests for Normal Distribution
Test	Statistic		p Value
Kolmogorov-Smirnov	D	0.16666667	Pr > D	>0.150
Cramer-von Mises	W-Sq	0.02803177	Pr > W-Sq	>0.250
Anderson-Darling	A-Sq	0.17100986	Pr > A-Sq	>0.250

Quantiles for Normal Distribution
Percent	Quantile
Percent	Observed	Estimated
1.0	1.00000	-0.28995
5.0	1.00000	0.67383
10.0	1.00000	1.18761
25.0	2.00000	2.04613
50.0	3.00000	3.00000
75.0	4.00000	3.95387
90.0	5.00000	4.81239
95.0	5.00000	5.32617
99.0	5.00000	6.28995

this is footnote 1

this is footnote 2

this is footnote 3

[1] "R"

Recoding

Suppose we want to recode var1 into a new variable identifying subjects with var1 scores <=3 and those >3 (splitting the data at 3). We’ll create the new variable and then add the formatting we want using the factor() function in R.

Since we made var1 a “factor” type variable above, we have to use the as.numeric() function to be able to manipulate the numeric values underlying this variable as a factor.

We also use the as.numeric() function again since the statement x > 3 is a logic statement and returns FALSE and TRUE and what we want is 0 and 1. If you do want a logic variable you can leave off this 2nd (outer) as.numeric() function.

R Code

x <- c(1,2,3,4,5)
x > 3

[1] FALSE FALSE FALSE  TRUE  TRUE

typeof(x > 3)

[1] "logical"

dataA$var1r <- as.numeric(
  as.numeric(dataA$var1) > 3
  )

typeof(dataA$var1r)

[1] "double"

dataA$var1r <- factor(dataA$var1r,
                      levels=c(0,1),
                      labels=c("NO var1 <= 3",
                               "YES var1 > 3")) 

kable(dataA)

id	var1	var2	var3	var1r
1	neutral	male	45	NO var1 <= 3
2	neutral	male	44	NO var1 <= 3
3	disagree	female	34	NO var1 <= 3
5	agree	female	38	YES var1 > 3
8	strongly disagree	male	41	NO var1 <= 3
9	strongly agree	female	39	YES var1 > 3

SAS Code

From lesson 5 - merge dataA and dataB together.

data dataAB;
  set dataA dataB;
run;

proc print data=dataAB noobs; run;

id	var1	var2	var3	VAR4
1	3	m	45	.
2	3	m	44	.
3	2	f	34	.
5	4	f	38	.
8	1	m	41	.
9	5	f	39	.
11	4	f	.	0
12	4	m	.	6
14	2	m	.	4
15	3	f	.	7
17	1	f	.	8
18	5	m	.	5

Let’s recode var1 into a new variable identifying subjects with var1 scores <=3 and those >3 (splitting the data at 3). First we’ll define the formatting we want for our new variable and after we recode it in the DATA step we’ll apply the formatting.

proc format;
  value splitYN
    0 = 'NO var1 <= 3'
    1 = 'YES var1 > 3';
run;

Create new dataset with recoded variables and apply new format labels.

data dataAB_new;
  set dataAB;
  if var1 > 3 then var1split3 = 1;
  else             var1split3 = 0;
  format var1split3 splitYN.;
run;

Create table and plot of new variable split <=3 and >3.

proc freq data=dataAB_new;
 tables var1split3/plots=freqplot;
 run;

var1split3	Frequency	Percent	Cumulative Frequency	Cumulative Percent
NO var1 <= 3	7	58.33	7	58.33
YES var1 > 3	5	41.67	12	100.00

Bar Chart of Frequencies for var1split3

Data Wrangling

Melinda K. Higgins, PhD.

September 13, 2017

Data Wrangling

Adding Labels and Other Notations

adding labels and levels for `factors` in `R`

using `SAS` formats

tables with labels

Other labeling in `SAS` - titles

Recoding

id	var1	var2	var3	VAR4
1	3	m	45	.
2	3	m	44	.
3	2	f	34	.
5	4	f	38	.
8	1	m	41	.
9	5	f	39	.
11	4	f	.	0
12	4	m	.	6
14	2	m	.	4
15	3	f	.	7
17	1	f	.	8
18	5	m	.	5

id	var1	var2	var3	VAR4
1	3	m	45	.
2	3	m	44	.
3	2	f	34	.
5	4	f	38	.
8	1	m	41	.
9	5	f	39	.
11	4	f	.	0
12	4	m	.	6
14	2	m	.	4
15	3	f	.	7
17	1	f	.	8
18	5	m	.	5

Data Wrangling

Melinda K. Higgins, PhD.

September 13, 2017

Data Wrangling

Adding Labels and Other Notations

adding labels and levels for factors in R

using SAS formats

tables with labels

Other labeling in SAS - titles

Recoding

adding labels and levels for `factors` in `R`

using `SAS` formats

Other labeling in `SAS` - titles

id	var1	var2	var3	VAR4
1	3	m	45	.
2	3	m	44	.
3	2	f	34	.
5	4	f	38	.
8	1	m	41	.
9	5	f	39	.
11	4	f	.	0
12	4	m	.	6
14	2	m	.	4
15	3	f	.	7
17	1	f	.	8
18	5	m	.	5