How to Change Variable Names While Reading Datat in R
Addressing Data
Overview
Didactics: 20 min
Exercises: 0 minQuestions
What are the dissimilar methods for accessing parts of a data frame?
Objectives
Understand the iii different ways R tin address data inside a data frame.
Combine different methods for addressing information with the consignment operator to update subsets of data.
R is a powerful linguistic communication for data manipulation. There are 3 main ways for addressing data inside R objects.
- By index (subsetting)
- By logical vector
- Past name
Lets start by loading some sample data:
dat <- read.csv ( file = 'data/sample.csv' , header = TRUE , stringsAsFactors = FALSE )
The kickoff row of this csv file is a listing of column names. Nosotros used the header = Truthful statement to
read.csv
and then that R tin interpret the file correctly. We are using the stringsAsFactors = FALSE argument to override the default behaviour for R. Using factors in R is covered in a separate lesson.
Lets take a wait at this data.
R has loaded the contents of the .csv file into a variable chosen dat
which is a data frame
.
We can compactly brandish the internal construction of a data frame using the structure part str
.
'data.frame': 100 obs. of nine variables: $ ID : chr "Sub001" "Sub002" "Sub003" "Sub004" ... $ Gender : chr "1000" "m" "grand" "f" ... $ Group : chr "Command" "Treatment2" "Treatment2" "Treatment1" ... $ BloodPressure: int 132 139 130 105 125 112 173 108 131 129 ... $ Age : num 16 17.2 19.five 15.seven nineteen.9 14.3 17.7 19.8 nineteen.4 eighteen.8 ... $ Aneurisms_q1 : int 114 148 196 199 188 260 135 216 117 188 ... $ Aneurisms_q2 : int 140 209 251 140 120 266 98 238 215 144 ... $ Aneurisms_q3 : int 202 248 122 233 222 320 154 279 181 192 ... $ Aneurisms_q4 : int 237 248 177 220 228 294 245 251 272 185 ...
The str
function tell us that the data has 100 rows and 9 columns. It is also tell u.s. that the information frame is made up of graphic symbol chr
, integer int
and numeric
vectors.
ID Gender Grouping BloodPressure Age Aneurisms_q1 Aneurisms_q2 one Sub001 m Command 132 sixteen.0 114 140 two Sub002 m Treatment2 139 17.2 148 209 3 Sub003 m Treatment2 130 19.v 196 251 iv Sub004 f Treatment1 105 15.7 199 140 5 Sub005 m Treatment1 125 nineteen.nine 188 120 six Sub006 M Treatment2 112 14.three 260 266 Aneurisms_q3 Aneurisms_q4 1 202 237 2 248 248 3 122 177 4 233 220 five 222 228 half dozen 320 294
The data is the results of an (not real) experiment, looking at the number of aneurysms that formed in the eyes of patients who undertook 3 dissimilar treatments.
Addressing by Index
Data can be accessed by index. Nosotros have already seen how square brackets [
can be used to subset data (sometimes too called "slicing"). The generic format is dat[row_numbers,column_numbers]
.
Selecting Values
What will be returned by
dat[ane, 1]
? Recollect about the number of rows and columns y'all would expect as the result.Solution
If we go out out a dimension R volition interpret this as a request for all values in that dimension.
Selecting More Values
What volition be returned past
dat[, 2]
?Solution
[ane] "m" "thousand" "1000" "f" "k" "Thousand" "f" "yard" "m" "f" "thou" "f" "f" "chiliad" "m" "m" "f" "m" [19] "chiliad" "F" "f" "m" "f" "f" "k" "M" "M" "f" "m" "f" "f" "m" "m" "thousand" "m" "f" [37] "f" "grand" "M" "g" "f" "m" "m" "1000" "f" "f" "M" "One thousand" "thou" "m" "m" "f" "f" "f" [55] "1000" "f" "thou" "m" "m" "f" "f" "f" "f" "M" "f" "thou" "f" "f" "Thousand" "thousand" "m" "m" [73] "F" "chiliad" "m" "f" "1000" "M" "G" "f" "m" "M" "M" "k" "m" "f" "f" "f" "one thousand" "m" [91] "f" "thou" "F" "f" "m" "one thousand" "F" "m" "M" "M"
The colon :
can be used to create a sequence of integers.
Creates a vector of numbers from 6 to 9.
This can exist very useful for addressing data.
Subsetting with Sequences
Use the colon operator to index just the aneurism count data (columns half-dozen to 9).
Solution
Aneurisms_q1 Aneurisms_q2 Aneurisms_q3 Aneurisms_q4 i 114 140 202 237 2 148 209 248 248 three 196 251 122 177 iv 199 140 233 220 5 188 120 222 228 6 260 266 320 294 7 135 98 154 245 8 216 238 279 251 ix 117 215 181 272 10 188 144 192 185 11 134 155 247 223 12 152 177 323 245 xiii 112 220 225 195 xiv 109 150 177 189 xv 146 140 239 223 xvi 97 172 203 207 17 165 157 200 193 eighteen 158 265 243 187 19 178 109 206 182 20 107 188 167 218 21 174 160 203 183 22 97 110 194 133 23 187 239 281 214 24 188 191 256 265 25 114 199 242 195 26 115 160 158 228 27 128 249 294 315 28 112 230 281 126 29 136 109 105 155 thirty 103 148 219 228 31 132 151 234 162 32 118 154 260 160 33 166 176 253 233 34 152 105 197 299 35 191 148 166 185 36 152 178 158 170 37 161 270 232 284 38 239 184 317 269 39 132 137 193 206 40 168 255 273 274 41 140 184 239 202 42 166 85 179 196 43 141 160 179 239 44 161 168 212 181 45 103 111 254 126 46 231 240 260 310 47 192 141 180 225 48 178 180 169 183 49 167 123 236 224 50 135 150 208 279 51 150 166 153 204 52 192 80 138 222 53 153 153 236 216 54 205 264 269 207 55 117 194 216 211 56 199 119 183 251 57 182 129 226 218 58 180 196 250 294 59 111 111 244 201 60 101 98 178 116 61 166 167 232 241 62 158 171 237 212 63 189 178 177 238 64 189 101 193 172 65 239 189 297 300 66 185 224 151 182 67 224 112 304 288 68 104 139 211 204 69 222 199 280 196 70 107 98 204 138 71 153 255 218 234 72 118 165 220 227 73 102 184 246 222 74 188 125 191 157 75 180 283 204 298 76 178 214 291 240 77 168 184 184 229 78 118 170 249 249 79 169 114 248 233 80 156 138 218 258 81 232 211 219 246 82 188 108 180 136 83 169 168 180 211 84 241 233 292 182 85 65 207 234 235 86 225 185 195 235 87 104 116 173 221 88 179 158 216 244 89 103 140 209 186 90 112 130 175 191 91 226 170 307 244 92 228 221 316 259 93 209 142 199 184 94 153 104 194 214 95 111 118 173 191 96 148 132 200 194 97 141 196 322 273 98 193 112 123 181 99 130 226 286 281 100 126 157 129 160
Finally we can utilize the c()
(combine) function to address non-sequential rows and columns.
ID Gender Group BloodPressure Age one Sub001 m Command 132 16.0 5 Sub005 m Treatment1 125 nineteen.9 seven Sub007 f Control 173 17.vii ix Sub009 g Treatment2 131 19.4
Returns the first 5 columns for patients in rows 1,5,seven and 9
Subsetting Not-Sequential Data
Write lawmaking to return the age and gender values for the showtime 5 patients.
Solution
Age Gender one 16.0 m 2 17.2 m three 19.5 1000 iv 15.7 f five nineteen.9 m
Addressing by Name
Columns in an R data frame are named.
[ane] "ID" "Gender" "Group" "BloodPressure" [v] "Historic period" "Aneurisms_q1" "Aneurisms_q2" "Aneurisms_q3" [ix] "Aneurisms_q4"
Default Names
If column names are non specified e.chiliad. using
headers = Fake
in aread.csv()
function, R assigns default namesV1, V2, ..., Vn
Nosotros usually use the $
operator to address a column past proper name
[one] "1000" "m" "m" "f" "m" "M" "f" "one thousand" "grand" "f" "chiliad" "f" "f" "one thousand" "one thousand" "one thousand" "f" "m" [nineteen] "k" "F" "f" "m" "f" "f" "m" "Chiliad" "Thousand" "f" "m" "f" "f" "m" "thousand" "m" "m" "f" [37] "f" "m" "M" "m" "f" "yard" "m" "chiliad" "f" "f" "One thousand" "Grand" "m" "one thousand" "1000" "f" "f" "f" [55] "m" "f" "m" "m" "thou" "f" "f" "f" "f" "Grand" "f" "m" "f" "f" "Yard" "1000" "m" "m" [73] "F" "grand" "thousand" "f" "Thousand" "Grand" "M" "f" "m" "Thousand" "Chiliad" "chiliad" "m" "f" "f" "f" "k" "m" [91] "f" "g" "F" "f" "m" "m" "F" "m" "M" "G"
When we extract a unmarried column from a information frame using the $
operator, R volition return a vector of that column form and not a data frame.
Named addressing can as well be used in square brackets.
caput ( dat [, c ( 'Age' , 'Gender' )])
Historic period Gender i 16.0 yard 2 17.two one thousand iii 19.5 m four xv.7 f five xix.9 m 6 14.iii Yard
All-time Practice
Best practice is to accost columns by name. Often, yous volition create or delete columns and the column position volition change.
Rows in an R data frame can likewise be named, and rows can also be addressed past their names.
By default, row names are indices (i.e. position of each row in the data frame):
[1] "1" "2" "3" "four" "5" "6" "vii" "8" "9" "10" "11" "12" [xiii] "13" "14" "15" "16" "17" "eighteen" "19" "twenty" "21" "22" "23" "24" [25] "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35" "36" [37] "37" "38" "39" "40" "41" "42" "43" "44" "45" "46" "47" "48" [49] "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59" "60" [61] "61" "62" "63" "64" "65" "66" "67" "68" "69" "70" "71" "72" [73] "73" "74" "75" "76" "77" "78" "79" "eighty" "81" "82" "83" "84" [85] "85" "86" "87" "88" "89" "ninety" "91" "92" "93" "94" "95" "96" [97] "97" "98" "99" "100"
We tin can add row names equally we read in the file with the row.names
parameter in read.csv
.
In the following example, we choose the first cavalcade ID to become the vector of row names of the information frame, with row.names = 1
.
dat2 <- read.csv ( file = 'data/sample.csv' , header = TRUE , stringsAsFactors = Faux , row.names = 1 ) rownames ( dat2 )
[i] "Sub001" "Sub002" "Sub003" "Sub004" "Sub005" "Sub006" "Sub007" "Sub008" [9] "Sub009" "Sub010" "Sub011" "Sub012" "Sub013" "Sub014" "Sub015" "Sub016" [17] "Sub017" "Sub018" "Sub019" "Sub020" "Sub021" "Sub022" "Sub023" "Sub024" [25] "Sub025" "Sub026" "Sub027" "Sub028" "Sub029" "Sub030" "Sub031" "Sub032" [33] "Sub033" "Sub034" "Sub035" "Sub036" "Sub037" "Sub038" "Sub039" "Sub040" [41] "Sub041" "Sub042" "Sub043" "Sub044" "Sub045" "Sub046" "Sub047" "Sub048" [49] "Sub049" "Sub050" "Sub051" "Sub052" "Sub053" "Sub054" "Sub055" "Sub056" [57] "Sub057" "Sub058" "Sub059" "Sub060" "Sub061" "Sub062" "Sub063" "Sub064" [65] "Sub065" "Sub066" "Sub067" "Sub068" "Sub069" "Sub070" "Sub071" "Sub072" [73] "Sub073" "Sub074" "Sub075" "Sub076" "Sub077" "Sub078" "Sub079" "Sub080" [81] "Sub081" "Sub082" "Sub083" "Sub084" "Sub085" "Sub086" "Sub087" "Sub088" [89] "Sub089" "Sub090" "Sub091" "Sub092" "Sub093" "Sub094" "Sub095" "Sub096" [97] "Sub097" "Sub098" "Sub099" "Sub100"
We can now excerpt one or more rows using those row names:
Gender Group BloodPressure Historic period Aneurisms_q1 Aneurisms_q2 Aneurisms_q3 Sub072 thou Control 116 17.4 118 165 220 Aneurisms_q4 Sub072 227
dat2 [ c ( "Sub009" , "Sub072" ), ]
Gender Group BloodPressure Age Aneurisms_q1 Aneurisms_q2 Sub009 m Treatment2 131 nineteen.iv 117 215 Sub072 m Control 116 17.iv 118 165 Aneurisms_q3 Aneurisms_q4 Sub009 181 272 Sub072 220 227
Note that row names must be unique!
For instance, if we try and read in the data setting the Group column as row names, R will throw an error because values in that column are duplicated:
dat2 <- read.csv ( file = 'data/sample.csv' , header = TRUE , stringsAsFactors = Simulated , row.names = 3 )
Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed
Addressing by Logical Vector
A logical vector contains only the special values TRUE
and FALSE
.
c ( True , TRUE , Faux , FALSE , TRUE )
[ane] Truthful True Faux Imitation TRUE
Truth and Its Reverse
Notation the values
TRUE
andFALSE
are all capital messages and are not quoted.
Logical vectors can be created using relational operators
e.g. <, >, ==, !=, %in%
.
x <- c ( 1 , 2 , iii , 11 , 12 , 13 ) x < 10
[i] True True True FALSE False FALSE
[1] TRUE True Truthful Simulated Fake FALSE
We tin use logical vectors to select data from a data frame. This is often referred to as logical indexing.
index <- dat $ Grouping == 'Command' dat [ index ,] $ BloodPressure
[one] 132 173 129 77 158 81 137 111 135 108 133 139 126 125 99 122 155 133 94 [twenty] 98 74 116 97 104 117 90 150 116 108 102
Often this performance is written as one line of code:
plot ( dat [ dat $ Group == 'Control' , ] $ BloodPressure )
Using Logical Indexes
- Create a scatterplot showing BloodPressure for subjects not in the control group.
- How many ways are at that place to alphabetize this fix of subjects?
Solution
The lawmaking for such a plot:
plot ( dat [ dat $ Grouping != 'Control' , ] $ BloodPressure )
In addition to
dat$Group != 'Control'
, one could usedat$Grouping %in% c("Treatment1", "Treatment2")
.
Combining Addressing and Assignment
The assignment operator <-
tin be combined with addressing.
ten <- c ( 1 , two , 3 , 11 , 12 , 13 ) x [ x < 10 ] <- 0 ten
Updating a Subset of Values
In this dataset, values for Gender accept been recorded every bit both capital
Thousand, F
and lowercasem, f
. Combine the addressing and assignment operations to convert all values to lowercase.Solution
dat [ dat $ Gender == 'M' , ] $ Gender <- 'grand' dat [ dat $ Gender == 'F' , ] $ Gender <- 'f'
Key Points
Data in data frames tin be addressed by index (subsetting), past logical vector, or by proper noun (columns merely).
Use the
$
operator to address a column by name.
Source: https://swcarpentry.github.io/r-novice-inflammation/10-supp-addressing-data/
0 Response to "How to Change Variable Names While Reading Datat in R"
Post a Comment