How to Change Variable Names While Reading Datat in R

Addressing Data

Overview

Didactics: 20 min
Exercises: 0 min

Questions

  • What are the dissimilar methods for accessing parts of a data frame?

Objectives

  • Understand the iii different ways R tin address data inside a data frame.

  • Combine different methods for addressing information with the consignment operator to update subsets of data.

R is a powerful linguistic communication for data manipulation. There are 3 main ways for addressing data inside R objects.

  • By index (subsetting)
  • By logical vector
  • Past name

Lets start by loading some sample data:

                          dat                                          <-                                          read.csv              (              file                                          =                                          'data/sample.csv'              ,                                          header                                          =                                          TRUE              ,                                          stringsAsFactors                                          =                                          FALSE              )                                                  

The kickoff row of this csv file is a listing of column names. Nosotros used the header = Truthful statement to read.csv and then that R tin interpret the file correctly. We are using the stringsAsFactors = FALSE argument to override the default behaviour for R. Using factors in R is covered in a separate lesson.

Lets take a wait at this data.

R has loaded the contents of the .csv file into a variable chosen dat which is a data frame.

We can compactly brandish the internal construction of a data frame using the structure part str.

            'data.frame':	100 obs. of  nine variables:  $ ID           : chr  "Sub001" "Sub002" "Sub003" "Sub004" ...  $ Gender       : chr  "1000" "m" "grand" "f" ...  $ Group        : chr  "Command" "Treatment2" "Treatment2" "Treatment1" ...  $ BloodPressure: int  132 139 130 105 125 112 173 108 131 129 ...  $ Age          : num  16 17.2 19.five 15.seven nineteen.9 14.3 17.7 19.8 nineteen.4 eighteen.8 ...  $ Aneurisms_q1 : int  114 148 196 199 188 260 135 216 117 188 ...  $ Aneurisms_q2 : int  140 209 251 140 120 266 98 238 215 144 ...  $ Aneurisms_q3 : int  202 248 122 233 222 320 154 279 181 192 ...  $ Aneurisms_q4 : int  237 248 177 220 228 294 245 251 272 185 ...                      

The str function tell us that the data has 100 rows and 9 columns. It is also tell u.s. that the information frame is made up of graphic symbol chr, integer int and numeric vectors.

                          ID Gender      Grouping BloodPressure  Age Aneurisms_q1 Aneurisms_q2 one Sub001      m    Command           132 sixteen.0          114          140 two Sub002      m Treatment2           139 17.2          148          209 3 Sub003      m Treatment2           130 19.v          196          251 iv Sub004      f Treatment1           105 15.7          199          140 5 Sub005      m Treatment1           125 nineteen.nine          188          120 six Sub006      M Treatment2           112 14.three          260          266   Aneurisms_q3 Aneurisms_q4 1          202          237 2          248          248 3          122          177 4          233          220 five          222          228 half dozen          320          294                      

The data is the results of an (not real) experiment, looking at the number of aneurysms that formed in the eyes of patients who undertook 3 dissimilar treatments.

Addressing by Index

Data can be accessed by index. Nosotros have already seen how square brackets [ can be used to subset data (sometimes too called "slicing"). The generic format is dat[row_numbers,column_numbers].

Selecting Values

What will be returned by dat[ane, 1]? Recollect about the number of rows and columns y'all would expect as the result.

Solution

If we go out out a dimension R volition interpret this as a request for all values in that dimension.

Selecting More Values

What volition be returned past dat[, 2]?

Solution

                                  [ane] "m" "thousand" "1000" "f" "k" "Thousand" "f" "yard" "m" "f" "thou" "f" "f" "chiliad" "m" "m" "f" "m"  [19] "chiliad" "F" "f" "m" "f" "f" "k" "M" "M" "f" "m" "f" "f" "m" "m" "thousand" "m" "f"  [37] "f" "grand" "M" "g" "f" "m" "m" "1000" "f" "f" "M" "One thousand" "thou" "m" "m" "f" "f" "f"  [55] "1000" "f" "thou" "m" "m" "f" "f" "f" "f" "M" "f" "thou" "f" "f" "Thousand" "thousand" "m" "m"  [73] "F" "chiliad" "m" "f" "1000" "M" "G" "f" "m" "M" "M" "k" "m" "f" "f" "f" "one thousand" "m"  [91] "f" "thou" "F" "f" "m" "one thousand" "F" "m" "M" "M"                              

The colon : can be used to create a sequence of integers.

Creates a vector of numbers from 6 to 9.

This can exist very useful for addressing data.

Subsetting with Sequences

Use the colon operator to index just the aneurism count data (columns half-dozen to 9).

Solution

                                  Aneurisms_q1 Aneurisms_q2 Aneurisms_q3 Aneurisms_q4 i            114          140          202          237 2            148          209          248          248 three            196          251          122          177 iv            199          140          233          220 5            188          120          222          228 6            260          266          320          294 7            135           98          154          245 8            216          238          279          251 ix            117          215          181          272 10           188          144          192          185 11           134          155          247          223 12           152          177          323          245 xiii           112          220          225          195 xiv           109          150          177          189 xv           146          140          239          223 xvi            97          172          203          207 17           165          157          200          193 eighteen           158          265          243          187 19           178          109          206          182 20           107          188          167          218 21           174          160          203          183 22            97          110          194          133 23           187          239          281          214 24           188          191          256          265 25           114          199          242          195 26           115          160          158          228 27           128          249          294          315 28           112          230          281          126 29           136          109          105          155 thirty           103          148          219          228 31           132          151          234          162 32           118          154          260          160 33           166          176          253          233 34           152          105          197          299 35           191          148          166          185 36           152          178          158          170 37           161          270          232          284 38           239          184          317          269 39           132          137          193          206 40           168          255          273          274 41           140          184          239          202 42           166           85          179          196 43           141          160          179          239 44           161          168          212          181 45           103          111          254          126 46           231          240          260          310 47           192          141          180          225 48           178          180          169          183 49           167          123          236          224 50           135          150          208          279 51           150          166          153          204 52           192           80          138          222 53           153          153          236          216 54           205          264          269          207 55           117          194          216          211 56           199          119          183          251 57           182          129          226          218 58           180          196          250          294 59           111          111          244          201 60           101           98          178          116 61           166          167          232          241 62           158          171          237          212 63           189          178          177          238 64           189          101          193          172 65           239          189          297          300 66           185          224          151          182 67           224          112          304          288 68           104          139          211          204 69           222          199          280          196 70           107           98          204          138 71           153          255          218          234 72           118          165          220          227 73           102          184          246          222 74           188          125          191          157 75           180          283          204          298 76           178          214          291          240 77           168          184          184          229 78           118          170          249          249 79           169          114          248          233 80           156          138          218          258 81           232          211          219          246 82           188          108          180          136 83           169          168          180          211 84           241          233          292          182 85            65          207          234          235 86           225          185          195          235 87           104          116          173          221 88           179          158          216          244 89           103          140          209          186 90           112          130          175          191 91           226          170          307          244 92           228          221          316          259 93           209          142          199          184 94           153          104          194          214 95           111          118          173          191 96           148          132          200          194 97           141          196          322          273 98           193          112          123          181 99           130          226          286          281 100          126          157          129          160                              

Finally we can utilize the c() (combine) function to address non-sequential rows and columns.

                          ID Gender      Group BloodPressure  Age one Sub001      m    Command           132 16.0 5 Sub005      m Treatment1           125 nineteen.9 seven Sub007      f    Control           173 17.vii ix Sub009      g Treatment2           131 19.4                      

Returns the first 5 columns for patients in rows 1,5,seven and 9

Subsetting Not-Sequential Data

Write lawmaking to return the age and gender values for the showtime 5 patients.

Solution

                                  Age Gender one 16.0      m 2 17.2      m three 19.5      1000 iv 15.7      f five nineteen.9      m                              

Addressing by Name

Columns in an R data frame are named.

            [ane] "ID"            "Gender"        "Group"         "BloodPressure" [v] "Historic period"           "Aneurisms_q1"  "Aneurisms_q2"  "Aneurisms_q3"  [ix] "Aneurisms_q4"                      

Default Names

If column names are non specified e.chiliad. using headers = Fake in a read.csv() function, R assigns default names V1, V2, ..., Vn

Nosotros usually use the $ operator to address a column past proper name

                          [one] "1000" "m" "m" "f" "m" "M" "f" "one thousand" "grand" "f" "chiliad" "f" "f" "one thousand" "one thousand" "one thousand" "f" "m"  [nineteen] "k" "F" "f" "m" "f" "f" "m" "Chiliad" "Thousand" "f" "m" "f" "f" "m" "thousand" "m" "m" "f"  [37] "f" "m" "M" "m" "f" "yard" "m" "chiliad" "f" "f" "One thousand" "Grand" "m" "one thousand" "1000" "f" "f" "f"  [55] "m" "f" "m" "m" "thou" "f" "f" "f" "f" "Grand" "f" "m" "f" "f" "Yard" "1000" "m" "m"  [73] "F" "grand" "thousand" "f" "Thousand" "Grand" "M" "f" "m" "Thousand" "Chiliad" "chiliad" "m" "f" "f" "f" "k" "m"  [91] "f" "g" "F" "f" "m" "m" "F" "m" "M" "G"                      

When we extract a unmarried column from a information frame using the $ operator, R volition return a vector of that column form and not a data frame.

Named addressing can as well be used in square brackets.

                          caput              (              dat              [,                                          c              (              'Age'              ,                                          'Gender'              )])                                                  
                          Historic period Gender i 16.0      yard 2 17.two      one thousand iii 19.5      m four xv.7      f five xix.9      m 6 14.iii      Yard                      

All-time Practice

Best practice is to accost columns by name. Often, yous volition create or delete columns and the column position volition change.

Rows in an R data frame can likewise be named, and rows can also be addressed past their names.
By default, row names are indices (i.e. position of each row in the data frame):

                          [1] "1"   "2"   "3"   "four"   "5"   "6"   "vii"   "8"   "9"   "10"  "11"  "12"   [xiii] "13"  "14"  "15"  "16"  "17"  "eighteen"  "19"  "twenty"  "21"  "22"  "23"  "24"   [25] "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36"   [37] "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44"  "45"  "46"  "47"  "48"   [49] "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"  "57"  "58"  "59"  "60"   [61] "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"  "71"  "72"   [73] "73"  "74"  "75"  "76"  "77"  "78"  "79"  "eighty"  "81"  "82"  "83"  "84"   [85] "85"  "86"  "87"  "88"  "89"  "ninety"  "91"  "92"  "93"  "94"  "95"  "96"   [97] "97"  "98"  "99"  "100"                      

We tin can add row names equally we read in the file with the row.names parameter in read.csv.
In the following example, we choose the first cavalcade ID to become the vector of row names of the information frame, with row.names = 1.

                          dat2                                          <-                                          read.csv              (              file                                          =                                          'data/sample.csv'              ,                                          header                                          =                                          TRUE              ,                                          stringsAsFactors                                          =                                          Faux              ,                                          row.names              =              1              )                                          rownames              (              dat2              )                                                  
                          [i] "Sub001" "Sub002" "Sub003" "Sub004" "Sub005" "Sub006" "Sub007" "Sub008"   [9] "Sub009" "Sub010" "Sub011" "Sub012" "Sub013" "Sub014" "Sub015" "Sub016"  [17] "Sub017" "Sub018" "Sub019" "Sub020" "Sub021" "Sub022" "Sub023" "Sub024"  [25] "Sub025" "Sub026" "Sub027" "Sub028" "Sub029" "Sub030" "Sub031" "Sub032"  [33] "Sub033" "Sub034" "Sub035" "Sub036" "Sub037" "Sub038" "Sub039" "Sub040"  [41] "Sub041" "Sub042" "Sub043" "Sub044" "Sub045" "Sub046" "Sub047" "Sub048"  [49] "Sub049" "Sub050" "Sub051" "Sub052" "Sub053" "Sub054" "Sub055" "Sub056"  [57] "Sub057" "Sub058" "Sub059" "Sub060" "Sub061" "Sub062" "Sub063" "Sub064"  [65] "Sub065" "Sub066" "Sub067" "Sub068" "Sub069" "Sub070" "Sub071" "Sub072"  [73] "Sub073" "Sub074" "Sub075" "Sub076" "Sub077" "Sub078" "Sub079" "Sub080"  [81] "Sub081" "Sub082" "Sub083" "Sub084" "Sub085" "Sub086" "Sub087" "Sub088"  [89] "Sub089" "Sub090" "Sub091" "Sub092" "Sub093" "Sub094" "Sub095" "Sub096"  [97] "Sub097" "Sub098" "Sub099" "Sub100"                      

We can now excerpt one or more rows using those row names:

                          Gender   Group BloodPressure  Historic period Aneurisms_q1 Aneurisms_q2 Aneurisms_q3 Sub072      thou Control           116 17.4          118          165          220        Aneurisms_q4 Sub072          227                      
                          dat2              [              c              (              "Sub009"              ,                                          "Sub072"              ),                                          ]                                                  
                          Gender      Group BloodPressure  Age Aneurisms_q1 Aneurisms_q2 Sub009      m Treatment2           131 nineteen.iv          117          215 Sub072      m    Control           116 17.iv          118          165        Aneurisms_q3 Aneurisms_q4 Sub009          181          272 Sub072          220          227                      

Note that row names must be unique!
For instance, if we try and read in the data setting the Group column as row names, R will throw an error because values in that column are duplicated:

                          dat2                                          <-                                          read.csv              (              file                                          =                                          'data/sample.csv'              ,                                          header                                          =                                          TRUE              ,                                          stringsAsFactors                                          =                                          Simulated              ,                                          row.names              =              3              )                                                  
            Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed                      

Addressing by Logical Vector

A logical vector contains only the special values TRUE and FALSE.

                          c              (              True              ,                                          TRUE              ,                                          Faux              ,                                          FALSE              ,                                          TRUE              )                                                  
            [ane]  Truthful  True Faux Imitation  TRUE                      

Truth and Its Reverse

Notation the values TRUE and FALSE are all capital messages and are not quoted.

Logical vectors can be created using relational operators e.g. <, >, ==, !=, %in%.

                          x                                          <-                                          c              (              1              ,                                          2              ,                                          iii              ,                                          11              ,                                          12              ,                                          13              )                                          x                                          <                                          10                                                  
            [i]  True  True  True FALSE False FALSE                      
            [1]  TRUE  True  Truthful Simulated Fake FALSE                      

We tin use logical vectors to select data from a data frame. This is often referred to as logical indexing.

                          index                                          <-                                          dat              $              Grouping                                          ==                                          'Command'                                          dat              [              index              ,]              $              BloodPressure                                                  
                          [one] 132 173 129  77 158  81 137 111 135 108 133 139 126 125  99 122 155 133  94 [twenty]  98  74 116  97 104 117  90 150 116 108 102                      

Often this performance is written as one line of code:

                          plot              (              dat              [              dat              $              Group                                          ==                                          'Control'              ,                                          ]              $              BloodPressure              )                                                  

plot of chunk logical_vectors_indexing2

Using Logical Indexes

  1. Create a scatterplot showing BloodPressure for subjects not in the control group.
  2. How many ways are at that place to alphabetize this fix of subjects?

Solution

  1. The lawmaking for such a plot:

                                                                                          plot                      (                      dat                      [                      dat                      $                      Grouping                                                                  !=                                                                  'Control'                      ,                                                                  ]                      $                      BloodPressure                      )                                                                                  

    plot of chunk plot-logical

  2. In addition to dat$Group != 'Control', one could use dat$Grouping %in% c("Treatment1", "Treatment2").

Combining Addressing and Assignment

The assignment operator <- tin be combined with addressing.

                          ten                                          <-                                          c              (              1              ,                                          two              ,                                          3              ,                                          11              ,                                          12              ,                                          13              )                                          x              [              x                                          <                                          10              ]                                          <-                                          0                                          ten                                                  

Updating a Subset of Values

In this dataset, values for Gender accept been recorded every bit both capital Thousand, F and lowercase m, f. Combine the addressing and assignment operations to convert all values to lowercase.

Solution

                                  dat                  [                  dat                  $                  Gender                                                      ==                                                      'M'                  ,                                                      ]                  $                  Gender                                                      <-                                                      'grand'                                                      dat                  [                  dat                  $                  Gender                                                      ==                                                      'F'                  ,                                                      ]                  $                  Gender                                                      <-                                                      'f'                                                                  

Key Points

  • Data in data frames tin be addressed by index (subsetting), past logical vector, or by proper noun (columns merely).

  • Use the $ operator to address a column by name.

wilsonnoubtly92.blogspot.com

Source: https://swcarpentry.github.io/r-novice-inflammation/10-supp-addressing-data/

0 Response to "How to Change Variable Names While Reading Datat in R"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel