Here, we show how to aggregate data in R based on functions, using colSums(), rowSums(), colMeans(), rowMeans(), sapply(), apply(), and aggregate().

All the functions are from the "base" package, except for aggregate() which is from the "stats" package.

Summary Functions for Aggregating Data Objects in R
Function Usage
colSums() Column sums
rowSums() Row sums
colMeans() Column means
rowMeans() Row means
sapply() Aggregate data based on a function
apply() Aggregate data based on a function
aggregate() Aggregate data by groups or subsets

1 Derive Sums and Means of Columns and Rows in R

Using cars data from the datasets package.

Sample rows from cars:

cars
   speed dist
1      4    2
6      9   10
10    11   17
17    13   34
23    14   80
29    17   32
38    19   68
39    20   32
45    23   54
50    25   85

Derive the column sums of the dataframe:

This returns the sum of the values in each column.

colSums(cars)
speed  dist 
  770  2149 

Derive the row sums of the dataframe:

This returns the sum of the values in each row.

rowSums(cars)
 [1]   6  14  11  29  24  19  28  36  44  28  39  26  32  36  40  39  47  47  59
[20]  40  50  74  94  35  41  69  48  56  49  57  67  60  74  94 102  55  65  87
[39]  52  68  72  76  84  88  77  94 116 117 144 110

Derive the column means of the dataframe:

This returns the mean of the values in each column.

colMeans(cars)
speed  dist 
15.40 42.98 

Derive the row means of the dataframe:

This returns the mean of the values in each row.

rowMeans(cars)
 [1]  3.0  7.0  5.5 14.5 12.0  9.5 14.0 18.0 22.0 14.0 19.5 13.0 16.0 18.0 20.0
[16] 19.5 23.5 23.5 29.5 20.0 25.0 37.0 47.0 17.5 20.5 34.5 24.0 28.0 24.5 28.5
[31] 33.5 30.0 37.0 47.0 51.0 27.5 32.5 43.5 26.0 34.0 36.0 38.0 42.0 44.0 38.5
[46] 47.0 58.0 58.5 72.0 55.0

2 Aggregate Data with sapply()

sapply() is similar to lapply(), but it returns a vector instead of a list like lapply().

Using trees data from the datasets package.

Sample rows from trees:

trees
   Girth Height Volume
1    8.3     70   10.3
6   10.8     83   19.7
10  11.2     75   19.9
13  11.4     76   21.4
16  12.9     74   22.2
17  12.9     85   33.8
23  14.5     74   36.3
27  17.5     82   55.7
29  18.0     80   51.5
31  20.6     87   77.0

Aggregate the means of the columns in the dataframe:

This gives the means of the values in each column.

sapply(trees, FUN = mean)
   Girth   Height   Volume 
13.24839 76.00000 30.17097 

Aggregate the sum of the columns in the dataframe:

This gives the sum of the values in each column.

sapply(trees, FUN = sum)
 Girth Height Volume 
 410.7 2356.0  935.3 

Aggregate the standard deviation of the columns in the dataframe:

This gives the standard deviation of the values in each column.

sapply(trees, FUN = sd)
    Girth    Height    Volume 
 3.138139  6.371813 16.437846 

Aggregate the variance of the columns in the dataframe:

This gives the variance of the values in each column.

sapply(trees, FUN = var)
     Girth     Height     Volume 
  9.847914  40.600000 270.202796 

Aggregate the medians of the columns in the dataframe:

This gives the median of the values in each column.

sapply(trees, FUN = quantile, probs = 1/2)
 Girth.50% Height.50% Volume.50% 
      12.9       76.0       24.2 

Aggregate the minimum values of the columns in the dataframe:

This gives the minimum value in each column.

sapply(trees, FUN = min)
 Girth Height Volume 
   8.3   63.0   10.2 

Aggregate the maximum values of the columns in the dataframe:

This gives the maximum value in each column.

sapply(trees, FUN = max)
 Girth Height Volume 
  20.6   87.0   77.0 

Aggregate the five number summary of the columns in the dataframe:

This gives the quartiles for each column, from the first to the last row, the min, 25th percentile, 50th percentile or the median, 75th percentile and the max, respectively.

sapply(trees, FUN = fivenum)
     Girth Height Volume
[1,]  8.30     63   10.2
[2,] 11.05     72   19.4
[3,] 12.90     76   24.2
[4,] 15.25     80   37.3
[5,] 20.60     87   77.0

Aggregate quantiles (20th, 40th, 60th and 80th) of the columns in the dataframe:

This gives the 20th, 40th, 60th and 80th quantiles of the values in each column.

sapply(trees, FUN = quantile, probs = c(0.2, 0.4, 0.6, 0.8))
    Girth Height Volume
20%  11.0     71   18.8
40%  11.4     75   21.4
60%  13.7     79   27.4
80%  16.3     81   42.6

3 Aggregate Data with apply()

See a sample from the trees dataframe used here above.

You can perform the aggregates with all of the same functions shown with sapply() above with apply(), with the additional step of specifying the "MARGIN" argument. "MARGIN = 1" for rows, and "MARGIN = 2" for columns as in the case of sapply().

Aggregate the sum of the rows in the dataframe:

This shows the sum of the values in each row.

apply(trees, MARGIN = 1, FUN = sum)
 [1]  88.6  83.9  82.0  98.9 110.5 113.5  92.6 104.2 113.7 106.1 114.5 108.4
[13] 108.8 102.0 106.1 109.1 131.7 126.7 110.4 102.7 126.5 125.9 124.8 126.3
[25] 135.9 153.7 155.2 156.2 149.5 149.0 184.6

Aggregate the sum of the columns in the dataframe using apply():

This shows the sum of the values in each column.

apply(trees, MARGIN = 2, FUN = sum)
 Girth Height Volume 
 410.7 2356.0  935.3 

Aggregate the means of the rows in the dataframe:

This shows the mean of the values in each row.

apply(trees, MARGIN = 1, FUN = mean)
 [1] 29.53333 27.96667 27.33333 32.96667 36.83333 37.83333 30.86667 34.73333
 [9] 37.90000 35.36667 38.16667 36.13333 36.26667 34.00000 35.36667 36.36667
[17] 43.90000 42.23333 36.80000 34.23333 42.16667 41.96667 41.60000 42.10000
[25] 45.30000 51.23333 51.73333 52.06667 49.83333 49.66667 61.53333

Aggregate quantiles (20th, 40th, 60th and 80th) of the rows in the dataframe:

This shows the 20th, 40th, 60th and 80th quantiles of the values in each row.

apply(trees, MARGIN = 1, FUN = quantile, probs = c(0.2, 0.4, 0.6, 0.8))
     [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
20%  9.10  9.28  9.36 12.86 13.94 14.36 12.84 13.88 15.70 14.68 16.46 15.24
40%  9.90  9.96  9.92 15.22 17.18 17.92 14.68 16.76 20.30 18.16 21.62 19.08
60% 22.24 21.24 20.76 27.52 31.24 32.36 25.68 29.56 34.08 30.92 35.16 32.00
80% 46.12 43.12 41.88 49.76 56.12 57.68 45.84 52.28 57.04 52.96 57.08 54.00
    [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
20% 15.40 15.54 14.84 16.62 21.26 18.94 18.50 18.24  22.2 21.20 23.22 24.92
40% 19.40 19.38 17.68 20.34 29.62 24.58 23.30 22.68  30.4 28.20 31.94 33.84
60% 32.32 30.84 30.28 32.56 44.04 39.12 34.76 32.72  43.2 41.36 43.84 45.04
80% 54.16 49.92 52.64 53.28 64.52 62.56 52.88 48.36  60.6 60.68 58.92 58.52
    [,25] [,26] [,27] [,28] [,29] [,30] [,31]
20% 26.82 32.54 32.78 34.06  31.4  31.2 43.16
40% 37.34 47.78 48.06 50.22  44.8  44.4 65.72
60% 49.48 60.52 60.96 62.64  57.2  56.8 79.00
80% 63.24 70.76 71.48 71.32  68.6  68.4 83.00

4 Aggregate Data by Groups or Subsets with aggregate()

You can perform the aggregates with all of the same functions shown with sapply() above with aggregate().

With aggregate(), you can perform the aggregation based on groups and subsets, hence performing statistical analysis based on subsets by groups or factors in the dataframe.

Using CO2 data from datasets package.

Sample rows from CO2:

CO2
   Plant        Type  Treatment conc uptake
1    Qn1      Quebec nonchilled   95   16.0
4    Qn1      Quebec nonchilled  350   37.2
12   Qn2      Quebec nonchilled  500   40.6
15   Qn3      Quebec nonchilled   95   16.2
39   Qc3      Quebec    chilled  350   34.0
42   Qc3      Quebec    chilled 1000   41.4
50   Mn2 Mississippi nonchilled   95   12.0
54   Mn2 Mississippi nonchilled  500   32.4
73   Mc2 Mississippi    chilled  250   12.3
84   Mc3 Mississippi    chilled 1000   19.9

Aggregate the means by Type:

aggregate(uptake ~ Type, data = CO2, FUN = mean)
         Type   uptake
1      Quebec 33.54286
2 Mississippi 20.88333

Aggregate the means by Treatment:

aggregate(uptake ~ Treatment, data = CO2, FUN = mean)
   Treatment   uptake
1 nonchilled 30.64286
2    chilled 23.78333

Aggregate the means by Type and Treatment:

aggregate(uptake ~ Type + Treatment, data = CO2, FUN = mean)
         Type  Treatment   uptake
1      Quebec nonchilled 35.33333
2 Mississippi nonchilled 25.95238
3      Quebec    chilled 31.75238
4 Mississippi    chilled 15.81429

Aggregate quantiles by Type and Treatment:

aggregate(uptake ~ Type + Treatment,
          data = CO2, quantile, probs = c(1/3, 2/3))
         Type  Treatment uptake.33.33333% uptake.66.66667%
1      Quebec nonchilled         35.13333         40.86667
2 Mississippi nonchilled         26.06667         30.70000
3      Quebec    chilled         31.76667         38.26667
4 Mississippi    chilled         13.46667         18.03333

5 Aggregate Data Based on Special or Custom Functions in R

You can perform the aggregates on dataframes based on special or custom functions to derive results of interest.

The is applicable to apply(), sapply(), and aggregate().

Derive column sum of squares:

See a sample from the cars dataframe used here above.

# First, sum of squares formula for any x vector
sumSS = function(x){sum(x^2)}

sapply(cars, FUN = sumSS)

# Or
apply(cars, MARGIN = 2, FUN = sumSS)
 speed   dist 
 13228 124903 

Derive row sum of squares:

See a sample from the cars dataframe used here above.

# First, sum of squares formula for any x vector
sumSS = function(x){sum(x^2)}

apply(cars, MARGIN = 1, FUN = sumSS)
 [1]    20   116    65   533   320   181   424   776  1256   410   905   340
[13]   544   720   928   845  1325  1325  2285   872  1492  3796  6596   625
[25]   901  3141  1280  1856  1313  1889  2789  2088  3460  6100  7380  1657
[37]  2477  4985  1424  2704  3104  3536  4496  4840  3445  5476  9040  9225
[49] 14976  7850

Derive sum of squared deviations by groups:

See a sample from the CO2 dataframe used here above.

# First, sum of squared deviations formula for any x vector
sumSD = function(x){sum(x^2 - mean(x))}

aggregate(uptake ~ Type + Treatment, data = CO2, FUN = sumSD)
         Type  Treatment   uptake
1      Quebec nonchilled 27317.14
2 Mississippi nonchilled 14694.88
3      Quebec    chilled 22366.14
4 Mississippi    chilled  5249.33

Copyright © 2020 - 2024. All Rights Reserved by Stats Codes