Here, we show how to aggregate data in R based on functions, using
colSums()
, rowSums()
, colMeans()
,
rowMeans()
, sapply()
, apply()
,
and aggregate()
.
All the functions are from the "base" package, except for
aggregate()
which is from the "stats" package.
Function | Usage |
colSums() |
Column sums |
rowSums() |
Row sums |
colMeans() |
Column means |
rowMeans() |
Row means |
sapply() |
Aggregate data based on a function |
apply() |
Aggregate data based on a function |
aggregate() |
Aggregate data by groups or subsets |
Using cars data from the datasets package.
Sample rows from cars:
speed dist
1 4 2
6 9 10
10 11 17
17 13 34
23 14 80
29 17 32
38 19 68
39 20 32
45 23 54
50 25 85
Derive the column sums of the dataframe:
This returns the sum of the values in each column.
speed dist
770 2149
Derive the row sums of the dataframe:
This returns the sum of the values in each row.
[1] 6 14 11 29 24 19 28 36 44 28 39 26 32 36 40 39 47 47 59
[20] 40 50 74 94 35 41 69 48 56 49 57 67 60 74 94 102 55 65 87
[39] 52 68 72 76 84 88 77 94 116 117 144 110
Derive the column means of the dataframe:
This returns the mean of the values in each column.
speed dist
15.40 42.98
Derive the row means of the dataframe:
This returns the mean of the values in each row.
[1] 3.0 7.0 5.5 14.5 12.0 9.5 14.0 18.0 22.0 14.0 19.5 13.0 16.0 18.0 20.0
[16] 19.5 23.5 23.5 29.5 20.0 25.0 37.0 47.0 17.5 20.5 34.5 24.0 28.0 24.5 28.5
[31] 33.5 30.0 37.0 47.0 51.0 27.5 32.5 43.5 26.0 34.0 36.0 38.0 42.0 44.0 38.5
[46] 47.0 58.0 58.5 72.0 55.0
sapply()
is similar to lapply()
, but it
returns a vector instead of a list like lapply()
.
Using trees data from the datasets package.
Sample rows from trees:
Girth Height Volume
1 8.3 70 10.3
6 10.8 83 19.7
10 11.2 75 19.9
13 11.4 76 21.4
16 12.9 74 22.2
17 12.9 85 33.8
23 14.5 74 36.3
27 17.5 82 55.7
29 18.0 80 51.5
31 20.6 87 77.0
Aggregate the means of the columns in the dataframe:
This gives the means of the values in each column.
Girth Height Volume
13.24839 76.00000 30.17097
Aggregate the sum of the columns in the dataframe:
This gives the sum of the values in each column.
Girth Height Volume
410.7 2356.0 935.3
Aggregate the standard deviation of the columns in the dataframe:
This gives the standard deviation of the values in each column.
Girth Height Volume
3.138139 6.371813 16.437846
Aggregate the variance of the columns in the dataframe:
This gives the variance of the values in each column.
Girth Height Volume
9.847914 40.600000 270.202796
Aggregate the medians of the columns in the dataframe:
This gives the median of the values in each column.
Girth.50% Height.50% Volume.50%
12.9 76.0 24.2
Aggregate the minimum values of the columns in the dataframe:
This gives the minimum value in each column.
Girth Height Volume
8.3 63.0 10.2
Aggregate the maximum values of the columns in the dataframe:
This gives the maximum value in each column.
Girth Height Volume
20.6 87.0 77.0
Aggregate the five number summary of the columns in the dataframe:
This gives the quartiles for each column, from the first to the last row, the min, 25th percentile, 50th percentile or the median, 75th percentile and the max, respectively.
Girth Height Volume
[1,] 8.30 63 10.2
[2,] 11.05 72 19.4
[3,] 12.90 76 24.2
[4,] 15.25 80 37.3
[5,] 20.60 87 77.0
Aggregate quantiles (20th, 40th, 60th and 80th) of the columns in the dataframe:
This gives the 20th, 40th, 60th and 80th quantiles of the values in each column.
Girth Height Volume
20% 11.0 71 18.8
40% 11.4 75 21.4
60% 13.7 79 27.4
80% 16.3 81 42.6
See a sample from the trees dataframe used here above.
You can perform the aggregates with all of the same functions shown
with sapply()
above with apply()
, with the
additional step of specifying the "MARGIN" argument. "MARGIN = 1" for
rows, and "MARGIN = 2" for columns as in the case of
sapply()
.
Aggregate the sum of the rows in the dataframe:
This shows the sum of the values in each row.
[1] 88.6 83.9 82.0 98.9 110.5 113.5 92.6 104.2 113.7 106.1 114.5 108.4
[13] 108.8 102.0 106.1 109.1 131.7 126.7 110.4 102.7 126.5 125.9 124.8 126.3
[25] 135.9 153.7 155.2 156.2 149.5 149.0 184.6
Aggregate the sum of the columns in the dataframe using apply():
This shows the sum of the values in each column.
Girth Height Volume
410.7 2356.0 935.3
Aggregate the means of the rows in the dataframe:
This shows the mean of the values in each row.
[1] 29.53333 27.96667 27.33333 32.96667 36.83333 37.83333 30.86667 34.73333
[9] 37.90000 35.36667 38.16667 36.13333 36.26667 34.00000 35.36667 36.36667
[17] 43.90000 42.23333 36.80000 34.23333 42.16667 41.96667 41.60000 42.10000
[25] 45.30000 51.23333 51.73333 52.06667 49.83333 49.66667 61.53333
Aggregate quantiles (20th, 40th, 60th and 80th) of the rows in the dataframe:
This shows the 20th, 40th, 60th and 80th quantiles of the values in each row.
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
20% 9.10 9.28 9.36 12.86 13.94 14.36 12.84 13.88 15.70 14.68 16.46 15.24
40% 9.90 9.96 9.92 15.22 17.18 17.92 14.68 16.76 20.30 18.16 21.62 19.08
60% 22.24 21.24 20.76 27.52 31.24 32.36 25.68 29.56 34.08 30.92 35.16 32.00
80% 46.12 43.12 41.88 49.76 56.12 57.68 45.84 52.28 57.04 52.96 57.08 54.00
[,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
20% 15.40 15.54 14.84 16.62 21.26 18.94 18.50 18.24 22.2 21.20 23.22 24.92
40% 19.40 19.38 17.68 20.34 29.62 24.58 23.30 22.68 30.4 28.20 31.94 33.84
60% 32.32 30.84 30.28 32.56 44.04 39.12 34.76 32.72 43.2 41.36 43.84 45.04
80% 54.16 49.92 52.64 53.28 64.52 62.56 52.88 48.36 60.6 60.68 58.92 58.52
[,25] [,26] [,27] [,28] [,29] [,30] [,31]
20% 26.82 32.54 32.78 34.06 31.4 31.2 43.16
40% 37.34 47.78 48.06 50.22 44.8 44.4 65.72
60% 49.48 60.52 60.96 62.64 57.2 56.8 79.00
80% 63.24 70.76 71.48 71.32 68.6 68.4 83.00
You can perform the aggregates with all of the same functions shown
with sapply()
above with aggregate()
.
With aggregate()
, you can perform the aggregation based
on groups and subsets, hence performing statistical analysis based on
subsets by groups or factors in the dataframe.
Using CO2 data from datasets package.
Sample rows from CO2:
Plant Type Treatment conc uptake
1 Qn1 Quebec nonchilled 95 16.0
4 Qn1 Quebec nonchilled 350 37.2
12 Qn2 Quebec nonchilled 500 40.6
15 Qn3 Quebec nonchilled 95 16.2
39 Qc3 Quebec chilled 350 34.0
42 Qc3 Quebec chilled 1000 41.4
50 Mn2 Mississippi nonchilled 95 12.0
54 Mn2 Mississippi nonchilled 500 32.4
73 Mc2 Mississippi chilled 250 12.3
84 Mc3 Mississippi chilled 1000 19.9
Aggregate the means by Type:
Type uptake
1 Quebec 33.54286
2 Mississippi 20.88333
Aggregate the means by Treatment:
Treatment uptake
1 nonchilled 30.64286
2 chilled 23.78333
Aggregate the means by Type and Treatment:
Type Treatment uptake
1 Quebec nonchilled 35.33333
2 Mississippi nonchilled 25.95238
3 Quebec chilled 31.75238
4 Mississippi chilled 15.81429
Aggregate quantiles by Type and Treatment:
Type Treatment uptake.33.33333% uptake.66.66667%
1 Quebec nonchilled 35.13333 40.86667
2 Mississippi nonchilled 26.06667 30.70000
3 Quebec chilled 31.76667 38.26667
4 Mississippi chilled 13.46667 18.03333
You can perform the aggregates on dataframes based on special or custom functions to derive results of interest.
The is applicable to apply()
, sapply()
, and
aggregate()
.
Derive column sum of squares:
See a sample from the cars dataframe used here above.
# First, sum of squares formula for any x vector
sumSS = function(x){sum(x^2)}
sapply(cars, FUN = sumSS)
# Or
apply(cars, MARGIN = 2, FUN = sumSS)
speed dist
13228 124903
Derive row sum of squares:
See a sample from the cars dataframe used here above.
# First, sum of squares formula for any x vector
sumSS = function(x){sum(x^2)}
apply(cars, MARGIN = 1, FUN = sumSS)
[1] 20 116 65 533 320 181 424 776 1256 410 905 340
[13] 544 720 928 845 1325 1325 2285 872 1492 3796 6596 625
[25] 901 3141 1280 1856 1313 1889 2789 2088 3460 6100 7380 1657
[37] 2477 4985 1424 2704 3104 3536 4496 4840 3445 5476 9040 9225
[49] 14976 7850
Derive sum of squared deviations by groups:
See a sample from the CO2 dataframe used here above.
# First, sum of squared deviations formula for any x vector
sumSD = function(x){sum(x^2 - mean(x))}
aggregate(uptake ~ Type + Treatment, data = CO2, FUN = sumSD)
Type Treatment uptake
1 Quebec nonchilled 27317.14
2 Mississippi nonchilled 14694.88
3 Quebec chilled 22366.14
4 Mississippi chilled 5249.33
The feedback form is a Google form but it does not collect any personal information.
Please click on the link below to go to the Google form.
Thank You!
Go to Feedback Form
Copyright © 2020 - 2024. All Rights Reserved by Stats Codes