DATAMASH
Section: User Commands (1)
Updated: July 2014
Index
Return to Main Contents
NAME
datamash - command-line calculations
SYNOPSIS
datamash
[OPTION] op [col] [op col ...]
DESCRIPTION
Performs numeric/string operations on input from stdin.
'op' is the operation to perform;
For grouping operations 'col' is the input field to use.
File operations:
-
transpose, reverse
Numeric Grouping operations:
-
sum, min, max, absmin, absmax
Textual/Numeric Grouping operations:
-
count, first, last, rand
unique, collapse, countunique
Statistical Grouping operations:
-
mean, median, q1, q3, iqr, mode, antimode
pstdev, sstdev, pvar, svar, mad, madraw
pskew, sskew, pkurt, skurt, dpo, jarque
OPTIONS
Grouping Options:
- -f, --full
-
print entire input line before op results
(default: print only the grouped keys)
- -g, --group=X[,Y,Z]
-
group via fields X,[Y,Z]
- --header-in
-
first input line is column headers
- --header-out
-
print column headers as first line
- -H, --headers
-
same as '--header-in --header-out'
- -i, --ignore-case
-
ignore upper/lower case when comparing text;
this affects grouping, and string operations
- -s, --sort
-
sort the input before grouping; this removes the
need to manually pipe the input through 'sort'
File Operation Options:
- --no-strict
-
allow lines with varying number of fields
- --filler=X
-
fill missing values with X (default %s)
General Options:
- -t, --field-separator=X
-
use X instead of TAB as field delimiter
- -W, --whitespace
-
use whitespace (one or more spaces and/or tabs)
for field delimiters
- -z, --zero-terminated
-
end lines with 0 byte, not newline
- --help
-
display this help and exit
- --version
-
output version information and exit
AVAILABLE OPERATIONS
File operations:
- transpose
-
transpose rows, columns of the input file
- reverse
-
reverse field order in each line
Numeric Grouping operations
- sum
-
sum the of values
- min
-
minimum value
- max
-
maximum value
- absmin
-
minimum of the absolute values
- absmax
-
maximum of the absolute values
Textual/Numeric Grouping operations
- count
-
count number of elements in the group
- first
-
the first value of the group
- last
-
the last value of the group
- rand
-
one random value from the group
- unique
-
comma-separated sorted list of unique values
- collapse
-
comma-separated list of all input values
- countunique
-
number of unique/distinct values
Statistical Grouping operations
- mean
-
mean of the values
- median
-
median value
- q1
-
1st quartile value
- q3
-
3rd quartile value
- iqr
-
inter-quartile range
- mode
-
mode value (most common value)
- antimode
-
anti-mode value (least common value)
- pstdev
-
population standard deviation
- sstdev
-
sample standard deviation
- pvar
-
population variance
- svar
-
sample variance
- mad
-
median absolute deviation, scaled by constant 1.4826 for normal distributions
- madraw
-
median absolute deviation, unscaled
- sskew
-
skewness of the (sample) group
- pskew
-
skewness of the (population) group
values x reported by 'sskew' and 'pskew' operations:
x > 0 - positively skewed / skewed right
0 > x - negatively skewed / skewed left
x > 1 - highly skewed right
1 > x > 0.5 - moderately skewed right
0.5 > x > -0.5 - approximately symmetric
-0.5 > x > -1 - moderately skewed left
-1 > x - highly skewed left
- skurt
-
excess Kurtosis of the (sample) group
- pkurt
-
excess Kurtosis of the (population) group
- jarque
-
p-value of the Jarque-Beta test for normality
- dpo
-
p-value of the D'Agostino-Pearson Omnibus test for normality;
for 'jarque' and 'dpo' operations:
null hypothesis is normality;
low p-Values indicate non-normal data;
high p-Values indicate null-hypothesis cannot be rejected.
EXAMPLES
Print the sum and the mean of values from column 1:
- $ seq 10 | datamash sum 1 mean 1
55 5.5
Group input based on field 1, and sum values (per group) on field 2:
- $ cat example.txt
A 10
A 5
B 9
B 11
$ datamash -g 1 sum 2 < example.txt
A 15
B 20
Unsorted input must be sorted (with '-s'):
- $ cat example.txt
A 10
C 4
B 9
C 1
A 5
B 11
$ datamash -s -g1 sum 2 < example.txt
A 15
B 20
C 5
Which is equivalent to:
- $ cat example.txt | sort -k1,1 | datamash -g 1 sum 2
Use -h (--headers) if the input file has a header line:
- # Given a file with student name, field, test score...
$ head -n5 scores_h.txt
Name Major Score
Shawn Engineering 47
Caleb Business 87
Christian Business 88
Derek Arts 60
# Calculate the mean and standard devian for each major
$ datamash --sort --headers --group 2 mean 3 pstdev 3 < scores_h.txt
(or use short form)
$ datamash -sH -g2 mean 3 pstdev 3 < scores_h.txt
GroupBy(Major) mean(Score) pstdev(Score)
Arts 68.9 10.1
Business 87.3 4.9
Engineering 66.5 19.1
Health-Medicine 90.6 8.8
Life-Sciences 55.3 19.7
Social-Sciences 60.2 16.6
Reverse field order in each line:
- $ seq 6 | paste - - | datamash reverse
2 1
4 3
6 5
Transpose rows, columns:
- $ seq 6 | paste - - | datamash transpose
1 3 5
2 4 6
ADDITIONAL INFORMATION
See
GNU Datamash Website (http://www.gnu.org/software/datamash)
AUTHOR
Written by Assaf Gordon.
COPYRIGHT
Copyright © 2014 Assaf Gordon
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
SEE ALSO
The full documentation for
datamash
is maintained as a Texinfo manual. If the
info
and
datamash
programs are properly installed at your site, the command
-
info datamash
should give you access to the complete manual.
Index
- NAME
-
- SYNOPSIS
-
- DESCRIPTION
-
- File operations:
-
- Numeric Grouping operations:
-
- Textual/Numeric Grouping operations:
-
- Statistical Grouping operations:
-
- OPTIONS
-
- Grouping Options:
-
- File Operation Options:
-
- General Options:
-
- AVAILABLE OPERATIONS
-
- File operations:
-
- Numeric Grouping operations
-
- Textual/Numeric Grouping operations
-
- Statistical Grouping operations
-
- EXAMPLES
-
- ADDITIONAL INFORMATION
-
- AUTHOR
-
- COPYRIGHT
-
- SEE ALSO
-
This document was created by