Student Scores - Continued¶
Let's continue with our students' exam result. This time, we will try to write an awk
program!
1. Preparation¶
Raw Data
Run the below command to create our test data
cat <<SCORES> scores.txt
William Shakespeare Male English 1564 90
Jane Austen Female English 1775 87
Alexandre Dumas Male French 1802 58
Mark Twain Male American 1835 79
Charles Dickens Male English 1812 83
Franz Kafka Male German 1883 74
J.R.R. Tolkien Male English 1892 47
Ernest Hemingway Male American 1899 66
SCORES
2. Using a Script¶
An awk
program is mainly composed three blocks: BEGIN
, COMMANDS
, and END
Name | Description |
---|---|
BEGIN | Initialize |
COMMANDS | Process Rows |
END | Finalize |
First, let us try a Hello world
program of awk.
This program will:
- Print a
Hello World!
at initialization. - For each line, print a
Hello,
and the students name. - Print a
Good Luck
at finalization.
Save the codes in Codes
tab into a file named helloworld.awk
, run the command in the Run
tab, and check the result against the Output
tab.
Hello World
Save the following codes in a file named helloworld.awk
BEGIN {
printf "Hello World!\n"
}
{
printf "Hello, %s %s!\n", $1,$2
}
END {
printf "Good Luck!\n"
}
awk -f helloworld.awk scores.txt
Hello World!
Hello, William Shakespeare!
Hello, Jane Austen!
Hello, Alexandre Dumas!
Hello, Mark Twain!
Hello, Charles Dickens!
Hello, Franz Kafka!
Hello, J.R.R. Tolkien!
Hello, Ernest Hemingway!
Good Luck!
3. Statistics¶
3.1 Count¶
Let's try a simple program: find out how many male and female students are there?
Gender Count
Save the codes into a file gender-count.awk
.
BEGIN {
female = 0
male = 0
mystery = 0
printf "Gender Count \n"
}
{
if ($3 == "Male")
male += 1
else if ($3 == "Female")
female += 1
else
mystery += 1
}
END {
printf "%-10s%-8d\n", "Female",female
printf "%-10s%-8d\n", "Male",male
printf "%-10s%-8d\n", "Mystery",mystery
}
PROG
awk -f gender-count.awk scores.txt
Gender Count
Female 1
Male 7
Mystery 0
What about another count by nationality?
Nationality Count
Save the codes into a file nationality-count.awk
.
{
nationality[$4] += 1
}
END {
printf "%-12s%6s\n", "Nationality","Counts"
for (n in nationality)
printf "%-12s%6d\n", n, nationality[n]
}
awk -f nationality-count.awk scores.txt
Nationality Counts
German 1
American 2
French 1
English 4
3.2 Average¶
Now let's try to calculate the average scores
First, let's calculate the average score by gender.
Average by Gender
Save the codes into a file gender-average.awk
.
BEGIN {
female = 0; male = 0; mystery = 0
female_score = 0; male_score = 0; mystery_score = 0
printf "Gender Avg Scores \n"
printf "---------------------\n"
}
{
if ($3 == "Male") {
male += 1
male_score += $NF
}
else if ($3 == "Female"){
female += 1
female_score += $NF
}
else {
mystery += 1
mystery_score += $NF
}
}
END {
printf "%-10s%-12.2f\n", "Female",female_score / female
printf "%-10s%-12.2f\n", "Male",male_score / male
printf "%-10s%-12.2f\n", "Mystery",mystery == 0 ? 0:mystery_score / mystery
printf "---------------------\n"
printf "%-10s%-12.2f\n", "Total",(female_score + male_score + mystery_score) / NR
}
PROG
awk -f gender-average.awk scores.txt
Gender Avg Scores
---------------------
Female 87.00
Male 71.00
Mystery 0.00
---------------------
Total 73.00
Now, let's take a look of the performance by their home country.
Nationality Average
Save the codes into a file nationality-average.awk
.
{
nationality[$4] += $NF
nationality_count[$4] += 1
}
END {
printf "%-12s%10s\n", "Nationality","Avg Score"
for (n in nationality)
printf "%-12s%10.2f\n", n, nationality[n] / nationality_count[n]
}
awk -f nationality-average.awk scores.txt
Nationality Avg Score
German 74.00
American 72.50
French 58.00
English 76.75
Let's calculate the average score of students born in different times before 1800
and after 1800
.
Average by Time
Save the codes into a file time-average.awk
.
BEGIN {
before_1800 = 0; after_1800 = 0; mystery = 0
before_1800_score = 0; after_1800_score = 0; mystery_score = 0
printf "Gender Avg Scores \n"
printf "-------------------------\n"
}
{
if ($5 >= 1800) {
after_1800 += 1
after_1800_score += $NF
}
else {
before_1800 += 1
before_1800_score += $NF
}
}
END {
printf "%-12s%12.2f\n", "Before 1800",before_1800_score / before_1800
printf "%-12s%12.2f\n", "After 1800",after_1800_score / after_1800
printf "-------------------------\n"
printf "%-12s%12.2f\n", "Total",(before_1800_score + after_1800_score) / NR
}
awk -f time-average.awk scores.txt
Gender Avg Scores
-------------------------
Before 1800 88.50
After 1800 67.83
-------------------------
Total 73.00