Student Scores - Continued¶

Let's continue with our students' exam result. This time, we will try to write an awk program!

1. Preparation¶

Raw Data

Run the below command to create our test data

cat <<SCORES> scores.txt
William Shakespeare Male English 1564 90
Jane Austen Female English 1775 87
Alexandre Dumas Male French 1802 58
Mark Twain Male American 1835 79
Charles Dickens Male English 1812 83
Franz Kafka Male German 1883 74
J.R.R. Tolkien Male English 1892 47
Ernest Hemingway Male American 1899 66
SCORES

2. Using a Script¶

An awk program is mainly composed three blocks: BEGIN, COMMANDS, and END

Name	Description
BEGIN	Initialize
COMMANDS	Process Rows
END	Finalize

First, let us try a Hello world program of awk.

This program will:

Print a Hello World! at initialization.
For each line, print a Hello, and the students name.
Print a Good Luck at finalization.

Save the codes in Codes tab into a file named helloworld.awk, run the command in the Run tab, and check the result against the Output tab.

Hello World

CodesRunOutput

Save the following codes in a file named helloworld.awk

BEGIN {
    printf "Hello World!\n"
}
{
    printf "Hello, %s %s!\n", $1,$2
}
END {
    printf "Good Luck!\n"
}

awk -f helloworld.awk scores.txt

Hello World!
Hello, William Shakespeare!
Hello, Jane Austen!
Hello, Alexandre Dumas!
Hello, Mark Twain!
Hello, Charles Dickens!
Hello, Franz Kafka!
Hello, J.R.R. Tolkien!
Hello, Ernest Hemingway!
Good Luck!

3. Statistics¶

3.1 Count¶

Let's try a simple program: find out how many male and female students are there?

Gender Count

CodesRunOutput

Save the codes into a file gender-count.awk.

BEGIN {
    female = 0
    male = 0
    mystery = 0
    printf "Gender    Count   \n"
}
{
    if ($3 == "Male")
        male += 1
    else if ($3 == "Female")
        female += 1
    else
        mystery += 1
}
END {
    printf "%-10s%-8d\n", "Female",female
    printf "%-10s%-8d\n", "Male",male
    printf "%-10s%-8d\n", "Mystery",mystery
}
PROG

awk -f gender-count.awk scores.txt

Gender    Count
Female    1
Male      7
Mystery   0

What about another count by nationality?

Nationality Count

CodesRunOutput

Save the codes into a file nationality-count.awk.

{
    nationality[$4] += 1
}
END {
    printf "%-12s%6s\n", "Nationality","Counts"
    for (n in nationality)
        printf "%-12s%6d\n", n, nationality[n]
}

awk -f nationality-count.awk scores.txt

Nationality Counts
German           1
American         2
French           1
English          4

3.2 Average¶

Now let's try to calculate the average scores

First, let's calculate the average score by gender.

Average by Gender

CodesRunOutput

Save the codes into a file gender-average.awk.

BEGIN {
    female = 0; male = 0; mystery = 0
    female_score = 0; male_score = 0; mystery_score = 0
    printf "Gender    Avg Scores \n"
    printf "---------------------\n"
}
{
    if ($3 == "Male") {
        male += 1
        male_score += $NF
    }
    else if ($3 == "Female"){
        female += 1
        female_score += $NF
    }
    else {
        mystery += 1
        mystery_score += $NF
    }
}
END {
    printf "%-10s%-12.2f\n", "Female",female_score / female
    printf "%-10s%-12.2f\n", "Male",male_score / male
    printf "%-10s%-12.2f\n", "Mystery",mystery == 0 ? 0:mystery_score / mystery
    printf "---------------------\n"
    printf "%-10s%-12.2f\n", "Total",(female_score + male_score + mystery_score) / NR
}
PROG

awk -f gender-average.awk scores.txt

Gender    Avg Scores
---------------------
Female    87.00
Male      71.00
Mystery   0.00
---------------------
Total     73.00

Now, let's take a look of the performance by their home country.

Nationality Average

CodesRunOutput

Save the codes into a file nationality-average.awk.

{
    nationality[$4] += $NF
    nationality_count[$4] += 1
}
END {
    printf "%-12s%10s\n", "Nationality","Avg Score"
    for (n in nationality)
        printf "%-12s%10.2f\n", n, nationality[n] / nationality_count[n]
}

awk -f nationality-average.awk scores.txt

Nationality  Avg Score
German           74.00
American         72.50
French           58.00
English          76.75

Let's calculate the average score of students born in different times before 1800 and after 1800.

Average by Time

CodesRunOutput

Save the codes into a file time-average.awk.

BEGIN {
    before_1800 = 0; after_1800 = 0; mystery = 0
    before_1800_score = 0; after_1800_score = 0; mystery_score = 0
    printf "Gender      Avg Scores \n"
    printf "-------------------------\n"
}
{
    if ($5 >= 1800) {
        after_1800 += 1
        after_1800_score += $NF
    }
    else {
        before_1800 += 1
        before_1800_score += $NF
    }
}
END {
    printf "%-12s%12.2f\n", "Before 1800",before_1800_score / before_1800
    printf "%-12s%12.2f\n", "After 1800",after_1800_score / after_1800
    printf "-------------------------\n"
    printf "%-12s%12.2f\n", "Total",(before_1800_score + after_1800_score) / NR
}

awk -f time-average.awk scores.txt

Gender      Avg Scores
-------------------------
Before 1800        88.50
After 1800         67.83
-------------------------
Total              73.00