Skip to content

Second Hand Car Trades

It is so exciting that we are having our first mini project! Time to practice our learned skills.

Our task is to find the following stats from the mock records of second-hand car trades in last year to fill in the form below:

Milleage Avarage Price Highest Price Most Expensive Car's Model Lowest Price Cheapest Car's Model
0 ~ 5000 miles
5000 ~ 10000 miles
10000 ~ 15000 miles
15000 ~ 20000 miles
...

1. Preparation

Test Data

The test data for this section can be find at our Underneath GitHub, /Linux/AWK/auto-sales.csv.

It contains 50,000 mock records of second-hand car trades in last year.

Here is a glimpse of the data:

Make Model Model Year City of Sale Price Mileage
Chrysler Concorde 1995 San Mariano $54363.18 22943
Toyota Tacoma 1997 Cibitungmasjid $54454.63 40357
Chevrolet Lumina 2001 Xinjie $66763.48 14185
Mitsubishi Chariot 1995 Dū Laīnah $81671.18 14242

2. Explore the Data

2.1 Warm Up

Before dive into our task, let's first get some clues on the unfamiliar data

Head

head auto-sales.csv

Then let's have a look how many car makes in our data.

Car Makes Count

awk 'BEGIN {FS=","} NR>1 && !car_make[$1]++ {print $1}' auto-sales.csv | wc -l
78

From the result, we know there are total 78 car makes, but ...

Work like a charm?

Do you know why these two are here?

  • NR>1
  • !carmake[$1]++
  • We don't need the header.
  • wc -l means count how many line are there.
  • We just want to count the car make once. Once we count a car make, any time after this, we will ignore it.

2.2 Abnormal Data?

A brief summary of the whole data is quite informative, however, before we try it, let's look further into the data. As there are 50,000 rows, it is possible that the format is not prefect.

Find the Anormalies

awk '
BEGIN {
    FS=","; printf "%-12s%8s\n", "No. of Fields", "Counts"
}
{
    if (NR > 1) {
        field_count[NF]++
    }
}
END {
    for (n in field_count) {
        printf "%-12s%8d\n", n" fields", field_count[n]
    }
}
' auto-sales.csv
No. of Fields  Counts
6 fields       49989
7 fields           9
9 fields           2

The result shows there are 9 rows having 7 fields and 2 rows having 9 field. We better have a look what's happening in these rows.

What's the Trouble?

awk 'BEGIN {FS=","; print "Abnormal Lines:"} NF > 6 {print NR,$0}' auto-sales.csv
Abnormal Lines:
163 GMC,Vandura G1500,1995,"Palmas De Gran Canaria, Las",$44040.73,20190
3270 Buick,LaCrosse,2008,"Villa Presidente Frei, Ñuñoa, Santiago, Chile",$20017.13,76130
14621 Chrysler,Crossfire,2007,"Hospitalet De Llobregat, L'",$37630.40,67305
20305 Aston Martin,DB9,2012,"Brgy. Nalook, kalibo",$38022.23,69482
21516 Porsche,928,1987,"Villa Presidente Frei, Ñuñoa, Santiago, Chile",$70256.79,32108
27181 Tesla,Roadster,2011,"Palmas De Gran Canaria, Las",$53713.62,12872
29971 Toyota,Solara,2006,"Hospitalet De Llobregat, L'",$47681.80,67924
34442 Mazda,MX-5,1999,"Laikit, Laikit II (Dimembe)",$64224.04,11707
41271 Mazda,Mazda3,2006,"Palmas De Gran Canaria, Las",$84869.12,31552
42110 GMC,Yukon XL 2500,2011,"Palmas De Gran Canaria, Las",$49027.72,11711
47084 Chevrolet,Equinox,2010,"Las Vegas, Santa Barbara",$57602.29,15432

It is the comma(,) in the address that causing the problem. However, luck for us, the abnormal addresses are all double quoted.

FPAT

FPAT gives you the ability to match fields with regex.

Using FPAT

awk '
BEGIN {
    FPAT="([^,]+)|(\"[^\"]+\")"; printf "%-12s%8s\n", "No. of Fields", "Counts"
}
{
    if (NR > 1) {
        field_count[NF]++
    }
}
END {
    for (n in field_count) {
        printf "%-12s%8d\n", n" fields", field_count[n]
    }
}
' auto-sales.csv
No. of Fields  Counts
6 fields       50000

Now, troubles gone!

FPAT

We will just use the FPAT for this little summary. You might have noticed, using the regex does slow down a bit. It's better to avoid it when dealing with large data.

2.3 Brief Summary Stats

Now it's time for us to make a brief summery of the data.

auto-sales-summary.awk

Save the code as auto-sales-summary.awk

BEGIN {
    FPAT = "([^,]+)|(\"[^\"]+\")"
    printf "%-20s%15s\n", "Name", "Counts"
    for (i=0;i<35;i++) printf "-"
    printf "\n"
}
{
    if (NR > 1){
        !car_make[$1]++ && no_car_make++
        !car_model[$1$2]++ && no_car_model++
        !model_year[$1$2$3]++ && no_model_year++
        !city[$4]++ && no_city++
        sales += substr($5, 2, length($5))
        milleage += $6
    }
}
END {
    printf "%-20s%15d\n", "Car Make", no_car_make
    printf "%-20s%15d\n", "Car Model", no_car_model
    printf "%-20s%15d\n", "Car Model Year", no_model_year
    printf "%-20s%15d\n", "City", no_city
    for (i=0;i<35;i++) printf "-"
    printf "\n\n\n"
    printf "%-20s%15s\n", "Name", "Average"
    for (i=0;i<35;i++) printf "-"
    printf "\n"
    printf "%-20s%15.2f\n", "Average Price", sales / (NR - 1)
    printf "%-20s%15.2f\n", "Average Milleage", milleage / (NR - 1)
}

Summary

awk -f auto-sales-summary.awk auto-sales.csv
Name                         Counts
-----------------------------------
Car Make                         78
Car Model                      1046
Car Model Year                 7254
City                          24647
-----------------------------------


Name                        Average
-----------------------------------
Average Price              54933.42
Average Milleage           44980.66

3. Our Task

Now let's finish the task

auto-sales-task.awk

Save the code as auto-sales-task.awk

BEGIN {
    # set field separator to ","
    FS = ","
}
{
    if (NR > 1){
        # key is the milleage factor
        # given a key, it represents range:
        # from key * 5000 to (key + 1) * 5000 -1
        # for example, if key = 1, milleage range is [5000, 9999]
        key = int($NF / 5000)

        # remove the dollar($) sign
        price = substr($(NF-1), 2, length($NF-1))

        # record the largest factor for printing purpose in the `END` part
        max_key < key && max_key = key

        # sum up the sales and count the rows within the milleage range
        sales[key] += price
        count[key]++

        # record the most expensive car
        if (max_price[key] < price) {
            max_price[key] = price
            max_car_model[key] = $1" "$2" "$3
        }
        # record the cheapest car
        if (min_price[key] == 0 || min_price[key] > price) {
            min_price[key] = price
            min_car_model[key] = $1" "$2" "$3
        }
    }
}
END {
    # 1. print avg, max, min prices grouped by milleage
    # 1.1 print the headers
    for (i=0;i<65;i++) printf "-"; printf "\n"
    printf "%20s%15s", "Milleage", "Average Price"
    printf "%15s%15s\n", "Max Price", "Min Price"
    for (i=0;i<65;i++) printf "-"; printf "\n"

    # 1.2 print the the prices
    for (i = 0; i < max_key+1; i++){
        printf "%6s ~%6s miles", i * 5000,(i + 1) * 5000 - 1
        printf "%15s", sales[i]? "$ "sales[i] / count[i]:null
        printf "%15s", max_price[i]? "$ "max_price[i]:null
        printf "%15s\n", min_price[i]? "$ "min_price[i]:null
    }
    for (i=0;i<65;i++) printf "-"; printf "\n"

    # 2. print most expensive and cheapest model per milleage range
    # 2.1 print headers
    for (i=0;i<90;i++) printf "-"; printf "\n"
    printf "%20s%30s%40s\n", "Milleage", "Most Expensive Model", "Cheapest Model"
    for (i=0;i<90;i++) printf "-"; printf "\n"

    # 2.2 print most expensive and cheapest car models
    for (i = 0; i < max_key+1; i++){
        printf "%6s ~%6s miles", i * 5000,(i + 1) * 5000 - 1
        printf "%30s%40s\n", max_car_model[i], min_car_model[i]
    }
    for (i=0;i<90;i++) printf "-"; printf "\n"
}

There is no need to get panic. Most of the lines are just printing.

We only need to take a look a t the commands in the middle.

First, we need a plan. We want to record the data belong to different milleage range.

How can we assign each line to its milleage range?

We can use a key to record the data into an array. The key we use here is int(milleage/5000), which is the floor of the milleage divided by the milleage range interval(5000).

The we just need to record the data into arrays, which we've already done loads of times.

Tip

Like in a bash script, use "#" to write your comments.

Run

awk -f auto-sales-task.awk auto-sales.csv

Output

-----------------------------------------------------------------
            Milleage  Average Price      Max Price      Min Price
-----------------------------------------------------------------
    0 ~  4999 miles
 5000 ~  9999 miles
10000 ~ 14999 miles      $ 54381.6        $ 89981        $ 20013
15000 ~ 19999 miles      $ 54154.6        $ 89981        $ 20028
20000 ~ 24999 miles      $ 55095.8        $ 89982        $ 20000
25000 ~ 29999 miles        $ 54785        $ 89964        $ 20008
30000 ~ 34999 miles      $ 55028.3        $ 89991        $ 20037
35000 ~ 39999 miles      $ 55340.8        $ 89985        $ 20083
40000 ~ 44999 miles      $ 55064.9        $ 89975        $ 20031
45000 ~ 49999 miles      $ 55261.1        $ 89993        $ 20015
50000 ~ 54999 miles      $ 55009.5        $ 89996        $ 20004
55000 ~ 59999 miles      $ 55013.5        $ 89966        $ 20008
60000 ~ 64999 miles      $ 54686.9        $ 89970        $ 20017
65000 ~ 69999 miles      $ 54650.6        $ 89998        $ 20038
70000 ~ 74999 miles      $ 55116.5        $ 89972        $ 20044
75000 ~ 79999 miles        $ 55465        $ 89999        $ 20017
80000 ~ 84999 miles        $ 27307        $ 27307        $ 27307
-----------------------------------------------------------------
------------------------------------------------------------------------------------------
            Milleage          Most Expensive Model                          Cheapest Model
------------------------------------------------------------------------------------------
    0 ~  4999 miles
 5000 ~  9999 miles
10000 ~ 14999 miles             Subaru Justy 1987                     Audi Cabriolet 1997
15000 ~ 19999 miles             Subaru Leone 1989                      Toyota Tacoma 1995
20000 ~ 24999 miles  Bentley Continental GTC 2011                   GMC Vandura 3500 1994
25000 ~ 29999 miles                 Lexus GS 1995                      Nissan Maxima 1998
30000 ~ 34999 miles            Plymouth Neon 1996     Land Rover Discovery Series II 2000
35000 ~ 39999 miles             Pontiac Vibe 2009                          Ford F250 2012
40000 ~ 44999 miles           Toyota 4Runner 1996                  Mitsubishi Mirage 2000
45000 ~ 49999 miles       Buick Coachbuilder 1991             Mercedes-Benz SL-Class 2003
50000 ~ 54999 miles                 Lexus GX 2010           Ford Explorer Sport Trac 2008
55000 ~ 59999 miles              Suzuki XL-7 2007                     Pontiac LeMans 1989
60000 ~ 64999 miles              Porsche 911 2009                    Nissan Frontier 2011
65000 ~ 69999 miles            Subaru Legacy 2001                            Audi A6 2000
70000 ~ 74999 miles             Jeep Patriot 2012                      Ford E-Series 1994
75000 ~ 79999 miles             BMW 8 Series 1992                     Buick LaCrosse 2008
80000 ~ 84999 miles             Aptera Typ-1 2009                       Aptera Typ-1 2009
------------------------------------------------------------------------------------------

4. How fast is awk?

By appending the data multiple times, I got a csv of 5 million records.

How fast is our program with 5 million record?

Run

cat 5m-data.csv | wc -l
Ouput
5000001

Run

time awk -f auto-sales-task.awk 5m-data.csv
Output
awk -f auto-sales-task.awk 5m-data.csv  8.79s user 0.05s system 99% cpu 8.839 total

8.839s is surely not a disappointment for our task given our data is quite random!

Fast or Slow?

Our task is not an efficient task, and actually, the program is quite time consuming.

For an easy task in our warm up session, it would only costs 1.852s to go through the 5m data.

time awk 'BEGIN {FS=","} NR>1 && !car_make[$1]++ {print $1}' 5m-data.csv
awk 'BEGIN {FS=","} NR>1 && !car_make[$1]++ {print $1}' 5m-data.csv  1.82s user 0.04s system 99% cpu 1.852 total

And if you explore further, you will find out the most time consuming part is the key calculation in our program, which is responsible for almost half the time used.