Second Hand Car Trades¶
It is so exciting that we are having our first mini project
! Time to practice our learned skills.
Our task is to find the following stats from the mock records of second-hand car trades in last year to fill in the form below:
Milleage | Avarage Price | Highest Price | Most Expensive Car's Model | Lowest Price | Cheapest Car's Model |
---|---|---|---|---|---|
0 ~ 5000 miles | |||||
5000 ~ 10000 miles | |||||
10000 ~ 15000 miles | |||||
15000 ~ 20000 miles | |||||
... |
1. Preparation¶
Test Data
The test data for this section can be find at our Underneath GitHub, /Linux/AWK/auto-sales.csv
.
It contains 50,000 mock records of second-hand car trades in last year.
Here is a glimpse of the data:
Make | Model | Model Year | City of Sale | Price | Mileage |
---|---|---|---|---|---|
Chrysler | Concorde | 1995 | San Mariano | $54363.18 | 22943 |
Toyota | Tacoma | 1997 | Cibitungmasjid | $54454.63 | 40357 |
Chevrolet | Lumina | 2001 | Xinjie | $66763.48 | 14185 |
Mitsubishi | Chariot | 1995 | Dū Laīnah | $81671.18 | 14242 |
2. Explore the Data¶
2.1 Warm Up¶
Before dive into our task, let's first get some clues on the unfamiliar data
Head
head auto-sales.csv
Then let's have a look how many car makes in our data.
Car Makes Count
awk 'BEGIN {FS=","} NR>1 && !car_make[$1]++ {print $1}' auto-sales.csv | wc -l
78
From the result, we know there are total 78
car makes, but ...
Work like a charm?
Do you know why these two are here?
NR>1
!carmake[$1]++
- We don't need the header.
wc -l
meanscount how many line are there
.- We just want to count the car make once. Once we count a car make, any time after this, we will ignore it.
2.2 Abnormal Data?¶
A brief summary of the whole data is quite informative, however, before we try it, let's look further into the data. As there are 50,000 rows, it is possible that the format is not prefect.
Find the Anormalies
awk '
BEGIN {
FS=","; printf "%-12s%8s\n", "No. of Fields", "Counts"
}
{
if (NR > 1) {
field_count[NF]++
}
}
END {
for (n in field_count) {
printf "%-12s%8d\n", n" fields", field_count[n]
}
}
' auto-sales.csv
No. of Fields Counts
6 fields 49989
7 fields 9
9 fields 2
The result shows there are 9 rows having 7 fields and 2 rows having 9 field. We better have a look what's happening in these rows.
What's the Trouble?
awk 'BEGIN {FS=","; print "Abnormal Lines:"} NF > 6 {print NR,$0}' auto-sales.csv
Abnormal Lines:
163 GMC,Vandura G1500,1995,"Palmas De Gran Canaria, Las",$44040.73,20190
3270 Buick,LaCrosse,2008,"Villa Presidente Frei, Ñuñoa, Santiago, Chile",$20017.13,76130
14621 Chrysler,Crossfire,2007,"Hospitalet De Llobregat, L'",$37630.40,67305
20305 Aston Martin,DB9,2012,"Brgy. Nalook, kalibo",$38022.23,69482
21516 Porsche,928,1987,"Villa Presidente Frei, Ñuñoa, Santiago, Chile",$70256.79,32108
27181 Tesla,Roadster,2011,"Palmas De Gran Canaria, Las",$53713.62,12872
29971 Toyota,Solara,2006,"Hospitalet De Llobregat, L'",$47681.80,67924
34442 Mazda,MX-5,1999,"Laikit, Laikit II (Dimembe)",$64224.04,11707
41271 Mazda,Mazda3,2006,"Palmas De Gran Canaria, Las",$84869.12,31552
42110 GMC,Yukon XL 2500,2011,"Palmas De Gran Canaria, Las",$49027.72,11711
47084 Chevrolet,Equinox,2010,"Las Vegas, Santa Barbara",$57602.29,15432
It is the comma(,
) in the address that causing the problem. However, luck for us, the abnormal addresses are all double quoted.
FPAT
FPAT
gives you the ability to match fields with regex.
Using FPAT
awk '
BEGIN {
FPAT="([^,]+)|(\"[^\"]+\")"; printf "%-12s%8s\n", "No. of Fields", "Counts"
}
{
if (NR > 1) {
field_count[NF]++
}
}
END {
for (n in field_count) {
printf "%-12s%8d\n", n" fields", field_count[n]
}
}
' auto-sales.csv
No. of Fields Counts
6 fields 50000
Now, troubles gone!
FPAT
We will just use the FPAT
for this little summary. You might have noticed, using the regex does slow down a bit. It's better to avoid it when dealing with large data.
2.3 Brief Summary Stats¶
Now it's time for us to make a brief summery of the data.
auto-sales-summary.awk
Save the code as auto-sales-summary.awk
BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")"
printf "%-20s%15s\n", "Name", "Counts"
for (i=0;i<35;i++) printf "-"
printf "\n"
}
{
if (NR > 1){
!car_make[$1]++ && no_car_make++
!car_model[$1$2]++ && no_car_model++
!model_year[$1$2$3]++ && no_model_year++
!city[$4]++ && no_city++
sales += substr($5, 2, length($5))
milleage += $6
}
}
END {
printf "%-20s%15d\n", "Car Make", no_car_make
printf "%-20s%15d\n", "Car Model", no_car_model
printf "%-20s%15d\n", "Car Model Year", no_model_year
printf "%-20s%15d\n", "City", no_city
for (i=0;i<35;i++) printf "-"
printf "\n\n\n"
printf "%-20s%15s\n", "Name", "Average"
for (i=0;i<35;i++) printf "-"
printf "\n"
printf "%-20s%15.2f\n", "Average Price", sales / (NR - 1)
printf "%-20s%15.2f\n", "Average Milleage", milleage / (NR - 1)
}
Summary
awk -f auto-sales-summary.awk auto-sales.csv
Name Counts
-----------------------------------
Car Make 78
Car Model 1046
Car Model Year 7254
City 24647
-----------------------------------
Name Average
-----------------------------------
Average Price 54933.42
Average Milleage 44980.66
3. Our Task¶
Now let's finish the task
auto-sales-task.awk
Save the code as auto-sales-task.awk
BEGIN {
# set field separator to ","
FS = ","
}
{
if (NR > 1){
# key is the milleage factor
# given a key, it represents range:
# from key * 5000 to (key + 1) * 5000 -1
# for example, if key = 1, milleage range is [5000, 9999]
key = int($NF / 5000)
# remove the dollar($) sign
price = substr($(NF-1), 2, length($NF-1))
# record the largest factor for printing purpose in the `END` part
max_key < key && max_key = key
# sum up the sales and count the rows within the milleage range
sales[key] += price
count[key]++
# record the most expensive car
if (max_price[key] < price) {
max_price[key] = price
max_car_model[key] = $1" "$2" "$3
}
# record the cheapest car
if (min_price[key] == 0 || min_price[key] > price) {
min_price[key] = price
min_car_model[key] = $1" "$2" "$3
}
}
}
END {
# 1. print avg, max, min prices grouped by milleage
# 1.1 print the headers
for (i=0;i<65;i++) printf "-"; printf "\n"
printf "%20s%15s", "Milleage", "Average Price"
printf "%15s%15s\n", "Max Price", "Min Price"
for (i=0;i<65;i++) printf "-"; printf "\n"
# 1.2 print the the prices
for (i = 0; i < max_key+1; i++){
printf "%6s ~%6s miles", i * 5000,(i + 1) * 5000 - 1
printf "%15s", sales[i]? "$ "sales[i] / count[i]:null
printf "%15s", max_price[i]? "$ "max_price[i]:null
printf "%15s\n", min_price[i]? "$ "min_price[i]:null
}
for (i=0;i<65;i++) printf "-"; printf "\n"
# 2. print most expensive and cheapest model per milleage range
# 2.1 print headers
for (i=0;i<90;i++) printf "-"; printf "\n"
printf "%20s%30s%40s\n", "Milleage", "Most Expensive Model", "Cheapest Model"
for (i=0;i<90;i++) printf "-"; printf "\n"
# 2.2 print most expensive and cheapest car models
for (i = 0; i < max_key+1; i++){
printf "%6s ~%6s miles", i * 5000,(i + 1) * 5000 - 1
printf "%30s%40s\n", max_car_model[i], min_car_model[i]
}
for (i=0;i<90;i++) printf "-"; printf "\n"
}
There is no need to get panic. Most of the lines are just printing.
We only need to take a look a t the commands in the middle.
First, we need a plan. We want to record the data belong to different milleage range.
How can we assign each line to its milleage range?
We can use a key to record the data into an array. The key we use here is int(milleage/5000)
, which is the floor of the milleage divided by the milleage range interval(5000
).
The we just need to record the data into arrays, which we've already done loads of times.
Tip
Like in a bash script, use "#" to write your comments.
Run
awk -f auto-sales-task.awk auto-sales.csv
Output
-----------------------------------------------------------------
Milleage Average Price Max Price Min Price
-----------------------------------------------------------------
0 ~ 4999 miles
5000 ~ 9999 miles
10000 ~ 14999 miles $ 54381.6 $ 89981 $ 20013
15000 ~ 19999 miles $ 54154.6 $ 89981 $ 20028
20000 ~ 24999 miles $ 55095.8 $ 89982 $ 20000
25000 ~ 29999 miles $ 54785 $ 89964 $ 20008
30000 ~ 34999 miles $ 55028.3 $ 89991 $ 20037
35000 ~ 39999 miles $ 55340.8 $ 89985 $ 20083
40000 ~ 44999 miles $ 55064.9 $ 89975 $ 20031
45000 ~ 49999 miles $ 55261.1 $ 89993 $ 20015
50000 ~ 54999 miles $ 55009.5 $ 89996 $ 20004
55000 ~ 59999 miles $ 55013.5 $ 89966 $ 20008
60000 ~ 64999 miles $ 54686.9 $ 89970 $ 20017
65000 ~ 69999 miles $ 54650.6 $ 89998 $ 20038
70000 ~ 74999 miles $ 55116.5 $ 89972 $ 20044
75000 ~ 79999 miles $ 55465 $ 89999 $ 20017
80000 ~ 84999 miles $ 27307 $ 27307 $ 27307
-----------------------------------------------------------------
------------------------------------------------------------------------------------------
Milleage Most Expensive Model Cheapest Model
------------------------------------------------------------------------------------------
0 ~ 4999 miles
5000 ~ 9999 miles
10000 ~ 14999 miles Subaru Justy 1987 Audi Cabriolet 1997
15000 ~ 19999 miles Subaru Leone 1989 Toyota Tacoma 1995
20000 ~ 24999 miles Bentley Continental GTC 2011 GMC Vandura 3500 1994
25000 ~ 29999 miles Lexus GS 1995 Nissan Maxima 1998
30000 ~ 34999 miles Plymouth Neon 1996 Land Rover Discovery Series II 2000
35000 ~ 39999 miles Pontiac Vibe 2009 Ford F250 2012
40000 ~ 44999 miles Toyota 4Runner 1996 Mitsubishi Mirage 2000
45000 ~ 49999 miles Buick Coachbuilder 1991 Mercedes-Benz SL-Class 2003
50000 ~ 54999 miles Lexus GX 2010 Ford Explorer Sport Trac 2008
55000 ~ 59999 miles Suzuki XL-7 2007 Pontiac LeMans 1989
60000 ~ 64999 miles Porsche 911 2009 Nissan Frontier 2011
65000 ~ 69999 miles Subaru Legacy 2001 Audi A6 2000
70000 ~ 74999 miles Jeep Patriot 2012 Ford E-Series 1994
75000 ~ 79999 miles BMW 8 Series 1992 Buick LaCrosse 2008
80000 ~ 84999 miles Aptera Typ-1 2009 Aptera Typ-1 2009
------------------------------------------------------------------------------------------
4. How fast is awk
?¶
By appending the data multiple times, I got a csv of 5 million records.
How fast is our program with 5 million record?
Run
cat 5m-data.csv | wc -l
5000001
Run
time awk -f auto-sales-task.awk 5m-data.csv
awk -f auto-sales-task.awk 5m-data.csv 8.79s user 0.05s system 99% cpu 8.839 total
8.839s
is surely not a disappointment for our task given our data is quite random!
Fast or Slow?
Our task is not an efficient task, and actually, the program is quite time consuming.
For an easy task in our warm up session, it would only costs 1.852s
to go through the 5m data
.
time awk 'BEGIN {FS=","} NR>1 && !car_make[$1]++ {print $1}' 5m-data.csv
awk 'BEGIN {FS=","} NR>1 && !car_make[$1]++ {print $1}' 5m-data.csv 1.82s user 0.04s system 99% cpu 1.852 total
And if you explore further, you will find out the most time consuming part is the key
calculation in our program, which is responsible for almost half the time used.