Skip to content

Use Regex

grep is not complete without regex.

Let's continue with our test data.

1. Preparation

Raw Data

Run the following command to create a test data.

cat <<LOGS> tut-access.log
161.138.187.117 - - [05/Jan/2021:23:05:01 -0500] "PUT /app/main/posts HTTP/1.0" 200 4973 "http://smith.com/" "Mozilla/5.0 (Windows NT 6.1; yi-US; rv:1.9.2.20) Gecko/2011-09-10 13:36:12 Firefox/3.6.13"
83.191.216.184 - - [05/Jan/2021:23:05:39 -0500] "POST /apps/cart.jsp?appID=3885 HTTP/1.0" 200 5031 "http://www.bryant.com/terms/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_12_4) AppleWebKit/534.0 (KHTML, like Gecko) Chrome/60.0.831.0 Safari/534.0"
164.42.246.104 - - [05/Jan/2021:23:09:56 -0500] "PUT /wp-content HTTP/1.0" 200 4976 "https://flynn-cruz.com/home/" "Mozilla/5.0 (Android 2.1; Mobile; rv:45.0) Gecko/45.0 Firefox/45.0"
28.219.159.236 - - [05/Jan/2021:23:11:43 -0500] "PUT /list HTTP/1.0" 200 5010 "http://www.wright.com/home/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.1 (KHTML, like Gecko) Chrome/42.0.863.0 Safari/536.1"
189.143.182.79 - - [05/Jan/2021:23:14:06 -0500] "GET /apps/cart.jsp?appID=9015 HTTP/1.0" 200 5037 "http://www.davis-moreno.com/app/tag/home.asp" "Mozilla/5.0 (Windows CE; ti-ET; rv:1.9.1.20) Gecko/2013-12-25 17:20:54 Firefox/3.8"
28.105.221.183 - - [05/Jan/2021:23:17:22 -0500] "GET /wp-admin HTTP/1.0" 200 4947 "http://www.smith-phelps.biz/homepage/" "Mozilla/5.0 (Windows; U; Windows 95) AppleWebKit/535.24.6 (KHTML, like Gecko) Version/5.0.5 Safari/535.24.6"
86.74.0.138 - - [05/Jan/2021:23:19:53 -0500] "GET /app/main/posts HTTP/1.0" 200 5036 "http://turner-brown.com/blog/search/" "Mozilla/5.0 (Windows NT 4.0; hr-HR; rv:1.9.2.20) Gecko/2014-09-22 21:20:35 Firefox/3.6.4"
154.106.166.221 - - [05/Jan/2021:23:20:24 -0500] "GET /app/main/posts HTTP/1.0" 200 4942 "https://www.white.com/wp-content/faq/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/531.2 (KHTML, like Gecko) Chrome/30.0.872.0 Safari/531.2"
160.75.38.89 - - [05/Jan/2021:23:25:07 -0500] "GET /list HTTP/1.0" 200 5043 "http://gordon.biz/search/main/tags/privacy.asp" "Mozilla/5.0 (Windows NT 5.1; fil-PH; rv:1.9.0.20) Gecko/2019-03-26 20:49:08 Firefox/3.8"
90.1.228.91 - - [05/Jan/2021:23:25:57 -0500] "GET /wp-admin HTTP/1.0" 200 4982 "http://brady.com/tag/home/" "Mozilla/5.0 (X11; Linux i686; rv:1.9.6.20) Gecko/2012-05-30 12:45:08 Firefox/3.8"
LOGS

2. Extract All the Webhost

Webhost

grep -oE "http[s]{0,1}://[^/]+" tut-access.log

There are other options

grep -oE "http[s]{0,1}://[[:alnum:]-]+(\.[[:alnum:]-]+){1,2}" tut-access.log

http://smith.com
http://www.bryant.com
https://flynn-cruz.com
http://www.wright.com
http://www.davis-moreno.com
http://www.smith-phelps.biz
http://turner-brown.com
https://www.white.com
http://gordon.biz
http://brady.com

Without www.

Now let's find out the web url not accessed through cname www.

Of course there are multiple ways to achieve this.

grep -oE "http[s]{0,1}://[^/]+" tut-access.log | grep -v 'www'

http://smith.com
https://flynn-cruz.com
http://gordon.biz
http://brady.com