Linux - Filters

Last update on November 13 2023 10:30:56 (UTC/GMT +8 hours)

Introduction

In this session, we have covered the most common filters of Linux system. Commands that are created to be used with a pipe are often called filters. These filters are very small programs that do one specific thing very efficiently. They can be used as building blocks. The combination of simple commands and filters in a long pipe allows you to design elegant solutions.

cat

When between two pipes, the cat command does nothing (except putting stdin on stdout).

datasoft @ datasoft-linux ~$ tac count | cat | cat | cat |cat |cat
four
three 
two
one 
 datasoft @ datasoft-linux ~$

tee

Writing long pipes in Unix is fun, but sometimes you may want intermediate results. The tee filter puts stdin on stdout and also into a file. So tee is almost the same as cat, except that it has two identical outputs.

datasoft @ datasoft-linux ~$ tac count | tee  temp.txt | tac
one 
two
three 
four

 datasoft @ datasoft-linux ~$ cat temp.txt

four
three 
two
one 
 datasoft @ datasoft-linux ~$

grep

In Linux the grep command is used as a searching and pattern matching tools. The most common use of grep is to filter lines of text containing (or not containing) a certain string.

datasoft @ datasoft-linux ~$ cat xyz.txt
raju das
ayan roy
riju saha
dustu saha
ajoy das
 datasoft @ datasoft-linux ~$ cat xyz.txt | grep saha
riju saha
dustu saha
 datasoft @ datasoft-linux ~$

You can write this without the cat.

datasoft @ datasoft-linux ~$ cat xyz.txt
raju das
ayan roy
riju saha
dustu saha
ajoy das
 datasoft @ datasoft-linux ~$ cat xyz.txt | grep saha
riju saha
dustu saha
 datasoft @ datasoft-linux ~$ grep saha xyz.txt
riju saha
dustu saha
 datasoft @ datasoft-linux ~$ grep das xyz.txt
raju das
ajoy das
 datasoft @ datasoft-linux ~$

One of the most useful options of grep is grep -i which filters in a case insensitive way.

datasoft @ datasoft-linux ~$ grep roy xyz.txt
ayan roy
 datasoft @ datasoft-linux ~$ grep -i roy xyz.txt
ayan roy
 datasoft @ datasoft-linux ~$

Another very useful option is grep -v which outputs lines not matching the string.

datasoft @ datasoft-linux ~$ grep -v dustu xyz.txt
raju das
ayan roy
riju saha
ajoy das
 datasoft @ datasoft-linux ~$

And of course, both options can be combined to filter all lines not containing a case insensitive string.

datasoft @ datasoft-linux ~$ grep -vi das xyz.txt
ayan roy
riju saha
dustu saha
datasoft @ datasoft-linux ~$

With grep -A1 one line after the result is also displayed.

datasoft @ datasoft-linux ~$ grep -A1 raju xyz.txt
raju das
ayan roy
datasoft @ datasoft-linux ~$

With grep -B1 one line before the result is also displayed.

datasoft @ datasoft-linux ~$ grep -B1 riju xyz.txt
ayan roy
riju saha
datasoft @ datasoft-linux ~$

With grep -C1 (context) one line before and one after are also displayed. All three options (A,B, and C) can display any number of lines (using e.g. A2, B4 or C20).

datasoft @ datasoft-linux ~$ grep -C1 riju xyz.txt
ayan roy
riju saha
dustu saha
datasoft @ datasoft-linux ~$

cut

The cut filter is used to cut out selected fields (columns) of each line of a file, depending on a delimiter or a count of bytes. The following code uses "cut" to filter the username and userid in the /etc/passwd file. It uses the colon as a delimiter, and selects fields 1 and 3.

datasoft @ datasoft-linux ~$ cut -d: -f1,3 /etc/passwd | tail -4
colord:113
hplip:114
pulse:115
datasoft:1000
 datasoft @ datasoft-linux ~$

When using a space as the delimiter for cut, you have to quote the space.

datasoft @ datasoft-linux ~$ cut -d" " -f1 xyz.txt
raju
ayan
riju
dustu
ajoy
 datasoft @ datasoft-linux ~$

This example uses cut to display the second to the seventh character of /etc/passwd.

datasoft @ datasoft-linux ~$ cut -c2-7 /etc/passwd | tail -4
olord:
plip:x
ulse:x
atasof
 datasoft @ datasoft-linux ~$

You can translate characters with tr. The following command shows the translation of all occurrences of 'e' to 'E'.

datasoft @ datasoft-linux ~$ cat xyz.txt | tr 'e' 'E'
raju das
ayan roy
riju saha
dustu saha
ajoy das
datasoft @ datasoft-linux ~$

Here we set all letters to uppercase by defining two ranges.

datasoft @ datasoft-linux ~$ cat xyz.txt | tr 'a-z' 'A-Z'
RAJU DAS
AYAN ROY
RIJU SAHA
DUSTU SAHA
AJOY DAS
datasoft @ datasoft-linux ~$

Here we translate all newlines to spaces.

datasoft @ datasoft-linux ~$ cat count
one 
two
three 
four

datasoft @ datasoft-linux ~$ cat count | tr '\n' ' '
one  two three  four   datasoft @ datasoft-linux ~$

The tr -s filter can also be used to squeeze multiple occurrences of a character to one.

datasoft @ datasoft-linux ~$ cat pqr.txt
apple     mango   orange    guava   lemon
 datasoft @ datasoft-linux ~$ cat pqr.txt | tr -s ' '
apple mango orange guava lemon
 datasoft @ datasoft-linux ~$

You can also use tr to 'encrypt' texts with rot13.

datasoft @ datasoft-linux ~$ cat count | tr 'a-z' 'khkasdkhaskdfhkahskfh'
khs 
fhk
fhsss 
dkhs

 datasoft @ datasoft-linux ~$ cat count | tr 'a-z' 'k-sa-f'
feo 
fff
frfoo 
pfff

 datasoft @ datasoft-linux ~$

This last example uses tr -d to delete characters.

datasoft @ datasoft-linux ~$ cat xyz.txt | tr -d e
raju das
ayan roy
riju saha
dustu saha
ajoy das
 datasoft @ datasoft-linux ~$

wc command is used to count words, lines and characters for each file

datasoft @ datasoft-linux ~$ wc xyz.txt
 5 10 48 xyz.txt
 datasoft @ datasoft-linux ~$ wc -l xyz.txt
5 xyz.txt
 datasoft @ datasoft-linux ~$ wc -w xyz.txt
10 xyz.txt
 datasoft @ datasoft-linux ~$ wc -c xyz.txt
48 xyz.txt
 datasoft @ datasoft-linux ~$

sort

The sort filter (alphabetical sort) is used to sort lines of text files.

datasoft @ datasoft-linux ~$ cat xyz.txt
raju das
ayan roy
riju saha
dustu saha
ajoy das
 datasoft @ datasoft-linux ~$ sort xyz.txt
ajoy das
ayan roy
dustu saha
raju das
riju saha
 datasoft @ datasoft-linux ~$

But the sort filter has many options to tweak its usage. This example shows sorting different columns (column 1 or column 2).

datasoft @ datasoft-linux ~$ sort -k1 abc.txt
Bihar, Andrapradesh, 90
Burdwan, Bhubaneswar, 20
Delhi, Orrisa, 65
Goa, Gujrat, 45
Kolkata , Karnataka, 15
 datasoft @ datasoft-linux ~$ sort -k2 abc.txt
Bihar, Andrapradesh, 90
Burdwan, Bhubaneswar, 20
Goa, Gujrat, 45
Kolkata , Karnataka, 15
Delhi, Orrisa, 65
 datasoft @ datasoft-linux ~$

The screenshot below shows the difference between an alphabetical sort and a numerical sort (both on the third column).

datasoft @ datasoft-linux ~$ sort -k3 abc.txt
Burdwan, Bhubaneswar, 20
Goa, Gujrat, 45
Delhi, Orrisa, 65
Bihar, Andrapradesh, 90
Kolkata , Karnataka, 15
 datasoft @ datasoft-linux ~$ sort -n -k3 abc.txt
Kolkata , Karnataka, 15
Burdwan, Bhubaneswar, 20
Goa, Gujrat, 45
Delhi, Orrisa, 65
Bihar, Andrapradesh, 90
 datasoft @ datasoft-linux ~$

uniq

uniq command is used to omit repeated lines from a sorted list.

datasoft @ datasoft-linux ~$ cat abc.txt
Kolkata , Karnataka, 15
Burdwan, Bhubaneswar, 20
Goa, Gujrat, 45
Delhi, Orrisa, 65
Bihar, Andrapradesh, 90
datasoft @ datasoft-linux ~$ sort abc.txt
Bihar, Andrapradesh, 90
Burdwan, Bhubaneswar, 20
Delhi, Orrisa, 65
Goa, Gujrat, 45
Kolkata , Karnataka, 15
datasoft @ datasoft-linux ~$ sort abc.txt |uniq
Bihar, Andrapradesh, 90
Burdwan, Bhubaneswar, 20
Delhi, Orrisa, 65
Goa, Gujrat, 45
Kolkata , Karnataka, 15
datasoft @ datasoft-linux ~$

uniq can also count occurrences with the -c option.

datasoft @ datasoft-linux ~$ sort abc.txt |uniq -c
      1 Bihar, Andrapradesh, 90
      1 Burdwan, Bhubaneswar, 20
      1 Delhi, Orrisa, 65
      1 Goa, Gujrat, 45
      1 Kolkata , Karnataka, 15
 datasoft @ datasoft-linux ~$

comm

Comparing streams (or files) can be done with the comm. By default comm will output three columns. In this example, Abba, Cure and Queen are in both lists, Bowie and Sweet are only in the first file, Turner is only in the second.

datasoft @ datasoft-linux ~$ cat > lebel1.txt
Ape
Bat
Cat
Dog
Mat
Sit
Zip
^C
datasoft @ datasoft-linux ~$ cat > lebel2.txt
Ape
Cat
Dog
Nest
vest
^C
datasoft @ datasoft-linux ~$ comm lebel1.txt lebel2.txt
		Ape
Bat
		Cat
		Dog
Mat
	Nest
Sit
	vest
Zip
 datasoft @ datasoft-linux ~$

The output of comm can be easier to read when outputting only a single column. The digits point out which output columns should not be displayed.

datasoft @ datasoft-linux ~$ comm -12 lebel1.txt lebel2.txt
Ape
Cat
Dog
 datasoft @ datasoft-linux ~$ comm -13 lebel1.txt lebel2.txt
Nest
vest
 datasoft @ datasoft-linux ~$ comm -23 lebel1.txt lebel2.txt
Bat
Mat
Sit
Zip
 datasoft @ datasoft-linux ~$

European humans like to work with ascii characters, but computers store files in bytes. The example below creates a simple file, and then uses od to show the contents of the file in hexadecimal bytes

datasoft @ datasoft-linux ~$ cat > sample.txt
ABCDEFGHIJKL
123456789101112
^C
 datasoft @ datasoft-linux ~$ od -t x1 sample.txt
0000000 41 42 43 44 45 46 47 48 49 4a 4b 4c 0a 31 32 33
0000020 34 35 36 37 38 39 31 30 31 31 31 32 0a
0000035
 datasoft @ datasoft-linux ~$

The same file can also be displayed in octal bytes.

datasoft @ datasoft-linux ~$ od -b sample.txt
0000000 101 102 103 104 105 106 107 110 111 112 113 114 012 061 062 063
0000020 064 065 066 067 070 071 061 060 061 061 061 062 012
0000035
 datasoft @ datasoft-linux ~$

And here is the file in ascii (or backslashed) characters.

datasoft @ datasoft-linux ~$ od -c sample.txt
0000000   A   B   C   D   E   F   G   H   I   J   K   L  \n   1   2   3
0000020   4   5   6   7   8   9   1   0   1   1   1   2  \n
0000035
 datasoft @ datasoft-linux ~$

sed

Sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline).

datasoft @ datasoft-linux ~$ echo level5 | sed 's/5/42/'
level42
 datasoft @ datasoft-linux ~$ echo level5 | sed 's/level/high/'
high5
 datasoft @ datasoft-linux ~$

Add g for global replacements (all occurrences of the string per line).

datasoft @ datasoft-linux ~$ echo level5 level6 | sed 's/level/high/'
high5 level6
 datasoft @ datasoft-linux ~$ echo level5 level6 | sed 's/level/high/g'
high5 high6
datasoft @ datasoft-linux ~$

With d you can remove lines from a stream containing a character.

datasoft @ datasoft-linux ~$ cat > cricket.txt
Sachin Tendulkar, Maharastra
Sourav Ganguly, Kolkata
Mahendra singh Dhoni, Jharkhand
Birat Kohili, Delhi
Birendra Sewag, Delhi
Anil Kumble, Chenni
^C
 datasoft @ datasoft-linux ~$ cat cricket.txt | sed '/Delhi/d'
Sachin Tendulkar, Maharastra
Sourav Ganguly, Kolkata
Mahendra singh Dhoni, Jharkhand
Anil Kumble, Chenni
 datasoft @ datasoft-linux ~$

pipe examples

who | wc

How many users are logged on to this system ?

datasoft @ datasoft-linux ~$ who
datasoft :0           2014-08-02 10:51 (:0)
datasoft pts/0        2014-08-02 10:54 (:0)
datasoft pts/7        2014-08-02 10:57 (:0)
datasoft pts/14       2014-08-02 14:10 (:0)
 datasoft @ datasoft-linux ~$ 
 datasoft @ datasoft-linux ~$ who | wc -l
4

who | cut | sort

Display a sorted list of logged on users.

datasoft @ datasoft-linux ~$ who | cut -d' ' -f1 | sort
datasoft
datasoft
datasoft
datasoft
 datasoft @ datasoft-linux ~$

Display a sorted list of logged on users, but every user only once .

 datasoft @ datasoft-linux ~$ who | cut -d' ' -f1 | sort | uniq
datasoft
 datasoft @ datasoft-linux ~$

grep | cut

Display a list of all bash user accounts on this computer. Users accounts are explained in detail later.

datasoft @ datasoft-linux ~$ grep bash /etc/passwd
root:x:0:0:root:/root:/bin/bash
datasoft:x:1000:1000:datasoft,,,:/home/datasoft:/bin/bash
 datasoft @ datasoft-linux ~$ bash /etc/passwd | cut -d: -f1
/etc/passwd: line 1: root:x:0:0:root:/root:/bin/bash: No such file or directory
/etc/passwd: line 2: daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin: No such file or directory
/etc/passwd: line 3: bin:x:2:2:bin:/bin:/usr/sbin/nologin: No such file or directory

Exercise, Practice and Solution:

1. Put a sorted list of all bash users in bashusers.txt.

Code:

grep bash /etc/passwd | cut -d: -f1 | sort > bashusers.txt

2. Put a sorted list of all logged on users in onlineusers.txt.

Code:

who | cut -d' ' -f1 | sort > onlineusers.txt

3. Make a list of all filenames in /etc that contain the string samba.

Code:

ls /etc | grep samba

4. Make a sorted list of all files in /etc that contain the case insensitive string samba.

Code:

ls /etc | grep -i samba | sort

5. Look at the output of /sbin/ifconfig. Write a line that displays only ip address and the subnet mask.

Code:

/sbin/ifconfig | head -2 | grep 'inet ' | tr -s ' ' | cut -d' ' -f3,5

6. Write a line that removes all non-letters from a stream.

Code:

datasoft @ datasoft-linux ~$ cat text
This is, yes really! , a text with ?&* too many str$ange# characters ;-)
datasoft @ datasoft-linux ~$ cat text | tr -d ',!$?.*&^%#@;()-'
This is yes really a text with too many strange characters

7. Write a line that receives a text file, and outputs all words on a separate line.

Code:

datasoft @ datasoft-linux ~$ cat text2
it is very cold today without the sun
datasoft @ datasoft-linux ~$ cat text2 | tr ' ' '\n'
it
is
very
cold
today
without
the
sun

8. Write a spell checker on the command line. (There may be a dictionary in /usr/share/
dict/ .)

Code:

datasoft @ datasoft-linux ~$ echo "The zun is shining today" > text
datasoft @ datasoft-linux ~$ cat > DICT
is
shining
sun
the
today
datasoft @ datasoft-linux ~$ cat text | tr 'A-Z ' 'a-z\n' | sort | uniq | comm -23 - DICT
zun
You could also add the solution from question number 6 to remove non-letters, and tr -s '
'to remove redundant spaces.

Previous: Linux I/O redirection
Next: Linux Basic Unix tools