Count unique lines in a file in Linux
Counting unique lines in a file is a common task in Linux, and there are a number of different tools and methods that can be used to perform this operation. In general, the appropriate method depends on the specific requirements and constraints of the task, such as the size of the input file, performance and memory requirements, and the format and content of the data.
Count the number of unique lines in a file using the sort and uniq commands
One way to count the number of unique lines in a file in Linux is to use the sort and uniq commands. The sort command sorts the input data into a specified order, and the uniq command filters out duplicate lines from the sorted data.
The data.txt file contains the following content for this article's examples.
arg1
arg2
arg3
arg2
arg2
arg1
To count the number of unique lines in a file, you can use the following command:
sort data.txt | uniq -c | wc -l
Output:
3
This command sorts the data.txt file in ascending order (by default) and pipes the output to the uniq command. The uniq command filters out any duplicate lines from the sorted data and adds a count of the number of times each line appears in the input.
The output is then piped to the wc command, which counts the number of lines in the input and prints the results to the terminal.
The sort and uniq commands are simple and efficient tools for counting the number of unique lines in a file and are suitable for most common scenarios. However, they have some limitations and disadvantages, such as requiring the input data to be sorted, which can be slow for large files and use a lot of memory.
Also, the uniq command removes only adjacent duplicate lines from sorted data, so it might not give the expected results for some inputs.
Count the number of unique lines in a file using awk command
Another way to count the number of unique lines in a file in Linux is to use the awk command , which is a powerful text processing tool that can perform various operations on text files. The awk command has a built-in associative array data structure that can store and count the number of occurrences of each line in the input.
For example, to count the number of unique lines in a file named data.txt, you can use the following command:
awk '!a[$0]++' data.txt | wc -l
Output:
3
This command uses the awk command to read the data.txt file and applies a simple condition to each input line. The condition uses the !a[$0]++ expression, which increments the value of the a array for each line read.
This effectively counts the number of times each line occurs in the input and stores the counts in an array.
The awk command then applies ! a[$0]
the operator of the expression, which negates the value of the array element. This means that only the rows in the array whose count is 0 will pass the condition and be printed to the output.
The output is then piped to the wc command, which counts the number of lines in the input and prints the results to the terminal.
The awk command also provides several options and features that you can use to control its behavior and customize its output. For example, you can use the -F option to specify a different field separator or use the -v option to define variables that you can use in your scripts.
You can also use the printf function to format the output of the awk command in various ways.
Here is an example of a more complex awk script that uses these functions to count the number of unique lines in a file called data.txt, where each line is a comma-delimited list of fields:
awk -F, '{a[$1]++} END {for (i in a) { printf "%s,%d\n", i, a[i] }}' data.txt | wc -l
Output:
3
The script uses the -F option to specify the , character as the field separator, which defines an array that stores and counts the occurrences of each field in the input.
The command then awk
reads each line of the data.txt file and increments the value of the array for each field read. This effectively counts the number of times each unique field appears in the input.
The END block of the script is executed after all the input lines are read and the array is iterated using a for loop. The printf function is used to format the output of the awk command and print each unique field and its count to the output.
The output is then piped to the wc command, which counts the number of lines in the input and prints the results to the terminal.
In summary, there are several ways to count the number of unique lines in a file in Linux, and the appropriate method will depend on the specific requirements and constraints of the task. The sort and uniq commands are simple and effective tools for counting unique lines, while the awk command provides more advanced features and options to customize the output and behavior of the script.
For reprinting, please send an email to 1244347461@qq.com for approval. After obtaining the author's consent, kindly include the source as a link.
Related Articles
Creating a Progress Bar in Bash
Publish Date:2025/03/23 Views:94 Category:OPERATING SYSTEM
-
A progress bar is a visual indicator that shows the progress of a task, such as a long-running script or command. It can be used to provide feedback to the user about the status of a task and can also help estimate the time remaining before
Counting files in a directory in Bash
Publish Date:2025/03/23 Views:178 Category:OPERATING SYSTEM
-
Counting how many files are in a directory is a common task in Bash, and there are a number of different tools and methods that can be used to perform this operation. In general, the appropriate method depends on the specific requirements a
Execute commands in a variable in a Bash script
Publish Date:2025/03/23 Views:111 Category:OPERATING SYSTEM
-
This article is about storing Bash commands in a variable and then executing it directly from that variable. First, we will discuss the various ways to execute commands contained in a variable, followed by several script examples. Let’s g
Bash variable multiplication
Publish Date:2025/03/23 Views:150 Category:OPERATING SYSTEM
-
This article explains how to multiply two variables in Bash. Multiplying variables in Bash Multiplying two variables is a simple operation in Bash. We can use the arithmetic operator * to multiply two numbers in Bash. In Bash, multiplicatio
Bash md5sum command
Publish Date:2025/03/23 Views:116 Category:OPERATING SYSTEM
-
This article explains how to use the md5sum command in Bash. Bash md5sum command md5sum command prints the 32 character and 128 bit checksum of a given file. This command converts the file into a hash using the MD5 algorithm; the syntax of
Sorting Arrays in Bash
Publish Date:2025/03/23 Views:73 Category:OPERATING SYSTEM
-
Sorting an array is a very common task in any programming language. In Bash scripting, we can also accomplish this task in two different ways. The first one uses any sorting algorithm and the second one uses a built-in keyword in Bash scrip
Multidimensional arrays in Bash
Publish Date:2025/03/23 Views:68 Category:OPERATING SYSTEM
-
Multidimensional array is a very important element for any program. It is mainly used to create table view of data and many other purposes. This article demonstrates how to create a two-dimensional array. In addition, we will discuss the to
Append new data to an array without specifying the index in Bash
Publish Date:2025/03/23 Views:138 Category:OPERATING SYSTEM
-
Arrays are a common part of any programming language. In Bash scripts, you can also use arrays; you can declare, modify, and manipulate arrays. But in this article, we will see step by step how to declare an array and add new data to it. We
String comparison in batch files
Publish Date:2025/03/22 Views:168 Category:OPERATING SYSTEM
-
A string is an ordered collection of characters. You can compare strings using conditional commands in batch files, namely, if, if-else, and for commands. Strings may contain spaces and special characters, which can cause errors in the batc