How To Use Command Line
I am a cell biologist; I perform experiments in a laboratory to answer scientific questions. Due to corona virus outbreak our laboratory has been closed, so I am spending all my time at home. I have decided to write some tutorials for bioinformatics data analysis, so that I can go back to these tutorials whenever I will need. These tutorials might be useful for other biologists as well. If you have a Mac computer you can search for terminal in your computer, when you open the terminal, a black screen will appear where we can write command to perform task by computer and computer will give back results. If you are using windows machine, you can google “how to use command line in windows” I think will be able to figure it out.
If you have terminal open in your computer, you are using our computer from terminal. First we are going to see how to navigate around different folders/directories in our computer. First thing we would like to check is that, where we are in our computer, to get your location at your computer you can type the command ‘pwd’ your computer will give back your location at the computer. The location you get after opening the terminal is your ‘home directory’, please remember this term you might need it in future.
pwd
output
/Users/biplab
After starting terminal and entering ‘pwd’ command I get the path to the home directory, which is my working directory when you open the terminal. At a given time your location in the computer is called current/working directory. ‘pwd’ means ‘print working directory’. Now what is path? Folders or directory structure in our computer has a heigharchical starting with the ‘root directory’. The output from ‘pwd’ command ‘/Users/biplab’ the ‘/’ at beginng denotes the root directory.
Now we would like check, what are the files and directory inside the home directory. We can check that by the command ‘ls’
ls
output
Applications Movies
Desktop Music
Documents Pictures
Downloads Public
Dropbox (Partners HealthCare) opt
Library
Output from ‘ls’ command gives name of the directories and files in my working directory. Here I can see I have Application, Desktop and Documents directories
We can go to any directory we want let say, we would like to go to Documents directory. In order to change the directory, we can use the command ‘cd’ which means ‘change directory’. So far, we typed command but did not give any input to the command. This time we need to give input to the ‘cd’ command, the input to the command called ‘argument’ remember the term arguments. Argument should be separated by space. Argument to the ‘cd’ command is the name or path to the directory where we would like to go. Let’s say I would like to go to Documents directory then I need to put ‘Documents’ with a space after cd. If you do not put any argument after cd command, you will go back to your home directory.
cd Documents
This command will not give back anything if the directory name or path is correct. If you give an incorrect name in the argument you will get an error message saying ‘no such file or directory:’ Above command will change my working directry to Documents. Again, you can check path to working directory by ‘pwd’ command. You use ‘ls’ command to check what are files or directories in your current location.
Now I am at Documents directory, I can check what is in documents directory by ls command.
ls
output
AB.pdf Presentation Tutorial
ACMB.pdf ResearchArticles Writing
MATLAB ResearchData megacc
MEGA X Scripts
Let say I would like to go to ResearchData directory, I can again use cd command and name of the directory to go to ResearchData directory. I am in ResearchData directory, but I would like to go one directory up, which is Documents directory. In this case if you put cd Documents, it will give an error. Instead of Documents argument you need to put ‘..’ which will take you to directory that is one step back that Documents directory.
cd ..
Now we are able to move around our computer’s directories. In my Documents directory I have two pdf files, which I would to copy to Presentation directory. I can use ‘cp’ command for copying files or directory. You need to give two argument for ‘cp’ command, one is the file name or path to the file and directory or path to the directory where you would like to copy the file.
cp AB.pdf Tutorial
Now we would like to change name of the file AB.pdf to bioinformatics.pdf we can use ‘mv’ command to change file name or move a file from one directory to another.
mv AB.pdf bioinformatics.pdf
Next thing is about deleting file from the computer. We can use ‘rm’ command to remove file from the computer.
rm AB.pdf
I have lamina.bd file in Tutorial folder I can check content of the file by ‘cat’ command. ‘cat’ command will print content of the file on the screen.
cat lamina.bed
If the file is very big it is good idea to use ‘less’ command to look at the content of the file.
less lamina.bed
You can press enter to see more content of the file. Press ‘q’ to go back to your command line again.
Most of the time I use ‘head’ command to check first few lines of the file and ‘tail’ command to check last few line of the file.
head lamina.bed
output
chrom start end value
chr1 11323785 11617177 0.86217008797654
chr1 12645605 13926923 0.934891485809683
chr1 14750216 15119039 0.945945945945946
chr1 18102157 19080189 0.895174708818636
chr1 29491029 30934636 0.892526250772082
chr1 33716472 35395979 0.911901081916538
chr1 36712462 37685238 0.95655951346655
chr1 37838094 38031209 0.944206008583691
chr1 38272060 39078902 0.940932642487047
tail lamina.bed
output
chrX 135187116 135597436 0.940133037694013
chrX 135860243 138648904 0.888570404872805
chrX 138846031 148357359 0.83263937080731
chrX 148454624 148844607 0.784702549575071
chrX 153719866 153904495 0.772413793103448
chrY 2940166 7172793 0.808911201757138
chrY 7880008 13098461 0.79400260756193
chrY 13556427 13843364 0.892655367231638
chrY 14113371 15137286 0.93640897755611
chrY 15475619 19472504 0.813842482100239
Let do something more useful. I would like to know how many lines in lamina.bed. I can use ‘wc’ command with the option ‘-l’ to count number of lines in the file.
wc -l lamina.bed
1345 lamina.bed
The lamina.bed file has 1345 lines.
Sometimes we need specific column of the data file. We can use ‘cut’ command with ‘-f’ option to select specific column from the data file. For ‘-f’ option we need to provide column number that we would like to select.
cut -f 1,2 lamina.bed
This command will output first and second column of lamina.bed file. Sometime columns are separated by comma instead of space in that case ‘cut’ command has an option to input how the columns are separated. I have a comma separated file with four column. For that file I need to use -d “,”
cut -d "," -f 1,2 lamin_head.bed
If we need to sort data based on data in one column we can use ‘sort’ command with ‘k’ option to sort data.
sort -k2n lamina.bed
output
chrom start end value
chr20 11619 194068 0.980891719745223
chr8 696699 1664048 0.736240913811007
chr9 808574 2755227 0.904992729035385
chr18 857086 2083983 0.93071000855432
chr10 1193532 3050735 0.827242524916944
chr20 1406160 1690748 0.96989966555184
chr5 1715774 5463587 0.817772511848341
chr2 1829412 3344344 0.865861027190332
chr20 1876704 2021710 0.951351351351351
chr8 1971125 6247858 0.872897196261682
In the above commend ‘k2n’ means consider column 2 as number. We can see that output are sorted based number in column 2.
One advantage of command line is that you can redirect output from one command to another command by using pipe character ‘ | ’. In first column of the lamina.bed file contain chromosome name; we can select first column by ‘cut’ command then redirect to uniq command to see what are the unique chromosome in that file. |
cut -f 1 lamina.bed | uniq
So far we are printing all of our output on the screen, you can redirect output to a file by using ‘>’. Let say we would like to redirect output from above command to a file name called chrome_name.txt.
cut -f 1 lamina.bed | uniq > chrome_name.txt
Sometimes we need to replace a specific character from a file with another character. For example, lamina_head.bed file is comma separated, now I would like to replace comma with space. We can use ‘tr’ command to replace comma by space. 1st argument for ‘tr’ command is the pattern we would like to replace, and second argument is the character that we would like to replace with. In our case, 1st argument is “,” second argument is space which is given inside quote “ “. ‘tr’ commend cannot operate on file directly so I print content of by file by ‘cat’ then redirect to ‘tr’ command.
cat lamin_head.bed | tr "," " "
Wild card
A lot of time I use wild card character, which act as a place holder for any one or more character. Most of the time I use star * wildcard. *.txt means any one or more character before “.txt” similarly abc* means any one or more character after “abc”. Let see one example of using wild card. In my working directory I have a lot of different types of files such as pdf, doc, jpg, fasta files. I only want list of file that are pdf, here I can use wild *.pdf to print list of files that are pdf.
ls *.pdf
AB.pdf ACMB.pdf
Bioinformatic analysis involve downloading data from web. We can use ‘wget’ or ‘curl’ command to download data from website. ‘wget’ or ‘curl’ take web link of the file as an argument.
wget https://downloads.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_R63-1-1_20100105.tgz
This will download S288C_reference_genome_R63-1-1_20100105.tgz file in your working directory. This is a genome sequence file from Saccharomyces Genome Database.
Files for genomic data is very big, so to reduce size of the file is compressed as gzip file. The file we downloaded from SGD website is a ‘tgz’ file. One of the way compress the file using gzip. File name end with .gz that the file compressed by gzip and you need to use ‘gunzip’ command to decompress .gz file. Sometimes multiple files are archived and compressed together to a single file called ‘tarball’ and the file name has extension of ‘.tgz’ or ‘tar.gz’. This file we downloaded is a tarball file compressed with gzip. This file can be extracted by following command.
tar -zxvf S288C_reference_genome_R63-1-1_20100105.tgz
Above command will decompress the file we downloaded, and you will see a directory name ‘S288C_reference_genome_R63-1-1_20100105’. You can go into that directory and check files.
Another command I use a lot is ‘grep’ command. You can look for pattern is file using grep command and grep command will output the line that match the pattern. ‘S288C_reference_genome_R63-1-1_20100105’ directory contain multi fasta files. We know that name of sequence in a fasta file start with ‘>’. If we would like to extract name of sequences in multifasta file, we can grep command with ‘>’ as pattern and name of the file as an argument.
grep '>' S288C_reference_sequence_R63-1-1_20100105.fsa
This will print seqence names in S288C_reference_sequence_R63-1-1_20100105.fsa fasta file on the screen. Sometimes I need to print line that does not contain the pattern, in that case I ‘grep’ with option ‘-v’.
grep -v '>' S288C_reference_sequence_R63-1-1_20100105.fsa
This will print sequences in the file without the name.
So far we were working with single files, in reality I need to pass a commands to mutlple files and stored the output in multiple file. You can write a for loop to perform this kind of work. Let write a simple for loop in shell. “S288C_reference_sequence_R63-1-1_20100105” directory has multiple fasta file we would like to print head of every files. Below is the for loop to do that.
for i in `ls *.fasta` do
head $i done
I use simillar for loop to perform a lot of tasks where multiple files need to be processed by same command.
This is a very simple command line applications, you can do many more in command line. Most commands I talked about has various options you google about every command to find more. This I hope will help you get started with command line.