Bioinformatics File Format
Bioinformatics analysis involved in manipulating varius file formats. Manipulation of files require a solid understanding of different file formats. Here I will try to explain what what different type file formats.
FASTA
Text file that store either nucleotide or amino acid sequence of DNA or proteins respectively.
Each record in a fasta file has two parts: Line 1: ‘>’ followed by sequene ID. Line 2: One or more line of nucleotide or amino acid sequences.
For example:
>gi|425153|gb|L26238.1|MUSHOME Mus domesticus (lbx) homeodomain mRNA, partial cds
CCATTTCAACAAGTACCTGACCAGGGCTCGGCGAGTGGAAGTTGCCGCTATTCTCGAGCTCAACGAAACT
CAAGTGAAAATT
Sequence ID: >gi|425153|gb|L26238.1|MUSHOME Mus domesticus (lbx) homeodomain mRNA, partial cds Sequence: CCATTTCAACAAGTACCTGACCAGGGCTCGGCGAGTGGAAGTTGCCGCTATTCTCGAGCTCAACGAAACTCAAGTGAAAATT
FASTQ File
FASTQ files contain nucleotide sequence with quality score associated with each nucleotide. Each record in FASTQ file consist of four lines.
Line 1: Starts with “@” followed sequence ID Line 2: Sequence data Line 3: Start with “+” Line 4: Quality score for each base in the Line 2, score is encoded by the ASCII character.
Example fastq file
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Sequence ID: @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
Sequence: GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
Line 3: '+'
Quality score:!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Quality score means probability of an error in base calling. High score means low probability of error.
GFF files
GFF file is nine column tab delimited file that describe information about each features in the genome. Each line in GFF file represents a feature in a genome such as gene, transcripts, exosome, promoter etc.
Column 1: Chromosome name or ID. Column 2: Program that generate the GFF file Column 3: Feature type such as gene, exon and coding sequence Column 4: Start position of feature in the chromosome Column 5: End position of feature in the chromosome Column 6: Score of the feature Column 7: Strand +(forward) or -(reverse) Column 8: frame denoted by one of the integers 0, 1, 2; base of nucleotide codon begins. Column 9: Attributes of feature such as ID, Name, Alias, Parent, Target, Gap, Ontology term etc. Each attribute is separated by “;”
Example:
chrI SGD CDS 538 792 . + 0 Parent=YAL068W-A_mRNA;Name=YAL068W-A_CDS;orf_classification=Dubious
Column 1: chrI - chromosome name Column 2: SGD - GFF source of GFF file in this case yeast genome database (SGD) Column 3: CDS - Feature type is coding sequence (CDS) Column 4: 538 - Feature start at 538 position of chromosome Column 5: 792 - Feature end at 792 position of the chromosome Column 6: . Score for the feature Column 7: + - which mean featue is in the forward strand Column 8: 0 - Codon start at first base of the sequence Column 9: Parent=YAL068W-A_mRNA;Name=YAL068W-A_CDS;orf_classification=Dubious
GTF file
First 8 columns are same as GFF file but 9th column has two different atrributes such as gene_id and transcript_id that are separated by “;”
SAM File
Output from sequence alignment program that contain 11 column each line in the sam file alignment information about each read. Header lines that start with @ For example: @HD - Header line @SQ - Reference genome information @RG - Read group information @PG - Program information
HISEQ:496:C4KY7ACXX:8:1101:1606:2994 73 4 13740599 36 100M * 0 0 ATCACAAAGAATATTCATCAATGCTTCACAAAACATTGGAAGGGGTAATAATGATGGAGACGTTTCCAAAAACAACCGTTGATGTTTTTCCATTGTTTCT ;;?=?;=BDDCA:CEEE@4A?,AEB?A?9A?<+?::?CCCD1))08?BD4B?<BBD:C=)(5-;A7@AA=CC/=??(3>@5;;AD############### MD:Z:32G10T45G5G4 NH:i:1 HI:i:1 NM:i:4 SM:i:36 XQ:i:40 X2:i:0 XO:Z:HU PG:Z:A
HISEQ:496:C4KY7ACXX:8:1101:1606:2994 133 * 0 0 * 4 13740599 0 ATACAATCGAAAATCATAGTTATTTATGCTCATTCATCGGAAGCTGGGGCAGACTGTTTCAGACAATTACCCATTATTTCTCGAACACTTGAACTAGCAT (85@34?############################################################################################# XO:Z:HU
Column 1: HISEQ:496:C4KY7ACXX:8:1101:1606:2994 - Query sequence ID Column 2: 73 - Sam FLAG Column 3: 3 - Reference sequence name is 3 that mean read align to chromosome 3 Column 4: 13740599 - 1 based 5’ posistion in the chrosome where read align. Column 5: 36 - Mapping quality Column 6: 100M - CIGAR String 100M means 100 match Column 7: * Ref. name of the mate read Column 8: 0 Position of mate read Column 9: 0 Observe template length Column 10: Query sequence Column 11: Quality score for the sequencee
Two important columns in SAM file are column 2 which is sam flag it is important that you understance meaning of sam FLAG. You input number in this column in this link and file what that FLAG means. Another important column is column 6 which is a CIGAR string. Here is the explanation of CIGAR string.
BAM file
Binary format of SAM file is called bam file.