Questions tagged [fasta]

FASTA is a software package for sequence alignment of proteins and nucleic acids. FASTA is also the name of the file format used by these programs to represent sequences of peptides or nucleotides. The format is a de facto standard in bioinformatics.

The FASTA format (read as "fast A format") is a text-based format used by the FASTA software for representing nucleic acids and proteins. It represents each nucleotide and amino-acid as a letter. The FASTA format also supports naming of sequences.

The format achieved great popularity, becoming the de facto standard for representing biological sequences.

A bioinformatical record in FASTA format consists of the header (comment) string followed by one or more strings describing the sequence (one letter per nucleotide or amino acid). Header strings begin with >. The sequence that follows is wrapped at a fixed width (often 60, but generally no more than 80).

> Sample nucleotide sequence
AGCACTGAGTAACGTATAAGCAGTCCCCGGACGCGTA
> Nucleotide sequence #2
GCCACGGGAGTTGAAGAACATCGAGAATGCCACTAGTTTTCACCCTTCATAGATATCCTA
GCGCCGTACATGTATACGAGATCTTTGTCACGCAGTATGGAGGATTGTGGCCAGCAATAC
GTCGTGTCCCGCAATGCTTCATTAGATCCCCGTATATCCATCCTGAGTCATTGTCTGTTG
TCCGTTTTGAAGGAGTCTAGCAGCTTGATA
743 questions
18
votes
13 answers

Converting FASTQ to FASTA with SED/AWK

I have a data in that always comes in block of four in the following format (called FASTQ): @SRR018006.2016 GA2:6:1:20:650 length=36 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN +SRR018006.2016 GA2:6:1:20:650…
neversaint
  • 50,277
  • 118
  • 274
  • 437
15
votes
1 answer

Perl6 : What is the best way for dealing with very big files?

Last week I decided to give a try to Perl6 and started to reimplement one of my program. I have to say, Perl6 is so the easy for object programming, an aspect very painfull to me in Perl5. My program have to read and store big files, such as whole…
Beuss
  • 450
  • 2
  • 9
15
votes
3 answers

Read FASTA into a dataframe and extract subsequences of FASTA file

I have a small fasta file of DNA sequences which looks like this: >NM_000016 700 200 234 ACATATTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC >NM_000775 700 124 236 CTAACCTCTCCCAGTGTGGAACCTCTATCTCATGAGAAAGCTGGGATGAG >NM_003820 700 111…
Paul.j
  • 615
  • 2
  • 7
  • 14
13
votes
3 answers

Sequence length of FASTA file

I have the following FASTA file: >header1 CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC >header2 GGT >header3 TTATGAT My desired output: >header1 117 >header2 3 >header3 7 # 3…
cucurbit
  • 987
  • 1
  • 10
  • 28
11
votes
9 answers

Remove line breaks in a FASTA file

I have a fasta file where the sequences are broken up with newlines. I'd like to remove the newlines. Here's an example of my file: >accession1 ATGGCCCATG GGATCCTAGC >accession2 GATATCCATG AAACGGCTTA I'd like to convert it into…
chimeric
  • 685
  • 1
  • 8
  • 14
10
votes
4 answers

parsing a fasta file using a generator ( python )

I am trying to parse a large fasta file and I am encountering out of memory errors. Some suggestions to improve the data handling would be appreciated. Currently the program correctly prints out the names however partially through the file I get a…
Lamar B
  • 245
  • 1
  • 2
  • 7
9
votes
4 answers

Efficient file buffering & scanning methods for large files in python

The description of the problem I am having is a bit complicated, and I will err on the side of providing more complete information. For the impatient, here is the briefest way I can summarize it: What is the fastest (least execution time) way to…
eblume
  • 1,518
  • 2
  • 16
  • 21
9
votes
4 answers

Writing fasta files using R package seqinr?

When I use write.fasta in seqinr, the file that it outputs looks like this: >Sequence name 1 >Sequence name 2 >Sequence name 3 ...etc Sequence 1 Sequence 2 Sequence 3 ...etc In other words, the sequence names are all at the beginning of the…
Jennifer Collins
  • 243
  • 1
  • 4
  • 9
8
votes
2 answers

extract sequences from multifasta file by ID in file using awk

I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs. FASTA file…
7
votes
2 answers

FASTA Algorithm Explanation

I'm trying to understand the basic steps of FASTA algorithm in searching similar sequences of a query sequence in a database. These are the steps of the algorithm: Identify common k-words between I and J Score diagonals with k-word matches,…
conmadoi
  • 2,103
  • 3
  • 13
  • 5
7
votes
1 answer

chaos game for DNA sequences

I have tried the mathematica code for making the chaos game for DNA sequences posted in this address: http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html which is like this: genome = Import["c:\data\sequence.fasta", "Sequence"]; genome =…
Layla
  • 4,654
  • 14
  • 48
  • 64
7
votes
3 answers

multiFASTA file processing

I was curious to know if there is any bioinformatics tool out there able to process a multiFASTA file giving me infos like number of sequences, length, nucleotide/aminoacid content, etc. and maybe automatically draw descriptive plots. Also an R…
Federico Giorgi
  • 9,409
  • 9
  • 38
  • 50
6
votes
3 answers

Reading in file block by block using specified delimiter in python

I have an input_file.fa file like this (FASTA format): > header1 description data data data >header2 description more data data data I want to read in the file one chunk at a time, so that each chunk contains one header and the corresponding data,…
Chris_Rands
  • 30,797
  • 12
  • 66
  • 100
6
votes
2 answers

Biopython SeqIO to Pandas Dataframe

I have a FASTA file that can easily be parsed by SeqIO.parse. I am interested in extracting sequence ID's and sequence lengths. I used these lines to do it, but I feel it's waaaay too heavy (two iterations, conversions, etc.) from Bio import…
Sara
  • 773
  • 1
  • 8
  • 15
6
votes
4 answers

How to find inverted repeated pattern in a FASTA sequence?

Suppose my long sequence looks like: 5’-AGGGTTTCCC**TGACCT**TCACTGC**AGGTCA**TGCA-3 The two italics subsequences (here within the two stars) in this long sequence are together called as inverted repeat pattern. The length and the combination of…
user1964587
  • 399
  • 2
  • 6
  • 12
1
2 3
49 50