Questions tagged [fasta]

FASTA is a software package for sequence alignment of proteins and nucleic acids. FASTA is also the name of the file format used by these programs to represent sequences of peptides or nucleotides. The format is a de facto standard in bioinformatics.

The FASTA format (read as "fast A format") is a text-based format used by the FASTA software for representing nucleic acids and proteins. It represents each nucleotide and amino-acid as a letter. The FASTA format also supports naming of sequences.

The format achieved great popularity, becoming the de facto standard for representing biological sequences.

A bioinformatical record in FASTA format consists of the header (comment) string followed by one or more strings describing the sequence (one letter per nucleotide or amino acid). Header strings begin with >. The sequence that follows is wrapped at a fixed width (often 60, but generally no more than 80).

> Sample nucleotide sequence
AGCACTGAGTAACGTATAAGCAGTCCCCGGACGCGTA
> Nucleotide sequence #2
GCCACGGGAGTTGAAGAACATCGAGAATGCCACTAGTTTTCACCCTTCATAGATATCCTA
GCGCCGTACATGTATACGAGATCTTTGTCACGCAGTATGGAGGATTGTGGCCAGCAATAC
GTCGTGTCCCGCAATGCTTCATTAGATCCCCGTATATCCATCCTGAGTCATTGTCTGTTG
TCCGTTTTGAAGGAGTCTAGCAGCTTGATA
743 questions
4
votes
4 answers

Printing a sequence from a fasta file

I often need to find a particular sequence in a fasta file and print it. For those who don't know, fasta is a text file format for biological sequences (DNA, proteins, etc.). It's pretty simple, you have a line with the sequence name preceded by a…
Colin
  • 8,627
  • 10
  • 42
  • 50
4
votes
2 answers

Using Bio.SeqIO to write single-line FASTA

QIIME requests this (here) regarding the fasta files it receives as input: The file is a FASTA file, with sequences in the single line format. That is, sequences are not broken up into multiple lines of a particular length, but instead the entire…
Korem
  • 9,501
  • 5
  • 46
  • 67
4
votes
3 answers

Convert table into fasta in R

I have a table like this: >head(X) column1 column2 sequence1 ATCGATCGATCG sequence2 GCCATGCCATTG I need an output in a fasta file, looking like this: sequence1 ATCGATCGATCG sequence2 GCCATGCCATTG So, basically I need all entries of the 2nd…
user3586764
  • 53
  • 1
  • 3
4
votes
3 answers

append contents from one file to another with newline separation

I'm trying to, I think, replicate the cat functionality of the Linux shell in a platform-agnostic way such that I can take two text files and merge their contents in the following manner: file_1 contains: 42 bottles of beer on the wall file_2…
glarue
  • 500
  • 5
  • 20
4
votes
2 answers

Using Biopython (Python) to extract sequence from FASTA file

Ok so I need to extract part of a sequence from a FASTA file, using python (biopython, http://biopython.org/DIST/docs/tutorial/Tutorial.html) I need to get the first 10 bases from each sequence and put them in one file, preserving the sequence info…
user1784467
  • 403
  • 4
  • 9
  • 16
4
votes
2 answers

Scala functional way of processing large scala data with lazy collections

I am trying to figure out memory-efficient AND functional ways to process a large scale of data using strings in scala. I have read many things about lazy collections and have seen quite a bit of code examples. However, I run into "GC overhead…
Wayne Jhukie
  • 137
  • 2
  • 11
3
votes
1 answer

Making Blast database from FASTA in Python

How can I do this? I use Biopython and saw manual already. Of course I can make blastdb from FASTA using "makeblastdb" in standalone NCBI BLAST+, but I want to whole process in one program. It seems there are two possible solutions. Find a function…
3
votes
2 answers

Parsing file in parallel

I am thinking about a way to parse a fasta-file in parallel. For those of you not knowing fasta-format an example: >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG …
peri4n
  • 1,303
  • 13
  • 21
3
votes
3 answers

How do I merge two FASTA files (one file with line break) in Perl?

I have two following Fasta file: file1.fasta >0 GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT >1 GTTAAGTTATATCAAACTAAATATACATACTATAAA >2 GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC file2.qual >0 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40…
neversaint
  • 50,277
  • 118
  • 274
  • 437
3
votes
2 answers

Removing lines which match with specific pattern from another file

I've got two files (I only show the beginning of these files)…
Paillou
  • 455
  • 3
  • 13
3
votes
2 answers

Find length of a contig in one fasta, using the header of another fasta as query in python

I'm trying to find a python solution to extract the length of a specific sequence within a fasta file using the full header of the sequence as the query. The full header is stored as a variable earlier in the pipeline (i.e. "CONTIG"). I would like…
Gunther
  • 119
  • 6
3
votes
1 answer

How to remove duplicates from fasta file but keep at least one per group based on header

I have a multifasta file that looks like this: ( all sequences are >100bp, more than one line, and same lenght…
Xela Vi
  • 77
  • 5
3
votes
1 answer

Pairwise alignment of multi-FASTA file sequences

I have multi-FASTA file containing more than 10 000 fasta sequences resulted from Next Generation Sequencing and I want to do pairwise alignment of each sequence to each sequence inside the file and store all the results in the same new file in…
Aurora
  • 31
  • 3
3
votes
1 answer

Is there a way to collect many multiline strings delineated by a specific character into an Arraylist using the data stream in Java 8?

I have a fasta file that I want to parse into an ArrayList, each position having an entire sequence. The sequences are multiline strings, and I don't want to include the identification line in the string that I store. My current code splits each…
Sam
  • 33
  • 3
3
votes
1 answer

Directly calling SeqIO.parse() in for loop works, but using it separately beforehand doesn't? Why?

In python this code, where I directly call the function SeqIO.parse() , runs fine: from Bio import SeqIO a = SeqIO.parse("a.fasta", "fasta") records = list(a) for asq in SeqIO.parse("a.fasta", "fasta"): print("Q") But this, where I first…
1 2
3
49 50