Welcome to pyfastx’s documentation!
The pyfastx
is a lightweight Python C extension that enables users to randomly access to sequences from plain and gzipped FASTA/Q files. This module aims to provide simple APIs for users to extract sequence from FASTA and reads from FASTQ by identifier and index number. The pyfastx
will build indexes stored in a sqlite3 database file for random access to avoid consuming excessive amount of memory. In addition, the pyfastx
can parse standard (sequence is spread into multiple lines with same length) and nonstandard (sequence is spread into one or more lines with different length) FASTA format. This module used kseq.h written by @attractivechaos in klib project to parse plain FASTA/Q file and zran.c written by @pauldmccarthy in project indexed_gzip to index gzipped file for random access.
This project was heavily inspired by @mdshw5’s project pyfaidx and @brentp’s project pyfasta.
Features
Single file for the Python extension
Lightweight, memory efficient for parsing FASTA file
Fast random access to sequences from gzipped FASTA file
Read sequences from FASTA file line by line
Calculate assembly N50 and L50
Calculate GC content and nucleotides composition
Extract reverse, complement and antisense sequence
Excellent compatibility, support for parsing nonstandard FASTA file
Support for random access reads from FASTQ file