Biology
Biology and bioinformatics-oriented data cleaning functions.
join_fasta(df, filename, id_col, column_name)
Convenience method to join in a FASTA file as a column.
This allows us to add the string sequence of a FASTA file as a new column of data in the dataframe.
This method only attaches the string representation of the SeqRecord.Seq object from Biopython. Does not attach the full SeqRecord. Alphabet is also not stored, under the assumption that the data scientist has domain knowledge of what kind of sequence is being read in (nucleotide vs. amino acid.)
This method mutates the original DataFrame.
For more advanced functions, please use phylopandas.
Examples:
>>> import tempfile
>>> import pandas as pd
>>> import janitor.biology
>>> tf = tempfile.NamedTemporaryFile()
>>> tf.write('''>SEQUENCE_1
... MTEITAAMVKELRESTGAGMMDCK
... >SEQUENCE_2
... SATVSEINSETDFVAKN'''.encode('utf8'))
66
>>> tf.seek(0)
0
>>> df = pd.DataFrame({"sequence_accession":
... ["SEQUENCE_1", "SEQUENCE_2", ]})
>>> df = df.join_fasta(
... filename=tf.name,
... id_col='sequence_accession',
... column_name='sequence',
... )
>>> df.sequence
0 MTEITAAMVKELRESTGAGMMDCK
1 SATVSEINSETDFVAKN
Name: sequence, dtype: object
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
A pandas DataFrame. |
required |
filename |
str
|
Path to the FASTA file. |
required |
id_col |
str
|
The column in the DataFrame that houses sequence IDs. |
required |
column_name |
str
|
The name of the new column. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with new FASTA string sequence column. |
Source code in janitor/biology.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
|