Skip to content

Commit a650039

Browse files
author
Alan Rubin
authored
updates for 1.1.0 (#10)
* add support for ambiguous amino acid codes * updated benchmarks
1 parent bafdb38 commit a650039

File tree

11 files changed

+165
-131
lines changed

11 files changed

+165
-131
lines changed

docs/benchmark.rst

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,25 @@
11
Performance comparison
22
*****************************
33

4-
This page contains some performance and usage comparisons for processing FASTQ_ files with fqfa and `pyfastx <https://github.com/lmdu/pyfastx>`_.
4+
This page contains some performance and usage comparisons for processing FASTQ_ files with
5+
fqfa and `pyfastx <https://github.com/lmdu/pyfastx>`_.
56

67
In these benchmarks, fqfa is comparable to `pyfastx <https://github.com/lmdu/pyfastx>`_,
7-
although `pyfastx <https://github.com/lmdu/pyfastx>`_ run in non-indexed mode is fastest.
8+
although `pyfastx <https://github.com/lmdu/pyfastx>`_ has made substantial performance
9+
improvements since fqfa was written, particularly when reading gzip-compressed input files.
810

911
The results are derived from `Jupyter notebooks <https://jupyter.org/>`_.
10-
If you'd like to run this code yourself, the notebooks are available with the fqfa documentation in ``fqfa/docs/notebooks``.
11-
The file used in the benchmark is from the `Enrich2 example dataset <https://github.com/FowlerLab/Enrich2-Example>`_.
12-
To run the benchmarks as written, you will have to decompress the bz2 file and also create a gzipped version.
13-
14-
This section includes examples of usage that are common in my work, primarily in processing files of barcode reads for high-throughput functional genomic assays.
15-
`pyfastx <https://github.com/lmdu/pyfastx>`_ includes many other functions that are not demonstrated here.
12+
If you'd like to run this code yourself, the notebooks are available with the fqfa
13+
documentation in ``fqfa/docs/notebooks``.
14+
The file used in the benchmark is from the
15+
`Enrich2 example dataset <https://github.com/FowlerLab/Enrich2-Example>`_.
16+
To run the benchmarks as written, you will have to decompress the bz2 file and also
17+
create a gzipped version.
18+
19+
This section includes examples of usage that are common in my work, primarily in
20+
processing files of barcode reads for high-throughput functional genomic assays.
21+
`pyfastx <https://github.com/lmdu/pyfastx>`_ includes many other functions that are not
22+
demonstrated here.
1623

1724
Benchmarking for raw FASTQ files
1825
#####################################

docs/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
author = "Alan F Rubin"
2424

2525
# The full version, including alpha/beta/rc tags
26-
release = "1.0.0"
26+
release = "1.1.0"
2727

2828

2929
# -- General configuration ---------------------------------------------------

docs/notebooks/benchmarks_bz2.ipynb

Lines changed: 4 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -114,20 +114,6 @@
114114
"print(f\"Kept {len(filt_reads)} reads after applying filter.\")\n",
115115
"del filt_reads"
116116
]
117-
},
118-
{
119-
"cell_type": "code",
120-
"execution_count": null,
121-
"metadata": {},
122-
"outputs": [],
123-
"source": []
124-
},
125-
{
126-
"cell_type": "code",
127-
"execution_count": null,
128-
"metadata": {},
129-
"outputs": [],
130-
"source": []
131117
}
132118
],
133119
"metadata": {
@@ -146,18 +132,18 @@
146132
"name": "python",
147133
"nbconvert_exporter": "python",
148134
"pygments_lexer": "ipython3",
149-
"version": "3.7.5rc1"
135+
"version": "3.8.2"
150136
},
151137
"pycharm": {
152138
"stem_cell": {
153139
"cell_type": "raw",
154-
"source": [],
155140
"metadata": {
156141
"collapsed": false
157-
}
142+
},
143+
"source": []
158144
}
159145
}
160146
},
161147
"nbformat": 4,
162148
"nbformat_minor": 4
163-
}
149+
}

docs/notebooks/benchmarks_gz.ipynb

Lines changed: 6 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -33,10 +33,10 @@
3333
"\n",
3434
"```\n",
3535
"334M BRCA1_input_sample.fq\n",
36-
"48M BRCA1_input_sample.fq.bz2\n",
37-
"520M BRCA1_input_sample.fq.fxi\n",
38-
"68M BRCA1_input_sample.fq.gz\n",
39-
"522M BRCA1_input_sample.fq.gz.fxi\n",
36+
" 48M BRCA1_input_sample.fq.bz2\n",
37+
"511M BRCA1_input_sample.fq.fxi\n",
38+
" 68M BRCA1_input_sample.fq.gz\n",
39+
"513M BRCA1_input_sample.fq.gz.fxi\n",
4040
"```"
4141
]
4242
},
@@ -184,7 +184,7 @@
184184
"# Benchmark 3: filtering reads on quality\n",
185185
"\n",
186186
"This code creates a list of reads for which all bases are at least Q20.\n",
187-
"The performance and usage in this section is quite similar to Benchmark 2."
187+
"The performance and usage in this section is quite a bit faster than Benchmark 2 following recent performance improvements in pyfastx."
188188
]
189189
},
190190
{
@@ -245,20 +245,6 @@
245245
"print(f\"Kept {len(filt_reads)} reads after applying filter.\")\n",
246246
"del filt_reads"
247247
]
248-
},
249-
{
250-
"cell_type": "code",
251-
"execution_count": null,
252-
"metadata": {},
253-
"outputs": [],
254-
"source": []
255-
},
256-
{
257-
"cell_type": "code",
258-
"execution_count": null,
259-
"metadata": {},
260-
"outputs": [],
261-
"source": []
262248
}
263249
],
264250
"metadata": {
@@ -277,7 +263,7 @@
277263
"name": "python",
278264
"nbconvert_exporter": "python",
279265
"pygments_lexer": "ipython3",
280-
"version": "3.7.5rc1"
266+
"version": "3.8.2"
281267
},
282268
"pycharm": {
283269
"stem_cell": {

docs/notebooks/benchmarks_raw.ipynb

Lines changed: 6 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,10 @@
3232
"\n",
3333
"```\n",
3434
"334M BRCA1_input_sample.fq\n",
35-
"48M BRCA1_input_sample.fq.bz2\n",
36-
"520M BRCA1_input_sample.fq.fxi\n",
37-
"68M BRCA1_input_sample.fq.gz\n",
38-
"522M BRCA1_input_sample.fq.gz.fxi\n",
35+
" 48M BRCA1_input_sample.fq.bz2\n",
36+
"511M BRCA1_input_sample.fq.fxi\n",
37+
" 68M BRCA1_input_sample.fq.gz\n",
38+
"513M BRCA1_input_sample.fq.gz.fxi\n",
3939
"```"
4040
]
4141
},
@@ -181,7 +181,7 @@
181181
"# Benchmark 3: filtering reads on quality\n",
182182
"\n",
183183
"This code creates a list of reads for which all bases are at least Q20.\n",
184-
"The performance and usage in this section is quite similar to Benchmark 2."
184+
"The performance and usage in this section is quite a bit faster than Benchmark 2 following recent performance improvements in pyfastx."
185185
]
186186
},
187187
{
@@ -240,20 +240,6 @@
240240
"print(f\"Kept {len(filt_reads)} reads after applying filter.\")\n",
241241
"del filt_reads"
242242
]
243-
},
244-
{
245-
"cell_type": "code",
246-
"execution_count": null,
247-
"metadata": {},
248-
"outputs": [],
249-
"source": []
250-
},
251-
{
252-
"cell_type": "code",
253-
"execution_count": null,
254-
"metadata": {},
255-
"outputs": [],
256-
"source": []
257243
}
258244
],
259245
"metadata": {
@@ -272,7 +258,7 @@
272258
"name": "python",
273259
"nbconvert_exporter": "python",
274260
"pygments_lexer": "ipython3",
275-
"version": "3.7.5rc1"
261+
"version": "3.8.2"
276262
}
277263
},
278264
"nbformat": 4,

docs/notebooks/exported/benchmarks_bz2.rst

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,8 @@ statement.
3131
3232
.. parsed-literal::
3333
34-
CPU times: user 51.3 s, sys: 993 ms, total: 52.3 s
35-
Wall time: 52.3 s
34+
CPU times: user 42.2 s, sys: 1.05 s, total: 43.3 s
35+
Wall time: 43.4 s
3636
@140313_SN743_0432_AC3TTHACXX:4:1101:5633:2224:1#0/1
3737
CCCGTGGCCTTTTCCA
3838
+
@@ -81,8 +81,8 @@ FastqRead class.
8181
8282
.. parsed-literal::
8383
84-
CPU times: user 1min 59s, sys: 174 ms, total: 1min 59s
85-
Wall time: 1min 59s
84+
CPU times: user 1min 35s, sys: 277 ms, total: 1min 35s
85+
Wall time: 1min 35s
8686
Median average quality is 37.5
8787
8888
@@ -109,9 +109,7 @@ class.
109109
110110
.. parsed-literal::
111111
112-
CPU times: user 58.8 s, sys: 848 ms, total: 59.7 s
113-
Wall time: 59.7 s
112+
CPU times: user 43 s, sys: 784 ms, total: 43.8 s
113+
Wall time: 43.8 s
114114
Kept 3641762 reads after applying filter.
115115
116-
117-

docs/notebooks/exported/benchmarks_gz.rst

Lines changed: 36 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,10 @@ than the reads in this case:
2222
::
2323

2424
334M BRCA1_input_sample.fq
25-
48M BRCA1_input_sample.fq.bz2
26-
520M BRCA1_input_sample.fq.fxi
27-
68M BRCA1_input_sample.fq.gz
28-
522M BRCA1_input_sample.fq.gz.fxi
25+
48M BRCA1_input_sample.fq.bz2
26+
511M BRCA1_input_sample.fq.fxi
27+
68M BRCA1_input_sample.fq.gz
28+
513M BRCA1_input_sample.fq.gz.fxi
2929

3030
.. code:: ipython3
3131
@@ -37,8 +37,8 @@ than the reads in this case:
3737
3838
.. parsed-literal::
3939
40-
CPU times: user 32.1 s, sys: 18.6 s, total: 50.7 s
41-
Wall time: 50.8 s
40+
CPU times: user 9.1 s, sys: 1.05 s, total: 10.1 s
41+
Wall time: 10.2 s
4242
<Read> 140313_SN743_0432_AC3TTHACXX:4:1101:5633:2224:1#0/1 with length of 16
4343
<Read> 140313_SN743_0432_AC3TTHACXX:4:1101:6580:2239:1#0/1 with length of 16
4444
<Read> 140313_SN743_0432_AC3TTHACXX:4:1101:6929:2242:1#0/1 with length of 16
@@ -62,8 +62,8 @@ doesn’t perform any extra computation or quality value conversion.
6262
6363
.. parsed-literal::
6464
65-
CPU times: user 3.34 s, sys: 452 ms, total: 3.79 s
66-
Wall time: 3.79 s
65+
CPU times: user 2.59 s, sys: 312 ms, total: 2.9 s
66+
Wall time: 2.9 s
6767
('140313_SN743_0432_AC3TTHACXX:4:1101:5633:2224:1#0/1', 'CCCGTGGCCTTTTCCA', 'B@CFFFFFHHHHHJJJ')
6868
('140313_SN743_0432_AC3TTHACXX:4:1101:6580:2239:1#0/1', 'TTTGGTAAAGGGTAAC', 'BBCFFDFFHHHHDHIJ')
6969
('140313_SN743_0432_AC3TTHACXX:4:1101:6929:2242:1#0/1', 'AATAATGTATGTACCT', 'BC@FFFFEFHHHHJJJ')
@@ -89,8 +89,8 @@ statement.
8989
9090
.. parsed-literal::
9191
92-
CPU times: user 39.7 s, sys: 757 ms, total: 40.5 s
93-
Wall time: 40.5 s
92+
CPU times: user 30.8 s, sys: 881 ms, total: 31.6 s
93+
Wall time: 31.6 s
9494
@140313_SN743_0432_AC3TTHACXX:4:1101:5633:2224:1#0/1
9595
CCCGTGGCCTTTTCCA
9696
+
@@ -138,6 +138,14 @@ information is not provided.
138138
print(f"Median average quality is {median(read_quals)}")
139139
del read_quals
140140
141+
142+
.. parsed-literal::
143+
144+
CPU times: user 53.9 s, sys: 323 ms, total: 54.2 s
145+
Wall time: 54.2 s
146+
Median average quality is 37.5
147+
148+
141149
pyfastx without index
142150
---------------------
143151

@@ -154,8 +162,8 @@ processing the input file.
154162
155163
.. parsed-literal::
156164
157-
CPU times: user 1min 12s, sys: 95.6 ms, total: 1min 12s
158-
Wall time: 1min 12s
165+
CPU times: user 55.9 s, sys: 15.4 ms, total: 55.9 s
166+
Wall time: 56 s
159167
Median average quality is 37.5
160168
161169
@@ -175,17 +183,17 @@ FastqRead class.
175183
176184
.. parsed-literal::
177185
178-
CPU times: user 1min 42s, sys: 119 ms, total: 1min 42s
179-
Wall time: 1min 42s
186+
CPU times: user 1min 23s, sys: 55.6 ms, total: 1min 23s
187+
Wall time: 1min 23s
180188
Median average quality is 37.5
181189
182190
183191
Benchmark 3: filtering reads on quality
184192
=======================================
185193

186194
This code creates a list of reads for which all bases are at least Q20.
187-
The performance and usage in this section is quite similar to Benchmark
188-
2.
195+
The performance and usage in this section is quite a bit faster than
196+
Benchmark 2 following recent performance improvements in pyfastx.
189197

190198
pyfastx with index
191199
------------------
@@ -199,6 +207,14 @@ information is not provided.
199207
print(f"Kept {len(filt_reads)} reads after applying filter.")
200208
del filt_reads
201209
210+
211+
.. parsed-literal::
212+
213+
CPU times: user 6.17 s, sys: 360 ms, total: 6.53 s
214+
Wall time: 6.53 s
215+
Kept 3641707 reads after applying filter.
216+
217+
202218
pyfastx without index
203219
---------------------
204220

@@ -211,8 +227,8 @@ pyfastx without index
211227
212228
.. parsed-literal::
213229
214-
CPU times: user 9.29 s, sys: 356 ms, total: 9.65 s
215-
Wall time: 9.65 s
230+
CPU times: user 7.24 s, sys: 620 ms, total: 7.86 s
231+
Wall time: 7.87 s
216232
Kept 3641762 reads after applying filter.
217233
218234
@@ -232,9 +248,7 @@ class.
232248
233249
.. parsed-literal::
234250
235-
CPU times: user 39.9 s, sys: 884 ms, total: 40.8 s
236-
Wall time: 40.8 s
251+
CPU times: user 31.2 s, sys: 660 ms, total: 31.9 s
252+
Wall time: 31.9 s
237253
Kept 3641762 reads after applying filter.
238254
239-
240-

0 commit comments

Comments
 (0)