The Art of Scientific Python
Juan Nunez-Iglesias, Stéfan van der Walt & Harriet Dashnow
Book Details
Price
|
3.00 USD |
---|---|
Pages
| 277 p |
File Size
|
29,767 KB |
File Type
|
PDF format |
ISBN
| 978-1-491-92287-3 |
Copyright
| 2017 Juan Nunez-Iglesias, Stéfan van der Walt, and Harriet Dashnow |
Juan Nunez-Iglesias is a freelance consultant and a Research Scientist at the University
of Melbourne, Australia. Prior positions include Research Associate at HHMI
Janelia Farm (where he worked with Mitya Chklovskii) and Research Assistant/PhD
student at the University of Southern California (where he studied computational
biology supervised by Xianghong Jasmine Zhou). His principal research interests are
neuroscience and image analysis. He is also interested in graph methods in bioinformatics
and in biostatistics.
Stéfan van der Walt is an assistant researcher at the Berkeley Institute for Data Science
at the University of California, Berkeley, and a senior lecturer in applied mathematics
at Stellenbosch University, South Africa. He has been involved in the
development of scientific open source software for more than a decade, and enjoys
teaching Python at workshops and conferences. Stéfan is the founder of scikit-image
and a contributor to NumPy, SciPy, and cesium-ml.
Harriet Dashnow is a bioinformatician and has worked at the Murdoch Childrens
Research Institute, the Department of Biochemistry at the University of Melbourne,
and the Victorian Life Sciences Computation Initiative (VLSCI). Harriet obtained a
BA (Psychology), a BS (Genetics and Biochemistry), and a MS (Bioinformatics) from
the University of Melbourne. She is currently working toward a PhD. She organizes
and teaches computational skills workshops in such areas as genomics, Software Carpentry,
Python, R, Unix, and Git version control.
Preface
Unlike the stereotypical wedding dress, it was—to use a technical term—elegant, like a computer
algorithm that achieves an impressive outcome with just a few lines of code.
—Graeme Simsion, The Rosie Effect
Welcome to Elegant SciPy. We’re going to spend rather a lot of time focusing on the
“SciPy” bit of the title, so let’s take a moment to reflect on the “Elegant” bit. There are
plenty of manuals, tutorials, and documentation websites out there that describe the
SciPy library. Elegant SciPy goes further. More than just teaching you how to write
code that works, we will inspire you to write code that rocks!
In The Rosie Effect (hilarious book; go read its prequel The Rosie Project when you’re
done with Elegant SciPy), Graeme Simsion twists the conventions of the word “elegant”
around. Most would use it to describe the visual simplicity, style, and grace of,
say, the first iPhone. Instead Graeme Simsion’s hero, Don Tillman, uses a computer
algorithm to define elegance. We hope that you will understand exactly what he
means after reading this book; that you will read or write a piece of elegant code, and
feel calmed in the glow of its beauty and grace. (Note: The authors may be prone to hyperbole.)
A good piece of code just feels right. When you look at it, its intent is clear, it is often
concise (but not so concise as to be obscure), and it is efficient at executing the task at
hand. For the authors, the joy of analyzing elegant code lies in the lessons hidden
within, and the way it inspires us to be creative in how we approach new coding problems.
Ironically, creativity can also tempt us to show off cleverness at the expense of the
reader, and write obtuse code that is hard to understand. PEP8 (the Python style
guide) and PEP20 (the Zen of Python) remind us that “code is read much more often
than it is written” and therefore “readability counts.”
The conciseness of elegant code comes through abstraction and the judicious use of
functions, not just through packing in a bunch of nested function calls. It may take a
minute or two to grok, but it should ultimately provide a crisp, “ah-ha!” moment of
understanding. Once you know the various components of the code, its correctness
should be obvious. This can be aided by clear variable and function names, and carefully
crafted comments that explain the code, rather than merely describe it.
In the New York Times, software engineer J. Bradford Hipps recently argued that “to
write better code, [one should] read Virginia Woolf ”:
As a practice, software development is far more creative than algorithmic.
The developer stands before her source code editor in the same way the author confronts
the blank page. […] They may also share a healthy impatience for the ways
things “have always been done” and a generative desire to break conventions. When
the module is finished or the pages complete, their quality is judged against many of
the same standards: elegance, concision, cohesion; the discovery of symmetries where
none were seen to exist. Yes, even beauty.
This is the position we take in this book.
Now that we’ve dealt with the “elegant” part of the title, let’s come back to the “SciPy.”
Depending on context, “SciPy” can mean a software library, an ecosystem, or a community.
Part of what makes SciPy great is that it has excellent online documentation
and tutorials, rendering Just Another Reference book pointless; instead, Elegant SciPy
wants to present the best code built with SciPy.
The code we have chosen highlights clever, elegant uses of advanced features of
NumPy, SciPy, and related libraries. The beginning reader will learn to apply these
libraries to real-world problems using beautiful code. And we use real scientific data
to motivate our examples.
Like SciPy itself, we wanted Elegant SciPy to be driven by the community. We’ve taken
many of our examples from working code found in the wider scientific Python ecosystem,
selecting them for their illustration of the principles of elegant code we outlined above.
Table of Contents
Preface vii
1. Elegant NumPy: The Foundation of Scientific Python
Introduction to the Data: What Is Gene Expression? 2
NumPy N-Dimensional Arrays 6
Why Use ndarrays Instead of Python Lists? 8
Vectorization 10
Broadcasting 10
Exploring a Gene Expression Dataset 12
Reading in the Data with pandas 12
Normalization 14
Between Samples 14
Between Genes 21
Normalizing Over Samples and Genes: RPKM 24
Taking Stock 30
2. Quantile Normalization with NumPy and SciPy
Getting the Data 33
Gene Expression Distribution Differences Between Individuals 34
Biclustering the Counts Data 37
Visualizing Clusters 39
Predicting Survival 42
Further Work: Using the TCGA’s Patient Clusters 46
Further Work: Reproducing the TCGA’s clusters 46
3. Networks of Image Regions with ndimage
Images Are Just NumPy Arrays 50
Exercise: Adding a Grid Overlay 55
Filters in Signal Processing 56
Filtering Images (2D Filters) 63
Generic Filters: Arbitrary Functions of Neighborhood Values 66
Exercise: Conway’s Game of Life 67
Exercise: Sobel Gradient Magnitude 68
Graphs and the NetworkX library 68
Exercise: Curve Fitting with SciPy 72
Region Adjacency Graphs 73
Elegant ndimage: How to Build Graphs from Image Regions 76
Putting It All Together: Mean Color Segmentation 78
4. Frequency and the Fast Fourier Transform
Introducing Frequency 81
Illustration: A Birdsong Spectrogram 84
History 90
Implementation 91
Choosing the Length of the DFT 92
More DFT Concepts 94
Frequencies and Their Ordering 94
Windowing 100
Real-World Application: Analyzing Radar Data 105
Signal Properties in the Frequency Domain 111
Windowing, Applied 115
Radar Images 117
Further Applications of the FFT 122
Further Reading 122
Exercise: Image Convolution 123
5. Contingency Tables Using Sparse Coordinate Matrices
Contingency Tables 127
Exercise: Computational Complexity of Confusion Matrices 128
Exercise: Alternative Algorithm to Compute the Confusion Matrix 128
Exercise: Multiclass Confusion Matrix 128
scipy.sparse Data Formats 129
COO Format 129
Exercise: COO Representation 130
Compressed Sparse Row Format 130
Applications of Sparse Matrices: Image Transformations 133
Exercise: Image Rotation 138
Back to Contingency Tables 139
Exercise: Reducing the Memory Footprint 140
Contingency Tables in Segmentation 140
Information Theory in Brief 142
Exercise: Computing Conditional Entropy 144
Information Theory in Segmentation: Variation of Information 145
Converting NumPy Array Code to Use Sparse Matrices 147
Using Variation of Information 149
Further Work: Segmentation in Practice 156
6. Linear Algebra in SciPy
Linear Algebra Basics 157
Laplacian Matrix of a Graph 158
Exercise: Rotation Matrix 159
Laplacians with Brain Data 165
Exercise: Showing the Affinity View 170
Exercise Challenge: Linear Algebra with Sparse Matrices 170
PageRank: Linear Algebra for Reputation and Importance 171
Exercise: Dealing with Dangling Nodes 176
Exercise: Equivalence of Different Eigenvector Methods 176
Concluding Remarks 176
7. Function Optimization in SciPy
Optimization in SciPy: scipy.optimize 179
An Example: Computing Optimal Image Shift 180
Image Registration with Optimize 186
Avoiding Local Minima with Basin Hopping 190
Exercise: Modify the align Function 190
“What Is Best?”: Choosing the Right Objective Function 191
8. Big Data in Little Laptop with Toolz
Streaming with yield 200
Introducing the Toolz Streaming Library 203
k-mer Counting and Error Correction 206
Currying: The Spice of Streaming 210
Back to Counting k-mers 212
Exercise: PCA of Streaming Data 214
Markov Model from a Full Genome 214
Exercise: Online Unzip 217
Epilogue 221
Appendix: Exercise Solutions 225
Index 247
Who Is This Book For?
Elegant SciPy is intended to inspire you to take your Python to the next level. You will
learn SciPy by example, from the very best code.
Before starting, you should at least have seen Python, and know about variables,
functions, loops, and maybe a bit of NumPy. You might have even honed your Python
skills with advanced material, such as Fluent Python. If this doesn’t describe you, you
should start with some beginner Python tutorials, such as Software Carpentry, before
continuing with this book.
But perhaps you don’t know whether the “SciPy stack” is a library or a menu item
from the International House of Pancakes, and you aren’t sure about best practices.
Perhaps you are a scientist who has read some Python tutorials online, and have
downloaded some analysis scripts from another lab or a previous member of your
own lab, and have fiddled with them. And you might think that you are more or less
alone when you learn to code SciPy. You are not.
As we progress, we will teach you how to use the internet as your reference. And we
will point you to the mailing lists, repositories, and conferences where you will meet
like-minded scientists who are a little further in their journey than you.
This is a book that you will read once, but may return to for inspiration (and maybe
to admire some elegant code snippets!).
Why SciPy?
The NumPy and SciPy libraries make up the core of the Scientific Python ecosystem.
The SciPy software library implements a set of functions for processing scientific
data, such as statistics, signal processing, image processing, and function optimization.
SciPy is built on top of NumPy, the Python numerical array computation library.
Building on NumPy and SciPy, an entire ecosystem of apps and libraries has grown
dramatically over the past few years, spanning a broad spectrum of disciplines that
includes astronomy, biology, meteorology and climate science,
and materials science, among others.