Publisher: O’Reilly Media | 2010 | Pages: 538
Collecting data is relatively easy, but turning raw information into something useful requires that you know how to extract precisely what you need. With this insightful book, intermediate to experienced programmers interested in data analysis will learn techniques for working with data in a business environment. You'll learn how to look at data to discover what it contains, how to capture those ideas in conceptual models, and then feed your understanding back into the organization through business plans, metrics dashboards, and other applications.
Along the way, you'll experiment with concepts through hands-on workshops at the end of each chapter. Above all, you'll learn how to think about the results you want to achieve - rather than rely on tools to think for you.
Use graphics to describe data with one, two, or dozens of variables
Develop conceptual models using back-of-the-envelope calculations, as well asscaling and probability arguments
Mine data with computationally intensive methods such as simulation and clustering
Make your conclusions understandable through reports, dashboards, and other metrics programs
Understand financial calculations, including the time-value of money
Use dimensionality reduction techniques or predictive analytics to conquer challenging data analysis situations
Become familiar with different open source programming environments for data analysis
Including:
Data Analysis
What’s in This Book
What’s with the Workshops?
What’s with the Math?
What You’ll Need
What’s Missing
Graphics: Looking at Data
A Single Variable: Shape and DistributionDot and Jitter Plots
Histograms and Kernel Density Estimates
The Cumulative Distribution Function
Rank-Order Plots and Lift Charts
Only When Appropriate: Summary Statistics and Box Plots
Workshop: NumPy
Further Reading
Two Variables: Establishing RelationshipsScatter Plots
Conquering Noise: Smoothing
Logarithmic Plots
Banking
Linear Regression and All That
Showing What’s Important
Graphical Analysis and Presentation Graphics
Workshop: matplotlib
Further Reading
Time As a Variable: Time-Series AnalysisExamples
The Task
Smoothing
Don’t Overlook the Obvious!
The Correlation Function
Optional: Filters and Convolutions
Workshop: scipy.signal
Further Reading
More Than Two Variables: Graphical Multivariate AnalysisFalse-Color Plots
A Lot at a Glance: Multiplots
Composition Problems
Novel Plot Types
Interactive Explorations
Workshop: Tools for Multivariate Graphics
Further Reading
Chapter 6 Intermezzo: A Data Analysis Session
A Data Analysis Session
Workshop: gnuplot
Further Reading
Analytics: Modeling Data
Guesstimation and the Back of the EnvelopePrinciples of Guesstimation
How Good Are Those Numbers?
Optional: A Closer Look at Perturbation Theory and Error Propagation
Workshop: The GNU Scientific Library (GSL)
Further Reading
Models from Scaling ArgumentsModels
Arguments from Scale
Mean-Field Approximations
Common Time-Evolution Scenarios
Case Study: How Many Servers Are Best?
Why Modeling?
Workshop: Sage
Further Reading
Arguments from Probability ModelsThe Binomial Distribution and Bernoulli Trials
The Gaussian Distribution and the Central Limit Theorem
Power-Law Distributions and Non-Normal Statistics
Other Distributions
Optional: Case Study—Unique Visitors over Time
Workshop: Power-Law Distributions
Further Reading
What You Really Need to Know About Classical StatisticsGenesis
Statistics Defined
Statistics Explained
Controlled Experiments Versus Observational Studies
Optional: Bayesian Statistics—The Other Point of View
Workshop: R
Further Reading
Intermezzo: Mythbusting—Bigfoot, Least Squares, and All ThatHow to Average Averages
The Standard Deviation
Least Squares
Further Reading
Computation: Mining Data
SimulationsA Warm-Up Question
Monte Carlo Simulations
Resampling Methods
Workshop: Discrete Event Simulations with SimPy
Further Reading
Finding ClustersWhat Constitutes a Cluster?
Distance and Similarity Measures
Clustering Methods
Pre- and Postprocessing
Other Thoughts
A Special Case: Market Basket Analysis
A Word of Warning
Workshop: Pycluster and the C Clustering Library
Further Reading
Seeing the Forest for the Trees: Finding Important AttributesPrincipal Component Analysis
Visual Techniques
Kohonen Maps
Workshop: PCA with R
Further Reading
Intermezzo: When More Is DifferentA Horror Story
Some Suggestions
What About Map/Reduce?
Workshop: Generating Permutations
Further Reading
Applications: Using Data
Reporting, Business Intelligence, and DashboardsBusiness Intelligence
Corporate Metrics and Dashboards
Data Quality Issues
Workshop: Berkeley DB and SQLite
Further Reading
Financial Calculations and ModelingThe Time Value of Money
Uncertainty in Planning and Opportunity Costs
Cost Concepts and Depreciation
Should You Care?
Is This All That Matters?
Workshop: The Newsvendor Problem
Further Reading
Predictive AnalyticsTopics in Predictive Analytics
Some Classification Terminology
Algorithms for Classification
The Process
The Secret Sauce
The Nature of Statistical Learning
Workshop: Two Do-It-Yourself Classifiers
Further Reading
Epilogue: Facts Are Not RealityAppendix Programming Environments for Scientific Computation and Data Analysis
Software Tools
A Catalog of Scientific Software
Writing Your Own
Further Reading
Appendix Results from Calculus
Common Functions
Calculus
Useful Tricks
Notation and Basic Math
Where to Go from Here
Further Reading
Appendix Working with Data
Sources for Data
Cleaning and Conditioning
Sampling
Data File Formats
The Care and Feeding of Your Data Zoo
Skills
Terminology
Further Reading
Appendix About the Author
Colophon