Wes McKinney

Something went wrong. Please try your request again later.
Follow to get new release updates and improved recommendations
OK
About Wes McKinney
Since 2007, I have been creating fast, easy-to-use data wrangling and statistical computing tools, mostly in the Python programming language. I am best known for creating the pandas project and writing the book Python for Data Analysis. I am also a contributor to the Apache Arrow, Kudu, and Parquet projects within the Apache Software Foundation. I was the co-founder and CEO of DataPad. I later spent a couple years leading efforts to bring Python and Hadoop together at Cloudera. I'm now working for Two Sigma in New York.
Customers Also Bought Items By
Are you an author?
Help us improve our Author Pages by updating your bibliography and submitting a new or current image and biography.
1 11 1
Author Updates
-
-
Blog postJoint Post from Wes McKinney and Josh Patterson Allow us to reintroduce ourselves Too often people say "let’s do something together" in passing, and don’t. There's the occasional inter-project collaboration, but rarely will people take that next step. There are countless reasons why this happens, and aligning goals is challenging to say the least. But after spending the last several years working separately on related problems in the data ecosystem, we realized our best hope to mak9 months ago Read more
-
Blog postAbout 6 years ago I gave up on Apple laptops and switched to Linux full time. When I'm at home, I prefer to work on a well-equipped desktop with a large monitor and one of my beloved Kinesis Advantage keyboards. But I am on the road a lot for work and so I need to do a lot of hacking on the go.
My initial reasons for giving up on OS X / macOS were due to my frustrations around developer tooling and package management, problems now largely solved by Homebrew. I got tired of being able3 years ago Read more -
Blog postThe first quarter of 2019 has now wrapped up. In March we spent a good amount of time focused on getting the 0.13.0 Apache Arrow release out of the door. I will mention a few development highlights from the month and provide the full changelog of patches later in the post.
Development Highlights We are continuing to set up our physical build and test cluster which we'll use to run integration tests, GPU-enabled builds, benchmark comparisons, and other automated tests to help with Arro3 years ago Read more -
Blog postThe team had a busy 28 days this February. The Apache Arrow community is discussing a 0.13 release toward the end of March, so we spent February helping the project toward the next release milestone. We have been pushing projects on multiple fronts and discuss some of those here.
The Apache Arrow project just had its 3rd birthday, and we are pleased to report that the community is thriving and growing fast after only a short time as a top-level project in The Apache Software Foundatio3 years ago Read more -
Blog postUrsa Labs had a busy January that went by too quickly. After a high-intensity 3 months of development, we helped release Apache Arrow 0.12 on January 20th. A good chunk of our time was spent fighting fires (in packaging and builds) related to the continued expansion of the project in recent months.
The 0.12 release contains a new merged documentation site where you can expect more project-level documentation to appear this year.
Upcoming Focus Areas The team is working in a nu3 years ago Read more -
Blog postThe tax situation in the United States is pretty messed up.
Much ado has been made over the last week about AOC suggesting bringing back the pre-Reagan era 70% marginal tax rate on regular income over $10,000,000. This means that if you made $12 million in normal income in a single year, you would pay 70% of the last $2 million to the federal government.
At the same time, the American taxpayer still hasn't seen Donald J. Trump's tax returns. There has been speculation about wh3 years ago Read more -
Blog postFor ten out of the last eleven years, I've lived in two places: New York City and San Francisco. The last two years have been in NYC. After founding Ursa Labs, a not-for-profit open source development group, I felt it was time to make my home somewhere that isn't either of those places. After some contemplation and consulting many friends, I decided on Nashville, Tennessee. This blog explains some of my feelings on this lifestyle change, and why I hope to see an increased migration of tech wo3 years ago Read more
-
Blog postThis blog posts discusses the design and performance implications of using bitmaps to mark null values instead of sentinel values (or special values like NaN).
How Apache Arrow's columnar format handles null values Depending on who you talk to, one controversial aspect of the Apache Arrow columnar format is the fact that it uses a validity bitmap to mark value as null or not null. Here's the basic idea:
The number of nulls is a property of an array, generally computed once then st4 years ago Read more -
Blog postI'm excited to announce that NVIDIA AI Labs has signed on as a supporter of Ursa Labs. NVIDIA's new open source RAPIDS data science platform uses Apache Arrow for an interoperable representation of tabular data (data frames). We are looking forward to collaborating on our respective development roadmaps and growing the ecosystem of projects that use Arrow.
This new financial support will enable us to grow our team of full-time open source software developers.
Apache Arrow: An Open4 years ago Read more -
Blog postFunding open source software development is a complicated subject. I’m excited to announce that I’ve founded Ursa Labs (https://ursalabs.org), an independent development lab with the mission of innovation in data science tooling.
I am initially partnering with RStudio and Two Sigma to assist me in growing and maintaining the lab’s operations, and to align engineering efforts on creating interoperable, cross-language computational systems for data science, all powered by Apache Arrow.4 years ago Read more -
Blog postWell-known database systems researcher Daniel Abadi published a blog post yesterday asking Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?.
Despite the somewhat confrontational title, based on his analysis, the answer is "Yes, we do", but I have a number of issues to discuss, including in part the premise of the article.
Storage and runtime formats, in context Arrow is not competing with Parquet and ORC. Th5 years ago Read more -
Blog postEarlier this year, development for the Feather file format moved to the Apache Arrow codebase. I will explain how this has already affected Feather and what to expect from the project going forward.
Feather: How's it related to Arrow? Shortly after we announced the formation of Apache Arrow in February 2016, Hadley Wickham and I met up and discussed how we could foster more collaboration in the Python and R community around shared infrastructure for data science. Hadley suggested (as5 years ago Read more -
Blog postThis post is the first of many to come on Apache Arrow, pandas, pandas2, and the general trajectory of my work in recent times and into the foreseeable future. This is a bit of a read and overall fairly technical, but if interested I encourage you to take the time to work through it.
In this post I hope to explain as concisely as I can some of the key problems with pandas's internals and how I've been steadily planning and building pragmatic, working solutions for them. To the outside5 years ago Read more -
Blog postModern Day Gollums You can see the hype building. The rabid, foamed-mouth speculation. Apple is about to release a new, never-before-seen iPhone model. Edge-to-edge screen, no home button? What will it be called? You can see the upcoming queue at the Apple Store, the zombies lining up for the feeding. My precious...
For many people nowadays, the smart phone has become the single most important object in their lives. In public places (like public transportation, cafés, or resta5 years ago Read more -
Blog postTL;DR: I discuss my perspective in the recent BSD+Patents episode with Facebook Open Source.
Patents and Open Source Disclaimer: I am not a lawyer
Many open source software (OSS) projects led by industry giants use the Apache License 2.0 (henceforth the ASL2.0), compared with the MIT or BSD licenses which also enjoy popularity amongst many other OSS projects. This is the license used by projects in the Apache Software Foundation, but you can use the license without being part5 years ago Read more -
Blog postIn this post, I show how Parquet can encode very large datasets in a small file footprint, and how we can achieve data throughput significantly exceeding disk IO bandwidth by exploiting parallelism (multithreading).
Apache Parquet: Top performer on low-entropy data As you can read in the Apache Parquet format specification, the format features multiple layers of encoding to achieve small file size, among them:
Dictionary encoding (similar to how pandas.Categorical represents data,5 years ago Read more -
Blog postOver the past couple weeks, Nong Li and I added a streaming binary format to Apache Arrow, accompanying the existing random access / IPC file format. We have implementations in Java and C++, plus Python bindings. In this post, I explain how the format works and show how you can achieve very high data throughput to pandas DataFrames.
Columnar streaming data A common question I get about using Arrow is the high cost of transposing large tabular datasets from record- or row-oriented form5 years ago Read more -
Blog postOver the last year, I have been working with the Apache Parquet community to build out parquet-cpp, a first class C++ Parquet file reader/writer implementation suitable for use in Python and other data applications. Uwe Korn and I have built the Python interface and integration with pandas within the Python codebase (pyarrow) in Apache Arrow.
This blog is a follow up to my 2017 Roadmap post.
Design: High performance columnar data in Python The Apache Arrow and Parquet C++ libr5 years ago Read more -
Blog postThere have been many Python libraries developed for interacting with the Hadoop File System, HDFS, via its WebHDFS gateway as well as its native Protocol Buffers-based RPC interface. I'll give you an overview of what's out there and show some engineering I've been doing to offer a high performance HDFS interface within the developing Arrow ecosystem.
This blog is a follow up to my 2017 Roadmap post.
Hadoop file system protocols HDFS is a part of Apache Hadoop, and its design w5 years ago Read more -
Blog post2017 is shaping up to be an exciting year in Python data development. In this post I'll give you a flavor of what to expect from my end. In follow up blog posts, I plan to go into more depth about how all the pieces fit together. I have been a bit delinquent in blogging in 2016, since my hands have been quite full doing development and working on the 2nd edition of Python for Data Analysis. I am going to do my best to write more in 2017.
New position After a productive 2 years with Cl5 years ago Read more -
Blog postIn this post I discuss some recent work in Apache Arrow to accelerate converting to pandas objects from general Arrow columnar memory.
Challenges constructing pandas DataFrame objects quickly One of the difficulties in fast construction of pandas DataFrame object is that the "native" internal memory structure is more complex than a dictionary or list of one-dimensional NumPy arrays. I won't go into the reasons for this complexity, but it's something we're hoping to do away w5 years ago Read more -
Blog postTL;DR I discuss my impressions of the newest version of the classic Kinesis Advantage contoured mechnical keyboard, the Advantage2.
Mechanical keyboards Mechanical keyboards have become a big business the last 5 years or so, with clackity-clack Cherry MX key switches becoming all the rage amongst programmers and gamers alike. In the age of ever-thinner laptop keyboards (and Apple even getting rid of physical buttons and keys in recent Macbook Pros), the strong tactile feedback and sat5 years ago Read more -
Blog postTL;DR One of the most harmful parts of the GitHub platform is the code contribution calendar. This "hacker score card" overemphasizes the value of commits over the other kinds of important contributions to open source projects, like doing code reviews and discussing bugs and new features on the issue tracker.
A skewed view of reality We love it and we hate it: the GitHub contribution calendar.
Some of the common grievances against the calendar, until very recently, w6 years ago Read more -
Blog postKinesis Corporation has produced a long-awaited update to their Savant Elite line of foot pedals. If you find yourself with wrist or RSI pain, you might consider giving them a look.
Disclaimer: Kinesis Corporation sent me an evaluation model of the SE 2 foot pedal. As a long time fan (and advocate for anyone who helps people overcome RSI problems) I agreed to write a blog about it!
Being that weird person with the foot pedals click click click
"What is that noise?6 years ago Read more -
Blog postSummary: Feather's good performance is a side effect of its design, but the primary goal of the project is to have a common memory layout (Apache Arrow) and metadata (type information) for use in multiple programming languages.
Feather performance Several people asked me about Matt Dowle's blog post about fast CSV writing. I say: bravo!
The dirty secret with Feather's performance is that neither Hadley or I spent much effort on performance optimization. Through the project's c6 years ago Read more -
Blog postSummary: I explain the relationship between Feather and Apache Arrow in more technical detail.
Memory representation and file formats I was recently asked to explain the difference between Apache Arrow (providing a standard in-memory columnar memory representation) and Feather (a file format using Apache Arrow).
Before going deeper into Feather and Arrow, let's look at how memory representations (these are also probably more commonly called data structures) can lead to file fo6 years ago Read more -
Blog postSummary: I raise some follow-up points to yesterday's post on community-governed packaging efforts.
The problem with conda-forge today As discussed in my blog post yesterday, conda-forge may offer the way forward to create a community-governed package repository that meets the standards already established in the Linux, R, and other open source communities.
The single biggest problem with conda-forge right now is that its hosting of build artifacts depends on closed source sof6 years ago Read more -
Blog postSummary: It's finally time we worked as a community to create a reliable, community-governed repository of trusted Python binary package artifacts, just like Linux, R, Java, and many other open source tool ecosystems have already done. Enterprise-friendly platform distributions do play an important role, though. I examine the various nuances within. I also talk about the new conda-forge project which may offer the way forward.
Python environment management hell: a personal story When6 years ago Read more -
Blog postSummary: It's much easier to create impressive demos than it is to create dependable, functionally-comprehensive production software. I discuss my thoughts on this topic.
Post-Conference Lows Last week was the annual California Strata-Hadoop World conference. I've now been to some variant of the Strata conference 12 times since Fall 2011.
Having worked more than 8 out of the last 10 years on building production-grade data analysis systems, conferences like Strata have grown em6 years ago Read more -
Blog postUnsigned integers (size_t, uint32_t, and friends) can be hazardous, as signed-to-unsigned integer conversions can happen without so much as a compiler warning.
An example: size_t as an index variable Occasionally, discussions come up about using unsigned integers as index variables for STL containers (whose size() attribute is unsigned). So the debate is between effectively these two alternatives:
#include <iostream> void do_something_unsigned(size_t i) { std::cout <<6 years ago Read more -
Blog postMany people have asked me about the proliferation of DataFrame APIs like Spark DataFrames, Ibis, Blaze, and others.
As it turns out, executing pandas-like code in a scalable environment is a difficult compiler engineering problem to enable composable, imperative Python or R code to be translated into a SQL or Spark/MapReduce-like representation. I show an example of what I mean and some work that I've done to create a better "pandas compiler" with Ibis.
An example: compa6 years ago Read more -
Blog postTL;DR: At the risk of stating the obvious, manual management of files on disks now in 2016 is increasingly old-fashioned and largely unnecessary, especially among the non-technorati. Encapsulated / managed cloud services and consumer web applications have made it anachronistic for most normal people. Whether this is a good thing can be debated, but it is happening nonetheless.
In this post, I explore this topic in some detail as it relates to my personal experience.
Data packratry6 years ago Read more -
Blog postI'm super excited to be involved in the new open source Apache Arrow community initiative. For Python (and R, too!), it will help enable
Substantially improved data access speeds Closer to native performance Python extensions for big data systems like Apache Spark New in-memory analytics functionality for nested / JSON-like data There's plenty of places you can learn more about Arrow, but this post is about how it's specifically relevant to pandas users. See, for example:
"Py6 years ago Read more -
Blog posthttp://github.com/wesm/ib-flex-analyzer I published a small set of tools to parse, clean, and pandas-ify XML flex statements from Interactive Brokers. I love IB but they aren't the best at giving you detailed portfolio analysis to understand how you are succeeding or failing at personal trading. Luckily we have nice things in Python!
The project will help you:
Roll up options P&L by underlying security and compare stock vs option P&L by underlying (very helpful if you wri7 years ago Read more -
Blog postI really enjoyed the cheeky blog post by my pal Rob Story.
Like many other data tool creators, I've been annoyed by the assorted "Python vs R" click-bait articles and Hacker News posts by folks who in all likelihood might not survive an interview panel with me on it.
The worst part of the superficial "R vs Python" articles is that they're adding noise where there ought to be more signal about some of the real problems facing the data science community. Let7 years ago Read more -
Blog postPython's mock module (unittest.mock in Python 3.3 and higher) allows you to observe parameters passed to functions.
I'm a little slow, so I had to dig around to figure out how to do this.
Let's say you have a class:
class ProductionClass: def user_api(self, a, b=False): if b: return self._internal_api_one(a) else: return self._internal_api_two(a) def _internal_api_one(self, a): # Do something necessary with a pass def _inte7 years ago Read more -
Blog postSelling your stuff on Amazon is a losing game, and I don't recommend that you do it.
Let me tell you what happened to me recently.
I wasn't using my iPad so I decided to sell it on Amazon.
Boom, sold. February 10 it sells, I ship with Shyp same day.
23 days later, the buyer requests to return it to me with the comments "Perfectly described and in great condition. Was shipped fast. But had to return since I received another one from work.".
Aft7 years ago Read more -
Blog postI turned 30 on Friday.
I threw a party. Nearly all my favorite people in the Bay Area (with some exceptions, travel and life and so forth) came to wish me well and be merry. It was wonderful.
The decade I just completed has been quite the personal journey, with growth and change in my perspective on life. I’ve been incredibly lucky in more ways than I can count. And it's probably true that, as Thomas Jefferson said, “I am a great believer in luck, and I find the harder I work,7 years ago Read more -
Blog postAfter some unanticipated media leaks (here and here), I was very excited to finally share that my team and I are joining Cloudera. You can find out all the concrete details in those articles, but I wanted to give a bit more intimate perspective on the move and what we see in the future inside Cloudera Engineering.
Chang She and I conceived DataPad in 2012 while we were building out pandas and helping the PyData ecosystem get itself off the ground. I was writing a book and every 6 week8 years ago Read more -
Blog postI was excited to be able to talk at two recent data-centric conferences in New York. They touch on some related subjects, with the PyData talk being a lot more technical and having to do with low-level architecture in pandas and engineering work I've been doing this year at DataPad.
Before anyone yells at me, I'm going to revisit the PostgreSQL benchmarks in my PyData talk at some point as the performance would be a lot better with the data stored in fully-normalized form (a single fa9 years ago Read more -
Blog postI was graciously invited to give the keynote presentation at this year's PyCon Singapore. Luckily, I love to hack on long plane rides. See the slides from the talk below.
I showed some analytics on Python posts on Stackoverflow during the talk, here is the IPython notebook. The raw data is right here.
I also gave a half day pandas tutorial, here is the IPython notebook. You will need the data to do it yourself, here's a download link.
PyCon Singapore 2013 Keynote from We9 years ago Read more -
Blog postPyCon and PyData 2013 were a blast this last week. Several people noted that my GitHub activity on pandas hasn't quite been the same lately and wondered if I was getting a little burned out. I'm happy to say quite the opposite; I'll still be involved with pandas development (though not the 80 hours/week of the last 2 years), but I'm starting an ambitious new data project that I'm looking forward to sharing later this year. This endeavor is also taking me from New York to San Francisco. I'm sa9 years ago Read more
-
-
Blog postWe're hard at work as usual getting the next major pandas release out. I hope you're as excited as I am! An interesting problem came up recently on the ever-popular FEC Disclosure database used in my book and in many pandas demos. The powers that be decided it would be cool if they put commas at the end of each line; fooling most CSV readers into thinking there are empty fields at the end of each line:
In [4]: path Out[4]: '/home/wesm/Downloads/P00000001-ALL.csv' In [5]: !head $path -n 59 years ago Read more -
Blog postI taught a class this past Monday, June 18, at General Assembly. Here are the (very brief) slides and a link to the IPython notebooks. You'll need at least pandas 0.8.0b2, though unfortunately I identified a few bugs during the class that have since been fixed. Look out for the final release of pandas 0.8.0 any day now.
Intro to Python for Financial Data Analysis from Wes McKinney10 years ago Read more -
Blog postMaking time zone handling palatable is surprisingly difficult to get right. The generally agreed-upon "best practice" for storing timestamps is to use UTC. Otherwise, you have to worry about daylight savings time ambiguities or non-existent times. The misery of time zone handling is well documented, and summarized nicely last year by Armin Ronacher. When you work in UTC, most of your troubles go away; converting a single timestamp or array of timestamps between time zones becomes in10 years ago Read more
-
Blog postI'm on my way back from R/Finance 2012. Those guys did a nice job of organizing the conference and was great to meet everyone there.
As part of pandas development, I have had to develop a suite of high performance data algorithms and implementation strategies which are the heart and soul of the library. I get asked a lot why pandas's performance is much better than R and other data manipulation tools. The reasons are two-fold: careful implementation (in Cython and and C, so minimizing10 years ago Read more -
Blog postIn time series data, it's fairly common to need to compute the last known value "as of" a particular date. However, missing data is the norm, so it's a touch more complicated than doing a simple binary search. Here is an implementation using array operations that takes these things into account:
def asof_locs(stamps, where, mask): """ Parameters ---------- stamps : array of timestamps where : array of timestamps Values to determine the "as of" for10 years ago Read more -
Blog postYesterday I found myself adding some additional Cython methods for doing fast grouped aggregations in pandas. To my disappointment, I found myself duplicating a lot of code and not having much alternative beyond cooking up some kind of ad hoc code generation framework. So, here's the problem: array of data (possibly with NAs), array of labels, with the number of distinct labels known (range from 0 to N — 1). Aggregate data based on label using some function in {sum, min, max, product, mean, v10 years ago Read more