The Data Canteen: Episode 15
Matt Harrison: Effective Pandas
Matt Harrison is one of the foremost experts on Python and Pandas. He holds a degree in computer science from Stanford University, possesses experience in roles ranging from Senior Software Engineer to CTO, and operates a consulting firm named MetaSnake.
In this episode, Host Ted Hallum and Matt dive into the most crucial, powerful, challenging, and enjoyable content in his new book entitled Effective Pandas!
FEATURED GUEST:
Name: Matt Harrison
LinkedIn: https://www.linkedin.com/in/panela/
Twitter: https://twitter.com/__mharrison__
SUPPORT THE DATA CANTEEN (LIKE PBS, WE'RE LISTENER SUPPORTED!):
Donate: https://vetsindatascience.com/support-join
EPISODE LINKS:
For purchases at Matt's MetaSnake store use promo code "VET" at checkout for a 45% discount!
Effective Pandas (ebook): https://store.metasnake.com/effective-pandas-book
Effective Pandas (paperback): https://tinyurl.com/effective-pandas-paperback
Python for Data Scientists and Engineers (course): https://store.metasnake.com/py4ds
Effective Pandas (course): https://store.metasnake.com/effective-pandas
PODCAST INFO:
Host: Ted Hallum
Website: https://vetsindatascience.com/thedatacanteen
Apple Podcasts: https://podcasts.apple.com/us/podcast/the-data-canteen/id1551751086
YouTube: https://www.youtube.com/channel/UCaNx9aLFRy1h9P22hd8ZPyw
Stitcher: https://www.stitcher.com/show/the-data-canteen
CONTACT THE DATA CANTEEN:
Voicemail: https://www.speakpipe.com/datacanteen
VETERANS IN DATA SCIENCE AND MACHINE LEARNING:
Website: https://vetsindatascience.com/
Join the Community: https://vetsindatascience.com/support-join
Mentorship Program: https://vetsindatascience.com/mentorship
OUTLINE:
00:00:00 - Introduction
00:00:51 - How Matt got started with Python and Pandas
00:09:59 - What is Pandas
00:13:42 - Pandas is just as valuable, or perhaps more valuable, as an API
00:16:46 - Pandas tips for R users
00:21:53 - Prerequisite knowledge for getting the best mileage out of Pandas
00:29:50 - Promo code "VET" for 45% off at the MetaSnake store
00:30:17 - Some of Matt's strong opinions on how one should write Pandas code
00:41:01 - The most critical chapter of Effective Pandas for new learners
00:42:21 - The most useful chapter of Effective Pandas for intermediate/expert-level users
00:43:40 - The most challenging chapter of Effective Pandas for new learners
00:46:23 - The chapter of Effective Pandas that Matt found most enjoyable to write
00:50:07 - Effective Pandas the course
00:52:36 - Listener questions answered
01:13:03 - Matt's 2nd favorite Python package (jupytext)
01:16:40 - Matt's current learning focus
01:21:42 - The most exciting thing Matt sees on the horizon for Pandas
01:24:22 - Farewells
Transcript
DISCLAIMER: This is a direct, machine-generated transcript of the podcast audio and may not be grammatically correct.
[00:00:07] Ted Hallum: Welcome to the data. Canteen, a podcast focused on the care being of data scientists and machine learning engineers who share in the common bond of us military service. I'm your host, Ted Hallam. Today. I'm chatting with Python and pandas expert. Matt Harrison. Matt's been using Python for over 20 years.
He's a best-selling author, a sought after conference speaker, and he runs a consulting and training firm called . Today, Matt and I chat primarily about pandas. We cover what pandas is, his new book and towel effective pain is, and how it can help you level up your data. Ringly abilities and Python. Also, Matt field's pandas questions submitted by you, our listeners, we cover all that and more.
I hope you enjoy the rest of this conversation and here we go.
[00:00:51] Ted Hallum: Matt. Thank you so much for coming on the data canteen. I looking over your background, boy, what a story. So I see multiple CTO roles. You've been a VP of data science. You've been an instructor of Python data science at multiple universities. Obviously you've published multiple books.
We're going to talk about your latest book here today. There's no question. When you look at your body of work, there's an incredible passion in you. Python and pandas. I think, I think it's fair to say Panda specifically, cause you have a couple of books specifically on that. So I would love to hear just a quick summary of your data science journey through all of that and what cultivated this love for Python and pay.
Sure,
[00:01:37] Matt Harrison: thanks for having me on first of all, happy to be here. And thanks for serving a community that I think is very interested in leveling up and Yeah, so my background is I have a computer science degree and, and I graduated in the year 2000 and that was the same year that I basically started doubling down on Python.
So I went to work out of school, to a company that was doing some natural language processing and my journey to Python was the. Very smart PhD person who I was assigned to work with on a little project was a tickle user program, language tickle. And, and when I was in school, someone had told me if you learn parole, I can get you a job.
And so I went and bought a book on Pearl and basically self-taught myself and it turned out to be true. They got me a job. And so at that point in time, I, I mean, I, in school, I had done Java and C and lisp. But like Pearl was a lot easier than those. And so I was like, okay, yeah, pearls, my thing. And this colleague who I was working with, we were both sort of butting heads because I'm like, let's just do it in Pearl.
And, and he was like, let's do it in tickle. And neither one of us wanted to use the other language. And so it turned out that instead of doing that. Compromised on Python. We're like, well, there's this language called Python. Neither of us have used it, but it looks interesting. Maybe we should try that.
And that compromise turned out to be, I guess, a big turning point in my career because I've used Python ever since, and kind of never looked back. And in the meantime I've done things a lot of Java script and I've, I've actually done some closure and some C sharp and Java as well, but Python has been, I guess, the thing that fits my brain.
So from the start I've had experience in data worked for a couple startups one doing some open source stuff, another one doing some business intelligence. And so I guess, To, to bring on the Panda side. I wrote basically a reporting backend, an ELAP engine for generating reports written in Python.
And after I had written that a few years after I was at pike con and went to a talk that was about pandas. And that was I guess another turning point. I had my code that did. Data analysis with Python, but there's this pandas library. And the nice thing about this Panda's library is that it is leveraging num PI under the covers.
So for those of your listeners who aren't aware, Python's probably the most popular data science language, probably the most popular language right now, but certainly the most popular for data science and not trying to like bag on any other languages or whatnot. And certainly you can do cool things with other languages.
But Python has a problem in that. Python is not a fast language. Python is actually a slow language relative to like, see your rest are, are, are not, are, are Java. Just due to how Python works under the covers. And but previous to that, this library called num PI had been created. And the, the, I guess I could say the dirty secret of num PI was that if, if I have a list of Python numbers each one of those numbers, like if they're an integer, they have some overhead because Python essentially does this thing called boxing where it doesn't give you a raw into juror, but it gives you a Python object and then the integer is inside of it.
So there's quite a bit of overhead for all of those objects. It makes it easy to write code with, but it also makes it slow. And you can have a memory over here. So what number I did is they said, okay, instead of giving you Python integers, let's just allocate a C array in, in Ram. And then. There won't be any overhead.
We'll just be here as an eight byte, integer next to an annular eight by genders. You're next for another eight byte integer. And if you need to do an operation to them, you just do it to the whole array and you can leverage modern CPU architectures and S I M D instructions. And if you need to add two to it, you don't pull out the individual number.
You just say, here's the whole block. Add two to everything in the whole block. And that just happens on the CPU and it's quick. And so numb PI does that and pandas leverages numb pie. My library is all pure Python, so while it was nice and easy to use, it was actually pretty slow. So I'm like, okay, well library sort of already does the dirty work for me of speeding up my code.
And so I sort of started running with that and along the way so at that point I had written a book on Python. I taught at Python and other conferences. I, I was the Utah Python. User group president, I guess. And so as I was you know, participating in the community and putting stuff out there eventually there was enough influx of people wanting help with Python and data.
Tools that I started Metta snake, which does corporate training consulting. So I I've, I've been doing that for the past while. And recently, as you mentioned I guess released my third pandas book effective pandas. I say third because around 2015, 16, I released my first pound, his book which was learning the pandas library.
Later on, I was approached to write the second edition of the pandas cookbook. So I did not write the original Panda's cookbook, but I had read it and I liked what was in there and they wanted to update that. So I basically added a few chapters and did the update however And there was, I guess, an itch in me to revisit my learning, the Panda's library book.
Just because after, you know, I'd written it a few years back and in the meantime I taught thousands of people pandas in corporate environments and online trainings. I had done a bunch of consulting with pandas and I. I basically read a lot of content about pandas and come to have some strong opinions.
We can talk about that later. And so that eventually, you know, what was going to be like a second edition of the learning that Panda's library turned out to be like a complete rewrite of the book. So while, while it is technically, I guess the second dish and that there's very little common content in there.
So that that's maybe a long-winded story of my background. I've been using Python for over 20 years, really happy with it, happy with the community there and have been using it recently in more data science-y roles as well.
[00:09:01] Ted Hallum: I love that story. And I, especially like that you went back and highlighted how it was fortuitous in the sense that you were comfortable in Pearl, which was a popular language at the time you got into that language because it often.
Career prospects. And then it just so happened that out of luck or Providence, you're working with this other person who wants to use a different language, and then you kind of land on Python as a compromise, and then really that ended up being your whole career. And so often it's like that first, it's just some happenstance thing that ends up sparking what we're going to do for the next 10 or 15 years.
[00:09:42] Matt Harrison: Yeah. And that really was luck because at that time, Python was not at all what it was today. It was an unknown language. We kind of had to hide that we were using it at this company. So it, it really was provenance or luck that Python turned out to be what it is today.
[00:09:59] Ted Hallum: Now for a lot of our users, when we talk about pandas, they're going to immediately know what we're talking about.
The spectrum of our listenership is everything from extremely experienced people, to folks who are just getting into data science. And then there are other people who have gotten into data science, but maybe they're using other languages. So you know, with pandas, we get a couple of specific data structures that add to what we get with just pure Python.
Can you go into detail about what we get with pandas in that regard? Sure.
[00:10:30] Matt Harrison: Yeah. So I, for those who aren't familiar with pandas, I like to describe pandas as an in-memory no sequel database. What, what do I mean by that? I mean that Pandas the library. When you use it, you need to be able to hold your data in memory.
And then pandas basically allows you many of the things that you would do with a database, but you're not speaking SQL. So you can filter, you can slice, you can do pivoting, you can do aggregations, those sorts of things. The main data. And I like to say it's for tabular data. So tabular data is it's data that is in a table that you would see in a database or in a spreadsheet.
And so if you don't have tabular data, pandas might not be the right tool for you, but a lot of people do have tabular data sort of Excel rules the world. And so a lot of what you could do with Excel, you can do with pandas and people have a lot of information and databases. And so. Can suck that out in and manipulate that as well.
The main data structures are a data frame and a series. And so if you, if you grok the basics of those, that will put you well on your way to understanding Panda. So a, a series let's go with a series first, a series you can think of as a column from a database or a column from a spreadsheet. And again, The, the special sauce of pandas is that Penn does, does not represent that column as a list of Python objects, but rather it's using num PI to say, here's basically a buffer of memory and it's an optimized storage mechanism, but also optimize for doing computation.
But you get a Python interface for doing that. The other main data type is a data frame, and that's now gives to a database table or a spreadsheet, just a sheet. And so you can think of it, a table as a bunch of columns. And so I, I, data frame is a bunch of series and it has a bunch of operations that you can do on it.
And it turns out that both the series and the data frame, if you inspect their attributes, you can use like the dirt built-in function in Python. And that will list the attributes that an object has or has access to. And there's over 400. Attributes on both a series and a data frame. If you look at like the, the intersection of those, what's common to both of those, there's over 300 of those attributes that are common to both.
So pandas provides a, a rich API, a very large API, but also that can be overwhelming for many. I think it's overwhelming for me. But if my claim is that if you, if you understand that basically a series is one dimensional, a data frame is two-dimensional and you understand that a lot of things you can do with one or the other, but you're doing it on one dimension or two dimensions that can sort of ease a lot of the cognitive burden and overhead of, of understanding pandas.
[00:13:42] Ted Hallum: Absolutely. Now, while we're kind of on this topic of just defining what is pandas, when I was looking at your blog, I saw a really. Interesting point that you made, and you said that you felt like at this point that Python was just as valuable or perhaps more valuable as an API than as a package. So I wondered if you would just touch on that before we proceed on
[00:14:06] Matt Harrison: I think you meant pandas is as an API, not Python as an API.
Yes. Yes.
[00:14:11] Ted Hallum: Thank you. I apologize. Yes. Pandas.
[00:14:13] Matt Harrison: Yeah. Yeah. And so yeah, like I said, Python is the most popular language for data science. And if you are manipulating tabular data, pandas is the most popular tool for doing that. And, and so at this point in time there's a lot of people who have a lot of production code in pandas, and there's a lot of processes that are being controlled by pandas.
There's a lot of machine learning, a lot of reporting, a lot of ETL that is done with pandas now, Is pandas. Perfect. I, if you read my book or you listen to me, talk about pandas or Python in general, I'm the first to admit that these tools are not perfect. And, and I think it does a disservice to my audience to come in and claim that if you use Python or pandas, you will never have any issues.
I mean, I I'm teaching a PI a pandas course this week and just yesterday come across an issue. That's like, oh, this is really annoying. Or this is a bug. Right. And so I don't think it does my audience any benefit to claim that like you're never have any problems with these. However, at this point we sort of are where we are there.
This library has sort of taken the, the, I guess, data, community by storm, but there are, there are drawbacks to it. And so I'm, I've got a list here. Of some notes I've taken and I've got a list of alternate platforms which basically at this point in time, you've got pandas the library, but Penn does has this API, right?
That's got these 400 plus attributes and there are on here I got 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13. So I've listed like 13 different libraries that claim to implement most or some portion of the Panda's API. So at this point it's not just pandas the library, but it's, I think going forward, making an investment in pandas is a wise investment.
Because again, you have huge number of Python users, but also If you're not using pandas, the library, it is quite possible that you will be using pen. Does the API, the interface that pandas has to leverage some other library that might be optimized in ways that pandas is not optimized.
[00:16:44] Ted Hallum: Got it. Okay.
That makes a lot of sense. Now, now that we've established, like what pandas is for portions of our audience who are primarily our users, because you know, of course at this point in time, there's two main languages and use for data science, Python, and R. Now in our, there is a native data frame data structure.
And so they may be very familiar with the concept of a data frame, but within the context of our, and this, this instantiation and pandas might be a little bit new to them for, I don't know, you've had tremendous experience training people. I'm sure some of them were. Our users who are looking to expand their knowledge base, to include Python and pandas for those folks who have experience hands-on with data frames and AR, and then they start the transition of trying to learn how to leverage data frames in Python, three pandas.
Are there any friction points that tend to crop up as they make that transition over to Python and pandas? That's a
[00:17:51] Matt Harrison: great question, Ted. And I actually say I'm probably not the best person to answer it because I have, you know, like I said, I I've sort of been heads down in the Python land and Python sort of filled my knit needs.
However, so, so I haven't really used are in happiness or in anger. I don't have much experience with that. However, my understanding is that for example, in our, you can use things like apply and there there's an apply method. And if you use that, that makes things speedy. It allows you to vectorize operations.
It turns out that in, in pandas, if you use apply, it slows things down generally. And so that, that may be an anti-pattern or something that does not map cleanly from R to Python. So. Some other things that I I've seen, friction points or complaints from our people is just sort of the syntax that, you know, are maybe has a consistent syntax, whereas Penn does, maybe not.
So there, there is, I would say some minimal level of Python that's needed to be understood. And then on top of that to really be effective with pandas, I would advise that then in addition to understanding how to, you know, make instances of classes and then call methods you also need to understand like slicing syntax in my world.
Like there wouldn't be a slice syntax or an index operation. It would just all be methods. It would just, just makes it easier for people who don't have a lot of programming experience. I have one way to do it, rather than if you're going to do this, you do a ILO, can you slice it? Otherwise you call some method that that's a lot of cognitive overhead and just syntax for the sake of syntax.
It doesn't really make sense to people who aren't familiar with that also. Once we start doing some more advanced things in pandas, we really do want to understand some of these functional constructs, functional programming constructs. So that notion of a first-class function pandas does allow you to pass a function into certain operations, such as the pipe operator or the assign, I should say the pipe method or the assigned method.
And what that allows you to do is pass in the current state of the data frame and take operations on that. So if you understand how to do Lambdas, it's gonna make your life a lot easier there as well. Another thing that might be a challenge or new to our users and people in general, just coming to pandas is maybe This notion of list comprehensions Python has a comprehension syntax that allows you to basically, if you're looping over a list or a sequence, and you're doing some mapping, operation, or filtering operation to it, you can rewrite that in a single line of code using a comprehension construct.
And so those turn out to be pretty useful. And are they a hard requirement? No, they're not a hard requirement, but oftentimes if you know how to use a comprehension because it is not a statement in Python is an expression. You can embed it in places where you wouldn't be able to use a four loop directly.
And it allows you to write a little bit less lines of code than you would otherwise. So for those of us who are lazy programmers, who liked to use the computer to automate things, a comprehension construct can be useful.
[00:21:20] Ted Hallum: Well, Matt, you may not consider yourself a R officiant auto but you certainly did a great job at identifying a few key friction points for our audience who might be looking to transition from R over to Panda Python and pandas.
So I really appreciate that you gave them some, some big tips and shortcuts there that they would probably have to learn the hard way otherwise. for the rest of the conversation, we're going to continue to talk about pandas, the use of data frames with pandas within the context of your new book called effective pandas, which you mentioned earlier.
So as we get into this conversation, not everybody will necessarily be at the point yet where they're a good candidate for the content of the book. I would imagine there may be a, some prerequisite knowledge where people could get the, get all the mileage possible out of what you've provided there. So what would you consider to be prerequisite knowledge?
For effective pandas. Yeah, that's a great
[00:22:15] Matt Harrison: question. I find Ted that in a lot of my training. Oftentimes I am training people who want to up their Python game, who are developers or dev ops people or back end developers, or who are using Python, Python developers. That's our date. And so I, I do teach a lot of that with my training, but I also find that a lot of my clients are people who are not programmers.
They don't, didn't go to school to be programmers. They don't want to be a programmer. But they want to use Python or various libraries in Python, not as a programmer, but as a tool to get a job done. And I think that's just expanding, especially because if you go to PYP API, the Python package, index pi.org you'll see that there's over 350,000 packages that you can use with Python.
So once, once you've sort of bought into this ecosystem, a lot of times, a lot of the tasks that you need to do, there's already a library that does that for you. So it makes your job really easy. So what What pre-recs should you have, if you, if you want to start using pandas and maybe you don't have you know, a lot of programming background, so there there's basic Python and you know, if you have some programming knowledge I mean, You can, you know, teach yourself the basics of Python relatively easily.
And you can map, you know, if you have an understanding of JavaScript or see your job, or, or even our, you can map these constructs at a high level, you know, 30,000 foot view, most programming languages look exactly the same. You just change the syntax a little. So you can sort of do that with Python and you can get by relatively.
Okay. And, and I see a lot of people who I train, who sort of self-taught and have done that. Again, as I just mentioned, there are some things that might be a little bit confusing or they don't exist in other languages that if you understand them, they will make your Panda's life easier. So in addition to sort of the basics of Python I would say understanding index.
And so you have a lot of languages have basic indexing, but Python has a thing called slicing, which allows you sort of saying, I want to index, which is with square brackets. I want to say, like, I've got a list and I want to take the first item of list. You do a square bracket after the list and put a zero in there.
And that says, pull off the first item. Python is zero-based zeros, the first item, but in addition, Python allows you to do what's called a slice. And so you can say, I don't just want to pull off the first item. I want to pull off the first five items. So you can do something instead of in your square brackets, you can put zero colon five, and that's going to say, start at position zero and go up to, but not including position five of go call that the half open integral, but it gives us five items from that.
So that's slicing and There are other languages that support that generally through some sort of method, but Python has a syntax for that. And then if you start leveraging num PI num PI has additional slicing syntax, that might be confusing even for people who are used to Python slicing because num PI and pandas have more than one dimension.
Whereas a Python list is one dimensional, a Python data frame is two dimensional. And so you can slice a data frame on what we call the index access and on the column access as well. So understanding the notion of slicing and slicing, not on one dimension, but on two dimensions, that can be a, a, something that's maybe a challenge or new to a lot of people.
And then those, those things, I just talked about Ted Lambdas understanding land. Isn't not, I don't think Lambdas are particularly. Hard. They do provide some, I guess, visual syntax overload because you're, you're basically taking the logic of a function and putting it in line much like a comprehension as an expression.
And you can embed an expression rather than this is a Python statement. You can't put a four loop inside of a method call, but you can put a less comprehension. You can't define a function inside of a method call, but you can pass in a function or you can define a Lambda and Lambda. You can pass in the Lambeth directly there.
So those are some shortcuts that if you understand those, I would say that they're going to grease the skids and make your Panda's use a lot easier.
[00:26:52] Ted Hallum: Now I invited you onto the data canteen to talk about effective pandas and you know, your broader journey, because I know that you offer such high quality content.
Now feel free since we just talked about prerequisite knowledge. I happen to know you have some other books and courses. About some of those topics you just talked about. That would be great for our listeners who might need to get some of that foundational level knowledge before they approach pandas.
So is there a book or a course you have that you think would really fit the bill to get people ready for what effective pandas has to offer?
[00:27:31] Matt Harrison: Yeah, I, again, full disclosure, like you said, I you take this with a grain of salt because I make my living like selling snake oil. Right. I teach people how to program with Python and tell lies with data.
But I, I did make a course specifically as a prereq for doing data science. It's called Python for data scientists and engineers. I, I find that a lot of Python material, including, I would say that my Python book illustrated guide to Python is generally aimed at more people who are wanting to program Python maybe as a dev op or right back in programs, but a lot of the knowledge and that that works for, for Data science, but there's a lot of the knowledge there, like defining classes while that's useful.
It's really not something that you do a lot when you're using pandas or making machine learning models, but they're all showing these other things like Lambdas and comprehensions that you a beginning Python book might not really go into great detail. But if, if you're looking to understand or leverage something like pandas, it comes in very useful.
So I did create a course Python for data scientists that, that covers exactly that just because I found that there was sort of a hole there for people wanting, wanting to learn Python, but not just here's the Python course that covers everything. I mean, that's fine if you want to spend 20 hours doing that.
But I, I don't, I would rather have something that's like catered to.
[00:29:08] Ted Hallum: Sure. So there you go. If you're listening and you need that foundational level knowledge to get going before you jump into pandas, there's a couple of great options to get you started.
[00:29:20] Matt Harrison: And Ted I'll, I'll put a code, we talked about having a code, so I'll, I'll, I'll put a code and I'll put that course on the code as well.
So your listeners can get a disc.
[00:29:30] Ted Hallum: Yeah, absolutely. So this is a great point. I'll go ahead and throw out the link so that people can see the link to your store. This is the exact link actually, where folks could get a digital copy of the book, effective pandas. Of course, if you were to just take that last part of the URL off, that would take you to the general store where the other books and courses that Matt mentioned are located.
And then as Matt just alluded to he has very generously given us a discount that you can apply at checkout, and it's just capital the E T and that'll get you 45%. So I would definitely recommend making sure that you use that discount code. If you make a purchase at that store, excuse me, mat store, which is store dot metastatic Bennis, snake.com.
Okay. So diving into the actual effective pain book, as I was reading your blog entry, the introduced the book I saw where you made the statement, that it's a highly opinionated book that teaches best practice, best practices using pandas. And that caught me highly opinionated. So first I'll just ask you, when you say that the book is highly opinionated, what does that mean?
Yeah.
[00:30:46] Matt Harrison: So like I was just saying when. I had written my original pandas book. And in the meantime, taught classes to thousands and read a bunch of students code and read a bunch of the material out there talking about pandas. And I came to have some strong opinions on how one should write pandas code.
And so if you go to like my Twitter and you look at my, the images that I post, I don't tend to post images of cats. I tend to post images of code most of the time. And a lot of that is pandas code. Some of it's just general Python code, but it's oftentimes when I post pandas code, that's doing something not just something basic, but something kind of cool, like takings something and exporting it to Excel and putting a graph in there or doing some machine learning on top of it or cleaning it up and making some visualization.
The code that I post will. Elicit a strong reaction, either positive or negative. It's like, there's no lukewarm here. And I've had people literally say, this is the worst code I've ever seen. What are you trying to do? I would hate working with you. And then I get the opposite response is that people are like, this is awesome.
This is super clean and this change changed my world by, by doing this. And so so let me, let me maybe come back and talk about what this code is. My, what, what tends to bother people is this notion that I I'm pretty adamant about called chaining. So most people who are using pandas are using it in a, in a Jupiter environment.
And what they'll do generally, Ted is they will load up some raw data from a CSV or from a SQL query or some other dataset, and they'll have this data frame and then they'll start doing an operation, right? And it turns out that really what they want to do is they want to clean up their data and make sure that, you know, the missing values are removed or add new columns, or they might want to do some prep to the data to make it ready so that they can throw it into a machine learning model.
And so those steps, those processes. Tend to take more than one line of code. They're often multiple lines, maybe 20 different steps or whatnot. And so what I see is that most people do is they will take a Jupiter notebook, which is a notebook environment that has a bunch of cells. No, basically say like in cell one, I'm going to do one thing.
I'm going to pull off some column and sell two. I'm going to update some common cell three. I'm going to filter some calm, like, oh, I didn't do the other operation. Right. So it'll go back up a cell and they'll change something and they'll make sure they're running their other cells. And they have this long process of, of these operations that they've done at the end that eventually they have made their data frame.
They tweaked it enough such that it's, it's where they want it to be. But in the meantime, they have what I would call digital cruft, all this noise that is really not important to the end result. The end result is that they. Transformed their data, but they have all these artifacts that are just taking a memory, but also taking up a mental brain power because it, your brain has to like, look at all these, and then it has to say, this is important.
This is not. And the other thing that turns out to be bad in addition to like wasting your brain power wasting memory, because you're keeping around artifacts that you don't need to is that oftentimes. Because Jupiter is flexible. You can put cells above and below and you can execute sales in arbitrary order.
What people will do is they'll run something up here and then they need to come back down here. So I'll run something down here and they won't run their notebook in a linear order. So it actually makes it kind of hard when you want to come back to your notebook or even collaborate or share it with others.
Their notebook doesn't even run from top to bottom. And so they can't really even trace how they got their data to this state, but they just know that they have the data in the state. So the idea behind chaining Ted is to say, we're going to put on some restrictions and we're going to say. You can do this transformation, but instead of doing it in 20 cells, we want you to do it in one cell.
And instead of doing it in, in all these steps that are discontinuous, we want you to just say, here's the first step. Here's the second step. Here's the third step. And if you put on those restrictions, it might make it a little bit harder to sort of think about it. And so you might need to think a little bit more, but it's going to actually improve your code.
Your code will read like a recipe because you'll have here's my raw data. Here's step one. Here's step two. Here's step three. Here's step four. And at the end of this, my data is clean. The other thing it does is you can take. Single operation, which is now a recipe. It's, it's all one thing. You don't have intermediate steps along the way, and you can indent it and throw it into a function.
And then you just take that function and you put it right at the very top of your notebook, right after you load the raw data. So when you need to come back to your notebook, the next day, you just load your data and you run that function and you're good to go. You don't need to worry about running 50 different cells.
In some certain order, you don't need to worry about all these intermediate things that you don't have, and it makes your life a lot easier. Now, a lot of people are like, that's just ugly. It, is it ugly? I don't know. It reads like a recipe to me, my challenge to people on the internet when people like, say that it's the worst code ever it is, is to rewrite it, right?
If this is so bad that it's causing you to have some reaction to it, rewrite it. And then a lot of people would probably think I'm trolling, but I'm not as an educator. I want to write code. That's easy to read and easy to understand because code is generally written once, but it's read multiple times.
So I don't want to optimize for making it easy for me to write and doing bad practices, marina easy. Right. I want to optimize for making it easy to understand and easy to share. And so th that is an open challenge that I give to people. Most people just sort of ignore that because I think they think I'm showing, but I'm, I'm honestly, if there is a better way to write code, I want to do it.
And, and what I have found after. Again, years of using pandas years of looking at people's code is that if you put on these constraints, it will make your life a lot easier. And you can go look at like the reviews of effective pandas, got them on my website or on Amazon. And that's what I'm hearing from, from readers as well, that reading this book and applying the practices there.
Changes their code for the better. So that's, I guess the strongest opinion there. But it is, I guess, something that I, I see very few people talking about. You, you see all of these blog posts about pandas and people blog about pandas because it's popular, but they, they are really sort of misguided and, and they might show some interesting things, but they are pushing bad practices that inevitably lead people to be frustrated with their code and not use it in the most efficient way.
[00:38:04] Ted Hallum: So I also like chaining, but when I like, like maybe some of the other folks that you train, when I first got exposed to training, my initial reaction was this is odd. It felt. But then it took some hands on experience and some practice. And I ultimately came to the same conclusions that you mentioned.
I found that it made my code cleaner, but, and you can elaborate on this I'm sure. For those that are new to Python, there are some tenants to Python. We talk about having Python code. One of them is that explicit code is better than implicit code. And I would say that for those folks who are doing different data operations, like one data operation per cell that is kind of taking explicit code to an extreme but having gotten experience with chaining, I would say that if you give it a chance, your code can be just as explicit as having all those sets.
But you can do it all in one cell, like you said, and then it's not just more efficient in terms of the way the code's laid out, but it's also, you mentioned the artifacts that are hanging around in memory. It's much more efficient in that way, too. So I don't think that you lose anything on the explicitness of your code.
It's just much more concisely stated.
[00:39:24] Matt Harrison: Yeah. I like to compare it to, to an, an analog that I like to compare it to is Python white space. So I, I teach a lot of Python courses and oftentimes I'll get someone in a Python course. Who's been using C or Java for 20 plus years. And. They need to use Python now because a lot of people are using Python.
And so the company wants them to do some Python training and, and oftentimes these people are like, oh, but Python has white space. It's like, yeah, it does have white space. Right. And I asked them, do you indent your code? And they're like, yeah, I debt my code. I'm like, why do you invent your code? Because it makes it easy to read.
I'm like, okay. I mean, Python wants to, easy to read, but the, but they're like, oh, but I just hate that. You're forcing me to, I'm like, okay, whatever. But then after they do it, like the next day, they don't care. It's not a big deal. And I, I think it's similar to that with chaining at first, it's like, oh, this is a huge restriction.
And it might be a little restriction. You might need to think a little bit. Maybe you need some times you might need to even leverage like this pipe method, because there's not a really clean way to do a chain without sticking a pipe in there, which is fine. I'm okay with that. But most people who try it.
Come around to it versus I think a lot of people are like, oh, I can't stand that. Or it looks weird. Yeah. It might look weird, but once you try it and get used to it, you're going to have an epiphany and it will change, change your code for
[00:40:52] Ted Hallum: absolutely. So as we dive into the actual content of effect effective pandas the book has 36 chapters.
As I count, there are 31 of them that deal with actual specific aspects of pandas. And so my first question is of all those chapters, which one do you think is the most critical for new learners who are just diving into pandas?
[00:41:17] Matt Harrison: Yeah. I mean, I think at some level you do need those basics of, you know, understanding a series in a data frame, again, one dimension versus two dimensions.
But I think the other critical thing to realize, again, this API is huge. There's a lot that you can do with it, but you don't need to, you don't need to memorize the whole API. And also. A good chunk of the API works with both a series and a data frame, and they just work on different dimensions. So if you're going to do a reduction on a series of, remember a series is one dimension.
If you reduce that, you're going to get a scalar value out of that, contrast that with a data frame, if you do the same operation on a data frame, that's a reducing, it's going to collapse two dimensions down to one dimension. So generally that would give you a series, which is what Python pandas uses for one dimension out of that.
So if you understand those basics of series data frame, and then sort of the common operations that you can do with both of them, that's gonna put you well on your way to starting to leverage pandas.
[00:42:21] Ted Hallum: Now, for those who would like to get your book, because they have used pandas a little bit and they know they like it, but they want to go onto the intermediate or expert level of pandas usage.
What do you think is the chapter that they will get the most mileage.
[00:42:38] Matt Harrison: Yeah, and, and again, I think for me, probably the biggest thing that you'll get out of the book is looking at real life, examples of chains that are taking real life, data, not canned data, not random data and showing these, these processing chains.
So, so that's not a single chapter per se. There's not a chaining chapter, but throughout the book, as we're looking at different examples, you will see those. And again, I do think if, if you take away something from this explore chaining, try it out. Right. And you may be, it doesn't work for you, but for most people it does work for them.
[00:43:19] Ted Hallum: I would personally recommend, I remember when I got to the point where I was doing grouping, grouping and aggregation capabilities of pandas, it's just phenomenal. What you can do with this. So I think just my personal opinion, I know you have a chapter on that. I would definitely recommend that for people looking to go to the next level as well, you know, that that goes right into the chaining thing.
Cause you can change some of those methods too. Yeah. Now you obviously are heavily involved in providing training to people and getting people to be proficient with Python and pandas. When you look at the content that you have on pandas and your experiences, training people in that material, which chapter would you say you've observed just to be on average, the most challenging for new learners.
[00:44:08] Matt Harrison: Yeah, I think generally challenging is probably like the, the reshaping. So pivoting melting unstack, those tend to be a little bit confusing. I mean, a lot of people have experienced with those coming from Excel, doing a pivot table or coming from a database and doing a group buying an aggregation. But anytime you're like taking data and you're like changing dimensions that, that tends to require a lot of working memory to, to sort of grok that.
And so. One of the ways that I've tried to address that is with the imagery in the book. I've got a lot of images diagrams in the book and the book is in full color. So I, I think having like a color to sort of abstractly represent what's going on seeing where things move as you do those, those can be useful.
But another thing that I do in addition to the imagery is I I've tried to put real life data in the book. So not oftentimes if you look at blog posts or even the Panda's documentation, I think the Panda's documentation is generally good, but it, it will use just random data. And so when you start doing some of these complex things like a pivot or melting your data, and you're using random data for me, it takes a lot more cognitive overhead because I have to look at like the random data.
I'm grouping this random thing. And now I have this group random thing. What does this even mean? Whereas if I have real data, it's like I've got all of the cars that were produced in this year and I'm going to group them by make and model. And then look at the average and, and look at what that looks like that that's easy to understand.
And then I can understand the grouping operation, but I don't have to like, think about the data that's going on there. So I think using real-world examples that can be super powerful, helping people understand it a lot better. And, and then but also some people do like that visual representation, so powerful diagrams that can aid that as well.
[00:46:23] Ted Hallum: Okay. Now I feel like given the amount of heart and soul that you've obviously invested in Python and pandas, that I'd be remiss if I didn't ask. Which chapter you found most enjoyable to write? Because I think that would be very telling if you enjoyed it, it's probably a key one worth investing some extra time in as we read the book.
[00:46:46] Matt Harrison: Yeah, I mean, at some point when you're writing a book, it becomes from like, I want to get this out to, oh, why did I decide to do this? But yeah, I would say it's a marathon. If you, if you're looking to write a book it's one of those things where it makes sure you're committed to it. And that's a lot of the reason why people will instead of self publishing, you might want to consider having a publisher because you might have someone like dangling and character are only converse, like pushing you along to you to finish the book.
I would say one of the more fun ones was the time series chapter. So again, that I did want to use real-world data. So the time series chapter is actually something that I wanted to do. I haven't done it yet, but so I live in Utah and there's a river that I would like to paddleboard down.
But this river basically is only floatable for like two weeks of the year. And so what, the time series data. So I was trying to wrangle one of my friends into doing this and part of the prep for that was. Went and pulled down all of the time series and from, so guess USG S has like river flow data.
And then from Noah, I got some atmospheric data, some meteorological data, and I wanted to make some predictive models that would be able to allow me to forecast when the river would be runnable. So that the time series stuff in there, it doesn't have the forecasting part, but it does have taking that data, joining it, merging it, slicing and dicing it, filling in the empty things.
A lot of the common operations that you would do with time series, but it's on this real-world data. Right? So again, I think it is, is you know, some arbitrary river in Utah, the most important data for everyone, maybe not right. But what I like to encourage people to do then is. In, in the book, there are a bunch of exercises and, you know the exercises are pretty open-ended, so it's not just like here's, here's the river dataset go do something with it.
The river, the, the exercises are go find a data set that you are interested in and then apply this to it. So I'm a huge fan of doing something. Science tells us that if you do something, you're going to learn a lot better than just listening to me or watching me or reading my book. And also if it's something that you're interested in, maybe you're interested in it because your work is paying you to be interested in it, or you're interested in it because it's a hobby, you're going to be a lot more effective learners.
So that's a little hack that you can do find some little data set that you are interested in it. Maybe some of your listeners might be interested in, you know, how do I get started with this? Well, find something that you're interested. And then start using pandas to slice and dice it. Right. And that's going to be a lot more effective probably than just reading a book about it or watching a video about it.
[00:49:46] Ted Hallum: Yeah, absolutely. I think that's the number one pro tip find datasets that you're actually interested in because then the projects that you do, aren't going to feel so much like work. And then if you can find an employer who wants you to analyze and do data science on data that you're interested in, then your actual work won't feel so much like work.
So yeah, that's definitely always a key takeaway. So Matt, I'm going to throw up your URL again to your store because when I was there getting ready for this episode, looking at your different educational offerings, I noted that in addition to the. Effective pandas book. You also have a course by that same name.
So I assume that the two are complimentary. And I was just going to ask you what's that course, like in terms of format, if people who liked the book and listened to this podcast, we want to take the course, like what could they expect? Yeah.
[00:50:44] Matt Harrison: Yeah. So I would say Ted that in general courses have a different focus than a book.
So book is probably going to go into the weeds a little bit more, go into the details a little bit more, whereas a course might be a little bit more hands-on. So there is certainly some overlap between the book, but this is. A course, that's going to introduce pandas, but it also has some labs for you to do along the way.
So I, I you know, here here's a small section and then here's a lab and then you can try out the lab and then validate that you're getting that working along the way are, you know, the book is going to cover some of the wards and more minutia that you just can't really do in a course. But some people like learning from books, some people like learning from course.
I guess if, if, if someone is considering doing both my, my probably I would guess that my preferred Ordering of that would be to do the course first and then read the book after, just because I think once you have that hands-on experience of doing something I think you're going to pay attention to the book in a different way than if you're going to read the book and then try it on afterwards, because I think if you've tried it out and then you start reading the book, you're going to take notes and be like, oh, I should do this and I should do this.
So I think that would probably be the more effective ordering if someone would be considering doing both
[00:52:23] Ted Hallum: outstanding. Well, Matt, I appreciate you humoring my interrogating you about both the book and the. Offerings there at your store for courses and things like that. At this point, I'd like to transition.
I had some members of our community who hit me up and they had questions that they actively said they would like for me to ask you during this podcast. So the first question comes from our community members, Simon lax, and he asks, when is pandas the wrong tool for the job? Is there ever a situation apart from memory constraints or speed where other tools might be.
[00:52:59] Matt Harrison: Yeah, that's a, that's a great question. So yeah, again, I do like pandas, right. But it has certain places where it fits. So I said, one of the things is that it works for small data. So and what I mean by small data is data that would fit on a machine. Now what a machine looks like these days sort of depends on how much money you have to buy gram or how much you want to rent a machine.
Right? So I have a laptop that has 64 gigs, right. A few years ago. That would be sort of unheard of in the laptop, but you can also go to the cloud and you can rent out machines that have multiples of that. And you can run pandas on that. And you know, instead of having to scale out to multiple machines, you can do something like that.
So you know, Panda's might be the wrong tool for the job on a dataset. It doesn't fit on your machine, but it might be if you switch machines, that might be okay. Another thing that just on that note of like memory is if, if you're not aware of the data types, what your columns are represented as that can be important, especially when you're dealing with stream data pen, this has optimized ways of storing numeric data, but it doesn't really optimize stream data.
So it actually sort of has buffers that point back to Python objects. One of the things that you can do though, is if you have what we call low cardinality categoric data, so maybe you have makes and models of cars and you have. A million rows, but you only have 20 different makes if you just represented that as string you'd have a million entries, but if you represented that as a pandas category, what it would do is it would represent it as a number between one and 20 with which would be abstracted into a sequence of the different categories that were there.
And so you can easily see that that would, you know, representing that as a sequence of numbers versus a stream for each of those would be a huge memory saving. So oftentimes if you've got categoric data and you can change those data types and even shrink your data so that if it wouldn't fit on your machine, now it might, it might fit on your machine and you might use it there.
Some other places where I would say you just don't want to use pandas if you have non tabular data, right. I mean, you can try and make it work, but it's probably not the case. So if you're doing like video or images or text data, Probably not the right place for Penn does another place where Penn does probably isn't the right tool is if you've got you know, a lot of scientists are doing like matrices of homogenous data.
You know, I've got a matrix of floating point numbers or integer numbers, and I want to do math operations for those. And pandas might work for that. But if you just use raw num PI for that, it's probably a better choice. It's pandas does have some overhead on top of numb pie. And so just going directly with numb pie might be a better option there.
But again, going back to that idea of Penn does the API. I think we're seeing now that, you know, the sort of memory thing might be a thing of the past, and it's more, if you have tabular data and it's small data or big data, a pen does the API is what you want to.
[00:56:16] Ted Hallum: Okay. Awesome answer. Thank you so much.
I think that I think Simon's going to be blown away with that. So fantastic. Our next question comes from David Vermilion and he says I'm an experienced and comfortable our user who likes the quality of reports that I get with our markdown. I've tried Jupiter lab and a few other things, but I'm yet to find any options in Python that yield reports with the same aesthetic Polish that I can get with our markdown.
What Python solutions do you recommend, Matt for generating beautiful reports?
[00:56:51] Matt Harrison: Yeah, so that's that. So again, I, I haven't used our markdown. My understanding is that like you write a markdown file and then you run it through some tool and it generates HTML or maybe some PDF or something for you. So Yeah, Jupiter is Jupiter the best for like sharing with others.
It might be right. If you have a technical audience, you might say here's a notebook, and I'm just going to share this notebook with you, right. Alternatively, you can export from Jupiter. You can export a PDF or you can export an HTML from Jupiter. Is that the best? It might be, it might not be maybe as clean as something you get from our markdown, what are some other options?
You have options there there's a bunch of dashboarding options that we're seeing. One of the ones that I'm seeing that tends to get a lot of mentions these days is called stream lit. And so if, if you want to share some reports, but it's more of a dashboard where you want some interactivity, you might want to check out stream lit or some of the other dashboarding options there.
I mean Python, as far as like the documentation for Python, if you're like making documentation Python has a pretty good documentation story you can look at. I mean, Jupiter, there's a thing called Jupiter book. The, the intent there is to actually write books in Jupiter and then do an export and make HTML or PDF from that.
All of my books. I've written all of my books with the exception. Well, with the exception of the ones that are published by publishers, all of my books have been written in a tool called restructured text. So I have made my own tool chain that will take restructure texts. If you're not familiar with that, it's basically Python's version of markdown.
So but basically you can take I can take a file that looks very much like mark down. It has code in it and I can run it through a tool that will generate a PDF or we'll generate an ebook from that. If I wanted to, I could generate slides from it. I have another tool with when I'm doing my courses.
All of my slides are written basically in restructured text as well. And I run a tool on my slides and generates my slides from that. The nice thing about using something like restructured text is. All of the code snippets in there. I can actually test, so Python has a mechanism for testing code snippets called doc test.
And so I can test my books. I can test my slides and make sure that those work so those are various options for taking content and making some sort of report or output from it. You know, whether, you know, where there's the overlap, that's sufficient for our markdown. You know, and I've also heard that like with our markdown or, or these are tools you can embed Python code in there.
So I mean, if our markdown is really working for you and that's what you want to use maybe continue using it. Right. But there are a lot of options in the Python world that people are using to generate PDFs and HTML.
[01:00:14] Ted Hallum: Awesome. So there you go, David, a couple of options to achieve similarly beautiful analysis results using Python.
Our next question comes from John Droescher and John says pandas benefits from having documentation that's as good or better than most other Python packages. So in what ways does the effective pandas book add value beyond what users can already get in the pandas document? Yeah,
[01:00:44] Matt Harrison: and I, I will, I will say to pandas, one of the things I do like about pandas and kudos to the developers is that if you're inside of Jupiter, you can pull up the documentation.
And generally not always, there are a few methods that are, I think, a little bit undocumented, but generally the documentation is really good. And so I like to encourage my students. Hey, if you need to pull up the documentation, don't jump to a search window, try and do it from Jupiter first, because that's going to eliminate distractions and you're going to be a lot more productive.
Anytime you cut to a search window, you're just opening your brain for a distraction and you're going to limit your productivity. So in general, like the Panda's documentation is pretty good. I mean, From directly from, from Jupiter. I mean, like, I can give an example of something that I wish were, were more accessible from Jupiter.
That is, is on the Panda's website. That it's not there for example, offset aliases. So when you have a, a data that has times in it, oftentimes you want to aggregate by. All the months or all the years, or you want to do eight days at a time or whatever pen does let you do that with a thing called offset alias relatively easily.
But to get the documentation for what the offset alias character alias names are that's not available in the docstring. You actually need to go to the Panda's website to get that. I also have it in my book, but I wish like there was an easy way to get that directly. It'd be nice because oftentimes my students are like, I forgot what the name of the offset aliases, and then they have to start searching to be fair to pandas.
They do put a link to that in the doc string, but I just wish they would embed it there because I've found that people just want, want that information. So let's, let's maybe look at a higher question. Oftentimes people are like, why would I pay for something when there's something free? So maybe I can just address that.
I have a blog post that talks about that as well. It's like, and again, I'm heavily biased here, right? Cause I make my living selling content and helping others learn. But yeah, there's a lot of free content out there if you like free content and that's your thing. Go crazy with that. However, what I've seen is that most free content maybe we'll cover some aspect and it's not about taking you on a path, right?
So my, my book is about taking you on a path to learn pandas. I would contrast that with the documentation for pandas, which while pretty good. The purpose of the documentation of pandas is to write documentation about every feature of. And while that's nice. I don't think that the normal pandas user needs to know about the documentation of every feature of pandas.
If you like to memorize obscure things that might be useful, but I would rather learn practical things that I'm going to apply. I'd rather learn the 20% that I'm going to use 80% of the time, rather than wasting my time on 400 different methods. Doing that one more contrast between maybe my book and the pound is documentation.
Is that a lot of the Panda's documentation uses random data, which I, I find. Kind of annoying. Again, for me, when I look at just random data, it has no meaning. And so when you're trying to understand some of these more complex operations and you just have random data, it doesn't really help me learn what's going on with the operation.
Right. But if I have real data and I understand what it means. That helps me understand. I don't have to expend mental energy, trying to think this was just random things. And now I am slicing random things, but it's like I've got my river data and this is how much rain fell on this month. Right.
And I had it by 15 minute frequencies, but I've now re sampled it to monthly frequencies. And then I plotted it and I can understand, oh, this is the rainfall for the every month on this river. So very clear what we're doing and how to apply that. So those would be some of the things too, that I think consider generally when someone's writing a book or making a course one of the things they have to do is write a proposal, especially if they're like doing it with a publisher and they should think of like, I want to take the end user from here to here.
Versus documentation documentation is I want to document every method and blog posts are generally. I'm going to dive into some method here or, or some piece of functionality, and those can be nice, but the blog post is generally not going to take you from an end to end. It's going to dive into a certain aspect.
So those are some of the pros and cons of the different types of documentation. For me, again, I'm highly biased, but I think if you go on a path, especially when you're learning something, that's going to be a lot more effective than just going to the Panda's a website and trying to memorize everything on there.
[01:05:40] Ted Hallum: Yeah, absolutely. I would say as a, if you're a upper intermediate level user, an expert, the documentation can be handy because you maybe know exactly what you need and you get that one thing. But if you're a new hand is learner, then it's like, okay, there's great documentation. Where do I start? How do I proceed?
Where do I end? Whereas something that's thoughtfully structured, like what you've put forward here, you kind of get like. Pedagogical strategy of that whole process and it's baked in, so new learners can just start and then they're going to be kind of you'll hold their hand through the learning process and take them where they need to go.
[01:06:23] Matt Harrison: Yeah. And I think one more thing, I mean, this goes back to the real data example, but the exam, a lot of the examples on the Panda's website or in blogs are like, here's some method and here's how you use it. Right. Which again might be nice. And sometimes an expert might want to dive into that, but really.
You oftentimes want to do a process, right? And so my book is going to show you these real-world datasets, and it's going to give you these chains of this is what I had to do. It wasn't just one thing. It was actually a step of five or 10 things. Right. And this is what it looks like in the real world. It doesn't actually look like just doing one thing, right?
I mean, you do want to do that one thing, but it's part of a step. And the output of that one thing is really not that important. The output of the whole process is what is important. And so you're going to get that in my book. You're not going to really find a lot of that on, on things that sort of deep dive into a single method.
[01:07:20] Ted Hallum: Absolutely. All right, Matt, our next question is from Collin Bardo and he says what's Matt's favorite, lesser known, or most undervalued pandas capable.
[01:07:32] Matt Harrison: Yeah, that's a great question. Again, there are like 400 different methods on there, so there there's a lot in there and I don't, I actually don't claim to be an expert on all of those because I come from the pragmatic side of like, I don't want to memorize 400 things.
I just want to use what I need to use to be productive. I would say one of the ones that I find a lot of people don't know about is the query method. A lot of people know about using like the to do slicing and filtering. But there is also a query method. And the nice thing about a query method is that query method lends itself very well to chaining.
And so versus if you do an ILO in the middle of a chain, oftentimes people will want to, they need to filter, do some Boolean array. Then you'd create a bull in a rape. But if it's in the middle of the chain, if they do the bullying array based on the original data frame it might not work versus the query.
The query is going to operate on the current state of the data frame. And so and also you can just write it as like a SQL query, like where this column is greater than five or something. It makes it pretty easy to read. So for for those who are listening, you can do something similar. You can get the current state of a data frame with, with like dot Loke, bypassing a Lambda function into the index operation of look, which again has a little bit of a cognitive overhead.
The Lambda function will take the current state of the data frame. You just need to return something that I look can slice or index off of, but generally query re I find that query reads a lot.
[01:09:06] Ted Hallum: Very cool. Thank you so much for that. I don't think I've ever used that method myself and now I'm excited to try it,
[01:09:13] Matt Harrison: so it makes it look like SQL, so.
Sure. Yeah.
[01:09:16] Ted Hallum: Yeah. Well I like to have, I like to write code that's as readable as possible, and that certainly sounds like a more readable approach than embedding a Lambda function inside of a Loke. Yeah. Now occasionally I'll, I'll take a Liberty since I am the, the podcast host here. The last question is actually one that I have for you.
And the question is because I had a previous role as a data scientist where I worked almost exclusively with geospatial data and my eyes were open to the incredible value that can be derived from geospatial data. And as a result of, of needing to work. With that data almost exclusively. I became proficient with geo pandas, which is an extension of pandas for working with geometric types.
And so I was just curious to find out if you know, now that the book's out and you're probably already starting to accumulate notes for, Hey, what might I include in a future edition? Do you think that geo pandas might make the cut so that people could learn how to use the power of pandas for geospatial data?
That's
[01:10:25] Matt Harrison: a good question. I've had, I've had similar comments. People are like, why don't you cover like extension res? So, so again, there there's a rich API and pandas, and then there's a whole swath of tools that are around the pen's ecosystem that people are doing some really cool things with. I mean, I've, I've got my book here.
I've actually, you'll see that I've already got a bunch of things where I'm like add this. So. As someone who's a creator, I'm always sort of thinking about that. I'll say there's no guarantees, but one, one thought that is going through my head. Ted is rather than updating effective pen is with something like that.
I I'm, I am considering a book called enterprise pandas. And the idea with that would be more like case studies. Very cool. Taking things like, okay, let's look at a geo pandas case study, right. Or an extension or a case study or someone who is using pandas in an ETL system or machine learning, or how people are writing robust pandas, and then testing that and making sure that their pandas code works in production systems.
So that would be, again, I'm no guarantees, but I, I, I think, you know, a lot of people there, there's a lot of, like I said, a lot of beginner stuff, a lot of stuff with like, here's this method, there's this method. Right. But like pulling it together. There's not a lot of that. And I think that there, there could be value in a, in a book that is showing, you know, this is real world stuff.
And these are associated libraries that can enable you, like you said, to do really powerful things. Go above and beyond just like a basic slicing and dicing.
[01:12:22] Ted Hallum: I think a book of case studies like that would be broadly welcomed by the community. I would like to have it in my library. I can only imagine that the members of our community from things they've told me would appreciate a book like that.
And I can even see it being strongly considered as a text for like an introductory level data science course, because it would give people those hands-on pragmatic. Like how. Did any type of thing. Yeah, yeah,
[01:12:50] Matt Harrison: yeah. Yeah. When
[01:12:53] Ted Hallum: do you think that'll be out Matt
[01:12:55] Matt Harrison: the next week? I, yeah, I, I, I'm working on cloning myself.
And so as soon as I'm done with that that won't get it.
[01:13:03] Ted Hallum: All right. Well, I've got a short list here of a few questions as we wrap up that are just kind of like stray electrons. And the first one is obviously I think your, your favorite Python packages, pandas, what would you say is your second favorite Python?
[01:13:20] Matt Harrison: Yeah, I don't, I don't know. Yeah. I guess I have sort of a love, hate relationship with pandas. Sometimes it brings me some pain, so but yeah. What, what package, here's a package that also suggests that maybe some of your listeners don't know about if they're Python users that I've found super useful and it's a package called Jupyter text and basically.
If you're using Jupiter, Jupiter has saves itself as Jason output, which is fine in that Jason JavaScript object notation is great for like web services, that sort of thing. What Jason isn't good for is doing diffs on. Oftentimes, when you're committing things to source control, like get, you need to do a diff on it.
And so if you change your notebook and then you do a diff on it, it's really hard to see what's going on there. What you by text gives us is the ability to say I've got my notebook here, but I also want to make a synchronization where I want to sync it to a Python script. And so you install the plugin and then it puts an option in your, in your Jupiter window.
And you can click on this Jupyter text and you can say, save it to Python. And then anytime you save your notebook, in addition to saving the, the Jason the IPI and B file, it's also going to write this Python script file. Now, one of the nice things with that is that like, I like to use Emacs. I I've been using the Emacs editor for longer than I've been using Python.
So I have a bunch of muscle memory around that. If I open up a I Python notebook in Emacs, it's kind of a pain because you're editing Jason, which is kind of annoying. However, if I open up this Python file, it just looks like Python got some comments where cells start and end. And it has some comments if you have like marked down in there, but otherwise it's just a Python file.
And then if I need, sometimes I might need to refactor or change my code. It's a lot easier for me to do that directly from Emacs so I can change the Python file. Save it. And then I reload the notebook and the notebook reflects what the Python file is. So it's like a two-way sync. If I save the notebook, it overwrites a Python file.
If I save the Python file, overwrites that, but then if I want to push this into source control, I'm collaborating with others that diff on the Python file is just a normal Python diff. So it's very clear to see versus the diff on the notebook might be a little bit convoluted because different Jason is not quite as clean as, as diffing Python.
So check that out. If you are interested in editing your notebooks from an editor where you're just editing Python text, if, if that's a little bit more proficient for you, or if you want to start using source control, which I think you should, if you're doing anything in sort of a protect production capacity, and you want to really understand diffs and changes to your notebooks as you're going along.
Okay.
[01:16:18] Ted Hallum: So Matt, yeah. Another tip that. Excited to seize personally and run with, as soon as we get off this call, because I have lived that pain of trying to deal with Jupiter notebooks in get and other Virgin control environments. And it is an absolute undecipherable nightmare to figure out what the diffs are.
So that is very cool. Yeah. Thank you. The next question is for someone who is constantly creating new courses and writing books, you've got. Always be learning. Like, there'd be no way for you to not be learning all the time so that you have new content. So I'm curious to know what's your current learning focus.
What are you diving into?
[01:17:03] Matt Harrison: Yeah, that's a great question. Oh, a lot of it sort of goes back to that notion of, of the enterprise stuff, the case studies and that sort of a way that I've used all along to learn, like when I wrote my Python OLAP engine way back in the day I didn't just start writing that actually, there were certain papers that talked about slicing and dicing data and fundamental operations.
And so I would read those papers in that base, basically implement that in Python. So you know, for some of my material I am doing more case study, like stuff. So going out and reading. What the papers or what other practical books are saying. For example I'm, I'm currently doing cells reporting with Excel.
Well, it's a pen, it's a sales reporting with pandas, the idea to replace Excel, which is a course, I'm actually running it this week, but I'm running another version in March, which is the intent of that course is, you know I've generated a lot of reports over the years using Python. A lot of those have been sort of exported as Excel.
And but I'm finding that a lot of people want to either throw out Excel or just re see if they can use Python or pandas to, to generate those instead. And so. That is the course I'm doing sort of the, the way I came across that as I'm like, well, I can just write my own content, but rather what I did is I researched other people who are doing reports and I actually read a material on not on Python or on pandas, but material on Excel.
Right. And like, here are best practices for Excel and here here's how you make these reports in Excel. And then I did that in pandas instead. Right. So, so it's leveraging others' best practices, but for tools that might not just enable us to do these best practices, but now once we're in the Python ecosystem we, we get, you know, best practices of Excel, but we also get access to these 350,000 libraries that we don't have an Excel.
So that's sort of my general learning practice. When I want to learn something, I will go and research what experts have done about it. And oftentimes implementing what experts have done about it as a great way for you to understand state of the art or most up-to-date ways to do that. And that's sort of where I'm at right now.
You know, I've got clients who are asking for training on certain things that maybe I haven't used for a few years. And so revisiting that material certainly can be a challenge. Right. But again, my process for that would be okay, let's, let's go out and see what people are doing with it.
And I generally will go to like a book. That's going to take me on a path. Right. And then start applying the content of that book to a dataset or information that is useful or interesting to me. And go from there.
[01:20:20] Ted Hallum: It's so true. I'm constantly telling people that I'm an expert in whatever I've been doing the most for the last 90 to 120 days.
I've definitely forgotten way more than I know in this moment. I'm the only consolation prize being that some of those things I learned a year, two, three years ago, I know I learned them. Those those brain synapses connected. And so I know I can do it again if I have to, I think,
[01:20:48] Matt Harrison: yeah, yeah. Generally your brain, you know, and I've experienced this a lot.
If you've read something once, I mean, you might feel nice about it, but it's just going to go out of your brain. You really, the science tells us you need to do space repetition, which is taking notes and then revisiting those notes or alternatively I've found a great way of doing that is to actually put it into practice.
Right. And then not only are you using, you're using a different portion of your brain by typing it out thinking about it, you actually use a different portion that reinforces what you learned. So coming back to like doing projects, it's so critical for learning something. But yeah, if you've learned something in the past, it's not a big deal that you forget it, unless you're like in an interview situation, you might want to review that.
But oftentimes if you've reviewed some, learn something in the past, a quick review of that can like pull up a lot of that content that you might've thought you had.
[01:21:42] Ted Hallum: A hundred percent. So Matt, for the sake of time, I'll close out by asking, you know, I know you probably keep a really close pulse on the horizon, looking out of the near and midterm for both Python and pandas.
So what has you most excited that you've heard is coming down the pike and in that realm, and then let, let the listeners know how you prefer to be contacted if they want to reach out to you. Sure.
[01:22:07] Matt Harrison: Yeah. Again, I think I'd go back to, you know, as, as far as like the pen does world I just think pandas is a good investment right now.
There are other things that are coming along, but I don't think that for the next five to 10 years, those are in the Python world or in data science world in general, they're going to really compete with pandas. But also we just see like I said, I've got a list of 13. Libraries that implement the Panda's API.
Right? So even if you think you might have larger data in the future is pandas worthwhile. I think it is because scaling that out using some like spark just released the Panda's API on spark on version 3.2. You've got DAS, you've got Moden that these things that offer you scale out and want, I mean, Moden offers scale-out and they want to be, you know, completely following the Panda's API so much to the extent that they will reproduce bugs in pandas.
Even, even though that might seem weird because they want people to have the same experience with modem that they would have in pandas. So I think that I'm, I'm just excited about, you know, as we see things making easy to scale out or leveraging the GPU or, you know, will I be using pandas in three years?
I might not be using pandas, but I probably will be using the API. Maybe there's a better implementation that, you know, I swap out an import tonight, import whatever cheetah as PD. And by changing that single line, I'm now getting a three X improvement in my speed, but my code stays. There to be specific.
There's no cheetah library, but I'm giving that an example example. I
[01:24:04] Ted Hallum: think that's the perfect way to the, what the example you gave of doing a different import as PD, and then using the syntax that you're so familiar with. I think that's the perfect way for people who might not have realized what we were saying earlier about the API being as important as the actual package are, or perhaps more important.
They probably just became real for them. What you mean when you say that? So Matt, between instructing at universities, writing books, creating courses, providing consulting to companies, doing professional training, I know you're extremely busy and I just want to say from a heart, thank you so much for coming on the show to share about this important.
I think one of the best present tools for doing data wrangling and processing data and especially that last tip about Boden, I think. Love pandas and that want to use, use it for larger data. They're probably going to be rushing out to Google that. So thank you for everything. And we look forward to maybe having you again on the show in the
[01:25:07] Matt Harrison: future.
Awesome. Yeah. Thanks for having me, Ted. Thanks for your good work. If people want to connect with me on LinkedIn or on Twitter, Twitter, Dunder, M Harrison underscore underscore I'm Harrison, and just grind to square you and I don't tweet cat photos. I tweet pandas code that makes you upset or very happy.
Thanks, Ted.
[01:25:25] Ted Hallum: Hey, thanks Matt. Until next time. Thank you for joining Matt. And after this conversation about pandas, if you'd like to learn more, I encourage you to check out the book introduced in this episode. There's more information and a link where you can pick up a copy of effective pandas for yourself in the show notes below with that until the next episode I've been G clean data, low P values and Godspeed on your day to day.