Introduction to NumPy for Data Science and Analytics
Hey guys! Welcome to the Data Science and Analytics Series of ours. We’ve already covered the fundamentals and some visualizations in this series. So, now let’s go in a flow and cover in detail. So, Let’s start with NumPy and we’ll move to other libraries and tools slowly.
In today’s article, we’ll do NumPy. Whether we’re working with small datasets or huge ones, this is a tool we’ll be using every day. So,
- Why is NumPy so crucial for Data Science?
- Why should we use it instead of regular Python lists?
- How does it help with data manipulation and preparation?
Stick around, and by the end of this article, you’ll understand exactly how NumPy can streamline your workflow and enhance your efficiency. We’ll also discuss it with some code examples to understand the concepts better.
Great! If you want, you can also watch the Video as well: 👇
Alright! Let’s get into it now.
1. The Importance of NumPy in Data Science
- Data Science involves dealing with a lot of data. Means often much more than we can easily handle with standard Python data structures like lists and dictionaries. This is where NumPy comes in.
- NumPy stands for Numerical Python. NumPy is a library that provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on them.
Let me give you a quick example to show what I mean.
list1 = [1, 2, 3, 4]
list2 = [5, 6, 7, 8]
Here, I have two lists of numbers. If I wanted to perform an operation on all elements of these lists, like multiplying them together, doing that with standard Python lists is a bit slow and clunky.
mul = [list1[i] * list2[i] for i in range(len(list1))]
print(mul) # Output: [5, 12, 21, 32]
Not too bad, but now imagine doing this with hundreds of thousands of elements. This approach quickly becomes inefficient and complicated, right?
2. Why NumPy is Better than Lists for Large Datasets
But NumPy is optimized for performance and memory efficiency. With NumPy arrays, operations like the one we just performed can be done much faster and with much cleaner code.
Let me show you the NumPy version of this same operation. First, we need to import NumPy and create NumPy arrays then we’ll multiply them.
import numpy as np
arr1 = np.array([1, 2, 3, 4])
arr2 = np.array([5, 6, 7, 8])
mul = arr1 * arr2
print(mul) # Output: [5, 12, 21, 32]
Look at that! Much cleaner, right? But the real power of NumPy comes when we’re working with larger datasets. NumPy is optimized for numerical operations, and again it uses C, which is much faster than Python’s native operations.
For larger datasets, NumPy can be 50-100 times faster than using standard Python list. It’s also more memory-efficient, which is especially important when working with limited system resources or massive datasets.
3. Performance Benchmark Example
I’m going to show you the time it takes to perform a large operation using both Python lists and NumPy arrays.
For this, we’ll use the time module.
from time import time
Okay, now let’s take two large lists and compare the time taken by Python list and NumPy array.
largeList1 = list(range(1, 1000000))
largeList2 = list(range(1, 1000000))
startTime = time()
mulNums = [largeList1[i] * largeList2[i] for i in range(len(largeList1))]
endTime = time()
print("Time Take by Python List:", endTime - startTime)
# Output: Total Time Taken by Python Lists: 0.2608051300048828
Great! Let’s now convert the lists into the NumPy arrays.
largeArr1 = np.array(largeList1)
largeArr2 = np.array(largeList2)
startTime = time()
mulNums = largeArr1 * largeArr2
endTime = time()
print("Time Take by NumPy Array:", endTime - startTime)
# Output: Total Time Taken by NumPy Array: 0.023926734924316406
As we can see, the NumPy array operation is far faster than the list operation, even when handling millions of data points.
And That’s all for today! We’re going to do much more exciting things in the upcoming articles. As you’ve already reached the end part of this article, Thanks for reading! 😊