Introduction to Jupyter Notebook, Numpy, & Matplotlib

Outline

Jupyter notebook

Numpy

Matplotlib

Jupyter Notebook

What is Jupyter Notebook

  • create and share documents that contain live code, equations, visualizations and narrative text.
  • data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
  • From when you start a Jupyter notebook, variable assignments, function declarations, etc are stored and updated as you go.

Useful features

  • Hit shift+enter to execute the cell
  • Use ! to excute shell commands, try !ls, !pwd, !pip install ...
  • Use ? (e.g., ?numpy, ?numpy.mean()) for quick reference on syntax
  • Change themes
    • pip install jupyterthemes to install themes
    • jt -l to check the available themes
    • jt -t <theme> to set the theme
    • jt -r back to the default theme
  • Notebook Extensions
    • install: type pip install jupyter_contrib_nbextensions and then jupyter contrib nbextension install
    • Table of Contents
    • Hinterland
    • Split cells
  • Using LaTeX for forumlas
    • $\frac{\partial^2u}{\partial t^2}=c^2\frac{\partial^2 u}{\partial x^2}$
  • Multicursor support

Sharing your notebooks

  • Before you share
    • Click “Cell > All Output > Clear”
    • Click “Kernel > Restart & Run All”
    • Wait for your code cells to finish executing and check ran as expected

Numpy

Array programming with NumPy

Nature volume 585, pages 357–362 (2020)

Charles R. Harris, K. Jarrod Millman, and Stéfan J. van der Walt et al.

What is Numpy (Numerical Python)

  • a fundamental package for scientific computing in Python
  • a Python library includes a multidimensional array object (ndarray), various derived objects, and fast operations on arrays
  • one of the top five Python packages, used in every field of science and engineering.
  • install numpy: pip install numpy
  • import numpy: import numpy as np

Why is Numpy

  • fast
  • the core of the scientific Python and PyData ecosystems
  • the universal standard for working with numerical data in Python
  • works well with Pandas, SciPy, Matplotlib, scikit-learn, scikit-image and most other data science and scientific Python packages
  • all kinds of matrix operations, lots of built-in functions

ndarray: the core of NumPy

  • a homogeneous n-dimensional array object (unlike Python sequences)
    • The elements in a NumPy array are of the same data type and the same size in memory.
  • NumPy arrays have a fixed size at creation, unlike Python lists (which can grow dynamically)
  • a NumPy array has continuous memory

A 1D array is a vector

its shape is just the number of components

In [1]:
import numpy as np

data = np.array([1,2,3])
print(data)
[1 2 3]

Screen%20Shot%202020-10-08%20at%207.49.54%20PM.png

A 2D array is a matrix; its shape is (number of rows, number of columns)

In [2]:
data = np.array([[1,2],
               [3,4],
               [5,6]])
print(data)
[[1 2]
 [3 4]
 [5 6]]

image.png

Structure of a 3D (3, 4, 2) array

image.png

Memory and strides: what make numpy fast

  • a NumPy array is stored in a contiguous block of memory
  • two key concepts relating to memory: dimensions and strides
  • strides are integer numbers of bytes to step in each dimension

An example of 2D array

In [3]:
data = np.array([[1,2,3],[4,5,6]], dtype='int16')
data.strides
Out[3]:
(6, 2)

shape of the 2D array

image.png

continuous memory and strides

image.png

The byte address of an element data[i1,i2]:

byte_address = data.strides[0] * i1 + data.strides[1] * i2

2D array shape and memory

image.png

in a NumPy array, the last axis is the fastest while the first axis is the slowest.

In [4]:
import numpy as np

a = np.random.rand(5000, 5000)
In [5]:
%%time
a[0, :].sum()
CPU times: user 34 µs, sys: 3 µs, total: 37 µs
Wall time: 40.1 µs
Out[5]:
2513.744110297411
In [6]:
%%time
a[:, 0].sum()
CPU times: user 702 µs, sys: 206 µs, total: 908 µs
Wall time: 416 µs
Out[6]:
2533.4437081558804

Rolling window, strided tricks

a simple example: taking mean for running window:

In [7]:
import numpy as np
sequence = np.random.normal(size=1000000) + np.arange(1000000) #define a 1D array

very bad idea is to do this with pure python

In [8]:
def running_average_simple(seq, window=100):
    result = np.zeros(len(seq)-window + 1)
    for i in range(len(result)):
        temp = seq[i:i+window]
        result[i] = temp.mean()
    return result

it is better to use as_strided

In [9]:
from numpy.lib.stride_tricks import as_strided
def running_average_strides(seq, window=100):
    stride = seq.strides[0]
    sequence_strides = as_strided(seq, shape=[len(seq)-window+1, window], strides=[stride, stride])
    return sequence_strides.mean(axis=1)

benchmark of the speed

In [10]:
import time

start1 = time.time()
result1 = running_average_simple(sequence)
end1 = time.time()
start2 = time.time()
result2 = running_average_strides(sequence)
end2 = time.time()
print(result1)
print(result2)
print('running_average_simple takes:'+str(end1-start1))
print('running_average_stride takes:'+str(end2-start2))
[4.93380524e+01 5.03442473e+01 5.13432677e+01 ... 9.99947588e+05
 9.99948591e+05 9.99949588e+05]
[4.93380524e+01 5.03442473e+01 5.13432677e+01 ... 9.99947588e+05
 9.99948591e+05 9.99949588e+05]
running_average_simple takes:4.855233192443848
running_average_stride takes:0.036527156829833984

Ways to create a numpy array

create an array from a regular Python sequence (list or tuple)

In [11]:
import numpy as np

a = np.array([2,3,4])
print(a.dtype)

b = np.array([2.,3.1,4.8])
print(b.dtype)
int64
float64

sequences of sequences = 2D arrays, sequences of sequences of sequences = 3D arrays, and so on

In [12]:
b = np.array([(1.5,2,3), (4,5,6)])
print(b)
print(b.dtype)
[[1.5 2.  3. ]
 [4.  5.  6. ]]
float64

create an array with a specified shape

In [13]:
shape=(3,2)
a = np.ones(shape)
print(a)
[[1. 1.]
 [1. 1.]
 [1. 1.]]
In [14]:
shape=(3,2)
b = np.zeros(shape)
print(b)
[[0. 0.]
 [0. 0.]
 [0. 0.]]
In [15]:
shape=(3,2)
c = np.random.random(shape)
print(c)
[[0.97760399 0.77848398]
 [0.69826252 0.35639333]
 [0.87414457 0.92383478]]

image.png

In [16]:
shape=(2,3)
d = np.full(shape,6)
print(d)
[[6 6 6]
 [6 6 6]]
In [17]:
shape=(3,3,2)
e = np.empty(shape) #uninitialized, may vary
print(e)
[[[-1.28822975e-231 -1.28822975e-231]
  [ 5.92878775e-323  0.00000000e+000]
  [ 0.00000000e+000  0.00000000e+000]]

 [[ 0.00000000e+000  0.00000000e+000]
  [ 0.00000000e+000  0.00000000e+000]
  [ 0.00000000e+000  0.00000000e+000]]

 [[ 0.00000000e+000  0.00000000e+000]
  [-1.28822975e-231  1.73060594e-077]
  [ 2.47032823e-323  0.00000000e+000]]]
In [18]:
a.fill(8)
b.fill(8)
c.fill(8)
d.fill(8)
print(a)
print('-------------')
print(b)
print('-------------')
print(c)
print('-------------')
print(d)
[[8. 8.]
 [8. 8.]
 [8. 8.]]
-------------
[[8. 8.]
 [8. 8.]
 [8. 8.]]
-------------
[[8. 8.]
 [8. 8.]
 [8. 8.]]
-------------
[[8 8 8]
 [8 8 8]]
In [19]:
f = np.arange(1,5,0.5)
print(f)
[1.  1.5 2.  2.5 3.  3.5 4.  4.5]
In [20]:
g = np.linspace( 0,3,5)  # 5 numbers from 0 to 3
print(g)
[0.   0.75 1.5  2.25 3.  ]

Create an array from existing data

Create a new array from existing arrays

In [21]:
data = np.array([1,  2,  3,  4,  5,  6])
arr1 = data[3:6]
print(arr1)
[4 5 6]
In [22]:
data.reshape(2,3)
Out[22]:
array([[1, 2, 3],
       [4, 5, 6]])
In [23]:
data.reshape(3,2)
Out[23]:
array([[1, 2],
       [3, 4],
       [5, 6]])

image.png

In [24]:
a1 = np.array([[1, 1],
               [2, 2]])

a2 = np.array([[3, 3],
               [4, 4]])
np.vstack((a1, a2))
Out[24]:
array([[1, 1],
       [2, 2],
       [3, 3],
       [4, 4]])
In [25]:
np.hstack((a1, a2))
Out[25]:
array([[1, 1, 3, 3],
       [2, 2, 4, 4]])

Load from .txt files

In [26]:
import numpy as np

s1_e=np.loadtxt('./data/seismicEvents/AH_CZO_SHE.txt')
s1_n=np.loadtxt('./data/seismicEvents/AH_CZO_SHN.txt')
s1_z=np.loadtxt('./data/seismicEvents/AH_CZO_SHZ.txt')
s2_e=np.loadtxt('./data/seismicEvents/AH_TAP_BHE.txt')
s2_n=np.loadtxt('./data/seismicEvents/AH_TAP_BHN.txt')
s2_z=np.loadtxt('./data/seismicEvents/AH_TAP_BHZ.txt')
print(s1_e)
print(s1_n)
print(s1_z)
print(s2_e)
print(s2_n)
print(s2_z)
[8644. 8635. 8611. ... 8772. 8782. 8804.]
[7966. 7990. 8031. ... 7991. 8003. 8003.]
[8569. 8586. 8585. ... 8686. 8648. 8630.]
[ -99. -144. -157. ... -271.  264.  283.]
[-2300. -2265. -2145. ... -3366. -2365.  -755.]
[ -761.  -728.  -707. ... -1426. -1702. -1770.]
In [27]:
import matplotlib.pyplot as plt

def plotSeismicEvent(dt,se,sn,sz,title):
    t = dt*np.arange(0.0, len(se))
    fig,(ax1,ax2,ax3) = plt.subplots(
        3,1,figsize=(12,6),constrained_layout=True)
    fig.suptitle(title,fontsize=24)
    ax1.plot(t, se)
    ax1.set_ylabel('Amplitude')
    ax1.set_title('East component')

    ax2.plot(t, sn)
    ax2.set_ylabel('Amplitude')
    ax2.set_title('North component')

    ax3.plot(t, sz)
    ax3.set_ylabel('Amplitude')
    ax3.set_title('Vertical component')
    ax3.set_xlabel('Time (s)')
    
    plt.show()
In [28]:
dt=0.01 #sampling intervale
title1 = 'Three components of the 1st event'
plotSeismicEvent(dt,s1_e,s1_n,s1_z,title1)
In [29]:
dt=0.01 #sampling intervale
title2 = 'Three components of the 2nd event'
plotSeismicEvent(dt,s2_e,s2_n,s2_z,title2)

Load from .dat files

In [30]:
import numpy as np

shape = (782,1006)
sx = np.fromfile('./data/seismicFacies/sx589.dat',dtype=np.single)
fx = np.fromfile('./data/seismicFacies/fx589.dat',dtype=np.single)
sx = np.reshape(sx,shape)
fx = np.reshape(fx,shape)
sx = np.transpose(sx)
fx = np.transpose(fx)
In [31]:
import matplotlib.pyplot as plt
import numpy as np

fig, axs = plt.subplots(1,2,figsize=(20,9),constrained_layout=True)
cm = ['gray', 'jet']
bars = ['Amplitude', 'Seismic facies']
imgs = [sx,fx]
for col in range(2):
    ax = axs[col]
    ax.xaxis.tick_top()
    ax.set_ylim(len(imgs[col]),0)
    pcm = ax.pcolormesh(imgs[col],cmap=cm[col])
    cbar=fig.colorbar(pcm, ax=ax, orientation="horizontal", aspect=20)
    cbar.set_label(bars[col])
plt.rcParams.update({'font.size':18})
plt.show()

The seismic facies identification challenge

image.png

Our team (collaborating with Huawei) is ranking NO. 4

we are still trying to improve our result!

image.png

Fast operations on numpy arrays

  • mathematical, logical
  • shape manipulation, sorting, selecting, I/O
  • discrete Fourier transforms
  • basic linear algebra
  • basic statistical operations
  • random simulation and more
  • check out the numpy tutorial
  • check out Josh Devlin's numpy cheat sheet

Indexing and slicing

the same as python lists

In [32]:
import numpy as np

data = np.array([1, 2, 3])
data[1]
Out[32]:
2
In [33]:
data[0:2]
Out[33]:
array([1, 2])
In [34]:
data[1:]
Out[34]:
array([2, 3])
In [35]:
data[-2:]
Out[35]:
array([2, 3])

image.png

Access/index array values with a condition

In [36]:
import numpy as np
a = np.array([[1 , 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
a[a < 5]
Out[36]:
array([1, 2, 3, 4])
In [37]:
a[(a > 2) & (a < 11)]
Out[37]:
array([ 3,  4,  5,  6,  7,  8,  9, 10])
In [38]:
a>5
Out[38]:
array([[False, False, False, False],
       [False,  True,  True,  True],
       [ True,  True,  True,  True]])

Basic array operations

addition, subtraction, multiplication, division, and more

In [39]:
data = np.array([1, 2])
ones = np.ones(2, dtype=int)
data + ones
Out[39]:
array([2, 3])

image.png

In [40]:
print(data - ones)
print(data * data)
print(data / data)
[0 1]
[1 4]
[1. 1.]

image.png

In [41]:
print(data.max())
print(data.min())
print(data.sum())
2
1
3

image.png

In [42]:
data = np.array([[1, 2], [3, 4]])
print(data.max())
print(data.min())
print(data.sum())
4
1
10

image.png

In [43]:
print(data.max(axis=0))
print(data.max(axis=1))
[3 4]
[2 4]

image.png

Broadcasting

how numpy treats arrays with different shapes during arithmetic operations

In [44]:
import numpy as np

data = np.array([[1, 2], [3, 4], [5, 6]])
ones_row = np.array([[1, 1]])
data + ones_row
Out[44]:
array([[2, 3],
       [4, 5],
       [6, 7]])

image.png

Broadcasting

how numpy treats arrays with different shapes during arithmetic operations

image.png

Working with mathematical formulas

$mse=\frac{1}{n}\sum^n_{i=1}(Y^p_i-Y_i)^2$

Implementing this formula is simple and straightforward in NumPy:

$mse=(1/n)*np.sum(np.square(predictions-labels))$

Fast operations on numpy arrays

  • mathematical, logical
  • shape manipulation, sorting, selecting, I/O
  • discrete Fourier transforms
  • basic linear algebra
  • basic statistical operations
  • random simulation
  • and much more.

Creating Arrays

  • np.array([1,2,3]) | One dimensional array
  • np.array([(1,2,3),(4,5,6)]) | Two dimensional array
  • np.zeros(3) | 1D array of length 3 all values 0
  • np.ones((3,4)) | 3x4 array with all values 1
  • np.eye(5) | 5x5 array of 0 with 1 on diagonal (Identity matrix)
  • np.linspace(0,100,6) | Array of 6 evenly divided values from 0 to 100
  • np.arange(0,10,3) | Array of values from 0 to less than 10 with step 3 (eg [0,3,6,9])
  • np.full((2,3),8) | 2x3 array with all values 8
  • np.random.rand(4,5) | 4x5 array of random floats between 0–1
  • np.random.rand(6,7)*100 | 6x7 array of random floats between 0–100
  • np.random.randint(5,size=(2,3)) | 2x3 array with random ints between 0–4

Inspecting Properties

  • arr.size | Returns number of elements in arr
  • arr.shape | Returns dimensions of arr (rows,columns)
  • arr.dtype | Returns type of elements in arr
  • arr.astype(dtype) | Convert arr elements to type dtype
  • arr.tolist() | Convert arr to a Python list
  • np.info(np.eye) | View documentation for np.eye

Copying/sorting/reshaping

  • np.copy(arr) | Copies arr to new memory
  • arr.view(dtype) | Creates view of arr elements with type dtype
  • arr.sort() | Sorts arr
  • arr.sort(axis=0) | Sorts specific axis of arr
  • two_d_arr.flatten() | Flattens 2D array two_d_arr to 1D
  • arr.T | Transposes arr (rows become columns and vice versa)
  • arr.reshape(3,4) | Reshapes arr to 3 rows, 4 columns without changing data
  • arr.resize((5,6)) | Changes arr shape to 5x6 and fills new values with 0

Adding/removing Elements

  • np.append(arr,values) | Appends values to end of arr
  • np.insert(arr,2,values) | Inserts values into arr before index 2
  • np.delete(arr,3,axis=0) | Deletes row on index 3 of arr
  • np.delete(arr,4,axis=1) | Deletes column on index 4 of arr

Combining/splitting

  • np.concatenate((arr1,arr2),axis=0) | Adds arr2 as rows to the end of arr1
  • np.concatenate((arr1,arr2),axis=1) | Adds arr2 as columns to end of arr1
  • np.split(arr,3) | Splits arr into 3 sub-arrays
  • np.hsplit(arr,5) | Splits arr horizontally on the 5th index

Indexing/slicing/subsetting

  • arr[5] | Returns the element at index 5
  • arr[2,5] | Returns the 2D array element on index [2][5]
  • arr[1]=4 | Assigns array element on index 1 the value 4
  • arr[1,3]=10 | Assigns array element on index [1][3] the value 10
  • arr[0:3] | Returns the elements at indices 0,1,2 (On a 2D array: returns rows 0,1,2)
  • arr[0:3,4] | Returns the elements on rows 0,1,2 at column 4
  • arr[:2] | Returns the elements at indices 0,1 (On a 2D array: returns rows 0,1)
  • arr[:,1] | Returns the elements at index 1 on all rows
  • arr<5 | Returns an array with boolean values
  • (arr1<3) & (arr2>5) | Returns an array with boolean values
  • ~arr | Inverts a boolean array
  • arr[arr<5] | Returns array elements smaller than 5

Scalar Math

  • np.add(arr,1) | Add 1 to each array element
  • np.subtract(arr,2) | Subtract 2 from each array element
  • np.multiply(arr,3) | Multiply each array element by 3
  • np.divide(arr,4) | Divide each array element by 4 (returns np.nan for division by zero)
  • np.power(arr,5) | Raise each array element to the 5th power

Vector Math

  • np.add(arr1,arr2) | Elementwise add arr2 to arr1
  • np.subtract(arr1,arr2) | Elementwise subtract arr2 from arr1
  • np.multiply(arr1,arr2) | Elementwise multiply arr1 by arr2
  • np.divide(arr1,arr2) | Elementwise divide arr1 by arr2
  • np.power(arr1,arr2) | Elementwise raise arr1 raised to the power of arr2
  • np.array_equal(arr1,arr2) | Returns True if the arrays have the same elements and shape
  • np.sqrt(arr) | Square root of each element in the array
  • np.sin(arr) | Sine of each element in the array
  • np.log(arr) | Natural log of each element in the array
  • np.abs(arr) | Absolute value of each element in the array
  • np.ceil(arr) | Rounds up to the nearest int
  • np.floor(arr) | Rounds down to the nearest int
  • np.round(arr) | Rounds to the nearest int

Statistics

  • np.mean(arr,axis=0) | Returns mean along specific axis
  • arr.sum() | Returns sum of arr
  • arr.min() | Returns minimum value of arr
  • arr.max(axis=0) | Returns maximum value of specific axis
  • np.var(arr) | Returns the variance of array
  • np.std(arr,axis=1) | Returns the standard deviation of specific axis
  • arr.corrcoef() | Returns correlation coefficient of array

Benchmark of speed

Pure python (with a list of values)

In [45]:
import time
import math
costs = np.zeros(3) #to collect the costs of the three methods
def pyStdDev(a):
    mean = sum(a) / len(a)
    return math.sqrt((sum(((x - mean)**2 for x in a)) / len(a)))
In [46]:
start = time.time()
a = [float(v) for v in range(10000000)]
pyStdDev(a)
end = time.time()
costs[0] = end-start
print(costs[0])
2.270772933959961

Numpy

In [47]:
import numpy as np

def npStdDev(a):
    return np.std(a)
In [48]:
start = time.time()
a = np.arange(1e7)
npStdDev(a)
end = time.time()
costs[1] = end-start
print(costs[1])
0.17032194137573242

Cython-naive

In [49]:
import math

def cyStdDev(a):
    m = a.mean()
    w = a - m
    wSq = w**2
    return math.sqrt(wSq.mean())
In [50]:
start = time.time()
a = np.arange(1e7)
cyStdDev(a)
end = time.time()
costs[2] = end-start
print(costs[2])
0.0803229808807373

Benchmark of speed (Python vs Numpy vs Cython)

In [51]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(12, 6))
ax.set_title('Standard deviation of a vector with 1e7 elements')
ax.set_ylabel('Seconds')
ax.bar(['Python', 'Numpy', 'Cython'], costs)
plt.rcParams.update({'font.size': 18})
plt.show()

Matplotlib

we have used it

import matplotlib as plt

start your adventure with matplotlib by following the demos

In [52]:
import matplotlib.pyplot as plt
texts = ['Enjoy', 'python', 'numpy', 'matplotlib',
        'Pandas, SciPy, scikit-learn, scikit-image and more']
colors = ['red','green','green','green','blue']
for k in range(len(texts)):
    plt.text(0.1, 1-0.2*k, texts[k], fontsize=36, color=colors[k])
plt.text(0.10, -0.1,'This is a matplotlib figure...',fontsize=28, color='black')
plt.axis('off')
plt.rcParams.update({'font.family': 'fantasy'})
plt.savefig("plotText.pdf", dpi=300, bbox_inches = 'tight')
plt.savefig("plotText.png", dpi=300, bbox_inches = 'tight')
plt.show()