skip to navigation
skip to content

Planet Python

Last update: January 13, 2020 09:48 PM UTC

January 13, 2020


Podcast.__init__

Using Deliberate Practice To Level Up Your Python

An effective strategy for teaching and learning is to rely on well structured exercises and collaboration for practicing the material. In this episode long time Python trainer Reuven Lerner reflects on the lessons that he has learned in the 5 years since his first appearance on the show, how his teaching has evolved, and the ways that he has incorporated more hands-on experiences into his lessons. This was a great conversation about the benefits of being deliberate in your approach to ongoing education in the field of technology, as well as having some helpful references for ways to keep your own skills sharp.

Summary

An effective strategy for teaching and learning is to rely on well structured exercises and collaboration for practicing the material. In this episode long time Python trainer Reuven Lerner reflects on the lessons that he has learned in the 5 years since his first appearance on the show, how his teaching has evolved, and the ways that he has incorporated more hands-on experiences into his lessons. This was a great conversation about the benefits of being deliberate in your approach to ongoing education in the field of technology, as well as having some helpful references for ways to keep your own skills sharp.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host as usual is Tobias Macey and today I’m pleased to welcome back Reuven Lerner to talk about the benefits of deliberate practice for learning and improving programming skills

Interview

  • Introductions

  • How did you get introduced to Python?

  • In your first appearance on the show back in episode 2 we talked about your experience as a Python trainer. How has your teaching style evolved in the past 5 years?

    • How has the focus and scope of your training changed in that time period?
  • What have you found to be some of the most helpful and effective tactics in your training?

  • From the learner perspective, what are some strategies that you recommend for retaining information, particularly in the context of gaining technical knowledge?

  • In-person training vs. real-time online training vs. recorded videos, advantages and disadvantages of each.

  • Blended learning, in which we combine aspects of the above

    • Beyond in-person training, what are your preferred methods for learning and maintaining new skills?
  • What is deliberate practice and how does it differ from the habits that many of us might default to?

    • What are some of the resources that you provide for students of your trainings for practicing?
    • What are some of the outside resources which you have found most useful or effective?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

January 13, 2020 09:28 PM UTC


Real Python

Logistic Regression in Python

As the amount of available data, the strength of computing power, and the number of algorithmic improvements continue to rise, so does the importance of data science and machine learning. Classification is among the most important areas of machine learning, and logistic regression is one of its basic methods. By the end of this tutorial, you’ll have learned about classification in general and the fundamentals of logistic regression in particular, as well as how to implement logistic regression in Python.

In this tutorial, you’ll learn:

Free Bonus: Click here to get access to a free NumPy Resources Guide that points you to the best tutorials, videos, and books for improving your NumPy skills.

Classification

Classification is a very important area of supervised machine learning. A large number of important machine learning problems fall within this area. There are many classification methods, and logistic regression is one of them.

What Is Classification?

Supervised machine learning algorithms define models that capture relationships among data. Classification is an area of supervised machine learning that tries to predict which class or category some entity belongs to, based on its features.

For example, you might analyze the employees of some company and try to establish a dependence on the features or variables, such as the level of education, number of years in a current position, age, salary, odds for being promoted, and so on. The set of data related to a single employee is one observation. The features or variables can take one of two forms:

  1. Independent variables, also called inputs or predictors, don’t depend on other features of interest (or at least you assume so for the purpose of the analysis).
  2. Dependent variables, also called outputs or responses, depend on the independent variables.

In the above example where you’re analyzing employees, you might presume the level of education, time in a current position, and age as being mutually independent, and consider them as the inputs. The salary and the odds for promotion could be the outputs that depend on the inputs.

Note: Supervised machine learning algorithms analyze a number of observations and try to mathematically express the dependence between the inputs and outputs. These mathematical representations of dependencies are the models.

The nature of the dependent variables differentiates regression and classification problems. Regression problems have continuous and usually unbounded outputs. An example is when you’re estimating the salary as a function of experience and education level. On the other hand, classification problems have discrete and finite outputs called classes or categories. For example, predicting if an employee is going to be promoted or not (true or false) is a classification problem.

There are two main types of classification problems:

  1. Binary or binomial classification: exactly two classes to choose between (usually 0 and 1, true and false, or positive and negative)
  2. Multiclass or multinomial classification: three or more classes of the outputs to choose from

If there’s only one input variable, then it’s usually denoted with 𝑥. For more than one input, you’ll commonly see the vector notation 𝐱 = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of the predictors (or independent features). The output variable is often denoted with 𝑦 and takes the values 0 or 1.

When Do You Need Classification?

You can apply classification in many fields of science and technology. For example, text classification algorithms are used to separate legitimate and spam emails, as well as positive and negative comments. You can check out Practical Text Classification With Python and Keras to get some insight into this topic. Other examples involve medical applications, biological classification, credit scoring, and more.

Image recognition tasks are often represented as classification problems. For example, you might ask if an image is depicting a human face or not, or if it’s a mouse or an elephant, or which digit from zero to nine it represents, and so on. To learn more about this, check out Traditional Face Detection With Python and Face Recognition with Python, in Under 25 Lines of Code.

Logistic Regression Overview

Logistic regression is a fundamental classification technique. It belongs to the group of linear classifiers and is somewhat similar to polynomial and linear regression. Logistic regression is fast and relatively uncomplicated, and it’s convenient for you to interpret the results. Although it’s essentially a method for binary classification, it can also be applied to multiclass problems.

Math Prerequisites

You’ll need an understanding of the sigmoid function and the natural logarithm function to understand what logistic regression is and how it works.

This image shows the sigmoid function (or S-shaped curve) of some variable 𝑥:

Sigmoid Function

The sigmoid function has values very close to either 0 or 1 across most of its domain. This fact makes it suitable for application in classification methods.

This image depicts the natural logarithm log(𝑥) of some variable 𝑥, for values of 𝑥 between 0 and 1:

Natural Logarithm

As 𝑥 approaches zero, the natural logarithm of 𝑥 drops towards negative infinity. When 𝑥 = 1, log(𝑥) is 0. The opposite is true for log(1 − 𝑥).

Note that you’ll often find the natural logarithm denoted with ln instead of log. In Python, math.log(x) and numpy.log(x) represent the natural logarithm of x, so you’ll follow this notation in this tutorial.

Problem Formulation

In this tutorial, you’ll see an explanation for the common case of logistic regression applied to binary classification. When you’re implementing the logistic regression of some dependent variable 𝑦 on the set of independent variables 𝐱 = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of predictors ( or inputs), you start with the known values of the predictors 𝐱ᵢ and the corresponding actual response (or output) 𝑦ᵢ for each observation 𝑖 = 1, …, 𝑛.

Your goal is to find the logistic regression function 𝑝(𝐱) such that the predicted responses 𝑝(𝐱ᵢ) are as close as possible to the actual response 𝑦ᵢ for each observation 𝑖 = 1, …, 𝑛. Remember that the actual response can be only 0 or 1 in binary classification problems! This means that each 𝑝(𝐱ᵢ) should be close to either 0 or 1. That’s why it’s convenient to use the sigmoid function.

Once you have the logistic regression function 𝑝(𝐱), you can use it to predict the outputs for new and unseen inputs, assuming that the underlying mathematical dependence is unchanged.

Methodology

Logistic regression is a linear classifier, so you’ll use a linear function 𝑓(𝐱) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ + 𝑏ᵣ𝑥ᵣ, also called the logit. The variables 𝑏₀, 𝑏₁, …, 𝑏ᵣ are the estimators of the regression coefficients, which are also called the predicted weights or just coefficients.

The logistic regression function 𝑝(𝐱) is the sigmoid function of 𝑓(𝐱): 𝑝(𝐱) = 1 / (1 + exp(−𝑓(𝐱)). As such, it’s often close to either 0 or 1. The function 𝑝(𝐱) is often interpreted as the predicted probability that the output for a given 𝐱 is equal to 1. Therefore, 1 − 𝑝(𝑥) is the probability that the output is 0.

Logistic regression determines the best predicted weights 𝑏₀, 𝑏₁, …, 𝑏ᵣ such that the function 𝑝(𝐱) is as close as possible to all actual responses 𝑦ᵢ, 𝑖 = 1, …, 𝑛, where 𝑛 is the number of observations. The process of calculating the best weights using available observations is called model training or fitting.

To get the best weights, you usually maximize the log-likelihood function (LLF) for all observations 𝑖 = 1, …, 𝑛. This method is called the maximum likelihood estimation and is represented by the equation LLF = Σᵢ(𝑦ᵢ log(𝑝(𝐱ᵢ)) + (1 − 𝑦ᵢ) log(1 − 𝑝(𝐱ᵢ))).

When 𝑦ᵢ = 0, the LLF for the corresponding observation is equal to log(1 − 𝑝(𝐱ᵢ)). If 𝑝(𝐱ᵢ) is close to 𝑦ᵢ = 0, then log(1 − 𝑝(𝐱ᵢ)) is close to 0. This is the result you want. If 𝑝(𝐱ᵢ) is far from 0, then log(1 − 𝑝(𝐱ᵢ)) drops significantly. You don’t want that result because your goal is to obtain the maximum LLF. Similarly, when 𝑦ᵢ = 1, the LLF for that observation is 𝑦ᵢ log(𝑝(𝐱ᵢ)). If 𝑝(𝐱ᵢ) is close to 𝑦ᵢ = 1, then log(𝑝(𝐱ᵢ)) is close to 0. If 𝑝(𝐱ᵢ) is far from 1, then log(𝑝(𝐱ᵢ)) is a large negative number.

There are several mathematical approaches that will calculate the best weights that correspond to the maximum LLF, but that’s beyond the scope of this tutorial. For now, you can leave these details to the logistic regression Python libraries you’ll learn to use here!

Once you determine the best weights that define the function 𝑝(𝐱), you can get the predicted outputs 𝑝(𝐱ᵢ) for any given input 𝐱ᵢ. For each observation 𝑖 = 1, …, 𝑛, the predicted output is 1 if 𝑝(𝐱ᵢ) > 0.5 and 0 otherwise. The threshold doesn’t have to be 0.5, but it usually is. You might define a lower or higher value if that’s more convenient for your situation.

There’s one more important relationship between 𝑝(𝐱) and 𝑓(𝐱), which is that log(𝑝(𝐱) / (1 − 𝑝(𝐱))) = 𝑓(𝐱). This equality explains why 𝑓(𝐱) is the logit. It implies that 𝑝(𝐱) = 0.5 when 𝑓(𝐱) = 0 and that the predicted output is 1 if 𝑓(𝐱) > 0 and 0 otherwise.

Classification Performance

Binary classification has four possible types of results:

  1. True negatives: correctly predicted negatives (zeros)
  2. True positives: correctly predicted positives (ones)
  3. False negatives: incorrectly predicted negatives (zeros)
  4. False positives: incorrectly predicted positives (ones)

You usually evaluate the performance of your classifier by comparing the actual and predicted outputsand counting the correct and incorrect predictions.

The most straightforward indicator of classification accuracy is the ratio of the number of correct predictions to the total number of predictions (or observations). Other indicators of binary classifiers include the following:

The most suitable indicator depends on the problem of interest. In this tutorial, you’ll use the most straightforward form of classification accuracy.

Single-Variate Logistic Regression

Single-variate logistic regression is the most straightforward case of logistic regression. There is only one independent variable (or feature), which is 𝐱 = 𝑥. This figure illustrates single-variate logistic regression:

1D Logistic Regression

Here, you have a given set of input-output (or 𝑥-𝑦) pairs, represented by green circles. These are your observations. Remember that 𝑦 can only be 0 or 1. For example, the leftmost green circle has the input 𝑥 = 0 and the actual output 𝑦 = 0. The rightmost observation has 𝑥 = 9 and 𝑦 = 1.

Logistic regression finds the weights 𝑏₀ and 𝑏₁ that correspond to the maximum LLF. These weights define the logit 𝑓(𝑥) = 𝑏₀ + 𝑏₁𝑥, which is the dashed black line. They also define the predicted probability 𝑝(𝑥) = 1 / (1 + exp(−𝑓(𝑥))), shown here as the full black line. In this case, the threshold 𝑝(𝑥) = 0.5 and 𝑓(𝑥) = 0 corresponds to the value of 𝑥 slightly higher than 3. This value is the limit between the inputs with the predicted outputs of 0 and 1.

Multi-Variate Logistic Regression

Multi-variate logistic regression has more than one input variable. This figure shows the classification with two independent variables, 𝑥₁ and 𝑥₂:

2D Logistic Regression

The graph is different from the single-variate graph because both axes represent the inputs. The outputs also differ in color. The white circles show the observations classified as zeros, while the green circles are those classified as ones.

Logistic regression determines the weights 𝑏₀, 𝑏₁, and 𝑏₂ that maximize the LLF. Once you have 𝑏₀, 𝑏₁, and 𝑏₂, you can get:

The dash-dotted black line linearly separates the two classes. This line corresponds to 𝑝(𝑥₁, 𝑥₂) = 0.5 and 𝑓(𝑥₁, 𝑥₂) = 0.

Regularization

Overfitting is one of the most serious kinds of problems related to machine learning. It occurs when a model learns the training data too well. The model then learns not only the relationships among data but also the noise in the dataset. Overfitted models tend to have good performance with the data used to fit them (the training data), but they behave poorly with unseen data (or test data, which is data not used to fit the model).

Overfitting usually occurs with complex models. Regularization normally tries to reduce or penalize the complexity of the model. Regularization techniques applied with logistic regression mostly tend to penalize large coefficients 𝑏₀, 𝑏₁, …, 𝑏ᵣ:

Regularization can significantly improve model performance on unseen data.

Logistic Regression in Python

Now that you understand the fundamentals, you’re ready to apply the appropriate packages as well as their functions and classes to perform logistic regression in Python. In this section, you’ll see the following:

Let’s start implementing logistic regression in Python!

Logistic Regression Python Packages

There are several packages you’ll need for logistic regression in Python. All of them are free and open-source, with lots of available resources. First, you’ll need NumPy, which is a fundamental package for scientific and numerical computing in Python. NumPy is useful and popular because it enables high-performance operations on single- and multi-dimensional arrays.

NumPy has many useful array routines. It allows you to write elegant and compact code, and it works well with many Python packages. If you want to learn NumPy, then you can start with the official user guide. The NumPy Reference also provides comprehensive documentation on its functions, classes, and methods.

Note: To learn more about NumPy performance and the other benefits it can offer, check out Pure Python vs NumPy vs TensorFlow Performance Comparison and Look Ma, No For-Loops: Array Programming With NumPy.

Another Python package you’ll use is scikit-learn. This is one of the most popular data science and machine learning libraries. You can use scikit-learn to perform various functions:

You’ll find useful information on the official scikit-learn website, where you might want to read about generalized linear models and logistic regression implementation. If you need functionality that scikit-learn can’t offer, then you might find StatsModels useful. It’s a powerful Python library for statistical analysis. You can find more information on the official website.

Finally, you’ll use Matplotlib to visualize the results of your classification. This is a Python library that’s comprehensive and widely used for high-quality plotting. For additional information, you can check the official website and user guide. There are several resources for learning Matplotlib you might find useful, like the official tutorials, the Anatomy of Matplotlib, and Python Plotting With Matplotlib (Guide).

Logistic Regression in Python With scikit-learn: Example 1

The first example is related to a single-variate binary classification problem. This is the most straightforward kind of classification problem. There are several general steps you’ll take when you’re preparing your classification models:

  1. Import packages, functions, and classes
  2. Get data to work with and, if appropriate, transform it
  3. Create a classification model and train (or fit) it with your existing data
  4. Evaluate your model to see if its performance is satisfactory

A sufficiently good model that you define can be used to make further predictions related to new, unseen data. The above procedure is the same for classification and regression.

Step 1: Import Packages, Functions, and Classes

First, you have to import Matplotlib for visualization and NumPy for array operations. You’ll also need LogisticRegression, classification_report(), and confusion_matrix() from scikit-learn:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

Now you’ve imported everything you need for logistic regression in Python with scikit-learn!

Step 2: Get Data

In practice, you’ll usually have some data to work with. For the purpose of this example, let’s just create arrays for the input (𝑥) and output (𝑦) values:

x = np.arange(10).reshape(-1, 1)
y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

The input and output should be NumPy arrays (instances of the class numpy.ndarray) or similar objects. numpy.arange() creates an array of consecutive, equally-spaced values within a given range. For more information on this function, check the official documentation or NumPy arange(): How to Use np.arange().

The array x is required to be two-dimensional. It should have one column for each input, and the number of rows should be equal to the number of observations. To make x two-dimensional, you apply .reshape() with the arguments -1 to get as many rows as needed and 1 to get one column. For more information on .reshape(), you can check out the official documentation. Here’s how x and y look now:

>>>
>>> x
array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])
>>> y
array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

x has two dimensions:

  1. One column for a single input
  2. Ten rows, each corresponding to one observation

y is one-dimensional with ten items. Again, each item corresponds to one observation. It contains only zeros and ones since this is a binary classification problem.

Step 3: Create a Model and Train It

Once you have the input and output prepared, you can create and define your classification model. You’re going to represent it with an instance of the class LogisticRegression:

model = LogisticRegression(solver='liblinear', random_state=0)

The above statement creates an instance of LogisticRegression and binds its references to the variable model. LogisticRegression has several optional parameters that define the behavior of the model and approach:

You should carefully match the solver and regularization method for several reasons:

Once the model is created, you need to fit (or train) it. Model fitting is the process of determining the coefficients 𝑏₀, 𝑏₁, …, 𝑏ᵣ that correspond to the best value of the cost function. You fit the model with .fit():

model.fit(x, y)

.fit() takes x, y, and possibly observation-related weights. Then it fits the model and returns the model instance itself:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

This is the obtained string representation of the fitted model.

You can use the fact that .fit() returns the model instance and chain the last two statements. They are equivalent to the following line of code:

model = LogisticRegression(solver='liblinear', random_state=0).fit(x, y)

At this point, you have the classification model defined.

You can quickly get the attributes of your model. For example, the attribute .classes_ represents the array of distinct values that y takes:

>>>
>>> model.classes_
array([0, 1])

This is the example of binary classification, and y can be 0 or 1, as indicated above.

You can also get the value of the slope 𝑏₁ and the intercept 𝑏₀ of the linear function 𝑓 like so:

>>>
>>> model.intercept_
array([-1.04608067])
>>> model.coef_
array([[0.51491375]])

As you can see, 𝑏₀ is given inside a one-dimensional array, while 𝑏₁ is inside a two-dimensional array. You use the attributes .intercept_ and .coef_ to get these results.

Step 4: Evaluate the Model

Once a model is defined, you can check its performance with .predict_proba(), which returns the matrix of probabilities that the predicted output is equal to zero or one:

>>>
>>> model.predict_proba(x)
array([[0.74002157, 0.25997843],
       [0.62975524, 0.37024476],
       [0.5040632 , 0.4959368 ],
       [0.37785549, 0.62214451],
       [0.26628093, 0.73371907],
       [0.17821501, 0.82178499],
       [0.11472079, 0.88527921],
       [0.07186982, 0.92813018],
       [0.04422513, 0.95577487],
       [0.02690569, 0.97309431]])

In the matrix above, each row corresponds to a single observation. The first column is the probability of the predicted output being zero, that is 1 - 𝑝(𝑥). The second column is the probability that the output is one, or 𝑝(𝑥).

You can get the actual predictions, based on the probability matrix and the values of 𝑝(𝑥), with .predict():

>>>
>>> model.predict(x)
array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1])

This function returns the predicted output values as a one-dimensional array.

The figure below illustrates the input, output, and classification results:

Result of Logistic Regression

The green circles represent the actual responses as well as the correct predictions. The red × shows the incorrect prediction. The full black line is the estimated logistic regression line 𝑝(𝑥). The grey squares are the points on this line that correspond to 𝑥 and the values in the second column of the probability matrix. The black dashed line is the logit 𝑓(𝑥).

The value of 𝑥 slightly above 2 corresponds to the threshold 𝑝(𝑥)=0.5, which is 𝑓(𝑥)=0. This value of 𝑥 is the boundary between the points that are classified as zeros and those predicted as ones.

For example, the first point has input 𝑥=0, actual output 𝑦=0, probability 𝑝=0.26, and a predicted value of 0. The second point has 𝑥=1, 𝑦=0, 𝑝=0.37, and a prediction of 0. Only the fourth point has the actual output 𝑦=0 and the probability higher than 0.5 (at 𝑝=0.62), so it’s wrongly classified as 1. All other values are predicted correctly.

When you have nine out of ten observations classified correctly, the accuracy of your model is equal to 9/10=0.9, which you can obtain with .score():

>>>
>>> model.score(x, y)
0.9

.score() takes the input and output as arguments and returns the ratio of the number of correct predictions to the number of observations.

You can get more information on the accuracy of the model with a confusion matrix. In the case of binary classification, the confusion matrix shows the numbers of the following:

To create the confusion matrix, you can use confusion_matrix() and provide the actual and predicted outputs as the arguments:

>>>
>>> confusion_matrix(y, model.predict(x))
array([[3, 1],
       [0, 6]])

The obtained matrix shows the following:

It’s often useful to visualize the confusion matrix. You can do that with .imshow() from Matplotlib, which accepts the confusion matrix as the argument:

cm = confusion_matrix(y, model.predict(x))

fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
    for j in range(2):
        ax.text(j, i, cm[i, j], ha='center', va='center', color='red')
plt.show()

The code above creates a heatmap that represents the confusion matrix:

Classification Confusion Matrix

In this figure, different colors represent different numbers and similar colors represent similar numbers. Heatmaps are a nice and convenient way to represent a matrix. To learn more about them, check out the Matplotlib documentation on Creating Annotated Heatmaps and .imshow().

You can get a more comprehensive report on the classification with classification_report():

>>>
>>> print(classification_report(y, model.predict(x)))
              precision    recall  f1-score   support

           0       1.00      0.75      0.86         4
           1       0.86      1.00      0.92         6

    accuracy                           0.90        10
   macro avg       0.93      0.88      0.89        10
weighted avg       0.91      0.90      0.90        10

This function also takes the actual and predicted outputs as arguments. It returns a report on the classification as a dictionary if you provide output_dict=True or a string otherwise.

Note: It’s usually better to evaluate your model with the data you didn’t use for training. That’s how you avoid bias and detect overfitting. You’ll see an example later in this tutorial.

For more information on LogisticRegression, check out the official documentation. In addition, scikit-learn offers a similar class LogisticRegressionCV, which is more suitable for cross-validation. You can also check out the official documentation to learn more about classification reports and confusion matrices.

Improve the Model

You can improve your model by setting different parameters. For example, let’s work with the regularization strength C equal to 10.0, instead of the default value of 1.0:

model = LogisticRegression(solver='liblinear', C=10.0, random_state=0)
model.fit(x, y)

Now you have another model with different parameters. It’s also going to have a different probability matrix and a different set of coefficients and predictions:

>>>
>>> model.intercept_
array([-3.51335372])
>>> model.coef_
array([[1.12066084]])
>>> model.predict_proba(x)
array([[0.97106534, 0.02893466],
       [0.9162684 , 0.0837316 ],
       [0.7810904 , 0.2189096 ],
       [0.53777071, 0.46222929],
       [0.27502212, 0.72497788],
       [0.11007743, 0.88992257],
       [0.03876835, 0.96123165],
       [0.01298011, 0.98701989],
       [0.0042697 , 0.9957303 ],
       [0.00139621, 0.99860379]])
>>> model.predict(x)
array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

As you can see, the absolute values of the intercept 𝑏₀ and the coefficient 𝑏₁ are larger. This is the case because the larger value of C means weaker regularization, or weaker penalization related to high values of 𝑏₀ and 𝑏₁.

Different values of 𝑏₀ and 𝑏₁ imply a change of the logit 𝑓(𝑥), different values of the probabilities 𝑝(𝑥), a different shape of the regression line, and possibly changes in other predicted outputs and classification performance. The boundary value of 𝑥 for which 𝑝(𝑥)=0.5 and 𝑓(𝑥)=0 is higher now. It’s above 3. In this case, you obtain all true predictions, as shown by the accuracy, confusion matrix, and classification report:

>>>
>>> model.score(x, y)
1.0
>>> confusion_matrix(y, model.predict(x))
array([[4, 0],
       [0, 6]])
>>> print(classification_report(y, model.predict(x)))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         4
           1       1.00      1.00      1.00         6

    accuracy                           1.00        10
   macro avg       1.00      1.00      1.00        10
weighted avg       1.00      1.00      1.00        10

The score (or accuracy) of 1 and the zeros in the lower-left and upper-right fields of the confusion matrix indicate that the actual and predicted outputs are the same. That’s also shown with the figure below:

Result of Logistic Regression

This figure illustrates that the estimated regression line now has a different shape and that the fourth point is correctly classified as 0. There isn’t a red ×, so there is no wrong prediction.

Logistic Regression in Python With scikit-learn: Example 2

Let’s solve another classification problem. It’s similar to the previous one, except that the output differs in the second value. The code is similar to the previous case:

# Step 1: Import packages, functions, and classes
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Step 2: Get data
x = np.arange(10).reshape(-1, 1)
y = np.array([0, 1, 0, 0, 1, 1, 1, 1, 1, 1])

# Step 3: Create a model and train it
model = LogisticRegression(solver='liblinear', C=10.0, random_state=0)
model.fit(x, y)

# Step 4: Evaluate the model
p_pred = model.predict_proba(x)
y_pred = model.predict(x)
score_ = model.score(x, y)
conf_m = confusion_matrix(y, y_pred)
report = classification_report(y, y_pred)

This classification code sample generates the following results:

>>>
>>> print('x:', x, sep='\n')
x:
[[0]
 [1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]]
>>> print('y:', y, sep='\n', end='\n\n')
y:
[0 1 0 0 1 1 1 1 1 1]

>>> print('intercept:', model.intercept_)
intercept: [-1.51632619]
>>> print('coef:', model.coef_, end='\n\n')
coef: [[0.703457]]

>>> print('p_pred:', p_pred, sep='\n', end='\n\n')
p_pred:
[[0.81999686 0.18000314]
 [0.69272057 0.30727943]
 [0.52732579 0.47267421]
 [0.35570732 0.64429268]
 [0.21458576 0.78541424]
 [0.11910229 0.88089771]
 [0.06271329 0.93728671]
 [0.03205032 0.96794968]
 [0.0161218  0.9838782 ]
 [0.00804372 0.99195628]]

>>> print('y_pred:', y_pred, end='\n\n')
y_pred: [0 0 0 1 1 1 1 1 1 1]

>>> print('score_:', score_, end='\n\n')
score_: 0.8

>>> print('conf_m:', conf_m, sep='\n', end='\n\n')
conf_m:
[[2 1]
 [1 6]]

>>> print('report:', report, sep='\n')
report:
              precision    recall  f1-score   support

           0       0.67      0.67      0.67         3
           1       0.86      0.86      0.86         7

    accuracy                           0.80        10
   macro avg       0.76      0.76      0.76        10
weighted avg       0.80      0.80      0.80        10

In this case, the score (or accuracy) is 0.8. There are two observations classified incorrectly. One of them is a false negative, while the other is a false positive.

The figure below illustrates this example with eight correct and two incorrect predictions:

Result of Logistic Regression

This figure reveals one important characteristic of this example. Unlike the previous one, this problem is not linearly separable. That means you can’t find a value of 𝑥 and draw a straight line to separate the observations with 𝑦=0 and those with 𝑦=1. There is no such line. Keep in mind that logistic regression is essentially a linear classifier, so you theoretically can’t make a logistic regression model with an accuracy of 1 in this case.

Logistic Regression in Python With StatsModels: Example

You can also implement logistic regression in Python with the StatsModels package. Typically, you want this when you need more statistical details related to models and results. The procedure is similar to that of scikit-learn.

Step 1: Import Packages

All you need to import is NumPy and statsmodels.api:

import numpy as np
import statsmodels.api as sm

Now you have the packages you need.

Step 2: Get Data

You can get the inputs and output the same way as you did with scikit-learn. However, StatsModels doesn’t take the intercept 𝑏₀ into account, and you need to include the additional column of ones in x. You do that with add_constant():

x = np.arange(10).reshape(-1, 1)
y = np.array([0, 1, 0, 0, 1, 1, 1, 1, 1, 1])
x = sm.add_constant(x)

add_constant() takes the array x as the argument and returns a new array with the additional column of ones. This is how x and y look:

>>>
>>> x
array([[1., 0.],
       [1., 1.],
       [1., 2.],
       [1., 3.],
       [1., 4.],
       [1., 5.],
       [1., 6.],
       [1., 7.],
       [1., 8.],
       [1., 9.]])
>>> y
array([0, 1, 0, 0, 1, 1, 1, 1, 1, 1])

This is your data. The first column of x corresponds to the intercept 𝑏₀. The second column contains the original values of x.

Step 3: Create a Model and Train It

Your logistic regression model is going to be an instance of the class statsmodels.discrete.discrete_model.Logit. This is how you can create one:

>>>
>>> model = sm.Logit(y, x)

Note that the first argument here is y, followed by x.

Now, you’ve created your model and you should fit it with the existing data. You do that with .fit() or, if you want to apply L1 regularization, with .fit_regularized():

>>>
>>> result = model.fit(method='newton')
Optimization terminated successfully.
         Current function value: 0.350471
         Iterations 7

The model is now ready, and the variable result holds useful data. For example, you can obtain the values of 𝑏₀ and 𝑏₁ with .params:

>>>
>>> result.params
array([-1.972805  ,  0.82240094])

The first element of the obtained array is the intercept 𝑏₀, while the second is the slope 𝑏₁. For more information, you can look at the official documentation on Logit, as well as .fit() and .fit_regularized().

Step 4: Evaluate the Model

You can use results to obtain the probabilities of the predicted outputs being equal to one:

>>>
>>> result.predict(x)
array([0.12208792, 0.24041529, 0.41872657, 0.62114189, 0.78864861,
       0.89465521, 0.95080891, 0.97777369, 0.99011108, 0.99563083])

These probabilities are calculated with .predict(). You can use their values to get the actual predicted outputs:

>>>
>>> (result.predict(x) >= 0.5).astype(int)
array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1])

The obtained array contains the predicted output values. As you can see, 𝑏₀, 𝑏₁, and the probabilities obtained with scikit-learn and StatsModels are different. This is the consequence of applying different iterative and approximate procedures and parameters. However, in this case, you obtain the same predicted outputs as when you used scikit-learn.

You can obtain the confusion matrix with .pred_table():

>>>
>>> result.pred_table()
array([[2., 1.],
       [1., 6.]])

This example is the same as when you used scikit-learn because the predicted ouptuts are equal. The confusion matrices you obtained with StatsModels and scikit-learn differ in the types of their elements (floating-point numbers and integers).

.summary() and .summary2() get output data that you might find useful in some circumstances:

>>>
>>> result.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                   10
Model:                          Logit   Df Residuals:                        8
Method:                           MLE   Df Model:                            1
Date:                Sun, 23 Jun 2019   Pseudo R-squ.:                  0.4263
Time:                        21:43:49   Log-Likelihood:                -3.5047
converged:                       True   LL-Null:                       -6.1086
                                        LLR p-value:                   0.02248
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.9728      1.737     -1.136      0.256      -5.377       1.431
x1             0.8224      0.528      1.557      0.119      -0.213       1.858
==============================================================================
"""
>>> result.summary2()
<class 'statsmodels.iolib.summary2.Summary'>
"""
                        Results: Logit
===============================================================
Model:              Logit            Pseudo R-squared: 0.426   
Dependent Variable: y                AIC:              11.0094
Date:               2019-06-23 21:43 BIC:              11.6146
No. Observations:   10               Log-Likelihood:   -3.5047
Df Model:           1                LL-Null:          -6.1086
Df Residuals:       8                LLR p-value:      0.022485
Converged:          1.0000           Scale:            1.0000  
No. Iterations:     7.0000                                     
-----------------------------------------------------------------
          Coef.    Std.Err.      z      P>|z|     [0.025   0.975]
-----------------------------------------------------------------
const    -1.9728     1.7366   -1.1360   0.2560   -5.3765   1.4309
x1        0.8224     0.5281    1.5572   0.1194   -0.2127   1.8575
===============================================================

"""

These are detailed reports with values that you can obtain with appropriate methods and attributes. For more information, check out the official documentation related to LogitResults.

Logistic Regression in Python: Handwriting Recognition

The previous examples illustrated the implementation of logistic regression in Python, as well as some details related to this method. The next example will show you how to use logistic regression to solve a real-world classification problem. The approach is very similar to what you’ve already seen, but with a larger dataset and several additional concerns.

This example is about image recognition. To be more precise, you’ll work on the recognition of handwritten digits. You’ll use a dataset with 1797 observations, each of which is an image of one handwritten digit. Each image has 64 px, with a width of 8 px and a height of 8 px.

Note: To learn more about this dataset, check the official documentation.

The inputs (𝐱) are vectors with 64 dimensions or values. Each input vector describes one image. Each of the 64 values represents one pixel of the image. The input values are the integers between 0 and 16, depending on the shade of gray for the corresponding pixel. The output (𝑦) for each observation is an integer between 0 and 9, consistent with the digit on the image. There are ten classes in total, each corresponding to one image.

Step 1: Import Packages

You’ll need to import Matplotlib, NumPy, and several functions and classes from scikit-learn:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

That’s it! You have all the functionality you need to perform classification.

Step 2a: Get Data

You can grab the dataset directly from scikit-learn with load_digits(). It returns a tuple of the inputs and output:

x, y = load_digits(return_X_y=True)

Now you have the data. This is how x and y look:

>>>
>>> x
array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]])
>>> y
array([0, 1, 2, ..., 8, 9, 8])

That’s your data to work with. x is a multi-dimensional array with 1797 rows and 64 columns. It contains integers from 0 to 16. y is an one-dimensional array with 1797 integers between 0 and 9.

Step 2b: Split Data

It’s a good and widely-adopted practice to split the dataset you’re working with into two subsets. These are the training set and the test set. This split is usually performed randomly. You should use the training set to fit your model. Once the model is fitted, you evaluate its performance with the test set. It’s important not to use the test set in the process of fitting the model. This approach enables an unbiased evaluation of the model.

One way to split your dataset into training and test sets is to apply train_test_split():

x_train, x_test, y_train, y_test =\
    train_test_split(x, y, test_size=0.2, random_state=0)

train_test_split() accepts x and y. It also takes test_size, which determines the size of the test set, and random_state to define the state of the pseudo-random number generator, as well as other optional arguments. This function returns a list with four arrays:

  1. x_train: the part of x used to fit the model
  2. x_test: the part of x used to evaluate the model
  3. y_train: the part of y that corresponds to x_train
  4. y_test: the part of y that corresponds to x_test

Once your data is split, you can forget about x_test and y_test until you define your model.

Step 2c: Scale Data

Standardization is the process of transforming data in a way such that the mean of each column becomes equal to zero, and the standard deviation of each column is one. This way, you obtain the same scale for all columns. Take the following steps to standardize your data:

  1. Calculate the mean and standard deviation for each column.
  2. Subtract the corresponding mean from each element.
  3. Divide the obtained difference by the corresponding standard deviation.

It’s a good practice to standardize the input data that you use for logistic regression, although in many cases it’s not necessary. Standardization might improve the performance of your algorithm. It helps if you need to compare and interpret the weights. It’s important when you apply penalization because the algorithm is actually penalizing against the large values of the weights.

You can standardize your inputs by creating an instance of StandardScaler and calling .fit_transform() on it:

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)

.fit_transform() fits the instance of StandardScaler to the array passed as the argument, transforms this array, and returns the new, standardized array. Now, x_train is a standardized input array.

Step 3: Create a Model and Train It

This step is very similar to the previous examples. The only difference is that you use x_train and y_train subsets to fit the model. Again, you should create an instance of LogisticRegression and call .fit() on it:

model = LogisticRegression(solver='liblinear', C=0.05, multi_class='ovr',
                           random_state=0)
model.fit(x_train, y_train)

When you’re working with problems with more than two classes, you should specify the multi_class parameter of LogisticRegression. It determines how to solve the problem:

The last statement yields the following output since .fit() returns the model itself:

LogisticRegression(C=0.05, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='ovr', n_jobs=None, penalty='l2', random_state=0,
                   solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

These are the parameters of your model. It’s now defined and ready for the next step.

Step 4: Evaluate the Model

You should evaluate your model similar to what you did in the previous examples, with the difference that you’ll mostly use x_test and y_test, which are the subsets not applied for training. If you’ve decided to standardize x_train, then the obtained model relies on the scaled data, so x_test should be scaled as well with the same instance of StandardScaler:

x_test = scaler.transform(x_test)

That’s how you obtain a new, properly-scaled x_test. In this case, you use .transform(), which only transforms the argument, without fitting the scaler.

You can obtain the predicted outputs with .predict():

y_pred = model.predict(x_test)

The variable y_pred is now bound to an array of the predicted outputs. Note that you use x_test as the argument here.

You can obtain the accuracy with .score():

>>>
>>> model.score(x_train, y_train)
0.964509394572025
>>> model.score(x_test, y_test)
0.9416666666666667

Actually, you can get two values of the accuracy, one obtained with the training set and other with the test set. It might be a good idea to compare the two, as a situation where the training set accuracy is much higher might indicate overfitting. The test set accuracy is more relevant for evaluating the performance on unseen data since it’s not biased.

You can get the confusion matrix with confusion_matrix():

>>>
>>> confusion_matrix(y_test, y_pred)
array([[27,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0, 32,  0,  0,  0,  0,  1,  0,  1,  1],
       [ 1,  1, 33,  1,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  1, 28,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0, 29,  0,  0,  1,  0,  0],
       [ 0,  0,  0,  0,  0, 39,  0,  0,  0,  1],
       [ 0,  1,  0,  0,  0,  0, 43,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0, 39,  0,  0],
       [ 0,  2,  1,  2,  0,  0,  0,  1, 33,  0],
       [ 0,  0,  0,  1,  0,  1,  0,  2,  1, 36]])

The obtained confusion matrix is large. In this case, it has 100 numbers. This is a situation when it might be really useful to visualize it:

cm = confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm)
ax.grid(False)
ax.set_xlabel('Predicted outputs', fontsize=font_size, color='black')
ax.set_ylabel('Actual outputs', fontsize=font_size, color='black')
ax.xaxis.set(ticks=range(10))
ax.yaxis.set(ticks=range(10))
ax.set_ylim(9.5, -0.5)
for i in range(10):
    for j in range(10):
        ax.text(j, i, cm[i, j], ha='center', va='center', color='white')
plt.show()

The code above produces the following figure of the confusion matrix:

Classification Confusion Matrix

This is a heatmap that illustrates the confusion matrix with numbers and colors. You can see that the shades of purple represent small numbers (like 0, 1, or 2), while green and yellow show much larger numbers (27 and above).

The numbers on the main diagonal (27, 32, …, 36) show the number of correct predictions from the test set. For example, there are 27 images with zero, 32 images of one, and so on that are correctly classified. Other numbers correspond to the incorrect predictions. For example, the number 1 in the third row and the first column shows that there is one image with the number 2 incorrectly classified as 0.

Finally, you can get the report on classification as a string or dictionary with classification_report():

>>>
>>> print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.96      1.00      0.98        27
           1       0.89      0.91      0.90        35
           2       0.94      0.92      0.93        36
           3       0.88      0.97      0.92        29
           4       1.00      0.97      0.98        30
           5       0.97      0.97      0.97        40
           6       0.98      0.98      0.98        44
           7       0.91      1.00      0.95        39
           8       0.94      0.85      0.89        39
           9       0.95      0.88      0.91        41

    accuracy                           0.94       360
   macro avg       0.94      0.94      0.94       360
weighted avg       0.94      0.94      0.94       360

This report shows additional information, like the support and precision of classifying each digit.

Beyond Logistic Regression in Python

Logistic regression is a fundamental classification technique. It’s a relatively uncomplicated linear classifier. Despite its simplicity and popularity, there are cases (especially with highly complex models) where logistic regression doesn’t work well. In such circumstances, you can use other classification techniques:

Fortunately, there are several comprehensive Python libraries for machine learning that implement these techniques. For example, the package you’ve seen in action here, scikit-learn, implements all of the above-mentioned techniques, with the exception of neural networks.

For all these techniques, scikit-learn offers suitable classes with methods like model.fit(), model.predict_proba(), model.predict(), model.score(), and so on. You can combine them with train_test_split(), confusion_matrix(), classification_report(), and others.

Neural networks (including deep neural networks) have become very popular for classification problems. Libraries like TensorFlow, PyTorch, or Keras offer suitable, performant, and powerful support for these kinds of models.

Conclusion

You now know what logistic regression is and how you can implement it for classification with Python. You’ve used many open-source packages, including NumPy, to work with arrays and Matplotlib to visualize the results. You also used both scikit-learn and StatsModels to create, fit, evaluate, and apply models.

Generally, logistic regression in Python has a straightforward and user-friendly implementation. It usually consists of these steps:

  1. Import packages, functions, and classes
  2. Get data to work with and, if appropriate, transform it
  3. Create a classification model and train (or fit) it with existing data
  4. Evaluate your model to see if its performance is satisfactory
  5. Apply your model to make predictions

You’ve come a long way in understanding one of the most important areas of machine learning! If you have questions or comments, then please put them in the comments section below.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

January 13, 2020 02:00 PM UTC


Abhijeet Pal

Python Program To Display Characters From A to Z

Problem Definition Create a Python program to display all alphabets from A to Z. Solution This article will go through two pythonic ways to generate alphabets. Using String module Python’s built-in string module comes with a number of useful string functions one of them is string.ascii_lowercase Program import string for i in string.ascii_lowercase: print(i, end=" ") Output a b c d e f g h i j k l m n o p q r s t u v w x y z The string.ascii_lowercase method returns all lowercase alphabets as a single string abcdefghijklmnopqrstuvwxyzso the program is simply running a for loop over the string characters and printing them. Similarly for uppercase A to Z letters. Program import string for i in string.ascii_uppercase: print(i, end=" ") Output A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Using chr() Function The chr() function in Python returns a Unicode character for the provided ASCII value, hence chr(97) returns “a”. To learn more about chr() …

The post Python Program To Display Characters From A to Z appeared first on Django Central.

January 13, 2020 01:16 PM UTC


Reuven Lerner

Last chance for Weekly Python Exercise A1

This is a final reminder that in a few hours, registration will close for Weekly Python Exercise A1: Data structures for beginners.

Again and again, WPE participants have said that Weekly Python Exercise was the boost they needed to become more familiar with Python.

Now, if Python fluency is your goal, then that’s great.�? But for most people, Python fluency isn’t the goal — it’s a means to an end.�? And to what end?

The $100 you spend for this 15-week course will more than pay for itself in future earnings.�? But if you find that the price is out of reach, remember that I give discounts to students, seniors/pensioners/retirees, and anyone living outside of the world’s 30 richest countries.�? If this applies to you, then just e-mail me, and I’ll gladly give you the appropriate coupon code.

But don’t delay, because the first exercise will soon be going out to subscribers!�? And I won’t be offering WPE A1 again until 2021.

Click here to join Weekly Python Exercise A1: Data structures for beginners

The post Last chance for Weekly Python Exercise A1 appeared first on Reuven Lerner.

January 13, 2020 12:00 PM UTC


Ned Batchelder

Bug #915: solved!

Yesterday I pleaded, Bug #915: please help! It got posted to Hacker News, where Robert Xiao (nneonneo) did some impressive debugging and found the answer.

The user’s code used mocks to simulate an OSError when trying to make temporary files (source):

with patch('tempfile._TemporaryFileWrapper') as mock_ntf:
    mock_ntf.side_effect = OSError()

Inside tempfile.NamedTemporaryFile, the error handling misses the possibility that _TemporaryFileWrapper will fail (source):

(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
try:
    file = _io.open(fd, mode, buffering=buffering,
                    newline=newline, encoding=encoding, errors=errors)

    return _TemporaryFileWrapper(file, name, delete)
except BaseException:
    _os.unlink(name)
    _os.close(fd)
    raise

If _TemporaryFileWrapper fails, the file descriptor fd is closed, but the file object referencing it still exists. Eventually, it will be garbage collected, and the file descriptor it references will be closed again.

But file descriptors are just small integers which will be reused. The failure in bug 915 is that the file descriptor did get reused, by SQLite. When the garbage collector eventually reclaimed the file object leaked by NamedTemporaryFile, it closed a file descriptor that SQLite was using. Boom.

There are two improvements to be made here. First, the user code should be mocking public functions, not internal details of the Python stdlib. In fact, the variable is already named mock_ntf as if it had been a mock of NamedTemporaryFile at some point.

NamedTemporaryFile would be a better mock because that is the function being used by the user’s code. Mocking _TemporaryFileWrapper is relying on an internal detail of the standard library.

The other improvement is to close the leak in NamedTemporaryFile. That request is now bpo39318. As it happens, the leak had also been reported as bpo21058 and bpo26385.

Lessons learned:

I named Robert Xiao at the top, but lots of people chipped in effort to help get to the bottom of this. ikanobori posted it to Hacker News in the first place. Chris Caron reported the original #915 and stuck with the process as it dragged on. Thanks everybody.

January 13, 2020 11:15 AM UTC


Codementor

5 Best Text Editors for Programmers

The 5 Best Text Editors for Programmers. 1. Atom text editor 2. Vim text editor 3. VS Code text editor 4. Notepad++ text editor 5. Sublime text editor. It is essential for Software Developers and&hellip;

January 13, 2020 08:32 AM UTC

Top 3 Best Python Books You Should Read in 2019

These 3 best python books cover the python programming language. They contain quality content on python 3, data science, and machine learning techniques used in python. Python is a widely used&hellip;

January 13, 2020 08:27 AM UTC


IslandT

Small python application which will remove duplicate files from the windows 10 os

I am glad to inform you all that the remove duplicate file project written with python has finally completed and now it will be uploaded to GitHub for your all to enjoy. This is free software and it will always remain free. Although I would really love to create a Linux version of this remove duplicate file’s software, but I do not have a Linux os’s computer therefore at the moment this software is just for the windows user only. I have packed this software up where you can just download the setup.exe file and then install the program and start it up to search and destroy the duplicate files inside your computer. Here are the steps you need to do to search and destroy the duplicate files:

All right, now let us look at the images below for the step by step tutorial on how to use this python application.

Click on the remove button below to select a file or files
Click shift and select all the files that you wish to search and destroy their duplicate versions from and click the open button
Next is to select the folder which you wish to remove all the duplicate files from then click on the select folder button
That is it, now just sit back and enjoy while watching the program doing its job

If you spot any bug in this application you can leave a comment below this post and I will fix it as fast as possible. If you are not sure about how to use this program you can create a folder then copy and paste some files from one folder to this new folder and start to practice to delete the duplicate files based on the steps I have shown to you earlier. This program is not perfect and thus your feedback is very important for me to improve on the quality of this program.

This program is only for the windows os’s user and maybe you need to have python installed on your desktop before you can use it. If you intend to compile and run the program by yourself, then you can download the entire package then just open up the windows command prompt and type in ‘python path/to/Multitas.py’ to start the application on your windows os laptop, this application runs well with no problem at all! The latest version (Download the setup.exe file): Download the setup.exe file to your windows os’s laptop then install the application on your windows laptop by following the setup instruction. In the future, all the latest updates will come with the setup file with the version number on it, for example, setup1.exe, setup2.exe and etc.

What is new in this latest version: The move file feature has been included where you can now move a file from one folder to another folder by clicking on the move button then select a file and then select the folder which you want to move that file into!

I have just created a python program which can help you to remove the duplicate files in any folder from r/Python

The above is the latest version of the program.

The application has been uploaded to GitHub and you can now download the setup.exe file through this link.�? After you have downloaded the file, click on the setup.exe file to install the program and use it.

January 13, 2020 06:43 AM UTC


Mike Driscoll

PyDev of the Week: Tyler Reddy

This week we welcome Tyler Reddy (@Tyler_Reddy) as our PyDev of the Week! Tyler is a core developer of Scipy and Numpy. He has also worked on the MDAnalysis library, which is for particle physics simulation analysis. If you’re interested in seeing some of his contributions, you can check out his Github profile. Let’s spend some time getting to know Tyler better!

Tyler Reddy

Can you tell us a little about yourself (hobbies, education, etc):

I grew up in Dartmouth, Nova Scotia, Canada and stayed there until my late twenties. My Bachelor and PhD degrees were both in biochemistry, focused on structural biology. I did travel a lot for chess, winning a few notable tournaments in my early teen years and achieving a master rating in Canada by my late teens. Dartmouth is also known as the “City of Lakes,” and I grew up paddling on the nearby Lake Banook. In the cold Canadian Winter the lake would freeze over and training would switch to a routine including distance running—this is where my biggest “hobby” really took off. I still run about 11 miles daily in the early morning.

I did an almost six year post-doc in Oxford, United Kingdom. I had started to realize during my PhD that my skill set was better suited to computational work than work on the lab bench. Formally, I was still a biol- ogist while at Oxford, but it was becoming clear that my contributions were starting to look a lot more like applied computer science and computational geometry in particular. I was recruited to Los Alamos National Labora- tory to work on viruses (the kind that make a person, not computer, sick), but ultimately my job has evolved into applied computer scientist here, and nothing beats distance running in beautiful Santa Fe, NM.

Why did you start using Python?

I think it started during my PhD with Jan Rainey in Canada. He was pretty good about letting me explore ways to use programming to make research processes more efficient, even when I might have been better off in the short term by “just doing the science.” Eventually my curiosity grew to the point where I just read one of the editions of Mark Lutz’s “Learning Python” from cover to cover. I very rarely used the terminal to test things out while reading the book—I just kept going through chapters feverishly—I suppose Python is pretty readable! I still prefer reading books to random experimenting when approaching new problems/languages, though I don’t always have the time/luxury to do so. I remember reading Peter Seibel’s “Coders at Work,” and making a list of all the books the famous programmers interviewed there were talking about.

What other programming languages do you know and which is your favorite?

During my second postdoc at Los Alamos I read Stephen Kochan’s “Pro- gramming in C.” For that book I did basically do every single exercise in the terminal as I read it—I found that far more necessary with C than Python to get the ideas to stick. I had made an earlier attempt at reading the classic “The C Programming Language” book by K&R and found it rather hard to learn from! I thought I was doing something wrong since it was described as a classic in “Coders at Work,” I think. I’ll probably never go back to that book now, but I certainly get a lot of mileage out of my C knowledge these days.

I did a sabbatical at UC Berkeley with Stéfan van der Walt and the NumPy core team, working on open source full time for a year. NumPy is written in C under the hood, so it was essential I could at least read the source. A lot of the algorithm implementations in SciPy that I review or write are written in the hybrid Cython (C/Python) language to speed up the inner loops, etc.

I’ve also written a fair bit of tcl, and I write a lot of CMake code these days at work.

Python easily wins out as my favorite language, but C isn’t too far be- hind. I have to agree with the high-profile authors in “Coders at Work” who described C as “beautiful” (or similar) and C++ as, well, something else. Indeed, the NumPy team wrote a custom type templating language in C, processed by Python, instead of using C++. That said, Bjarne did visit UC Berkeley while I was there and it sounds like C++ may be taking a few more ideas from the Python world in the future!

What projects are you working on now?

I’m the release manager for SciPy, which has been my main long-term open source project focus in recent years. I’ve been trying really hard to improve the computational geometry algorithms available in SciPy—both in terms of adding new ones from the recent mathematics literature and improving the ones we already have.

A lot of my time goes into code review now though. I don’t mind— that’s kind of how it works—if I’m going to expect the other core devs and community to review my code and help me get over the finish line I should be ready to do the same for them. Indeed, as funding is now starting to show up a bit more for some OSS projects we’re quickly realizing that just dumping a bunch of new code on the core team/community will quickly cause a problem—review bandwidth is really important.

I’ve had a few rejected proposals for funding for computational geometry work in scipy.spatial, but I will keep trying! We recently wrote a paper for SciPy, which was a lot of work with such a big group/history/body of code, but probably worth it in the end.

I also try to stay involved in NumPy code review, especially for infrastructure- related changes (wheels, CI testing, etc.) and some interest I have in datetime code.

My open source journey started with the MDAnalysis library for particle physics simulation analysis. I try to help out there too, but just keeping up with the emails/notifications for 3+ OSS projects is extremely hard in mostly free time. I try to track notifications/stay somewhat involved in what is going on with OpenBLAS and asv as well, though it feels like I’m failing to keep up most of the time!

Which Python libraries are your favorite (core or 3rd party)?

I think hypothesis is probably underrated—some libraries are hesitant to incorporate it into their testing frameworks, but I think the property-based testing has real potential to catch scenarios humans would have a hard time anticipating, or at least that would take a long time to properly plan for. I find that hypothesis almost always adds a few useful test cases I hadn’t thought of that will require special error handling, for example.

Coverage.py is pretty important for showing line coverage, but I wish the broader CI testing ecosystem had more robust/diverse options for displaying coverage data and aggregating results from Python and compiled language source code. A number of the larger projects I work on have issues with reliability of codecov. The Azure Pipelines service has an initial coverage offering—we’ll see if that really takes off. It will be neat if we can soon mouse over a line of tested code and see the name of the test that covers it. I think I saw somewhere that this will perhaps soon be possible.

How did you get involved with SciPy?

My first substantial contribution was the implementation of Spherical Voronoi diagram calculation in scipy.spatial.SphericalVoronoi. I was working on physics simulations of spherical influenza viruses at the time, and wanted a reliable way to determine the amount of surface area that molecules were occupying. I was fortunate that my postdoc supervisor at the time, Mark Sansom at Oxford, allowed me to explore my interest in computational geom- etry algorithms like that. I gave a talk at what I believe was the second annual PyData London conference about the algorithm implementation, which was still incomplete at the time, and received some really helpful feedback from two expert computational geometers—one was an academic, the other was loosely associated with the CGAL team.

I really enjoyed the process of working with the SciPy team—I remember the first person to ever review my code there was CJ Carey, a computer scientist who is now working at Google. I was pretty intimidated, but they were quite welcoming and I was probably a little too excited when Ralf Gommers, the chair of the steering council, invited me to join the core team. I’ve been hooked ever since!

What are the pros and cons of using SciPy?

You can usually depend on SciPy to have a pretty stable API over time—we generally take changes in behavior quite seriously. A break in backwards compatibility would normally require a long deprecation cycle. The qual- ity/robustness expected for algorithms implemented in SciPy is generally quite high and the library is well-tested, so it is usually best to use SciPy if an algorithm is already available in it. The documentation is of reasonably-high quality and constantly improving and many common questions are answered on i.e., StackOverflow.

If you want to play with experimental algorithms or advocate for a rapid change in behavior, SciPy may not be your first choice. Early adoption of im- mature technologies is usually not likely to happen. Stability and reliability are important at the base of the Python Scientific Computing ecosystem.

How will SciPy / NumPy be changing in the future?

The amount of activity/progress happening for these two projects is pretty staggering. The official response is usually to take a look at the roadmaps for NumPy and SciPy.

A few things that stand out off the top of my head: improving support for using different backends to perform calculations with NumPy and SciPy (for example, using GPUs or distributed infrastructure), and making it easier to use custom dtypes. You might want to speed up code with Cython or Numba or Pythran and some thought may be required for NumPy and SciPy to remain well-suited for each of those.

I think I’m starting to see indications that binary wheels will eventually become available for PowerPC and ARM architectures, but my impression was that there were still some challenges there.

I think you’ll probably see better published papers/citation targets for these two projects in the future as well. With all the efforts underway to get grants to fund these projects I think we’ll continue to see periods where there will be funded developers driving things forward more quickly, as has happened with the grant for NumPy at UC BIDS.

Thanks for doing the interview, Tyler!

The post PyDev of the Week: Tyler Reddy appeared first on The Mouse Vs. The Python.

January 13, 2020 06:05 AM UTC


Mike C. Fletcher

Started work on getting py-spy/speedscope in RunSnakeRun

So having finally written down the thoughts on a carbon tax, that kept distracting me from actually working on Open Source, I finally got a bit of work done on Open Source on the last night of the vacation.

What I started work on was getting a sampling profiler format supported, and for that I chose py-spy, particularly its speedscope export format. The work is still early days, but it does seem to work in my initial test cases.

At the moment I'm only supporting the "sampled" mode (vs the evented mode, which is closer to coldshot) for the format. I haven't implemented module/location tree-view yet. More annoying, the sample format doesn't include start-of-function information, so there's no differentiation between two functions with the same name in the same file for separating out the results. The results are also a bit confusing when you're used to cProfile style, as the boxes are stack-line based, so you'll see separate boxes for funcname:32 and funcname:34 children next to each other even though its the same child function involved. That's confusing enough that I'll likely group children that are calls to the same function (regardless of which line in the function they were in during the sample) into the same box.

The speedscope format would also make it pretty easy to do per-line heat-maps in the file, and obviously (given it's what speedscope normally does), a flame-graph would be a reasonable display as well. Anyway, when I have some more vacation time I can look into further work on it.

January 13, 2020 05:47 AM UTC


Codementor

Python for Beginners: Making Your First Socket Program (Client & Server Communication)

How to send a text file between client and server: Python simple example and source code download. See the video for more info!

January 13, 2020 05:39 AM UTC

January 12, 2020


Jaime Buelta

ffind v1.2.0 released!

The new version of ffind v1.2.0 is available in GitHub and PyPi. This version includes the ability to configure defaults by environment variables and to force case insensitivity in searches. You can upgrade with     pip install ffind --upgrade This will be the latest version to support Python 2.6. Happy searching!

January 12, 2020 09:32 PM UTC

Python Automation Cookbook

So, great news, I wrote a book and it’s available! It’s called Python Automation Cookbook, and it’s aimed to people that already know a bit of Python (not necessarily developers only), but would like to use it to automate common tasks like search files, creating different kind of documents, adding graphs, sending emails, text messages,… Read More

January 12, 2020 08:55 PM UTC

Hands-On Docker for Microservices with Python Book

Last year I published a book, and I liked the experience, so I wrote another! The book is called Hands-On Docker for Microservices with Python, and it goes through the different steps to move from a Monolith Architecture towards a Microservices one. It is written from a very practical stand point, and aims to cover… Read More

January 12, 2020 08:54 PM UTC


Ned Batchelder

Bug #915: please help!

Updated: this was solved on Hacker News. Details in Bug #915: solved!

I just released coverage.py 5.0.3, with two bug fixes. There was another bug I really wanted to fix, but it has stumped me. I’m hoping someone can figure it out.

Bug #915 describes a disk I/O failure. Thanks to some help from Travis support, Chris Caron has provided instructions for reproducing it in Docker, and they work: I can generate disk I/O errors at will. What I can’t figure out is what coverage.py is doing wrong that causes the errors.

To reproduce it, start a Travis-based docker image:

cid=$(docker run -dti --privileged=true --entrypoint=/sbin/init \
    -v /sys/fs/cgroup:/sys/fs/cgroup:ro \
    travisci/ci-sardonyx:packer-1542104228-d128723)
docker exec -it $cid /bin/bash

Then in the container, run these commands:

su - travis
git clone --branch=nedbat/debug-915 https://github.com/nedbat/apprise-api.git
cd apprise-api
source ~/virtualenv/python3.6/bin/activate
pip install tox
tox -e bad,good

This will run two tox environments, called good and bad. Bad will fail with a disk I/O error, good will succeed. The difference is that bad uses the pytest-cov plugin, good does not. Two detailed debug logs will be created: debug-good.txt and debug-bad.txt. They show what operations were executed in the SqliteDb class in coverage.py.

The Big Questions: Why does bad fail? What is it doing at the SQLite level that causes the failure? And most importantly, what can I change in coverage.py to prevent the failure?

Some observations and questions:

If you come up with answers to any of these questions, I will reward you somehow. I am also eager to chat if that would help you solve the mysteries. I can be reached on email, Twitter, as nedbat on IRC, or in Slack. Please get in touch if you have any ideas. Thanks.

January 12, 2020 03:17 PM UTC

January 11, 2020


Weekly Python StackOverflow Report

(ccx) stackoverflow python report

These are the ten most rated questions at Stack Overflow last week.
Between brackets: [question score / answers count]
Build date: 2020-01-11 19:37:20 GMT


  1. Return or yield from a function that calls a generator? - [18/5]
  2. re.findall('(ab|cd)', string) vs re.findall('(ab|cd)+', string) - [15/3]
  3. What do * (single star) and / (slash) do as independent parameters? - [13/2]
  4. Comparing lists in two columns row-wise efficiently - [10/4]
  5. Why are some Python exceptions lower-case? - [8/1]
  6. Python split column with regex - [6/2]
  7. How to remove extra whitespace from image in opencv? - [6/2]
  8. Subregions of boolean 2d array - [6/2]
  9. mypy: Why is "int" a subtype of "float"? - [6/1]
  10. Looping through multiple arrays & concatenating values in pandas - [5/3]

January 11, 2020 07:37 PM UTC


Reinout van Rees

Github basic auth deprecation and jenkins

I have been getting periodic deprecation notice emails from github for the last few months:

Hi @nenskins,

You recently used a password to access an endpoint through the GitHub API using okhttp/2.7.5. We will deprecate basic authentication using password to this endpoint soon:

https://api.github.com/

We recommend using a personal access token (PAT) with the appropriate scope to access this endpoint instead. Visit https://github.com/settings/tokens for more information.

Thanks, The GitHub Team

Hm, that @nenskins user, that is our old jenkins instance talking to github somehow. Apparently through basic auth. Only... where? Most of the github traffic seemed to use just an access token. Jenkins calls that the secret text type. Basic auth is type username with password in jenkins.

What it turned out to be was the github branch source plugin. This periodically looks at our github organisation to see if there are new projects or new branches that it missed. Normally github tells our jenkins when there's a new project or pull request or so.

Ok, on to the jenkins settings for my organisation. The confusing thing here is that the "credentials" setting says this:

Note that only "username with password" credentials are
supported. Existing credentials of other kinds will be filtered out. This
is because jenkins exercises GitHub API, and this last one does not
support other ways of authentication.

Huh? Github is refusing user/password basic auth, which is what this plugin only supports? I updated every plugin, but the problem still persisted.

I only got it after reading this bug report and especially this comment:

Isn't that message saying that you can continue to use basic auth so long as instead of using your actual password you use a personal access token. Generate a personal access token from the GitHub "Settings" page and store that personal access token in the Jenkins username / password credential as the password. Place your username as the username. Check that it works. It has been working that way for me.

Ah! So "github is refusing user/password basic auth" really means "github is refusing user/password basic auth". Using an access token instead of your password is actually fine.

The info in jenkins on those credentials actually mention that somewhat:

If your organization contains private repositories, then you need to
specify a credential from an user who have access to those
repositories. This is done by creating a "username with password"
credential where the password is GitHub personal access tokens. The
necessary scope is "repo".

So I visited https://github.com/settings/tokens and generated a new token with full "repo" rights (this is actually quite restricted in scope, despite the name).

In Jenkins I added a new global username/password credential with the github username + the access token and hurray, everything worked again.

January 11, 2020 06:01 PM UTC

PyGrunn: advanced pytest - Òscar Vilaplana

(One of my summaries of a talk at the 2019 PyGrunn conference).

Imagine being a developer being woken up at night because your latest commit broke the website. You fix the issue, run the tests of your part of the code (which passes) and push to github. That runs all the tests and it fails in a completely unrelated piece of the code. But what is happening? Is the test wrong? Is your code wrong? "3 is not 90": what does that mean?

What does it mean that this fails? What is the test's promise? If a test you wrote fails, it should fail beautifully. It should tell exactly what's wrong:

assert num_something == 2, "The number should match the number of added items"

You can use pytest fixtures to at least make the data the test is working with clearer.

You can make fixtures that work as context managers:

@pytest.fixture
def test_with_teardown():
    thing = create_something()  # setup
    yield thing
    thing.destroy()  # teardown

A tip: have fixtures per subsystem. Assuming you have multiple test directories, one per subsystem. Give every subsystem its own conftest.py. Different subsystems might get different fixtures even though using the same name ("product" for instance). This way you can tweak your main fixture items per subsystem.

  • Disadvantage: it is implicit instead of explicit...
  • Advantage: the fixtures can stay minimal. Otherwise your fixture has to support all use cases.

You can parametrize fixtures:

@pytest.fixture(params=["no-user", "disabled-user", "read-only-user"])
def unauthorized_user():
    if param == ...
        return ...
    if param == ...
        ...

Tests using that fixture are run three times, once for every possible kind of unauthorized user!

You can do it even more elaborate. You can make a kind of a "build matrix" and use @pytest.mark.parametrize.

If every test needs a temporary database or a temporary something, you can pass auto_use=True to the fixture, that'll apply it automatically.

Pytest can help you with mocking, but sometimes you're better off setting up dependency injection. So adding a parameter to the method you're testing to accept some mocked item instead of its regular default.

If you think regular code is more important than the tests: he pro-tests :-)

You need tests because they give you a feeling of safety. If you feel safe, you dare to try things. Tests are a bit of a shared goal inside a team: you and your code want to belong. You want interaction: make sure your tests are communicative and helpful.

January 11, 2020 06:01 PM UTC

PyGrunn: python as a scientist's playground - Peter Kroon

(One of my summaries of a talk at the 2019 PyGrunn conference).

He's a scientist. Quite often, he searches for python packages.

  • If you're writing python packages, you can learn how someone might search for your package.
  • If you don't write python packages, you can learn how to investigate.

Scientists try to solve unsolved problems. When doing it with computers, you basically do three things.

  • Perform simulations.
  • Set up simulations.
  • Analyze results.

Newton said something about "standing on the shoulders of giants". So basically he predicted the python package index! So much libraries to build upon!

A problem is that there is so much software. There are multiple libraries that can handle graphs (directed graphs, not diagrams). He's going to use that as an example.

Rule one: PR is important. If you don't know a package exists, it won't come on the list. Google, github discovery, stackoverflow, scientific liberature, friends, pygrunn talks, etc.

A README is critical. Without a good readme: forget it.

The five he found: graph-tool, networkx, igraph, python-graph scipy.sparse.csgraph.

Rule two: documentation is very important. Docs should showcase the capabilities. This goes beyond explaining it, it should show it.

I must be able to learn how to use your package from the docs. Just some API documentation is not enough, you need examples.

Watch out with technical jargon and terms. On the one hand: as a scientist you're often from a different field and you might not know your way around the terms. On the other hand, you do want to mention those terms to help with further investigation.

Bonus points for references to scientific literature.

Documentation gold standard: scikit-learn!

python-graph has no online docs, so that one's off the shortlist. The other four are pretty OK.

Rule three: it must be python3 compatible. On 1 january 2020 he's going to wipe python2 from all the machines that he has write access to.

All four packages are OK.

Rule four: it must be easy to install. So pypi (or a conda channel). You want to let pip/coda deal with your dependencies. If not, at least list them.

Pure python is desirable. If you need compilation of c/c++/fortran, you need all the build dependencies. This also applies to your dependencies.

He himself is a computer scientist, so he can compile stuff. But most scientists can't really do.

He himself actually wants to do research: he doesn't want to solve packaging problems.

graph-tool is not on pypi, networkx is pure python. scipy is fortran/c, but provides wheels. igraph is C core and not on pypi.

So scipy and networkx are left.

Rule five: it must be versatile. Your package must do everything. If your package does a lot, there are fewer dependencies in the future. And I have to learn fewer packages.

If it doesn't do everything, it might still be ok if it is extendable. He might even open a pull request to add the functionality that he needs.

Note: small projects that solve one project and solve it well are OK, too.

networkx: does too much to count. Nice. scipy.sparse.csgraph has six functions. So for now, networkx is the package of choice.

The first and third rules are hard rules: if it is a python2-only package it is out and if you can't find a package, you can't find it, period.

Conclusions

  • You need to invest effort to make ME try your package.
  • "My software is so amazing, so you should invest time and effort to use it": NO :-)
  • If it doesn't work in 15 minutes: next candidate.

January 11, 2020 06:01 PM UTC

PyGrunn: data processing and visualisation of tractor data - Erik-Jan Blanksma

(One of my summaries of a talk at the 2019 PyGrunn conference).

He works for Dacom, a firm that writes software to help farmers be more effective. Precision farming is a bit of a buzzword nowadays. You can get public elevation data, you can let someone fly over your fields to take measurements or a cart can do automatic ground samples. This way you can make a "prescription map" where to apply more fertilizer and where less will do.

Another source of data is the equipment the farmer uses to drive over his field. As an example, the presentation looks at a potato harvester.

  • Which route did the harvester take through the field?
  • What was the yield (in potatoes per hectare) in all the various spots?

Some tools and libraries that they use:

  • Numpy: very efficient numerical processing. Arrays.
  • Pandas: dataseries.
  • Matplotlib: graph plotting library.
  • Postgis: geographical extension to the postgres databases.

Pandas is handy for reading in data, from csv for instance. It integrates nicely with matplotlib. With a one-liner you can let it create a histogram from the data.

With the .describe() function, you get basic statistics about your data.

Another example: a map (actually a graph, but it looks like a map) with color codes for the yield. The locations where the yields are lower are immediately clear this way.

When converting data, watch out with your performance. What can be done by pandas itself is much quicker than if it has to ask python to do it. For instance, creating a datetime from a year field, a month field, etc, that takes a long time as it basically happens per row. It is way quicker to let pandas concatenate the yyyy/mm/dd + time info into one string and then convert that one string to a datetime.

He showed the same example for creating a geometric point. It is quickest to create a textual POINT(1.234,8.234) string from two x/y fields and only then to convert it to a point.

Use the best tool for the job. Once he had massaged the data in pandas, he exported it to a postgis database table. Postgis has lots of geographical functions, like ST_CENTROID, ST_BUFFER, and ST_MAKELINE, which he used to do the heavy geographical lifting.

He then used the "geopandas" extension to let pandas read the resulting postgis query's result. Which could again be plotted with matplotlib.

Nice!

January 11, 2020 06:01 PM UTC

PyGrunn: embedding the python interpreter - Mark Boer

(One of my summaries of a talk at the 2019 PyGrunn conference).

Writing scripts inside applications is often hard. Some of them luckily have an embedded version of python, but not all of them.

Two important terms: extending and embedding. Lots of scientific software is made available via extending: a python wrapper. Numpy and tensorflow, for instance.

The other way around is embedding: you put python inside your application. Useful for plugins, scripting. He doesn't know if jupyter notebooks are a good example of embedding, but in any case, jupyter is doing funny things with the python interpreter.

CPython, which is the version of python we're talking about, consists of three parts:

  • Bytecode compiler
  • Python virtual machine (the one running the compiled bytecode).
  • Python's C API, which allows other programs to call python. The C API is the opposite of python: it is hard to read and write :-) Oh, and the error messages are horrendous.

But... starting python from C and sending lines to the REPL, that's quite easy. PyRun-SimpleString(). He showed a 10-line C program that reads from stdin and lets python execute it.

He then expanded it to run in a separate thread. But soon his program crashed. The solution was to explicitly acquire and release the GIL ("global interpreter lock").

A problem: multiprocessing doesn't work. At least on windows. Starting another process from within python opens another version of the whole program you're in...

A suggestion: pybind11, a handy library for helping you embed python into c++. It especially helps with managing the GIL and for embedding python modules.

Something he sees often, is that it is used to parallellize code for the benefit of python:

  • Convert python types to c/c++ types
  • release GIL
  • Perform computation
  • acquire GIL
  • Convert c/c++ types to return type.

A note on deployment: just include python in your installer.

January 11, 2020 06:01 PM UTC

PyGrunn: lessons from using GraphQL in production - Niek Hoekstra & Jean-Paul van Oosten

(One of my summaries of a talk at the 2019 PyGrunn conference).

GraphQL is a different way to create APIs. So: differently from REST. You describe what you want to recieve back, instead of having a fixed REST api. With REST you often get too much or too little. You may need to do a lot of different calls.

REST often leads to a tight coupling between the front-end and the back-end. Changes to a REST api often break the front-end...

What they especially like about GraphQL: it is designed to have documentation build-in.

What they use: "graphene", "graphene-django" and "relay". On the front-end it is "apollo" (react-native, react, native ios/android).

With graphene-django you first have to define the data you're exposing. The various object types the types of the attributes, the relations, etc.

A tip: differentiate between "a user" and "me". Don't add more data to a user object if it turns out to be you. Just have a separate endpoint for "me". Way easier.

Caching: that needs to be done outside of graphene, it only can do a bit of caching right at then end on the resulting json. You're better off caching at the django object level.

A potential problem spot is the flexibility that GraphQL gives you in querying relations. You need to do quite some clever hacking to use django's select_related/prefetch_related speedups. You need to pay attention.

Uploading files is tricky. GraphQL itself does not handle file uploads. Their solution was to have a POST or PUT endpoint somewhere and to return the info about the uploaded file as GraphQL.

A downside of GraphQL: it is hard to predict the costs of a query. You can ask for adresses of contacts living at addresses of contacts and so on: you can kill the server that way. You could prevent that by, for instance, limiting the depth of the query.

There are reasons to stay with REST:

  • GraphQL is not a silver bullet. Yes, it has advantages.
  • The django/python tooling is still not very mature.
  • Determining the cost of a query is hard to predict beforehand.

But: just use GraphQL, it is fun!

January 11, 2020 06:01 PM UTC

PyGrunn: testing your infrastructure code - Ruben Homs

(One of my summaries of a talk at the 2019 PyGrunn conference).

Servers used to be managed by proper wizards. But even wizards can be killed by a balrog. So... what happens when your sysadmin leaves?

  • The point of failure is the sysadmin.
  • Knowledge about infrastructure is centralised.
  • It is non-reproducible.

A solution is configuration management. Chef, ansible, saltstack, puppet. Configuration that's in source control instead of information in a sysadmin's head.

  • It is a reproducible way to build your infrastructure.
  • Source code, so everyone can see how a system works.
  • You can even version your infrastructure.

He'll use saltstack as an example, that's what they're using in his company. It is a master/minion system. So a central master pushes out commands to the minion systems.

For testing, he uses a tool called "kitchen", originally intended for puppet, which can however also be used with saltstack: https://kitchen.saltstack.com/ . He showed a demo where he created a couple of virtualbox machines and automatically ran the salt scripts on them.

You can then ssh to those boxes and check if they're OK.

But... that's manual work. So he started using testinfra and pytest. Testinfra helps you test infrastructure. There are build-in tests for checking if a package has been installed, for instance:

def test_emacs_installed(host):
    assert host.package("emacs").is_installed

You can run those tests via "kitchen". They use that to test their infrastructure setup from travis-ci.com.

January 11, 2020 06:01 PM UTC

PyGrunn: a day has only 24 ± 1 hours - Miroslav �?edivý

(One of my summaries of a talk at the 2019 PyGrunn conference).

Time zones... If you do datatime.datetime.now() you'll get a date+time without timezone information. You can get different results on your laptop (set to local time) and a server (that might be set to UTC).

You can use datetime.datetime.utcnow() that returns UTC time. But... without a timezone attached. Best is to request the time in a specific timezone.

There are gotchas regarding time. Check your time only once in a calculation. If you call .utcnow() multiple times, you can get different dates when your code runs around 0:00.

Same with time.time(): if the "ntp" daemon adjusts your system clock in the mean time you get weird results. For that, there is time.monotonic().

The original source for all time zone information is the time zone database (tzdata). You can download it and look at all the files per timezone. Interesting reading! Look at Istanbul's timezone. Daylight savings time being delayed by a day in a specific year because of a nationwide school exam. It was anounced a few weeks before. That's all in the time zone database.

So if you make a Docker now and still use it in two years' time, you might run into problems because summer time might have been abolished by the EU by then. So make sure you keep your time zone libary up to date.

January 11, 2020 06:01 PM UTC

PyGrunn: monitoring and profiling Flask apps - Patrick Vogel & Bogdan Petre

(One of my summaries of a talk at the 2019 PyGrunn conference).

Patrick and Bogdan are students at Groningen University and they made the Flask Monitoring Dashboard. Some questions you might be interested in:

  • What is the performance of your flask-based apps?
  • How much is my web app being used?
  • Mean response time per endpoint?
  • Are there a lot of outliers? Most customers might have 20 items and one customer might have a couple of thousands: that'll hurt performance for only that specific customer. Important to understand.
  • Monitor performance improvements in case you deploy a new version.

What are your options?

  • Commercial monitoring like google analytics or pingdom.
  • Write your own monitoring in flask middleware.
  • No middleware.
  • Best: use the flask monitoring dashboard!

It offers several levels of monitoring that you can configure per endpoint. From just monitoring the last time the endpoint has been called to full profiling including outlier detection. They showed a webpage with the profiling information: it sure looked useful. As an example, there's a table view (hours vertically, days of the week horizontally) showing the relative usage per day/hour.

It works by "monkeypatching" the view functions. Flask has an internal dictionary with all the view functions! When profiling, a separate thread is started that periodically collects stacktrace info from the function being monitored.

Such monitoring of course has a performance impact. They're actually researching that right now. The lower levels of monitoring ("time it was last used" and "amount of usage") have no discernable impact. The two levels that do profiling have more overhead. For cpu/memory intensive tasks, the overhead is around 10%. For disk intensive tasks, it can hit 60%.

January 11, 2020 06:01 PM UTC