Planet Python
Last update: October 13, 2020 01:48 PM UTC
October 13, 2020
Stack Abuse
Simple NLP in Python With TextBlob: Tokenization
Introduction
The amount of textual data on the Internet has significantly increased in the past decades. There's no doubt that the processing of this amount of information must be automated, and the TextBlob package is one of the fairly simple ways to perform NLP - Natural Language Processing.
It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, tokenization, sentiment analysis, classification, translation, and more.
No special technical prerequisites for employing this library are needed. For instance, TextBlob is applicable for both Python 2 and 3. In case you don't have any textual information for the project you want to work on, TextBlob provides necessary corpora from the NLTK database.
Installing TextBlob
Let's start out by installing TextBlob and the NLTK corpora:
$ pip install -U textblob
$ python -m textblob.download_corpora
Note: This process can take some time due to a broad number of algorithms and corpora that this library contains.
What is Tokenization?
Before going deeper into the field of NLP you should also be able to recognize these key terms:
-
Corpus (or corpora in plural) - is simply a certain collection of language data (e.g. texts). Corpora are normally used for training different models of text classification or sentiment analysis, for instance.
-
Token - is a final string that is detached from the primary text, or in other words, it's an output of tokenization.
What is tokenization itself?
Tokenization or word segmentation is a simple process of separating sentences or words from the corpus into small units, i.e. tokens.
An illustration of this could be the following sentence:
-
Input (corpus): The evil that men do lives after them
-
Output (tokens): | The | evil | that | men | do | lives | after | them |
Here, the input sentence is tokenized on the basis of spaces between words. You can also tokenize characters from a single word (e.g. a-p-p-l-e from apple) or separate sentences from one text.
Tokenization is one of the basic and crucial stages of language processing. It transforms unstructured textual material into data. This could be applied further in developing various models of machine translation, search engine optimization, or different business inquiries.
Implementing Tokenization in Code
First of all, it's necessary to establish a TextBlob object and define a sample corpus that will be tokenized later. For example, let's try to tokenize a part of the poem If written by R. Kipling:
from textblob import TextBlob
# Creating the corpus
corpus = '''If you can force your heart and nerve and sinew to serve your turn long after they are gone. And so hold on when there is nothing in you except the Will which says to them: 'Hold on!'
'''
Once the object is created, it should be passed as an argument to the TextBlob constructor:
blob_object = TextBlob(corpus)
Once constructed, we can perform various operations on this blob_object. It already contains our corpus, categorized to a degree.
Word Tokenization
Finally, to get the tokenized words we simply retrieve the words attribute to the created blob_object. This gives us a list containing Word objects, that behave very similarly to str objects:
from textblob import TextBlob
corpus = '''If you can force your heart and nerve and sinew to serve your turn long after they are gone. And so hold on when there is nothing in you except the Will which says to them: 'Hold on!'
'''
blob_object = TextBlob(corpus)
# Word tokenization of the sample corpus
corpus_words = blob_object.words
# To see all tokens
print(corpus_words)
# To count the number of tokens
print(len(corpus_words))
The output commands should give you the following:
['If', 'you', 'can', 'force', 'your', 'heart', 'and', 'nerve', 'and', 'sinew', 'to', 'serve', 'your', 'turn', 'long', 'after', 'they', 'are', 'gone', 'and', 'so', 'hold', 'on', 'when', 'there', 'is', 'nothing', 'in', 'you', 'except', 'the', 'Will', 'which', 'says', 'to', 'them', 'Hold', 'on']
38
It's worth noting that this approach tokenizes words using SPACE as the delimiting character. We can change this delimiter, for example, to a TAB:
from textblob import TextBlob
from nltk.tokenize import TabTokenizer
corpus = '''If you can force your heart and nerve and sinew to serve your turn long after they are gone. And so hold on when there is nothing in you except the Will which says to them: 'Hold on!'
'''
tokenizer = TabTokenizer()
blob_object = TextBlob(corpus, tokenizer = tokenizer)
# Word tokenization of the sample corpus
corpus_words = blob_object.tokens
# To see all tokens
print(corpus_words)
Note that we've added a TAB after the first sentence here. How, the corpus of the words looks something like:
['If you can force your heart and nerve and sinew to serve your turn long after they are gone.','And so hold on when there is nothing in you except the Will which says to them: 'Hold on!']
nltk.tokenize contains other tokenization options as well. By default, it uses the SpaceTokenizer which you don't need to define explicitly, but can. Other than these two, it also contains useful tokenizers such as LineTokenizer, BlankLineTokenizer and WordPunctTokenizer.
A full list can be found in their documentation.
Sentence Tokenization
To tokenize on a sentence-level, we'll use the same blob_object. This time, instead of the words attribute, we will use the sentences attribute. This returns a list of Sentence objects:
from textblob import TextBlob
corpus = '''If you can force your heart and nerve and sinew to serve your turn long after they are gone. And so hold on when there is nothing in you except the Will which says to them: 'Hold on!'
'''
blob_object = TextBlob(corpus)
# Sentence tokenization of the sample corpus
corpus_sentence = blob_object.sentences
# To identify all tokens
print(corpus_sentence)
# To count the number of tokens
print(len(corpus_sentence))
Output:
[Sentence("If you can force your heart and nerve and sinew to serve your turn long after they are gone"), Sentence("And so hold on when there is nothing in you except the Will which says to them: 'Hold on!")]
2
Conclusion
Tokenization is a very important data pre-processing step in NLP and involves breaking down of a text into smaller chunks called tokens. These tokens can be individual words, sentences or characters in the original text.
TextBlob is a great library to get into NLP with since it offers a simple API that lets users quickly jump into performing NLP tasks.
In this article, we discussed just one of the NLP tasks that TextBlob deals with, but in a next series, we will take a look at how to solve more complex issues, such as dealing with word inflections, plural and singular forms of words, and more.
Codementor
How to Implement Role based Access Control With FastAPI
Quick Summary of RBAC concept, working code snippets and how I reached there
Kushal Das
Updates from Johnnycanencrpt development in last few weeks
In July this year, I wrote a very initial Python module in Rust for OpenPGP, Johnnycanencrypt aka jce. It had very basic encryption, decryption, signing, verification, creation of new keys available. It uses https://sequoia-pgp.org library for the actual implementation.
I wanted to see if I can use such a Python module (which does not call out to the gpg2 executable) in the SecureDrop codebase.
First try (2 weeks ago)
Two weeks ago on the Friday, when I sat down to see if I can start using the module, within a few minutes, I understood it was not possible. The module was missing basic key management, more more refined control over creation, or expiration dates.
On that weekend, I wrote a KeyStore using file-based keys as backend and added most of the required functions to try again.
The last Friday
I sat down again; this time, I had a few friends (including Saptak, Nabarun) on video along with me, and together we tried to plug the jce inside SecureDrop container for Focal. After around 4 hours, we had around 5 failing tests (from 32) in the crypto-related tests. Most of the basic functionality was working, but we are stuck for the last few tests. As I was using the file system to store the keys (in simple .sec or .pub files), it was difficult to figure out the existing keys when multiple processes were creating/deleting keys in the same KeyStore.
Next try via a SQLite based KeyStore
Next, I replaced the KeyStore with an SQLite based backend. Now multiple processes can access the keys properly. With a few other updates, now I have only 1 failing test (where I have to modify the test properly) in that SecureDrop Focal patch.
While doing this experiment, I again found the benefits of writing the documentation of the library as I developed. Most of the time, I had to double-check against it to make sure that I am doing the right calls. I also added one example where one can verify the latest (10.0) Tor Browser download via Python.
In case you already use OpenPGP encryption in your tool/application, or you want to try it, please give jce a try. Works on Python3.7+. I tested on Linux and macOS, and it should work on Windows too. I have an issue open on that, and if you know how to do that, please feel free to submit a PR.
October 12, 2020
Podcast.__init__
Cloud Native Application Delivery Using GitOps - Episode 284
The way that applications are being built and delivered has changed dramatically in recent years with the growing trend toward cloud native software. As part of this movement toward the infrastructure and orchestration that powers your project being defined in software, a new approach to operations is gaining prominence. Commonly called GitOps, the main principle is that all of your automation code lives in version control and is executed automatically as changes are merged. In this episode Victor Farcic shares details on how that workflow brings together developers and operations engineers, the challenges that it poses, and how it influences the architecture of your software. This was an interesting look at an emerging pattern in the development and release cycle of modern applications.
Summary
The way that applications are being built and delivered has changed dramatically in recent years with the growing trend toward cloud native software. As part of this movement toward the infrastructure and orchestration that powers your project being defined in software, a new approach to operations is gaining prominence. Commonly called GitOps, the main principle is that all of your automation code lives in version control and is executed automatically as changes are merged. In this episode Victor Farcic shares details on how that workflow brings together developers and operations engineers, the challenges that it poses, and how it influences the architecture of your software. This was an interesting look at an emerging pattern in the development and release cycle of modern applications.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Tree Schema is a data catalog that is making metadata management accessible to everyone. With Tree Schema you can create your data catalog and have it fully populated in under five minutes when using one of the many automated adapters that can connect directly to your data stores. Tree Schema includes essential cataloging features such as first class support for both tabular and unstructured data, data lineage, rich text documentation, asset tagging and more. Built from the ground up with a focus on the intersection of people and data, your entire team will find it easier to foster collaboration around your data. With the most transparent pricing in the industry – $99/mo for your entire company – and a money-back guarantee for excellent service, you’ll love Tree Schema as much as you love your data. Go to pythonpodcast.com/treeschema today to get your first month free, and mention this podcast to get %50 off your first three months after the trial.
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to pythonpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host as usual is Tobias Macey and today I’m interviewing Victor Farcic about using GitOps practices to manage your application and your infrastructure in the same workflow
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by giving an overview of what GitOps is?
- What are the architectural or design elements that developers need to incorporate to make their applications work well in a GitOps workflow?
- What are some of the tools that facilitate a GitOps approach to managing applications and their target environments?
- What are some useful strategies for managing local developer environments to maintain parity with how production deployments are architected?
- As developers acquire more resonsibility for building the automation to provision the production environment for their applications, what are some of the operations principles that they need to understand?
- What are some of the development principles that operators and systems administrators need to acquire to be effective in contributing to an environment that is managed by GitOps?
- What are the areas for collaboration and dividing lines of responsibility between developers and platform engineers in a GitOps environment?
- Beyond the application development and deployment, what are some of the additional concerns that need to be built into an application in order for it to be manageable and maintainable once it is in production?
- What are some of the organizational principles that contribute to a successful implementation of GitOps?
- What are some of the most interesting, innovative, or unexpected ways that you have seen GitOps employed?
- What have you found to be the most challenging aspects of creating a scalable and maintainable GitOps practice?
- When is GitOps the wrong choice, and what are the alternatives?
- What resources do you recommend for anyone who wants to dig deeper into this subject?
Keep In Touch
Picks
- Tobias
- Victor
Links
- GitOps
- CodeFresh
- Kubernetes
- DevOps Paradox Podcast
- Perl
- Cloud Native
- ArgoCD
- Flux
- Observability
- Prometheus
- Helm
- KNative
- MiniKube
- Viktor’s Udemy Books and Courses
- Viktor’s YouTube channel
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Reuven Lerner
What’s the easiest way to boost your career as a software developer? Learn to touch type.
I’ve been a professional programmer for about 30 years, self-employed for 25 years, and doing full-time corporate Python training for more than a decade.
I run a small business, which involves me writing, programming, and teaching, as well as handling all of the business-related stuff.
So, what’s my most important skill, the thing that helps me get lots accomplished in a short period of time? Easy: My ability to touch type.
It all started when I was in high school in the mid-1980s. I would use my family’s computer — yes, in those days, the entire family shared one — for schoolwork, for doing some introductory programming, and even writing newsletters for my high-school youth organization. The thing is, I was doing all of this typing with two fingers, and this drove my parents bananas.
Both of my parents can touch type. In those days, it was typical for office workers to record their correspondence, give the recording to a secretary, and then review the result before sending it out. My father never did that, because he typed at least as fast as his secretary, and the whole dictation process would slow him down. It wasn’t unusual to hear the rat-tat-tat of my father typing from his study at home.
It’s no surprise that it bothered my parents to be hunting and pecking. I was pretty fast at it, but I was no match for my father or any other touch typist. My parents strongly encouraged me to learn to touch type, but I was a teenager, which meant that I knew better than they did. And besides, I type fast enough, right?
Finally, my parents set a new rule: For every hour that I used the computer, I had to spend an hour doing a lesson from a touch-typing book. (How quaint, right?) I yelled. I screamed. I cried. I protested. But my parents didn’t budge.
At first, it was painful: When you start to touch type, you are learning to use your hands in a new way, one that feels completely foreign. You also type much more slowly than you did before, and feel like you’re wasting your time. I certainly had these feelings, and when I had to get something done quickly, I would refer to my old two-finger method.
But within two or three weeks, I was already touch typing as quickly as I did with two fingers. Better yet, and somewhat amazingly, I was able to type without looking at the keyboard! I could enter passages from a book, without having to move my eyes from book to keyboard and back. I could talk to someone while typing. I could even sneak a peak at the TV while I was typing.
Achieving true speed didn’t happen for a while. But when I started college in the fall of 1988, I was already typing at a pretty fast clip. At the student newspaper, I was frequently drafted to take printouts from the Associated Press and type them into our “world and nation” section. And at the computer labs, where we had loud, mechanical IBM keyboards, people would ask me if I could type more slowly, because the rat-tat-tat was disturbing them.
Fast forward to 2020, and I cannot imagine my work without being able to touch type:
- Just about every day, I teach Python programming to my corporate clients. Rather than using slides, I live-code, talking while looking at my students (or the screen). I describe what I’m typing as I do it, and type at the same speed as I speak.
- Similarly, the online video courses and YouTube videos that I’ve created wouldn’t be possible were it not for touch typing.
- I can type at about the same speed as I think, meaning that when I have ideas I want to put into an article, blog post, or book, I can just sit down and write. This doesn’t mean that my text can get away without editing — but I can’t imagine the writing and editing process if typing weren’t a natural extension of my thought process.
- When I speak with a potential new client, I can take notes in real time, while holding the conversation.
- I can write and respond to e-mail quickly and easily. (This is something of a curse; I never learned to write short e-mail messages. It’s always full sentences, and typically full paragraphs, from me.)
Lots of professional writers know that they need to touch type. After all, they write for a living, and being unable to get the most out of their keyboard would seem like a crazy thing to do.
And yet, I find that a small number of the developers in my courses can touch type. They never really thought about it that much, or decided not to put time and effort into it, or thought that it was hard or impossible to learn. But it’s definitely not a priority.
Touch typing looks magical and impossible to achieve. It’s like watching a virtuoso pianist expressing themselves through the instrument, their thoughts and feelings flowing effortlessly from their brains to their hands, and then to the piano.
But here’s the thing: It’s not hard to learn. You’ll be frustrated for the weeks during which you’re learning and forcing yourself to work in a new way. But it pays for itself in spades, allowing you to write, edit, and express yourself — in code and text — more easily than you could ever imagine. And if I managed to learn from a book as an angry teenager, then you can certainly learn with the variety of online tools, many of them free, available today.
So if you want to give your career a boost, don’t go and learn the latest language, JavaScript library, or API. Rather, learn to touch type. The time that you save and the flexibility it’ll provide will more than make up for the time you spent learning.
The post What’s the easiest way to boost your career as a software developer? Learn to touch type. appeared first on Reuven Lerner.
Ned Batchelder
Ordered dict surprises
Since Python 3.6, regular dictionaries retain their insertion order: when you iterate over a dict, you get the items in the same order they were added to the dict. Before 3.6, dicts were unordered: the iteration order was seemingly random.
Here are two surprising things about these ordered dicts.
You can’t get the first item
Since the items in a dict have a specific order, it should be easy to get the first (or Nth) item, right? Wrong. It’s not possible to do this directly. You might think that d[0] would be the first item, but it’s not, it’s the value of the key 0, which could be the last item added to the dict.
The only way to get the Nth item is to iterate over the dict, and wait until you get to the Nth item. There’s no random access by ordered index. This is one place where lists are better than dicts. Getting the Nth element of a list is an O(1) operation. Getting the Nth element of a dict (even if it is ordered) is an O(N) operation.
OrderedDict is a little different
If dicts are ordered now, collections.OrderedDict is useless, right? Well, maybe. It won’t be removed because that would break code using that class, and it has some methods that regular dicts don’t. But there’s also one subtle difference in behavior. Regular dicts don’t take order into account when comparing dicts for equality, but OrderedDicts do:
>>> d1 = {"a": 1, "b": 2}
>>> d2 = {"b": 2, "a": 1}
>>> d1 == d2
True
>>> list(d1)
['a', 'b']
>>> list(d2)
['b', 'a']
>>> from collections import OrderedDict
>>> od1 = OrderedDict([("a", 1), ("b", 2)])
>>> od2 = OrderedDict([("b", 2), ("a", 1)])
>>> od1 == od2
False
>>> list(od1)
['a', 'b']
>>> list(od2)
['b', 'a']
>>>
BTW, this post is the result of a surprisingly long and contentious discussion in the Python Discord.
Test and Code
134: Business Outcomes and Software Development
Within software projects, there are lots of metrics we could measure. But which ones really matter. Instead of a list, Benjamin Harding shares with us a way of thinking about business outcomes that can help us with every day decision making.
We talk about:
- Business outcomes vs vanity metrics
- As a developer, how do you keep business outcomes in mind
- Thinking about customer value all the time
- Communicating decisions and options in terms of costs and impact on business outcomes
- Company culture and it's role in reinforcing a business outcome mindset
- And even the role of team lead as impact multiplier
I really enjoyed this conversation. But I admit that at first, I didn't realize how important this is on all software development. Metrics are front and center in a web app. But what about a service, or an embedded system with no telemetry. It still matters, maybe even more so. Little and big decisions developers face every day that have impact on costs and benefits with respect to customer value and business outcome, even if it's difficult to measure.
Special Guest: Benjamin Harding.
Sponsored By:
- PyCharm Professional: Try PyCharm Pro for 4 months and learn how PyCharm will save you time. Promo Code: TESTANDCODE20
- monday.com: Creating a monday.com app can help thousands of people and win you prizes. Maybe even a Tesla or a MacBook.
Support Test & Code : Python Testing for Software Engineering
<p>Within software projects, there are lots of metrics we could measure. But which ones really matter. Instead of a list, Benjamin Harding shares with us a way of thinking about business outcomes that can help us with every day decision making. </p> <p>We talk about:</p> <ul> <li>Business outcomes vs vanity metrics</li> <li>As a developer, how do you keep business outcomes in mind</li> <li>Thinking about customer value all the time</li> <li>Communicating decisions and options in terms of costs and impact on business outcomes</li> <li>Company culture and it's role in reinforcing a business outcome mindset</li> <li>And even the role of team lead as impact multiplier </li> </ul> <p>I really enjoyed this conversation. But I admit that at first, I didn't realize how important this is on all software development. Metrics are front and center in a web app. But what about a service, or an embedded system with no telemetry. It still matters, maybe even more so. Little and big decisions developers face every day that have impact on costs and benefits with respect to customer value and business outcome, even if it's difficult to measure.</p><p>Special Guest: Benjamin Harding.</p><p>Sponsored By:</p><ul><li><a href="https://testandcode.com/pycharm" rel="nofollow">PyCharm Professional</a>: <a href="https://testandcode.com/pycharm" rel="nofollow">Try PyCharm Pro for 4 months and learn how PyCharm will save you time.</a> Promo Code: TESTANDCODE20</li><li><a href="https://monday.com/testandcode" rel="nofollow">monday.com</a>: <a href="https://monday.com/testandcode" rel="nofollow">Creating a monday.com app can help thousands of people and win you prizes. Maybe even a Tesla or a MacBook.</a></li></ul><p><a href="https://www.patreon.com/testpodcast" rel="payment">Support Test & Code : Python Testing for Software Engineering</a></p>IslandT
Return a list of multiply numbers with Python
In this simple exercise from CodeWars, you will build a function program that takes a value, integer and returns a list of its multiples up to another value, limit. If the limit is a multiple of integer, it should be included as well. There will only ever be positive integers passed into the function, not consisting of 0. The limit will always be higher than the base.
For example, if the parameters passed are (2, 6), the function should return [2, 4, 6] as 2, 4, and 6 are the multiples of 2 up to 6.
Below is the solution, write down your own solution in the comment box.
def find_multiples(integer, limit):
li = []
mul = 1
while(True):
number = integer * mul
mul += 1
if number <= limit:
li.append(number)
else:
return li
The while loop will keep on running until the limit has been reached then the function will return the entire list.
Stack Abuse
Add Legend to Figure in Matplotlib
Introduction
Matplotlib is one of the most widely used data visualization libraries in Python. Typically, when visualizing more than one variable, you'll want to add a legend to the plot, explaining what each variable represents.
In this article, we'll take a look at how to add a legend to a Matplotlib plot.
Creating a Plot
Let's first create a simple plot with two variables:
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots()
x = np.arange(0, 10, 0.1)
y = np.sin(x)
z = np.cos(x)
ax.plot(y, color='blue')
ax.plot(z, color='black')
plt.show()
Here, we've plotted a sine function, starting at 0 and ending at 10 with a step of 0.1, as well as a cosine function in the same interval and step. Running this code yields:

Now, it would be very useful to label these and add a legend so that someone who didn't write this code can more easily discern which is which.
Add Legend to a Figure in Matplotlib
Let's add a legend to this plot. Firstly, we'll want to label these variables, so that we can refer to those labels in the legend. Then, we can simply call legend() on the ax object for the legend to be added:
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots()
x = np.arange(0, 10, 0.1)
y = np.sin(x)
z = np.cos(x)
ax.plot(y, color='blue', label='Sine wave')
ax.plot(z, color='black', label='Cosine wave')
leg = ax.legend()
plt.show()
Now, if we run the code, the plot will have a legend:

Notice how the legend was automatically placed in the only free space where the waves won't run over it.
Customize Legend in Matplotlib
The legend is added, but it's a little bit cluttered. Let's remove the border around it and move it to another location, as well as change the plot's size:
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(0, 10, 0.1)
y = np.sin(x)
z = np.cos(x)
ax.plot(y, color='blue', label='Sine wave')
ax.plot(z, color='black', label='Cosine wave')
leg = ax.legend(loc='upper right', frameon=False)
plt.show()
This results in:

Here, we've used the loc argument to specify that we'd like to put the legend in the top right corner. Other values that are accepted are upper left, lower left, upper right, lower right, upper center, lower center, center left and center right.
Additionally, you can use center to put it in the dead center, or best to place the legend at the "best" free spot so that it doesn't overlap with any of the other elements. By default, best is selected.
Add Legend Outside of Axes
Sometimes, it's tricky to place the legend within the border box of a plot. Perhaps, there are many elements going on and the entire box is filled with important data.
In such cases, you can place the legend outside of the axes, and away from the elements that constitute it. This is done via the bbox_to_anchor argument, which specifies where we want to anchor the legend to:
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(0, 10, 0.1)
y = np.sin(x)
z = np.cos(x)
ax.plot(y, color='blue', label='Sine wave')
ax.plot(z, color='black', label='Cosine wave')
leg = ax.legend(loc='center', bbox_to_anchor=(0.5, -0.10), shadow=False, ncol=2)
plt.show()
This results in:

The bbox_to_anchor argument accepts a few arguments itself. Firstly, it accepts a tuple, which allows up to 4 elements. Here, we can specify the x, y, width and height of the legend.
We've only set the x and y values, to displace it -0.10 below the axes, and 0.5 from the left side (0 being the lefthand of the box and 1 the righthand side).
By tweaking these, you can set the legend at any place. Within or outside of the box.
Then, we've set the shadow to False. This is used to specify whether we want a small shadow rendered below the legend or not.
Finally, we've set the ncol argument to 2. This specifies the number of labels in a column. Since we have two labels and want them to be in one column, we've set it to 2. If we changed this argument to 1, they'd be placed one above the other:

Note: The bbox_to_anchor argument is used alongside the loc argument. The loc argument will put the legend based on the bbox_to_anchor. In our case, we've put it in the center of the new, displaced, location of the border box.
Conclusion
In this tutorial, we've gone over how to add a legend to your Matplotlib plots. Firstly, we've let Matplotlib figure out where the legend should be located, after which we've used the bbox_to_anchor argument to specify our own location, outside of the axes.
If you're interested in Data Visualization and don't know where to start, make sure to check out our book on Data Visualization in Python.
Data Visualization in Python, a book for beginner to intermediate Python developers, will guide you through simple data manipulation with Pandas, cover core plotting libraries like Matplotlib and Seaborn, and show you how to take advantage of declarative and experimental libraries like Altair.
Real Python
Using ggplot in Python: Visualizing Data With plotnine
In this tutorial, you’ll learn how to use ggplot in Python to create data visualizations using a grammar of graphics. A grammar of graphics is a high-level tool that allows you to create data plots in an efficient and consistent way. It abstracts most low-level details, letting you focus on creating meaningful and beautiful visualizations for your data.
There are several Python packages that provide a grammar of graphics. This tutorial focuses on plotnine since it’s one of the most mature ones. plotnine is based on ggplot2 from the R programming language, so if you have a background in R, then you can consider plotnine as the equivalent of ggplot2 in Python.
In this tutorial, you’ll learn how to:
- Install plotnine and Jupyter Notebook
- Combine the different elements of the grammar of graphics
- Use plotnine to create visualizations in an efficient and consistent way
- Export your data visualizations to files
This tutorial assumes that you already have some experience in Python and at least some knowledge of Jupyter Notebook and pandas. To get up to speed on these topics, check out Jupyter Notebook: An Introduction and Using Pandas and Python to Explore Your Dataset.
Free Download: Get a sample chapter from Python Tricks: The Book that shows you Python's best practices with simple examples you can apply instantly to write more beautiful + Pythonic code.
Setting Up Your Environment#
In this section, you’ll learn how to set up your environment. You’ll cover the following topics:
- Creating a virtual environment
- Installing plotnine
- Installing Juptyer Notebook
Virtual environments enable you to install packages in isolated environments. They’re very useful when you want to try some packages or projects without messing with your system-wide installation. You can learn more about them in Python Virtual Environments: A Primer.
Run the following commands to create a directory named data-visualization and a virtual environment inside it:
$ mkdir data-visualization
$ cd data-visualization
$ python3 -m venv venv
After running the above commands, you’ll find your virtual environment inside the data-visualization directory. Run the following command to activate the virtual environment and start using it:
$ source ./venv/bin/activate
When you activate a virtual environment, any package that you install will be installed inside the environment without affecting your system-wide installation.
Next, you’ll install plotnine inside the virtual environment using the pip package installer.
Install plotnine by running this command:
$ pip install plotnine
Executing the above command makes the plotnine package available in your virtual environment.
Finally, you’ll install Jupyter Notebook. While this isn’t strictly necessary for using plotnine, you’ll find Jupyter Notebook really useful when working with data and building visualizations. If you’ve never used the program before, then you can learn more about it in Jupyter Notebook: An Introduction.
To install Jupyter Notebook, use the following command:
$ pip install jupyter
Congratulations, you now have a virtual environment with plotnine and Jupyter Notebook installed! With this setup, you’ll be able to run all the code samples presented through this tutorial.
Building Your First Plot With ggplot and Python#
In this section, you’ll learn how to build your first data visualization using ggplot in Python. You’ll also learn how to inspect and use the example datasets included with plotnine.
Read the full article at https://realpython.com/ggplot-python/ »
[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
PyCharm
Datalore by JetBrains: Online Jupyter Notebooks Editor With PyCharm’s Code Insight
If you work with Jupyter Notebooks and want to run code, produce heavy visualizations, and render markdown online – give Datalore a try. It comes with cloud storage, real-time collaboration, notebook publishing, and PyCharm’s code insight. In this blog post we’ll give you a quick introduction to what you can do in Datalore.

Jupyter notebooks in the cloud
Once you register a Datalore account, you can get your first notebook up and running in seconds. No setup is required, and the most popular data science libraries such as NumPy, Matplotlib, pandas, TensorFlow, etc., are already preinstalled.

As soon as you create a new notebook or upload an existing one, you can attach dataset files to it. In Datalore, files are uploaded to cloud storage and then attached to the notebook. The free Datalore plan comes with 10 GB of storage space.
Code insight in Datalore
One of the best features of Datalore is its coding assistance, which it borrows directly from PyCharm.
We firmly believe that code completion, quick-fixes, auto-imports, rename, and reformatting options help make your online coding experience far more productive. Try out the coding assistance and let us know what you think!

Online editor experience
Datalore supports Markdown and LaTeX. All computations are run in the cloud, which improves the time it takes for visualizations and markdown cells to be rendered.

We also support common Jupyter shortcuts and documentation popups. You can find the full list of available action shortcuts in the Help → Command palette menu tab.

Collaboration in Datalore
There are 4 ways to collaborate in Datalore.
1. Share your notebooks
Share your notebooks with your team via File → Share and collaborate in real time. The cursors of your team members will appear with color highlights and name tags. If something goes wrong you can revert to a history checkpoint via Tools → History.

2. Publish your notebooks
Publish your notebooks when you want to share insights and receive comments. Published notebooks can then be shared using a link.

3. Share whole workspaces
Share whole workspaces and work together with colleagues on multiple notebooks. Notebooks and attached files will be shared among all the workspace members. You can create a shared workspace via the Workspace menu on Datalore’s home screen.

4. Publish PyCharm notebooks
Publish PyCharm notebooks to share the results with your colleagues. You can upload them to Datalore from PyCharm IDE via the pre-installed Datalore plugin. Just make sure you are using version 0.1.18 or later.

Give Datalore a try!
Learn more about Datalore’s features from the Datalore blog. The Datalore team is always eager to hear your feedback! Please don’t hesitate to write to us in the comments or post in our forum.
Enjoy your data science journey,
Your Datalore and PyCharm teams
Chris Moffitt
Case Study: Processing Historical Weather Pattern Data
Introduction
The main purpose of this blog is to show people how to use Python to solve real world problems. Over the years, I have been fortunate enough to hear from readers about how they have used tips and tricks from this site to solve their own problems. In this post, I am extremely delighted to present a real world case study. I hope it will give you some ideas about how you can apply these concepts to your own problems.
This example comes from Michael Biermann from Germany. He had the challenging task of trying to gather detailed historical weather data in order to do analysis on the relationship between air temperature and power consumption. This article will show how he used a pipeline of Python programs to automate the process of collecting, cleaning and processing gigabytes of weather data in order to perform his analysis.
Problem Background
I will turn it over to Michael to give the background for this problem.
Hi, I’m Michael, CEO of a company providing services to energy providers, especially focusing on electrical power and gas. I wanted to do an ex-post analysis to get deeper insights into the deviation of the power consumption of electrical heating systems in comparison to the air temperature. Since we provide power to other companies, we need to have a good grasp on the power consumption, which correlates to the air temperature. In short, I needed to know how well I can predict the long term temperatures and how much deviation is to be expected.
To be able to do this analysis, I needed historical temperatures, which are supplied by the German weather service, DWD. Since it would be really time consuming to download all the files and extract them by hand, I decided to give this a shot with Python. I know a few things about programming, but I am pretty far from a professional programmer. The process was a lot of trial and error, but this project turned out to be exactly the right fit for this approach. I use a lot of hardcore Excel analysis, fetching and munching data with Power Query M, but this was clearly over the limit to what can be done in Excel.
I am really happy with the results. There is hardly anything as satisfying as letting the computer do the hard work for the next 20 min, while grabbing a cup of coffee.
I am also really happy to have learned a few more things about web scraping, because I can use it in future projects to automate data fetching.
Here is a visual to help understand the process Michael created:
If you are interested in following along, all of the code examples are available here.
Downloading the Data
The first notebook in the pipeline is
1-dwd_konverter_download
. This notebook pulls
historical temperature data from the German Weather Service (DWD) server and formats it for
future use in other projects.
The data is delivered in hourly frequencies in a .zip file for each of the available weather stations. To use the data, we need everything in a single .csv file with all stations side-by-side. Also, we need the daily average.
To reduce computing time, we also crop all data earlier than 2007.
For the purposes of this article, I have limited the download to only 10 files but the full data set is over 600 files.
import requests
import re
from bs4 import BeautifulSoup
from pathlib import Path
# Set base values
download_folder = Path.cwd() / 'download'
base_url = 'https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/historical/'
# Initiate Session and get the Index-Page
with requests.Session() as s:
resp = s.get(base_url)
# Parse the Index-Page for all relevant <a href>
soup = BeautifulSoup(resp.content)
links = soup.findAll("a", href=re.compile("stundenwerte_TU_.*_hist.zip"))
# For testing, only download 10 files
file_max = 10
dl_count = 0
#Download the .zip files to the download_folder
for link in links:
zip_response = requests.get(base_url + link['href'], stream=True)
# Limit the downloads while testing
dl_count += 1
if dl_count > file_max:
break
with open(Path(download_folder) / link['href'], 'wb') as file:
for chunk in zip_response.iter_content(chunk_size=128):
file.write(chunk)
print('Done')
This portion of code will parse the download page and find all of the zip files with the name
studenwerte_TU
and save them in a
download
directory.
Extracting the Data
After the first step is completed, the download directory contains multiple zip files.
The second notebook in the process is
2-dwd_konverter_extract
which will search each
zip file for a .txt file that contains the actual temperature values.
The program will then extract each file and move to the
import
directory for further processing.
from pathlib import Path
import glob
import re
from zipfile import ZipFile
# Folder definitions
download_folder = Path.cwd() / 'download'
import_folder = Path.cwd() / 'import'
# Find all .zip files and generate a list
unzip_files = glob.glob('download/stundenwerte_TU_*_hist.zip')
# Set the name pattern of the file we need
regex_name = re.compile('produkt.*')
# Open all files, look for files that match ne regex pattern, extract to 'import'
for file in unzip_files:
with ZipFile(file, 'r') as zipObj:
list_of_filenames = zipObj.namelist()
extract_filename = list(filter(regex_name.match, list_of_filenames))[0]
zipObj.extract(extract_filename, import_folder)
display('Done')
After running this script, the
import
directory will contain text files that
look like this:
STATIONS_ID;MESS_DATUM;QN_9;TT_TU;RF_TU;eor
3;1950040101; 5; 5.7; 83.0;eor
3;1950040102; 5; 5.6; 83.0;eor
3;1950040103; 5; 5.5; 83.0;eor
3;1950040104; 5; 5.5; 83.0;eor
3;1950040105; 5; 5.8; 85.0;eor
Building the DataFrame
Now that we have isolated the data we need, we must format it for further analysis.
There are three steps in this notebook
3-dwd_konverter_build_df
:
Process Individual Files
The files are imported into a single DataFrame, stripped of unnecessary columns and filtered by date.
Then we set a
DateTimeIndex
and concatenate them into the
main_df
. Because the loop takes a
long time, we output some status messages, to ensure the process is still running.
Process the concatenated main_df
Then we display some info of the
main_df
so we can ensure that there are no errors, mainly
to ensure all data-types are recognized correctly. Also, we drop duplicate entries, in case
some of the .csv files were accidentally duplicated during the development process.
Unstack and export
For the final step, we unstack the
main_df
and save it to a .csv and a .pkl file for the
next step in the analysis process. Also, we display some output to get a grasp of what is
going on.
Now let’s look at the code:
import numpy as np
import pandas as pd
from IPython.display import clear_output
from pathlib import Path
import glob
import_files = glob.glob('import/*')
out_file = Path.cwd() / "export_uncleaned" / "to_clean"
obsolete_columns = [
'QN_9',
'RF_TU',
'eor'
]
main_df = pd.DataFrame()
i = 1
for file in import_files:
# Read in the next file
df = pd.read_csv(file, delimiter=";")
# Prepare the df before merging (Drop obsolete, convert to datetime, filter to date, set index)
df.drop(columns=obsolete_columns, inplace=True)
df["MESS_DATUM"] = pd.to_datetime(df["MESS_DATUM"], format="%Y%m%d%H")
df = df[df['MESS_DATUM']>= "2007-01-01"]
df.set_index(['MESS_DATUM', 'STATIONS_ID'], inplace=True)
# Merge to the main_df
main_df = pd.concat([main_df, df])
# Display some status messages
clear_output(wait=True)
display('Finished file: {}'.format(file), 'This is file {}'.format(i))
display('Shape of the main_df is: {}'.format(main_df.shape))
i+=1
# Check if all types are correct
display(main_df['TT_TU'].apply(lambda x: type(x).__name__).value_counts())
# Make sure that to files or observations a duplicates, eg. scan the index for duplicate entries.
# The ~ is a bitwise operation, meaning it flips all bits.
main_df = main_df[~main_df.index.duplicated(keep='last')]
# Unstack the main_df
main_df = main_df.unstack('STATIONS_ID')
display('Shape of the main_df is: {}'.format(main_df.shape))
# Save main_df to a .csv file and a pickle to continue working in the next step
main_df.to_pickle(Path(out_file).with_suffix('.pkl'))
main_df.to_csv(Path(out_file).with_suffix('.csv'), sep=";")
display(main_df.head())
display(main_df.describe())
As this program runs, here is some of the progress output:
'Finished file: import/produkt_tu_stunde_20041101_20191231_00078.txt' 'This is file 10' 'Shape of the main_df is: (771356, 1)' float 771356 Name: TT_TU, dtype: int64 'Shape of the main_df is: (113952, 9)'
Here is what the final DataFrame looks like:
| TT_TU | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| STATIONS_ID | 3 | 44 | 71 | 73 | 78 | 91 | 96 | 102 | 125 |
| MESS_DATUM | |||||||||
| 2007-01-01 00:00:00 | 11.4 | NaN | NaN | NaN | 11.0 | 9.4 | NaN | 9.7 | NaN |
| 2007-01-01 01:00:00 | 12.0 | NaN | NaN | NaN | 11.4 | 9.6 | NaN | 10.4 | NaN |
| 2007-01-01 02:00:00 | 12.3 | NaN | NaN | NaN | 9.4 | 10.0 | NaN | 9.9 | NaN |
| 2007-01-01 03:00:00 | 11.5 | NaN | NaN | NaN | 9.3 | 9.7 | NaN | 9.5 | NaN |
| 2007-01-01 04:00:00 | 9.6 | NaN | NaN | NaN | 8.6 | 10.2 | NaN | 8.9 | NaN |
At the end of this step, we have the file in a condensed format we can use for analysis.
Final Processing
The data contains some errors, which need to be cleaned. You can see, by looking at the
output of
main_df.describe(),
that the minimum temperature on some
stations is -999. That means that there is no plausible measurement for this particular
hour. We change this to
np.nan,
so that we can safely calculate the average daily value
in the next step.
Once these values are corrected, we need to resample to daily measurements. Pandas
resample
makes this really simple.
import numpy as np
import pandas as pd
from pathlib import Path
# Import and export paths
pkl_file = Path.cwd() / "export_uncleaned" / "to_clean.pkl"
cleaned_file = Path.cwd() / "export_cleaned" / "cleaned.csv"
# Read in the pickle file from the last cell
cleaning_df = pd.read_pickle(pkl_file)
# Replace all values with "-999", which indicate missing data
cleaning_df.replace(to_replace=-999, value=np.nan, inplace=True)
# Resample to daily frequency
cleaning_df = cleaning_df.resample('D').mean().round(decimals=2)
# Save as .csv
cleaning_df.to_csv(cleaned_file, sep=";", decimal=",")
# Show some results for verification
display(cleaning_df.loc['2011-12-31':'2012-01-04'])
display(cleaning_df.describe())
display(cleaning_df)
Here is the final DataFrame with daily average values for the stations:
| TT_TU | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| STATIONS_ID | 3 | 44 | 71 | 73 | 78 | 91 | 96 | 102 | 125 |
| MESS_DATUM | |||||||||
| 2011-12-31 | NaN | 3.88 | 2.76 | 1.19 | 4.30 | 2.43 | NaN | 3.80 | NaN |
| 2012-01-01 | NaN | 10.90 | 8.14 | 4.03 | 10.96 | 10.27 | NaN | 9.01 | NaN |
| 2012-01-02 | NaN | 7.41 | 6.18 | 4.77 | 7.57 | 7.77 | NaN | 6.48 | 4.66 |
| 2012-01-03 | NaN | 6.14 | 3.61 | 4.46 | 6.38 | 5.28 | NaN | 5.63 | 3.51 |
| 2012-01-04 | NaN | 5.80 | 2.48 | 4.45 | 5.46 | 4.57 | NaN | 5.85 | 1.94 |
Summary
There are several aspects of this case study that I really like.
- Michael was not an expert programmer and decided to dedicate himself to learning the Python necessary for solving this problem.
- It took some time for him to learn how to accomplish multiple tasks but he persevered through all the challenges and built a complete solution.
- This was a real world problem that would be difficult to solve with other tools but could be automated with very few lines of Python code.
- The process could be time consuming to run so it’s broken down into multiple stages. This is a great idea to apply to other problems. This previous article actually served as the inspiration for many of the techniques used in the solution.
- This solution brings together many different concepts including web scraping, downloading files, working with zip files and cleaning & analyzing data with pandas.
- Michael now has a new skill that he can apply to other problems in his business.
Finally, I love this quote from Michael:
There is hardly anything as satisfying as letting the computer do the hard work for the next 20 min, while grabbing a cup of coffee.
I agree 100%. Thank you Michael for taking the time to share such a great example! I hope it gives you some ideas to apply to your own projects.
IslandT
Beginning steps to create a Stockfish chess application
I am a chess player and I like to play chess, in order to improve my chess skill recently I have decided to create a chess application which I can play with so I can further improve my chess skill and get ready to face a stronger opponent in a site like lichess. The below chess application will take me around a year to complete and I will show you all the progress from time to time.
This application will use the below tools to develop:-
- Python will be the programming language that needs to develop this application.
- Stockfish chess engine will be needed as the central mind of this application.
- stockfish module will be needed to link to the Stockfish chess engine.
- Pygame will be needed to display the graphic user interface of this chess application.
- I am using windows os to develop this application so it might not work for the user with the other OS.
In this chapter, we will first download the Stockfish chess engine from this site. I am using the 64bit version (Maximally compatible but slow) to suite my laptop. This chess engine alone does not do anything, we will need a stockfish engine wrapper which you can get from this site! There are other modules around that do the same thing but stockfish module appears to be very easy to use.
With the above two tools ready, we can now open up our PyCharm IDE and input the following code.
from stockfish import Stockfish
stockfish = Stockfish("E:\StockFish\stockfish_20090216_x64")
stockfish.set_position(["e2e4", "e7e6"])
print(stockfish.get_board_visual())
As you can see we first need to import the Stockfish module into our program. Next, we will pass in the path to the Stockfish chess engine as an argument when we create a Stockfish object. Next, we will set the first move for both the Milk and the Chocolate players. Finally, we will print out the chess position on the chessboard accordingly.
The position on the chess board
The chessboard looks really great but as I have mentioned before I will not use this display but instead will use PyGame to create the chess user interface for this chess application instead.
So there you have it, we have successfully installed the Stockfish module as well as downloaded the Stockfish chess engine.
What next? Next time we will install the PyGame module and show the chess pieces on the chessboard!
Mike Driscoll
PyDev of the Week: Sean Tibor
This week we welcome Sean Tibor (@smtibor) as our PyDev of the Week! Sean is the co-host of the Teaching Python podcast. He has been a guest on other podcasts, such as Test & Code and is the founder of Red Reef Digital.
Let’s spend a few moments getting to know Sean better!

Can you tell us a little about yourself (hobbies, education, etc):
It’s funny: I never expected to be a teacher. I went to college and grad school for Information Systems and learned to code in C++, Java, PHP, and VB.NET, then spent nearly 20 years working in IT and Marketing.
A few years ago, a dear family friend asked me to consider a career change into teaching since she thought I would have an aptitude for it. This is now my third year teaching middle school computer science in Florida at a private PK-12 school. Every 11-14 year old student in my school takes 9 weeks of computer science for each year of grade 6, 7, 8.
There are few things that I find professionally more satisfying than seeing a kid discover potential within themselves. Teaching has become more about the journey that each student goes through in learning to code than the specific lessons they learn.
It’s also really fun that my hobbies of coding hardware, making and designing electronics, and 3d printing have become part of my profession. I get to bring all of these skills and knowledge to my teaching craft, so it feels like I get to play all day with the things I love.
Why did you start using Python?
When I started teaching, the school I joined had just undergone a huge revision to their Computer Science curriculum. As part of that, they chose to make Python the language that all middle school students would learn.
So over the course of the summer, I started learning as much Python as I could absorb, using everything from books like Automate the Boring Stuff to CircuitPython and MicroPython hardware to Pybites code challenges. It took several months, but I was able to start teaching right from the first day of school.
In addition to teaching Python, it’s also been very useful for integration and automation projects around the school to make things run a bit smoother. I’m also using it to work on a few side projects in the marketing automation space, so it’s enhanced other parts of my professional life.
What other programming languages do you know and which is your favorite?
I’m a strong believer in Python as a useful and efficient language for getting things done so that’s my go-to language. Over the years, I’ve dabbled in a lot of different languages like VB.NET, Java, PHP, Objective-C, C++, and Arduino. Most of that has been replaced with Python for my projects and then I add in some HTML, CSS, JS, and SQL as needed to make it all come together.
What projects are you working on now?
My favorite project right now has been a wrapper library and function library for our school’s JAMF server that handles Apple device management. Our school has over 1500 iPads in use across two campuses and my project automates many of the common tasks that used to be very hands-on and manual. Now that we have this project in place, we can hand over a brand new shrink-wrapped iPad to a teacher or student and it will automatically configure itself with apps and settings within about 5 minutes of connecting to the internet.
Which Python libraries are your favorite (core or 3rd party)?
I don’t think it gets a lot of attention, but I love the dateutil library. My final project for my undergraduate degree was a web-based personal information manager that syncronized with your PDA and the most complex part by far was the calendar module. Ever since, I’ve been a little obsessed with getting my dates and times correct in code, and the dateutil library has so many useful features from timezone selection to parsing strings into datetime objects and even having interesting relative dates.
What have you learned being a host of the Teaching Python podcast?
The best thing has been meeting all of the amazing people in the Python community and doing that all with my teaching partner and co-host, Kelly Paredes. She hadn’t coded before and I hadn’t taught before when we started the podcast, so each of us were beginners at something where the other person was more of an expert.
With every person we meet, we each learn a lot more about teaching, Python, and the many, many different cool things that people are doing out there in the world. Often after an episode recording session, we’ll sit there and chat about all the interesting things we learned from our guest or from each other.
I also found it really amazing how welcoming and accessible the Python and education community can be. We started as just two teachers who wanted to try making a podcast about our experiences teaching something new to both of us. We’ve made amazing friends, had some of the most mind-blowing conversations, and no one has ever said no.
What is the hardest thing to teach in class about Python?
The hardest thing is nothing to do with the Python language. It’s overcoming a student’s belief that “I am not a coder.” With patience and persistence, I’ve found that nearly every student can find something that they like about coding and create something that they are tremendously proud of. I’ve seen students create everything from an RGB-lit umbrella, to a choose-your-own adventure game with 700 lines of code, to an Alexa voice skill that reminds them about things so their mom doesn’t have to.
I’ve found that coding is a lot like running. Many people say that they’re not a runner. However, it’s your own journey to running or coding that matters. If you run, you are a runner. If you code, you are a coder. I don’t expect every student to be a gifted coder, but I’ve seen students blow me away with what they can do once they discard the notion that they are “not a coder.”
Is there anything else you’d like to say?
Learning Python in order to teach it to others has been quite a bit different than the other times I’ve learned a new language. Every time a student asks me how something works, I think I’ve got the right answer, but then they ask me a followup question that makes me excited to go learn more. Teaching another person is absolutely the best way to keep yourself challenged and motivated to learn more.
Thanks for doing the interview, Sean!
The post PyDev of the Week: Sean Tibor appeared first on The Mouse Vs. The Python.
IslandT
Merge two dictionaries using the Dict Union operator
In this article we will create a Python function which will merge two dictionaries using the Dict Union operator.
The Dict Union operator will only merge the key and value pair with a unique key’s name, which means if there are two keys with the same name in the same dictionary, only the last key in the dictionary will be merged. If the same key appears in both dictionaries, then the key in the second dictionary will be merged into this Dict union.
After the merger of two dictionaries, the function will change the value of the key if the third optional argument has tuples in it which contain the key and value to be changed.
def merged(k1, k2, front):
k3 = k1 | k2
if front != []:
k3 |= front
return k3
Now let us try out a few examples:-
d = {'shoe': 1, 'slipper': 2, 'boot': 3, 'shoe':7}
l = {'shirt': 3, 'dress': 1, 'shoe':4}
print(merged(d, l, [('shirt', 5)]))
{'shoe': 4, 'slipper': 2, 'boot': 3, 'shirt': 5, 'dress': 1}
d = {'shoe': 1, 'slipper': 2, 'boot': 3, 'shoe':7}
l = {'shirt': 3, 'dress': 1}
print(merged(d, l, [('shirt', 5)]))
{'shoe': 7, 'slipper': 2, 'boot': 3, 'shirt': 5, 'dress': 1}
d = {'shoe': 1, 'slipper': 2, 'boot': 3, 'shoe':7}
l = {'shirt': 3, 'dress': 1}
print(merged(d, l, [('shirt', 5), ('shoe', 3)]))
{'shoe': 3, 'slipper': 2, 'boot': 3, 'shirt': 5, 'dress': 1}
d = {'shoe': 1, 'slipper': 2, 'boot': 3, 'shoe':7}
l = {'shirt': 3, 'dress': 1}
print(merged(d, l, []))
{'shoe': 7, 'slipper': 2, 'boot': 3, 'shirt': 3, 'dress': 1}
What is your thought about this? Leave your comment with your own solution in the comment box under this post 
Wing Tips
Debug Docker Compose Containerized Python Apps with Wing Pro
This Wing Tip describes how to configure Docker Compose so that Python code running on selected container services can be debugged with Wing Pro. This makes it easy to develop and debug containerized applications written in Python.
Prerequisites
To get started, you will need to Install Docker Compose.
You will also need an existing Docker Compose project that uses Python on at least one of the container services. If you don't already have one, it is easy to set one up as described in Getting Started with Docker Compose. However, if you use that example you will need to change to the official Python docker image and not the 'alpine' image, which contains a stripped down build of Python that cannot load Wing's debugger core. This is easy to do, by changing from FROM python:3.7-alpine to FROM:python:3.8 in the Dockerfile. You will also need to remove the RUN apk add line from the Dockerfile. This is not needed with the official Python docker image.
Configuration
To set up your Docker Compose project so it can be used with Wing's Python debugger, you will need to add some volume mounts to each container that you want to debug. These mount Wing's debugger support and cause Python to initiate debug whenever it is run on the container.
1. Prepare sitecustomize
The first step is to make a new directory sitecustomize in the same directory as your docker-compose.yml file and then add a file named __init__.py to the directory with the following contents:
from . import wingdbstub
This is the hook that will cause Python on the containers to load Wing's debugger. It is loaded by Python's Site-specific configuration hook.
2. Configure wingdbstub.py
Next, you need to configure a copy of wingdbstub.py to place into this sitecustomize directory. This module is provided by Wing as the way to start debug of any Python code that is launched from outside of the IDE, as is the case here since your code is launched in the container by docker-compose up.
You can find the master copy of wingdbstub.py at the top level of your Wing installation (or on macOS in Contents/Resources inside WingPro.app). If you don't know where this is, it is listed as the Install Directory in Wing's About box.
You will need to make copy of this file to your sitecustomize package directory and then make two changes to it:
- Set WINGHOME='/wingpro7'
- Set kHostPort='host.docker.internal:50005'
3. Inspect Your Installation
In order to figure out what volume mounts you need to add to your docker-compose.yml file, you first need to determine:
(1) The full path of your Wing installation on the host system, which is given in Wing's About box. This is the same place that you found wingdbstub.py earlier.
(2) The location of the site customization site-packages on each container that you want to debug. This is where you will mount the sitecustomize directory from your host system. You can determine this value by starting Python on the container and inspecting it. For example, for a service in docker-compose.yml that is called web, you can start Python on the container interactively like this:
docker run -i compose_web python -i -u
Note that the Docker image name is the same as the Docker Compose service name but with compose_ prepended.
Then type or paste in the following lines of code:
>>> import os, sys, site
>>> v = sys.version_info[:2]
>>> print(os.path.join(site.USER_BASE, 'lib', 'python{}.{}'.format(*v), 'site-packages'))
Make a note of the path that this prints; you will need it in the next step below.
4. Add Mounted Volumes
Now you can now add your volume mounts in the docker-compose.yml file. You will be mounting the Wing installation directory at /wingpro7 (this must match the WINGHOME set earlier in your copy of wingdbstub.py) and your sitecustomize package directory inside the above-determined site-packages.
For example on Windows you might add the following in docker-compose.yml for each service that you want to debug:
volumes: - "C:\Program Files (x86)\Wing Pro 7.2:/wingpro7" - "./sitecustomize:/root/.local/lib/python3.8/site-packages/sitecustomize"
On macOS this might instead be:
volumes: - /Applications/WingPro.app/Contents/Resources:/wingpro7 - ./sitecustomize:/root/.local/lib/python3.8/site-packages/sitecustomize
And on Linux it might be:
volumes: - /usr/lib/wingpro7:/wingpro7 - ./sitecustomize:/root/.local/lib/python3.8/site-packages/sitecustomize
Example
Here's an example of these added volumes in context, within the docker-compose.yml that is used in Getting Started with Docker Compose:
version: "3.8"
services:
web:
build: .
ports:
- "5000:5000"
volumes:
- .:/code
- ./sitecustomize:/root/.local/lib/python3.8/site-packages/sitecustomize
- /Applications/WingPro.app/Contents/Resources:/wingpro7
environment:
FLASK_ENV: development
redis:
image: "redis:alpine"
Note that we're only debugging the web service and not Python code running on the redis service.
Starting Debug
Now you can start your cluster and debug your containerized Python code in Wing Pro.
To do that, first make sure Wing is listening for outside debug connections, by clicking
on the bug icon
in the lower left of Wing's window and enabling Accept Debug
Connections.
If you are using the Flask example from Getting Started with Docker Compose (or any code that spawns multiple processes that you wish to debug) then you will also need to open Project Properties from the Project menu and set Debug Child Processes under the Debug/Execute tab to Always Debug Child Processes.
Then start your cluster with docker-compose up. Your application will start and the containers you've configured for debug should attempt to connect to Wing Pro. Wing will initially reject the connection and display a dialog for each container you are trying to debug:

Click Accept and then stop docker-compose up by pressing Ctrl-C and restart it. The second time you start your cluster, the containers should manage to connect successfully to Wing's debugger, because you've accepted the randomly generated security token used by each container.
You can now set breakpoints, step through code, and view and interact with data in the debug process using Stack Data, Debug Console, and other tools in Wing. For more information on Wing's capabilities, see the Tutorial in Wing's Help menu or take a look at the Quick Start Guide.
Trouble-Shooting
If you can't get the debugger to connect, try setting kLogFile in your copy of wingdbstub.py to "<stderr>". This will log debugger diagnostics to the output from docker-compose up and will indicate whether the debugger is failing to load or failing to connect to the IDE. You can email this output to support@wingware.com for help.
To inspect other problems, including whether your added file mounts are working correctly, you can start a shell in selected Docker containers after docker-compose up with docker-compose <service> <cmd>. For example to start an interactive shell for the service web defined in docker-compose.yml:
docker-compose web bash
Future Directions
Part of our focus in Wing Pro 8 is to extend and improve Wing's support for containerized development. This includes automating container and cluster configuration. As of the date of this article, a subset of that functionality, for working with a single container, is available in our early access program. Future releases will extend this to support Docker Compose and possibly also other container orchestration systems. If you have requests for specific types of support for containerized development, please email us.
That's it for now! We'll be back soon with more Wing Tips for Wing Python IDE.
As always, please don't hesitate to email support@wingware.com if you run into problems, have any questions, or have topic suggestions for future Wing Tips!
ListenData
Learn Python for Data Science
| Data Science with Python Tutorial |
Python 2 vs. 3
Google yields thousands of articles on this topic. Some bloggers opposed and some in favor of 2.7. If you filter your search criteria and look for only recent articles, you would find Python 2 is no longer supported by the Python Software Foundation. Hence it does not make any sense to learn 2.7 if you start learning it today. Python 3 supports all the packages. Python 3 is cleaner and faster. It is a language for the future. It fixed major issues with versions of Python 2 series. Python 3 was first released in year 2008. It has been 12 years releasing robust versions of Python 3 series. You should go for latest version of Python 3.Python for Data Science : Introduction
Python is widely used and very popular for a variety of software engineering tasks such as website development, cloud-architecture, back-end etc. It is equally popular in data science world. In advanced analytics world, there has been several debates on R vs. Python. There are some areas such as number of libraries for statistical analysis, where R wins over Python but Python is catching up very fast.With popularity of big data and data science, Python has become first programming language of data scientists.There are several reasons to learn Python. Some of them are as follows -
- Python runs well in automating various steps of a predictive model.
- Python has awesome robust libraries for machine learning, natural language processing, deep learning, big data and artificial Intelligence.
- Python wins over R when it comes to deploying machine learning models in production.
- It can be easily integrated with big data frameworks such as Spark and Hadoop.
- Python has a great online community support.
- YouTube
- Dropbox
- Disqus
How to install Python?
There are two ways to download and install Python- Download Anaconda. It comes with Python software along with preinstalled popular libraries.
- Download Pythonfrom its official website. You have to manually install libraries.
Recommended : Go for first option and download anaconda. It saves a lot of time in learning and coding Python
- Jupyter (Ipython) Notebook
- Spyder
![]() |
| Spyder - Python Coding Environment |
October 11, 2020
Tarek Ziade
Web App Software Development Maturity Model
The Capability Maturity Model Integration (CMMI) describes different levels of maturity for the development process of any organization in a measurable way. It offers a set of best practices to improve all processes. It's been regularly updated, and the latest version includes some notions of agility.
CMMI can be applied …
IslandT
Write a python function that produces an array with the numbers 0 to N-1 in it
In this article, we will create a python function that will produce an array with the numbers 0 to N-1 in it.
For example, the following code will result in an array containing the numbers 0 to 4:
arr(5) // => [0,1,2,3,4]
There are a few rules we need to follow here:-
- when the user passes in 0 to the above function, the function will return an empty list.
- when the user passes in an empty argument into the above function, the function will also return an empty list.
- any other positive number will result in an ascending order array.
Below is the full solution to the above problem.
def arr(n=None):
li = []
if n == 0 or n == None:
return li
else:
for i in range(n):
li.append(i)
return li
Write down your own solution in the comment box below this post 
Andrea Grandi
Python 3.9 introduces removeprefix and removesuffix
A quick tutorial to removeprefix and removesuffix methods which have been introduced with Python 3.9.0
Codementor
Data Engineering Series #1: 10 Key tech skills you need, to become a competent Data Engineer.
Bridging the gap between Application Developers and Data Scientists, the demand for Data Engineers ro...
Ram Rachum
GridRoyale - A life simulation for exploring social dynamics
GridRoyale - A life simulation for exploring social dynamics
Another day, another project :)
This is a project that I wanted to do for years. I finally had the opportunity to do it. Check out the GridRoyale readme on GitHub for more details and a live demo.
GridRoyale is a life simulation. It’s a tool for machine learning researchers to explore social dynamics.
It’s similar to Game of Life or GridWorld, except I added game mechanics to encourage the players to behave socially. These game mechanics are similar to those in the battle royale genre of computer games, which is why it’s called GridRoyale.
The game mechanics, Python framework and visualization are pretty good– The core algorithm sucks, and I’m waiting for someone better than me to come and write a new one. If that’s you, please open a pull request.
"CodersLegacy"
Scrapy vs BeautifulSoup | Python Web Crawlers
This article is Scrapy vs BeautifulSoup comparison.
If you ever come across a scenario where you need to download data off the internet, you’ll need to use a Python Web Crawler. There are two good web crawlers in Python that can be used for this purpose, Scrapy and BeautifulSoup.
What are web crawlers? What is web scraping? Which python web crawler should you be using, Scrapy or BeautifulSoup? We’ll be answering all these questions here in this article.
Web Scraping and Web Crawlers
Web scraping is the act of extracting or “scraping” data from a web page. The general process is as follows. First the targeted web page is “fetched” or downloaded. Next we the data is retrieved and parsed through into a suitable format. Finally we get to navigate through the parsed data, selecting the data we want.
The Web scraping process is fully automated, done through a bot which we call the “Web Crawler”. Web Crawlers are created using appropriate software like Python, with the BeautifulSoup and Scrapy libraries.
BeautifulSoup vs Scrapy
BeautifulSoup is actually just a simple content parser. It can’t do much else, as it even requires the requests library to actually retrieve the web page for it to scrape. Scrapy on the other hand is an entire framework consisting of many libraries, as an all in one solution to web scraping. Scrapy can retrieve, parse and extract data from a web page all by itself.
By this point you might be asking, why even learn BeautifulSoup? Scrapy is an excellent framework, but it’s learning curve is much steeper due to the large number of features, a harder setup, and complex navigation. BeautifulSoup is both easier to learn and use. Even someone who knows Scrapy well may use BeautifulSoup for simpler tasks.
The difference between the two is the same as the difference between a simple pistol and a Rifle with advanced gear attached. The pistol, due to it’s simplicity is easier and faster to use. On the other hand the Rifle requires much more skill and training to use, but ultimately is much deadlier than the pistol.
Scrapy Features
It’s possible that some of the below tasks are possible with BeautifulSoup through alternate means, like using other libraries. However, the point here is that Scrapy has all these features built in to it, fully supported and compatible with it’s other features.
Improved Scraping
Built upon the Twisted, an asynchronous networking framework, Scrapy is also much faster than other web scrapers in terms of speed and memory usage.
Furthermore, it’s much more versatile and flexible. Websites often change their layout and structure over time. Scrapy is not effected by any minor changes in the website, and will continue to work normally.
Using other classes and settings like “Rules” you can also adjust the behavior of the Scrapy Spider in many different ways.
Parallel requests.
Typically web crawlers deal with one request at a time. Scrapy has the ability to run requests in parallel, allowing for much faster scraping.
In theory, if you could execute 60 requests in a minute, with 6 “concurrent” requests, you could get it done in 10 seconds. This isn’t always the case though due to overhead, latency and time taken to actually download the page.
Cookies and User agents
By default, web crawlers will identify themselves as web crawlers to the browser/website they access. This can be quite a problem when you’re trying to get around the bot protection on certain websites.
With the use of User Agents, Cookies and Headers in Scrapy, you can fool the website into thinking that it’s an actual human attempting to access the site.
AutoThrottle
One of the major reasons why websites are able to detect Scrapy Spiders (or any spider in general) is due to how fast the Requests are made. Things just get even worse when your Scrapy Spider ends up slowing down the website due to the large number of requests in a short period of time.
To prevent this, Scrapy has the AutoThrottle option. Enabling this setting will cause Scrapy to automatically adjust the scraping speed of the spider depending on the traffic load on the target website.
This benefits us because our Spider becomes a lot less noticeable and the chances of getting IP banned decreases significantly. On the other hand the website also benefits since the load is more evenly spread out instead of being concentrated at a singe point.
Rate limiting
The purpose of Rate or “Request” Limiting is the same as AutoThrottle, to increase the delay between requests to keep the spider off the website’s radar. There are all kinds of different settings which you can manipulate to achieve the desired result.
The difference between this setting and AutoThrottle is that Rate limiting involves using fixed delays, whereas AutoThrottle automatically adjusts the delay based off several factors.
Another bonus fact in Scrapy is that you can actually use both AutoThrottle and the Rate limiting settings together to create a more complex crawler that’s both fast and undetectable.
Proxies and VPN’s
In cases we you need to send out a large number of requests to a website, it’s extremely suspicious if they are all coming from one IP address. If you’re not careful, you’re IP will get banned pretty quick.
The solution to this is the Rotating Proxies and VPN support that Scrapy offers. With this you can change things so that each request appears to have arrived from a different location. Using this is the closest you’ll get to completely masking the presence of your Web crawler.
XPath and CSS Selectors
XPath and CSS selectors are key to making Scrapy a complete web scraping library. These two are advanced and easy to use techniques through which one can easily scrape through the HTML content on a web page.
XPath in particular is an extremely flexible way of navigating through the HTML structure of a web page. It’s more versatile than CSS selectors, being able to traverse both forward and backward.
Debugging and Logging
Another one of Scrapy’s handy features is the inbuilt debugger and logger. Everything that happens, from the headers used, to the time taken for each page to download, the website latency etc is all printed out in the terminal and can be logged into a proper file. Any errors or potential issues that occur are also displayed.
Exception Handling
While web scraping on a large scale, you’ll run into all kinds of different server errors, missing pages, internet issues etc. Scrapy, with it’s exception handling allows you to gracefully each one of these issues without breaking down. You can even pause your Scrapy spider and resume it at a later time.
Scrapy Code
Below are some example codes for Scrapy that we’ve selected from our various tutorials to demonstrate here. Each project example is accompanied by a brief description about it’s usage.
This first Scrapy code example features a Spider that scans through the entire quotes.toscrape extracting each and every quote along with the Author’s name.
We’ve used the Rules class in order to ensure that the Spider scrapes only certain pages (to save time and avoid duplicate quotes) and added some custom settings, such as AutoThrottle.
class SuperSpider(CrawlSpider):
name = 'spider'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
base_url = 'http://quotes.toscrape.com'
rules = [Rule(LinkExtractor(allow = 'page/', deny='tag/'),
callback='parse_filter_book', follow=True)]
custom_settings = {
'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_DEBUG': True,
}
def parse_filter_book(self, response):
for quote in response.css('div.quote'):
yield {
'Author': quote.xpath('.//span/a/@href').get(),
'Quote': quote.xpath('.//span[@class= "text"]/text()').get(),
Another important feature that Scrapy has is link following which can be implemented in different ways. For instance the example above also had link following enabled through the Rules class.
In the below example however, we’re doing it in a unique way that allows us to visit every page on Wikipedia extracting the page names from every single one of them. In short, it’s a more controlled way of link following.
The below code will not actually scrape the entire site due the DEPTH_LIMIT setting. We’ve done this simply to limit the Spider around python related topics and to keep the scraping time reasonable.
from scrapy.spiders import CrawlSpider
class SuperSpider(CrawlSpider):
name = 'follower'
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/Web_scraping']
base_url = 'https://en.wikipedia.org'
custom_settings = {
'DEPTH_LIMIT': 1
}
def parse(self, response):
for next_page in response.xpath('.//div/p/a'):
yield response.follow(next_page, self.parse)
for quote in response.xpath('.//h1/text()'):
yield {'quote': quote.extract() }
This section doesn’t really contribute much to the Scrapy vs BeautifulSoup debate, but it does help you get an idea on what Scrapy code is like.
Conclusion
If you’re a beginner, I would likely recommend BeautifulSoup over Scrapy. It’s just easier than Scrapy in almost every way, from it’s setup to it’s usage. Once you’ve gained some experience, the transition to Scrapy should become easier as they have overlapping concepts.
For simple projects, BeautifulSoup will be more than enough. However, if you’re really serious about making a proper web crawler then you’ll have to use Scrapy.
Ultimately, you should learn both (while giving preference to Scrapy) and use either one of them depending on the situation.
This marks the end of the Scrapy vs BeautifulSoup article. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.
The post Scrapy vs BeautifulSoup | Python Web Crawlers appeared first on CodersLegacy.
Awesome Python Applications
Spack
Spack: Language-independent package manager for supercomputers, Mac, and Linux, designed for scientific computing.
Links:
ABlog for Sphinx
ABlog v0.10.11 released
Pull Requests merged in:


