skip to navigation
skip to content

Planet Python

Last update: October 13, 2020 01:48 PM UTC

October 13, 2020


Stack Abuse

Simple NLP in Python With TextBlob: Tokenization

Introduction

The amount of textual data on the Internet has significantly increased in the past decades. There's no doubt that the processing of this amount of information must be automated, and the TextBlob package is one of the fairly simple ways to perform NLP - Natural Language Processing.

It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, tokenization, sentiment analysis, classification, translation, and more.

No special technical prerequisites for employing this library are needed. For instance, TextBlob is applicable for both Python 2 and 3. In case you don't have any textual information for the project you want to work on, TextBlob provides necessary corpora from the NLTK database.

Installing TextBlob

Let's start out by installing TextBlob and the NLTK corpora:

$ pip install -U textblob
$ python -m textblob.download_corpora

Note: This process can take some time due to a broad number of algorithms and corpora that this library contains.

What is Tokenization?

Before going deeper into the field of NLP you should also be able to recognize these key terms:

What is tokenization itself?

Tokenization or word segmentation is a simple process of separating sentences or words from the corpus into small units, i.e. tokens.

An illustration of this could be the following sentence:

Here, the input sentence is tokenized on the basis of spaces between words. You can also tokenize characters from a single word (e.g. a-p-p-l-e from apple) or separate sentences from one text.

Tokenization is one of the basic and crucial stages of language processing. It transforms unstructured textual material into data. This could be applied further in developing various models of machine translation, search engine optimization, or different business inquiries.

Implementing Tokenization in Code

First of all, it's necessary to establish a TextBlob object and define a sample corpus that will be tokenized later. For example, let's try to tokenize a part of the poem If written by R. Kipling:

from textblob import TextBlob

# Creating the corpus
corpus = '''If you can force your heart and nerve and sinew to serve your turn long after they are gone. And so hold on when there is nothing in you except the Will which says to them: 'Hold on!'
'''

Once the object is created, it should be passed as an argument to the TextBlob constructor:

blob_object = TextBlob(corpus)

Once constructed, we can perform various operations on this blob_object. It already contains our corpus, categorized to a degree.

Word Tokenization

Finally, to get the tokenized words we simply retrieve the words attribute to the created blob_object. This gives us a list containing Word objects, that behave very similarly to str objects:

from textblob import TextBlob

corpus = '''If you can force your heart and nerve and sinew to serve your turn long after they are gone. And so hold on when there is nothing in you except the Will which says to them: 'Hold on!'
'''

blob_object = TextBlob(corpus)

# Word tokenization of the sample corpus
corpus_words = blob_object.words
# To see all tokens
print(corpus_words)
# To count the number of tokens
print(len(corpus_words))

The output commands should give you the following:

['If', 'you', 'can', 'force', 'your', 'heart', 'and', 'nerve', 'and', 'sinew', 'to', 'serve', 'your', 'turn', 'long', 'after', 'they', 'are', 'gone', 'and', 'so', 'hold', 'on', 'when', 'there', 'is', 'nothing', 'in', 'you', 'except', 'the', 'Will', 'which', 'says', 'to', 'them', 'Hold', 'on']
38

It's worth noting that this approach tokenizes words using SPACE as the delimiting character. We can change this delimiter, for example, to a TAB:

from textblob import TextBlob
from nltk.tokenize import TabTokenizer

corpus = '''If you can force your heart and nerve and sinew to serve your turn long after they are gone. 	And so hold on when there is nothing in you except the Will which says to them: 'Hold on!'
'''

tokenizer = TabTokenizer()
blob_object = TextBlob(corpus, tokenizer = tokenizer)

# Word tokenization of the sample corpus
corpus_words = blob_object.tokens
# To see all tokens
print(corpus_words)

Note that we've added a TAB after the first sentence here. How, the corpus of the words looks something like:

['If you can force your heart and nerve and sinew to serve your turn long after they are gone.','And so hold on when there is nothing in you except the Will which says to them: 'Hold on!']

nltk.tokenize contains other tokenization options as well. By default, it uses the SpaceTokenizer which you don't need to define explicitly, but can. Other than these two, it also contains useful tokenizers such as LineTokenizer, BlankLineTokenizer and WordPunctTokenizer.

A full list can be found in their documentation.

Sentence Tokenization

To tokenize on a sentence-level, we'll use the same blob_object. This time, instead of the words attribute, we will use the sentences attribute. This returns a list of Sentence objects:

from textblob import TextBlob

corpus = '''If you can force your heart and nerve and sinew to serve your turn long after they are gone. And so hold on when there is nothing in you except the Will which says to them: 'Hold on!'
'''

blob_object = TextBlob(corpus)

# Sentence tokenization of the sample corpus
corpus_sentence = blob_object.sentences
# To identify all tokens
print(corpus_sentence)
# To count the number of tokens
print(len(corpus_sentence))

Output:

[Sentence("If you can force your heart and nerve and sinew to serve your turn long after they are gone"), Sentence("And so hold on when there is nothing in you except the Will which says to them: 'Hold on!")]
2

Conclusion

Tokenization is a very important data pre-processing step in NLP and involves breaking down of a text into smaller chunks called tokens. These tokens can be individual words, sentences or characters in the original text.

TextBlob is a great library to get into NLP with since it offers a simple API that lets users quickly jump into performing NLP tasks.

In this article, we discussed just one of the NLP tasks that TextBlob deals with, but in a next series, we will take a look at how to solve more complex issues, such as dealing with word inflections, plural and singular forms of words, and more.

October 13, 2020 12:30 PM UTC


Codementor

How to Implement Role based Access Control With FastAPI

Quick Summary of RBAC concept, working code snippets and how I reached there

October 13, 2020 06:02 AM UTC


Kushal Das

Updates from Johnnycanencrpt development in last few weeks

In July this year, I wrote a very initial Python module in Rust for OpenPGP, Johnnycanencrypt aka jce. It had very basic encryption, decryption, signing, verification, creation of new keys available. It uses https://sequoia-pgp.org library for the actual implementation.

I wanted to see if I can use such a Python module (which does not call out to the gpg2 executable) in the SecureDrop codebase.

First try (2 weeks ago)

Two weeks ago on the Friday, when I sat down to see if I can start using the module, within a few minutes, I understood it was not possible. The module was missing basic key management, more more refined control over creation, or expiration dates.

On that weekend, I wrote a KeyStore using file-based keys as backend and added most of the required functions to try again.

The last Friday

I sat down again; this time, I had a few friends (including Saptak, Nabarun) on video along with me, and together we tried to plug the jce inside SecureDrop container for Focal. After around 4 hours, we had around 5 failing tests (from 32) in the crypto-related tests. Most of the basic functionality was working, but we are stuck for the last few tests. As I was using the file system to store the keys (in simple .sec or .pub files), it was difficult to figure out the existing keys when multiple processes were creating/deleting keys in the same KeyStore.

Next try via a SQLite based KeyStore

Next, I replaced the KeyStore with an SQLite based backend. Now multiple processes can access the keys properly. With a few other updates, now I have only 1 failing test (where I have to modify the test properly) in that SecureDrop Focal patch.

While doing this experiment, I again found the benefits of writing the documentation of the library as I developed. Most of the time, I had to double-check against it to make sure that I am doing the right calls. I also added one example where one can verify the latest (10.0) Tor Browser download via Python.

In case you already use OpenPGP encryption in your tool/application, or you want to try it, please give jce a try. Works on Python3.7+. I tested on Linux and macOS, and it should work on Windows too. I have an issue open on that, and if you know how to do that, please feel free to submit a PR.

October 13, 2020 04:32 AM UTC

October 12, 2020


Podcast.__init__

Cloud Native Application Delivery Using GitOps - Episode 284

The way that applications are being built and delivered has changed dramatically in recent years with the growing trend toward cloud native software. As part of this movement toward the infrastructure and orchestration that powers your project being defined in software, a new approach to operations is gaining prominence. Commonly called GitOps, the main principle is that all of your automation code lives in version control and is executed automatically as changes are merged. In this episode Victor Farcic shares details on how that workflow brings together developers and operations engineers, the challenges that it poses, and how it influences the architecture of your software. This was an interesting look at an emerging pattern in the development and release cycle of modern applications.

Summary

The way that applications are being built and delivered has changed dramatically in recent years with the growing trend toward cloud native software. As part of this movement toward the infrastructure and orchestration that powers your project being defined in software, a new approach to operations is gaining prominence. Commonly called GitOps, the main principle is that all of your automation code lives in version control and is executed automatically as changes are merged. In this episode Victor Farcic shares details on how that workflow brings together developers and operations engineers, the challenges that it poses, and how it influences the architecture of your software. This was an interesting look at an emerging pattern in the development and release cycle of modern applications.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Tree Schema is a data catalog that is making metadata management accessible to everyone. With Tree Schema you can create your data catalog and have it fully populated in under five minutes when using one of the many automated adapters that can connect directly to your data stores. Tree Schema includes essential cataloging features such as first class support for both tabular and unstructured data, data lineage, rich text documentation, asset tagging and more. Built from the ground up with a focus on the intersection of people and data, your entire team will find it easier to foster collaboration around your data. With the most transparent pricing in the industry – $99/mo for your entire company – and a money-back guarantee for excellent service, you’ll love Tree Schema as much as you love your data. Go to pythonpodcast.com/treeschema today to get your first month free, and mention this podcast to get %50 off your first three months after the trial.
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to pythonpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
  • Your host as usual is Tobias Macey and today I’m interviewing Victor Farcic about using GitOps practices to manage your application and your infrastructure in the same workflow

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by giving an overview of what GitOps is?
  • What are the architectural or design elements that developers need to incorporate to make their applications work well in a GitOps workflow?
  • What are some of the tools that facilitate a GitOps approach to managing applications and their target environments?
  • What are some useful strategies for managing local developer environments to maintain parity with how production deployments are architected?
  • As developers acquire more resonsibility for building the automation to provision the production environment for their applications, what are some of the operations principles that they need to understand?
  • What are some of the development principles that operators and systems administrators need to acquire to be effective in contributing to an environment that is managed by GitOps?
  • What are the areas for collaboration and dividing lines of responsibility between developers and platform engineers in a GitOps environment?
  • Beyond the application development and deployment, what are some of the additional concerns that need to be built into an application in order for it to be manageable and maintainable once it is in production?
  • What are some of the organizational principles that contribute to a successful implementation of GitOps?
  • What are some of the most interesting, innovative, or unexpected ways that you have seen GitOps employed?
  • What have you found to be the most challenging aspects of creating a scalable and maintainable GitOps practice?
  • When is GitOps the wrong choice, and what are the alternatives?
  • What resources do you recommend for anyone who wants to dig deeper into this subject?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

October 12, 2020 11:12 PM UTC


Reuven Lerner

What’s the easiest way to boost your career as a software developer? Learn to touch type.

I’ve been a professional programmer for about 30 years, self-employed for 25 years, and doing full-time corporate Python training for more than a decade.

I run a small business, which involves me writing, programming, and teaching, as well as handling all of the business-related stuff.

So, what’s my most important skill, the thing that helps me get lots accomplished in a short period of time? Easy: My ability to touch type.

It all started when I was in high school in the mid-1980s. I would use my family’s computer — yes, in those days, the entire family shared one — for schoolwork, for doing some introductory programming, and even writing newsletters for my high-school youth organization. The thing is, I was doing all of this typing with two fingers, and this drove my parents bananas.

Both of my parents can touch type. In those days, it was typical for office workers to record their correspondence, give the recording to a secretary, and then review the result before sending it out. My father never did that, because he typed at least as fast as his secretary, and the whole dictation process would slow him down. It wasn’t unusual to hear the rat-tat-tat of my father typing from his study at home.

It’s no surprise that it bothered my parents to be hunting and pecking. I was pretty fast at it, but I was no match for my father or any other touch typist. My parents strongly encouraged me to learn to touch type, but I was a teenager, which meant that I knew better than they did. And besides, I type fast enough, right?

Finally, my parents set a new rule: For every hour that I used the computer, I had to spend an hour doing a lesson from a touch-typing book. (How quaint, right?) I yelled. I screamed. I cried. I protested. But my parents didn’t budge.

At first, it was painful: When you start to touch type, you are learning to use your hands in a new way, one that feels completely foreign. You also type much more slowly than you did before, and feel like you’re wasting your time. I certainly had these feelings, and when I had to get something done quickly, I would refer to my old two-finger method.

But within two or three weeks, I was already touch typing as quickly as I did with two fingers. Better yet, and somewhat amazingly, I was able to type without looking at the keyboard! I could enter passages from a book, without having to move my eyes from book to keyboard and back. I could talk to someone while typing. I could even sneak a peak at the TV while I was typing.

Achieving true speed didn’t happen for a while. But when I started college in the fall of 1988, I was already typing at a pretty fast clip. At the student newspaper, I was frequently drafted to take printouts from the Associated Press and type them into our “world and nation” section. And at the computer labs, where we had loud, mechanical IBM keyboards, people would ask me if I could type more slowly, because the rat-tat-tat was disturbing them.

Fast forward to 2020, and I cannot imagine my work without being able to touch type:

Lots of professional writers know that they need to touch type. After all, they write for a living, and being unable to get the most out of their keyboard would seem like a crazy thing to do.

And yet, I find that a small number of the developers in my courses can touch type. They never really thought about it that much, or decided not to put time and effort into it, or thought that it was hard or impossible to learn. But it’s definitely not a priority.

Touch typing looks magical and impossible to achieve. It’s like watching a virtuoso pianist expressing themselves through the instrument, their thoughts and feelings flowing effortlessly from their brains to their hands, and then to the piano.

But here’s the thing: It’s not hard to learn. You’ll be frustrated for the weeks during which you’re learning and forcing yourself to work in a new way. But it pays for itself in spades, allowing you to write, edit, and express yourself — in code and text — more easily than you could ever imagine. And if I managed to learn from a book as an angry teenager, then you can certainly learn with the variety of online tools, many of them free, available today.

So if you want to give your career a boost, don’t go and learn the latest language, JavaScript library, or API. Rather, learn to touch type. The time that you save and the flexibility it’ll provide will more than make up for the time you spent learning.

The post What’s the easiest way to boost your career as a software developer? Learn to touch type. appeared first on Reuven Lerner.

October 12, 2020 06:58 PM UTC


Ned Batchelder

Ordered dict surprises

Since Python 3.6, regular dictionaries retain their insertion order: when you iterate over a dict, you get the items in the same order they were added to the dict. Before 3.6, dicts were unordered: the iteration order was seemingly random.

Here are two surprising things about these ordered dicts.

You can’t get the first item

Since the items in a dict have a specific order, it should be easy to get the first (or Nth) item, right? Wrong. It’s not possible to do this directly. You might think that d[0] would be the first item, but it’s not, it’s the value of the key 0, which could be the last item added to the dict.

The only way to get the Nth item is to iterate over the dict, and wait until you get to the Nth item. There’s no random access by ordered index. This is one place where lists are better than dicts. Getting the Nth element of a list is an O(1) operation. Getting the Nth element of a dict (even if it is ordered) is an O(N) operation.

OrderedDict is a little different

If dicts are ordered now, collections.OrderedDict is useless, right? Well, maybe. It won’t be removed because that would break code using that class, and it has some methods that regular dicts don’t. But there’s also one subtle difference in behavior. Regular dicts don’t take order into account when comparing dicts for equality, but OrderedDicts do:

>>> d1 = {"a": 1, "b": 2}
>>> d2 = {"b": 2, "a": 1}
>>> d1 == d2
True
>>> list(d1)
['a', 'b']
>>> list(d2)
['b', 'a']

>>> from collections import OrderedDict
>>> od1 = OrderedDict([("a", 1), ("b", 2)])
>>> od2 = OrderedDict([("b", 2), ("a", 1)])
>>> od1 == od2
False
>>> list(od1)
['a', 'b']
>>> list(od2)
['b', 'a']
>>>

BTW, this post is the result of a surprisingly long and contentious discussion in the Python Discord.

October 12, 2020 06:48 PM UTC


Test and Code

134: Business Outcomes and Software Development

Within software projects, there are lots of metrics we could measure. But which ones really matter. Instead of a list, Benjamin Harding shares with us a way of thinking about business outcomes that can help us with every day decision making.

We talk about:

I really enjoyed this conversation. But I admit that at first, I didn't realize how important this is on all software development. Metrics are front and center in a web app. But what about a service, or an embedded system with no telemetry. It still matters, maybe even more so. Little and big decisions developers face every day that have impact on costs and benefits with respect to customer value and business outcome, even if it's difficult to measure.

Special Guest: Benjamin Harding.

Sponsored By:

Support Test & Code : Python Testing for Software Engineering

<p>Within software projects, there are lots of metrics we could measure. But which ones really matter. Instead of a list, Benjamin Harding shares with us a way of thinking about business outcomes that can help us with every day decision making. </p> <p>We talk about:</p> <ul> <li>Business outcomes vs vanity metrics</li> <li>As a developer, how do you keep business outcomes in mind</li> <li>Thinking about customer value all the time</li> <li>Communicating decisions and options in terms of costs and impact on business outcomes</li> <li>Company culture and it&#39;s role in reinforcing a business outcome mindset</li> <li>And even the role of team lead as impact multiplier </li> </ul> <p>I really enjoyed this conversation. But I admit that at first, I didn&#39;t realize how important this is on all software development. Metrics are front and center in a web app. But what about a service, or an embedded system with no telemetry. It still matters, maybe even more so. Little and big decisions developers face every day that have impact on costs and benefits with respect to customer value and business outcome, even if it&#39;s difficult to measure.</p><p>Special Guest: Benjamin Harding.</p><p>Sponsored By:</p><ul><li><a href="https://testandcode.com/pycharm" rel="nofollow">PyCharm Professional</a>: <a href="https://testandcode.com/pycharm" rel="nofollow">Try PyCharm Pro for 4 months and learn how PyCharm will save you time.</a> Promo Code: TESTANDCODE20</li><li><a href="https://monday.com/testandcode" rel="nofollow">monday.com</a>: <a href="https://monday.com/testandcode" rel="nofollow">Creating a monday.com app can help thousands of people and win you prizes. Maybe even a Tesla or a MacBook.</a></li></ul><p><a href="https://www.patreon.com/testpodcast" rel="payment">Support Test & Code : Python Testing for Software Engineering</a></p>

October 12, 2020 04:15 PM UTC


IslandT

Return a list of multiply numbers with Python

In this simple exercise from CodeWars, you will build a function program that takes a value, integer and returns a list of its multiples up to another value, limit. If the limit is a multiple of integer, it should be included as well. There will only ever be positive integers passed into the function, not consisting of 0. The limit will always be higher than the base.

For example, if the parameters passed are (2, 6), the function should return [2, 4, 6] as 2, 4, and 6 are the multiples of 2 up to 6.

Below is the solution, write down your own solution in the comment box.

def find_multiples(integer, limit):
    li = []
    mul = 1
    while(True):
        number = integer * mul
        mul += 1
        if number <= limit:
            li.append(number)
        else:
            return li

The while loop will keep on running until the limit has been reached then the function will return the entire list.

October 12, 2020 03:07 PM UTC


Stack Abuse

Add Legend to Figure in Matplotlib

Introduction

Matplotlib is one of the most widely used data visualization libraries in Python. Typically, when visualizing more than one variable, you'll want to add a legend to the plot, explaining what each variable represents.

In this article, we'll take a look at how to add a legend to a Matplotlib plot.

Creating a Plot

Let's first create a simple plot with two variables:

import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots()

x = np.arange(0, 10, 0.1)
y = np.sin(x)
z = np.cos(x)

ax.plot(y, color='blue')
ax.plot(z, color='black')

plt.show()

Here, we've plotted a sine function, starting at 0 and ending at 10 with a step of 0.1, as well as a cosine function in the same interval and step. Running this code yields:

sine visualization python

Now, it would be very useful to label these and add a legend so that someone who didn't write this code can more easily discern which is which.

Add Legend to a Figure in Matplotlib

Let's add a legend to this plot. Firstly, we'll want to label these variables, so that we can refer to those labels in the legend. Then, we can simply call legend() on the ax object for the legend to be added:

import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots()

x = np.arange(0, 10, 0.1)
y = np.sin(x)
z = np.cos(x)

ax.plot(y, color='blue', label='Sine wave')
ax.plot(z, color='black', label='Cosine wave')
leg = ax.legend()

plt.show()

Now, if we run the code, the plot will have a legend:

add legend to matplotlib plot

Notice how the legend was automatically placed in the only free space where the waves won't run over it.

Customize Legend in Matplotlib

The legend is added, but it's a little bit cluttered. Let's remove the border around it and move it to another location, as well as change the plot's size:

import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots(figsize=(12, 6))

x = np.arange(0, 10, 0.1)
y = np.sin(x)
z = np.cos(x)

ax.plot(y, color='blue', label='Sine wave')
ax.plot(z, color='black', label='Cosine wave')
leg = ax.legend(loc='upper right', frameon=False)

plt.show()

This results in:

customizing legend in matplotlib

Here, we've used the loc argument to specify that we'd like to put the legend in the top right corner. Other values that are accepted are upper left, lower left, upper right, lower right, upper center, lower center, center left and center right.

Additionally, you can use center to put it in the dead center, or best to place the legend at the "best" free spot so that it doesn't overlap with any of the other elements. By default, best is selected.

Add Legend Outside of Axes

Sometimes, it's tricky to place the legend within the border box of a plot. Perhaps, there are many elements going on and the entire box is filled with important data.

In such cases, you can place the legend outside of the axes, and away from the elements that constitute it. This is done via the bbox_to_anchor argument, which specifies where we want to anchor the legend to:

import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots(figsize=(12, 6))

x = np.arange(0, 10, 0.1)
y = np.sin(x)
z = np.cos(x)

ax.plot(y, color='blue', label='Sine wave')
ax.plot(z, color='black', label='Cosine wave')
leg = ax.legend(loc='center', bbox_to_anchor=(0.5, -0.10), shadow=False, ncol=2)

plt.show()

This results in:

add legend outside of axes

The bbox_to_anchor argument accepts a few arguments itself. Firstly, it accepts a tuple, which allows up to 4 elements. Here, we can specify the x, y, width and height of the legend.

We've only set the x and y values, to displace it -0.10 below the axes, and 0.5 from the left side (0 being the lefthand of the box and 1 the righthand side).

By tweaking these, you can set the legend at any place. Within or outside of the box.

Then, we've set the shadow to False. This is used to specify whether we want a small shadow rendered below the legend or not.

Finally, we've set the ncol argument to 2. This specifies the number of labels in a column. Since we have two labels and want them to be in one column, we've set it to 2. If we changed this argument to 1, they'd be placed one above the other:

add legend outside of axes with one col

Note: The bbox_to_anchor argument is used alongside the loc argument. The loc argument will put the legend based on the bbox_to_anchor. In our case, we've put it in the center of the new, displaced, location of the border box.

Conclusion

In this tutorial, we've gone over how to add a legend to your Matplotlib plots. Firstly, we've let Matplotlib figure out where the legend should be located, after which we've used the bbox_to_anchor argument to specify our own location, outside of the axes.

If you're interested in Data Visualization and don't know where to start, make sure to check out our book on Data Visualization in Python.

Data Visualization in Python, a book for beginner to intermediate Python developers, will guide you through simple data manipulation with Pandas, cover core plotting libraries like Matplotlib and Seaborn, and show you how to take advantage of declarative and experimental libraries like Altair.

Data Visualization in Python

Understand your data better with visualizations! With over 275+ pages, you'll learn the ins and outs of visualizing data in Python with popular libraries like Matplotlib, Seaborn, Bokeh, and more.

October 12, 2020 02:08 PM UTC


Real Python

Using ggplot in Python: Visualizing Data With plotnine

In this tutorial, you’ll learn how to use ggplot in Python to create data visualizations using a grammar of graphics. A grammar of graphics is a high-level tool that allows you to create data plots in an efficient and consistent way. It abstracts most low-level details, letting you focus on creating meaningful and beautiful visualizations for your data.

There are several Python packages that provide a grammar of graphics. This tutorial focuses on plotnine since it’s one of the most mature ones. plotnine is based on ggplot2 from the R programming language, so if you have a background in R, then you can consider plotnine as the equivalent of ggplot2 in Python.

In this tutorial, you’ll learn how to:

  • Install plotnine and Jupyter Notebook
  • Combine the different elements of the grammar of graphics
  • Use plotnine to create visualizations in an efficient and consistent way
  • Export your data visualizations to files

This tutorial assumes that you already have some experience in Python and at least some knowledge of Jupyter Notebook and pandas. To get up to speed on these topics, check out Jupyter Notebook: An Introduction and Using Pandas and Python to Explore Your Dataset.

Free Download: Get a sample chapter from Python Tricks: The Book that shows you Python's best practices with simple examples you can apply instantly to write more beautiful + Pythonic code.

Setting Up Your Environment#

In this section, you’ll learn how to set up your environment. You’ll cover the following topics:

  1. Creating a virtual environment
  2. Installing plotnine
  3. Installing Juptyer Notebook

Virtual environments enable you to install packages in isolated environments. They’re very useful when you want to try some packages or projects without messing with your system-wide installation. You can learn more about them in Python Virtual Environments: A Primer.

Run the following commands to create a directory named data-visualization and a virtual environment inside it:

$ mkdir data-visualization
$ cd data-visualization
$ python3 -m venv venv

After running the above commands, you’ll find your virtual environment inside the data-visualization directory. Run the following command to activate the virtual environment and start using it:

$ source ./venv/bin/activate

When you activate a virtual environment, any package that you install will be installed inside the environment without affecting your system-wide installation.

Next, you’ll install plotnine inside the virtual environment using the pip package installer.

Install plotnine by running this command:

$ pip install plotnine

Executing the above command makes the plotnine package available in your virtual environment.

Finally, you’ll install Jupyter Notebook. While this isn’t strictly necessary for using plotnine, you’ll find Jupyter Notebook really useful when working with data and building visualizations. If you’ve never used the program before, then you can learn more about it in Jupyter Notebook: An Introduction.

To install Jupyter Notebook, use the following command:

$ pip install jupyter

Congratulations, you now have a virtual environment with plotnine and Jupyter Notebook installed! With this setup, you’ll be able to run all the code samples presented through this tutorial.

Building Your First Plot With ggplot and Python#

In this section, you’ll learn how to build your first data visualization using ggplot in Python. You’ll also learn how to inspect and use the example datasets included with plotnine.

Read the full article at https://realpython.com/ggplot-python/ »


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

October 12, 2020 02:00 PM UTC


PyCharm

Datalore by JetBrains: Online Jupyter Notebooks Editor With PyCharm’s Code Insight

If you work with Jupyter Notebooks and want to run code, produce heavy visualizations, and render markdown online – give Datalore a try. It comes with cloud storage, real-time collaboration, notebook publishing, and PyCharm’s code insight. In this blog post we’ll give you a quick introduction to what you can do in Datalore.

Datalore

Try Datalore

Jupyter notebooks in the cloud

Once you register a Datalore account, you can get your first notebook up and running in seconds. No setup is required, and the most popular data science libraries such as NumPy, Matplotlib, pandas, TensorFlow, etc., are already preinstalled.

As soon as you create a new notebook or upload an existing one, you can attach dataset files to it. In Datalore, files are uploaded to cloud storage and then attached to the notebook. The free Datalore plan comes with 10 GB of storage space.

Code insight in Datalore

One of the best features of Datalore is its coding assistance, which it borrows directly from PyCharm.

We firmly believe that code completion, quick-fixes, auto-imports, rename, and reformatting options help make your online coding experience far more productive. Try out the coding assistance and let us know what you think!

Online editor experience

Datalore supports Markdown and LaTeX. All computations are run in the cloud, which improves the time it takes for visualizations and markdown cells to be rendered.

We also support common Jupyter shortcuts and documentation popups. You can find the full list of available action shortcuts in the Help → Command palette menu tab.

Collaboration in Datalore

There are 4 ways to collaborate in Datalore.

1. Share your notebooks

Share your notebooks with your team via File → Share and collaborate in real time. The cursors of your team members will appear with color highlights and name tags. If something goes wrong you can revert to a history checkpoint via Tools → History.

2. Publish your notebooks

Publish your notebooks when you want to share insights and receive comments. Published notebooks can then be shared using a link.

3. Share whole workspaces

Share whole workspaces and work together with colleagues on multiple notebooks. Notebooks and attached files will be shared among all the workspace members. You can create a shared workspace via the Workspace menu on Datalore’s home screen.

4. Publish PyCharm notebooks

Publish PyCharm notebooks to share the results with your colleagues. You can upload them to Datalore from PyCharm IDE via the pre-installed Datalore plugin. Just make sure you are using version 0.1.18 or later.

Give Datalore a try!

Learn more about Datalore’s features from the Datalore blog. The Datalore team is always eager to hear your feedback! Please don’t hesitate to write to us in the comments or post in our forum.

Enjoy your data science journey,

Your Datalore and PyCharm teams

October 12, 2020 12:36 PM UTC


Chris Moffitt

Case Study: Processing Historical Weather Pattern Data

Introduction

The main purpose of this blog is to show people how to use Python to solve real world problems. Over the years, I have been fortunate enough to hear from readers about how they have used tips and tricks from this site to solve their own problems. In this post, I am extremely delighted to present a real world case study. I hope it will give you some ideas about how you can apply these concepts to your own problems.

This example comes from Michael Biermann from Germany. He had the challenging task of trying to gather detailed historical weather data in order to do analysis on the relationship between air temperature and power consumption. This article will show how he used a pipeline of Python programs to automate the process of collecting, cleaning and processing gigabytes of weather data in order to perform his analysis.

Problem Background

I will turn it over to Michael to give the background for this problem.

Hi, I’m Michael, CEO of a company providing services to energy providers, especially focusing on electrical power and gas. I wanted to do an ex-post analysis to get deeper insights into the deviation of the power consumption of electrical heating systems in comparison to the air temperature. Since we provide power to other companies, we need to have a good grasp on the power consumption, which correlates to the air temperature. In short, I needed to know how well I can predict the long term temperatures and how much deviation is to be expected.

To be able to do this analysis, I needed historical temperatures, which are supplied by the German weather service, DWD. Since it would be really time consuming to download all the files and extract them by hand, I decided to give this a shot with Python. I know a few things about programming, but I am pretty far from a professional programmer. The process was a lot of trial and error, but this project turned out to be exactly the right fit for this approach. I use a lot of hardcore Excel analysis, fetching and munching data with Power Query M, but this was clearly over the limit to what can be done in Excel.

I am really happy with the results. There is hardly anything as satisfying as letting the computer do the hard work for the next 20 min, while grabbing a cup of coffee.

I am also really happy to have learned a few more things about web scraping, because I can use it in future projects to automate data fetching.

Here is a visual to help understand the process Michael created:

Data Processing Pipeline

If you are interested in following along, all of the code examples are available here.

Downloading the Data

The first notebook in the pipeline is 1-dwd_konverter_download . This notebook pulls historical temperature data from the German Weather Service (DWD) server and formats it for future use in other projects.

The data is delivered in hourly frequencies in a .zip file for each of the available weather stations. To use the data, we need everything in a single .csv file with all stations side-by-side. Also, we need the daily average.

To reduce computing time, we also crop all data earlier than 2007.

For the purposes of this article, I have limited the download to only 10 files but the full data set is over 600 files.

import requests
import re
from bs4 import BeautifulSoup
from pathlib import Path

# Set base values
download_folder = Path.cwd() / 'download'
base_url = 'https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/historical/'


# Initiate Session and get the Index-Page
with requests.Session() as s:
    resp = s.get(base_url)

# Parse the Index-Page for all relevant <a href>
soup = BeautifulSoup(resp.content)
links = soup.findAll("a", href=re.compile("stundenwerte_TU_.*_hist.zip"))

# For testing, only download 10 files
file_max = 10
dl_count = 0

#Download the .zip files to the download_folder
for link in links:
    zip_response = requests.get(base_url + link['href'], stream=True)
    # Limit the downloads while testing
    dl_count += 1
    if dl_count > file_max:
        break
    with open(Path(download_folder) / link['href'], 'wb') as file:
        for chunk in zip_response.iter_content(chunk_size=128):
            file.write(chunk)

print('Done')

This portion of code will parse the download page and find all of the zip files with the name studenwerte_TU and save them in a download directory.

Extracting the Data

After the first step is completed, the download directory contains multiple zip files. The second notebook in the process is 2-dwd_konverter_extract which will search each zip file for a .txt file that contains the actual temperature values.

The program will then extract each file and move to the import directory for further processing.

from pathlib import Path
import glob
import re
from zipfile import ZipFile

# Folder definitions
download_folder = Path.cwd() / 'download'
import_folder = Path.cwd() / 'import'

# Find all .zip files and generate a list
unzip_files = glob.glob('download/stundenwerte_TU_*_hist.zip')

# Set the name pattern of the file we need
regex_name = re.compile('produkt.*')

# Open all files, look for files that match ne regex pattern, extract to 'import'
for file in unzip_files:
    with ZipFile(file, 'r') as zipObj:
        list_of_filenames = zipObj.namelist()
        extract_filename = list(filter(regex_name.match, list_of_filenames))[0]
        zipObj.extract(extract_filename, import_folder)

display('Done')

After running this script, the import directory will contain text files that look like this:

STATIONS_ID;MESS_DATUM;QN_9;TT_TU;RF_TU;eor
        3;1950040101;    5;   5.7;  83.0;eor
        3;1950040102;    5;   5.6;  83.0;eor
        3;1950040103;    5;   5.5;  83.0;eor
        3;1950040104;    5;   5.5;  83.0;eor
        3;1950040105;    5;   5.8;  85.0;eor

Building the DataFrame

Now that we have isolated the data we need, we must format it for further analysis.

There are three steps in this notebook 3-dwd_konverter_build_df :

Process Individual Files

The files are imported into a single DataFrame, stripped of unnecessary columns and filtered by date. Then we set a DateTimeIndex and concatenate them into the main_df . Because the loop takes a long time, we output some status messages, to ensure the process is still running.

Process the concatenated main_df

Then we display some info of the main_df so we can ensure that there are no errors, mainly to ensure all data-types are recognized correctly. Also, we drop duplicate entries, in case some of the .csv files were accidentally duplicated during the development process.

Unstack and export

For the final step, we unstack the main_df and save it to a .csv and a .pkl file for the next step in the analysis process. Also, we display some output to get a grasp of what is going on.

Now let’s look at the code:

import numpy as np
import pandas as pd
from IPython.display import clear_output

from pathlib import Path
import glob


import_files = glob.glob('import/*')
out_file = Path.cwd() / "export_uncleaned" / "to_clean"

obsolete_columns = [
    'QN_9',
    'RF_TU',
    'eor'
]

main_df = pd.DataFrame()
i = 1

for file in import_files:

    # Read in the next file
    df = pd.read_csv(file, delimiter=";")

    # Prepare the df before merging (Drop obsolete, convert to datetime, filter to date, set index)
    df.drop(columns=obsolete_columns, inplace=True)
    df["MESS_DATUM"] = pd.to_datetime(df["MESS_DATUM"], format="%Y%m%d%H")
    df = df[df['MESS_DATUM']>= "2007-01-01"]
    df.set_index(['MESS_DATUM', 'STATIONS_ID'], inplace=True)

    # Merge to the main_df
    main_df = pd.concat([main_df, df])

    # Display some status messages
    clear_output(wait=True)
    display('Finished file: {}'.format(file), 'This is file {}'.format(i))
    display('Shape of the main_df is: {}'.format(main_df.shape))
    i+=1

# Check if all types are correct
display(main_df['TT_TU'].apply(lambda x: type(x).__name__).value_counts())

# Make sure that to files or observations a duplicates, eg. scan the index for duplicate entries.
# The ~ is a bitwise operation, meaning it flips all bits.
main_df = main_df[~main_df.index.duplicated(keep='last')]


# Unstack the main_df
main_df = main_df.unstack('STATIONS_ID')
display('Shape of the main_df is: {}'.format(main_df.shape))

# Save main_df to a .csv file and a pickle to continue working in the next step
main_df.to_pickle(Path(out_file).with_suffix('.pkl'))
main_df.to_csv(Path(out_file).with_suffix('.csv'), sep=";")

display(main_df.head())
display(main_df.describe())

As this program runs, here is some of the progress output:

'Finished file: import/produkt_tu_stunde_20041101_20191231_00078.txt'
'This is file 10'
'Shape of the main_df is: (771356, 1)'
float    771356
Name: TT_TU, dtype: int64
'Shape of the main_df is: (113952, 9)'

Here is what the final DataFrame looks like:

TT_TU
STATIONS_ID 3 44 71 73 78 91 96 102 125
MESS_DATUM
2007-01-01 00:00:00 11.4 NaN NaN NaN 11.0 9.4 NaN 9.7 NaN
2007-01-01 01:00:00 12.0 NaN NaN NaN 11.4 9.6 NaN 10.4 NaN
2007-01-01 02:00:00 12.3 NaN NaN NaN 9.4 10.0 NaN 9.9 NaN
2007-01-01 03:00:00 11.5 NaN NaN NaN 9.3 9.7 NaN 9.5 NaN
2007-01-01 04:00:00 9.6 NaN NaN NaN 8.6 10.2 NaN 8.9 NaN

At the end of this step, we have the file in a condensed format we can use for analysis.

Final Processing

The data contains some errors, which need to be cleaned. You can see, by looking at the output of main_df.describe(), that the minimum temperature on some stations is -999. That means that there is no plausible measurement for this particular hour. We change this to np.nan, so that we can safely calculate the average daily value in the next step.

Once these values are corrected, we need to resample to daily measurements. Pandas resample makes this really simple.

import numpy as np
import pandas as pd
from pathlib import Path

# Import and export paths
pkl_file = Path.cwd() / "export_uncleaned" / "to_clean.pkl"
cleaned_file = Path.cwd() / "export_cleaned" / "cleaned.csv"

# Read in the pickle file from the last cell
cleaning_df = pd.read_pickle(pkl_file)

# Replace all values with "-999", which indicate missing data
cleaning_df.replace(to_replace=-999, value=np.nan, inplace=True)

# Resample to daily frequency
cleaning_df = cleaning_df.resample('D').mean().round(decimals=2)

# Save as .csv
cleaning_df.to_csv(cleaned_file, sep=";", decimal=",")

# Show some results for verification
display(cleaning_df.loc['2011-12-31':'2012-01-04'])
display(cleaning_df.describe())
display(cleaning_df)

Here is the final DataFrame with daily average values for the stations:

TT_TU
STATIONS_ID 3 44 71 73 78 91 96 102 125
MESS_DATUM
2011-12-31 NaN 3.88 2.76 1.19 4.30 2.43 NaN 3.80 NaN
2012-01-01 NaN 10.90 8.14 4.03 10.96 10.27 NaN 9.01 NaN
2012-01-02 NaN 7.41 6.18 4.77 7.57 7.77 NaN 6.48 4.66
2012-01-03 NaN 6.14 3.61 4.46 6.38 5.28 NaN 5.63 3.51
2012-01-04 NaN 5.80 2.48 4.45 5.46 4.57 NaN 5.85 1.94

Summary

There are several aspects of this case study that I really like.

  • Michael was not an expert programmer and decided to dedicate himself to learning the Python necessary for solving this problem.
  • It took some time for him to learn how to accomplish multiple tasks but he persevered through all the challenges and built a complete solution.
  • This was a real world problem that would be difficult to solve with other tools but could be automated with very few lines of Python code.
  • The process could be time consuming to run so it’s broken down into multiple stages. This is a great idea to apply to other problems. This previous article actually served as the inspiration for many of the techniques used in the solution.
  • This solution brings together many different concepts including web scraping, downloading files, working with zip files and cleaning & analyzing data with pandas.
  • Michael now has a new skill that he can apply to other problems in his business.

Finally, I love this quote from Michael:

There is hardly anything as satisfying as letting the computer do the hard work for the next 20 min, while grabbing a cup of coffee.

I agree 100%. Thank you Michael for taking the time to share such a great example! I hope it gives you some ideas to apply to your own projects.

October 12, 2020 12:25 PM UTC


IslandT

Beginning steps to create a Stockfish chess application

I am a chess player and I like to play chess, in order to improve my chess skill recently I have decided to create a chess application which I can play with so I can further improve my chess skill and get ready to face a stronger opponent in a site like lichess. The below chess application will take me around a year to complete and I will show you all the progress from time to time.

This application will use the below tools to develop:-

  1. Python will be the programming language that needs to develop this application.
  2. Stockfish chess engine will be needed as the central mind of this application.
  3. stockfish module will be needed to link to the Stockfish chess engine.
  4. Pygame will be needed to display the graphic user interface of this chess application.
  5. I am using windows os to develop this application so it might not work for the user with the other OS.

In this chapter, we will first download the Stockfish chess engine from this site. I am using the 64bit version (Maximally compatible but slow) to suite my laptop. This chess engine alone does not do anything, we will need a stockfish engine wrapper which you can get from this site! There are other modules around that do the same thing but stockfish module appears to be very easy to use.

With the above two tools ready, we can now open up our PyCharm IDE and input the following code.

from stockfish import Stockfish

stockfish = Stockfish("E:\StockFish\stockfish_20090216_x64")
stockfish.set_position(["e2e4", "e7e6"])
print(stockfish.get_board_visual())

As you can see we first need to import the Stockfish module into our program. Next, we will pass in the path to the Stockfish chess engine as an argument when we create a Stockfish object. Next, we will set the first move for both the Milk and the Chocolate players. Finally, we will print out the chess position on the chessboard accordingly.

altThe position on the chess board

The chessboard looks really great but as I have mentioned before I will not use this display but instead will use PyGame to create the chess user interface for this chess application instead.

So there you have it, we have successfully installed the Stockfish module as well as downloaded the Stockfish chess engine.

What next? Next time we will install the PyGame module and show the chess pieces on the chessboard!

October 12, 2020 11:38 AM UTC


Mike Driscoll

PyDev of the Week: Sean Tibor

This week we welcome Sean Tibor (@smtibor) as our PyDev of the Week! Sean is the co-host of the Teaching Python podcast. He has been a guest on other podcasts, such as Test & Code and is the founder of Red Reef Digital.

Let’s spend a few moments getting to know Sean better!

Sean Tibor

Can you tell us a little about yourself (hobbies, education, etc):

It’s funny: I never expected to be a teacher. I went to college and grad school for Information Systems and learned to code in C++, Java, PHP, and VB.NET, then spent nearly 20 years working in IT and Marketing.

A few years ago, a dear family friend asked me to consider a career change into teaching since she thought I would have an aptitude for it. This is now my third year teaching middle school computer science in Florida at a private PK-12 school. Every 11-14 year old student in my school takes 9 weeks of computer science for each year of grade 6, 7, 8.

There are few things that I find professionally more satisfying than seeing a kid discover potential within themselves. Teaching has become more about the journey that each student goes through in learning to code than the specific lessons they learn.

It’s also really fun that my hobbies of coding hardware, making and designing electronics, and 3d printing have become part of my profession. I get to bring all of these skills and knowledge to my teaching craft, so it feels like I get to play all day with the things I love.

Why did you start using Python?

When I started teaching, the school I joined had just undergone a huge revision to their Computer Science curriculum. As part of that, they chose to make Python the language that all middle school students would learn.

So over the course of the summer, I started learning as much Python as I could absorb, using everything from books like Automate the Boring Stuff to CircuitPython and MicroPython hardware to Pybites code challenges. It took several months, but I was able to start teaching right from the first day of school.

In addition to teaching Python, it’s also been very useful for integration and automation projects around the school to make things run a bit smoother. I’m also using it to work on a few side projects in the marketing automation space, so it’s enhanced other parts of my professional life.

What other programming languages do you know and which is your favorite?

I’m a strong believer in Python as a useful and efficient language for getting things done so that’s my go-to language. Over the years, I’ve dabbled in a lot of different languages like VB.NET, Java, PHP, Objective-C, C++, and Arduino. Most of that has been replaced with Python for my projects and then I add in some HTML, CSS, JS, and SQL as needed to make it all come together.

What projects are you working on now?

My favorite project right now has been a wrapper library and function library for our school’s JAMF server that handles Apple device management. Our school has over 1500 iPads in use across two campuses and my project automates many of the common tasks that used to be very hands-on and manual. Now that we have this project in place, we can hand over a brand new shrink-wrapped iPad to a teacher or student and it will automatically configure itself with apps and settings within about 5 minutes of connecting to the internet.

Which Python libraries are your favorite (core or 3rd party)?

I don’t think it gets a lot of attention, but I love the dateutil library. My final project for my undergraduate degree was a web-based personal information manager that syncronized with your PDA and the most complex part by far was the calendar module. Ever since, I’ve been a little obsessed with getting my dates and times correct in code, and the dateutil library has so many useful features from timezone selection to parsing strings into datetime objects and even having interesting relative dates.

What have you learned being a host of the Teaching Python podcast?

The best thing has been meeting all of the amazing people in the Python community and doing that all with my teaching partner and co-host, Kelly Paredes. She hadn’t coded before and I hadn’t taught before when we started the podcast, so each of us were beginners at something where the other person was more of an expert.

With every person we meet, we each learn a lot more about teaching, Python, and the many, many different cool things that people are doing out there in the world. Often after an episode recording session, we’ll sit there and chat about all the interesting things we learned from our guest or from each other.

I also found it really amazing how welcoming and accessible the Python and education community can be. We started as just two teachers who wanted to try making a podcast about our experiences teaching something new to both of us. We’ve made amazing friends, had some of the most mind-blowing conversations, and no one has ever said no.

What is the hardest thing to teach in class about Python?

The hardest thing is nothing to do with the Python language. It’s overcoming a student’s belief that “I am not a coder.” With patience and persistence, I’ve found that nearly every student can find something that they like about coding and create something that they are tremendously proud of. I’ve seen students create everything from an RGB-lit umbrella, to a choose-your-own adventure game with 700 lines of code, to an Alexa voice skill that reminds them about things so their mom doesn’t have to.

I’ve found that coding is a lot like running. Many people say that they’re not a runner. However, it’s your own journey to running or coding that matters. If you run, you are a runner. If you code, you are a coder. I don’t expect every student to be a gifted coder, but I’ve seen students blow me away with what they can do once they discard the notion that they are “not a coder.”

Is there anything else you’d like to say?

Learning Python in order to teach it to others has been quite a bit different than the other times I’ve learned a new language. Every time a student asks me how something works, I think I’ve got the right answer, but then they ask me a followup question that makes me excited to go learn more. Teaching another person is absolutely the best way to keep yourself challenged and motivated to learn more.

Thanks for doing the interview, Sean!

The post PyDev of the Week: Sean Tibor appeared first on The Mouse Vs. The Python.

October 12, 2020 05:05 AM UTC


IslandT

Merge two dictionaries using the Dict Union operator

In this article we will create a Python function which will merge two dictionaries using the Dict Union operator.

The Dict Union operator will only merge the key and value pair with a unique key’s name, which means if there are two keys with the same name in the same dictionary, only the last key in the dictionary will be merged. If the same key appears in both dictionaries, then the key in the second dictionary will be merged into this Dict union.

After the merger of two dictionaries, the function will change the value of the key if the third optional argument has tuples in it which contain the key and value to be changed.

def merged(k1, k2, front):

    k3 = k1 | k2
    if front != []:
        k3 |= front
    return k3

Now let us try out a few examples:-

d = {'shoe': 1, 'slipper': 2, 'boot': 3, 'shoe':7}
l = {'shirt': 3, 'dress': 1, 'shoe':4}

print(merged(d, l, [('shirt', 5)]))
{'shoe': 4, 'slipper': 2, 'boot': 3, 'shirt': 5, 'dress': 1}
d = {'shoe': 1, 'slipper': 2, 'boot': 3, 'shoe':7}
l = {'shirt': 3, 'dress': 1}

print(merged(d, l, [('shirt', 5)]))
{'shoe': 7, 'slipper': 2, 'boot': 3, 'shirt': 5, 'dress': 1}
d = {'shoe': 1, 'slipper': 2, 'boot': 3, 'shoe':7}
l = {'shirt': 3, 'dress': 1}

print(merged(d, l, [('shirt', 5), ('shoe', 3)]))
{'shoe': 3, 'slipper': 2, 'boot': 3, 'shirt': 5, 'dress': 1}
d = {'shoe': 1, 'slipper': 2, 'boot': 3, 'shoe':7}
l = {'shirt': 3, 'dress': 1}

print(merged(d, l, []))
{'shoe': 7, 'slipper': 2, 'boot': 3, 'shirt': 3, 'dress': 1}

What is your thought about this? Leave your comment with your own solution in the comment box under this post 🙂

October 12, 2020 03:36 AM UTC


Wing Tips

Debug Docker Compose Containerized Python Apps with Wing Pro

This Wing Tip describes how to configure Docker Compose so that Python code running on selected container services can be debugged with Wing Pro. This makes it easy to develop and debug containerized applications written in Python.

Prerequisites

To get started, you will need to Install Docker Compose.

You will also need an existing Docker Compose project that uses Python on at least one of the container services. If you don't already have one, it is easy to set one up as described in Getting Started with Docker Compose. However, if you use that example you will need to change to the official Python docker image and not the 'alpine' image, which contains a stripped down build of Python that cannot load Wing's debugger core. This is easy to do, by changing from FROM python:3.7-alpine to FROM:python:3.8 in the Dockerfile. You will also need to remove the RUN apk add line from the Dockerfile. This is not needed with the official Python docker image.

Configuration

To set up your Docker Compose project so it can be used with Wing's Python debugger, you will need to add some volume mounts to each container that you want to debug. These mount Wing's debugger support and cause Python to initiate debug whenever it is run on the container.

1. Prepare sitecustomize

The first step is to make a new directory sitecustomize in the same directory as your docker-compose.yml file and then add a file named __init__.py to the directory with the following contents:

from . import wingdbstub

This is the hook that will cause Python on the containers to load Wing's debugger. It is loaded by Python's Site-specific configuration hook.

2. Configure wingdbstub.py

Next, you need to configure a copy of wingdbstub.py to place into this sitecustomize directory. This module is provided by Wing as the way to start debug of any Python code that is launched from outside of the IDE, as is the case here since your code is launched in the container by docker-compose up.

You can find the master copy of wingdbstub.py at the top level of your Wing installation (or on macOS in Contents/Resources inside WingPro.app). If you don't know where this is, it is listed as the Install Directory in Wing's About box.

You will need to make copy of this file to your sitecustomize package directory and then make two changes to it:

  • Set WINGHOME='/wingpro7'
  • Set kHostPort='host.docker.internal:50005'

3. Inspect Your Installation

In order to figure out what volume mounts you need to add to your docker-compose.yml file, you first need to determine:

(1) The full path of your Wing installation on the host system, which is given in Wing's About box. This is the same place that you found wingdbstub.py earlier.

(2) The location of the site customization site-packages on each container that you want to debug. This is where you will mount the sitecustomize directory from your host system. You can determine this value by starting Python on the container and inspecting it. For example, for a service in docker-compose.yml that is called web, you can start Python on the container interactively like this:

docker run -i compose_web python -i -u

Note that the Docker image name is the same as the Docker Compose service name but with compose_ prepended.

Then type or paste in the following lines of code:

>>> import os, sys, site
>>> v = sys.version_info[:2]
>>> print(os.path.join(site.USER_BASE, 'lib', 'python{}.{}'.format(*v), 'site-packages'))

Make a note of the path that this prints; you will need it in the next step below.

4. Add Mounted Volumes

Now you can now add your volume mounts in the docker-compose.yml file. You will be mounting the Wing installation directory at /wingpro7 (this must match the WINGHOME set earlier in your copy of wingdbstub.py) and your sitecustomize package directory inside the above-determined site-packages.

For example on Windows you might add the following in docker-compose.yml for each service that you want to debug:

volumes:
  - "C:\Program Files (x86)\Wing Pro 7.2:/wingpro7"
  - "./sitecustomize:/root/.local/lib/python3.8/site-packages/sitecustomize"

On macOS this might instead be:

volumes:
  - /Applications/WingPro.app/Contents/Resources:/wingpro7
  - ./sitecustomize:/root/.local/lib/python3.8/site-packages/sitecustomize

And on Linux it might be:

volumes:
  - /usr/lib/wingpro7:/wingpro7
  - ./sitecustomize:/root/.local/lib/python3.8/site-packages/sitecustomize

Example

Here's an example of these added volumes in context, within the docker-compose.yml that is used in Getting Started with Docker Compose:

version: "3.8"
services:
  web:
    build: .
    ports:
      - "5000:5000"
    volumes:
      - .:/code
      - ./sitecustomize:/root/.local/lib/python3.8/site-packages/sitecustomize
      - /Applications/WingPro.app/Contents/Resources:/wingpro7
    environment:
      FLASK_ENV: development
  redis:
    image: "redis:alpine"

Note that we're only debugging the web service and not Python code running on the redis service.

Starting Debug

Now you can start your cluster and debug your containerized Python code in Wing Pro.

To do that, first make sure Wing is listening for outside debug connections, by clicking on the bug icon bugicon in the lower left of Wing's window and enabling Accept Debug Connections.

If you are using the Flask example from Getting Started with Docker Compose (or any code that spawns multiple processes that you wish to debug) then you will also need to open Project Properties from the Project menu and set Debug Child Processes under the Debug/Execute tab to Always Debug Child Processes.

Then start your cluster with docker-compose up. Your application will start and the containers you've configured for debug should attempt to connect to Wing Pro. Wing will initially reject the connection and display a dialog for each container you are trying to debug:

/images/blog/docker-compose/connection-rejection.png

Click Accept and then stop docker-compose up by pressing Ctrl-C and restart it. The second time you start your cluster, the containers should manage to connect successfully to Wing's debugger, because you've accepted the randomly generated security token used by each container.

You can now set breakpoints, step through code, and view and interact with data in the debug process using Stack Data, Debug Console, and other tools in Wing. For more information on Wing's capabilities, see the Tutorial in Wing's Help menu or take a look at the Quick Start Guide.

Trouble-Shooting

If you can't get the debugger to connect, try setting kLogFile in your copy of wingdbstub.py to "<stderr>". This will log debugger diagnostics to the output from docker-compose up and will indicate whether the debugger is failing to load or failing to connect to the IDE. You can email this output to support@wingware.com for help.

To inspect other problems, including whether your added file mounts are working correctly, you can start a shell in selected Docker containers after docker-compose up with docker-compose <service> <cmd>. For example to start an interactive shell for the service web defined in docker-compose.yml:

docker-compose web bash

Future Directions

Part of our focus in Wing Pro 8 is to extend and improve Wing's support for containerized development. This includes automating container and cluster configuration. As of the date of this article, a subset of that functionality, for working with a single container, is available in our early access program. Future releases will extend this to support Docker Compose and possibly also other container orchestration systems. If you have requests for specific types of support for containerized development, please email us.



That's it for now! We'll be back soon with more Wing Tips for Wing Python IDE.

As always, please don't hesitate to email support@wingware.com if you run into problems, have any questions, or have topic suggestions for future Wing Tips!

October 12, 2020 01:00 AM UTC


ListenData

Learn Python for Data Science

This tutorial would help you to learn Data Science with Python by examples. It is designed for beginners who want to get started with Data Science in Python. Python is an open source language and it is widely used as a high-level programming language for general-purpose programming. It has gained high popularity in data science world. In the PyPL Popularity of Programming language index, Python scored second rank with a 14 percent share. In advanced analytics and predictive analytics market, it is ranked among top 3 programming languages for advanced analytics.
Data Science Python
Data Science with Python Tutorial

Table of Contents

Python 2 vs. 3

Google yields thousands of articles on this topic. Some bloggers opposed and some in favor of 2.7. If you filter your search criteria and look for only recent articles, you would find Python 2 is no longer supported by the Python Software Foundation. Hence it does not make any sense to learn 2.7 if you start learning it today. Python 3 supports all the packages. Python 3 is cleaner and faster. It is a language for the future. It fixed major issues with versions of Python 2 series. Python 3 was first released in year 2008. It has been 12 years releasing robust versions of Python 3 series. You should go for latest version of Python 3.

Python for Data Science : Introduction

Python is widely used and very popular for a variety of software engineering tasks such as website development, cloud-architecture, back-end etc. It is equally popular in data science world. In advanced analytics world, there has been several debates on R vs. Python. There are some areas such as number of libraries for statistical analysis, where R wins over Python but Python is catching up very fast.With popularity of big data and data science, Python has become first programming language of data scientists.
There are several reasons to learn Python. Some of them are as follows -
  1. Python runs well in automating various steps of a predictive model.
  2. Python has awesome robust libraries for machine learning, natural language processing, deep learning, big data and artificial Intelligence.
  3. Python wins over R when it comes to deploying machine learning models in production.
  4. It can be easily integrated with big data frameworks such as Spark and Hadoop.
  5. Python has a great online community support.
Do you know these sites are developed in Python?
  1. YouTube
  2. Instagram
  3. Reddit
  4. Dropbox
  5. Disqus

How to install Python?

There are two ways to download and install Python
  1. Download Anaconda. It comes with Python software along with preinstalled popular libraries.
  2. Download Pythonfrom its official website. You have to manually install libraries.
Recommended : Go for first option and download anaconda. It saves a lot of time in learning and coding Python
Coding Environments
Anaconda comes with two popular IDE :
  1. Jupyter (Ipython) Notebook
  2. Spyder
Spyder. It is like RStudio for Python. It gives an environment wherein writing python code is user-friendly. If you are a SAS User, you can think of it as SAS Enterprise Guide / SAS Studio. It comes with a syntax editor where you can write programs. It has a console to check each and every line of code. Under the 'Variable explorer', you can access your created data files and function. I highly recommend Spyder!
Spyder - Python Coding Environment
Jupyter (Ipython) Notebook Jupyter is equivalent to markdown in R. It is useful when you need to present your work to others or when you need to create step by step project report as it can combine code, output, words, and graphics.
READ MORE »

October 12, 2020 12:38 AM UTC

October 11, 2020


Tarek Ziade

Web App Software Development Maturity Model

The Capability Maturity Model Integration (CMMI) describes different levels of maturity for the development process of any organization in a measurable way. It offers a set of best practices to improve all processes. It's been regularly updated, and the latest version includes some notions of agility.

CMMI can be applied …

October 11, 2020 10:00 PM UTC


IslandT

Write a python function that produces an array with the numbers 0 to N-1 in it

In this article, we will create a python function that will produce an array with the numbers 0 to N-1 in it.

For example, the following code will result in an array containing the numbers 0 to 4:

arr(5) // => [0,1,2,3,4]

There are a few rules we need to follow here:-

  1. when the user passes in 0 to the above function, the function will return an empty list.
  2. when the user passes in an empty argument into the above function, the function will also return an empty list.
  3. any other positive number will result in an ascending order array.

Below is the full solution to the above problem.

def arr(n=None): 

    li = []
    
    if n == 0 or n == None:
    	return li
     else:
        for i in range(n):
        	li.append(i)
        return li

Write down your own solution in the comment box below this post 🙂

October 11, 2020 01:39 PM UTC


Andrea Grandi

Python 3.9 introduces removeprefix and removesuffix

A quick tutorial to removeprefix and removesuffix methods which have been introduced with Python 3.9.0

October 11, 2020 01:37 PM UTC


Codementor

Data Engineering Series #1: 10 Key tech skills you need, to become a competent Data Engineer.

Bridging the gap between Application Developers and Data Scientists, the demand for Data Engineers ro...

October 11, 2020 11:20 AM UTC


Ram Rachum

GridRoyale - A life simulation for exploring social dynamics

GridRoyale - A life simulation for exploring social dynamics

Another day, another project :)

This is a project that I wanted to do for years. I finally had the opportunity to do it. Check out the GridRoyale readme on GitHub for more details and a live demo.

GridRoyale is a life simulation. It’s a tool for machine learning researchers to explore social dynamics.

It’s similar to Game of Life or GridWorld, except I added game mechanics to encourage the players to behave socially. These game mechanics are similar to those in the battle royale genre of computer games, which is why it’s called GridRoyale.

The game mechanics, Python framework and visualization are pretty good– The core algorithm sucks, and I’m waiting for someone better than me to come and write a new one. If that’s you, please open a pull request.

October 11, 2020 08:16 AM UTC


"CodersLegacy"

Scrapy vs BeautifulSoup | Python Web Crawlers

This article is Scrapy vs BeautifulSoup comparison.

If you ever come across a scenario where you need to download data off the internet, you’ll need to use a Python Web Crawler. There are two good web crawlers in Python that can be used for this purpose, Scrapy and BeautifulSoup.

What are web crawlers? What is web scraping? Which python web crawler should you be using, Scrapy or BeautifulSoup? We’ll be answering all these questions here in this article.


Web Scraping and Web Crawlers

Web scraping is the act of extracting or “scraping” data from a web page. The general process is as follows. First the targeted web page is “fetched” or downloaded. Next we the data is retrieved and parsed through into a suitable format. Finally we get to navigate through the parsed data, selecting the data we want.

The Web scraping process is fully automated, done through a bot which we call the “Web Crawler”. Web Crawlers are created using appropriate software like Python, with the BeautifulSoup and Scrapy libraries.


BeautifulSoup vs Scrapy

BeautifulSoup is actually just a simple content parser. It can’t do much else, as it even requires the requests library to actually retrieve the web page for it to scrape. Scrapy on the other hand is an entire framework consisting of many libraries, as an all in one solution to web scraping. Scrapy can retrieve, parse and extract data from a web page all by itself.

By this point you might be asking, why even learn BeautifulSoup? Scrapy is an excellent framework, but it’s learning curve is much steeper due to the large number of features, a harder setup, and complex navigation. BeautifulSoup is both easier to learn and use. Even someone who knows Scrapy well may use BeautifulSoup for simpler tasks.

The difference between the two is the same as the difference between a simple pistol and a Rifle with advanced gear attached. The pistol, due to it’s simplicity is easier and faster to use. On the other hand the Rifle requires much more skill and training to use, but ultimately is much deadlier than the pistol.


Scrapy Features

It’s possible that some of the below tasks are possible with BeautifulSoup through alternate means, like using other libraries. However, the point here is that Scrapy has all these features built in to it, fully supported and compatible with it’s other features.

Improved Scraping

Built upon the Twisted, an asynchronous networking framework, Scrapy is also much faster than other web scrapers in terms of speed and memory usage.

Furthermore, it’s much more versatile and flexible. Websites often change their layout and structure over time. Scrapy is not effected by any minor changes in the website, and will continue to work normally.

Using other classes and settings like “Rules” you can also adjust the behavior of the Scrapy Spider in many different ways.

Parallel requests.

Typically web crawlers deal with one request at a time. Scrapy has the ability to run requests in parallel, allowing for much faster scraping.

In theory, if you could execute 60 requests in a minute, with 6 “concurrent” requests, you could get it done in 10 seconds. This isn’t always the case though due to overhead, latency and time taken to actually download the page.

Cookies and User agents

By default, web crawlers will identify themselves as web crawlers to the browser/website they access. This can be quite a problem when you’re trying to get around the bot protection on certain websites.

With the use of User Agents, Cookies and Headers in Scrapy, you can fool the website into thinking that it’s an actual human attempting to access the site.

AutoThrottle

One of the major reasons why websites are able to detect Scrapy Spiders (or any spider in general) is due to how fast the Requests are made. Things just get even worse when your Scrapy Spider ends up slowing down the website due to the large number of requests in a short period of time.

To prevent this, Scrapy has the AutoThrottle option. Enabling this setting will cause Scrapy to automatically adjust the scraping speed of the spider depending on the traffic load on the target website.

This benefits us because our Spider becomes a lot less noticeable and the chances of getting IP banned decreases significantly. On the other hand the website also benefits since the load is more evenly spread out instead of being concentrated at a singe point.

Rate limiting

The purpose of Rate or “Request” Limiting is the same as AutoThrottle, to increase the delay between requests to keep the spider off the website’s radar. There are all kinds of different settings which you can manipulate to achieve the desired result.

The difference between this setting and AutoThrottle is that Rate limiting involves using fixed delays, whereas AutoThrottle automatically adjusts the delay based off several factors.

Another bonus fact in Scrapy is that you can actually use both AutoThrottle and the Rate limiting settings together to create a more complex crawler that’s both fast and undetectable.

Proxies and VPN’s

In cases we you need to send out a large number of requests to a website, it’s extremely suspicious if they are all coming from one IP address. If you’re not careful, you’re IP will get banned pretty quick.

The solution to this is the Rotating Proxies and VPN support that Scrapy offers. With this you can change things so that each request appears to have arrived from a different location. Using this is the closest you’ll get to completely masking the presence of your Web crawler.

XPath and CSS Selectors

XPath and CSS selectors are key to making Scrapy a complete web scraping library. These two are advanced and easy to use techniques through which one can easily scrape through the HTML content on a web page.

XPath in particular is an extremely flexible way of navigating through the HTML structure of a web page. It’s more versatile than CSS selectors, being able to traverse both forward and backward.

Debugging and Logging

Another one of Scrapy’s handy features is the inbuilt debugger and logger. Everything that happens, from the headers used, to the time taken for each page to download, the website latency etc is all printed out in the terminal and can be logged into a proper file. Any errors or potential issues that occur are also displayed.

Exception Handling

While web scraping on a large scale, you’ll run into all kinds of different server errors, missing pages, internet issues etc. Scrapy, with it’s exception handling allows you to gracefully each one of these issues without breaking down. You can even pause your Scrapy spider and resume it at a later time.


Scrapy Code

Below are some example codes for Scrapy that we’ve selected from our various tutorials to demonstrate here. Each project example is accompanied by a brief description about it’s usage.

Data Extractor

This first Scrapy code example features a Spider that scans through the entire quotes.toscrape extracting each and every quote along with the Author’s name.

We’ve used the Rules class in order to ensure that the Spider scrapes only certain pages (to save time and avoid duplicate quotes) and added some custom settings, such as AutoThrottle.

class SuperSpider(CrawlSpider):
    name = 'spider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    base_url = 'http://quotes.toscrape.com'
    rules = [Rule(LinkExtractor(allow = 'page/', deny='tag/'),
                  callback='parse_filter_book', follow=True)]

    custom_settings = {
        'AUTOTHROTTLE_ENABLED': True,
        'AUTOTHROTTLE_DEBUG': True,
    }

    def parse_filter_book(self, response):
        for quote in response.css('div.quote'):
            yield {
                'Author': quote.xpath('.//span/a/@href').get(),
                'Quote': quote.xpath('.//span[@class= "text"]/text()').get(),

Link Follower

Another important feature that Scrapy has is link following which can be implemented in different ways. For instance the example above also had link following enabled through the Rules class.

In the below example however, we’re doing it in a unique way that allows us to visit every page on Wikipedia extracting the page names from every single one of them. In short, it’s a more controlled way of link following.

The below code will not actually scrape the entire site due the DEPTH_LIMIT setting. We’ve done this simply to limit the Spider around python related topics and to keep the scraping time reasonable.

from scrapy.spiders import CrawlSpider

class SuperSpider(CrawlSpider):
    name = 'follower'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Web_scraping']
    base_url = 'https://en.wikipedia.org'

    custom_settings = {
        'DEPTH_LIMIT': 1
    }

    def parse(self, response):
        for next_page in response.xpath('.//div/p/a'):
            yield response.follow(next_page, self.parse)

        for quote in response.xpath('.//h1/text()'):
            yield {'quote': quote.extract() }

This section doesn’t really contribute much to the Scrapy vs BeautifulSoup debate, but it does help you get an idea on what Scrapy code is like.


Conclusion

If you’re a beginner, I would likely recommend BeautifulSoup over Scrapy. It’s just easier than Scrapy in almost every way, from it’s setup to it’s usage. Once you’ve gained some experience, the transition to Scrapy should become easier as they have overlapping concepts.

For simple projects, BeautifulSoup will be more than enough. However, if you’re really serious about making a proper web crawler then you’ll have to use Scrapy.

Ultimately, you should learn both (while giving preference to Scrapy) and use either one of them depending on the situation.


This marks the end of the Scrapy vs BeautifulSoup article. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.

The post Scrapy vs BeautifulSoup | Python Web Crawlers appeared first on CodersLegacy.

October 11, 2020 07:06 AM UTC


Awesome Python Applications

Spack

Spack: Language-independent package manager for supercomputers, Mac, and Linux, designed for scientific computing.

Links:

October 11, 2020 12:06 AM UTC


ABlog for Sphinx

ABlog v0.10.11 released

Pull Requests merged in:

improving glob matching and documenting it from choldgraf.

October 11, 2020 12:00 AM UTC