Planet Python
Last update: March 29, 2023 09:42 PM UTC
March 29, 2023
Ben Cook
Understanding the Data Science Process for Entrepreneurs
As an entrepreneur looking to harness the power of machine learning (ML) in your business, understanding the data science process is crucial. This process can be broken down into three main steps:
- Proof of concept (evaluate technical feasibility)
- Minimum viable product (scale up dataset size)
- Deployment (run the algorithm in production)
The goal is to move through these stages as quickly as possible so that you can gather feedback from real-world users. The longer you spend “in the lab” perfecting your algorithm, the less likely you are to build something your customers actually care about.
In this blog post, we’ll dive into each step and explore how you can apply them to your business.
Proof of Concept (Evaluate Technical Feasibility)
The proof of concept (POC) stage is all about identifying the problem you want to solve and understanding its technical feasibility. At this stage, you’ll select appropriate ML algorithms and data sources to tackle the problem.
Once you’ve chosen an algorithm, conduct a small-scale experiment to test your solution. The goal here is to validate your idea, not to build a full-fledged product. Iterate and refine your POC based on your initial findings, and don’t be afraid to make changes if something isn’t working.
Minimum Viable Product (Scale Up Dataset Size)
Once you’ve successfully proven your concept, it’s time to move on to the minimum viable product (MVP) stage. The goal here is to scale up the size of the dataset to validate your solution on a larger scale.
A more diverse and representative dataset will help you improve your ML model’s performance. As the model performance improves, gather customer feedback on your MVP and use it to make data-driven improvements. The feedback you receive at this stage is invaluable for shaping your final product.
Deployment (Run the Algorithm in Production)
With a refined MVP in hand, you’re ready to deploy your ML model. The deployment stage involves integrating the model into your existing software infrastructure and ensuring it performs well and scales to meet the demands of real-world use.
Monitor your model’s performance closely and address any issues or concerns that arise. Continuously iterate on your deployed model based on customer feedback and changing needs to ensure your product remains relevant and effective.
The Importance of a Fast, Iterative Process
Throughout the data science process, customer feedback is vital for shaping your product. By keeping the process fast and iterative, you’ll maximize the value of this feedback and increase your chances of success.
Adapt and refine your ML model based on real-world experiences, and don’t hesitate to pivot if you find that your initial approach isn’t working as expected. Embrace an agile mindset, and you’ll be well on your way to making a meaningful impact with your ML project.
Conclusion
Understanding the data science process is essential for any entrepreneur looking to leverage machine learning in their business. Apply these principles to your own projects, and always remember to keep the process fast and iterative to get the most out of customer feedback.
The post Understanding the Data Science Process for Entrepreneurs appeared first on Sparrow Computing.
Real Python
Build a Maze Solver in Python Using Graphs
If you’re up for a little challenge and would like to take your programming skills to the next level, then you’ve come to the right place! In this hands-on tutorial, you’ll practice object-oriented programming, among several other good practices, while building a cool maze solver project in Python.
From reading a maze from a binary file, to visualizing it using scalable vector graphics (SVG), to finding the shortest path from the entrance to the exit, you’ll go step by step through the guided process of building a complete and working project.
In this tutorial, you’ll learn how to:
- Use an object-oriented approach to represent the maze in memory
- Define a specialized binary file format to store the maze on disk
- Transform the maze into a traversable weighted graph
- Use graph search algorithms in the NetworkX library to find the solution
- Visualize the maze and its solution using scalable vector graphics (SVG)
Click the link below to download the complete source code for this project, along with the supporting materials, which include a few sample mazes:
Free Download: Click here to download the source code and supporting materials that you’ll use to build a maze solver in Python.
Demo: Python Maze Solver
At the end of this tutorial, you’ll have a command-line maze solver that can load your maze from a binary file and show its solution in the web browser:
You’ll learn how to build your own mazes like this from scratch and save them on disk. In the meantime, feel free to grab one of the sample mazes from the supporting materials. Now, get ready to dive in!
Project Overview
Take a glimpse at the expected file structure of your project. Once finished, your project’s file and directory tree will look as follows:
maze-solver/
│
├── mazes/
│ ├── labyrinth.maze
│ ├── miniature.maze
│ └── pacman.maze
│
├── src/
│ │
│ └── maze_solver/
│ │
│ ├── graphs/
│ │ ├── __init__.py
│ │ ├── converter.py
│ │ └── solver.py
│ │
│ ├── models/
│ │ ├── __init__.py
│ │ ├── border.py
│ │ ├── edge.py
│ │ ├── maze.py
│ │ ├── role.py
│ │ ├── solution.py
│ │ └── square.py
│ │
│ ├── persistence/
│ │ ├── __init__.py
│ │ ├── file_format.py
│ │ └── serializer.py
│ │
│ ├── view/
│ │ ├── __init__.py
│ │ ├── decomposer.py
│ │ ├── primitives.py
│ │ └── renderer.py
│ │
│ ├── __init__.py
│ └── __main__.py
│
├── pyproject.toml
└── requirements.txt
Yes, that’s a lot of files, but don’t worry! Most of them are fairly short, and some contain only a few lines of code. This helps keep things organized and makes the individual pieces reusable, letting you compose them in new ways. Such granularity also plays an important role in Python projects with larger codebases by avoiding the notorious circular dependency error that you might encounter if various parts of the code were in one big file.
The mazes/ subfolder is home to a few binary files with sample data that you’re going to use in this tutorial. You can get these files, along with the final source code and snapshots of the individual steps, by downloading the supporting materials:
Free Download: Click here to download the source code and supporting materials that you’ll use to build a maze solver in Python.
The src/ subfolder contains your Python modules and packages for the maze solver project. The maze_solver package consists of several subpackages that group logically related code fragments, including:
graphs: The traversal and conversion of the maze to a graph representationmodels: The building blocks of the maze and its solutionpersistence: A custom binary file format for persistent maze storageview: The visualization of the graph with scalable vector graphics
You’ll also find the special __main__.py file, which makes the enclosing package runnable so that you can execute it directly from the command line using Python’s -m option:
$ python -m maze_solver /path/to/sample.maze
When launched like this, the package reads the specified file with your maze. After solving the maze, it renders the solution into an SVG format embedded in a temporary HTML file. The file gets automatically opened in your default web browser. You can also run the same Python code using a shortcut command:
$ solve /path/to/sample.maze
It’ll work as long the solve command isn’t already taken or aliased by another program.
Finally, pyproject.toml provides your project’s configuration, metadata, and dependencies defined in the TOML format. The project only depends on one external library, which you’ll use to find the shortest path in the maze represented as a graph.
Next up, you’ll review a list of relevant resources that might become your savior in case you get stuck at any point. Also, remember the supporting materials, which contain a snapshot of each finished step. Along the way, you can compare your progress to the relevant step to ensure that you’re on the right track.
Read the full article at https://realpython.com/python-maze-solver/ »
[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
Python Morsels
Using "any" and "all" in Python
Need to check whether all items in a list match a certain condition? You can use Python's built-in any and all functions for that!
Table of contents
- Checking a condition for all items
- Using the
anyandallfunctions in Python - Python's
anyfunction - Python's
allfunction - Let's try using a list comprehension
- Using a generator expression
- Choosing between
anyandall - Using
anyandallin anifstatement - You might just need a containment check
- Cheat sheet: Python's
anyandall - Check whether all items match a condition with the
anyandallfunctions
Checking a condition for all items
This function accepts a list (or iterable) of numbers and checks whether all the numbers are between 0 and 5 (inclusive):
def ratings_valid(ratings):
for rating in ratings:
if rating < 0 or rating > 5:
return False
return True
This code:
- Loops over all given numbers (using a
forloop) - Returns
Falseas soon as a negative number or a number greater than 5 is found (using anifstatement to check that condition) - Returns
Trueif all numbers are between0and5
Note that this function returns as soon as it finds an invalid number, so it only iterates all the way through the number range if all the numbers are valid.
Using the any and all functions in Python
Let's first look at a …
Read the full article: https://www.pythonmorsels.com/any-and-all/
Stack Abuse
Parsing URLs with Python
Introduction
URLs are, no doubt, an important part of the internet, as it allows us to access resources and navigate websites. If the internet was one giant graph (which it is), URLs would be the edges.
We parse URLs when we need to break down a URL into its components, such as the scheme, domain, path, and query parameters. We do this to extract information, manipulate them, or maybe to construct new URLs. This technique is essential for a lot of different web development tasks, like web scraping, integrating with an API, or general app development.
In this short tutorial, we'll explore how to parse URLs using Python.
Note: Throughout this tutorial we'll be using Python 3.x, as that is when the urllib.parse library became available.
URL Parsing in Python
Lucky for us, Python offers powerful built-in libraries for URL parsing, allowing you to easily break down URLs into components and reconstruct them. The urllib.parse library, which is part of the larger urllib module, provides a set of functions that help you to deconstruct URLs into their individual components.
To parse a URL in Python, we'll first import the urllib.parse library and use the urlparse() function:
from urllib.parse import urlparse
url = "https://example.com/path/to/resource?query=example&lang=en"
parsed_url = urlparse(url)
The parsed_url object now contains the individual components of the URL, which has the following components:
- Scheme:
https - Domain:
example.com - Path:
/path/to/resource - Query parameters:
query=example&lang=en
To further process the query parameters, you can use the parse_qs function from the urllib.parse library:
from urllib.parse import parse_qs
query_parameters = parse_qs(parsed_url.query)
print("Parsed query parameters:", query_parameters)
The output would be:
Parsed query parameters: {'query': ['example'], 'lang': ['en']}
With this simple method, you have successfully parsed the URL and its components using Python's built-in urllib.parse library! Using this, you can better handle and manipulate URLs in your web development projects.
Best Practices for URL Parsing
Validating URLs: It's essential to ensure URLs are valid and properly formatted before parsing and manipulating them to prevent errors. You can use Python's built-in urllib.parse library or other third-party libraries like validators to check the validity of a URL.
Here's an example using the validators library:
import validators
url = "https://example.com/path/to/resource?query=example&lang=en"
if validators.url(url):
print("URL is valid")
else:
print("URL is invalid")
By validating URLs before parsing or using them, you can avoid issues related to working with improperly formatted URLs and ensure that your is more stable and less prone to errors or crashing.
Properly Handling Special Characters: URLs often contain special characters that need to be properly encoded or decoded to ensure accurate parsing and processing. These special characters, such as spaces or non-ASCII characters, must be encoded using the percent-encoding format (e.g., %20 for a space) to be safely included in a URL. When parsing and manipulating URLs, it is essential to handle these special characters appropriately to avoid errors or unexpected behavior.
The urllib.parse library offers functions like quote() and unquote() to handle the encoding and decoding of special characters. Here's an example of these in use:
from urllib.parse import quote, unquote
url = "https://example.com/path/to/resource with spaces?query=example&lang=en"
# Encoding the URL
encoded_url = quote(url, safe=':/?&=')
print("Encoded URL:", encoded_url)
# Decoding the URL
decoded_url = unquote(encoded_url)
print("Decoded URL:", decoded_url)
This code will output:
Encoded URL: https://example.com/path/to/resource%20with%20spaces?query=example&lang=en
Decoded URL: https://example.com/path/to/resource with spaces?query=example&lang=en
It's always good practice to handle special characters in URLs so that you can ensure that your parsing and manipulation code remains error-free.
Conclusion
Parsing URLs with Python is an essential skill for web developers and programmers, enabling them to extract, manipulate, and analyze URLs with ease. By utilizing Python's built-in libraries, such as urllib.parse, you can efficiently break down URLs into their components and perform various operations, such as extracting information, normalizing URLs, or modifying them for specific purposes.
Additionally, following best practices like validating URLs and handling special characters ensures that your parsing and manipulation tasks are accurate and reliable.
Brett Cannon
MVPy: Minimum Viable Python
Over 32 posts spanning well over 2 years, this is the final post in my blog series on Python&aposs syntactic sugar. I had set out to find all of the Python 3.8 syntax that could be rewritten if you were to run a tool over a single Python source file in isolation and still end up with reasonably similar semantics (i.e. no whole-program analysis, globals() having different keys was okay, don&apost care about performance). Surprisingly, it turns out to be easier to list what syntax you can&apost rewrite than re-iterate all the syntax that you can rewrite!
- Integers (as the base for other literals like bytes)
- Function calls
=- Function definitions
nonlocalreturnyieldtry/exceptwhile
All other syntax can devolve to this core set of syntax. I call this subset of syntax the Minimum Viable Python (MVPy) you need to make Python function as a whole. If you can implement this subset of the language, then you can do a syntactic translation to support the rest of Python&aposs syntax (although admittedly it might be a bit faster if you directly implemented all the syntax 😉).
If you look at what syntax is left, it pretty much aligns to what is required to implement a Turing machine:
- Read/write data (
=and integers) - Make decisions about data (
whileandtry) - Do things to that data (everything involving defining and using functions)
You might not be as productive in this subset of the language as you would be with all the syntax available in Python 3.8 (and later), but you should still be able to accomplish the same things given enough time and patience.
Addendum
Since the initial publication of this post on 2022-08-14, I was able to unravel even more syntax than I initially thought. This post has been updated to reflect those later realizations.
Unravelling `del`
In my post on unravelling the global statement, I mentioned how after my PyCascades 2023 talk some people came up to me about a couple of pieces of Python syntax that I had not managed to unravel. Beyond global, people thought I should be able to get rid of the del statement. Turns out that a bit of extra work I did in my global statement post lets me unravel del as well!
In my global statement post I talk about the (roughly) 3 namespaces that Python has:
- Local
- Global
- Built-in
In order to support del, I need to figure out how delete names from the local and global namespaces (you will trigger a NameError if you try and delete something in the built-in namespace).
Deleting from the local namespace
To delete a name in the local namespace, you need to make sure that if someone tries to use that same name later on, it causes an UnboundLocalError to be raised. Since there&aposs no way to directly manipulate the local namespace, we can assign a marker to tell us that a local name has been "deleted" (this also allow for garbage collection as expected). So del A can become _DELETED = object(); A = _DELETED. But we also have to detect if A is used after deletion to raise UnboundLocalError as appropriate.
_DELETED = object()
# `del A`
A = _DELETED
# Referencing `A`
if A is _DELETED:
raise UnboundLocalError("cannot access local variable &aposA&apos where it is not associated with a value")Unravelling del A along with accessing the name later onYou might be wondering why are we checking for our _DELETED marker when we can look at some code and know that del A was called, effectively making the A name (seem) useless going forward? Well, think about what happens if A was deleted in an if block? That would mean A could be deleted, but not necessarily deleted unconditionally.
Because local names can be used in expressions, we do have to unravel uses of potentially deleted names just like assignment expressions had use to make new names appear appropriately. It&aposs a bit tedious, but it does mean Python&aposs rules around evaluation is upheld while still allowing for del to operate appropriately.
Deleting from the global namespace
In the global unravelling post, I talked about how to identify when a name was global or built-in compared to a local name. That&aposs important here because deleting a global name is like deleting a key from a dictionary thanks to the globals() function: globals().__delitem__("A"). And if a name happens to not to be in that global dictionary, then NameError should be raised since you can&apost directly delete something from the built-in namespace.
try:
gettattr(globals(), "__delitem__")("A")
except KeyError:
raise NameError("name &aposA&apos is not defined")Unravelling del A for a global name
March 28, 2023
The Python Coding Blog
What’s a Python Iterable? [Python Data Structure Series #1]
You’re familiar with data structures such as lists, dictionaries, tuples, sets, and more. You may even know about the similarities and differences between their behaviours. But how comfortable are you with the terms used for the groups they belong to, like iterables, sequences, collections, and containers? In this series, we’ll dive into these categories of data structures, starting with Python iterables.
Why bother? Isn’t it sufficient to know about the data structures themselves, like lists and dictionaries, instead of the abstract groups they belong to, such as sequences and mappings? Initially, yes. However, as you become more proficient in using data structures, understanding the key principles for each group will help you select the right data structures to solve your problems and use them more efficiently.
Overview of the Python Data Structure Series
Here’s an overview of the seven articles in this series:
- Iterable: The structure you can loop through
- Sequence: The structure where one item follows another
- Mapping: The structure where each value has a label
- Container: The structure which contains items
- Collection: The structure that’s an iterable container and has a length
- Iterator: The structure which streams through another structure
- Generator: The structure which doesn’t hold any of its data
If you’re puzzled by these one-line descriptions, get a cup of coffee or a pot of tea, sit down, relax, and get ready for the journey through these seven articles.
What’s a Python Iterable?
The simple definition of a Python iterable is a data type you can use in a for loop. If you have an object and you want to know whether it’s iterable, you can place it at the end of a for statement. If you don’t get an error, then the object is iterable!
Let’s try several common data types to see if they’re iterable.
Lists
>>> for item in [3, 4, 5]: ... print(item) ... 3 4 5
Yes, lists are iterable. There’s no error message when you use a list in a for loop.
Integers
Let’s check whether integers are iterable:
>>> for item in 42: ... print(item) ... Traceback (most recent call last): ... TypeError: 'int' object is not iterable
No, they’re not. You get a TypeError when you try to loop through an integer. You cannot iterate over an integer. It’s not iterable.
Strings
Strings, along with integers, are probably the first data types you’ve ever come across in Python. Let’s check whether you can iterate through a string:
>>> for item in "Stephen": ... print(item) ... S t e p h e n
Yes, you can. The for loop goes through each character in the string, and each one-letter string is assigned to the variable item in each iteration. Strings are iterable.
Dictionaries
Next, it’s the turn of the dictionary. Let’s try the infallible for loop test to check whether this data type is iterable:
>>> for item in {"Name": "Stephen", "Favourite colour": None}:
... print(item)
...
Name
Favourite colour
The answer is yes, again. However, you’ll notice a different behaviour when looping through a dictionary compared to other iterables you’ve seen so far. The for loop iterates through the dictionary’s keys and not key-value pairs.
item is equal to "Name" in the first iteration and "Favourite colour" in the second. The values are not available in the loop.
You can use one of the dictionary methods, .items(), to loop through the key-value pairs instead. You can read more about dictionaries and different ways of looping through them in Chapter 4 in The Python Coding Book about Data, Data Types, and Data Structures.
Still, you can loop through a dictionary using a for loop without getting an error. Therefore, a dictionary is iterable.
A Bit More Detail About Python Iterables
Let’s look at a more detailed definition of a Python iterable. An iterable is a Python object which can return its elements one at a time.
Later in this series on Python data structures, you’ll learn more about another type called an iterator. Don’t let the similarity in the name confuse you. Iterables and iterators are related, but they’re different data structures.
You’ll read about iterators in more detail later in the series. For now, we can say that you can always create an iterator from an iterable.
Here’s another way you can check whether an object is iterable. You can try to convert it to an iterator using the built-in function iter(). If you don’t get an error, the object is iterable:
>>> iter([3, 4, 5])
<list_iterator object at 0x13d66ec20>
>>> iter(42)
Traceback (most recent call last):
...
TypeError: 'int' object is not iterable
>>> iter("Stephen")
<str_ascii_iterator object at 0x13d66e620>
>>> iter({"Name": "Stephen", "Favourite colour": None})
<dict_keyiterator object at 0x13d6ab7e0>
You can ignore what’s actually printed out when you display the value returned by iter(), although you can see the word “iterator” for three of the four outputs. The remaining one raises an error.
These outputs confirm the results you found earlier when using the for loop. Lists, strings, and dictionaries are iterable since iter() returns a value for these data types. However, when you use 42 as an argument for iter(), you get a TypeError. This data type cannot be used in iter() because it’s not iterable.
This result is not a coincidence! A for loop converts the iterable into an iterator behind the scenes.
When is a data type iterable?
A class will create an iterable object when it has either __iter__() or __getitem__() defined as one of its special methods. Special methods are also informally called dunder methods because they have double underscores at the start and end of their names. You can read more about special methods and defining classes in Chapter 7 of The Python Coding Book on Object-Oriented Programming.
Defining __iter__() to make a class iterable is the newer and preferred method.
Etymology Corner
You won’t find "iterable" in most English dictionaries. The closest English words are "iterate" and "iteration", which we also use in programming. Incidentally, we use "iterable" as both a noun and an adjective. So ‘a list is an iterable’ and ‘a list is iterable’ are both valid.
The word "iterable" comes from the Latin iterare, which means "to repeat".
More interestingly, the Latin iter means "a journey" or "a path". So, when you iterate, you’re going on a journey through the object, forging a path through its elements.
And, since you ask, the plural of iter is itinera from which we get the English word "itinerary".
In English, words ending in "-able" are usually adjectives showing the ability to do something. For example, something is:
- "readable" if it can be read — it’s "able" to be read
- "knowable" if it can be known
- "manageable" if it can be managed
and therefore:
- "iterable" if it can be iterated
Final Words
In summary, an iterable is an object you can iterate through. You can access each item one after another in an iterable.
Now it’s your turn. Try to think of as many data types as possible, and try them out either in a for loop or as arguments in iter() to see which ones are iterable.
Next Article: What’s a Python Sequence?
The post What’s a Python Iterable? [Python Data Structure Series #1] appeared first on The Python Coding Book.
PyCoder’s Weekly
Issue #570 (March 28, 2023)
#570 – MARCH 28, 2023
View in Browser »
Lessons Learned From Four Years Programming With Python
What are the core lessons you’ve learned along your Python development journey? What are key takeaways you would share with new users of the language? This week on the show, Duarte Oliveira e Carmo is here to discuss his recent talk, “Four Years of Python.”
REAL PYTHON podcast
Data Modeling, Parsing and Validation Using Pydantic
Pydantic is a Python library that provides data validation and settings management using Python type annotations. It allows developers to define a schema for their data, which includes the expected data types, default values, and validation rules.
SAMEER SHUKLA • Shared by Sameer Shukla
ChatGPT Outage: Here’s What Happened
On March 20th ChatGPT had an outage. It was caused by an asyncio redis-py client bug and also resulted in a data leak. Read more for details.
OPENAI.COM
Snyk Top 10: Python OSS Vulnerabilities Cheat Sheet
Deep dive into the most prevalent critical and high open source vulnerabilities found by Snyk scans of Python apps in 2022. Learn more about how these high-risk vulnerabilities might be impacting open source packages you are using today and how to fix them →
SNYK.IO sponsor
Articles & Tutorials
Ban 1+N in Django
The 1+N database anti-pattern is common: fetch some rows from the database then re-fetch specific rows to get all the items. An ORM can hide this away and make you not realize it is happening. This article talks about how to stop it in Django. With added meta-bonus: he links to how he attempted to write the article with ChatGPT.
ALEXANDER SCHEPANOVSKI
Deep Neural Nets: 33 Years Ago and 33 Years From Now
This article examines the original paper that proposed back propagation neural nets and relates what has changed and what is the same. Using that knowledge, it looks forward to what neural nets may be able to do decades from now. Includes accompanying code samples.
ANDREJ KARPATHY
The Best Way to Structure Your NoSQL Data Using Python
Data modeling can be challenging. The question that most often comes up is, “How do I structure my data?” The short answer: it depends. That’s why the Redis folks wrote a comprehensive e-book that goes through 8 different optimal scenarios and shows how to model them in Redis →
REDIS LABS sponsor
No-async async With Python
“A (reasonable) criticism of async is that it tends to proliferate in your code. In order to await something, your functions must be async all the way up the call-stack. Textual is an async framework, but doesn’t require the app developer to use the async.” Learn how Textual accomplishes async-agnosticism.
WILL MCGUGAN
reduce(): The Power of a Single Python Function
“While Python is not a pure functional programming language, you still can do a lot of functional programming in it. In fact, just one function - reduce() - can do most of it.” This article introduces you to reduce().
MARTIN HEINZ
When Should You Use .__repr__() vs .__str__() in Python?
In this tutorial, you’ll learn the difference between the string representations returned by .__repr__() vs .__str__() and understand how to use them effectively in classes that you define.
REAL PYTHON
How to Control Crowds with Python, OpenCV and InfluxDB
In this quick training, learn how to build a face recognition application using open source tools like OpenCV (Open Source Computer Vision Library) and InfluxDB time series platform. Github repository included.
INFLUXDATA sponsor
Use TOML for .env Files?
Using .env files to specify configuration environments can be handy, but problematic when it comes to multiple platforms. Some toolsets are starting to explore the use of TOML instead.
BRETT CANNON
Marketing for Developers
A few simple steps can make all the difference in whether your project gets noticed. This article is about Django projects, but most of the advice applies across all code bases.
ADAM HILL
Run a Flask Server Inside a Readonly Docker Container
Learn how to run a Python server inside a read-only Docker container and how to pre-bundle the SCSS and JS files in a separate step.
JON JAGGER
Apify Python SDK: Build and Manage Web Scraping Solutions in the Cloud
Build scrapers in the cloud and rely on the Apify platform for data storage, scheduling runs, and proxies.
APIFY sponsor
VS Code Shortcuts for Efficient Python Programmers
Learn keyboard shortcuts that will make you a more efficient and productive Python programmer with VS Code.
RODRIGO GIRÃO SERRÃO
Generate Images Using OpenAI and DALL·E 2
Learn how to use Python to interface with OpenAI’s API to do image generation.
IDOWU OMISOLA
Projects & Code
Events
Heidelberg Python Meetup
March 29, 2023
MEETUP.COM
PyStaDa
March 29, 2023
PYSTADA.GITHUB.IO
Weekly Real Python Office Hours Q&A (Virtual)
March 29, 2023
REALPYTHON.COM
SPb Python Drinkup
March 30, 2023
MEETUP.COM
PyTexas 2023
April 1 to April 3, 2023
PYTEXAS.ORG
Melbourne Python Users Group, Australia
April 3, 2023
J.MP
Happy Pythoning!
This was PyCoder’s Weekly Issue #570.
View in Browser »
[ Subscribe to 🐍 PyCoder’s Weekly 💌 – Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]
death and gravity
Limiting concurrency in Python asyncio: the story of async imap_unordered()
So, you're doing some async stuff, repeatedly, many times.
Like, hundreds of thousands of times.
Maybe you're scraping some data.
Maybe it's more complicated – you're calling an API, and then passing the result to another one, and then saving the result of that.
Either way, it's a good idea to not do it all at once. For one, it's not polite to the services you're calling. For another, it'll load everything in memory, all at once.
In sync code, you might use a thread pool and imap_unordered():
pool = multiprocessing.dummy.Pool(2)
for result in pool.imap_unordered(do_stuff, things_to_do):
print(result)
Here, concurrency is limited by the fixed number of threads.
But what about async code? In this article, we'll look at a few ways of limiting concurrency in asycio, and find out which one is best.
Tip
No, it's not Semaphore, despite what Stack Overflow may tell you.
Tip
If you're in a hurry – it's wait().
- Getting started
- asyncio.gather()
- asyncio.Semaphore
- asyncio.as_completed()
- asyncio.Queue
- Aside: backpressure
- asyncio.wait()
- Async iterables
- Bonus: exceptions
- Bonus: better decorators?
Getting started #
In order to try things out more easily, we'll start with a test harness of sorts.
Our async map_unordered() behaves pretty much like imap_unordered()
– it takes a coroutine function and an iterable of arguments,
and runs the resulting awaitables limit at a time:
| |
The actual running is done by limit_concurrency().
For now, we run them one by one
(we'll get back to this later on):
| |
To simulate work being done, we just sleep():
| |
Putting it all together,
we get a map_unordered.py LIMIT TIME... script that does stuff in parallel,
printing timings as we get each result:
| |
... like so:
$ python map_unordered.py 2
0.0: done
$ python map_unordered.py 2 .1 .2
0.1: 0.1
0.3: 0.2
0.3: done
Tip
If you need a refresher on lower level asyncio stuff related to waiting, check out Hynek Schlawack's excellent Waiting in asyncio.
asyncio.gather() #
In the Running Tasks Concurrently section of the asyncio docs, we find asyncio.gather(), which runs awaitables concurrently and returns their results.
We can use it to run limit-sized batches:
| |
This seems to work:
$ python map_unordered.py 2 .1 .2
0.2: 0.1
0.2: 0.2
0.2: done
... except:
$ python map_unordered.py 2 .1 .2 .2 .1
0.2: 0.1
0.2: 0.2
0.4: 0.2
0.4: 0.1
0.4: done
... those should fit in 0.3 seconds:
| sleep(.1) | sleep(.2) |
| sleep(.2) | sleep(.1) |
... but we're waiting for the entire batch to finish, even if some tasks finish earlier:
| sleep(.1) |...........| sleep(.2) |
| sleep(.2) | sleep(.1) |...........|
asyncio.Semaphore #
Screw the docs, too much to read; after some googling, the first few Stack Overflow answers all point to asyncio.Semaphore.
Like its threading counterpart,
we can use it to limit
how many times the body of a with block is entered in parallel:
| |
This works:
$ python map_unordered.py 2 .1 .2 .2 .1
0.3: 0.1
0.3: 0.2
0.3: 0.2
0.3: 0.1
0.3: done
... except, because gather() takes a sequence,
we end up consuming the entire aws iterable
before gather() is even called.
Let's highlight this:
| |
| |
As expected:
$ python map_unordered.py 2 .1 .2 .2 .1
iter end
0.3: 0.1
0.3: 0.2
0.3: 0.2
0.3: 0.1
0.3: done
For small iterables, this is fine, but for bigger ones, creating all the tasks upfront without running them might cause memory issues. Also, if the iterable is lazy (e.g. it comes from a paginated API), we only start work after it's all consumed in memory, instead processing it in a streaming fashion.
asyncio.as_completed() #
At a glance, asyncio.as_completed() might do what we need – it takes an iterable of awaitables, runs them concurrently, and returns an iterator of coroutines that "can be awaited to get the earliest next result from the iterable of the remaining awaitables".
Sadly, it still consumes the iterable right away:
def as_completed(fs, *, timeout=None):
... # set-up
todo = {ensure_future(f, loop=loop) for f in set(fs)}
... # actual logic
But there's another, subtler issue.
as_completed() has no limits of its own – it's up to us to limit how fast we feed it awaitables. Presumably, we could wrap the input iterable into a generator that yields awaitables only if enough results came out the other end, and waits otherwise.
However, due to historical reasons,
as_completed() takes a plain-old-sync-iterator
– we cannot await anything in its (sync) __next__(),
and sync waiting of any kind would block (and possibly deadlock)
the entire event loop.
So, no as_completed() for you.
asyncio.Queue #
Speaking of threading counterparts, how would you implement imap_unordered() if there was no Pool? Queues, of course!
And asyncio has its own Queue,
which you use in pretty much the same way:
start limit worker tasks that loop forever,
each pulling awaitables, awaiting them,
and putting the results into a queue.
| |
The iterable is exhausted before the last "batch" starts:
$ python map_unordered.py 2 .1 .2 .3 .3 .2 .1
0.1: 0.1
0.2: 0.2
0.4: 0.3
0.5: 0.3
iter end
0.6: 0.1
0.6: 0.2
0.6: done
I was going to work up to this in a few steps, but I'll just point out three common bugs this type of code might have (that apply to threads too).
First, we could incrementndone from the worker,
but this makes await queue.get() hang forever for empty iterables,
since workers never get to run by the time we get to it;
because there's no other await, it's not even a race condition.
| |
$ python map_unordered.py 2
iter end
Traceback (most recent call last):
...
asyncio.exceptions.TimeoutError
The solution is to signal the worker is done in-band, by putting a sentinel on the queue. I guess a good rule of thumb is that you want a put() for each get() without a timeout.1
Second, you have to catch all exceptions; otherwise, the worker gets killed, andget() waits forever for a sentinel that will never come.
| |
$ python map_unordered.py 2 .1 .2 0 .2 .1
0.1: 0.1
0.2: 0.2
0.4: 0.2
iter end
0.5: 0.1
Traceback (most recent call last):
...
asyncio.exceptions.TimeoutError
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<limit_concurrency.<locals>.worker() done, defined at map_unordered.py:20> exception=ZeroDivisionError('float division by zero')>
Traceback (most recent call last):
...
ZeroDivisionError: float division by zero
Finally, our input iterator is synchronous (for now),
so no other task can run during next(aws).
But if it were async,
any number of tasks could await anext(aws) in parallel,
leading to concurrency issues.
The fix is the same as with threads:
either protect that call with a Lock,
or feed awaitables to workers through an input queue.
Anyway, no need to worry about any that – a better solution awaits.
Aside: backpressure #
At this point, we're technically done – the queue solution does everything Pool.imap_unordered() does.
So much so, that, like imap_unordered(),
it lacks backpressure:
when code consuming results from map_unordered()
cannot keep up with the tasks producing them,
the results accumulate in the internal queue,
with potentially infinite memory usage.
>>> pool = multiprocessing.dummy.Pool(1)
>>> for result in pool.imap_unordered(print, range(4)):
... time.sleep(.1)
... print('got result')
...
0
1
2
3
got result
got result
got result
got result
>>> async def async_print(arg):
... print('in async_print', arg)
... return arg
...
>>> async for result in map_unordered(async_print, range(4), 1):
... await asyncio.sleep(.1)
... print('got result')
...
0
1
2
3
got result
got result
got result
got result
To fix this, we make the queue bounded, so that workers block while the queue is full.
| |
>>> async for result in map_unordered(async_print, range(5), 1):
... await asyncio.sleep(.1)
... print('got result')
...
0
1
2
got result
3
got result
4
got result
got result
got result
Alas, we can't do the same thing for Pool.imap_unordered() because don't have access to its queue, but that's a story for another time.
asyncio.wait() #
Pretending we're using threads works, but it's not all that idiomatic.
If only there was some sort of low level, select()-like primitive taking a set of tasks and blocking until at least one of them finishes. And of course there is – we've been purposefully avoiding it this entire time – it's asyncio.wait(), and it does exactly that.
By default, it waits until all tasks are completed, which isn't much better than gather().
But, with return_when=FIRST_COMPLETED,
it waits until at least one task is completed.
We can use this to keep a limit-sized set of running tasks
updated with new tasks as soon as the old ones finish:
| |
We change limit_concurrency() to yield awaitables instead of results,
so it's more symmetric – awaitables in, awaitables out.
map_unordered() then becomes an async generator function,
instead of a sync function returning an async generator.
This is functionally the same,
but does make it a bit more self-documenting.
| |
This implementation has all the properties that the Queue one has:
$ python map_unordered.py 2 .1 .2 .2 .1
0.1: 0.1
0.2: 0.2
0.3: 0.1
0.3: 0.2
iter end
0.3: done
... and backpressure too:
>>> async for result in map_unordered(async_print, range(4), 1):
... await asyncio.sleep(.1)
... print('got result')
...
0
got result
1
got result
2
got result
3
got result
Async iterables #
OK, but what if we pass map_unordered()
an asynchronous iterable?
We are talking about async stuff, after all.
This opens up a whole looking-glass world of async iteration: instead of iter() you have aiter(), instead of next() you have anext(), some of them you await, some you don't... Thankfully, we can support both without making things much worse.
And we don't need to be particularly smart about it either;
we can just feed the current code an async iterable from main(),
and punch our way through the exceptions:
| |
| |
$ python map_unordered.py 2 .1 .2 .2 .1
Traceback (most recent call last):
...
File "map_unordered.py", line 9, in map_unordered
aws = map(func, iterable)
TypeError: 'async_generator' object is not iterable
map() doesn't work with async iterables, so we use a generator expression instead.
| |
In true easier to ask for forgiveness than permission style,
we handle the exception from map() instead of, say,
checking if aws is an instance of collections.abc.Iterable.
We could wrap aws to always be an async iterable,
but limit_concurrency() is useful on it its own,
so it's better to support both.
$ python map_unordered.py 2 .1 .2 .2 .1
Traceback (most recent call last):
...
File "map_unordered.py", line 19, in limit_concurrency
aws = iter(aws)
TypeError: 'async_generator' object is not iterable
For async iterables, we need to use aiter():
| |
$ python map_unordered.py 2 .1 .2 .2 .1
Traceback (most recent call last):
...
File "map_unordered.py", line 32, in limit_concurrency
aw = next(aws)
TypeError: 'async_generator' object is not an iterator
... and anext():
| |
... which unlike aiter(), has to be awaited.
Here's
limit_concurrency() in its entire glory:
| |
Not as clean as before, but it gets the job done:
$ python map_unordered.py 2 .1 .2 .2 .1
0.1: 0.1
0.2: 0.2
0.3: 0.1
0.3: 0.2
iter end
0.3: done
Anyway, that's it for now.
Learned something new today? Share this with others, it really helps!
Want to know when new articles come out? Subscribe here to get new stuff straight to your inbox!
Bonus: exceptions #
OK, so what about exceptions?
A lot of times,
you still want to do the rest of the things,
even if one fails.
Also, you probably want to know which one failed,
but the map_unordered() results are not in order,
so how could you tell?
The most flexible solution is to let the user handle it just like they would with Pool.imap_unordered() – by decorating the original function. Here's one way of doing it:
| |
| |
$ python map_unordered.py 2 .1 .2 0 .2 .1
0.1: 0.1 -> 0.1
0.1: 0.0 -> float division by zero
0.2: 0.2 -> 0.2
0.3: 0.2 -> 0.2
0.3: 0.1 -> 0.1
iter end
0.3: done
Bonus: better decorators? #
Finally, here's a cool thing I learned from the asyncio docs.
When writing decorators, you can use partial() to bind the decorated function to an existing wrapper, instead of always returning a new one. The result is a more descriptive representation:
>>> return_args_and_exceptions(do_stuff)
functools.partial(<function _return_args_and_exceptions at 0x10647fd80>, <function do_stuff at 0x10647d8a0>)
Compare with the traditional version:
def return_args_and_exceptions(func):
async def wrapper(*args):
...
return wrapper
>>> return_args_and_exceptions(do_stuff)
<function return_args_and_exceptions.<locals>.wrapper at 0x103993560>
Does this have a fancy, academic name? Do let me know! [return]
Stack Abuse
Rounding Decimals in Python
Introduction
Whether you're working with financial data, scientific calculations, or any other type of data that requires precise decimal arithmetic, knowing how to round decimal numbers accurately can make all the difference. In Python, there are various methods for rounding digits, each with its unique pros and cons.
In this article, we'll take a look at the different methods of rounding decimals in Python, and offer tips and best practices to help you get a better understanding of rounding decimals in this programming language. We'll discuss
round()andformat()functions, as well as thedecimalmodule.
Using the round() Function
The first approach for rounding decimals in Python we'll discuss in this article is using the round() function. It is a built-in Python function that allows you to round numbers to a specified number of decimal places. It takes two arguments - number and ndigits:
round(number, ndigits=None)
The number argument is the decimal number that you want to round, and the ndigits argument (optional) specifies the number of decimal places to round to.
Note: If ndigits is not specified, round() will round the number to the nearest integer.
Now, let's take a look at a simple example. Say we have the following decimal number:
x = 3.14159
And say we want to round x to two decimal places. We'll use the round() function with ndigits set to 2:
rounded_x = round(x, 2)
The rounded_x variable would now hold the value 3.14, as expected.
Rounding Halfway Cases
That's all fine, but what if we want to round any of the halfway cases (numbers that end in .5). In that case, the round() function uses a "round half to even" algorithm. This means that the function rounds to the nearest even number. For example, round(2.5) will round down to 2, andround(3.5) will round up to 4.
Note: It's worth noting that this behavior can lead to unexpected results when rounding large sets of numbers. If you need to round a large set of numbers and want to avoid bias towards the even numbers, you may want to consider using another rounding method.
Rounding Floating Point Numbers
By default, most of the decimal numbers in Python are internally stored as a float data type. That means decimal numbers are actually stored as floating point numbers. That method of storing decimal numbers is known for approximating actual numbers that you want to store due to the limitations of the discrete nature of computers in general. Meaning that the infinite number of decimal numbers has to be somehow stored as an array of definite numbers of bits (zeros and ones).
To illustrate that, let's take a look at the following decimal number:
x = 3.175
It's fair to assume that if we round this number to 2 decimal places, the resulting number would be 3.18, right?
rounded_x = round(x, 2)
print(rounded_x)
But, as we can see form the output, that's actually not the case:
3.17
This shows the difficulties we face when working with floating point numbers in general - not all numbers are accurately stored. In this example, the number 3.175 was actually stored as 3.17499999999999982236431605997495353221893310546875 which explains why it was rounded down to 3.17 instead of 3.18, which we expected.
All-in-all, the round() function is certainly the most common method for rounding digits in Python, and it's usually suitable for most use cases. However, if you need more control over the rounding method, you may want to consider using the decimal module or the format() method instead.
Using the format() Method
The format() method is another built-in Python function that we can use to round decimals. It works by formatting a given number as a string, and then manipulating the string to display the desired number of decimal places.
Note: Obviously, this is not ideal if you want to actually work with numbers, but can be great approach if you need to display your rounded number in a specific way.
Now, let's take a look at the syntax of the format() method:
"{:.nf}".format(number)
The n section specifies the number of decimal places we want to round to. The number argument is the decimal number that you want to round.
Consider the same decimal number we used in the previous section:
x = 3.14159
Let's use the format() to round x to two decimal places:
rounded_x = "{:.2f}".format(x)
The rounded_x variable would now hold the value 3.14.
Examples
Take a look at a few more examples of using the format() method to round decimals:
# Round to the nearest integer
x = 3.14159
rounded_x = "{:.0f}".format(x)
# rounded_x is 3
# Round to one decimal place
x = 3.14159
rounded_x = "{:.1f}".format(x)
# rounded_x is 3.1
# Round to two decimal places
x = 3.14159
rounded_x = "{:.2f}".format(x)
# rounded_x is 3.14
One advantage of using the format() method over the round() function is that you can control the formatting of the rounded number more precisely. You can use the format() method to add leading zeros, commas for thousands separators, and other formatting options.
On the other hand, the format() method has some limitations. For example, it may not always produce the expected result when rounding halfway cases, and it may not be suitable for large sets of numbers. In these cases, you should probably consider using the decimal module instead.
Using the decimal Module
The decimal module is a Python module that provides support for working with decimal numbers. It offers a way to perform accurate decimal arithmetic, which floating point numbers can't perform.
To use the decimal module for rounding decimals, you first need to create a Decimal object that represents the decimal number you want to round. Then, to round the number, you can then use the quantize() method of the Decimal object:
Decimal(number).quantize(Decimal('.nf'))
Here, n is the number of decimal places to round to, and number is the decimal number that you want to round.
Let's take a look at the example number we've used in previous sections - 3.14159 and round it to two decimal places using the decimal module:
from decimal import Decimal
x = Decimal(3.14159)
rounded_x = x.quantize(Decimal('.01'))
The rounded_x variable would now hold the value 3.14.
A Few More Examples
Here are a few more examples of using the decimal module to round decimals:
# Round to the nearest integer
x = 3.14159
rounded_x = Decimal(x).quantize(Decimal('1'))
# rounded_x is 3
# Round to one decimal place
x = 3.14159
rounded_x = Decimal(x).quantize(Decimal('.1'))
# rounded_x is 3.1
# Round to two decimal places
x = 3.14159
rounded_x = Decimal(x).quantize(Decimal('.01'))
# rounded_x is 3.14
As you can see, the decimal module allows for precise rounding of decimal numbers, and it can be a good choice when accuracy is important. However, it can be slower than the other rounding methods, and it may require more code to use effectively. It's also worth noting that the decimal module may not always produce the expected result when rounding halfway cases, so you may want to test your code carefully to ensure that it behaves as expected.
Best Practices for Rounding Decimals in Python
- Know your requirements: Before choosing a method to round decimals, it's important to understand your requirements. Do you need exact decimal arithmetic, or is a rough approximation sufficient? Do you need to display the rounded number as a string, or do you need to perform further calculations with it? Answering these questions can help you choose the right method for your needs.
- Use the built-in
round()function for simple cases: The built-inround()function is the simplest way to round decimal numbers in Python, and it works well for most simple cases. Use it when you don't need precise decimal arithmetic, and when you don't need to perform further calculations with the rounded number. - Use the
format()method for more control: If you need more control over the formatting of the rounded number, use theformat()method. It allows you to specify the number of decimal places to round to, and to control other aspects of the formatting as well. - Use the
decimalmodule for exact decimal arithmetic: If you need to perform exact decimal arithmetic, use thedecimalmodule. It allows you to control the precision of the decimal calculations, and it provides a way to round decimal numbers to the desired number of decimal places. - Be aware of rounding errors: Rounding decimal numbers can introduce rounding errors, especially when working with very large or very small numbers. Be aware of these errors, and test your code carefully to ensure that it produces the expected results.
- Avoid rounding halfway cases: Halfway cases occur when the number being rounded is exactly halfway between two possible rounded values. Rounding halfway cases can produce unexpected results, so it's generally better to avoid them whenever possible.
- Document your code: When rounding decimals in Python, it's important to document your code clearly. Explain why you are rounding the number, what method you are using, and any assumptions or limitations that apply. This can help ensure that your code is clear, correct, and maintainable over time.
Conclusion
In this article, we've covered the basics of rounding decimals in Python, exploring a few different methods for achieving the desired result. These include using the built-in round() function, the format() method, and finally the decimal module.
By following the best practices in this article, you can avoid rounding errors, choose the right method for your needs, and document your code for better clarity and maintainability.
All-in-all, rounding decimals may seem like a small detail, but it's definitely an important part of many Python applications.
Real Python
YAML: Python's Missing Battery
Python is often marketed as a batteries-included language because it comes with almost everything you’d ever expect from a programming language. This statement is mostly true, as the standard library and the external modules cover a broad spectrum of programming needs. However, Python lacks built-in support for the YAML data format, commonly used for configuration and serialization.
In this video course, you’ll learn how to work with YAML in Python using the available third-party libraries, with a focus on PyYAML. If you’re new to YAML or haven’t used it in a while, then you’ll have a chance to take a quick crash course before diving deeper into the topic.
In this video course, you’ll learn how to:
- Read and write YAML documents in Python
- Serialize Python’s built-in and custom data types to YAML
- Safely read YAML documents from untrusted sources
Notably, you’ll learn about YAML’s advanced, potentially dangerous features and how to protect yourself from them.
[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
Mike Driscoll
An Intro to Textual – Creating Text User Interfaces with Python
Textual is a Python package used to create cross-platform Text User Interfaces (TUI). This may sound like you’ll be creating a user interface with ASCII-art, but that is not the case.
Textual is quite advanced and allows you to add widgets to your terminal applications, including buttons, context switchers, scroll bars, checkboxes, inputs and more.
Getting Started
The first thing you need to do is install Textual. If you only want to run Textual applications, then the following pip comand is all you need:
python3 -m pip install textual
However, if you want to write your own Textual applications, you should run this command instead:
python3 -m pip install "textual[dev]"
That funny little [dev] part will install some extra dependencies that make developing Textual applications easier.
Run the Textual Demo
The Textual package comes with a demo. You’ll find the demo is a great way to see what types of things you can do with Textual.
Here’s how you can run the demo:
python3 -m textual
When you run the command above in your terminal, you should see something like the following appears:

You can explore Textual in the demo and see the following features:
- Widgets
- Accelerators (i.e. CTRL+C)
- CSS styling
- and more
Creating Your Own Simple Application
While the demo is a great place to start to get a feel for what you can do with Textual, it’s always better to dive into the docs and start writing some real code.
You will understand how things work much quicker if you write the code yourself and then start changing the code piece by piece. By going through this iterative process, you’ll learn how to build a little at a time and you’ll have a series of small failures and successes, which is a great way to build up your confidence as you learn.
The first step to take when creating a Textual application is to import Textual and subclass the App() class. When you subclass App(), you create a Textual application. The App() class contains all the little bits and bobs you need to create your very own terminal application.
To start off, create a new Python file in your favorite text editor or IDE and name it hello_textual.py.
Next, enter the following code into your new file:
from textual.app import App
class HelloWorld(App):
...
if __name__ == "__main__":
app = HelloWorld()
app.run()
When you go to run your terminal application, you should run it in a terminal. Some IDEs have a terminal built-in, such as VS Code and PyCharm. Textual may or may not look correct in those terminals though.
Whenever possible, it is recommended that you run Textual applications in your external terminal. Your applications will look and behave better there most of the time. On Mac, it is recommended that you use iTerm rather than the built-in terminal as the built-in Terminal application hasn’t been updated in quite some time.
To run your new terminal application, you will need to run the following command:
python3 hello_textual.py
When you run this command, you will see the following:

Oops! That’s kind of like creating a blank black box! That’s probably not what you want after all.
To exit a Textual application, press CTRL+C. When you do, you will exit the application and return to the normal terminal.
That was exciting, but the user interface was very plain. You can fix that up a bit by adding a label in the next section!
Adding a Label
Now that you are back to your original terminal, go back to your Python editor and create a new file. This time you will name it hello_textual2.py.
Enter the following code into your new Python file:
from textual.app import App, ComposeResult
from textual.widgets import Label
class HelloWorld(App):
def compose(self) -> ComposeResult:
yield Label("Hello Textual")
if __name__ == "__main__":
app = HelloWorld()
app.run()
Your HelloWorld() class was empty before. Now you added a compose() method. The compose() method is where you normally setup your widgets.
A widget is a user interface element, such as a label, a text box, or a button. In this example, you add a Label() with the text “Hello World” in it.
Try running your new code in your terminal and you should see something like this:

Well, that looks a little better than the original. But it would be nice to have a way to close your application without using CTRL+C.
One common way to close an application is with a Close button. You’ll learn how to add one of those next!
Adding a Close Button
When you create a user interface, you want to communicate with the user about how they can close your application. A terminal application already has a way to close the terminal itself by way of its exit button.
However, you usually want a way to close your Textual application without closing the terminal itself. You have been using CTRL+C for this.
But there’s a better way! You can add a Button widget and connect an event handler to it to close the function.
To get started, open up a new Python file in your Python editor of choice. Name this one hello_textual3.py and then enter the following code:
# hello_textual3.py
from textual.app import App, ComposeResult
from textual.widgets import Button, Label
class HelloWorld(App):
def compose(self) -> ComposeResult:
self.close_button = Button("Close", id="close")
yield Label("Hello Textual")
yield self.close_button
def on_mount(self) -> None:
self.screen.styles.background = "darkblue"
self.close_button.styles.background = "red"
def on_button_pressed(self, event: Button.Pressed) -> None:
self.exit(event.button.id)
if __name__ == "__main__":
app = HelloWorld()
app.run()
The first change you’ll encounter is that you are now importing a Button in addition to your Label.
The next change is that you are assigning the Button to self.close_button in your compose() function before yielding the button. By assigning the button to a class variable or attribute, you can more easily access it later to change some features around the widget.
The on_mount() method is called when your application enters application mode. Here you set the background of your app (the screen) to “darkblue” and you set the background color of the close button to “red”.
Lastly, you create an on_button_pressed() method, which is your event handler for catching when a button is pressed. When the close button is pressed, the on_button_pressed() is called and your application exits. You pass in the button’s id to tell the application which button was used to close it, although you don’t use that information here.
It’s time to try running your code! When you do, you should see the following:

So far so good. Your application is looking great!
Now you’re ready to learn the basics of styling your application with CSS.
Adding Style with CSS
Textual let you apply a style using Cascading Style Sheet (CSS) in much the same way that web developers use CSS. You write the CSS file in a separate file that ends with the following extension: .css
By separating out the style from the logic, you can follow the Model-View-Controller design pattern. But even if you don’t follow that pattern, it lets you separate the logic from the design and can make iterating on your design easier.
To get started, you will first update your Python file so that it uses a CSS file. Open up your Python editor and create a new file named hello_textual_css.py, then enter the following code into it:
# hello_textual_css.py
from textual.app import App, ComposeResult
from textual.widgets import Button, Label
class HelloWorld(App):
CSS_PATH = "hello.css"
def compose(self) -> ComposeResult:
self.close_button = Button("Close", id="close")
yield Label("Hello Textual", id="hello")
yield self.close_button
def on_mount(self) -> None:
self.screen.styles.background = "darkblue"
self.close_button.styles.background = "red"
def on_button_pressed(self, event: Button.Pressed) -> None:
self.exit(event.button.id)
if __name__ == "__main__":
app = HelloWorld()
app.run()
The only change here is to add the class attribute, CSS_PATH, right after your HelloWorld() class definition. The CSS_PATH can be a relative or absolute path to your CSS file.
In the example code above, you use a relative path to a file named hello.css which should be saved in the same folder as your Python file.
You can now create hello.css in your Python or text editor. Then enter the following code into it:
Screen {
layout: grid;
grid-size: 2;
grid-gutter: 2;
padding: 2;
}
#hello {
width: 100%;
height: 100%;
column-span: 2;
content-align: center bottom;
text-style: bold;
}
Button {
width: 100%;
column-span: 2;
}
The Screen mentioned here maps to the self.screen object in your code. You are telling Textual that you want to use a grid layout where the number two signifies that the grid will be two columns wide and include two rows.
The spacing between rows is controlled by the grid-gutter. Finally, you set padding to add spacing around the content of the widget itself.
The #hello tag matches to the hello id of a widget in your code. In this case, your Label has the id of “hello”. So everything in the curly braces that follows the #hello tag controls the style of your Label. You want the label to span across both columns, and the text-style to be bold. You also set the width and height to 100%.
Finally, you have some styling to add to Button widgets. There’s only one here, but this would apply to all buttons if you had additional ones. You are setting the width of the button to 100% and telling it to span both columns.
Now that the explanation is out of the way, you are ready to try running your code. When you do, you should get something like this:

The button is now nice and large, but you could certainly make the text of the label a bit bigger. You should try and figure out how to do that as a stretch goal!
Wrapping Up
Textual is amazing! The demo has many more examples than what is covered here as does the Textual documentation
Here is what you learned from this article:
- The Textual Demo
- Creating a Label
- Adding a button
- Using a layout
- Styling with CSS
This article barely scratches the surface of all the amazing features that Textual has to offer. Keep an eye on this website though, as there are lots more articles on Textual coming soon!
The post An Intro to Textual – Creating Text User Interfaces with Python appeared first on Mouse Vs Python.
Lucas Cimon
__slots__ memory optimization in Python
Illustration from realpython.com
The other day, while working on fpdf2,
I used @dataclass,
a nice decorator that came in the standard library with Python 3.7,
to quickly define a class that mostly stored data.
Then a question came to my mind: is the __slots__ memory optimization compatible with …
— Permalink
Brett Cannon
Unravelling `global`
While preparing my talk for PyCascades 2023 on this very blog post series of Python&aposs syntactic sugar, I had an inkling that I could unravel the global statement. After talking to some folks after my talk, I realized that I could, in fact, unravel it! The trick was realizing what made globals (and built-ins) different from locals.
Python&aposs scopes
Before nonlocal and closures, Python had a relatively simple set of scoping rules that grouped everything into 3 namespaces (i.e. groupings of names, which is a somewhat technical name for "variables"):
- Any name created in a block (i.e.
def), unless specified by aglobalstatement, was local - Anything at the top of a module or named in a
globalstatement was global - The
builtinsmodule contained everything built-in
builtins module was introduced, but that&aposs just historical context.__builtins__, it&aposs actually an implementaiton detail of CPython, so I&aposm leaving it out of this discussion.This was known at the LGB rule ( Local, Global, Built-ins). To make the rest of this blog post easier to follow, assume that when I say "local" I am including nonlocal and closures are just fancy locals (you can read the actual scoping rules if you want the full details).
There is one key thing to notice about my outline of the LGB rule that makes locals unique compared to globals and built-ins: they must be created in the block where they reside. What that means is a local always comes into existence thanks to an assignment, which makes = and := very obvious syntax to signal what is a local name (and it&aposs a piece of syntax I don&apost think we can get rid of). Since we can look at an entire file&aposs contents, we can also deduce what all the local names are with complete confidence and consider them taken care of by = (this is actually how Python itself decides what&aposs local and what isn&apost).
Thus any name we come across which isn&apost a local is either a global or built-in name. Since you can&apost assign to the built-in namespace directly, we can disambiguate between globals and built-ins as by assignment; anything that&aposs assigned to that isn&apost to a local is implicitly a global name. It&aposs also important to note that all the global statement is doing is instructing Python to explicitly treat a name as a global instead of as a local when it comes to assignment. So if we can unravel assigning to a global name then we are done!
Unravelling global assignment
A very important tool we are going to use for this unvravelling is the globals() built-in function. What makes this such an important function for what we want to accomplish is that it "return[s] the dictionary implementing the current module namespace." Getting to treat the global namespace as a dictionary means that assigning to a global can be treated just like assigning to a dictionary key! That makes a direct unravelling of A = 42 be globals()["A"] = 42. But since we already unravelled subscription, we can unravel all the way down to just function calls: getattr(dict, "__setitem__")(globals(), "A", 42).
Unravelling the reading of a global name
But it turns out we can push things a bit farther and even unravel reading a global name (although this isn&apost really syntactic, so this is just an academic exercise)! Things get a little tricky when you try to read from a global name thanks to us having no syntactic way to tell a global name from a built-in name like we can for assignment. But since we have a distinct way to get both the globals and built-in namspaces via globals() and the builtins module, respectively, it&aposs straightforward to write code which looks things up appropriately. One way to do that in a single line would be globals()["A"] if "A" in globals() else builtins.A.
One little detail we do need to make sure to take care of, though, is to raise NameError if the name doesn&apost exist anywhere. So our one-liner is a bit too simplistic. Luckily, the full unravelling isn&apost tricky if we try to read a name of A:
import builtins as _builtins
if "A" in globals():
globals()["A"]
else:
try:
_builtins.A
except AttributeError:
raise NameError("name &aposA&apos is not defined")Unravelling the reading of the name A
March 27, 2023
Python Morsels
Implementing slicing
You can make Python objects support slicing by implementing a __getitem__ method that accepts slice objects.
Table of contents
Indexing relies on __getitem__, but so does slicing!
Python's subscript notation ([...]) relies on the __getitem__ method.
Here's a class with a __getitem__ method:
class S:
def __getitem__(self, index):
return index
The objects of this class support the subscript notation by just returning whatever was passed into those square brackets:
>>> s = S()
>>> s[4]
4
>>> s['a']
'a'
Objects that support key lookups or index lookups need to implement __getitem__.
But Python's slicing syntax also relies on the __getitem__ method.
Slicing uses slice objects
When you slice a sequence …
Read the full article: https://www.pythonmorsels.com/implementing-slicing/
Real Python
How to Read Python Input as Integers
If you’ve ever coded an interactive text-based application in Python, then you’ve probably found that you need a reliable way of asking the user for integers as input. It’s not enough simply to display a prompt and then gather keystrokes. You must check that the user’s input really represents an integer. If it doesn’t, then your code must react appropriately—typically by repeating the prompt.
In this tutorial, you’ll learn how to create a reusable utility function that’ll guarantee valid integer inputs from an interactive user. Along the way, you’ll learn about Python’s tools for getting a string from the console and converting that string into an integer.
Whenever you’re writing a program that interacts with the keyboard, you must code defensively to manage invalid inputs, so you’ll also learn the most Pythonic way to deal with this situation. You’ll handle any errors robustly inside a function that’s guaranteed to return nothing but integers.
Free Download: Click here to download the sample code that you’ll use to get integer input from users in Python.
How to Get Integer Input Values in Python
Python’s standard library provides a built-in tool for getting string input from the user, the input() function. Before you start using this function, double-check that you’re on a version of Python 3. If you’d like to learn why that’s so important, then check out the collapsible section below:
Python 2’s version of the input() function was unsafe because the interpreter would actually execute the string returned by the function before the calling program had any opportunity to verify it. This allowed a malicious user to inject arbitrary code into the program.
Because of this issue, Python 2 also provided the raw_input() function as a much safer alternative, but there was always the risk that an unsuspecting programmer might choose the more obviously-named input().
Python 3 renamed raw_input() to input() and removed the old, risky version of input(). In this tutorial, you’ll use Python 3, so this pitfall won’t be a concern.
In Python 3, the input() function returns a string, so you need to convert it to an integer. You can read the string, convert it to an integer, and print the results in three lines of code:
>>> number_as_string = input("Please enter an integer: ")
Please enter an integer: 123
>>> number_as_integer = int(number_as_string)
>>> print(f"The value of the integer is {number_as_integer}")
The value of the integer is 123
When the above snippet of code is executed, the interpreter pauses at the input() function and prompts the user to input an integer. A blinking cursor shows up at the end of the prompt, and the system waits for the user to type an arbitrary string of characters.
When the user presses the Enter key, the function returns a string containing the characters as typed, without a newline. As a reminder that the received value is a string, you’ve named the receiving variable number_as_string.
Your next line attempts to parse number_as_string as an integer and store the result in number_as_integer. You use the int() class constructor to perform the conversion.
Finally, the print() function displays the result.
Dealing With Invalid Input
You’ve probably already noticed that the above code is hopelessly optimistic. You can’t always rely on users to provide the kind of input that you expect. You can help a lot by providing an explicit prompt message, but through confusion, carelessness, or malice, there’ll always be users who provide invalid input. Your program should be ready to deal with any kind of text.
Read the full article at https://realpython.com/python-input-integer/ »
[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
Python for Beginners
Convert INI File to Dictionary in Python
INI files are one of the simplest configuration files that we use in software systems. In this article, we will discuss how to convert an INI file to a python dictionary.
What is the INI File Format?
INI (short for “initialization”) file format is a plain text file format that is commonly used to store configuration settings for computer programs. INI files have a .ini file extension.
Each INI file consists of different sections containing data in the form of key-value pairs. The sections are defined using square brackets ([]) and the key-value pairs are separated by an equal sign (=) or a colon (:).
For instance, consider the following example.
[employee]
name=John Doe
age=35
[job]
title=Software Engineer
department=IT
years_of_experience=10
[address]
street=123 Main St.
city=San Francisco
state=CA
zip=94102
The above INI file represents configuration data for an employee. It consists of several sections, each containing key-value pairs.
- The
[employee]section contains the employee’s name and age. [job]section contains information about the employee’s job, including the job title, department, and years of experience.- The
[address]section contains the employee’s address information, including the street address, city, state, and ZIP code.
Each key-value pair within a section represents a single configuration setting. Here, the key represents the name of the setting and the value represents the value of the setting. For example, the "name=John Doe" key-value pair in the [employee] section indicates that the employee’s name is "John Doe". Similarly, the "title=Software Engineer" key-value pair in the [job] section indicates that the employee’s job title is "Software Engineer".
The above ini file can be stored in a file as shown below.
INI Configuration File
INI files are easy to read and edit with any text editor. They are widely used on Windows and other operating systems for storing configuration data for various applications. However, in recent years, INI files have been largely replaced by more sophisticated configuration file formats, such as XML, JSON, and YAML.
Convert INI File to Python dictionary
To convert an INI file to a python dictionary, we will use the configparser module. We will use the following steps to convert an INI file to a python dictionary.
- First, we will open the ini configuration file in read mode using the
open()function. Theopen()function takes the file name as its first input argument and the python literal “r” as its second argument. After execution, it returns a file pointer. - Next, we will create an empty
ConfigParserobject using theConfigParser()function defined in the configparser module. We will also create an empty dictionary to store the output dictionary. - After creating the
ConfigParserobject, we will read the file into theConfigParserobject. For this, we will use theread_file()method. Theread_file()method, when invoked on an emptyConfigParserobject, takes the file pointer as its input argument and loads the file contents into theConfigParserobject. - The
ConfigParserobject stores data in different sections as shown in the example in the previous section. We will get the name of all the sections in theConfigParserobject using thesections()method. Thesections()method, when invoked on aConfigParserobject, returns a list of section names in the configuration file. - In each section of the configuration file, there are different key-value pairs to store the data. We can get the key-value pairs in a section using the
items()method. Theitems()method, when invoked on aConfigParserobject, takes the section name as its input argument and returns a list of tuples containing the key-value pairs in the given section. - Once we get the list of key-value pairs in a given section, we will convert the list of tuples into a dictionary using the
dict()function. Thedict()function takes the list of tuples as its input arguments and returns a python dictionary. - After creating dictionaries for key-value pairs in each section, we will assign the section name as the key and the corresponding dictionary as the value to the output dictionary.
After executing the above steps, we will get the python dictionary containing data from the INI file. You can observe this in the following example.
import configparser
config_object = configparser.ConfigParser()
file =open("employee.ini","r")
config_object.read_file(file)
output_dict=dict()
sections=config_object.sections()
for section in sections:
items=config_object.items(section)
output_dict[section]=dict(items)
print("The output dictionary is:")
print(output_dict)
Output:
The output dictionary is:
{'employee': {'name': 'John Doe', 'age': '35'}, 'job': {'title': 'Software Engineer', 'department': 'IT', 'years_of_experience': '10'}, 'address': {'street': '123 Main St.', 'city': 'San Francisco', 'state': 'CA', 'zip': '94102'}}
Instead of using a for loop, you can also use dictionary comprehension to convert an INI file to a python dictionary as shown below.
import configparser
config_object = configparser.ConfigParser()
file =open("employee.ini","r")
config_object.read_file(file)
output_dict={s:dict(config_object.items(s)) for s in config_object.sections()}
print("The output dictionary is:")
print(output_dict)
Output:
The output dictionary is:
{'employee': {'name': 'John Doe', 'age': '35'}, 'job': {'title': 'Software Engineer', 'department': 'IT', 'years_of_experience': '10'}, 'address': {'street': '123 Main St.', 'city': 'San Francisco', 'state': 'CA', 'zip': '94102'}}
You can observe that the above code works in a similar manner to the previous code using for loops.
Conclusion
In this article, we discussed two ways to convert an INI configuration file to a python dictionary. To learn more about file conversions, you can read this article on how to convert a python dictionary to INI file. You might also like this article on how to convert a XML file to YAML.
I hope you enjoyed reading this article. Stay tuned for more informative articles.
Happy Learning!
The post Convert INI File to Dictionary in Python appeared first on PythonForBeginners.com.
Mike Driscoll
PyDev of the Week: Kevin Kho
This week we welcome Kevin Kho as our PyDev of the Week! Kevin is a core developer on the Fugue package. You can catch up with Kevin on Medium where Kevin writes about Fugue, Python, and more. You can also see what projects Kevin is working on over at GitHub.
Let’s spend some more time getting to know Kevin better!
Can you tell us a little about yourself (hobbies, education, etc):
I grew up in Manila, Philippines. Both of my grandfathers immigrated to the Philippines from China, so I am Filipino-Chinese. I did all my pre-college schooling there and came to the US for college. I studied Civil Engineering at the University of Illinois at Urbana-Champaign (both Bachelor’s and Master’s) focused on water resources. Volunteering has been a big part of me, so I was heavily involved in my school chapter of Engineering Without Borders (for water and bridge projects). When I became a data scientist, I started volunteering for DataKind.
Professionally, I was a data scientist for four years across two companies, and then I joined Prefect, a Python-based workflow orchestrator. I was there for just over a year before I left to work more on the open-source Fugue project. I currently contract part-time for Citibank around distributed computing tooling.
I love watching and playing basketball, but am mostly a homebody. If I watch a show, it’s likely to be an anime or Korean drama. I used to play more computer games (mainly DOTA and League of Legends), but I can’t keep up anymore. Since COVID, I have been infected by the keyboard bug and spend time assembling and tinkering with keyboards. It’s an expensive hobby, though!
Why did you start using Python?
The short answer is I wanted to go into data science, so I came across it when self-studying. There is a longer story around that.
March 2016 was when AlphaGo went against Lee Sedol in a five-game series. I stayed up until 3 am or 4 am those nights watching the games. I don’t even play Go and only know the basic rules, but this event inspired me to pursue machine learning. I had no idea where to start and wasn’t even a heavy coder, but it did get me looking into data science.
In May 2016, I graduated with my master’s degree. I interviewed for a couple of civil engineering jobs, but it didn’t go so well because I wanted a coding component to my job. At this point, I had been doing research with the US Geological Survey (USGS) for one year, with a lot of work in R. I decided to try to see if I could self-study and break into data science myself.
Over the following six months, I took a bunch of Coursera courses around Python, machine learning, and algorithms, and then I got my first data science job at the start of 2017. My first job primarily did things in R, so I didn’t get to use Python professionally until late 2019 when I changed jobs.
What other programming languages do you know, and which is your favorite?
I don’t know a lot, especially because I frequently don’t finish courses if I don’t have a use case.
Matlab and C were the CS 101 requirement in my college program. I know R and Python well and have tried out Javascript and Java. I definitely like Python the most because it’s very accessible yet versatile to do most things you need. It’s incredible how people new to programming can learn it quickly while still having it be capable of advanced use cases with packages like PyTorch or Dask.
What projects are you working on now?
I am primarily working on Fugue. Fugue takes SQL, Python, and Pandas code and scales it to Spark, Dask, and Ray. We make big data projects easier to develop and maintain. One of the problems with distributed computing is that the code is coupled with the infrastructure. If you write Spark code, it needs to run on the Spark engine. Fugue decouples the business logic and execution. This lets users develop on their local machine, and then bring it to the cluster just by specifying the backend.
More recently, we added BigQuery as a SQL backend, so we are interested in having combinations like BigQuery-Ray or Snowflake-Spark. Connectivity is our focus so that users can utilize the optimal combination of different tools depending on the task.
Which Python libraries are your favorite (core or 3rd party)?
I’m only going to mention libraries I have not been significantly involved in.
- docker-py – from a code standpoint, I found it interesting because of the organization and mixins.
- whylogs – I believe they are laying building blocks that will redefine data validation
- pyswmm – it’s inspiring to see Python being more adopted in civil engineering
How did you get involved with the Fugue project?
I saw the main Fugue author, Han Wang, present at the Databricks and AI Summit. I reached out to him immediately afterward because I thought it could solve some problems we had at work. We had small data projects using Pandas, and big data projects using Spark, but we were implementing the same business logic twice. One version for Spark and one version for Pandas. I wanted to consolidate that with Fugue.
I was expecting just to be an end-user, but then I talked to Han and got involved. It has been a lot of work, but I am also heavily inspired by the other open-source developers I have met through it.
What are your top three features of Fugue?
1. Incremental adoption – users frequently only really need to scale out one expensive step of their pipeline. For example, maybe you want to train ten machine learning models, and you want to bring the training time down by running them in parallel or distributedly. Fugue can run a single step distributedly because it’s non-invasive, and you can leave the rest in Pandas. Actually, of the cool things Fugue does is read the type hints and comments to perform conversions. If users choose to move off Fugue, these just stay as helpful comments. Example here: https://fugue-tutorials.readthedocs.io/tutorials/beginner/schema.html#defining-schema
2. Interoperable SQL and Python. SQL code tends to be a second-class citizen, often invoked in-between Python code. FugueSQL elevates SQL as a first-class interface, so SQL can be the one invoking Python instead. SQL lovers can now utilize distributed backends like Spark and Dask without learning framework-specific code because of added keywords like LOAD, SAVE, PERSIST, PREPARTITION. Both the SQL and Python interfaces of Fugue can be used independently and are equivalent.
3. Easily extensible. Fugue can scale Python code, or can be used as a backend by existing code to scale. For example, libraries like whylogs, pycaret, and statsforecast all can be used with Fugue as a backend to scale to Spark, Dask, and Ray. These open-source maintainers benefit from not having to maintain three separate implementations so support all distributed backends.
Is there anything else you’d like to say?
1. Contributing to open-source is a lot easier than people think. You can always start with smaller issues, and if there are none, documentation and tutorials are always helpful and appreciated. Don’t hesitate to reach out to project maintainers (especially smaller maintainer teams). They will likely appreciate it.
2. Keeping an open mind – it’s very common for data scientists to completely avoid SQL. There are debates on Python vs. SQL, and I genuinely don’t understand this because they can be used powerfully together (enabled by Fugue, but even without it). Data practitioners can be very set in their ways for some reason, and are excessively in love with their tooling (R vs Python debates). I don’t think these debates matter as much as you’d expect with social media.
Thanks for doing the interview, Kevin!
The post PyDev of the Week: Kevin Kho appeared first on Mouse Vs Python.
March 26, 2023
Michał Bultrowicz
Separating different kinds of tests
When I work on a project I differentiate three kinds of tests: unit, integrated, and external. In this post I’ll explain how I think about them.
Glyph Lefkowitz
Telemetry Is Not Your Enemy
Part 1: A Tale of Two Metaphors
In software development “telemetry” is data collected from users of the software, almost always delivered to the authors of the software via the Internet.
In recent years, there has been a great deal of angry public discourse about telemetry. In particular, there is a lot of concern that every software vendor and network service operator collecting any data at all is spying on its users, surveilling every aspect of our lives. The media narrative has been that any tech company collecting data for any purpose is acting creepy as hell.
I am quite sympathetic to this view. In general, some concern about privacy is warranted whenever some new data-collection scheme is proposed. However it seems to me that the default response is no longer “concern and skepticism”; but rather “panic and fury”. All telemetry is seen as snooping and all snooping is seen as evil.
There’s a sense in which software telemetry is like surveillance. However, it is only like surveillance. Surveillance is a metaphor, not a description. It is far from a perfect metaphor.
In the discourse around user privacy, I feel like we have lost a lot of nuance about the specific details of telemetry when some people dismiss all telemetry as snooping, spying, or surveillance.
Here are some ways in which software telemetry is not like “snooping”:
- The data may be aggregated. The people consuming the results of telemetry are rarely looking at individual records, and individual records may not even exist in some cases. There are tools, like Prio, to do this aggregation to be as privacy-sensitive as possible.
- The data is rarely looked at by human beings. In the cases (such as ad-targeting) where the data is highly individuated, both the input (your activity) and the output (your recommendations) are both mainly consumed by you, in your experience of a product, by way of algorithms acting upon the data, not by an employee of the company you’re interacting with.1
- The data is highly specific. “Here’s a record with your account ID and the number of times you clicked the Add To Cart button without checking out” is not remotely the same class of information as “Here’s several hours of video and audio, attached to your full name, recorded without your knowledge or consent”. Emotional appeals calling any data “surveillance” tend to suggest that all collected data is the latter, where in reality most of it is much closer to the former.
There are other metaphors which can be used to understand software telemetry. For example, there is also a sense in which it is like voting.
I emphasize that voting is also a metaphor here, not a description. I will also freely admit that it is in many ways a worse metaphor for telemetry than “surveillance”. But it can illuminate other aspects of telemetry, the ones that the surveillance metaphor leaves out.
Data-collection is like voting because the data can represent your interests to a party that has some power over you. Your software vendor has the power to change your software, and you probably don’t, either because you don’t have access to the source code. Even if it’s open source, you almost certainly don’t have the resources to take over its maintenance.
For example, let’s consider this paragraph from some Microsoft documentation about telemetry:
We also use the insights to drive improvements and intelligence into some of our management and monitoring solutions. This improvement helps customers diagnose quality issues and save money by making fewer support calls to Microsoft.
“Examples of how Microsoft uses the telemetry data” from the Azure SDK documentation
What Microsoft is saying here is that they’re collecting the data for your own benefit. They’re not attempting to justify it on the basis that defenders of law-enforcement wiretap schemes might. Those who want literal mass surveillance tend to justify it by conceding that it might hurt individuals a little bit to be spied upon, but if we spy on everyone surely we can find the bad people and stop them from doing bad things. That’s best for society.
But Microsoft isn’t saying that.2 What Microsoft is saying here is that if you’re experiencing a problem, they want to know about it so they can fix it and make the experience better for you.
I think that is at least partially true.
Part 2: I Qualify My Claims Extensively So You Jackals Don’t Lose Your Damn Minds On The Orange Website
I was inspired to write this post due to the recent discussions in the Go community about how to collect telemetry which provoked a lot of vitriol from people viscerally reacting to any telemetry as invasive surveillance. I will therefore heavily qualify what I’ve said above to try to address some of that emotional reaction in advance.
I am not suggesting that we must take Microsoft (or indeed, the Golang team) fully at their word here. Trillion dollar corporations will always deserve skepticism. I will concede in advance that it’s possible the data is put to other uses as well, possibly to maximize profits at the expense of users. But it seems reasonable to assume that this is at least partially true; it’s not like Microsoft wants Azure to be bad.
I can speak from personal experience. I’ve been in professional conversations around telemetry. When I have, my and my teams’ motivations were overwhelmingly focused on straightforwardly making the user experience good. We wanted it to be good so that they would like our products and buy more of them.
It’s hard enough to do that without nefarious ulterior motives. Most of the people who develop your software just don’t have the resources it takes to be evil about this.
Part 3: They Can’t Help You If They Can’t See You
With those qualifications out of the way, I will proceed with these axioms:
- The developers of software will make changes to it.
- These changes will benefit some users.
- Which changes the developers select will be derived, at least in part, from the information that they have.
- At least part of the information that the developers have is derived from the telemetry they collect.
If we can agree that those axioms are reasonable, then let us imagine two user populations:
- Population A is privacy-sensitive and therefore sees telemetry as bad, and opts out of everything they possibly can.
- Population B doesn’t care about privacy, and therefore ignores any telemetry and blithely clicks through any opt-in.
When the developer goes to make changes, they will have more information about Population B. Even if they’re vaguely aware that some users are opting out (or refusing to opt in), the developer will know far less about Population A. This means that any changes the developer makes will not serve the needs of their privacy-conscious users, which means fewer features that respect privacy as time goes on.
Part 4: Free as in Fact-Free Guesses
In the world of open source software, this problem is even worse. We often have fewer resources with which to collect and analyze telemetry in the first place, and when we do attempt to collect it, a vocal minority among those users are openly hostile, with feedback that borders on harassment. So we often have no telemetry at all, and are making changes based on guesses.
Meanwhile, in proprietary software, the user population is far larger and less engaged. Developers are not exposed directly to users and therefore cannot be harassed or intimidated into dropping their telemetry. Which means that proprietary software gains a huge advantage: they can know what most of their users want, make changes to accommodate it, and can therefore make a product better than the one based on uninformed guesses from the open source competition.
Proprietary software generally starts out with a panoply of advantages already — most of which boil down to “money” — but our collective knee-jerk reaction to any attempt to collect telemetry is a massive and continuing own-goal on the part of the FLOSS community. There’s no inherent reason why free software’s design cannot be based on good data, but our community’s history and self-selection biases make us less willing to consider it.
That does not mean we need to accept invasive data collection that is more like surveillance. We do not need to allow for stockpiled personally-identifiable information about individual users that lives forever. The abuses of indiscriminate tech data collection are real, and I am not suggesting that we forget about them.
The process for collecting telemetry must be open and transparent, the data collected needs to be continuously vetted for safety. Clear data-retention policies should always be in place to avoid future unanticipated misuses of data that is thought to be safe today but may be de-anonymized or otherwise abused in the future.
I want the collaborative feedback process of open source development to result in this kind of telemetry: thoughtful, respectful of user privacy, and designed with the principle of least privilege in mind. If we have this kind of process, then we could hold it up as an example for proprietary developers to follow, and possibly improve the industry at large.
But in order to be able to produce that example, we must produce criticism of telemetry efforts that is specific, grounded in actual risks and harms to users, rather than a series of emotional appeals to slippery-slope arguments that do not correspond to the actual data being collected. We must arrive at a consensus that there are benefits to users in allowing software engineers to have enough information to do their jobs, and telemetry is not uniformly bad. We cannot allow a few users who are complaining to stop these efforts for everyone.
After all, when those proprietary developers look at the hard data that they have about what their users want and need, it’s clear that those who are complaining don’t even exist.
-
Please note that I’m not saying that this automatically makes such collection ethical. Attempting to modify user behavior or conduct un-reviewed psychological experiments on your customers is also wrong. But it’s wrong in a way that is somewhat different than simply spying on them. ↩
-
I am not suggesting that data collected for the purposes of improving the users’ experience could not be used against their interest, whether by law enforcement or by cybercriminals or by Microsoft itself. Only that that’s not what the goal is here. ↩
Kay Hayen
Nuitka Release 1.5
This is to inform you about the new stable release of Nuitka. It is the extremely compatible Python compiler, “download now”.
This release contains the long awaited 3.11 support, even if only on an experimental level. This means where 3.10 code is used, it is expected to work equally well, but the Python 3.11 specific new features have yet been done.
There is plenty of new features in Nuitka, e.g. much enhanced reports, Windows ARM native compilation support, and the usual slew of anti-bloat updates, and newly supported packages.
Bug Fixes
Standalone: Added implicit dependencies for
charset_normalizerpackage. Fixed in 1.4.1 already.Standalone: Added platform DLLs for
sounddevicepackage. Fixed in 1.4.1 already.Plugins: The info from Qt bindings about other Qt bindings being suppressed for import, was spawning multiple lines, breaking tests. Merged to a single line until we do text wrap for info messages as well. Fixed in 1.4.1 already.
Plugins: Fix
removeDllDependencieswas broken and could not longer be used to remove DLLs from inclusion. Fixed in 1.4.1 already.Fix, assigning methods of lists and calling them that way could crash at runtime. The same was true of dict methods, but had never been observed. Fixed in 1.4.2 already.
Standalone: Added DLL dependencies for
onnxruntime. Fixed in 1.4.2 already.Standalone: Added implicit dependencies for
textualpackage. Fixed in 1.4.2 already.Fix, boolean tests of lists could be optimized to wrong result when list methods got recognized, due to not annotating the escape during that pass properly. Fixed in 1.4.3 already.
Standalone: Added missing implicit dependency of
apsw. Fixed in 1.4.3 already.Note
Currently
apswonly works with manual workarounds and only in limited ways, there is an import level incompatible with__init__being an extension module, that Nuitka does not yet handle.Python3: Fix, for range arguments that fail to divide there difference, the code would have crashed. Fixed in 1.4.3 already.
Standalone: Fix, added support for newer
pkg_resourceswith another vendored package. Fixed in 1.4.4 already.Standalone: Fix, added support for newer
shapely2.0 versions. Fixed in 1.4.4 already.Plugins: Fix, some yaml package configurations with DLLs by code didn’t work anymore, notably old
shapely1.7.x versions were affected. Fixed in 1.4.4 already.Fix, for onefile final result the “–output-dir” option was ignored. Fixed in 1.4.4 already.
Standalone: Added
mozilla-capackage data file. Fixed in 1.4.4 already.Standalone: Fix, added missing implicit dependency for newer
gevent. Fixed in 1.4.4 already.Scons: Accept an installed Python 3.11 for Scons execution as well. Fixed in 1.4.4 already.
Python3.7: Some
importlib.resourcenodes asserted against use in 3.7, expecting it to be 3.8 or higher, but this interface is present in 3.7 already. Fixed in 1.4.5 already.Standalone: Fix, Python DLLs installed to the Windows system folder were not included, causing the result to be not portable. Fixed in 1.4.5 already.
Python3.9+: Fix,
metadata.resourcesfiles methodjoinpathis some contexts is expected to accept variable number of arguments. Fixed in 1.4.5 already.Standalone: Workaround for
customtkinterdata files on non-Windows. Fixed in 1.4.5 already.Standalone: Added support for
overridespackage. Fixed in 1.4.6 already.Standalone: Added data files for
strawberrypackage. Fixed in 1.4.7 already.Fix, anti-bloat plugin caused crashes when attempting to warn about packages coming from
--include-packageby the user. Fixed in 1.4.7 already.Windows: Fix, main program filenames with an extra dot apart from the
.pysuffix, had the part beyond that wrongly trimmed. Fixed in 1.4.7 already.Fix, list methods didn’t properly annotated value escape during their optimization, which could lead to wrong optimization for boolean tests. Fixed in 1.4.7 already.
Standalone: Added support for
imagej,scyjava,jpypepackages. Fixed in 1.4.8 already.Fix, using
--include-packageon extension module names was not working. Fixed in 1.4.8 already.Standalone: Added support for
tensorflow.kerasnamespace as well.Distutils: Fix namespace packages were not including their contained modules properly with regards to
__file__properties, making relative file access impossible.Onefile: On Windows the onefile binary did lock itself, which could fail with certain types of AV software. This is now avoided.
Accessing files using the top level
metadata.resourcesfiles object was not working properly, this is now supported too.MSYS2: Make sure mixing POSIX and Windows slashes causes no issues by hard-coding the onefile archive to use the subsystem slash rather than what MSYS prefers to use internally.
Standalone: Added missing dependencies of newer
imageio.Fix, side effect nodes didn’t annotate their non-exception raising nature properly, if that was the case.
New Features
Added experimental support for Python 3.11, for 3.10 language level code it should be fully usable, but the
CPython311test suite has not even been started to check newly added or changed features.Windows: Support for native Python on Windows ARM64, which needs 3.11 or higher, but standalone and therefore onefile do not yet work, due to lack of any form of binary dependency analysis tool.
This platform is relatively new in Python and generally. For the time being standalone and onefile should be done with Intel based Python, they would also be ARM64 only, whereas 32/64 Bit binaries can be run on all Windows ARM platforms.
Reports: Write compilation report even in case of Nuitka being interrupted or crashing. This then includes the exception, and a status like
completedorinterrupted. At this time this happens only when--report=was specified, but in the future we will likely write one in case of Nuitka crashes.Reports: Now the details of the used Python version, its flavor, the OS and the architecture are included. This is crucial information for analysis and can make
--versionoutput unnecessary.Reports: License reports now handle
UNKNOWNlicense by falling back to checking the classifiers, and therefore include the correct license e.g. withsetuptools. Also in case no license text is found, do not create an empty block. Added in 1.4.4 already.Reports: In case the distribution name and the contained package names differ, output the list of packages included from a distribution. Added in 1.4.4 already.
Reports: Include data file sizes in report. Added in 1.4.7 already.
Reports: Include memory usage into the compilation report as well.
macOS: Add support for downloading
ccacheon arm64 (M1/M2) too. Added in 1.4.4 already.UI: Allow
--output-filenamefor standalone mode again. Added in 1.4.3 already.Standalone: Improved isolation with Python 3.8 or higher. Using new init mechanisms of Python, we now achieve that the scan for
pyvenv.cfgon in current directory and above is not done, using it will be unwanted.Python2: Expose
__loader__for modules and register withpkg_resourcestoo which expects these to be present for custom resource handling.Python3.9+: The
metadata.resourcesfiles objects methoditerdirwas not implemented yet. Fixed in 1.4.5 already.Python3.9+: The
metadata.resourcesfiles objects methodabsolutewas not implemented yet.Added experimental ability to create virtualenv from an existing compilation report with new
--create-environment-from-reportoption. It attempts to create a requirements file with the used packages and their versions. However, sometimes it seems not to be possible to due to conflicts.
Optimization
Onefile: Use memory mapping for calculating the checksum of files on all platforms. This is faster and simpler code. So far it had only be done this way on Windows, but other platforms also benefit a lot from it.
Onefile: Use memory mapping for accessing the payload rather than file operations. This avoids differences to macOS payload handling and is much faster too.
Anti-Bloat: Avoid using
daskinjoblib.Note
Newer versions of
joblibdo not currently work yet due to their own form of multiprocessing spawn not being supported yet.Anti-Bloat: Adapt for newer
pandaspackage.Anti-Bloat: Remove more
IPythonusages in newer tensorflow.Use dedicated class bodies for Python2 and Python3, with the former has a static dict type shape, and with Python3 this needs to be traced in order to tell what the meta class put in there.
Compile time optimize dict
in/not inanddict.has_keyoperations statically where the keys of a dict are known. As a result, the class declarations of Python3 no longer created code for both branches, the one withmetaclass =in the class declaration and without. That means also a big scalability improvement.For the Python3 class bodies, the usage of
locals()was not recognized as not locally escaping all the variables, leading to variable traces where each class variable was marked as escaped for no good reason.Added support for
dict.fromkeysmethod, making the code generation understand and handle static methods as well.Added support for
os.listdirandos.path.basename. Added in 1.4.5 already for use in implementing theiterdirmethod, but they are also now optimized by themselves.Added support for trusted constant values of the
osmodule. These arecurdir,pardir,sep,extsep,altsep,pathsep,linesepwhich may enable some minor compile time optimization to happen and completes this aspect of theosmodule.Faster
digitsize checks duringfloatcode generation for better compile time performance.Faster
listoperations due to usingPyList_CheckExacteverywhere this is applicable, this mostly makes debug operations faster, but also deep copying list values, or extending lists with iterables, etc.Optimization: Collect module usages of the given module during its abstract execution. This avoids a full tree visit afterwards only to find them. It is much cheaper to collect them while we go over the tree. This enhances the scalability of large compilations by ca. 5%.
Optimization: Faster determination of loop variables. Rather than using a generic visitor, we use the children having generator codes to add traversal code that emits relevant variables to the user directly.
Cache extra search paths in order to avoid repeated directory operations as these are known to be slow at times.
Standalone: Do not include
py.typeddata files, these indicator files are for IDEs, but not needed at run time ever.Make sure that the generic attribute code optimization is also effective in cases where a Python DLL is used. Previously this was only guaranteed to be used with static libpython.
Faster list constant usage
Small immutable constants get their own code that is much faster for small sizes.
Medium sized lists get code that just is hinted the size, but takes items from a source list, still a lot faster.
For repeated lists where all elements are the same, we use a dedicated helper for all sizes, that is even faster except for small ones with LTO enabled, where the C compiler may already do that effectively.
Added optimization for
os.path.abspathandos.path.isabswhich of course have not as much potential for compile time optimization, but we needed them for providing.absolute()for the meta path loader files implementation.Faster class dictionary propagation decision. Instead of checking for trace types, let the trace object decide. Also abort immediately on first inhibit, rather than checking all variables. This improves Python2 compile time, and Python3 where this code is now starting to get used when the class dictionary is shown to have
dicttype.Specialize type method
__prepare__which is used in the Python3 re-formation of class bodies to initialize the class dictionary. Where the metaclass is resolved, we can use this to decide that the standard empty dictionary is used statically, enabling class dictionary propagation for best scalability.At this time this only happens with classes without bases, but we expect to soon do this with all compile time known base classes. At this time, these optimization to become effective, we need to optimize meta class selection from bases classes, as well as modification of base classes with
__mro_entries__methods.The
boolbuilt-in on boolean values is now optimized away.Since it’s used also for conditions being extracted, this is actually somewhat relevant, since it could keep code alive in side effects at least for no good reason and this allows a proper reduction.
Organisational
Project: Require the useful stuff for installation of Nuitka already. These are things we cannot inline really, but otherwise will frequently be warned about, e.g.
zstandardfor onefile andordered-setfor fast operation, but we do not require packages that might fail to install.User Manual: Added section about virus scanners and how to avoid false reports.
User Manual: Enhanced description for plugin module loading, the old code was too complicated and actually working only for a mode of including plugin code that is discouraged.
User Manual: Fix section for standalone finding files on wrong level.
Windows: Using the console on Python 3.4 to 3.7 is not working very well with e.g. many Asian systems. Nuitka fails to setup the encoding for stdin and stdout or this platform. It can then produce exceptions on input or output of unicode data, that doesn’t overlap with UTF-8.
We now inform the user of these older Python with a warning and mnemonic, to either disable the console or to upgrade to Python 3.8 or higher, which normally won’t be much of an issue for most users. Added in 1.4.1 already.
Debugging: Fixup debugging reference count output with Python3.4. For Python 3.11 compatibility tests, actually it was useful to compare with a version that doesn’t have coroutines yet. Never tell me, supporting old versions is not good.
Deprecating support for Python 3.3, there is no apparent use of this version, and it has gained specific bugs, that are indeed not worth our time. Python 2.6 and Python 2.7 will continue to be supported probably indefinitely.
Recommend
ordered-setfor Python 3.7 to 3.9 as well, as not only for 3.10+ because on Windows, to installordersetMSVC needs to be installed, whereasordered-sethas a wheel for ready use.Actually zstandard requirement is for a minimal version, added that to the requirement files.
Debugging: Lets not reexecute Nuitka in case if we are debugging it from Visual Code.
Debugging: Include the
.pdbfiles in Windows standalone mode for proper C tracebacks should that be necessary.UI: Detect the GitHub flavor of Python as well.
Quality: Check the
clang-formatversion to avoid older ones with bugs that made it switch whitespace for one file. Using the one from Visual Code C extension is a good idea, since it will often be available. Running the checks on newer Ubuntu GitHub Actions runner to have the correct version available.Quality: Updated the version of
rstfmtandisortto the latest versions.GitHub: Added commented out section for enabling
sshlogin, which we occasionally need to git bisect problems specific to GitHub Python flavor.Plugins: Report problematic plugin name with module name or DLL name when these raise exceptions.
Use
ordered-setpackage for Python3.7+ rather than only Python3.10+ because it doesn’t need any build dependency on Windows.UI: When showing source changes, also display the module name with the changed code.
UI: Use function intended for user query when asking about downloads too.
UI: Do not report usage of
ccachefor linking from newer version, that is not relevant.Onefile: Make sure we have proper error codes when reporting IO errors.
MSVC: Detect a version for developer prompts too. This version is needed for use in enabling version specific features.
Started UML diagrams with
plantumlthat will need to be completed before using them in then new and more visual parts of Nuitka documentation.UI: Check icon conversion capability at start of compilation rather than error exiting at the very end informing the user about required
imageiopackages to convert to native icons.Quality: Enhanced autoformat on Windows, which was susceptible to tools introducing Windows new lines before other steps were performed, that then could be confused, also enforcing use of UTF-8 encoding when working with Nuitka source code for formatting.
Cleanups
The
delvewheelplugin was still using azmqclass name from its original implementation, adapted that.Use common template for generator frames as well. This made them also work with 3.11, by avoiding duplication.
Applied code formatting to many more files in
tests, etc.Removed a few micro benchmarks that are instead to be covered by construct based tests now.
Enhanced code generation for specialized in-place operations to avoid unused code for operations that do not have any shortcuts where the operation would be actual in-place of a reference count 1 object.
Better code generation for module variable in-place operations with proper indentation and no repeated calls.
Plugins: Use the
namedtuplefactory that we created for informational tuples from plugins as well.Make details of download utils module more accessible for better reuse.
Remove last remaining Python 3.2 version check in C code, for us this is just Python3 with 3.2 being unsupported.
Cleanup, name generated call helper file properly, indicating that it is a generated file.
Tests
Made the CPython3.10 test suite largely executable with Python 3.11 and running that with CI now.
Allow measuring constructs without writing the code diff again. Was crashing when no filename was given.
Make Python3.11 test execution recognized by generally accepting partially supported versions to execute the tests with.
Handle also
newfstatdirectory checks in file usage scan. This are used on newer Linux systems.GitHub: In actions use
--reportfor coverage and upload the reports as artifacts.Use
no-qtplugin to avoid warnings inmatplotlibtest rather than disabling the warnings about Qt bindings.macOS: Detect if the machine can take runtime traces, which on Apple Silicon by default it cannot.
macOS: Cover all APIs for file tracing, rather than just one for extended coverage.
Fix, distutils test was not installing the built wheels, but source archive and therefore compiling that second time.
For the
pyproject.tomlusing tests, Nuitka was always downloaded from PyPI rather than using the version under test.Ignore
ldinfo output about mismatching architecture libraries being ignored. Fixed in 1.4.1 already.
Summary
With this release an important new avenue for scalability has been started. While for Python2 class bodies were very often reduced to just that dictionary creation, with Python3 that was not the case, due to the many new complexities, and while this release makes a start, we will be able to continue this path towards much more scalable class creation codes. And while the performance does not really matter all that much for these, knowing these, will ultimately lead us to “compiled classes” as our own type, and “compiled objects” that may well perform much faster.
Already now, the enhancements to class creation codes will result in smaller binaries, but much more is expected the more this is completed.
The majority of the work was of course to become Python3.11 compatible, and unfortunately the attribute lookups are not as optimized as for 3.10 yet, which may cause disappointing results for performance initially. We will need to complete that before benchmarks will make much sense.
For the next release, full Python 3.11 support is planned. I believe it should be usable. Problems with 3.11 may get hotfixes, but ultimately the develop version is probably the one to recommend when using 3.11 with Nuitka, as there will be the whole set of fixes, since not everything will be ported back.
The new reports should be used in bug reporting soon. We foresee that for issue reports, these may well become mandatory. Together with the ability to create a virtualenv from the reports, this may make reproducing issues a breeze, but first tries on complex projects were also highlighting that it may not be as simple.
Sandipan Dey
Histopathologic Cancer Detection with CNN
This problem appeared in a project in the coursera course Introduction to Deep Learning (by the university of Colorado Coulder) and is taken from a past Kaggle competition. Brief description of the problem and data In this mini-project, we shall use binary classification to classify an image into cancerous (class label 1) or benign (class label 0), i.e., to … Continue reading Histopathologic Cancer Detection with CNN
March 25, 2023
CodersLegacy
Changing HTML tags and content with Python BeautifulSoup
If you’re working with HTML in Python, the BeautifulSoup library is an excellent choice for parsing and manipulating HTML content. Most people only know about BeautifulSoup in the context of “parsing” HTML content. Little do they know, that BeautifulSoup can also be used for changing (replacing) tags and HTML content in Python.
For example, let’s assume you want to swap out all of the “h2” tags for “h3” tags inside some HTML content. BeautifulSoup can automate that for you. There is obviously alot more that it can do, which we will explore throughout this article.
Let’s get started.
Getting started
First, we’ll need to import the necessary modules. We’ll be using bs4 for parsing HTML and requests to fetch a web page to work with. In case you didn’t know, BeautifulSoup can’t actually acquire the HTML content. It just parses it.
from bs4 import BeautifulSoup
import requests
Also, I hope you actually have BeautifulSoup installed. It’s not part of the standard Python Library, so needs to be downloaded and installed separately.
For the purposes of this tutorial, we’ll be using a simple HTML file that contains a few HTML tags. Its easier to explain things this way.
html = """
<!DOCTYPE html>
<html>
<head>
<title>My Web Page</title>
</head>
<body>
<p>Hello, world!</p>
</body>
</html>
"""
It’s stored in a multiline string by the way. Another cool thing you can do is just copy paste this into a separate HTML file, and read it from there. Reduces the clutter a bit, especially when you have larger HTML files.
Changing HTML tags with BeautifulSoup
Now that we have our environment setup, we can begin using BeautifulSoup to parse and manipulate it. We won’t be doing too much parsing here, mostly changing and replacing tags. So if you want to learn more about parsing, selectors, and other bs4 concepts, check out our main tutorial.
Back to the tutorial.
I rendered the HTML from earlier into our browser, just so we can take a look at how it currently looks. Before we make any modifications.

First we will boot up our parser, by loading the HTML content into it.
from bs4 import BeautifulSoup
html = """
<!DOCTYPE html>
<html>
<head>
<title>My Web Page</title>
</head>
<body>
<p>Hello, world!</p>
</body>
</html>
"""
soup = BeautifulSoup(html, "html.parser")
Our very first goal, will be to change that paragraph tag into an H1 tag. To do this, we must first locate the paragraph tag, and then change it.
There are two ways we can do this. Here is the first method:
p_tag = soup.find('p')
p_tag.name = 'h1'
Printing out this HTML content, as shown below, proves it was successfully changed.
print(soup.prettify())
<!DOCTYPE html>
<html>
<head>
<title>
My Web Page
</title>
</head>
<body>
<h1>
Hello, world!
</h1>
</body>
</html>
I also rendered this HTML content in the browser so we could get a look at the real thing.

We can also change the content of the tag in a similar way. Instead of modifying the “name” attribute, just change the “string” attribute.
p_tag = soup.find('p')
p_tag.name = 'h1'
p_tag.string = 'CodersLegacy'
I’ll just directly show you the HTML rendered output.

Cool right?
The other way of doing this is using the replace_with() method. This method is a bit more complex so we will discuss it in a separate section, along with some other concepts.
Creating Tags in BeautifulSoup
Before we talk about replace_with(), I want to discuss how to “create” and “add” tags into BeautifulSoup. Earlier we just talked about modifying existing tags, this time we will be creating actual HTML elements, and adding them into our content.
There are two ways of creating new Tags. Either using the Tag class, or the new_tag() method. I don’t want to make an extra import for the Tag class, so lets stick to the new_tag() method, available on the soup object.
Here is an example, where we have created a “p” tag, along with a bunch of attributes, such as an ID and Class. I don’t actually intend to use these attributes; they are just here for demonstration purposes.
p_tag = soup.new_tag("p", attrs = [("id", "1"), ("class", "meow")])
p_tag.string = "Goodbye, World"
The tag is created empty by default, so we added some text into it. Now that we have this tag, we want to add it into our HTML content somehow.
To do this, we will first select an HTML element into which we want to add this. Let’s go ahead and add this into our “body” tag, alongside the other paragraph element. To do so, we will use the “append” method.
First we locate the tag:
body_tag = soup.find("body")
Then call the append() method:
body_tag.append(p_tag)
Here is the output HTML content.
!DOCTYPE html>
<html>
<head>
<title>
My Web Page
</title>
</head>
<body>
<p>
Hello, world!
</p>
<p class="meow" id="1">
Goodbye, World
</p>
</body>
</html>
And here is the rendered version.

Replacing Tags in BeautifulSoup with replace_with()
The replace_with() method in BeautifulSoup can be used to replace an HTML tag or its contents. This method is called on the tag you wish to replace, and takes as a parameter, the tag you wish to place into the HTML content. It takes a secondary (optional) parameter on the content of the tag. If you do not define this, the tag will be created with no content (an empty tag).
Here’s an example of how to replace a tag:
from bs4 import BeautifulSoup
html = """
<!DOCTYPE html>
<html>
<head>
<title>My Web Page</title>
</head>
<body>
<p>Hello, world!</p>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
p_tag = soup.find('p')
p_tag.replace_with(soup.new_tag('h1'), "This is a new Paragraph")
<!DOCTYPE html>
<html>
<head>
<title>
My Web Page
</title>
</head>
<body>
<h1>
</h1>
This is a new Paragraph
</body>
</html>
If you only want to change the text inside an HTML tag, you can do:
soup = BeautifulSoup(html, 'html.parser')
p_tag = soup.find('p')
p_tag.string.replace_with('Goodbye, world!')
This keeps the tag the same, but changes the inner content.
This marks the end of the “Changing HTML tags and content with Python BeautifulSoup” Article. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the tutorial content can be asked in the comments section below.
The post Changing HTML tags and content with Python BeautifulSoup appeared first on CodersLegacy.
PyBites
Teaching packaging by building a Python package
Listen here
Or watch here (recommended because there will be code!)
Welcome back to our podcast. In this week’s episode we look at Python packaging.
I was teaching this on our weekly PDM Code Clinic call and we ended up building quite a useful Pybites Open Source tool.
Introducing pybites-search, a command line tool to search our content (articles, Bite exercises, podcast episodes, youtube videos and tips).
We look at how to build a package and some of the code + design that went into pybites-search and how open sourcing this is a double win: our PDM bot project can leverage it and people can now contribute to this project.
Hope you enjoy this episode and comment your thoughts below as well as preferences for more Python / Developer / Mindset content. Thanks for watching.
Links / resources:
- Packaging Python Projects docs
- Pybites search tool / package
- Check out our PDM program
- Currently reading: The Gap and The Gain
Glyph Lefkowitz
What Would You Say You Do Here?
What have I been up to?
Late last year, I launched a Patreon. Although not quite a “soft” launch — I did toot about it, after all — I didn’t promote it very much.
I started this way because I realized that if I didn’t just put something up I’d be dithering forever. I’d previously been writing a sprawling monster of an announcement post that went into way too much detail, and kept expanding to encompass more and more ideas until I came to understand that salvaging it was going to be an editing process just as brutal and interminable as the writing itself.
However, that post also included a section where I just wrote about what I was actually doing.
So, for lots of reasons1, there are a diverse array of loosely related (or unrelated) projects below which may not get finished any time soon. Or, indeed, may go unfinished entirely. Some are “done enough” now, and just won’t receive much in the way of future polish.
That is an intentional choice.
The rationale, as briefly as I can manage, is: I want to lean into the my strength2 of creative, divergent thinking, and see how these ideas pan out without committing to them particularly intensely. My habitual impulse, for many years, has been to lean extremely hard on strategies that compensate for my weaknesses in organization, planning, and continued focus, and attempt to commit to finishing every project to prove that I’ll never flake on anything.
While the reward tiers for the Patreon remain deliberately ambiguous3, I think it would be fair to say that patrons will have some level of influence in directing my focus by providing feedback on these projects, and requesting that I work more on some and less on others.
So, with no further ado: what have I been working on, and what work would you be supporting if you signed up? For each project, I’ll be answering 3 questions:
- What is it?
- What have I been doing with it recently?
- What are my plans for it?
This. i.e. blog.glyph.im
What is it?
For starters, I write stuff here. I guess you’re reading this post for some reason, so you might like the stuff I write? I feel like this doesn’t require much explanation.
What have I done with it recently?
You might appreciate the explicitly patron-requested Potato
Programming post, a screed about
dataclass, or a deep dive on the
difficulties of codesigning and notarization on
macOS along with an announcement of a tool to
remediate them.
What are my plans for it?
You can probably expect more of the same; just all the latest thoughts & ideas from Glyph.
Twisted
What is it?
If you know of me you probably know of me as “the Twisted guy” and yeah, I am still that. If, somehow, you’ve ended up here and you don’t know what it is, wow, that’s cool, thanks for coming, super interested to know what you do know me for.
Twisted is an event-driven networking engine written in Python, the precursor
and inspiration for the asyncio module, and a suite of event-driven
programming abstractions, network protocol implementations, and general utility
code.
What have I done with it recently?
I’ve gotten a few things merged, including type annotations for
getPrimes and making the
bundled CLI OpenSSH server replacement work at all with public key
authentication again, as well
as some test cleanups that
reduce the overall surface area of old-style Deferred-returning tests that can
be flaky and slow.
I’ve also landed a posix_spawnp-based spawnProcess
implementation which speed up
process spawning significantly; this can be as much as 3x faster if you do a
lot of spawning of short-running processes.
I have a bunch of PRs in flight, too, including better annotations for
FilePath
Deferred, and
IReactorProcess, as well as a
fix for the aforementioned
posix_spawnp implementation.
What are my plans for it?
A lot of the projects below use Twisted in some way, and I continue to maintain it for my own uses. My particular focus is in quality-of-life improvements; issues that someone starting out with a Twisted project will bump into and find confusing or difficult. I want it to be really easy to write applications with Twisted and I want to use my own experiences with it.
I also do code reviews of other folks’ contributions; we do still have over 100 open PRs right now.
DateType
What is it?
DateType is a workaround for a very specific bug in the way that the datetime
standard library module deals with type composition: to wit, that datetime is
a subclass of date but is not
Liskov-substitutable
for it. There are even #type:ignore comments in the standard library type
stubs
to work around this problem, because if you did this in your own code, it
simply wouldn’t type-check.
What have I done with it recently?
I updated it a few months ago to expose DateTime and Time directly (as
opposed to AwareDateTime and NaiveDateTime), so that users could specialize
their own functions that took either naive or aware times without ugly and
slightly-incorrect unions.
What are my plans for it?
This library is mostly done for the time being, but if I had to polish it a bit I’d probably do two things:
- a readthedocs page for nice documentation
- write a PEP to get this integrated into the standard library
Although the compatibility problems are obviously very tricky and a PEP would probably be controversial, this is ultimately a bug in the stdlib, and should be fixed upstream there.
Automat
What is it?
It’s a library to make deterministic finite-state automata easier to create and work with.
What have I done with it recently?
Back in the middle of last year, I opened a PR to create a new, completely different front-end API for state machine definition. Instead of something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
this branch lets you instead do something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | |
In other words, it creates a state for every type, and type safety that much
more cleanly expresses what methods can be called and by whom; no need to make
everything private with tons of underscore-prefixed methods and attributes,
since all the caller can see is “an implementation of MachineProtocol”; your
state classes can otherwise just be normal classes, which do not require
special logic to be instantiated if you want to use them directly.
Also, by making a state for every type, it’s a lot cleaner to express that certain methods require certain attributes, by simply making them available as attributes on that state and then requiring an argument of that state type; you don’t need to plot your way through the outputs generated in your state graph.
What are my plans for it?
I want to finish up dealing with some issues with that branch - particularly the ugly patterns for communicating portions of the state core to the caller and also the documentation; there are a lot of magic signatures which make sense in heavy usage but are a bit mysterious to understand while you’re getting started.
I’d also like the visualizer to work on it, which it doesn’t yet, because the visualizer cribs a bunch of state from MethodicalMachine when it should be working purely on core objects.
Secretly
What is it?
This is an attempt at a holistic, end-to-end secret management wrapper around
Keyring. Whereas Keyring handles
password storage, this handles the whole lifecycle of looking up the secret to
see if it’s there, displaying UI to prompt the user (leveraging a pinentry
program from GPG if available)
What have I done with it recently?
It’s been a long time since I touched it.
What are my plans for it?
- Documentation. It’s totally undocumented.
- It could be written to be a bit more abstract. It dates from a time before
asyncio, so its current Twisted requirement for
Deferredcould be made into a genericAwaitableone. - Better platform support for Linux & Windows when GPG’s
pinentryis not available. - Support for multiple accounts so that when the user is prompted for the relevant credential, they can store it.
- Integration with 1Password via some of their many potentially relevant APIs.
Fritter
What is it?
Fritter is a frame-rate independent timer tree.
In the course of developing Twisted, I learned a lot about time and timers.
LoopingCall
encodes some of this knowledge, but it’s very tightly coupled to the somewhat
limited IReactorTime API.
Also, LoopingCall was originally designed with the needs of media playback
(particularly network streaming audio playback) in mind, but I have used it
more for background maintenance tasks and for animations. Both of these things
have requirements that LoopingCall makes awkward but FRITTer is designed to
meet:
-
At higher loads, surprising interactions can occur with the underlying priority queue implementation, and different algorithms may make a significant difference to performance. Fritter has a pluggable implementation of a priority queue and is carefully minimally coupled to it.
-
Driver selection is a first-class part of the API, with an included, public “Memory” driver for testing, rather than
LoopingCall’s “testing is at least possible”.reactorattribute. This means that out of the box it supports both Twisted and asyncio, and can easily have other things added. -
The API is actually generic on what constitutes time itself, which means that you can use it for both short-term (i.e.: monotonic clock values as float-seconds) and long-term (civil times as timezone-aware
datetimeobjects) recurring tasks. Recurrence rules can also be arbitrary functions. -
There is a recursive driver (this is the “tree” part) which both allows for:
a. groups of timers which can be suspended and resumed together, and
b. scaling of time, so that you can e.g. speed up or slow down the ticks for AIs, groups of animations, and so on, also in groups.
-
The API is also generic on what constitutes work. This means that, for example, in a certain timer you can say “all work units scheduled on this scheduler, in addition to being callable, must also have an
asJSONmethod”. And in fact that’s exactly what thelongtermmodule in Fritter does.
I can neither confirm nor deny that this project was factored out of a game engine for a secret game project which does not appear on this list.
What have I done with it recently?
Besides realizing, in the course of writing this blog post, that its CI was failing its code quality static checks (oops), the last big change was the preliminary support for recursive timers and serialization.
What are my plans for it?
-
These haven’t been tested in anger yet and I want to actually use them in a larger project to make sure that they don’t have any necessary missing pieces.
-
Documentation.
Encrust
What is it?
I have written about Encrust quite
recently so if you want to know about it,
you should probably read that post. In brief, it is a code-shipping tool for
py2app. It takes care of architecture-independence, code-signing, and
notarization.
What have I done with it recently?
Wrote it. It’s brand new as of this month.
What are my plans for it?
I really want this project to go away as a tool with an independent existence. Either I want its lessons to be fully absorbed into Briefcase or perhaps py2app itself, or for it to become a library that those things call into to do its thing.
Various Small Mac Utilities
What is it?
- QuickMacApp is a very small library for creating status-item “menu bar apps” in Python which don’t have much of a UI but want to run some Python code in the background and occasionally pop up a notification or ask the user a question or something. The idea is that if you have a utility that needs a minimal UI to just ask the user one or two things, you should be able to give it a GUI immediately, without thinking about it too much.
- QuickMacHotkey this is a very
minimal API to register hotkeys on macOS. this
example
is what comes up if you search the web for such a thing, but it hasn’t worked
on a current Python for about 11 years. This isn’t the “right” way to do
such a thing, since it provides no UI to set the shortcut, you’d have to
hard-code it. But
MASShortcutis now archived and I haven’t had the opportunity to investigateHotKey, so for the time being, it’s a handy thing, and totally adequate for the sort of quick-and-dirty applications you might make withQuickMacApp. - VEnvDotApp is a way of giving a virtualenv its own Info.plist and bundle ID, so that command-line python tools that just need to pop up a little mac GUI, like an alert or a notification, can do so with cross-platform tools without looking like it’s an app called “Python”, or in some cases breaking entirely.
- MOPUp is a command-line updater for
upstream Python.org macOS Python. For distributing third-party apps,
Python.org’s version is really the one you want to use (it’s universal2, and
it’s generally built with compiler options that make it a distributable thing
itself) but updating it by downloading a
.pkgfile from a web browser is kind of annoying.
What have I done with it recently?
I’ve been releasing all these tools as they emerge and are factored out of other work, and they’re all fairly recent.
What are my plans for it?
I will continue to factor out any general-purpose tools from my platform-specific Python explorations — hopefully more Linux and Windows too, once I’ve got writing code for my own computer down, but most of the tools above are kind of “done” on their own, at the moment.
The two things that come to mind though are that QuickMacApp should have a way of owning the menubar sometimes (if you don’t have something like Bartender, menu-bar-status-item-only apps can look like they don’t do anything when you launch them), and that MOPUp should probably be upstreamed to python.org.
Pomodouroboros
What is it?
Pomodouroboros is a pomodoro timer with a highly opinionated take. It’s based on my own experience of ADHD time blindness, and is more like a therapeutic intervention for that specific condition than a typical “productivity” timer app.
In short, it has two important features that I have found lacking in other tools:
- A gigantic, absolutely impossible to ignore visual timer that presents a HUD overlay over your entire desktop. It remains low-opacity and static most of the time but pulses every 30 seconds to remind you that time is passing.
- Rather than requiring you to remember to set a timer before anything happens, it has an idea of “work hours” when you want to be time-sensitive and presents constant prompting to get started.
What have I done with it recently?
I’ve been working on it fairly consistently lately. The big things I’ve been doing have been:
- factoring things out of the Pomodouroboros-specific code and into
QuickMacAppandEncrust. - Porting the UI to the redesigned core of the application, which has been implemented and tested in platform-agnostic Python but does not have any UI yet.
- fully productionizing the build process and ensuring that Encrust is producing binary app bundles that people can use.
What are my plans for it?
In brief, “finish the app”. I want this to have its own website and find a life beyond the Python community, with people who just want a timer app and don’t care how it’s written. The top priority is to replace the current data model, which is to say the parts of the UI that set and evaluate timers and edit the list of upcoming timers (the timer countdown HUD UI itself is fine).
I also want to port it to other platforms, particularly desktop Linux, where I know there are many users interested in such a thing. I also want to do a CLI version for folks who live on the command line.
Finally: Pomodouroboros serves as a test-bed for a larger goal, which is that I want to make it easier for Python programmers, particularly beginners who are just getting into coding at all, to write code that not only interacts with their own computer, but that they can share with other users in a real way. As you can see with Encrust and other projects above, as much as I can I want my bumpy ride to production code to serve as trailblazing so that future travelers of this path find it as easy as possible.
And Here Is Where The CTA Goes
If this stuff sounds compelling, you can obviously sign up, that would be great. But also, if you’re just curious, go ahead and give some of these projects some stars on GitHub or just share this post. I’d also love to hear from you about any of this!
If a lot of people find this compelling, then pursuing these ideas will become a full-time job, but I’m pretty far from that threshold right now. In the meanwhile, I will also be doing a bit of consulting work.
I believe much of my upcoming month will be spoken for with contracting, although quite a bit of that work will also be open source maintenance, for which I am very grateful to my generous clients. Please do get in touch if you have something more specific you’d like me to work on, and you’d like to become one of those clients as well.
-
Reasons which will have to remain mysterious until I can edit about 10,000 words of abstract, discursive philosophical rambling into something vaguely readable. ↩
-
A strength which is common to many, indeed possibly most, people with ADHD. ↩
-
While I want to give myself some leeway to try out ideas without necessarily finishing them, I do not want to start making commitments that I can’t keep. Particularly commitments that are tied to money! ↩
