Planet Python
Last update: November 21, 2021 04:40 PM UTC
November 21, 2021
Anarcat
The last syncmaildir crash
My syncmaildir (SMD) setup failed me one too many times (previously, previously). In an attempt to migrate to an alternative mail synchronization tool, I looked into using my IMAP server again, and found out my mail spool was in a pretty bad shape. I'm comparing mbsync and offlineimap in the next post.
The latest crash
On Monday, SMD just started failing with this error:
nov 15 16:12:19 angela systemd[2305]: Starting pull emails with syncmaildir...
nov 15 16:12:22 angela systemd[2305]: smd-pull.service: Succeeded.
nov 15 16:12:22 angela systemd[2305]: Finished pull emails with syncmaildir.
nov 15 16:14:08 angela systemd[2305]: Starting pull emails with syncmaildir...
nov 15 16:14:11 angela systemd[2305]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
nov 15 16:14:11 angela systemd[2305]: smd-pull.service: Failed with result 'exit-code'.
nov 15 16:14:11 angela systemd[2305]: Failed to start pull emails with syncmaildir.
nov 15 16:16:14 angela systemd[2305]: Starting pull emails with syncmaildir...
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Network error.
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Unable to get any data from the other endpoint.
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: This problem may be transient, please retry.
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Hint: did you correctly setup the SERVERNAME variable
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: on your client? Did you add an entry for it in your ssh
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: configuration file?
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Network error
nov 15 16:16:17 angela smd-pull[27188]: register: smd-client@localhost: TAGS: error::context(handshake) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
nov 15 16:16:17 angela systemd[2305]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
nov 15 16:16:17 angela systemd[2305]: smd-pull.service: Failed with result 'exit-code'.
nov 15 16:16:17 angela systemd[2305]: Failed to start pull emails with syncmaildir.
What is frustrating is that there's actually no network error here. Running the command by hand I did see a different message, but now I have lost it in my backlog. It had something to do with a filename being too long, and I gave up debugging after a while. This happened suddenly too, which added to the confusion.
In a fit of rage I started this blog post and experimenting with alternatives, which led me down a lot of rabbit holes.
Reviewing my previous mail crash
documentation, it seems most
solution involve talking to an IMAP server, so I figured I would just
do that. Wanting to try something new, i gave isync (AKA
mbsync) a try. Oh dear, I did not expect how much trouble just
talking to my IMAP server would be. It's not isync's fault, for what
it's worth, but it's the primary tool I used to debug things, and
served me well in that regard.
mailbox corruption
The first thing I found out is that certain messages in the IMAP spool
were corrupted. mbsync would stop on a FETCH command and Dovecot
would give me those errors on the server side:
nov 16 15:31:27 marcos dovecot[3621800]: imap(anarcat)<3630489><wAmSzO3QZtfAqAB1>: Error: Mailbox junk: Maildir filename has wrong W value, renamed the file from /home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495,W=2578:2,S to /home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495:2,S
nov 16 15:31:27 marcos dovecot[3621800]: imap(anarcat)<3630489><wAmSzO3QZtfAqAB1>: Error: Mailbox junk: Deleting corrupted cache record uid=1582: UID 1582: Broken virtual size in mailbox junk: read(/home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495,W=2578:2,S): FETCH BODY[] got too little data: 2540 vs 2578
and:
nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: UID=19288: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288) (read reason=mail stream)
nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: Deleting corrupted cache record uid=19288: UID 19288: Broken physical size in mailbox Sent: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288)
nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: UID=19288: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288) (read reason=)
nov 16 13:53:08 marcos dovecot[3520770]: imap-login: Panic: epoll_ctl(del, 7) failed: Bad file descriptor
"wrong W value"
At least the first error was automatically healed by Dovecot (by
renaming the file without the W= flag). The problem is that the
FETCH command fails and mbsync exits noisily. So you need to
constantly restart mbsync with a silly command like:
while ! mbsync -a; do sleep 1; done
"cached message size larger than expected"
The second problem is much harder to fix, because dovecot does not recover automatically. The above loop was taking too long: one full IMAP roundtrip (with authentication) for every corrupt message...
Workaround
So I read a lot on the Dovecot documentation on the maildir format, and wrote an extensive fix script for those two errors. The script worked and mbsync was able to sync the entire mail spool.
And no, rebuilding the index files didn't work. Also tried
doveadm force-resync -u anarcat which didn't do anything.
In the end I also had to do this, because the wrong cache values were also stored elsewhere.
service dovecot stop ; find -name 'dovecot*' -delete; service dovecot start
This would have totally broken any existing clients, but thankfully I'm starting from scratch (except maybe webmail, but I'm hoping it will self-heal as well, assuming it only has a cache and not a full replica of the mail spool).
Incoherence between Maildir and IMAP
Unfortunately, the first mbsync was incomplete as it was missing about 15,000 mails:
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l
384836
anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l
369221
As it turns out, mbsync was not at fault here either: this was yet
more mail spool corruption.
It's actually 26 folders with different sizes, which can be found with:
for folder in * .[^.]* ; do
printf "%s\t%d\n" $folder $(find "$folder" -type f -a \! -name '.*' | wc -l );
done
The special \! -name '.*' bit is to ignore the mbsync metadata,
which creates .uidvalidity and .mbsyncstate in every folder. That
ignores about 200 files but since they are spread around all folders,
which was making it impossible to review where the problem was.
Here is what the diff looks like:
--- Maildir-list 2021-11-17 20:42:36.504246752 -0500
+++ Maildir-mbsync-list 2021-11-17 20:18:07.731806601 -0500
@@ -6,16 +6,15 @@
[...]
.Archives 1
.Archives.2010 3553
-.Archives.2011 3583
-.Archives.2012 12593
+.Archives.2011 3582
+.Archives.2012 620
.Archives.2013 8576
.Archives.2014 11057
-.Archives.2015 8173
+.Archives.2015 8165
.Archives.2016 54
.band 34
.bitbuck 1
@@ -38,13 +37,12 @@
.couchsurfers 2
-cur 11285
+cur 11280
.current 130
.cv 2
.debbug 262
-.debian 37544
-drafts 1
-.Drafts 4
+.debian 37533
+.Drafts 2
.drone 241
.drupal 188
.drupal-devel 303
[...]
Misfiled messages
It's a bit all over the place, but we can already notice some huge
differences between mailboxes, for example in the Archives
folders. As it turns out, at least 12,000 of those missing mails were
actually misfiled: instead of being in the
Maildir/.Archives.2012/cur/ folder, they were directly in
Maildir/.Archives.2012/. This is something that doesn't matter for
SMD (and possibly for notmuch? it does matter, notmuch suddenly
found 12,000 new mails) but that definitely matters to Dovecot and
therefore mbsync...
After moving those files around, we still have 4,000 message missing:
anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l
381196
anarcat@angela:~(main)$ find Maildir/ -type f -a \! -name '.*' | wc -l
385053
The problem is that those 4,000 missing mails are harder to
track. Take, for example, .Archives.2011, which has a single messge
missing, out of 3,582. And the files are not identical: the checksums
don't match after going through the IMAP transport, so we can't use a
tool like hashdeep to compare the trees and find why that one
file is missing.
"register" folder
One big chunk of the 4,000, however, is a special folder called
register in my spool, which I was syncing separately (see Securing
registration email for details on that setup). That actually
covers 3,700 of those messages, so I actually have a more modest 300
messages to figure out, after (easily!) configuring mbsync to sync
that folder separately:
@@ -30,9 +33,29 @@ Slave :anarcat-local:
# Exclude everything under the internal [Gmail] folder, except the interesting folders
#Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail"
# Or include everything
-Patterns *
+#Patterns *
+Patterns * !register !.register
# Automatically create missing mailboxes, both locally and on the server
#Create Both
Create slave
# Sync the movement of messages between folders and deletions, add after making sure the sync works
#Expunge Both
+
+IMAPAccount anarcat-register
+Host imap.anarc.at
+User register
+PassCmd "pass imap.anarc.at-register"
+SSLType IMAPS
+CertificateFile /etc/ssl/certs/ca-certificates.crt
+
+IMAPStore anarcat-register-remote
+Account anarcat-register
+
+MaildirStore anarcat-register-local
+SubFolders Maildir++
+Inbox ~/Maildir-mbsync/.register/
+
+Channel anarcat-register
+Master :anarcat-register-remote:
+Slave :anarcat-register-local:
+Create slave
"tmp" folders and empty messages
After syncing the "register" messages, I end up with the measly little 160 emails out of sync:
anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l
384900
anarcat@angela:~(main)$ find Maildir/ -type f -a \! -name '.*' | wc -l
385059
Argh. After more digging, I have found 131 mails in the tmp/
directories of the client's mail spool. Mysterious! On the server
side, it's even more files, and not the same ones. Possible that
those were mails that were left there during a failed delivery of some
sort, during a power failure or some sort of crash? Who knows.
The first thing to do with those is to cleanup a bunch of empty files (21 on angela):
find .[^.]*/tmp -type f -empty -delete
As that turns out, they are all duplicates, in the sense that notmuch can easily find a copy of files with the same message ID in its database. In other words, this hairy command returns nothing
find .[^.]*/tmp -type f | while read path; do
msgid=$(grep -m 1 -i ^message-id "$path" | sed 's/Message-ID: //i;s/[<>]//g');
if notmuch count --exclude=false "id:$msgid" | grep -q 0; then
echo "$path <$msgid> not in notmuch" ;
fi;
done
... which is good. Or, to put it another way, this is safe:
find .[^.]*/tmp -type f -delete
Poof! 314 mails cleaned on the server side. Interestingly, SMD doesn't
pick up on those changes at all and still sees files in tmp/
directories on the client side, so we need to operate the same twisted
logic there.
notmuch to the rescue again
After cleaning that on the client, we get:
anarcat@angela:~(main)$ find Maildir/ -type f -a \! -name '.*' | wc -l
384928
anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l
384901
Ha! 27 mails difference. Those are the really sticky, unclear ones. I was hoping a full sync might clear that up, but after deleting the entire directory and starting from scratch, I end up with:
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l
385034
anarcat@angela:~(main)$ find Maildir-mbsync -type f -type f -a \! -name '.*' | wc -l
384993
That is: even more messages missing (now 37). Sigh.
Hopefully, this is something notmuch will be able to help with:
it can index all files by Message-ID (which I learned is
case-insensitive, yay) and tell us which messages don't make it
through.
Considering the corruption I found in the mail spool, I wouldn't be the least surprised those messages are just skipped by the IMAP server because of corruption. Unfortunately, there's nothing on the Dovecot server logs that would explain the discrepancy.
Here again, notmuch comes to the rescue. We can list all message IDs to figure out that discrepancy:
notmuch search --exclude=false --output=messages '*' | pv -s 18M | sort > Maildir-msgids
notmuch --config=.notmuch-config-mbsync search --exclude=false --output=messages '*' | pv -s 18M | sort > Maildir-mbsync-msgids
And then we can see how many messages notmuch thinks are missing:
$ wc -l *msgids
372723 Maildir-mbsync-msgids
372752 Maildir-msgids
That's 29 messages. Oddly, it doesn't exactly match the find output:
anarcat@angela:~(main)$ find Maildir-mbsync -type f -type f -a \! -name '.*' | wc -l
385204
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l
385241
That is 10 more messages. Ugh. But actually, I know what those are:
more misfiled messages (in a .folder/draft/ directory, bizarrely, so
the totals actually match.
In the notmuch output, there's a lot of stuff like this:
id:notmuch-sha1-fb880d673e24f5dae71b6b4d825d4a0d5d01cde4
Those are messages without a valid Message-ID. Notmuch (presumably) constructs one based on the file's checksum. Because the files differ between the IMAP server and the local mail spool (which is unfortunate, but possibly inevitable), those do not match. There are exactly the same number of those on both sides, so I'll go ahead and assume those are all accounted for.
What remains is:
anarcat@angela:~(main)$ diff -u Maildir-mbsync-msgids Maildir-msgids | grep '^\-[^-]' | grep -v sha1 | wc -l
2
anarcat@angela:~(main)$ diff -u Maildir-mbsync-msgids Maildir-msgids | grep '^\+[^+]' | grep -v sha1 | wc -l
21
anarcat@angela:~(main)$
ie. 21 missing from mbsync, and, surprisingly, 2 missing from the original mail spool.
Further inspection also showed they were all messages with some sort of "corruption": no body and only headers. I am not sure that is a legal email format in the first place. Since they were mostly spam or administrative emails ("You have been unsubscribed from mailing list..."), it seems fairly harmless to ignore those.
Conclusion
As we'll see in the next article, SMD has stellar performance, but this comes at a huge cost: it accesses the mail storage directly. This can (and has) created significant problems on the mail server. It's unclear exactly why those things happen, but Dovecot expects a particular storage format on its file, and it seems unwise to bypass that.
In the future, I'll try to remember to avoid that, especially since mechanisms like SMD require special server access (SSH) which, in the long term, I am not sure I want to maintain or expect.
In other words, just talking with an IMAP server opens up a lot more possibilities of hosting than setting up a custom synchronisation protocol over SSH. It's also safer and more reliable, as we have seen. Thankfully, I've been able to recover from all the errors I could find, but it could have gone differently and it would have been possible for SMD to permanently corrupt significant part of my mail archives.
I recommend SMD users start looking for alternatives. The project has been archived upstream, and the Debian package has been orphaned. I have seen significant mail box corruption, including entire mail spool destruction, mostly due to incorrect locking code. I have filed a release-critical bug in Debian to make sure it doesn't ship with Debian bookworm.
Alternatives like mbsync provide fast and reliable transport,
including over SSH. See the next
article for further discussion of
the alternatives.
November 21, 2021 04:04 PM UTC
John Ludhi/nbshare.io
Movie Name Generation Using GPT-2
Since its reveal in 2017 in the popular paper Attention Is All You Need (https://arxiv.org/abs/1706.03762), the Transformer quickly became the most popular model in NLP. The ability to process text in a non-sequential way (as opposed to RNNs) allowed for training of big models. The attention mechanism it introduced proved extremely useful in generalizing text.
Following the paper, several popular transformers surfaced, the most popular of which is GPT. GPT models are developed and trained by OpenAI, one of the leaders in AI research. The latest release of GPT is GPT-3, which has 175 billion parameters. The model was very advanced to the point where OpenAI chose not to open-source it. People can access it through an API after a signup process and a long queue.
However, GPT-2, their previous release is open-source and available on many deep learning frameworks.
In this excercise, we use Huggingface and PyTorch to fine-tune a GPT-2 model for movie name generation.
Overview:
- Imports and Data Loading
- Data Preprocessing
- Setup and Training
- Movie Name Generation
- Model Saving and Loading
Please use pip install {library name} in order to install the libraries below if they are not installed. "transformers" is the Huggingface library.
import re
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch.optim as optim
We set the device to enable GPU processing.
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
device
movies_file = "movies.csv"
Since the file is in CSV format, we use pandas.read_csv() to read the file
raw_df = pd.read_csv(movies_file)
raw_df
We can see that we have 9742 movie names in the title column. Since the other columns are not useful for us, we will only keep the title column.
movie_names = raw_df['title']
movie_names
As seen, the movie names all end with the release year. While it may be interesting to keep the years in the names and let the model output years for generated movies, we can safely assume it does not help the model in understanding movie names.
We remove them with a simple regex expression:
movie_list = list(movie_names)
def remove_year(name):
return re.sub("\([0-9]+\)", "", name).strip()
movie_list = [remove_year(name) for name in movie_list]
The final movie list looks ready for training. Notice that we do not need to tokenize or process the text any further since GPT2 comes with its own tokenizer that handles text in the approriate way.
movie_list[:5]
However, we should still acquire a fixed length input. We use the average movie name length in words in order to place a safe max length.
avg_length = sum([len(name.split()) for name in movie_list])/len(movie_list)
avg_length
Since the average movie name length in words is 3.3, we can assume that a max length of 10 will cover most of the instances.
max_length = 10
Before creating the dataset, we download the model and the tokenizer. We need the tokenizer in order to tokenize the data.
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2")
We send the model to the device and initialize the optimizer.
model = model.to(device)
optimizer = optim.AdamW(model.parameters(), lr=3e-4)
According to the GPT-2 paper, to fine-tune the model, use a task designator.
For our purposes, the designator is simply "movie: ". This will be added to the beginning of every example.
To correctly pad and truncate the instances, we find the number of tokens used by this designator:
tokenizer.encode("movie: ")
extra_length = len(tokenizer.encode("movie: "))
We create a simple dataset that extends the PyTorch Dataset class:
class MovieDataset(Dataset):
def __init__(self, tokenizer, init_token, movie_titles, max_len):
self.max_len = max_len
self.tokenizer = tokenizer
self.eos = self.tokenizer.eos_token
self.eos_id = self.tokenizer.eos_token_id
self.movies = movie_titles
self.result = []
for movie in self.movies:
# Encode the text using tokenizer.encode(). We ass EOS at the end
tokenized = self.tokenizer.encode(init_token + movie + self.eos)
# Padding/truncating the encoded sequence to max_len
padded = self.pad_truncate(tokenized)
# Creating a tensor and adding to the result
self.result.append(torch.tensor(padded))
def __len__(self):
return len(self.result)
def __getitem__(self, item):
return self.result[item]
def pad_truncate(self, name):
name_length = len(name) - extra_length
if name_length < self.max_len:
difference = self.max_len - name_length
result = name + [self.eos_id] * difference
elif name_length > self.max_len:
result = name[:self.max_len + 2]+[self.eos_id]
else:
result = name
return result
Then, we create the dataset:
dataset = MovieDataset(tokenizer, "movie: ", movie_list, max_length)
Using a batch_size of 32, we create the dataloader:
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, drop_last=True)
GPT-2 is capable of several tasks, including summarization, generation, and translation. To train for generation, use the same as input as labels:
def train(model, optimizer, dl, epochs):
for epoch in range(epochs):
for idx, batch in enumerate(dl):
with torch.set_grad_enabled(True):
optimizer.zero_grad()
batch = batch.to(device)
output = model(batch, labels=batch)
loss = output[0]
loss.backward()
optimizer.step()
if idx % 50 == 0:
print("loss: %f, %d"%(loss, idx))
When training a language model, it is easy to overfit the model. This is due to the fact that there is no clear evaluation metric. With most tasks, one can use cross-validation to guarantee not to overfit. For our purposes, we only use 2 epochs for training
train(model=model, optimizer=optimizer, dl=dataloader, epochs=2)
The loss decreased consistently, which means that the model was learning.
Movie Name Generation
In order to verify, we generate 20 movie names that are not existent in the movie list.
The generation methodology is as follows:
- The task designator is initially fed into the model
- A choice from the top-k choices is selected. A common question is why not use the highest ranked choice always. The simple answer is that introducing randomness helps the model create different outputs. There are several sampling methods in the literature, such as top-k and nucleus sampling. Im this example, we use top-k, where k = 9. K is a hyperparameter that improves the performance with tweaking. Feel free to play around with it to see the effects.
- The choice is added to the sequence and the current sequence is fed to the model.
- Repeat steps 2 and 3 until either max_len is achieved or the EOS token is generated.
def topk(probs, n=9):
# The scores are initially softmaxed to convert to probabilities
probs = torch.softmax(probs, dim= -1)
# PyTorch has its own topk method, which we use here
tokensProb, topIx = torch.topk(probs, k=n)
# The new selection pool (9 choices) is normalized
tokensProb = tokensProb / torch.sum(tokensProb)
# Send to CPU for numpy handling
tokensProb = tokensProb.cpu().detach().numpy()
# Make a random choice from the pool based on the new prob distribution
choice = np.random.choice(n, 1, p = tokensProb)
tokenId = topIx[choice][0]
return int(tokenId)
def model_infer(model, tokenizer, init_token, max_length=10):
# Preprocess the init token (task designator)
init_id = tokenizer.encode(init_token)
result = init_id
init_input = torch.tensor(init_id).unsqueeze(0).to(device)
with torch.set_grad_enabled(False):
# Feed the init token to the model
output = model(init_input)
# Flatten the logits at the final time step
logits = output.logits[0,-1]
# Make a top-k choice and append to the result
result.append(topk(logits))
# For max_length times:
for i in range(max_length):
# Feed the current sequence to the model and make a choice
input = torch.tensor(result).unsqueeze(0).to(device)
output = model(input)
logits = output.logits[0,-1]
res_id = topk(logits)
# If the chosen token is EOS, return the result
if res_id == tokenizer.eos_token_id:
return tokenizer.decode(result)
else: # Append to the sequence
result.append(res_id)
# IF no EOS is generated, return after the max_len
return tokenizer.decode(result)
Generating 20 unique movie names:
results = set()
while len(results) < 20:
name = model_infer(model, tokenizer, "movie:").replace("movie: ", "").strip()
if name not in movie_list:
results.add(name)
print(name)
As shown, the movie names look realistic, meaning that the model learned how to generate movie names correctly.
PyTorch makes it very easy to save the model:
torch.save(model.state_dict(), "movie_gpt.pth")
And, if you need to load the model in the future for quick inference without having to train:
model.load_state_dict(torch.load("movie_gpt.pth"))
In this tutorial, we learnt how to fine-tune the Huggingface GPT model to perform movie name generation. The same methodology can be applied to any language model available on https://huggingface.co/models
November 21, 2021 01:38 AM UTC
November 20, 2021
Evennia
The Evennia blog has moved to evennia.com!
This dev blog has moved! All past and future posts will now be found here instead on evennia.com. The linked post discusses the move in more detail, including the little custom blog platform I wrote for it.
The new blog has a new RSS feed address, so if you follow this blog via RSS, update your feed link (all old entries were migrated as well).
The old posts here on blogspot/bloggly will remain but won't be updated anymore.
Cheers,
Griatch
November 20, 2021 03:31 PM UTC
Weekly Python StackOverflow Report
(cccii) stackoverflow python report
These are the ten most rated questions at Stack Overflow last week.
Between brackets: [question score / answers count]
Build date: 2021-11-20 15:01:04 GMT
- Implementation of the Max() function in Python - [15/1]
- Plot bar chart in multiple subplot rows with Pandas - [6/1]
- Python threads difference for 3.10 and others - [6/1]
- Using a postgres composite field containing a geography type for django ORM usage - [6/0]
- Any simpler way to assign multiple columns in Python like R data.table := - [5/3]
- How to build triangular matrix as df with data from dataframe - [5/3]
- How to find the number of neighbours pixels in binary array - [5/1]
- Why aren't Pandas operations in-place? - [5/0]
- Sum of 1+3+5...+n until the sum exceeds 100 - [4/9]
- Fillna if all the values of a column are null in pandas - [4/5]
November 20, 2021 03:02 PM UTC
Andre Roberge
Friendly-traceback en español
Friendly and Friendly-traceback are now partially available in Spanish thanks to the work of Mrtín René (https://github.com/martinvilu).
You can have a look at the Spanish translations in context for SyntaxErrors and for other exceptions.
If you are interested in contributing to translations, please join this discussion and have a look at this online collaborative site.
Update: Someone just volunteered to help with the Italian translation. Note that there are more than 600 pieces of text to translate and that more volunteers can help!
November 20, 2021 03:00 PM UTC
ItsMyCode
ValueError: too many values to unpack (expected 2)
ItsMyCode |
If you get ValueError: too many values to unpack (expected 2), it means that you are trying to access too many values from an iterator. Value Error is a standard exception that can occur if the method receives an argument with the correct data type but an invalid value or if the value provided to the method falls outside the valid range.
In this article, let us look at what this error means and the scenarios you get this error and how to resolve the error with examples.
What is Unpacking in Python?
In Python, the function can return multiple values, and it can be stored in the variable. This is one of the unique features of Python when compared to other languages such as C++, Java, C# etc.
Unpacking in Python is an operation where an iterable of values will be assigned to a tuple or list of variables.
Unpacking using List in Python
In this example, we are unpacking the list of elements where each element that we return from the list should assign to a variable on the left-hand side to store these elements.
x,y,z = [5,10,15]
print(x)
print(y)
print(z)
Output
5
10
15
Unpacking list using underscore
Underscoring is most commonly used to ignore values; when _ is used as a variable when we do not want to use this variable at a later point.
x,y,_ = [5,10,15]
print(x)
print(y)
print(_)
Output
5
10
15
Unpacking list using an asterisk
The drawback with an underscore is it can just hold one iterable value, but what if you have too many values that comes dynamically? Asterisk comes as a rescue over here. We can use the variable with an asterisk in front to unpack all the values that are not assigned, and it can hold all these elements in it.
x,y, *z = [5,10,15,20,25,30]
print(x)
print(y)
print(z)
Output
5
10
[15, 20, 25, 30]
What is ValueError: too many values to unpack (expected 2)?
ValueError: too many values to unpack (expected 2) occurs when there is a mismatch between the returned values and the number of variables declared to store these values. If you have more objects to assign and fewer variables to hold, you get a value error.
The error occurs mainly in 2 scenarios-
Scenario 1: Unpacking the list elements
Let’s take a simple example that returns an iterable of three items instead of two, and we have two variables to hold these items on the left-hand side, and Python will throw ValueError: too many values to unpack.
In the below example we have 2 variables x and y but we are returning 3 iterables elements from list.
Error Scenario
x,y =[5,10,15]
Output
Traceback (most recent call last):
File "c:/Projects/Tryouts/main.py", line 1, in <module>
x,y =[5,10,15]
ValueError: too many values to unpack (expected 2)
Solution
While unpacking a list into variables, the number of variables you want to unpack must equal the number of items in the list.
If you already know the number of elements in the list, then ensure you have an equal number of variables on the left-hand side to hold these elements to solve.
If you do not know the number of elements in the list or if your list is dynamic, then you can unpack the list with an asterisk operator. It will ensure that all the un-assigned elements will be stored in a single variable with an asterisk operator.
# In case we know the number of elements
# in the list to unpack
x,y,z =[5,10,15]
print("If we know the number of elements in list")
print(x)
print(y)
print(z)
# if the list is dynamic
a,b, *c = [5,10,15,20,25,30]
print("In case of dynamic list")
print(a)
print(b)
print(c)
Output
If we know the number of elements in list
5
10
15
In case of dynamic list
5
10
[15, 20, 25, 30]
Scenario 2: Unpacking dictionary
In Python, Dictionary is a set of unordered items which holds key-value pairs. Let us consider a simple example of an employee, which consists of three keys, and each holds a value, as shown below.
If we need to extract and print each of the key and value pairs in the employee dictionary, we can use iterate the dictionary elements using a for loop.
Lets run our code and see what happens
Error Scenarios
# Unpacking using dictornary
employee= {
"name":"Chandler",
"age":25,
"Salary":10000
}
for keys, values in employee:
print(keys,values)
Output
Traceback (most recent call last):
File "c:/Projects/Tryouts/main.py", line 9, in <module>
for keys, values in employee:
ValueError: too many values to unpack (expected 2)
We get a Value error in the above code because each item in the “employee” dictionary is a value. We should not consider the keys and values in the dictionary as two separate entities in Python.
Solution
We can resolve the error by using a method called items(). The items() function returns a view object which contains both key-value pairs stored as tuples.
# Unpacking using dictornary
employee= {
"name":"Chandler",
"age":25,
"Salary":10000
}
for keys, values in employee.items():
print(keys,values)
Output
name Chandler
age 25
Salary 10000
Note: If you are using Python 2.x, you need to use iteritems() instead of the items() function.
The post ValueError: too many values to unpack (expected 2) appeared first on ItsMyCode.
November 20, 2021 01:54 PM UTC
November 19, 2021
Python for Beginners
Graph in Python
Graphs are one of the most important data structures. Graphs are used to represent telephone networks, maps, social network connections, etc. In this article we will discuss what a graph is and how we can implement a graph in Python.
What is a graph?
In mathematics, A graph is defined as a set of vertices and edges where vertices are particular objects and edges represent the connections between the vertices. The vertices and edges are represented by using sets.
Mathematically, a graph G can be represented as G= (V , E), where V is the set of vertices and E is the set of edges.
If an edge Ei connects vertices v1 and v2, we can represent the edge as Ei= (v1, v2).
How to represent a graph?
We will use the graph given in the following figure to learn how to represent a graph.
Graph in Python
To represent a graph, we will have to find the set of vertices and edges in the graph.
First, we will find the set of vertices. For this, we can create a set using the vertices given in the above figure. In the figure, the vertices have been named A,B,C,D,E, and F. So the set of vertices can be created as V={A, B, C, D, E, F}.
To find the set of edges, first we will find all the edges in the graph. You can observe that there are 6 edges in the graph numbered from E1 to E6. An edge Ei can be created as a tuple (v1, v2) where v1 and v2 are the vertices being connected by Ei. For the above graph, We can represent the edges as follows.
- E1=(A, D)
- E2 = (A,B)
- E3= (A,E)
- E4=(A, F)
- E5=(B,F)
- E6= (B,C)
The set of edges E can be represented as E= {E1, E2 , E3, E4, E5, E6}.
Finally, the graph G can be represented as G= (V,E) where V and E are sets of vertices and edges.
Till now, we have discussed how to represent a graph mathematically. Can you think of a way to represent a graph in a python program? Let us look into it.
How to represent a graph in Python?
We can represent a graph using an adjacency list. An adjacency list can be thought of as a list in which each vertex stores a list of all the vertices connected to it.
We will implement the adjacency list representation of the graph in python using a dictionary and lists.
First, we will create a python dictionary with all the vertex names as keys and an empty list (adjacency list) as their associated values using the given set of vertices.
After that, we will use the given set of edges to complete the adjacency list of each vertex that has been represented using the keys of the dictionary. For every edge (v1,v2), we will add v1 to the adjacency list of v2 and v2 to the adjacency list of v1.
In this way, every key (vertex) in the dictionary will have an associated value (a list of vertices) and the dictionary will represent the whole graph in python.
Given the set of vertices and edges, we can implement a graph in python as follows.
vertices = {"A", "B", "C", "D", "E", "F"}
edges = {("A", "D"), ("A", "B"), ("A", "E"), ("A", "F"), ("B", "F"), ("B", "C")}
graph = dict()
for vertex in vertices:
graph[vertex] = []
for edge in edges:
v1 = edge[0]
v2 = edge[1]
graph[v1].append(v2)
graph[v2].append(v1)
print("The given set of vertices is:", vertices)
print("The given set of edges is:", edges)
print("Graph representation in python is:")
print(graph)
Output:
The given set of vertices is: {'F', 'D', 'B', 'E', 'A', 'C'}
The given set of edges is: {('A', 'F'), ('A', 'B'), ('B', 'C'), ('A', 'D'), ('A', 'E'), ('B', 'F')}
Graph representation in python is:
{'F': ['A', 'B'], 'D': ['A'], 'B': ['A', 'C', 'F'], 'E': ['A'], 'A': ['F', 'B', 'D', 'E'], 'C': ['B']}
In the above output, you can verify that each key of the graph has a list of vertices that are connected to it as its value.
Conclusion
In this article, we have discussed graph data structure. We also discussed the mathematical representation of a graph and how we can implement it in python. To learn more about data structures in Python, you can read this article on Linked list in python.
The post Graph in Python appeared first on PythonForBeginners.com.
November 19, 2021 02:26 PM UTC
Lucas Cimon
Hacktoberfest on fpdf2 & v2.4.6
Last month, I realized late that October was hacktoberfest month!
This online event is a month-long celebration (October 1-31) of open source software run in partnership with different software companies, with a focus on encouraging contributions to open source projects.
While I participated in the 2019 edition as a contributor …
— Permalink
November 19, 2021 01:10 PM UTC
Real Python
The Real Python Podcast – Episode #87: Building a Content Aggregator and Working With RSS in Python
Have you wanted to work with RSS feeds in Python? Maybe you're looking for a new project to build for your portfolio that uses Django, unit tests, and custom commands. This week on the show, we have Real Python author Ricky White to talk about his recent step-by-step project titled, "Build a Content Aggregator in Python."
[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
November 19, 2021 12:00 PM UTC
November 18, 2021
"Mathspp Pydon'ts"
String formatting comparison | Pydon't 🐍
This article compares the three main string formatting methods in Python and suggests which methods to use in each situation.
![]()
(If you are new here and have no idea what a Pydon't is, you may want to read the Pydon't Manifesto.)
Introduction
The Zen of Python says that
“There should be one – and preferably only one – obvious way to do it.”
And yet, there are three main ways of doing string formatting in Python. This Pydon't will settle the score, comparing these three methods and helping you decide which one is the obvious one to use in each situation.
In this Pydon't, you will:
- learn about the old C-style formatting with %;
- learn about the string method
.format; - learn about the Python 3.6+ feature of literal string interpolation and f-strings;
- understand the key differences between each type of string formatting; and
- see where each type of string formatting really shines.
You can now get your free copy of the ebook “Pydon'ts – Write beautiful Python code” on Gumroad.
String formatting rationale
Let's pretend, for a second, that Python had zero ways of doing string formatting.
Now, I have a task for you: write a function that accepts a programming language name and returns a string saying that said programming language rocks. Can you do it? Again, without any string formatting whatsoever!
Here is a possible solution:
def language_rocks(language):
return language + " rocks!"
# ---
>>> language_rocks("Python")
'Python rocks!'
Great job!
Now, write a function that accepts a programming language name and its (estimated) number of users, and returns a string saying something along the lines of “ rocks! Did you know that has around users?”.
Can you do it? Recall that you are not supposed to use any string formatting facilities, whatsoever!
Here is a possible solution:
def language_info(language, users_estimate):
return (
language + " rocks! Did you know that " + language +
" has around " + str(users_estimate) + " users?!"
)
# ---
>>> language_info("Python", 10)
'Python rocks! Did you know that Python has around 10 users?!'
Notice how that escalated quite quickly: the purpose of our function is still very simple, and yet we have a bunch of string concatenations happening all over the place, just because we have some pieces of information that we want to merge into the string.
This is what string formatting is for: it's meant to make your life easier when you need to put information inside strings.
Three string formatting methods
Now that we've established that string formatting is useful, let's take a look at the three main ways of doing string formatting in Python.
First, here is how you would refactor the function above:
# Using C-style string formatting:
def language_info_cstyle(language, users_estimate):
return (
"%s rocks! Did you know that %s has around %d users?!" %
(language, language, users_estimate)
)
# Using the Python 3 `.format` method from strings:
def language_info_format(language, users_estimate):
return "{} rocks! Did you know that {} has around {}...
November 18, 2021 11:00 PM UTC
PyCharm
PyCharm 2021.3 Release Candidate Is Out
PyCharm’s 2021.3 major release is right around the corner, and now the PyCharm team is fine-tuning the new features and fixing important bugs.
As we approach the end of our EAP – Early Access Program, we’d like to thank everyone who joined it, tested the new features, commented on Twitter, submitted bug reports, and etc.
Your contribution always makes all the difference!
Previously highlighted features:
- Poetry Support
- New FastAPI Project Type
- New Jupyter Notebook Experience
- Remote Development Support (Beta)
Read our previous EAP blog posts for more information on highlighted features.
Important:
- This build requires an active JetBrains subscription.
- If you find any bugs while exploring this release candidate, please submit them to the PyCharm issue tracker.
- For the full list of issues solved in this build please read the release notes.
The PyCharm Team
November 18, 2021 01:29 PM UTC
Test and Code
170: pytest for Data Science and Machine Learning - Prayson Daniel
Prayson Daniel, a principle data scientist, discusses testing machine learning pipelines with pytest.
Prayson is using pytest for some pretty cool stuff, including:
- unit tests, of course
- testing pipeline stages
- counterfactual testing
- performance testing
All with pytest. So cool.
Special Guest: Prayson Daniel.
Sponsored By:
- PyCharm Professional: Try PyCharm Pro for 4 months and learn how PyCharm will save you time. Promo Code: TESTANDCODE22
Links:
- Python Bytes 250, with Prayson Daniel — Listen to this for more of an introduction to Prayson
November 18, 2021 05:30 AM UTC
Codementor
How to Send MMS in Python Using Plivo's Messaging API
How to Send MMS in Python Using Plivo's Messaging API
November 18, 2021 05:02 AM UTC
November 17, 2021
Trey Hunner
How to sort a dictionary in Python
Dictionaries are best used for key-value lookups: we provide a key and the dictionary very quickly returns the corresponding value.
But what if you need both key-value lookups and iteration? It is possible to loop over a dictionary and when looping, we might care about the order of the items in the dictionary.
With dictionary item order in mind, you might wonder how can we sort a dictionary?
Dictionaries are ordered
As of Python 3.6 dictionaries are ordered (technically the ordering became official in 3.7).
Dictionary keys are stored in insertion order, meaning whenever a new key is added it gets added at the very end.
1 2 3 4 | |
But if we update a key-value pair, the key remains where it was before:
1 2 3 | |
So if you plan to populate a dictionary with some specific data and then leave that dictionary as-is, all you need to do is make sure that original data is in the order you’d like.
For example if we have a CSV file of US state abbreviations and our file is ordered alphabetically by state name, our dictionary will be ordered the same way:
1 2 3 4 5 6 7 | |
If our input data is already ordered correctly, our dictionary will end up ordered correctly as well.
How to sort a dictionary by its keys
What if our data isn’t sorted yet?
Say we have a list-of-tuples that pair meeting rooms to their corresponding room numbers:
1
| |
And we’d like to sort this dictionary by its keys.
We could use the built-in sorted function to sort it:
1 2 | |
The sorted function uses the < operator to compare many items in the given iterable and return a sorted list.
The sorted function always returns a list.
To make these key-value pairs into a dictionary, we can pass them straight to the dict constructor:
1 2 3 | |
The dict constructor will accept a list of 2-item tuples (or any iterable of 2-item iterables) and make a dictionary out of it, using the first item from each tuple as a key and the second as the corresponding value.
Key-value pairs are sorted lexicographically… what?
We’re sorting tuples of the key-value pairs before making a dictionary out of them. But how does sorting tuples work?
1 2 3 | |
When sorting tuples, Python uses lexicographical ordering (which sounds fancier than it is). Comparing a 2-item tuple basically boils down to this algorithm:
1 2 3 4 5 6 | |
I’ve written an article on tuple ordering that explains this in more detail.
You might be thinking: it seems like this sorts not just by keys but by keys and values. And you’re right! But only sort of.
The keys in a dictionary should always compare as unequal (if two keys are equal, they’re seen as the same key).
So as long as the keys are comparable to each other with the less than operator (<), sorting 2-item tuples of key-value pairs should always sort by the keys.
Dictionaries can’t be sorted in-place
What if we already have our items in a dictionary and we’d like to sort that dictionary?
Unlike lists, there’s no sort method on dictionaries.
We can’t sort a dictionary in-place, but we could get the items from our dictionary, sort those items using the same technique we used before, and then turn those items into a new dictionary:
1 2 3 4 | |
That creates a new dictionary object. If we really wanted to update our original dictionary object, we could take the items from the dictionary, sort them, clear the dictionary of all its items, and then add all the items back into the dictionary:
1 2 3 4 | |
But why bother? We don’t usually want to operate on data structures in-place in Python: we tend to prefer making a new data structure rather than re-using an old one (this preference is partly thanks to how variables work in Python).
How to sort a dictionary by its values
What if we wanted to sort a dictionary by its values instead of its keys?
We could make a new list of value-key tuples (actually a generator in our case below), sort that, then flip them back to key-value tuples and recreate our dictionary:
1 2 3 4 5 6 7 8 | |
This works but it’s a bit long. Also this technique actually sorts both our values and our keys (giving the values precedence in the sorting).
What if we wanted to just sort our dictionary by its values, ignoring the contents of the keys entirely?
Python’s sorted function accepts a key argument that we can use for this!
1 2 3 4 5 6 7 8 | |
The key function we pass to sorted should accept an item from the iterable we’re sorting and return the key to sort by. Note that the word “key” here isn’t related to dictionary keys. Dictionary keys are used for looking up dictionary values whereas this key function returns an object that determines how to order items in an iterable.
If we want to sort the dictionary by its values, we could make a key function that accepts each item in our list of 2-item tuples and returns just the value:
1 2 3 4 | |
Then we’d use our key function by passing it to the sorted function (yes functions can be passed to other functions in Python) and pass the result to dict to create a new dictionary:
1 2 3 | |
If you prefer not to create a custom key function just to use it once, you could use a lambda function (which I don’t usually recommend):
1 2 3 | |
Or you could use operator.itemgetter to make a key function that gets the second item from each key-value tuple:
1 2 3 4 | |
I discussed my preference for itemgetter in my article on lambda functions.
Ordering a dictionary in some other way
What if we needed to sort our dictionary by something other than just a key or a value? For example what if our room number strings include numbers that aren’t always the same length:
1 2 3 4 5 6 7 8 | |
If we sorted these rooms by value, those strings wouldn’t be sorted in the numerical way we’re hoping for:
1 2 3 4 | |
Rm 30 should be first and Rm 2000 should be last. But we’re sorting strings, which are ordered character-by-character based on the unicode value of each character (I noted this in my article on tuple ordering).
We could customize the key function we’re using to sort numerically instead:
1 2 3 4 5 | |
When we use this key function to sort our dictionary:
1
| |
It will be sorted by the integer room number, as expected:
1 2 | |
Should you sort a dictionary?
When you’re about to sort a dictionary, first ask yourself “do I need to do this”? In fact, when you’re considering looping over a dictionary you might ask “do I really need a dictionary here”?
Dictionaries are used for key-value lookups: you can quickly get a value given a key. They’re very fast at retrieving values for keys. But dictionaries take up more space than a list of tuples.
If you can get away with using a list of tuples in your code (because you don’t actually need a key-value lookup), you probably should use a list of tuples instead of a dictionary.
But if key lookups are what you need, it’s unlikely that you also need to loop over your dictionary.
Now it’s certainly possible that right now you do in fact have a good use case for sorting a dictionary (for example maybe you’re sorting keys in a dictionary of attributes), but keep in mind that you’ll need to sort a dictionary very rarely.
Summary
Dictionaries are used for quickly looking up a value based on a key. The order of a dictionary’s items is rarely important.
In the rare case that you care about the order of your dictionary’s items, keep in mind that dictionaries are ordered by the insertion order of their keys (as of Python 3.6). So the keys in your dictionary will remain in the order they were added to the dictionary.
If you’d like to sort a dictionary by its keys, you can use the built-in sorted function along with the dict constructor:
1
| |
If you’d like to sort a dictionary by its values, you can pass a custom key function (one which returns the value for each item) to sorted:
1 2 3 4 5 | |
But remember, it’s not often that we care about the order of a dictionary. Whenever you’re sorting a dictionary, please remember to ask yourself do I really need to sort this data structure and would a list of tuples be more suitable than a dictionary here?
November 17, 2021 03:30 PM UTC
Python for Beginners
Dataclass in Python
While programming in python, you might have used classes to create different objects. Classes in python are very helpful in depicting real world objects in our programs. In this article, we will discuss a decorator named Dataclass with which we can modify the properties of a class. We will also discuss how a dataclass is important while programming in python.
What is a dataclass?
Dataclass is a decorator defined in the dataclasses module. It was introduced in python 3.7. A dataclass decorator can be used to implement classes that define objects with only data and very minimal functionalities.
A class defined using dataclass decorator has very specific uses and properties that we will discuss in the following sections. Let us first discuss how we can implement a class using dataclass in python.
How to use dataclass in Python?
The dataclass decorator has been defined in the dataclasses module. You can first install the dataclasses module using PIP as follows.
pip3 install --upgrade dataclasses
After installing the dataclasses module, you can import the dataclass decorator using the import statement as follows.
from dataclasses import dataclass
Let us now define a class with dataclass decorator.
from dataclasses import dataclass
@dataclass
class Person:
Name: str
Country: str
Age: int
candidate = Person("Joe Biden", "USA", 78)
print("The candidate is:",candidate)
Output:
The candidate is: Person(Name='Joe Biden', Country='USA', Age=78)
You may notice that we have specified the data type of class attributes in the above code. Moreover, we do not need to implement the __init__() constructor while using the dataclass decorator. The decorator itself implements the __init__() method for us.
Benefits of using dataclass in Python
We can define classes using dataclass decorator to represent objects.
If we want to print the attributes of the object using the print statement, we have to use the __repr__() method when we have implemented a class without using the dataclass decorator. Otherwise, the output is as follows.
class Person:
def __init__(self, name, country, age):
self.Name = name
self.Country = country
self.Age = age
candidate = Person("Joe Biden", "USA", 78)
print("The candidate is:", candidate)
Output:
The candidate is: <__main__.Person object at 0x7fb7289a8070>
To print the class attributes, we will have to implement the __repr__() method as follows.
class Person:
def __init__(self, name, country, age):
self.Name = name
self.Country = country
self.Age = age
def __repr__(self):
return "Name: {}, Country: {}, Age: {}".format(self.Name, self.Country, self.Age)
candidate = Person("Joe Biden", "USA", 78)
print("The candidate is:", candidate)
Output:
The candidate is: Name: Joe Biden, Country: USA, Age: 78
But, when we use the dataclass decorator, all the class attributes are printed without implementing the __repr__() method. This can be observed in the following example.
from dataclasses import dataclass
@dataclass
class Person:
Name: str
Country: str
Age: int
candidate = Person("Joe Biden", "USA", 78)
print("The candidate is:",candidate)
Output:
The candidate is: Person(Name='Joe Biden', Country='USA', Age=78)
Another major difference between a simple class and a class with dataclass decorator is the way in which the instances of the class are compared.
For example, When we create a class and compare its instances using the == operator, The python interpreter checks the identity or memory location of the objects and they are considered equal only if both the instances refer to the same memory location. This can be observed in the following program.
class Person:
def __init__(self, name, country, age):
self.Name = name
self.Country = country
self.Age = age
def __repr__(self):
return "Name: {}, Country: {}, Age: {}".format(self.Name, self.Country, self.Age)
candidate1 = Person("Joe Biden", "USA", 78)
candidate2 = Person("Joe Biden", "USA", 78)
print("Candidate 1 is:", candidate1)
print("Candidate 2 is:", candidate2)
print("Both the candidates are same?", candidate1 == candidate2)
Output:
Candidate 1 is: Name: Joe Biden, Country: USA, Age: 78
Candidate 2 is: Name: Joe Biden, Country: USA, Age: 78
Both the candidates are same? False
Here, you can see that Candidate 1 and Candidate 2 are considered different because they are different objects and refer to different memory locations.
On the contrary, when we define a class using the dataclass decorator, the comparison operator works very differently. When we compare two instances of the class using the == operator, the values in the class attributes of the objects are compared instead of the memory location. If the values in the corresponding attributes are equal in both instances, the objects are said to be equal. You can observe this in the following program.
from dataclasses import dataclass
@dataclass
class Person:
Name: str
Country: str
Age: int
candidate1 = Person("Joe Biden", "USA", 78)
candidate2 = Person("Joe Biden", "USA", 78)
print("The candidate 1 is:", candidate1)
print("The candidate 2 is:", candidate2)
print("Both the candidates are same?", candidate1 == candidate2)
Output:
The candidate 1 is: Person(Name='Joe Biden', Country='USA', Age=78)
The candidate 2 is: Person(Name='Joe Biden', Country='USA', Age=78)
Both the candidates are same? True
Here, both the candidates are considered equal because the attributes of the objects are equal. Thus we can easily compare data inside objects when we implement classes using the dataclass decorator.
You can see that the dataclass decorator gives us a better method to compare objects. Otherwise, we will have to define methods to compare the objects. This may result in costly execution in terms of time and space.
Conclusion
In this article, we have discussed the dataclass decorator in python. We have also implemented it and saw some of its peculiar properties that make it a useful construct to use in our programs. To learn more about python programming, you can read this article on list comprehension. You may also like this article on the linked list in Python.
The post Dataclass in Python appeared first on PythonForBeginners.com.
November 17, 2021 03:15 PM UTC
Real Python
Python News: What's New From October 2021?
A culmination of great work done by volunteers worldwide, the release of Python 3.10 dominated the Python community’s news cycle in October 2021. At the same time that this release was making new features available, Python got recognition as the top programming language for the month in the TIOBE Programming Community index.
There are also some new opportunities for you to support the community by participating in the Python Developer Survey and answering the PyCon US 2022 Call for Proposals.
Let’s dive into the biggest Python news from the past month!
Free Bonus: Click here to get a Python Cheat Sheet and learn the basics of Python 3, like working with data types, dictionaries, lists, and Python functions.
The Python 3.10 Release
New versions of Python are now released annually. We can look forward to the core developers sharing a lovely goody bag with the rest of us every October. With Python 3.10, which came out of beta on October 4th, everyone had something exciting to anticipate.
Each release of Python has a release manager who’s responsible for coordinating all changes and for building and preparing the files for distribution. The release manager for Python 3.10 and 3.11 is Pablo Galindo Salgado. In a first for Python, he built and released Python live on YouTube.
Python 3.10 Highlights
The new release includes lots of improvements to the language. Among our favorites are improved error messages, simplified syntax for type unions, and structural pattern matching.
Improved error messages will make your life easier, whether you’re a new Python developer or an experienced one. In particular, the feedback that you get when your code isn’t valid Python is more pointed and actionable in Python 3.10 than in previous versions. As an example, consider the following code, where there’s no closing bracket at the end of the first line:
news = ["errors", "types", "patterns"
print(", ".join(news))
In Python 3.9 and earlier, you’ll see the following if you try to run this code:
File "errors.py", line 2
print(", ".join(news))
^
SyntaxError: invalid syntax
Read the full article at https://realpython.com/python-news-october-2021/ »
[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
November 17, 2021 02:00 PM UTC
Stack Abuse
Keras Callbacks: Save and Visualize Prediction on Each Training Epoch
Introduction
Keras is a high-level API, typically used with the Tensorflow library, and has lowered the barrier to entry for many and democratized the creation of Deep Learning models and systems.
When just starting out, a high-level API that abstracts most of the inner-workings helps people get the hang of the basics, and build a starting intuition. Down the line though, practitioners naturally want to build a stronger intuition of what happens under the hood both to gain actionable insight and gain a deeper understanding of how their model learns.
In a lot of cases, it's useful to take a look at the learning process of a Deep Neural Network, testing how it predicts values on each learning epoch, and save the values.
These saved values can be used to visualize the predictions, using libraries like Matplotlib or Seaborn, or can be saved in a log for further analysis in smart systems, or simply analyzed by a human. We typically extract the learning curves of a model to gain a better understanding of how it performs through time - but learning curves reflect the mean loss through time, and you don't get to see how the model performs until it's done training.
Keras has a wonderful feature - callbacks which are snippets of code that are called during training, and can be used to customize the training process. Typically, you use callbacks to save the model if it performs well, stop the training if it's overfitting, or otherwise react to or affect the steps in the learning process.
This makes callbacks the natural choice for running predictions on each batch or epoch, and saving the results, and in this guide - we'll take a look at how to run a prediction on the test set, visualize the results, and save them as images, on each training epoch in Keras.
Note: We'll be building a simple Deep Learning model using Keras in the proceeding sections, but won't put much focus on the implementation or the dataset. This isn't meant to be a guide to building regression models, but a model is needed to properly showcase how the callback works.
If you're interested in reading more about how to build these models and how to get them highly accurate instead of just accurate - read our extensive and detailed Hands-On House Price Prediction - Deep Learning in Python with Keras!
Building and Evaluating a Deep Learning Model with Keras
Let's build a simple Keras model for illustrational purposes. We'll speed through this section with minimal focus and attention - this isn't a guide on building regression models. We'll be working with the California Housing Dataset, obtained through Scikit-Learn's datasets module, which is a dataset meant for regression.
Let's go ahead and import the libraries and static methods we'll be using:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
Now, let's load in the dataset, split it into a training and testing set (we'll split out a validation set later), and visualize the locations of the houses to check if the data's been loaded correctly:
X, y = fetch_california_housing(as_frame=True, return_X_y=True)
x_train, x_test, y_train, y_test = train_test_split(x, y)
plt.figure(figsize=(12, 8))
sns.scatterplot(data=x, x='Longitude', y='Latitude', size=y, alpha=0.5, hue=y, palette='magma')
plt.show()

Looks like California! Since the data is loaded correctly, we can define a simple sequential Keras model:
checkpoint = keras.callbacks.ModelCheckpoint("california.h5", save_best_only=True)
model = keras.Sequential([
keras.layers.Dense(64, activation='relu', kernel_initializer='normal', kernel_regularizer="l2", input_shape=[x_train.shape[1]]),
keras.layers.Dropout(0.2),
keras.layers.BatchNormalization(),
keras.layers.Dense(64, activation='relu', kernel_initializer='normal', kernel_regularizer="l2"),
keras.layers.Dropout(0.2),
keras.layers.BatchNormalization(),
keras.layers.Dense(1)
])
model.compile(loss='mae',
optimizer=keras.optimizers.RMSprop(learning_rate=1e-2, decay=0.1),
metrics=['mae'])
history = model.fit(
x_train, y_train,
epochs=150,
batch_size=64,
validation_split=0.2,
callbacks=[checkpoint]
)
Here, we've got a simple MLP, with a bit of Dropout and Batch Normalization to battle overfitting, optimized with the RMSprop optimizer and a Mean Absolute Error loss. We've fitted the model for 150 epochs, with a validation split of 0.2, and a ModelCheckpoint callback to save the weights in a file. Running this results in:
...
Epoch 150/150
387/387 [==============================] - 3s 7ms/step - loss: 0.6279 - mae: 0.5976 - val_loss: 0.6346 - val_mae: 0.6042
We could visualize the learning curves to gain some basic insight into how the training went, but it doesn't tell us the whole story - these are just aggregate means over the training and validation sets during training:
model_history = pd.DataFrame(history.history)
model_history['epoch'] = history.epoch
fig, ax = plt.subplots(1, figsize=(8,6))
num_epochs = model_history.shape[0]
ax.plot(np.arange(0, num_epochs), model_history["mae"],
label="Training MAE")
ax.plot(np.arange(0, num_epochs), model_history["val_mae"],
label="Validation MAE")
ax.legend()
plt.tight_layout()
plt.show()
This results in:

And we can evaluate our model with:
model.evaluate(x_test, y_test)
162/162 [==============================] - 0s 2ms/step - loss: 0.5695 - mae: 0.5451 - mape: 32.2959
As the target variable is measured in multiples of $100.000, which means our network misses the price by up to about $54.000, which is a Mean Absolute Percentage Error of ~32%. Most traditional Machine Learning methods such as Random Forest Regression, even after more extensive data pre-processing for this dataset achieve around $52.000, with tuned hyperparameters - so this is actually a pretty decent result, although, it could be improved with more preprocessing, better tuning and different architectures.
The point here wasn't to build a particularly accurate model, but we did choose a dataset using which the model wouldn't converge very quickly, so we can observe its dance around the target variables.
A more illustrative way to evaluate how the model's working ditches the aggregate Mean Absolute Error and Mean Absolute Percentage Error fully, and we can plot a scatter plot of the predicted prices against the actual prices. If they're equal - the plotted markers will follow a straight trajectory diagonally. For reference and scope - we can also plot a diagonal line and evaluate how close each marker is to the line:
test_predictions = model.predict(x_test)
test_labels = y_test
fig, ax = plt.subplots(figsize=(8,4))
plt.scatter(test_labels, test_predictions, alpha=0.6,
color='#FF0000', lw=1, ec='black')
lims = [0, 5]
plt.plot(lims, lims, lw=1, color='#0000FF')
plt.ticklabel_format(useOffset=False, style='plain')
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
plt.xlim(lims)
plt.ylim(lims)
plt.tight_layout()
plt.show()
Running this code results in:

The network overprices cheaper houses and underprices more expensive ones - and the estimates have a pretty generous scope (with some predictions on the right being totally out of scope - though, this happens because we haven't cleaned the dataset and many house prices are capped at that value when imported).
This isn't the insight you get from the learning curves, and a network that had the opposite effect - underpricing cheaper houses and overpricing expensive ones might have the same MAE and MAPE but behave totally differently.
What we're also interested in is how the model got here and how these predictions changed through time and the learning process. This is just the end point of the training process, and there was a fair bit of training involved to get here.
Let's go ahead and write a custom callback to add to the list of callbacks in the training process, that will run a prediction on the test set on each epoch, visualize the predictions and save them as an image.
Custom Prediction Keras Callback with Plots
Just like we've used the ModelCheckpoint callback to check whether a model is in its best-performing state on each epoch, and save it into a .h5 file and persist it - we can write a custom callback that'll run predictions, visualize them, and save the images on our disk.
Creating a custom callback boils down to extending the Callback class and overriding any of the methods it provides - the ones you don't override, retain their default behavior:
class PerformancePlotCallback(keras.callbacks.Callback):
def on_train_end(self, logs=None):
...
def on_epoch_begin(self, epoch, logs=None):
...
def on_epoch_end(self, epoch, logs=None):
...
def on_test_begin(self, logs=None):
...
def on_test_end(self, logs=None):
...
# Etc.
Depending on when you'd like to predict using your in-the-training model, you'll choose the appropriate method. A good measure of how it's progressing is an epoch, so on the end of each training epoch, we'll test the model on our test set.
We need a way to provide the test set to the callback, since this is external data. The easiest way to do that is to define a constructor that accepts the test set and evaluates the current model on it, giving you a consistent result:
class PerformancePlotCallback(keras.callbacks.Callback):
def __init__(self, x_test, y_test):
self.x_test = x_test
self.y_test = y_test
def on_epoch_end(self, epoch, logs=None):
print('Evaluating Model...')
print('Model Evaluation: ', self.model.evaluate(self.x_test))
This simple callback accepts the test set of houses and relevant target variables and evaluates itself on each epoch, printing the result to the console, right alongside the usual Keras output.
If we were to instantiate and add this callback to the model, and fit() it again, we'd see a different result from before:
performance_simple = PerformancePlotCallback(x_test, y_test)
# Model definition and compilation...
history = model.fit(
x_train, y_train,
epochs=150,
validation_split=0.2,
callbacks=[performance_simple]
)
This results in:
Epoch 1/150
387/387 [==============================] - 3s 7ms/step - loss: 1.0785 - mae: 1.0140 - val_loss: 0.9455 - val_mae: 0.8927
Evaluating Model...
162/162 [==============================] - 0s 1ms/step - loss: 0.0528 - mae: 0.0000e+00
Model Evaluation: [0.05277165770530701, 0.0]
Epoch 2/150
387/387 [==============================] - 3s 7ms/step - loss: 0.9048 - mae: 0.8553 - val_loss: 0.8547 - val_mae: 0.8077
Evaluating Model...
162/162 [==============================] - 0s 1ms/step - loss: 0.0471 - mae: 0.0000e+00
Model Evaluation: [0.04705655574798584, 0.0]
...
Awesome! The model is evaluating itself on each epoch, on the data we've passed into the callback. Now, let's modify the callback so it visualizes the predictions instead of printing them to the already cluttered output.
To simplify things, we'll get the callback to save the images to a folder, so that we can stitch them together into a video or a Gif later on. We'll also include a model_name in the constructor to help us differentiate models when generating the images and their filenames:
class PerformancePlotCallback(keras.callbacks.Callback):
def __init__(self, x_test, y_test, model_name):
self.x_test = x_test
self.y_test = y_test
self.model_name = model_name
def on_epoch_end(self, epoch, logs={}):
y_pred = self.model.predict(self.x_test)
fig, ax = plt.subplots(figsize=(8,4))
plt.scatter(y_test, y_pred, alpha=0.6,
color='#FF0000', lw=1, ec='black')
lims = [0, 5]
plt.plot(lims, lims, lw=1, color='#0000FF')
plt.ticklabel_format(useOffset=False, style='plain')
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
plt.xlim(lims)
plt.ylim(lims)
plt.tight_layout()
plt.title(f'Prediction Visualization Keras Callback - Epoch: {epoch}')
plt.savefig('model_train_images/'+self.model_name+"_"+str(epoch))
plt.close()
Here, we create a Matplotlib figure on each epoch, and plot a scatter plot of the predicted prices against the actual prices. Additionally, we've added a diagonal reference line - the closer our scatter plot markers are to the diagonal line, the more accurate our model's predictions were.
The plot is then saved via plt.savefig() with the model's name and the epoch number, alongside an informative title that lets you know which epoch the model is in during training.
Now, let's use this custom callback again, providing a model name in addition to the x_test and y_test sets:
checkpoint = keras.callbacks.ModelCheckpoint("california.h5", save_best_only=True)
performance = PerformancePlotCallback(x_test, y_test, "california_model")
# Model definition and compilation...
history = model.fit(
x_train, y_train,
epochs=150,
validation_split=0.2,
callbacks=[checkpoint, performance]
)
The PerformancePlotCallback goes into full swing, and in the designated folder generates an image of the performance on each epoch. The model_train_images folder is now filled with 150 plots:

You can now use your favorite tool to stitch the images together into a video or a Gif file, or simply peruse them manually. Here's a Gif of the model we've built training on this data:

Conclusion
In this guide, we've built a simple model to predict the price of a house in the California Housing Dataset with okay-ish accuracy. We've then taken a look at how to write a custom Keras callback to test a Deep Learning model's performance and visualize it during training, on each epoch.
We've proceeded to save these images to the disk and created a Gif from them, giving us a different perspective on the training process than the one we get from analyzing the learning curves of a model.
November 17, 2021 11:30 AM UTC
Quansight Labs Blog
A vision for extensibility to GPU & distributed support for SciPy, scikit-learn, scikit-image and beyond
Over the years, array computing in Python has evolved to support distributed arrays, GPU arrays, and other various kinds of arrays that work with specialized hardware, or carry additional metadata, or use different internal memory representations. The foundational library for array computing in the PyData ecosystem is NumPy. But NumPy alone is a CPU-only library - and a single-threaded one at that - and in a world where it's possible to get a GPU or a CPU with a large core count in the cloud cheaply or even for free in a matter of seconds, that may not seem enough. For the past couple of years, a lot of thought and effort has been spent on devising mechanisms to tackle this problem, and evolve the ecosystem in a gradual way towards a state where PyData libraries can run on a GPU, as well as in distributed mode across multiple GPUs.
We feel like a shared vision has emerged, in bits and pieces. In this post, we aim to articulate that vision and suggest a path to making it concrete, focusing on three libraries at the core of the PyData ecosystem: SciPy, scikit-learn and scikit-image. We are also happy to share that AMD has recognized the value of this vision, and is partnering with Quansight Labs to help make it a reality.
Read more… (13 min remaining to read)
November 17, 2021 10:00 AM UTC
eGenix.com
PyDDF Python Herbst Sprint 2021 (Online)
The following text is in German, since we're announcing a Python sprint in Düsseldorf, Germany.
Ankündigung
PyDDF Python Herbst Online Sprint 2021
Samstag, 20.11.2021, 10:00-18:00 Uhr
Sonntag, 21.11.2021, 10:00-18:00 Uhr
Informationen
Das Python Meeting Düsseldorf (PyDDF) veranstaltet ein Online Python Sprint Wochenende.Der Sprint findet am Wochenende 20/21.11.2021 online auf unserem Discord Server statt. Den Link zu dem Discord Server wird über die Meetup Anmeldung verteilt.
Folgende Themengebiete sind als Anregung bereits angedacht:
- PDFs des NRW Landtags parsen
Anmeldung und weitere Infos
Alles weitere und die Anmeldung findet Ihr auf der Meetup Sprint Seite:
Teilnehmer sollten sich zudem der PyDDF Telegram Gruppe beitreten, da wir uns dort koordinieren:
Über das Python Meeting Düsseldorf
Das Python Meeting Düsseldorf ist eine regelmäßige Veranstaltung in Düsseldorf, die sich an Python Begeisterte aus der Region wendet.
Einen guten ersten Überblick über die Vorträge bietet unser PyDDF YouTube-Kanal, auf dem wir Videos der Vorträge nach den Meetings veröffentlichen.Veranstaltet wird das Meeting von der eGenix.com GmbH, Langenfeld, in Zusammenarbeit mit Clark Consulting & Research, Düsseldorf.
Marc-Andre Lemburg, eGenix.com
November 17, 2021 09:00 AM UTC
ItsMyCode
UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte
ItsMyCode |
The UnicodeDecodeError occurs mainly while importing and reading the csv or json files in your Python code. If the provided file has some special characters, Python will throw an UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte.
What is UnicodeDecodeError ‘utf8’ codec can’t decode byte?
The UnicodeDecodeError normally happens when decoding a string from a certain coding. Since codings map only a limited number of str strings to Unicode characters, an illegal sequence of str characters (non-ASCII) will cause the coding-specific decode() to fail.
When importing and reading a csv file, Python tries to convert a byte-array (bytes which it assumes to be a utf-8-encoded string) to a Unicode string (str). It is a decoding process according to UTF-8 rules. When it tries this, it encounters a byte sequence that is not allowed in utf-8-encoded strings (namely this 0xff at position 0).
Example
import pandas as pd
a = pd.read_csv("filename.csv")
Output
Traceback (most recent call last):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 2: invalid start byte
There are multiple solutions to resolve this issue, and it depends on the different use cases. Let’s look at the most common occurrences, and the solution to each of these use cases.
Solution for Importing and Reading CSV files using Pandas
If you are using pandas to import and read the csv files, then you need to use the proper encoding type or set it to unicode_escape to resolve the UnicodeDecodeError as shown below.
import pandas as pd
data=pd.read_csv("C:\\Employess.csv",encoding=''unicode_escape')
print(data.head())
Solution for Loading and Parsing JSON files
If you are getting UnicodeDecodeError while reading and parsing JSON file content, it means you are trying to parse the JSON file, which is not in UTF-8 format. Most likely, it might be encoded in ISO-8859-1. Hence try the following encoding while loading json file, which should resolve the issue.
json.loads(unicode(opener.open(...), "ISO-8859-1"))
Solution for Loading and Parsing any other file formats
In case of any other file formats such as logs, you could open the file in binary mode and then continue the file read operation. If you just specify only read mode, it opens the file and reads the file content as a string, and it doesn’t decode properly.
You could do the same even for the csv,log,txt, or excel files also.
with open(path, 'rb') as f:
text = f.read()
Alternatively, you can use decode() method on the file content and specify errors=’replace’ to resolve UnicodeDecodeError
with open(path, 'rb') as f:
text = f.read().decode(errors='replace')
When you call .decode() an a unicode string, Python 2 tries to be helpful and decides to encode the Unicode string back to bytes (using the default encoding), so that you have something that you can really decode. This implicit encoding step doesn’t use errors='replace', so if there are any characters in the Unicode string that aren’t in the default encoding (probably ASCII) you’ll get a UnicodeEncodeError.
(Python 3 no longer does this as it is terribly confusing.)
Check the type of message and assuming it is indeed Unicode, work back from there to find where it was decoded (possibly implicitly) to replace that with the correct decoding.
Solution for decoding the string contents efficiently
If you encounter UnicodeDecodeError while reading a string variable, then you could simply use the encode method and encode into a utf-8 format which inturns resolve the error.
str.encode('utf-8').strip()
The post UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte appeared first on ItsMyCode.
November 17, 2021 08:36 AM UTC
Talk Python to Me
#341: 25 Pandas Functions You Didn’t Know Existed
Do you do anything with Jupyter notebooks? If you do, there is a very good chance you're working with the pandas library. This is one of THE primary tools of anyone doing computational work or data exploration with Python. Yet, this library is massive and knowing the idiomatic way to use it can be hard to discover. <br/> <br/> That's why I've invited Bex Tuychiev to be our guest. He wrote an excellent article highlighting 25 idiomatic Pandas functions and properties we should all keep in our data toolkit. I'm sure there is something here for all of us to take away and use pandas that much better.<br/> <br/> <strong>Links from the show</strong><br/> <br/> <div><b>Bex Tuychiev</b>: <a href="https://www.linkedin.com/in/bextuychiev/" target="_blank" rel="noopener">linkedin.com</a><br/> <b>Bex's Medium profile</b>: <a href="https://ibexorigin.medium.com/" target="_blank" rel="noopener">ibexorigin.medium.com</a><br/> <br/> <b>Numpy 25 functions article</b>: <a href="https://towardsdatascience.com/25-numpy-functions-you-never-knew-existed-p-guarantee-0-85-64616ba92fa8" target="_blank" rel="noopener">towardsdatascience.com</a><br/> <b>missingno package</b>: <a href="https://coderzcolumn.com/tutorials/data-science/missingno-visualize-missing-data-in-python" target="_blank" rel="noopener">coderzcolumn.com</a><br/> <b>Watch this episode on YouTube</b>: <a href="https://www.youtube.com/watch?v=ZBC1Q_kYIvE" target="_blank" rel="noopener">youtube.com</a><br/> <b>Episode transcripts</b>: <a href="https://talkpython.fm/episodes/transcript/341/25-pandas-functions-you-didn-t-know-existed" target="_blank" rel="noopener">talkpython.fm</a><br/> <br/> <b>---------- Stay in touch with us ----------</b><br/> <b>Subscribe on YouTube (for live streams)</b>: <a href="https://talkpython.fm/youtube" target="_blank" rel="noopener">youtube.com</a><br/> <b>Follow Talk Python on Twitter</b>: <a href="https://twitter.com/talkpython" target="_blank" rel="noopener">@talkpython</a><br/> <b>Follow Michael on Twitter</b>: <a href="https://twitter.com/mkennedy" target="_blank" rel="noopener">@mkennedy</a><br/></div><br/> <strong>Sponsors</strong><br/> <a href='https://shortcut.com/talkpython'>Shortcut</a><br> <a href='https://talkpython.fm/linode'>Linode</a><br> <a href='https://talkpython.fm/assemblyai'>AssemblyAI</a><br> <a href='https://talkpython.fm/training'>Talk Python Training</a>
November 17, 2021 08:00 AM UTC
Python Bytes
#259 That argument is a little late-bound
<p><strong>Watch the live stream:</strong></p> <a href='https://www.youtube.com/watch?v=IB4RBvz8sXU' style='font-weight: bold;'>Watch on YouTube</a><br> <br> <p><strong>About the show</strong></p> <p>Sponsored by <strong>Shortcut - Get started at</strong> <a href="http://shortcut.com/pythonbytes"><strong>shortcut.com/pythonbytes</strong></a></p> <p>Special guest: <strong>Renee Teate</strong></p> <p><strong>Michael #1:</strong> <a href="https://twitter.com/btskinn/status/1456293623599935490"><strong>pypi-changes</strong></a></p> <ul> <li><a href="https://twitter.com/btskinn/status/1456293623599935490"><strong>via Brian Skinn</strong></a>, created by <a href="https://twitter.com/gjbernat/status/1456207118470684674"><strong>Bernát Gábor</strong></a></li> <li>Visually show you which dependencies in an environment are out of date.</li> <li>See the age of everything you depend upon. </li> <li>Also, shoutout again to <a href="https://github.com/naiquevin/pipdeptree"><strong>pipdeptree</strong></a></li> </ul> <p><strong>Brian #2:</strong> <a href="https://lwn.net/SubscriberLink/875441/c29a2006cf489b7f/"><strong>Late-bound argument defaults for Python</strong></a></p> <ul> <li>Default values for arguments to functions are evaluated at function definition time.</li> <li>If a value is a short expression that uses a variable, that variable is in the scope of the function definition.</li> <li>The expression cannot use other arguments.</li> <li>Example of what you cannot do:</li> </ul> <pre><code> def foo(a, b = None, c = len(a)): ... </code></pre> <ul> <li>There’s a proposal by Chris Angelico to add a <code>=:</code> operator for late default evaluation. <ul> <li>syntax still up in the air. <code>=></code> and <code>?=</code> also discussed</li> </ul></li> <li>However, it’s non-trivial to add syntax to an established language, and this article notes: <ul> <li>At first blush, Angelico's idea to fix this "wart" in Python seems fairly straightforward, but the discussion has shown that there are multiple facets to consider. It is not quite as simple as "let's add a way to evaluate default arguments when the function is called"—likely how it was seen at the outset. That is often the case when looking at new features for an established language like Python; there is a huge body of code that needs to stay working, but there are also, sometimes conflicting, aspirations for features that could be added. It is a tricky balancing act.</li> </ul></li> </ul> <p><strong>Renee #3:</strong> <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html"><strong>pandas.read_sql</strong></a></p> <ul> <li>Since I wrote my book SQL for Data Scientists, I’ve gotten several questions about how I use SQL in my python scripts. It’s really simple: </li> <li>You can save your SQL as a text file and then import the dataset into a pandas dataframe to do the rest of my data cleaning, feature engineering, etc. </li> <li>Pandas has a built-in way to use SQL as a data source. </li> <li>You set up a connection to your database using another package like SQL Alchemy, then send the SQL string and the connection to the pandas.read_sql function. </li> <li>It returns a dataframe with the results of your query. </li> </ul> <p><strong>Michael #4:</strong> <a href="https://talkpython.fm/episodes/show/340/time-to-jit-your-python-with-pyjion"><strong>pyjion</strong></a></p> <ul> <li>by Anthony Shaw</li> <li>Pyjion is a JIT for Python based upon CoreCLR</li> <li>Check out <a href="https://live.trypyjion.com/"><strong>live.trypyjion.com</strong></a> <strong><em>*to see it in action</strong>.</em>*</li> <li>Requires Python 3.10, .NET Core 6</li> <li>Enable with just a couple of lines:</li> </ul> <pre><code> >>> import pyjion >>> pyjion.enable() </code></pre> <p><strong>Brian #5:</strong> <a href="https://adamj.eu/tech/2021/10/08/tips-for-debugging-with-print/"><strong>Tips for debugging with print()</strong></a></p> <ul> <li>Adam Johnson</li> <li>7 tips altogether, but I’ll highlight a few I loved reading about</li> <li>Debug variables with f-strings and <code>=</code> <ul> <li><code>print(f``"``{myvar=}``"</code>) </li> <li>Saves typing over <code>print(f``"``myvar={myvar}</code>") with the same result</li> </ul></li> <li>Make output “pop” with emoji (Brilliant!) <ul> <li><code>print("👉 spam()")</code></li> <li>Here’s some cool ones to use <ul> <li>👉 ❌ ✅ </li> </ul></li> </ul></li> <li>Use <code>rich.print</code> or <code>pprint</code> for pretty printing <ul> <li>Also, cool rename example to have both print and rich.print available <ul> <li><code>from rich import print as rprint</code></li> </ul></li> <li>Both <code>rich.print</code> and <code>pprint.pprint</code> are essential for printing structures nicely</li> </ul></li> <li>Brian’s addition <ul> <li>In pytest, failed tests report the stdout contents by default from the test</li> <li>I love the idea of using <code>rich.print</code> and emoji for print statements in tests themselves.</li> <li>Even though you can use <code>--showlocals</code> to print local variables for failed tests, having control of some output to help you debug something if it ever fails is a good thing.</li> </ul></li> </ul> <p><strong>Renee #6:</strong> <a href="https://shap.readthedocs.io/en/latest/index.html"><strong>SHAP</strong></a> (and <a href="https://shap.readthedocs.io/en/latest/example_notebooks/api_examples/plots/beeswarm.html?highlight=beeswarm">beeswarm plot</a>)</p> <ul> <li>Brought to my attention by my team member Brian Richards at HelioCampus, and now they’re becoming a standard part of some of our model evaluation/explanation outputs</li> <li>SHapley Additive exPlanations <ul> <li>Shapley values from game theory</li> </ul></li> <li>Additive: “SHAP values of all the input features will always sum up to the difference between baseline (expected) model output and the current model output for the prediction being explained” <ul> <li>Negative/positive - pushing the prediction towards one class or the other <ul> <li>There’s a SHAP value for every feature for every prediction</li> </ul></li> <li>Waterfall plots</li> <li>Scatterplots of input value vs SHAP value</li> <li>SHAP value can be outputted and pulled into other tools (I use them in Tableau)</li> </ul></li> <li>Correlation not causation</li> <li><a href="https://shap.readthedocs.io/en/latest/example_notebooks/api_examples/plots/beeswarm.html?highlight=beeswarm">Beeswarm plots</a> for feature importance with input value vs SHAP value</li> </ul> <p><strong>Extras</strong></p> <p><strong>Brian</strong>:</p> <ul> <li>Matthew Feickert recommended <code>pip index</code> and specifically <code>pip index versions</code> as a cool thing to try. <ul> <li>Example. <code>pip index versions pyhf</code> reports <ul> <li>all versions of <code>pyhf</code> available on pypi</li> <li>the latest version</li> <li>your installed version</li> </ul></li> <li>It’s currently “experimental” so conceptually the pypa could yank it. But we like it. I hope it stays.</li> </ul></li> </ul> <p><strong>Michael</strong>:</p> <ul> <li>My <a href="https://twitter.com/pycharm/status/1460975261948723208?s=12"><strong>PyCharm webcast</strong></a> </li> </ul> <p><strong>Renee</strong>: </p> <ul> <li>My book and companion website with interactive query editor: <a href="https://sqlfordatascientists.com/">SQL for Data Scientists</a></li> </ul> <p><strong>Joke:</strong> <a href="https://twitter.com/fvoron/status/1455979278936444930"><strong>git messages</strong></a></p>
November 17, 2021 08:00 AM UTC
Python⇒Speed
Speed up your Conda installs with Mamba
Conda installs can be very very very slow.
Every time you run conda install:
- It has to collect the package metadata.
- It has to solve the environment. … maybe you can take a coffee break here, or go work on a jigsaw puzzle to relax …
- It has to download packages.
- Eventually, finally, it will install the packages it downloaded.
By the time this is all done you’ve probably forgotten what it was you were trying to do in the first place. To be fair, Conda has gotten faster in the past few releases, but it’s still far from being fast.
Luckily, a new project called Mamba has set out to reimplement Conda functionality while running much faster. So let’s see:
- How much faster Mamba is.
- How to switch to Mamba.
- Using it in Docker to make image builds even faster.
November 17, 2021 12:00 AM UTC
November 16, 2021
PyCoder’s Weekly
Issue #499 (Nov. 16, 2021)
#499 – NOVEMBER 16, 2021
View in Browser »
The PSF Is Searching for Its Next Executive Director
After announcing earlier this summer that Ewa Jodlowska is leaving after ten years of service, the PSF has begun its search for the organization’s next Executive Director. Interested? You can apply here.
PYTHON SOFTWARE FOUNDATION
Selecting a Programming Language Can Be a Form of Premature Optimization
“Have you ever been told that Python couldn’t be used for a project because it wouldn’t be fast enough? I have, and I find it a bit frustrating as big banks, YouTube, Instagram, and plenty of other places that are performance-sensitive still manage to select Python and be happy.”
BRETT CANNON opinion
DataStax Astra DB, Built on Apache Cassandra™ Get 80 Gigabytes of Storage Free Every Month
DataStax Astra DB, built on Cassandra - now made easy in the cloud. Create a free Cassandra database in minutes for global scale on a startup budget. Get 80 gigabytes of storage free every month! Register now →
DATASTAX sponsor
Advanced Visual Studio Code for Python Developers
In this tutorial, you’ll learn how you can configure, extend, and optimize Visual Studio Code for a more effective and productive Python development environment. By digging into this customizable code editor and IDE, you’ll put yourself on track to be a VS Code power user.
REAL PYTHON
Securely Deploy a Django App With Gunicorn, Nginx, & HTTPS
Ready to take your Django app beyond development? Learn how to securely deploy your Django web app in production over HTTPS with Gunicorn and Nginx. Along the way, you’ll explore how HTTP headers can fortify your app’s security.
REAL PYTHON
How Python’s list Data Structure Really Works
This article explores the nuts and bolts of Python list operations, their time complexity, and underlying data structures.
ANTON ZHIYANOV • Shared by Anton Zhiyanov
Python Jobs
Senior Backend Software Engineer (Anywhere)
Senior Python Engineer (Anywhere)
Senior Software Engineer Backend (USA)
Full Stack Software Engineer Django/Postgres/React (Washington D.C., USA)
Senior Software Engineer (Washington D.C., USA)
Python Backend Engineer in Healthcare (Hasselt, Belgium)
Articles & Tutorials
The Legacy of OLPC and Charismatic Pitfalls in Teaching Programming
Do you remember the One Laptop Per Child program? What went wrong, and what can we learn from the program’s failure? What are the potential pitfalls of charismatic technology, and how can we avoid them when introducing students to programming? This week on the show, former guest Al Sweigart and author Morgan Ames are here to talk about her book “The Charisma Machine - The Life, Death, and Legacy of One Laptop per Child.”
REAL PYTHON podcast
Ruby vs Python Comes Down to the for Loop
“Contrasting how each language handles iteration helps understand how to work effectively in either.” Related discussion of this article on Hacker News.
DOUG TURNBULL
Find Out Why Scout’s a Developer’s Best Friend With a Free 14-Day Trial, No Credit Card Needed
Scout uses tracing logic to tie bottlenecks to source code so developers can get back to building great products instead of wasting time fixing performance issues. Real-time alerting gives you the insights you need in 4 min or less! Deploy today and we’ll donate $5 to the OSS project of your choice →
SCOUT APM sponsor
Create Distance Matrix Using Google Maps APIs
This article describes step by step how to use the Google Maps Distance Matrix API, how to parse its data to create a distance table and finally, how to store the parsed data in a data base.
JUAN ACOSTA • Shared by Juan Acosta
Cython, Rust, and More: Choosing a Language for Python Extensions
You can write Python extensions with Cython, Rust, and many other tools. In this article you’ll learn which one you should use, depending on your particular use case and needs.
ITAMAR TURNER-TRAURING
Async Python Is Not Faster
“Async Python is slower than ‘sync’ Python under a realistic benchmark. A bigger worry is that async frameworks go a bit wobbly under load.”
CAL PATERSON
Monads and Python
“Porting Monads to Python is a common hobby. But should we really do it?”
ROBERT COLLINS
CData Software – The Easiest Way to Connect Python With Data
Simple Python data access to more than 250 cloud applications, and data sources. Connect, Integrate, & Automate your data from Python, or any other application or tool.
CDATA SOFTWARE sponsor
Projects & Code
Events
Weekly Real Python Office Hours Q&A (Virtual)
November 15, 2021
REALPYTHON.COM
Women Who Code CONNECT Forward 2021
November 18 to November 20, 2021
WOMENWHOCODE.DEV
PyData Bristol Meetup
November 18, 2021
MEETUP.COM
Python Northwest
November 18, 2021
PYNW.ORG.UK
PyCon APAC 2021
November 19 to November 24, 2021
PYCON.ORG
Xtreme Python
November 24 to November 25, 2021
XTREMEPYTHON.DEV
Happy Pythoning!
This was PyCoder’s Weekly Issue #499.
View in Browser »
[ Subscribe to 🐍 PyCoder’s Weekly 💌 – Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]
November 16, 2021 07:30 PM UTC
ItsMyCode
Python ValueError: could not convert string to float
ItsMyCode |
If you convert a string object into a floating point in Python, you will get a ValueError: could not convert string to float. Usually, this happens if the string object has an invalid floating value with spaces or comma Python will throw ValueError while parsing into string object into float.
In this article, we will take a look at what this error means and how to fix this error in your code with examples.
ValueError: could not convert string to float
If we are reading and processing the data from excel or csv, we get the numbers in the form of a string, and in the code, we need to convert from string to float.
Python has a built-in float() method that can parse the string to a floating-point number. This method will be useful when we need to perform a mathematical operation on a string object.
The float() method only allows you to convert strings that hold float-like numbers. This means that you cannot convert a value if
- A value contains spaces
- A value contains a comma
- A value contains special characters
Exception could not convert string to float
order_value = '12,000'
tax_percentage = 4
tax_amount = (float(order_value)*(tax_percentage / 100))
print("The total tax amount is ", tax_amount)
Output
Traceback (most recent call last):
File "c:/Projects/Tryouts/main.py", line 4, in <module>
tax_amount = (float(order_value)*(tax_percentage / 100))
ValueError: could not convert string to float: '12,000'
Let’s take a simple example to demonstrate the ValueError exception. In the below code, we have the total order value in terms of USD, and we are accepting this in string format and performing a tax calculation.
If you see the above code, the order value has a comma-separated numeric value, and while parsing into a float, Python will throw ValueError: could not convert string to float.
There are a few other scenarios where you could get ValueError.
- Converting an empty string into a floating-point number
- Converting a non-floating string to a floating-point number
Fix ValueError: could not convert string to float
There are multiple ways to resolve the issue. Let’s take a look at each of the solutions.
Solution 1: Ensure the string has a valid floating value
The easiest way is to clean up the data or pass it in the correct format if we already know the data format before converting into float.
If the value has a comma, space, or any special characters, then it needs to be processed before converting into float.
In the below code, we are storing a valid float number as a string, and later we are converting that into floating-point to calculate tax.
Example:
order_value = '12000'
tax_percentage = 4
tax_amount = (float(order_value)*(tax_percentage / 100))
print("The total tax amount is ", tax_amount)
Output
The total tax amount is 480.0
Solution 2: Use try-except
The best way is to handle the exception in case of an invalid data format. In the below code, it will run the code in the try block. If the conversion fails, then it runs the except block code.
Example:
order_value = '12,000'
tax_percentage = 4
try:
tax_amount = (float(order_value)*(tax_percentage / 100))
print("The total tax amount is ", tax_amount)
except:
print ("Order value is invalid")
Output
Order value is invalid
The post Python ValueError: could not convert string to float appeared first on ItsMyCode.

