skip to navigation
skip to content

Planet Python

Last update: February 25, 2015 01:46 PM

February 25, 2015


Machinalis

Reading TechCrunch

When we discussed Information Extraction and IEPY among professional peers we noticed that the approach was often unknown to those who could benefit from it the most. Its main beneficiaries are those with large volumes of unstructured or poorly structured text, where it is very costly to go through the text manually to extract relationships (e.g. in the VC industry such as funding, acquisitions, creating or opening of offices, etc.) between entities (companies, investment funds, people, and so on).

To create an example directed at those with perhaps less of a technical background, we processed the news articles from TechCrunch News, the main technology blog in the United States. We sought the funding relationships in U.S. companies. We published the result and found some interesting things:

VC Industry and Specialized Press

The publication of news about funding may result from investigation by specialized journalists or pushed from the companies themselves, who manage to promote news within mainstream media content.

Then checking the funding-related content in the TechCrunch News posts and comparing it to more complete databases can show us the editorial policies that these journalists follow or the companies efficiency in placing their own content.

So for example, in the funded companies vs. average funding chart (currently one of the main discussion topics) you can see a growing gap between the events covered in the more general database (CrunchBase) and those from TechCrunch News coverage

Since last year there has been a tendency to cover events where the funding amount was greater than the average from the CrunchBase database. Based on this data, higher level funding events should attract more attention from journalists than other below-average ones.

Considering geographical distribution of events coverage




Some of the highlights we can see include:

And so on.

In summary, what was the advantage of this approach?  

If you wanted to have an overall view you could include content from other blogs like Gigaom, VentureBeat, TWSJ, Forbes Tech, Mashable, Wired, The Verge, etc. without extra effort once the tool has learned to identify and predict relationships (e.g. funding to companies).

And of course, as the demo outlines, we were able to read several thousand news articles, extract the information to build a database and make the demo without arousing the deepest murderous rage in us that reading ~100k articles looking for that relationship can awake.



February 25, 2015 01:14 PM


Ludovic Gasc

Open letter for the sync world

Theses days, I've seen more and more haters about the async community in Python, especially around AsyncIO.
I think this is sad and counter-productive.
I feel that for some people, frustrations or misunderstandings about the place of this new tool might be the cause, so I'd like to share some of my thoughts about it.


Just a proven pattern, not a "who has the biggest d*" contest


Some micro-benchmarks have been published to try to explain that AsyncIO isn't really efficient.
We all know that it is possible to have benchmarks prove about anything, and that the world isn't black or white.
So just for the sake of completeness, here are some macro-benchmarks based on Web applications examples: http://blog.gmludo.eu/2015/02/macro-benchmark-with-django-flask-and-asyncio.html


Now, before to start a ping-pong to try to determine who has the biggest, please read further:

Asynchronous/coroutine pattern isn't a new fancy stuff to decrease developer productivity and performance.
In fact, the idea of asynchrounous, non-blocking IO has been around in many OSes and programming languages for years.
In Linux for example, Asynchronous I/O Support was added to kernel 2.5, back in 2003, you can even find some specifications back in 1997 (http://pubs.opengroup.org/onlinepubs/007908799/xsh/aio.h.html)
It started to gain more visibility with (amongst others) NodeJS a couple of years ago.
This pattern is now included in most new languages (Go...) and is made available in older languages (Python, C#...).

Async isn't a silver bullet, especially for intensive calculations, but for I/O, at least from my experience, it seems to be much more efficient.


The lengthy but successful maturation process of a new standard


In the Python world, a number of alternatives were available (Gevent, Twisted, Eventlet, libevent, stackless,...) each with their own strengths and weaknesses.
Each of them went to a maturation process and could eventually be used on real production environments.

It was really clever for Guido to take all good ideas from all these async frameworks to create AsyncIO.
Instead of having a number of different frameworks, each of them reinventing the wheel on an island,
AsyncIO should help to have a "lingua franca" for doing async in Python.
This is pretty important because once you enter in the async world, all your usual tools and libs (like your favourite DB lib) should also be async compliant.
Because, AsyncIO isn't just a library, it will become the "standard" way to write async code with Python.


If Async means rewriting my perfectly working code, why should I bother ?


To integrate cleanly AsyncIO in your library or your application, you have to rethink the internal architecture.
When you start a new project in "async mode", you can't keep sync for the part of it: to get all async benefits, everything should be async.

But, this isn't mandatory from day 1: you can start simple, and port your code to the async pattern step-by-step.

I can understand some haters reactions: Internet is a big swarm where you have a lot of trends and hype.
Finally, few tools and patterns will really survive to the production's fire.
Meanwhile, you already wrote a lot of perfectly working code, and obviously you really don't want to rewrite that just for the promises of the latest buzz-word.

It's like oriented object programming, years ago, it suddenly became the new "proper" way of writing your code (some said),
and you couldn't be object and procedural in the same time.
Years later, procedural isn't completely dead, because in fact, OO sometimes brings unnecessary overhead.
It really depends on what sort of things you are writing (size matters!).
On the other hand, in 2015, who writes a full-Monty application with procedural only ?

I think one day, it will be the same for the async pattern.
It is always better to driving the change than to endure the change.
Think organic: on the long term, it is not the strongest that survives, nor is it the most intelligent.
It is usually the one being most open and adaptive to changes.


Buzzword, or real paradigm change ?


We don't know for sure if the async pattern is only a temporary fashion buzzword or a real paradigm shift in IT,  just like virtualization has become a de-facto standard over the last few years.

But my feeling is that it is here to stay, even if it won't be relevant for all Python projects.
I think it will become the right way to build efficient and scalable I/O-Bound projects,

For example, in an Internet (network) driven world, I see more and more projects centred around piping between cloud-based services.
For this type of developments, I'm personally convinced a paradigm shift has become unavoidable, and for Pythonists AsyncIO is probably the right horse to bet on.



Does anyone really care or "will I be paid more" ?


Let's face it, beside your geek fellows, nobody cares about the tools you are using:
Your users just want features for yesterday, as few bugs as possible, and they want their application to be fast and responsive.
Who cares if you use async, or some other hoodoo-voodoo-black-magic to reach the goal ?

I think that, by starting a "religious war" between sync and async Python developers, we would all waste our (precious) time.
Instead, we should cultivate emulation between Pythonistas, build solutions to increase real-world performances and stability.
Then let Darwin show us the long term path and adapt to it.

In the end, the whole Python community will benefit if Python is considered as a great language to write business logic with ease AND with brute performance.
We are all tired to hear people in other communities say that Python is slow, we are all convinced this is simply not true.

This is a communication war that the Python community has to win as a team.

PS: Special thanks to Nicolas Stein, aka. Nike, for the review of this text and his precious advices in general to stimulate a scientific approach of problems.

February 25, 2015 12:56 PM

Macro-benchmark with Django, Flask and AsyncIO (aiohttp.web+API-Hour)

Disclaimer: If you have some bias and/or dislike AsyncIO, please read my previous blog post before to start a war.

Tip: If you don't have the time to read the text, scroll down to see graphics.


Context of this macro-benchmark

Today, I propose you to benchmark a HTTP daemon based on AsyncIO, and compare results with a Flask and Django version.

For those who didn't follow AsyncIO news, aiohttp.web is a light Web framework based on aiohttp. It's like Flask but with less internal layers.
aiohttp is the implementation of HTTP with AsyncIO.

Moreover, API-Hour helps you to have multiprocess daemons with AsyncIO.
With this tool, we can compare Flask, Django and aiohttp.web in the same conditions.
This benchmark is based on a concrete need of one of our customers: they wanted to have a REST/JSON API to interact with their telephony server, based on Asterisk.
One of the WebServices gives the list of agents with their status. This WebService is heavily used because they use it on their public Website (itself having a serious traffic) to show who is available.

First, I've made a HTTP daemon based on Flask and Gunicorn, which gave honorable results. Later on, I replaced the HTTP part and pushed in production a daemon based on aiohttp.web and API-Hour.
A subset of theses daemons are used for this benchmark.
I've added a Django version because with Django and Flask, I certainly cover 90% of tools used by Python Web developers.

I've tried to have the same parameters for each daemon: for example, I obviously use the same number of workers, 16 in this benchmark.

I don't benchmark Django manage.py or dev HTTP server of Flask, I use Gunicorn, as most people use on production, to try to compare apples with apples.

Hardware

Network  benchmark

I've almost 1Gb/s with this network:

On Server:
$ iperf -c 192.168.2.101 -d
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 28.6 MByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.2.101, TCP port 5001
TCP window size: 28.6 MByte (default)
------------------------------------------------------------
[ 5] local 192.168.2.100 port 24831 connected with 192.168.2.101 port 5001
[ 4] local 192.168.2.100 port 5001 connected with 192.168.2.101 port 16316
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.1 sec 1.06 GBytes 903 Mbits/sec
[ 5] 0.0-10.1 sec 1.11 GBytes 943 Mbits/sec

On Client:
$ iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 28.6 MByte (default)
------------------------------------------------------------
[ 4] local 192.168.2.101 port 5001 connected with 192.168.2.100 port 24831
------------------------------------------------------------
Client connecting to 192.168.2.100, TCP port 5001
TCP window size: 28.6 MByte (default)
------------------------------------------------------------
[ 6] local 192.168.2.101 port 16316 connected with 192.168.2.100 port 5001
[ ID] Interval Transfer Bandwidth
[ 6] 0.0-10.0 sec 1.06 GBytes 908 Mbits/sec
[ 4] 0.0-10.2 sec 1.11 GBytes 927 Mbits/sec


System configuration
It's important to configure your PostgreSQL as a production server.
You need also to configure your Linux kernel to handle a lot of open sockets and some TCP tricks.
Everything is in the benchmark repository.

Client benchmark tool

From my experience with AsyncIO, Apache Benchmark (ab), Siege, Funkload and some old fashion HTTP benchmarks tools don't hit enough for an API-Hour daemon.
For now, I use wrk and wrk2 to benchmark.
wrk hits as fast as possible, where wrk2 hits with the same rate.

Metrics observed

I record three metrics:
  1. Requests/sec: Least interesting of metrics. (see below)
  2. Error rate: Sum of all errors (socket timeout, socket read/write, 5XX errors...)
  3. Reactivity: Certainly the most interesting of the three, it measures the time that our client will actually wait.

WebServices daemons

You can find all source code in API-Hour repository: https://github.com/Eyepea/API-Hour/tree/master/benchmarks
Each daemon has at least two WebServices:
On Flask daemon, I added /agents_with_pool endpoint, to use a database connection pool with Flask, but it isn't really good, you'll see later.
On Django daemon, I added /agents_with_orm endpoint, to measure the overhead to use Django-ORM instead of to use SQL directly. Warning: I didn't find a solution to have the exact same query.

Methodology

Each daemon will run alone to preserve resources.
Between each run, the daemon is restarted to be sure that previous test doesn't pollute the next one.

First turn

At the beginning, to have an idea how much maximum HTTP queries each daemon can support, I quickly attack (30 seconds) on localhost.

Warning ! This benchmark doesn't represent the reality you can have in production, because you don't have a network limitation nor latency, it's only for calibration.

Simple JSON document

In each daemon folder in benchmarks repository, you can read the output result of each wrk.
To simplify the reading, I summarize the captured values with an array and graphs:


Requests/s Errors Avg Latency (s)
Django+Gunicorn 70598 4489 7.7
Flask+Gunicorn 79598 4433 13.16
aiohttp.web+API-Hour 395847 0 0.03

Requests by seconds
(Higher is better)
Errors
(Lower is better)

Latency (s)
(Lower is better)

Agents list from database


Requests/s Errors Avg Latency (s)
Django+Gunicorn 583 2518 0.324
Django ORM+Gunicorn 572 2798 0.572
Flask+Gunicorn 634 2985 13.16
Flask (connection pool) 2535 79704 12.09
aiohttp.web+API-Hour 4179 0 0.098

Requests by seconds
(Higher is better)
Errors
(Lower is better)

Latency (s)
(Lower is better)


 Conclusions for the next round

On high charge, Django doesn't have the same behaviour as Flask: Both handle more or less the same requests rate, but Django penalizes less global latency of HTTP queries. The drawback is that the slow HTTP queries are very slow (26,43s for Django compared to 13,31s for Flask).
I removed Django ORM test for the next round because it isn't exactly the same SQL query generated and the performance difference with a SQL query is negligible.
I removed also Flask DB connection pool because the error rate is too important compared to other tests.

Second round

Here, I use wrk2, and changed the run time to 5 minutes.
A longer run time is very important because of how resources availability can change with time.
There are at least two reasons for this:

1. Your test environment runs  on top of some OS which continues its activity during the test.
Therefore, you need a long time to be more insensitive to transient use of your test machine resources by other things
like another OS daemon or cron job triggering meanwhile.

2. The ramp-up of your test will gradually consume more resources at different levels: at the level of your Python scripts & libs,
 as well as at the level of you OS / (Virtual) Machine.
This decrease of available resources will not necessarily be instantaneous, nor linear.
This is a typical source of after-deployment bad surprises in prod.
Here too, to be as close as possible to production scenario, you need to give time to your test to arrive to a "hover", eventually saturating some resources.
Ideally you'd saturate the network first (which in this case is like winning the jackpot).

Here, I'm testing at a constant 4000 queries per second, this time through the network.

Simple JSON document


Requests/s Errors Avg Latency (s)
Django+Gunicorn 1799 26883 97
Flask+Gunicorn 2714 26742 52
aiohttp.web+API-Hour 3995 0 0.002

Requests by seconds
(Higher is better)
Errors
(Lower is better)

Latency (s)
(Lower is better)

Agents list from database


Requests/s Errors Avg Latency (s)
Django+Gunicorn 278 37480 141.6
Flask+Gunicorn 304 40951 136.8
aiohttp.web+API-Hour 3698 0 7.84

Requests by seconds
(Higher is better)
Errors
(Lower is better)

Latency (s)
(Lower is better)

(Extra) Third round

For the fun, I used the same setup as second round, but with only with 10 requests/seconds during 30 seconds to see if under a low load, sync daemons could be quicker, because you have the AsyncIO overhead.

Agents list from database


Requests/s Errors Avg Latency (s)
Django+Gunicorn 10 0 0.01936
Flask+Gunicorn 10 0 0.01874
aiohttp.web+API-Hour 10 0 0.00642

Latency (s)
(Lower is better)

Conclusion

AsyncIO with aiohttp.web and API-Hour increases the number of requests per second, but more importantly, you have no sockets nor 5XX errors and the waiting time for each user is very really better, even with low load. This benchmark uses an ideal network setup, and therefore it doesn't cover a much worse scenario where your client arrives over a slow network (think smartphone users) on your Website.

It has been said often: If your webapp is your business, reduce waiting time is a key winner for you:

Some clues to improve AsyncIO performances

Even if this looks like good performance, we shouldn't rest on our laurels, we can certainly find more optimizations:

  1. Use an alternative event loop: I've tested to replace AsyncIO event loop and network layer by aiouv and quamash. For now, it doesn't really have a huge impact, maybe in the future.
  2. Have multiplex protocols from frontend to backend: HTTP 2 is now a multiplex protocol, it means you can stack several HTTP queries without waiting for the first response. This pattern should increase AsyncIO performances, but it must be validated by a benchmark.
  3. If you have another idea, don't hesitate to post it in comments.

Don't take architectural decisions based on micro-benchmarks

It's important to be very cautious with benchmarks, especially with micro-benchmarks. Check several different benchmarks, using different scenari, before to conclude on architecture for your application.

Don't forget this is all about IO-bound

If I was working for an organisation with a lot of CPU-bound projects, (such as a scientific organisation for example), my speech would be totally different.
But, my day-to-day challenges are more about I/O than about CPU, probably like for most Web developers.

Don't simply take me as a mentor. The needs and problematics of one person or organisation are not necessarily the same as your, even if that person is considered as a "guru" in one opensource community or another. 

We should all try to keep a rational, scientific approach instead of religious approach when selecting your tools.
I hope this post will give you some ideas to experiment with. Feel free to share your tips to increase performances, I'd be glad to include them in my benchmarks!

I hope that these benchmarks will be an eye-opener for you.

February 25, 2015 11:50 AM


PyPy Development

Experiments in Pyrlang with RPython

Pyrlang is an Erlang BEAM bytecode interpreter written in RPython.

It implements approximately 25% of BEAM instructions. It can support integer calculations (but not bigint), closures, exception handling, some operators to atom, list and tuple, user modules, and multi-process in single core. Pyrlang is still in development.

There are some differences between BEAM and the VM of PyPy:

Regarding bytecode dispatch loop, Pyrlang uses a while loop to fetch instructions and operands, call the function corresponding to every instruction, and jump back to the head of the while loop. Due to the differences between the RPython call-stack and BEAM’s Y register, we decided to implement and manage the Y register by hand. On the other hand, PyPy uses RPython’s call stack to implement Python’s call stack. As a result, the function for the dispatch loop in PyPy calls itself recursively. This does not happen in Pyrlang.

The Erlang compiler (erlc) usually compiles the bytecode instructions for function invocation into CALL (for normal invocation) and CALL_ONLY (for tail recursive invocation). You can use a trampoline semantic to implement it:

The current implementation only inserts the JIT hint of can_enter_jit following the CALL_ONLY instruction. This means that the JIT only traces the tail-recursive invocation in Erlang code, which has a very similar semantic to the loop in imperative programming languages like Python.

We have also written a single scheduler to implement the language level process in a single core. There is a runable queue in the scheduler. On each iteration, the scheduler pops one element (which is a process object with dispatch loop) from the queue, and executes the dispatch loop of the process object. In the dispatch loop, however, there is a counter-call “reduction” inside the dispatch loop. The reduction decrements during the execution of the loop, and when the reduction becomes 0, the dispatch loop terminates. Then the scheduler pushes that element into the runable queue again, and pops the next element for the queue, and so on.

We are planning to implement a multi-process scheduler for multi-core CPUs, which will require multiple schedulers and even multiple runable queues for each core, but that will be another story. :-)

Methods

We wrote two benchmark programs of Erlang:

  • FACT: A benchmark to calculate the factorial in a tail-recursive style, but because we haven’t implemented big int, we do a remainder calculation to the argument for the next iteration, so the number never overflows.
  • REVERSE: The benchmark creates a reversed list of numbers, such as [20000, 19999, 19998, …], and applies a bubble sort to it.

Results

The Value of Reduction

We used REVERSE to evaluate the JIT with different values of reduction:

The X axis is the value of reduction, and the Y axis is the execution time (by second).

It seems that when the value of reduction is small, the reduction influences the performance significantly, but when reduction becomes larger, it only increases the speed very slightly. In fact, we use 2000 as the default reduction value (as well as the reduction value in the official Erlang interpreter).

Surprisingly, the trace is always generated even when the reduction is very small, such as 0, which means the dispatch loop can only run for a very limited number of iterations, and the language level process executes fewer instructions than an entire loop in one switch of the scheduler). The generated trace is almost the same, regardless of different reduction values.

Actually, the RPython JIT only cares what code it meets, but does not care who executes it, thus the JIT always generates the results above. The trace even can be shared among different threads if they execute the same code.

The overhead at low reduction value may be due to the scheduler, which switches from different processes too frequently, or from the too-frequent switching between bytecode interpreter and native code, but not from JIT itself.

Here is more explanation from Armin Rigo:

“The JIT works well because you’re using a scheme where some counter is decremented (and the soft-thread interrupted when it reaches zero) only once in each app-level loop. The soft-thread switch is done by returning to some scheduler, which will resume a different soft-thread by calling it. It means the JIT can still compile each of the loops as usual, with the generated machine code containing the decrease-and-check-for-zero operation which, when true, exits the assembler."

Fair Process Switching vs. Unfair Process Switching

We are also concerned about the timing for decreasing reduction value. In our initial version of Pyrlang, we decrease reduction value at every local function invocation, module function invocation, and BIF (built-in function) invocation, since this is what the official Erlang interpreter does. However, since the JIT in RPython basically traces the target language loop (which is the tail recursive invocation in Pyrlang) it is typically better to keep the loop whole during a switch of the language level process. We modified Pyrlang, and made the reduction decrement only occur after CALL_ONLY, which is actually the loop boundary of the target language.

Of course, this strategy may cause an “unfair” execution among language level processes. For example, if one process has only a single long-sequence code, it executes until the end of the code. On the other hand, if a process has a very short loop, it may be executed by very limited steps then be switched out by the scheduler. However, in the real world, this “unfairness” is usually considered acceptable, and is used in many VM implementations including PyPy for improving the overall performance.

We compared these two versions of Pyrlang in the FACT benchmark. The reduction decrement is quite different because there are some BIF invocations inside the loop. In the old version the process can be suspended at loop boundaries or other function invocation, but in the new version, it can be suspended only at loop boundaries.

We show that the strategy is effective, removing around 7% of the overhead. We have also compared it in REVERSE, but since there are no extra invocations inside the trace, it cannot provide any performance improvement. In the real world, we believe there is usually more than one extra invocation inside a single loop, so this strategy is effective for most cases.

Comparison with Default Erlang and HiPE

We compared the performance of Pyrlang with the default Erlang interpreter and the HiPE (High Performance Erlang) complier. HiPE is an official Erlang compiler that can compile Erlang source code to native code. The speed of Erlang programs obviously improves but loses its generality instead.

Please note that Pyrlang is still in development, so in some situations it does less work than the default Erlang interpreter, such as not checking integer overflow when dealing with big integer, and not checking and adding locks when accessing message queues in the language-level process, so is therefore faster. The final version of Pyrlang may be slower.

We used the two benchmark programs above, and made sure both of them are executed for more than five seconds to cover the JIT warm-up time for RPython. The experiment environment is a OS X 10.10 machine with 3.5GHZ 6-core Intel Xeon E5 CPU and 14GB 1866 MHz DDR3 ECC memory.

Let’s look at the result of FACT. The graph shows that Pyrlang runs 177.41% faster on average than Erlang, and runs at almost the same speed as HiPE. However, since we haven’t implemented big integer in Pyrlang, the arithmetical operators do not do any extra overflow checking. It is reasonable that the final version for Pyrlang will be slower than the current version and HiPE.

As for REVERSE, the graph shows that Pyrlang runs 45.09% faster than Erlang, but 63.45% slower than HiPE on average. We think this is reasonable because there are only few arithmetical operators in this benchmark so the speeds of these three implementations are closer. However, we observed that at the scale of 40,000, the speed of Pyrlang slowed down significantly (111.35% slower than HiPE) compared with the other two scales (56.38% and 22.63% slower than HiPE).

Until now we can only hypothesize why Pyrlang slows down at that scale. We guess that the overhead might be from GC. This is because the BEAM bytecode provides some GC hints to help the default Erlang compiler to perform some GC operations immediately. For example, using GC_BIF instead of a BIF instruction tells the VM that there may be a GC opportunity, and tells the VM how many live variables should be around one instruction. In Pyrlang we do not use these kinds of hints but rely on RPython’s GC totally. When there are a huge number of objects during runtime, (as for REVERSE, it should be the Erlang list object) the speed therefore slows down.

Ruochen Huang

February 25, 2015 11:13 AM


Kushal Das

My talk in MSF, India

Last week I gave a talk on Free and Open Source Software in the Metal and Steel factory, Indian Ordinance Factories, Ishapore, India. I met Mr. Amartya Talukdar, a well known activist and blogger from Kolkata in the blogger’s meet. He currently manages the I.T. team in the above mentioned place and he arranged the talk to spread more awareness about FOSS.

I reached the main gate an hour before the talk. The securities came around to ask me why I was standing there in the road. I was sure this is going to happen again. I went into the factory along with Mr. Talukdar, at least three times the securities stopped me while the guns were ready. They also took my mobile phone, I left my camera back at home for the same reason.

I met the I.T. Department and few developers who work there, before the talk. Around 9:40am we moved to the big conference room for my talk. The talk started with Mr. Talukdar giving a small introduction. I was not sure how many technical people will attend the talk, so it was less technical and more on demo side. The room was almost full within few minutes, and I hope that my introductions to FOSS, Fedora, and Python went well. I was carrying a few Python docs with me and few other Fedora stickers. In the talk I spent most of time demoing various tools which can increase productivity of the management by using the right tools. We saw reStructuredText, rst2pdf and Sphinx for managing documents. We also looked into version control systems and how we can use them. We talked a bit about Owncloud, but without network, I could not demo. I also demoed various small Python scripts I use, to keep my life simple. I learned about various FOSS tools they are already using. They use Linux in the servers, my biggest suggestion was about using Linux in the desktops too. Viruses are always a common problem which can easily be eliminated with Linux on the desktops.

My talk ended around 12pm. After lunch, while walking back to the factory Mr. Talukdar showed me various historical places and items from Dutch and British colony days. Of course there were again the securities while going out and coming in.

We spent next few hours discussing various technology and workflow related queries with the Jt. General Manager Mr. Neeraj Agrawal. It was very nice to see that he is updated with all the latest news and information from the FOSS and technology world. We really need more people like him who are open to new ideas and capable of managing both the worlds. In future we will be doing a few workshops targeting the needs of the developers of the factory.

February 25, 2015 09:06 AM


Vasudev Ram

Publish SQLite data to PDF using named tuples

By Vasudev Ram


Some time ago I had written this post:

Publishing SQLite data to PDF is easy with xtopdf.

It showed how to get data from an SQLite (Wikipedia) database and write it to PDF, using xtopdf, my open source PDF creation library for Python.

Today I was browsing the Python standard library docs, and so thought of modifying that program to use the namedtuple data type from the collections module of Python, which is described as implementing "High-performance container datatypes". The collections module was introduced in Python 2.4.
Here is a modified version of that program, SQLiteToPDF.py, called SQLiteToPDFWithNamedTuples.py, that uses named tuples:
# SQLiteToPDFWithNamedTuples.py
# Author: Vasudev Ram - http://www.dancingbison.com
# SQLiteToPDFWithNamedTuples.py is a program to demonstrate how to read
# SQLite database data and convert it to PDF. It uses the Python
# data structure called namedtuple from the collections module of
# the Python standard library.

from __future__ import print_function
import sys
from collections import namedtuple
import sqlite3
from PDFWriter import PDFWriter

# Helper function to output a string to both screen and PDF.
def print_and_write(pw, strng):
print(strng)
pw.writeLine(strng)

try:

# Create the stocks database.
conn = sqlite3.connect('stocks.db')
# Get a cursor to it.
curs = conn.cursor()

# Create the stocks table.
curs.execute('''DROP TABLE IF EXISTS stocks''')
curs.execute('''CREATE TABLE stocks
(date text, trans text, symbol text, qty real, price real)''')

# Insert a few rows of data into the stocks table.
curs.execute("INSERT INTO stocks VALUES ('2006-01-05', 'BUY', 'RHAT', 100, 25.1)")
curs.execute("INSERT INTO stocks VALUES ('2007-02-06', 'SELL', 'ORCL', 200, 35.2)")
curs.execute("INSERT INTO stocks VALUES ('2008-03-07', 'HOLD', 'IBM', 300, 45.3)")
conn.commit()

# Create a namedtuple to represent stock rows.
StockRecord = namedtuple('StockRecord', 'date, trans, symbol, qty, price')

# Run the query to get the stocks data.
curs.execute("SELECT date, trans, symbol, qty, price FROM stocks")

# Create a PDFWriter and set some of its fields.
pw = PDFWriter("stocks.pdf")
pw.setFont("Courier", 12)
pw.setHeader("SQLite data to PDF with named tuples")
pw.setFooter("Generated by xtopdf - https://bitbucket.org/vasudevram/xtopdf")

# Write header info.
hdr_flds = [ str(hdr_fld).rjust(10) + " " for hdr_fld in StockRecord._fields ]
hdr_fld_str = ''.join(hdr_flds)
print_and_write(pw, '=' * len(hdr_fld_str))
print_and_write(pw, hdr_fld_str)
print_and_write(pw, '-' * len(hdr_fld_str))

# Now loop over the fetched data and write it to PDF.
# Map the StockRecord namedtuple's _make class method
# (that creates a new instance) to all the rows fetched.
for stock in map(StockRecord._make, curs.fetchall()):
row = [ str(col).rjust(10) + " " for col in (stock.date, \
stock.trans, stock.symbol, stock.qty, stock.price) ]
# Above line can instead be written more simply as:
# row = [ str(col).rjust(10) + " " for col in stock ]
row_str = ''.join(row)
print_and_write(pw, row_str)

print_and_write(pw, '=' * len(hdr_fld_str))

except Exception as e:
print("ERROR: Caught exception: " + e.message)
sys.exit(1)

finally:
pw.close()
conn.close()

This time I've imported print_function so that I can use print as a function instead of as a statement.

Here's a screenshot of the PDF output in Foxit PDF Reader:


- Vasudev Ram - Online Python training and programming

Dancing Bison Enterprises

Signup to hear about new products or services from me.

Posts about Python  Posts about xtopdf

Contact Page

February 25, 2015 03:50 AM

February 24, 2015


François Dion

J is for ... autojump!

Shell addon


At our last PYPTUG meeting, I was demoing Dshell. While at it I suggested using the j command. AKA, autojump:

https://github.com/joelthelion/autojump

On some linux distros, it is possible to install from the repo. On debian and derived systems (ie, ubuntu, mint etc), instructions to enable it after the install are in /usr/share/doc/autojump/README.Debian

Once installed and trained, you'll ask yourself how you've been able to live without it.

The j command itself is defined within a shell file. For example, for bash, you'll find this piece of code:

# default autojump command
j() {
if [[ ${1} == -* ]] && [[ ${1} != "--" ]]; then
autojump ${@}
return
fi
output="$(autojump ${@})"
if [[ -d "${output}" ]]; then
echo -e "\\033[31m${output}\\033[0m"
cd "${output}"
else
echo "autojump: directory '${@}' not found"
echo "\n${output}\n"
echo "Try \`autojump --help\` for more information."
false
fi
}

Although this part is all bash scripting, the actual autojump command is written in something else altogether.

Python powered


Autojump has been around for many years, to this day, few people actually are aware of it. In fact, as I typed j, a quick survey around the room confirmed my gut feel.
Python powered
I have blogged and tweeted about it before, but mostly in passing. Hopefully this post will bring a bit more exposure to this really useful tool.

And, do have a look at the python code ( https://github.com/joelthelion/autojump/blob/master/bin/autojump ). It has some interesting use of the lesser known SequenceMatcher class of the difflib module, and good use of lambdas. Oh, and yeah, it's pep8 formatted. Thank you.

Francois
@f_dion

February 24, 2015 09:57 PM


Caktus Consulting Group

PyCon 2015 Ticket Giveaway

Caktus is giving away a PyCon 2015 ticket, valued at $350. We love going to PyCon every year. It’s the largest gathering of developers using Python, the open source programming language that Caktus relies on. This year, it’ll be held April 8th-16th at the beautiful Palais des congrès de Montréal (the inspiration we used to design the website).

To enter, follow @caktusgroup on Twitter and RT this message.

The giveaway will end Tuesday, March 3rd at 12pm EST. Winner will be notified via Twitter DM. A response via DM is required within 24 hours or entrant forfeits their ticket. Caktus employees are not eligible. Winning entrant must be 18 years of age or older. Ticket is non-transferable.

Bonne chance!

February 24, 2015 09:43 PM


PythonClub - A Brazilian collaborative blog about Python

Tuplas mutantes em Python

Por Luciano Ramalho, autor do livro Fluent Python (O'Reilly, 2014)

See also the original article in English: http://radar.oreilly.com/2014/10/python-tuples-immutable-but-potentially-changing.html

Tuplas em Python têm uma característica surpreendente: elas são imutáveis, mas seus valores podem mudar. Isso pode acontecer quando uma tuple contém uma referência para qualquer objeto mutável, como uma list. Se você precisar explicar isso a um colega iniciando com Python, um bom começo é destruir o senso comum sobre variáveis serem como caixas em que armazenamos dados.

Em 1997 participei de um curso de verão sobre Java no MIT. A professora, Lynn Andrea Stein - uma premiada educadora de ciência da computação - enfatizou que a habitual metáfora de "variáveis como caixas" acaba atrapalhando o entendimento sobre variáveis de referência em linguagens OO. Variáveis em Python são como variáveis de referência em Java, portanto é melhor pensar nelas como etiquetas afixadas em objetos.

Eis um exemplo inspirado no livro Alice Através do Espelho e O Que Ela Encontrou Por Lá, de Lewis Carroll.

imagem Alice Através do Espelho e O Que Ela Encontrou Por Lá

Tweedledum e Tweedledee são gêmeos. Do livro: “Alice soube no mesmo instante qual era qual porque um deles tinha 'DUM' bordado na gola e o outro, 'DEE'”.

exemplo 1

Vamos representá-los como tuplas contendo a data de nascimento e uma lista de suas habilidades:

>>> dum = ('1861-10-23', ['poesia', 'fingir-luta'])
>>> dee = ('1861-10-23', ['poesia', 'fingir-luta'])
>>> dum == dee
True
>>> dum is dee
False
>>> id(dum), id(dee)
(4313018120, 4312991048)

É claro que dum e dee referem-se a objetos que são iguais, mas que não são o mesmo objeto. Eles têm identidades diferentes.

Agora, depois dos eventos testemunhados por Alice, Tweedledum decidiu ser um rapper, adotando o nome artístico T-Doom. Podemos expressar isso em Python dessa forma:

>>> t_doom = dum
>>> t_doom
('1861-10-23', ['poesia', 'fingir-luta'])
>>> t_doom == dum
True
>>> t_doom is dum
True

Então, t_doom e dum são iguais - mas Alice acharia tolice dizer isso, porque t_doom e dum referem-se à mesma pessoa: t_doom is dum.

exemplo 2

Os nomes t_doom e dum são apelidos. O termo em inglês "alias" significa exatamente apelido. Gosto que os documentos oficiais do Python muitas vezes referem-se a variáveis como “nomes“. Variáveis são nomes que damos a objetos. Nomes alternativos são apelidos. Isso ajuda a tirar da nossa mente a ideia de que variáveis são como caixas. Qualquer um que pense em variáveis como caixas não consegue explicar o que vem a seguir.

Depois de muito praticar, T-Doom agora é um rapper experiente. Codificando, foi isso o que aconteceu:

>>> skills = t_doom[1]
>>> skills.append('rap')
>>> t_doom
('1861-10-23', ['poesia', 'fingir-luta 'rap'])
>>> dum
('1861-10-23', ['poesia', 'fingir-luta 'rap'])

T-Doom conquistou a habilidade rap, e também Tweedledum — óbvio, pois eles são um e o mesmo. Se t_doom fosse uma caixa contendo dados do tipo str e list, como você poderia explicar que uma inclusão à lista t_doom também altera a lista na caixa dum? Contudo, é perfeitamente plausível se você entende variáveis como etiquetas.

A analogia da etiqueta é muito melhor porque apelidos são explicados mais facilmente como um objeto com duas ou mais etiquetas. No exemplo, t_doom[1] e skills são dois nomes dados ao mesmo objeto da lista, da mesma forma que dum e t_doom são dois nomes dados ao mesmo objeto da tupla.

Abaixo está uma ilustração alternativa dos objetos que representam Tweedledum. Esta figura enfatiza o fato de a tupla armazenar referências a objetos, e não os objetos em si.

exemplo 3

O que é imutável é o conteúdo físico de uma tupla, que armazena apenas referências a objetos. O valor da lista referenciado por dum[1] mudou, mas a identidade da lista referenciada pela tupla permanece a mesma. Uma tupla não tem meios de prevenir mudanças nos valores de seus itens, que são objetos independentes e podem ser encontrados através de referências fora da tupla, como o nome skills que nós usamos anteriormente. Listas e outros objetos imutáveis dentro de tuplas podem ser alterados, mas suas identidades serão sempre as mesmas.

Isso enfatiza a diferença entre os conceitos de identidade e valor, descritos em Python Language Reference, no capítulo Data model:

Cada objeto tem uma identidade, um tipo e um valor. A identidade de um objeto nunca muda, uma vez que tenha sido criado; você pode pensar como se fosse o endereço do objeto na memória. O operador is compara a identidade de dois objetos; a função id() retorna um inteiro representando a sua identidade.

Após dum tornar-se um rapper, os irmãos gêmeos não são mais iguais:

>>> dum == dee
False

Temos aqui duas tuplas que foram criadas iguais, mas agora elas são diferentes.

O outro tipo interno de coleção imutável em Python, frozenset, não sofre do problema de ser imutável mas com possibilidade de mudar seus valores. Isso ocorre porque um frozenset (ou um set simples, nesse sentido) pode apenas conter referências a objetos hashable (objetos que podem ser usados como chave em um dicionário), e o valor destes objetos, por definição, nunca pode mudar.

Tuplas são comumente usadas como chaves para objetos dict, e precisam ser hashable - assim como os elementos de um conjunto. Então, as tuplas são hashable ou não? A resposta certa é algumas tuplas são. O valor de uma tupla contendo um objeto mutável pode mudar, então uma tupla assim não é hashable. Para ser usada como chave para um dict ou elemento de um set, a tupla precisa ser constituída apenas de objetos hashable. Nossas tuplas de nome dum e dee não são hashable porque cada elemento contem uma referência a uma lista, e listas não são hashable.

Agora vamos nos concentrar nos comandos de atribuição que são o coração de todo esse exercício.

A atribuição em Python nunca copia valores. Ela apenas copia referências. Então quando escrevi skills = t_doom[1], não copiei a lista referenciada por t_doom[1], apenas copiei a referência a ela, que então usei para alterar a lista executando skills.append('rap').

Voltando ao MIT, a Profa. Stein falava sobre atribuição de uma forma muito cuidadosa. Por exemplo, ao falar sobre um objeto gangorra em uma simulação, ela dizia: “A variável g é atribuída à gangorra“, mas nunca “A gangorra é atribuída à variável g “. Em se tratando de variáveis de referência, é mais coerente dizer que a variável é atribuída ao objeto, e não o contrário. Afinal, o objeto é criado antes da atribuição.

Em uma atribuição como y = x * 10, o lado direito é computado primeiro. Isto cria um novo objeto ou retorna um já existente. Somente após o objeto ser computado ou retornado, o nome é atribuído a ele.

Eis uma prova disso. Primeiro criamos uma classe Gizmo, e uma instância dela:

>>> class Gizmo:
...     def __init__(self):
...         print('Gizmo id: %d' % id(self))
...
>>> x = Gizmo()
Gizmo id: 4328764080

Observe que o método __init__ mostra a identidade do objeto tão logo criado. Isso será importante na próxima demonstração.

Agora vamos instanciar outro Gizmo e imediatamente tentar executar uma operação com ele antes de atribuir um nome ao resultado:

>>> y = Gizmo() * 10
Gizmo id: 4328764360
Traceback (most recent call last):
  ...
TypeError: unsupported operand type(s) for *: 'Gizmo' and 'int'
>>> 'y' in globals()
False

Este trecho mostra que o novo objeto foi instanciado (sua identidade é 4328764360) mas antes que o nome y possa ser criado, uma exceção TypeError abortou a atribuição. A verificação 'y' in globals() prova que não existe o nome global y.

Para fechar: sempre leia lado direito de uma atribuição primero. Ali o objeto é computado ou retornado. Depois disso, o nome no lado esquerdo é vinculado ao objeto, como uma etiqueta afixada nele. Apenas esqueça aquela idéia de variáveis como caixas.

Em relação a tuplas, certifique-se que elas apenas contenham referências a objetos imutáveis antes de tentar usá-las como chaves em um dicionário ou itens em um set.

Este texto foi originalmente publicado no blog da editora O'Reilly em inglês. A tradução para o português foi feita por Paulo Henrique Rodrigues Pinheiro. O conteúto é baseado no capítulo 8 do meu livro Fluent Python. Esse capítulo, intitulado Object references, mutability and recycling também aborda a semântica da passagem de parâmetros para funções, melhores práticas para manipulação de parâmetros mutáveis, cópias rasas (shallow copies) e cópias profundas (deep copies), e o conceito de referências fracas (weak references) - além de outros tópicos. O livro foca em Python 3 mas grande parte de seu conteúdo se aplica a Python 2.7, como tudo neste texto.

February 24, 2015 01:17 PM


Montreal Python User Group

PyCon Startup Row - Registration

Tuesday March 3rd, we're inviting Montreal startups to present their startups to a panel of investors and VCs. Presentations will last 5mn, including a demonstration of the product.

There will be various startups at various stages of growth, from new startups looking for traction to growing startups.

This is a paid event to sustain our costs and to provide appetizers and wine for the networking parts.

Please get your discounted early bird 8$ tickets at https://www.eventbrite.ca/e/mtl-newtech-pycon-edition-tickets-15867698714

for more informations about PyCon: https://us.pycon.org/2015/

First 3 startups to pitch. Others to be announced soon.

Agenda:

Event Banner

February 24, 2015 05:00 AM

February 23, 2015


BioPython News

OBF Google Summer of Code 2014 Wrap-up


GoogleSummer_2014logoIn 2014, OBF had six students in the Google Summer of Code 2014™ (GSoC) program mentored under its umbrella of Bio* and related open-source bioinformatics community projects: Loris Cro (Bioruby) with mentors Francesco Strozzi and Raoul Bonnal; Evan Parker (Biopython) with mentors Wibowo Arindrarto and Peter Cock; Sarah Berkemer (BioHaskell) with mentors Christian Höner zu Siederdissen and Ketil Malde; and three students contributed to JSBML: Victor Kofia (mentors: Alex Thomas and Sarah Keating), Ibrahim Vazirabad (mentors: Andreas Dräger and Alex Thomas), and Leandro Watanabe (mentors: Nicolas Rodriguez and Chris Myers).

As a change from earlier years in which OBF participated in GSoC as a mentoring organization, in 2014 we purposefully defined our umbrella as much more inclusive of the wider bioinformatics open-source community, bringing it more in line with the annual Bioinformatics Open-Source Conference (BOSC).  In part this was also motivated by “paying it forward“, a concept central to growing healthy open-source communities, after the larger domain-agnostic language projects such as SciRuby and PSF had extended an open hand to OBF mentors when OBF did not get admitted as a GSoC mentoring organization in 2013. In the end, four out of the six succeeding student applications were for projects outside of the traditional core Bio* projects, a result with which everyone won: We had a terrific crop of students, our community grew larger and stronger, and open-source bioinformatics was advanced in a more diverse way than would have been possible otherwise.

In addition to our students, huge kudos also go to our mentors (see above), and to Eric Talevich (Biopython) and Raoul Bonnal (Bioruby), who ran our program participation as administrators. They all invested significant amounts of time on behalf of our community and projects. Thank you!

Below follows a short summary of each of the 2014 student projects, starting with the three JSBML students.

JSBML and GSoC 2014

JSBML logoJSBML is an international community-driven, open-source project to develop a Java API library for reading, writing and manipulating SBML, a data format for representing and exchanging computational models in systems biology. SBML has been in use for over a decade but continues to evolve and grow, and hence so does JSBML. JSBML holds two annual development-oriented workshops, and the three 2014 JSBML GSoC students had the opportunity to participate in and present their work at the autumn event, COMBINE (Computational Modeling in Biology Network), which was held in Los Angeles, California, right at the end of GSoC. Furthermore, a scientific publication on a new JSBML release, currently under review at Bioinformatics, highlights some of the work done by the students. Hence, JSBML’s 2014 participation in GSoC was a great success and experience, both for the students as well as the JSBML project and community.

Ibrahim Y. Vazirabad – “Improving the plugin interface for CellDesigner

CellDesigner UMLCellDesigner is a frequently used program in computational systems biology. It features an easy-to-use GUI, powerful graph editing functions, and a rich simulation functionality, among others. To facilitate rapid prototyping of new algorithms in third-party applications, CellDesigner provides a plug-in interface for Java applications to its robust interface and other features. However, the design and implementation of the plug-in interface made developing software for it very difficult and time consuming. To remedy this, a draft version of a JSBML library had been created to allow developing and testing prospective plug-in modules initially as stand-alone software, which can then be turned into a CellDesigner plug-in with very little effort. The goal of Ibrahim’s project was to improve the interface provided by the library, and importantly, to revise it to support access to one of CellDesigner’s most interesting features, graphical network layout. As a result of Ibrahim’s work, new CellDesigner test cases and plugins that use this interface have already been implemented, including one that converts between CellDesigner’s proprietary data format and the official SBML layout extension.

Leandro H. Watanabe – “Arrays Package

The arrays and dynamic package extensions to SBML have been proposed to overcome SBML’s limitation to static static models, which is in contrast to the inherently dynamic nature of many biological systems. The goal of Leandro’s project was to implement the arrays package in JSBML. Rather than enabling models with new behaviors to be constructed, the purpose of the arrays package is to represent regular constructs more efficiently and more compact than SBML core constructs can. To aid the integration of the arrays package into existing tools, Leandro also implemented the option of flattening an arrayed model to use only SBML core constructs, and a validation procedure for array constructs that checks whether a model violates any of the rules imposed on array constructs. As a consequence, his work helped solidify the Arrays Specification document of the SBML standard.

Victor Kofia – “Redesign the implementation of mathematical formulas

Screenshot 2015-02-23 16.30.16JSBML uses the concept of abstract syntax trees to work with mathematical expressions. For example, the image to the right shows a syntax tree representing the formula k8 · R1. Originally, JSBML implemented different kinds of formula components all in just one complex class with diverse type attributes, which was prone to introducing errors upon code changes and generally made maintenance of the software difficult. Victor implemented a math package for JSBML, in which different kinds of tree nodes that can occur in formulas (e.g., real numbers or algebraic symbols such as ‘plus’ or ‘minus’) are represented with their own, specialized classes. This has made handling of formulas much more straightforward, and also more efficient. In the future, this new representation could even be used for symbolic or numeric calculations.

Evan Parker – “Addition of a lazy loading sequence parser to Biopython’s SeqIO package

Though Biopython is already equipped with sequence parsers for a wide array of formats, these generally parsed entire records into memory. For large sequences such as entire chromosomes this quickly degrades performance.  To allow sequences to be loaded on-demand, Evan  designed a general lazy-loading parser by refactoring the existing object model, and then added format-specific modifications to each individual parser. The approach he devised works by pre-indexing the sequence files and then loading only those sequence regions that the user requests. Benchmarking and performance comparisons showed this approach yields significant performance gains when, as is common for genome-scale files, users are interested only in parts of the full sequence. Evan’s code is currently under review by Biopython core developers, and once merged will make parsing large sequences in Biopython much more tractable.

Loris Cro – “An ultra-fast scalable RESTful API to query large numbers of VCF datapoints

Variant Call Format (VCF) files are commonly generated by genome sequencing projects for sequence variations among different individuals and can get very large. The goal of Loris’ work was to develop code for Bioruby to determine the common variations (i.e., intersections) between multiple individuals and groups of individuals in a fast and scalable way. In the first phase of the project, Loris tested different technologies for storing large VCF files, from which MongoDB emerged as having superior performance. In the second phase Loris developed the code for efficiently storing VCF data into MongoDB, and then implemented algorithms for performing the intersection queries (see Github repo and Loris’ project blog). The code was developed using JRuby and uses the HTS-JDK library to parse the VCF data. In the course of the project, Loris also provided valuable feedback to the HTS-JDK team that led to improvements of the VCF parser and data model. The result of Loris’ GSoC work is now available to the community as a Ruby Gem, which has been tested and used already in large international genome re-sequencing projects, including Gene2Farm and WHEALBI.

Sarah Berkemer – “Open source high-performance BioHaskell

One of the challenges with sequence alignments for the purposes of sequence similarity searches is that for most known genes (i.e., sequences) relatively little is known about their biology, and the few for which a lot is known therefore tend to be only remotely related to a query sequence. Transitive alignments try to ameliorate this by aligning the query sequence against a large body of known but not deeply understood sequences, the intermediate set, which in turn are then aligned against the core of well-understood sequences. However, in contrast to aligning two sequences, aligning a sequence via a vast intermediate data set to a smaller core set is slow and memory-consuming. As part of her GSoC project, Sarah dug deep into the structure of the algorithm, and rewrote core parts to make use of fusing data structures and efficient tree-like data structures (see her project blog). Her work brought down the runtime for a benchmark by a factor of 3, from 31 to 11 minutes, and, arguably even more important, reduced memory consumption from 53 to 22 gigabytes. This now allows running the program on consumer-grade high-memory PCs. With Sarah having finished her Masters degree (congrats!!) in the meantime, she and her mentors are now in the process of writing a scientific application note and are planning to make the program available as an online web-service.

As a rather small family within the much larger OBF umbrella, the chance to have a student contribute to functional programming for computational biology has been a tremendous opportunity and learning experience for the Biohaskell community as well.

February 23, 2015 11:24 PM


Ionel Cristian

The problem with packaging in Python

Packaging is currently too hard in Python, and while there's effort to improve it, it's still largely focused on the problem of installing. The current approach is to just throw docs and specs at the building part: [2]

Drown it in docs.

Lets make docs! Must be poorly documented if no one understands it.

Why do we need a damn mountain of docs? Because when building a distribution the user experience is like this:

A screenshot of Microsoft Word at it's worst

Thanks for asking Mr. Clippy, I'd like to package code without going mad.

There are so many things going on in setup.py:

No one is going to read the list above, let alone understand what everything means!

We don't need a goddamn mountain of docs, we need something that's so simple even a monkey could publish a decent distribution on PyPI. But that means cutting down features ...

The perspective problem*

There are lots of improvements made in PEP-376, PEP-345, PEP-425, PEP-427 and PEP-426, but they are all improvements that allow tools like pip to work better. They still don't make my life easier, as a packager - the user of setuptools or distutils.

Don't get me wrong, it's good that we got those but I think there should be some focus on making a simpler packaging tool. An alternative to setuptools/distutils that has less features, more constraints but way easier to use. Sure, anyone can try to make something like that, but if it's not officially sanctioned it's going to have very limited success.

It has been tried before*

There have been attempts to replace the unholy duo [1] we have now but alas, the focus was wrong. There have been two directions of improvement:

  • Internals: better architecture to make the build tools more maintainable/extensible/whatever. Distutils2 was the champion of this approach.
  • Metadata as configuration: the "avoid code" mantra. Move the metadata in a configuration file, and avoid the crazy problems usually happen when you let users put code in setup.py. Distutils2 championed this idea and it lives today through d2to1.

However, the way code and data files are collected didn't change. As a packager, you still have to deal with the dozen confusing buttons. [3]

d2to1 is not better in this regard. In fact, it's worse because you have to hardcode metadata and there's no automatic discovery for whatever you're trying to package. [4]

The current course*

PEP-426 will open up possibilities of custom build systems, something else than setuptools, that could hypothetically solve all sorts of niche problems like C extensions with unusual dependencies.

What I dream of*

What if there would be a build system just for pure-Python distributions (and maybe some C extension with no dependencies)? Something that has some strong conventions: code in this place, docs in that place - no exceptions. Something like cargo has. Maybe with a nice project scaffold generator.

Of course, anyone can say: PEP-426 lets you build whatever you want, just do it! However, to make something really simple to use some conventions need to be broken, and if you want to convert your project some effort would be needed. You see, if it's not officially sanctioned it's not going to pick up. Death by lack of interest.

And if it doesn't pick up, then the vast majority of packagers are going to stick with the complicated setup.py we have now.

In a way, packaging in Python is a victim of bad habits - complex conventions, and feature bloat. It's hard to simplify things, because all the historical baggage people want to carry around. But it there's some official sanctioning then it's easier to accept the hard changes.

Concretely what I want is along these lines:

  • Get rid of py_modules, packages and package_dir. Just discover automatically whatever you have in a src dir.
  • Get rid of MANIFEST, MANIFEST.in and the baffling trio of package_data, data_files and include_package_data. Just take all the files are inside packages. Use .gitignore to exclude files.
  • Have a single way to store and retrieve metadata like the version in your code. Not a handful of ways.

In other words, one way to do it. Not one clear way, cause we document the hell out of it, but one, and only one, way to do it. What do you think, could it work? Would it improve anything?

[1]Distutils and setuptools: the confusing system everyone loves to hate.
[2]

There are a ton of places where you can find information about packaging, of various quality and freshness. At least now there's sanctioned place to go to: https://packaging.python.org/en/latest/distributing.html

Still, there's so much to read. What if there wouldn't be a need to know so much to package stuff?

[3]Does this look familiar? It has mostly the same options as distutils's setup. Too many options. Still lots of trial and error to make a distribution.
[4]Hardcoding information that you already have in the filesystem is a sure way to make mistakes. More about this: Python packaging pitfalls.

February 23, 2015 10:00 PM


PyPy Development

linalg support in pypy/numpy

Introduction

PyPy's numpy support has matured enough that it can now support the lapack/blas libraries through the numpy.linalg module. To install the version of numpy this blog post refers to, install PyPy version 2.5.0 or newer, and run this:

pypy -m pip install git+https://bitbucket.org/pypy/numpy.git

This update is a major step forward for PyPy's numpy support. Many of the basic matrix operations depend on linalg, even matplotlib requires it to display legends (a pypy-friendly version of matplotlib 1.3 is available at https://github.com/mattip/matplotlib).

A number of improvements and adaptations, some of which are in the newly-released PyPy 2.5.0, made this possible:
  • Support for an extended frompyfunc(), which in the PyPy version supports much of the ufunc API (signatures, multiple dtypes) allowing creation of pure-python, jit-friendly ufuncs. An additional keyword allows choosing between out = func(in) or func(in, out) ufunc signatures. More explanation follows.
  • Support for GenericUfuncs via PyPy's (slow) capi-compatibility layer. The underlying mechanism actually calls the internal implementation of frompyfunc().
  • A cffi version of _umath_linalg. Since cffi uses dlopen() to call into shared objects, we added support in the numpy build system to create non-python shared libraries from source code in the numpy tree. We also rewrote parts of the c-based _umath_linalg.c.src in python, renamed numpy's umath_linalg capi module to umath_linag_capi, and use it as a shared object through cffi.

Status

We have not completely implemented all the linalg features. dtype resolution via casting is missing, especially for complex ndarrays, which leads to slight numerical errors where numpy uses a more precise type for intermediate calculations. Other missing features in PyPy's numpy support may have implications for complete linalg support.

Some OSX users have noticed they need to update pip to version 6.0.8 to overcome a regression in pip, and it is not clear if we support all combinations of blas/lapack implementations on all platforms.

Over the next few weeks we will be ironing out these issues.

Performance

A simple benchmark is shown below, but let's state the obvious: PyPy's JIT and the iterators built into PyPy's ndarray implementation will in most cases be no faster than CPython's numpy. The JIT can help where there is a mixture of python and numpy-array code. We do have plans to implement lazy evaluation and to further optimize PyPy's support for numeric python, but numpy is quite good at what it does.

HowTo for PyPy's extended frompyfunc

The magic enabling blas support is a rewrite of the _umath_linalg c-based module as a cffi-python module that creates ufuncs via frompyfunc. We extended the numpy frompyfunc to allow it to function as a replacement for the generic ufunc available in numpy only through the c-api.

We start with the basic frompyfunc, which wraps a python function into a ufunc:
 
def times2(in0):
    return in0 * 2
ufunc = frompyfunc(times2, 1, 1)

In cpython's numpy the dtype of the result is always object, which is not implemented (yet) in PyPy, so this example will fail. While the utility of object dtypes can be debated, in the meantime we add a non-numpy-compatible keyword argument dtypes to frompyfunc. If dtype=['match'] the output dtype will match the dtype of the first input ndarray:

ufunc = frompyfunc(times2, 1, 1, dtype=['match'])
ai = arange(24).reshape(3, 4, 2)
ao = ufunc(ai)
assert  (ao == ai * 2).all()

I hear you ask "why is the dtypes keyword argument a list?" This is so we can support the Generalized Universal Function API, which allows specifying a number of specialized functions and the input-output dtypes each specialized function accepts.
Note that the function feeds the values of ai one at a time, the function operates on scalar values. To support more complicated ufunc calls, the generalized ufunc API allows defining a signature, which specifies the layout of the ndarray inputs and outputs. So we extended frompyfunc with a signature keyword as well.
We add one further extension to frompyfunc: we allow a Boolean keyword stack_inputs to specify the argument layout of the function itself. If the function is of the form:
 
out0, out1, ... = func(in0, in1,...)

then stack_inputs is False. If it is True the function is of the form:
 
func(in0, in1, ... out0, out1, ...)

Here is a complete example of using frompyfunc to create a ufunc, based on this link:
 
def times2(in_array, out_array):
    in_flat = in_array.flat
    out_flat = out_array.flat
    for i in range(in_array.size):
        out_flat[i] = in_flat[i] * 2
ufunc = frompyfunc([times2, times2], 1, 1,
                signature='(i)->(i)',
                dtypes=[dtype(int), dtype(int),
                        dtype(float), dtype(float),
                       ],
                stack_inputs=True,
                )
ai = arange(10, dtype=int)
ai2 = ufunc(ai)
assert all(ai2 == ai * 2)

Using this extended syntax, we rewrote the lapack calls into the blas functions in pure python, no c needed. Benchmarking this approach actually was much slower than using the upstream umath_linalg module via cpyext, as can be seen in the following benchmarks. This is due to the need to copy c-aligned data into Fortran-aligned format. Our __getitem__ and __setitem__ iterators are not as fast as pointer arithmetic in C. So we next tried a hybrid approach: compile and use numpy's umath_linalg python module as a shared object, and call the optimized specific wrapper function from it.

Benchmarks

Here are some benchmarks, running a tight loop of the different versions of linalg.inv(a), where a is a 10x10 double ndarray. The benchmark ran on an i7 processor running ubuntu 14.04 64 bit:
Impl. Time after warmup
CPython 2.7 + numpy 1.10.dev + lapack 8.9 msec/1000 loops
PyPy 2.5.0 + numpy + lapack via cpyext 8.6 msec/1000 loops
PyPy 2.5.0 + numpy + lapack via pure python + cffi 19.9 msec/1000 loops
PyPy 2.5.0 + numpy + lapack via python + c + cffi 9.5 msec/1000 loops


While no general conclusions may be drawn from a single micro-benchmark, it does indicate that there is some merit in the approach taken.

Conclusion

PyPy's numpy now includes a working linalg module. There are still some rough corners, but hopefully we have implemented the parts you need. While the speed of the isolated linalg function is no faster than CPython and upstream numpy, it should not be significantly slower either. Your use case may see an improvement if you use a mix of python and lapack, which is the usual case.

Please let us know how it goes. We love to hear success stories too.

We still have challenges at all levels of programming,and are always looking for people willing to contribute, so stop by on IRC at #pypy.

mattip and the PyPy Team

February 23, 2015 09:36 PM


Nicola Iarocci

Eve 0.5.2 ‘Giulia’ is Out

Eve 0.5.2 has just been released with a bunch of interesting fixes and documentation updates. See the changelog for details.

February 23, 2015 04:15 PM


Mike Driscoll

PyDev of the Week: Maciej Fijalkowski

This week we welcome Maciej Fijalkowski (@fijall) as our PyDev of the Week. He is a freelance programmer who spends a lot of time working on the PyPy project. I would recommend checking out some of his work on github. Let’s spend some time learning about our fellow Pythonista!

Can you tell us a little about yourself (hobbies, education, etc):

Originally from Poland, I am partly nomadic, having a semi-permanent base in Cape Town, South Africa. Got lured here by climbing, good weather, majestic landscapes, and later discovered surfing. Otherwise I can be found in various places in Europe and the US, especially Boulder, CO. I have been doing PyPy for about 8 years now (don’t know, lost track a bit), sometimes free time, sometimes permanent. These days I’m doing some consulting for both PyPy and other stuff, trying to build my own company, baroquesoftware.com.

Why did you start using Python?

I think it was early 2000s. I was using Perl and C++ at the time and a friend of mine was fighting with some programming assignments at the physics department. Doing a quick survey I found out that Python seems to be a language of choice for “beginners”. After teaching that to myself and her, I kind of discovered that Python is an actual language suitable not just for beginners. And this is how it started :-)

What other programming languages do you know and which is your favorite?

Due to the nature of my work, I am proficient in C, assembler (x86 and ARM), C++, Python, RPython. I can also read/write Java, Ruby, PHP, Ocaml, Prolog and a bunch of others I don’t quite remember. I can never make a project in JavaScript that does not turn out to be a major mess. As for the favorite, is that a trick question? Unsurprisingly, I code mostly in Python, but a lot of my work is done in RPython, which is a static-subset of Python that we use for PyPy. While I think RPython suits its niche very well, I would not recommend it as a general-purpose language, so I suppose Python stays at the top for me. I actually have various ideas how to create a language/ecosystem that would address a lot of Python shortcomings, if I ever have time :-)

What projects are you working on now?

Mostly PyPy, but more specifically:

  • improving PyPy warmup time and memory consumption
  • numpy
  • helping people with various stuff, e.g. IO performance improvements, profiling etc.

Also I’m the main contributor to hippyvm.

Which Python libraries are your favorite (core or 3rd party)?

I think the one I use the most is py.test. By now it’s an absolutely essential part of what we’re doing. As for the favorite one, I might be a bit biased since I articipated in the design, but I really like cffi. According to PyPI it gets like half a million downloads a month, so it can’t just be me.

Thanks so much!

The Last 10 PyDevs of the Week

February 23, 2015 01:30 PM


Ian Ozsvald

Starting Spark 1.2 and PySpark (and ElasticSearch and PyPy)

The latest PySpark (1.2) is feeling genuinely useful, late last year I had a crack at running Apache Spark 1.0 and PySpark and it felt a bit underwhelming (too much fanfare, too many bugs). The media around Spark continues to grow and e.g. today’s hackernews thread on the new DataFrame API has a lot of positive discussion and the lazily evaluated pandas-like dataframes built from a wide variety of data sources feels very powerful. Continuum have also just announced PySpark+GlusterFS.

One surprising fact is that Spark is Python 2.7 only at present, feature request 4897 is for Python 3 support (go vote!) which requires some cloud pickling to be fixed. Using the end-of-line Python release feels a bit daft. I’m using Linux Mint 17.1 which is based on Ubuntu 14.04 64bit. I’m using the pre-built spark-1.2.0-bin-hadoop2.4.tgz via their downloads page and ‘it just works’. Using my global Python 2.7.6 and additional IPython install (via apt-get):

spark-1.2.0-bin-hadoop2.4 $ IPYTHON=1 bin/pyspark
...
IPython 1.2.1 -- An enhanced Interactive Python.
...
 Welcome to
 ____              __
 / __/__  ___ _____/ /__
 _\ \/ _ \/ _ `/ __/  '_/
 /__ / .__/\_,_/_/ /_/\_\   version 1.2.0
 /_/
Using Python version 2.7.6 (default, Mar 22 2014 22:59:56)
 SparkContext available as sc.
 >>>

Note the IPYTHON=1, without that you get a vanilla shell, with it it’ll use IPython if it is in the search path. IPython lets you interactively explore the “sc” Spark context using tab completion which really helps at the start. To run one of the included demos (e.g. wordcount) you can use the spark-submit script:

spark-1.2.0-bin-hadoop2.4/examples/src/main/python 
$ ../../../../bin/spark-submit wordcount.py kmeans.py  # count words in kmeans.py

For my use case we were initially after sparse matrix support, sadly they’re only available for Scala/Java at present. By stepping back from my sklean/scipy sparse solution for a minute and thinking a little more map/reduce I could just as easily split the problem into number of counts and that parallelises very well in Spark (though I’d love to see sparse matrices in PySpark!).

I’m doing this with my contract-recruitment client via my ModelInsight as we automate recruitment, there’s a press release out today outlining a bit of what we do. One of the goals is to move to a more unified research+deployment approach, rather than lots of tooling in R&D which we then streamline for production, instead we hope to share similar tooling between R&D and production so deployment and different scales of data are ‘easier’.

I tried the latest PyPy 2.5 (running Python 2.7) and it ran PySpark just fine. Using PyPy 2.5 a  prime-search example takes 6s vs 39s with vanilla Python 2.7, so in-memory processing using RDDs rather than numpy objects might be quick and convenient (has anyone trialled this?). To run using PyPy set PYSPARK_PYTHON:

$ PYSPARK_PYTHON=~/pypy-2.5.0-linux64/bin/pypy ./pyspark

I’m used to working with Anaconda environments and for Spark I’ve setup a Python 2.7.8 environment (“conda create -n spark27 anaconda python=2.7″) & IPython 2.2.0. Whichever Python is in the search path or is specified at the command line is used by the pyspark script.

The next challenge to solve was integration with ElasticSearch for storing outputs. The official docs are a little tough to read as a non-Java/non-Hadoop programmer and they don’t mention PySpark integration, thankfully there’s a lovely 4-part blog sequence which “just works”:

  1. ElasticSearch and Python (no Spark but it sets the groundwork)
  2. Reading & Writing ElasticSearch using PySpark
  3. Sparse Matrix Multiplication using PySpark
  4. Dense Matrix Multiplication using PySpark

To summarise the above with a trivial example, to output to ElasticSearch using a trivial local dictionary and no other data dependencies:

$ wget http://central.maven.org/maven2/org/elasticsearch/
 elasticsearch-hadoop/2.1.0.Beta2/elasticsearch-hadoop-2.1.0.Beta2.jar
$ ~/spark-1.2.0-bin-hadoop2.4/bin/pyspark --jars 
 elasticsearch-hadoop-2.1.0.Beta2.jar
>>> res=sc.parallelize([1,2,3,4])
 >>> res2=res.map(lambda x: ('key', {'name': str(x), 'sim':0.22}))
 >>> res2.collect()
 [('key', {'name': '1', 'sim': 0.22}),
 ('key', {'name': '2', 'sim': 0.22}),
 ('key', {'name': '3', 'sim': 0.22}),
 ('key', {'name': '4', 'sim': 0.22})]

>>>res2.saveAsNewAPIHadoopFile(path='-', 
 outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat", 
 keyClass="org.apache.hadoop.io.NullWritable", 
 valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", 
 conf={"es.resource": "myindex/mytype"})

The above creates a list of 4 dictionaries and then sends them to a local ES store using “myindex” and “mytype” for each new document.  Before I found the above I used this older solution which also worked just fine.

Running the local interactive session using a mock cluster was pretty easy. The docs for spark-standalone are a good start:

sbin $ ./start-master.sh
 #  the log (full path is reported by the script so you could `tail -f `) shows
 # 15/02/17 14:11:46 INFO Master: 
 # Starting Spark master at spark://ian-Latitude-E6420:7077
 # which gives the link to the browser view of the master machine which is 
 # probably on :8080 (as shown here http://www.mccarroll.net/blog/pyspark/).
#Next start a single worker:
sbin $ ./start-slave.sh 0 spark://ian-Latitude-E6420:7077
 # and the logs will show a link to another web page for each worker 
 # (probably starting at :4040).
#Next you can start a pySpark IPython shell for local experimentation:
$ IPYTHON=1 ~/data/libraries/spark-1.2.0-bin-hadoop2.4/bin/pyspark 
  --master spark://ian-Latitude-E6420:7077
 # (and similarity you could run a spark-shell to do the same with Scala)
#Or we can run their demo code using the master node you've configured setup:
$ ~/spark-1.2.0-bin-hadoop2.4/bin/spark-submit 
  --master spark://ian-Latitude-E6420:7077 
  ~/spark-1.2.0-bin-hadoop2.4/examples/src/main/python/wordcount.py README.txt

Note if you tried to run the above spark-submit (which specifies the –master to connect to) and you didn’t have a master node, you’d see log messages like:

15/02/17 14:14:25 INFO AppClient$ClientActor: 
 Connecting to master spark://ian-Latitude-E6420:7077...
15/02/17 14:14:25 WARN AppClient$ClientActor: 
 Could not connect to akka.tcp://sparkMaster@ian-Latitude-E6420:7077: 
 akka.remote.InvalidAssociation: 
 Invalid address: akka.tcp://sparkMaster@ian-Latitude-E6420:7077
15/02/17 14:14:25 WARN Remoting: Tried to associate with 
 unreachable remote address 
 [akka.tcp://sparkMaster@ian-Latitude-E6420:7077]. 
 Address is now gated for 5000 ms, all messages to this address will 
 be delivered to dead letters. 
 Reason: Connection refused: ian-Latitude-E6420/127.0.1.1:7077

If you had a master node running but you hadn’t setup a worker node then after doing the spark-submit it’ll hang for 5+ seconds and then start to report:

15/02/17 14:16:16 WARN TaskSchedulerImpl: 
 Initial job has not accepted any resources; 
 check your cluster UI to ensure that workers are registered and 
 have sufficient memory

and if you google that without thinking about the worker node then you’d come to this diagnostic page  which leads down a small rabbit hole…

Stuff I’d like to know:


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

February 23, 2015 12:44 PM


PyCon

Signup for PyCon Dinners led by Jessica McKellar and Brandon Rhodes!

While the cost of PyCon includes breakfast and lunch as well as coffee and snacks, dinner is on your own, and for good reason. It's Montréal! Get out and enjoy the city, find some good food and drink, and hang out with new groups of people.

To make it even easier, this year we've organized another series of PyCon Dinners, one led by Jessica McKellar and one by Brandon Rhodes. These events are a great way to wrap up the first day of PyCon, taking place Friday April 10 at 6 PM, with a great three course meal with new and old friends. As 60% of attendees surveyed last year stated it was their first PyCon, these dinners are a great way to kick off the weekend and make new connections and setup plans for more dinners or other late night festivities.

Jessica is a director of the Python Software Foundation and has been instrumental in outreach efforts around the Python community, especially when it comes to PyCon. She's also a contributor to Twisted and has worked a lot with the OpenHatch project. She's a very experienced speaker with a ton of knowledge and information to share, and will make an excellent host for an excellent meal.

Brandon is a returning veteran of running a PyCon Dinner, having run last year's as a Python trivia game. He's also an experienced speaker of the Python conference circuit, and will be the chair of PyCons 2016 and 2017 when we head to Portland, Oregon after this year's work as co-chair.

Tickets are required for either dinner, with the meal price subsidized by the PSF for a cost of $45. Each prix fixe meal includes a delicious starter, main course, and dessert, with options available for dietary needs.

Check out the options on https://us.pycon.org/2015/events/dinners/ and sign up today? You can add a dinner ticket to your existing registration at https://us.pycon.org/2015/registration/.

If you don't have tickets to PyCon yet, hurry up because they are selling out very very soon.

February 23, 2015 12:27 PM


Django Weblog

Django sprint in Amsterdam, The Netherlands

We're very happy to announce that a two-day Django sprint will take place on March 7-8 in Amsterdam, Netherlands. This event is organized by the Dutch Django Association.

The venue is the office of DashCare just outside the center of Amsterdam. The sprint will start on Saturday, March 7th at 9:30 CET and and finish on Sunday, March 8th around 22:00 CET.

With the help of the Dutch Django Association and Divio we will have four core developers on site: Baptiste Mispelon, Markus Holtermann, Daniele Procida and Erik Romijn. Daniele will also be doing his famed “Don’t be afraid to commit” workshop, which will take people new to contributing to Django through the entire contribution process with real tickets. So please don’t hesitate to join even if you’ve never contributed to Django before.

If you'd like to join, please sign up on the meetup page. If you’re unable to come to Amsterdam, you're welcome to contribute to the sprint online. Sprinters and core developers will be available in the #django-sprint IRC channel on FreeNode.

We hope you can join us and help make the sprint as successful as possible!

February 23, 2015 10:00 AM


Omaha Python Users Group

February Meeting Notes

Here are links to a few of the topics at this month’s meeting:

FuzzyWuzzy: String matching in Python

Plumbum: Shell Combinators and More

Django: The web framework for perfectionists with deadlines.

February 23, 2015 12:24 AM

February 22, 2015


Daniel Greenfeld

Setting up LaTeX on Mac OS X

These are my notes for getting LaTeX running on Mac OS X with the components and fonts I want. Which is handy when you want to generate PDFs from Sphinx. At some point I want to replace this with a Docker container similar https://github.com/blang/latex-docker, albeit with the components in parts 3 and 4 below.

  1. Get mactex-basic.pkg from http://www.ctan.org/pkg/mactex-basic

  2. Click mactex-basic.pkg to install LaTeX.

  3. Update tlmgr:

    sudo tlmgr update --self
    
  1. Install the following tools via tlmgr:

    sudo tlmgr install titlesec
    sudo tlmgr install framed
    sudo tlmgr install threeparttable
    sudo tlmgr install wrapfig
    sudo tlmgr install multirow
    sudo tlmgr install enumitem
    sudo tlmgr install bbding
    sudo tlmgr install titling
    sudo tlmgr install tabu
    sudo tlmgr install mdframed
    sudo tlmgr install tcolorbox
    sudo tlmgr install textpos
    sudo tlmgr install import
    sudo tlmgr install varwidth
    sudo tlmgr install needspace
    sudo tlmgr install tocloft
    sudo tlmgr install ntheorem
    sudo tlmgr install environ
    sudo tlmgr install trimspaces
    
  2. Install fonts via tlmgr:

    sudo tlmgr install collection-fontsrecommended
    

note: Yes, I know I can install the basic LaTeX package using Homebrew, but sometimes I like doing things manually.

http://upload.wikimedia.org/wikipedia/commons/9/9c/Latex_example.png

February 22, 2015 10:00 PM


Al-Ahmadgaid Asaad

Python: Getting Started with Data Analysis

Analysis with Programming has recently been syndicated to Planet Python. And as a first post being a contributing blog on the said site, I would like to share how to get started with data analysis on Python. Specifically, I would like to do the following:
  1. Importing the data
    • Importing CSV file both locally and from the web;
  2. Data transformation;
  3. Descriptive statistics of the data;
  4. Hypothesis testing
    • One-sample t test;
  5. Visualization; and
  6. Creating custom function.

Importing the data

This is the crucial step, we need to import the data in order to proceed with the succeeding analysis. And often times data are in CSV format, if not, at least can be converted to CSV format. In Python we can do this using the following codes:

To read CSV file locally, we need the pandas module which is a python data analysis library. The read_csv function can read data both locally and from the web.

Data transformation

Now that we have the data in the workspace, next is to do transformation. Statisticians and scientists often do this step to remove unnecessary data not included in the analysis. Let's view the data first:

To R programmers, above is the equivalent of print(head(df)) which prints the first six rows of the data, and print(tail(df)) -- the last six rows of the data, respectively. In Python, however, the number of rows for head of the data by default is 5 unlike in R, which is 6. So that the equivalent of the R code head(df, n = 10) in Python, is df.head(n = 10). Same goes for the tail of the data.

Column and row names of the data are extracted using the colnames and rownames functions in R, respectively. In Python, we extract it using the columns and index attributes. That is,

Transposing the data is obtain using the T method,

Other transformations such as sort can be done using sort attribute. Now let's extract a specific column. In Python, we do it using either iloc or ix attributes, but ix is more robust and thus I prefer it. Assuming we want the head of the first column of the data, we have

By the way, the indexing in Python starts with 0 and not 1. To slice the index and first three columns of the 11th to 21st rows, run the following

Which is equivalent to print df.ix[10:20, ['Abra', 'Apayao', 'Benguet']]

To drop a column in the data, say columns 1 (Apayao) and 2 (Benguet), use the drop attribute. That is,

axis argument above tells the function to drop with respect to columns, if axis = 0, then the function drops with respect to rows.

Descriptive Statistics

Next step is to do descriptive statistics for preliminary analysis of our data using the describe attribute:

Hypothesis Testing

Python has a great package for statistical inference. And that's the stats library of scipy. The one sample t-test is implemented in ttest_1samp function. So that, if we want to test the mean of the Abra's volume of palay production against the null hypothesis with 15000 assumed population mean of the volume of palay production, we have

The values returned are tuple of the following values:
  • t : float or array
        t-statistic
  • prob : float or array
        two-tailed p-value
From the above numerical output, we see that the p-value = 0.2627 is greater than $\alpha=0.05$, hence there is no sufficient evidence to conclude that the average volume of palay production is not equal to 15000. Applying this test for all variables against the population mean 15000 volume of production, we have

The first array returned is the t-statistic of the data, and the second array is the corresponding p-values.

Visualization

There are several module for visualization in Python, and the most popular one is the matplotlib library. To mention few, we have bokeh and seaborn modules as well to choose from. In my previous post, I've demonstrated the matplotlib package which has the following graphic for box-whisker plot,
Now plotting using pandas module can beautify the above plot into the theme of the popular R plotting package, the ggplot. To use the ggplot theme just add one more line to the above code,

And you'll have the following,
Even neater than the default matplotlib.pyplot theme. But in this post, I would like to introduce the seaborn module which is a statistical data visualization library. So that, we have the following
Sexy boxplot, scroll down for more.

Creating custom function

To define a custom function in Python, we use the def function. For example, say we define a function that will add two numbers, we do it as follows,

By the way, in Python indentation is important. Use indentation for scope of the function, which in R we do it with braces {...}. Now here's an algorithm from my previous post,
  1. Generate samples of size 10 from Normal distribution with $\mu$ = 3 and $\sigma^2$ = 5;
  2. Compute the $\bar{x}$ and $\bar{x}\mp z_{\alpha/2}\displaystyle\frac{\sigma}{\sqrt{n}}$ using the 95% confidence level;
  3. Repeat the process 100 times; then
  4. Compute the percentage of the confidence intervals containing the true mean.
Coding this in Python we have,

Above code might be easy to read, but it's slow in replication. Below is the improvement of the above code, thanks to Python gurus, see comments on my previous post.

Update

For those who are interested in the ipython notebook of this article, please click here. This article was converted to ipython notebook by of Nuttens Claude.

Data Source

Reference

  1. Pandas, Scipy, and Seaborn Documentations.
  2. Wes McKinney & PyData Development Team (2014). pandas: powerful Python data analysis toolkit.

February 22, 2015 08:35 PM


Vasudev Ram

Excel to PDF with xlwings and xtopdf

By Vasudev Ram





Excel to PDF with xlwings and xtopdf - how many x in that? :)

I came across xlwings recently via the Net.

xlwings is by Zoomer Analytics, a startup based in Zürich, Switzerland, by a team with background in financial institutions.

Excerpt from the xlwings documentation:

[ xlwings is a BSD-licensed Python library that makes it easy to call Python from Excel and vice versa:

Interact with Excel from Python using a syntax that is close to VBA yet Pythonic.

Replace your VBA macros with Python code and still pass around your workbooks as easily as before.

xlwings fully supports NumPy arrays and Pandas DataFrames. It works with Microsoft Excel on Windows and Mac. ]

I checked out the xlwings quickstart.

Then did a quick test of using xlwings with xtopdf, my toolkit for PDF creation, to create a simple Excel spreadsheet, then read back its contents, and convert that to PDF.

Here is the code:
"""
xlwingsToPDF.py
A demo program to show how to convert the text extracted from Excel
content, using xlwings, to PDF. It uses the xlwings library, to create
and read the Excel input, and the xtopdf library to write the PDF output.
Author: Vasudev Ram - http://www.dancingbison.com
Copyright 2015 Vasudev Ram
"""

import sys
from xlwings import Workbook, Sheet, Range, Chart
from PDFWriter import PDFWriter

# Create a connection with a new workbook.
wb = Workbook()

# Create the Excel data.
# Column 1.
Range('A1').value = 'Foo 1'
Range('A2').value = 'Foo 2'
Range('A3').value = 'Foo 3'
# Column 2.
Range('B1').value = 'Bar 1'
Range('B2').value = 'Bar 2'
Range('B3').value = 'Bar 3'

pw = PDFWriter("xlwingsTo.pdf")
pw.setFont("Courier", 10)
pw.setHeader("Testing Excel conversion to PDF with xlwings and xtopdf")
pw.setFooter("xlwings: http://xlwings.org --- xtopdf: http://slid.es/vasudevram/xtopdf")

for row in Range('A1..B3').value:
s = ''
for col in row:
s += col + ' | '
pw.writeLine(s)

pw.close()
I ran it with this command:
py xlwingsToPDF.py
and here is a screenshot of the output PDF file:


Note: The xlwings library can be installed with:
pip install xlwings
But a prerequisite for it, pywin32, did not install automatically. pywin32 is a very useful and powerful Windows API wrapper library for Python, by Mark Hammond. I've used it a few times earlier, in earlier Python versions than Python 2.7.8, which I currently am using. I usually installed it directly in those earlier versions. This time, though it was a dependency for xlwings, it did not get installed automatically, and the above Python program gave a runtime error. I had to manually install pywin32 before the program could work.

- Enjoy.

- Vasudev Ram - Dancing Bison Enterprises

Signup to hear about new products or services from me.

Contact Page

February 22, 2015 05:39 AM


Python Software Foundation

PSF Community Service Award goes to Django Girls

Last week the PSF Board passed the following resolution:
“RESOLVED, that the Python Software Foundation award the 4th Quarter 2014 Community Service Awards to Ola Sitarska and Ola Sendecka for their work creating and growing Django Girls, an educational program which has reached more than half a dozen countries, and continues to grow to many more.”
Django Girls was founded by Ola Sitarska and Ola Sendecka as a workshop for about 20 people at EuroPython 2014 in Berlin. According to the Django Girls Website
"Django Girls is a non-profit organization that empowers and helps women to organize free, one-day programming workshops by providing tools, resources and support."
I asked Ola Sitarska for her reaction to receiving the PSF award and she responded:
"Receiving the Community Service Award was a wonderful surprise and amazing honor. I never expected that to happen and I couldn't be more grateful for all the support me and Ola received from the Python community while working on Django Girls."
She also gave me some additional background on their extraordinary growth and future plans.
"So far, we've taught Python and Django to 670 women in places like Germany, Poland, Uganda, Kenya, Ukraine, Taiwan, Australia, United States, and many more.  All attendees were complete beginners in the world of technology, but a couple of them are already working as Django Developers, taking an active role in Python and Django communities. This year we hope to grow even more, develop new open source teaching materials and setup a Django Girls non-profit organization based in US. You can help us make it happen by becoming a Django Girls Patron."
The PSF isn't the only organization that has recognized the significant contribution made by Django Girls. Last year they were honored with the Django Software Foundation's "Malcolm Tredinnick Memorial Prize".
I'm also very happy to report that the Django Girls will be bringing their one-day workshop to PyCon 2015 in Montreal on April 9: PyCon Montreal 2015 Django Girls Workshop. In addition, there will be workshops all over the world next year. See Django Girls.org for the full schedule of cities and to find out how to become a Django Girls Patron.
For those unable to attend a workshop, Django Girls also provides a free online tutorial that has been used by more 30,000 people. I, myself, as a novice programmer, have taken it and found it to be extremely understandable, effective, and fun. 
A hearty congratulations and thank you to this terrific organization!

February 22, 2015 04:47 AM

Enroll as PSF Voting Member

"Membership has its privileges!"
Since the new PSF bylaws were adopted in 2013, there have been several new membership categories that allow for voting rights.
Unfortunately it has taken some time for the PSF to devise a form to allow members to report their eligibility for those categories. We apologize, but here it is at last: Voting Membership.
We know that many of you have made valuable contributions to the language and the PSF, so we hope that you will take the next step and claim your right to vote. Please review the membership criteria at Membership Bylaws.
And for those of you who are not yet PSF members, we encourage you to join under the Basic Membership category. All it takes is to sign up here: PSF Membership.
Thanks to Director David Mertz for the creation of the form and to Directors Marc-André Lemburg and Nick Coghlan for their assistance.

Addendum: Just to clarify, if you are already a voting member (e.g., as a PSF Fellow), there is no need to do anything more. This new form is for Basic Members who do not as yet have voting rights but who qualify according to the criteria.

February 22, 2015 01:15 AM

February 21, 2015


Andrzej Skupień

How to use PIPE in python subprocess.Popen objects

This is something, that I always have to check. So today I'm writing it down.

Documentation to subprocess.Popen is here.

So pipelines are useful when you want to do something with output of command performed by Popen. What you would like to do with output:

Pass output to another bash command

So you want to pass output of first bash command to another. This will be equivalent of this code in bash:

$ ls /etc | grep ntp
ntp-restrict.conf
ntp.conf
ntp_opendirectory.conf

In Python you do that like this:

ls = subprocess.Popen('ls /etc'.split(), stdout=subprocess.PIPE)
grep = subprocess.Popen('grep ntp'.split(), stdin=ls.stdout, stdout=subprocess.PIPE)
output = grep.communicate()[0]

Do it in the proper way

Call ls.stdout.close() before grep.communicate() so that if grep dies prematurely, ls would exit sooner. And add ls.wait() at the end, to avoid creating a zombie:

ls = subprocess.Popen('ls /etc'.split(), stdout=subprocess.PIPE)
grep = subprocess.Popen('grep ntp'.split(), stdin=ls.stdout, stdout=subprocess.PIPE)
ls.stdout.close()
output = grep.communicate()[0]
ls.wait()

grep.stdin.close() is called by grep.communicate().

Another way to write it:

grep = Popen('grep ntp'.split(), stdin=PIPE, stdout=PIPE)
ls = Popen('ls /etc'.split(), stdout=grep.stdin)
output = grep.communicate()[0]
ls.wait()

As you can see, declaration order of commands in pipe doesn't matter.

Alternative to communicate function

Instead of grep.communicate() function you can use grep.stdout.read(). But first you have to wait for end of subprocess call:

>>> grep.wait()
>>> print grep.stdout.read()

Although this way have one disadvantage. It moves file pointer do the end of file. So every subsequent call will return empty string:

>>> grep.wait()
>>> print grep.stdout.read()
ntp-restrict.conf
ntp.conf
ntp_opendirectory.conf
>>> p2.stdout.read()
''

Further more this has one more disadvantage. If you first wait() and then read() a program that produces a lot of output (more than 4 kilobytes), you'll get a deadlock.

Special thanks to Marius Gedminas, J.F. Sebastian for help with this article.

February 21, 2015 11:00 PM