Planet Python
Last update: February 25, 2015 01:46 PM
February 25, 2015
Machinalis
Reading TechCrunch
When we discussed Information Extraction and IEPY among professional peers we noticed that the approach was often unknown to those who could benefit from it the most. Its main beneficiaries are those with large volumes of unstructured or poorly structured text, where it is very costly to go through the text manually to extract relationships (e.g. in the VC industry such as funding, acquisitions, creating or opening of offices, etc.) between entities (companies, investment funds, people, and so on).
To create an example directed at those with perhaps less of a technical background, we processed the news articles from TechCrunch News, the main technology blog in the United States. We sought the funding relationships in U.S. companies. We published the result and found some interesting things:
VC Industry and Specialized Press
The publication of news about funding may result from investigation by specialized journalists or pushed from the companies themselves, who manage to promote news within mainstream media content.
Then checking the funding-related content in the TechCrunch News posts and comparing it to more complete databases can show us the editorial policies that these journalists follow or the companies efficiency in placing their own content.
So for example, in the funded companies vs. average funding chart (currently one of the main discussion topics) you can see a growing gap between the events covered in the more general database (CrunchBase) and those from TechCrunch News coverage
Since last year there has been a tendency to cover events where the funding amount was greater than the average from the CrunchBase database. Based on this data, higher level funding events should attract more attention from journalists than other below-average ones.
Considering geographical distribution of events coverage
Some of the highlights we can see include:
-
The low coverage in the Massachusetts area. Close to NY and with almost the same number of events in CrunchBase, it is almost half as likely to report news of a funding in Massachusetts compared with NY. (Does it make sense to hire PR agencies in one area instead of another?)
-
The significantly high coverage in a state such as Utah is comparable to that of the main ones; NY and CA, but without a high volume of funding events
-
The consistent media coverage of events in the two main areas of the industry: CA and NY.
And so on.
In summary, what was the advantage of this approach?
If you wanted to have an overall view you could include content from other blogs like Gigaom, VentureBeat, TWSJ, Forbes Tech, Mashable, Wired, The Verge, etc. without extra effort once the tool has learned to identify and predict relationships (e.g. funding to companies).
And of course, as the demo outlines, we were able to read several thousand news articles, extract the information to build a database and make the demo without arousing the deepest murderous rage in us that reading ~100k articles looking for that relationship can awake.
Ludovic Gasc
Open letter for the sync world
Theses days, I've seen more and more haters about the async community in Python, especially around AsyncIO.
I think this is sad and counter-productive.
I feel that for some people, frustrations or misunderstandings about the place of this new tool might be the cause, so I'd like to share some of my thoughts about it.
Just a proven pattern, not a "who has the biggest d*" contest
Some micro-benchmarks have been published to try to explain that AsyncIO isn't really efficient.
We all know that it is possible to have benchmarks prove about anything, and that the world isn't black or white.
So just for the sake of completeness, here are some macro-benchmarks based on Web applications examples: http://blog.gmludo.eu/2015/02/macro-benchmark-with-django-flask-and-asyncio.html
Now, before to start a ping-pong to try to determine who has the biggest, please read further:
Asynchronous/coroutine pattern isn't a new fancy stuff to decrease developer productivity and performance.
In fact, the idea of asynchrounous, non-blocking IO has been around in many OSes and programming languages for years.
In Linux for example, Asynchronous I/O Support was added to kernel 2.5, back in 2003, you can even find some specifications back in 1997 (http://pubs.opengroup.org/onlinepubs/007908799/xsh/aio.h.html)
It started to gain more visibility with (amongst others) NodeJS a couple of years ago.
This pattern is now included in most new languages (Go...) and is made available in older languages (Python, C#...).
Async isn't a silver bullet, especially for intensive calculations, but for I/O, at least from my experience, it seems to be much more efficient.
The lengthy but successful maturation process of a new standard
In the Python world, a number of alternatives were available (Gevent, Twisted, Eventlet, libevent, stackless,...) each with their own strengths and weaknesses.
Each of them went to a maturation process and could eventually be used on real production environments.
It was really clever for Guido to take all good ideas from all these async frameworks to create AsyncIO.
Instead of having a number of different frameworks, each of them reinventing the wheel on an island,
AsyncIO should help to have a "lingua franca" for doing async in Python.
This is pretty important because once you enter in the async world, all your usual tools and libs (like your favourite DB lib) should also be async compliant.
Because, AsyncIO isn't just a library, it will become the "standard" way to write async code with Python.
If Async means rewriting my perfectly working code, why should I bother ?
To integrate cleanly AsyncIO in your library or your application, you have to rethink the internal architecture.
When you start a new project in "async mode", you can't keep sync for the part of it: to get all async benefits, everything should be async.
But, this isn't mandatory from day 1: you can start simple, and port your code to the async pattern step-by-step.
I can understand some haters reactions: Internet is a big swarm where you have a lot of trends and hype.
Finally, few tools and patterns will really survive to the production's fire.
Meanwhile, you already wrote a lot of perfectly working code, and obviously you really don't want to rewrite that just for the promises of the latest buzz-word.
It's like oriented object programming, years ago, it suddenly became the new "proper" way of writing your code (some said),
and you couldn't be object and procedural in the same time.
Years later, procedural isn't completely dead, because in fact, OO sometimes brings unnecessary overhead.
It really depends on what sort of things you are writing (size matters!).
On the other hand, in 2015, who writes a full-Monty application with procedural only ?
I think one day, it will be the same for the async pattern.
It is always better to driving the change than to endure the change.
Think organic: on the long term, it is not the strongest that survives, nor is it the most intelligent.
It is usually the one being most open and adaptive to changes.
Buzzword, or real paradigm change ?
We don't know for sure if the async pattern is only a temporary fashion buzzword or a real paradigm shift in IT, just like virtualization has become a de-facto standard over the last few years.
But my feeling is that it is here to stay, even if it won't be relevant for all Python projects.
I think it will become the right way to build efficient and scalable I/O-Bound projects,
For example, in an Internet (network) driven world, I see more and more projects centred around piping between cloud-based services.
For this type of developments, I'm personally convinced a paradigm shift has become unavoidable, and for Pythonists AsyncIO is probably the right horse to bet on.
Does anyone really care or "will I be paid more" ?
Let's face it, beside your geek fellows, nobody cares about the tools you are using:
Your users just want features for yesterday, as few bugs as possible, and they want their application to be fast and responsive.
Who cares if you use async, or some other hoodoo-voodoo-black-magic to reach the goal ?
I think that, by starting a "religious war" between sync and async Python developers, we would all waste our (precious) time.
Instead, we should cultivate emulation between Pythonistas, build solutions to increase real-world performances and stability.
Then let Darwin show us the long term path and adapt to it.
In the end, the whole Python community will benefit if Python is considered as a great language to write business logic with ease AND with brute performance.
We are all tired to hear people in other communities say that Python is slow, we are all convinced this is simply not true.
This is a communication war that the Python community has to win as a team.
PS: Special thanks to Nicolas Stein, aka. Nike, for the review of this text and his precious advices in general to stimulate a scientific approach of problems.
Macro-benchmark with Django, Flask and AsyncIO (aiohttp.web+API-Hour)
Disclaimer: If you have some bias and/or dislike AsyncIO, please read my previous blog post before to start a war.
Tip: If you don't have the time to read the text, scroll down to see graphics.
Context of this macro-benchmark
Today, I propose you to benchmark a HTTP daemon based on AsyncIO, and compare results with a Flask and Django version.For those who didn't follow AsyncIO news, aiohttp.web is a light Web framework based on aiohttp. It's like Flask but with less internal layers.
aiohttp is the implementation of HTTP with AsyncIO.
Moreover, API-Hour helps you to have multiprocess daemons with AsyncIO.
With this tool, we can compare Flask, Django and aiohttp.web in the same conditions.
This benchmark is based on a concrete need of one of our customers: they wanted to have a REST/JSON API to interact with their telephony server, based on Asterisk.
One of the WebServices gives the list of agents with their status. This WebService is heavily used because they use it on their public Website (itself having a serious traffic) to show who is available.
First, I've made a HTTP daemon based on Flask and Gunicorn, which gave honorable results. Later on, I replaced the HTTP part and pushed in production a daemon based on aiohttp.web and API-Hour.
A subset of theses daemons are used for this benchmark.
I've added a Django version because with Django and Flask, I certainly cover 90% of tools used by Python Web developers.
I've tried to have the same parameters for each daemon: for example, I obviously use the same number of workers, 16 in this benchmark.
I don't benchmark Django manage.py or dev HTTP server of Flask, I use Gunicorn, as most people use on production, to try to compare apples with apples.
Hardware
- Server: A Dell Precision M6800 with i7 2.90GHz and 16 GB of RAM
- Client: A Dell XPS L502X with i5 2.40GHz and 6GB of RAM
- Network: RJ45 cable between server and client
Network benchmark
I've almost 1Gb/s with this network:On Server:
$ iperf -c 192.168.2.101 -d
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 28.6 MByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.2.101, TCP port 5001
TCP window size: 28.6 MByte (default)
------------------------------------------------------------
[ 5] local 192.168.2.100 port 24831 connected with 192.168.2.101 port 5001
[ 4] local 192.168.2.100 port 5001 connected with 192.168.2.101 port 16316
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.1 sec 1.06 GBytes 903 Mbits/sec
[ 5] 0.0-10.1 sec 1.11 GBytes 943 Mbits/sec
On Client:
$ iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 28.6 MByte (default)
------------------------------------------------------------
[ 4] local 192.168.2.101 port 5001 connected with 192.168.2.100 port 24831
------------------------------------------------------------
Client connecting to 192.168.2.100, TCP port 5001
TCP window size: 28.6 MByte (default)
------------------------------------------------------------
[ 6] local 192.168.2.101 port 16316 connected with 192.168.2.100 port 5001
[ ID] Interval Transfer Bandwidth
[ 6] 0.0-10.0 sec 1.06 GBytes 908 Mbits/sec
[ 4] 0.0-10.2 sec 1.11 GBytes 927 Mbits/sec
System configuration
It's important to configure your PostgreSQL as a production server.
You need also to configure your Linux kernel to handle a lot of open sockets and some TCP tricks.
Everything is in the benchmark repository.
Client benchmark tool
From my experience with AsyncIO, Apache Benchmark (ab), Siege, Funkload and some old fashion HTTP benchmarks tools don't hit enough for an API-Hour daemon.For now, I use wrk and wrk2 to benchmark.
wrk hits as fast as possible, where wrk2 hits with the same rate.
Metrics observed
I record three metrics:- Requests/sec: Least interesting of metrics. (see below)
- Error rate: Sum of all errors (socket timeout, socket read/write, 5XX errors...)
- Reactivity: Certainly the most interesting of the three, it measures the time that our client will actually wait.
WebServices daemons
You can find all source code in API-Hour repository: https://github.com/Eyepea/API-Hour/tree/master/benchmarksEach daemon has at least two WebServices:
- /index: It's a simple JSON document
- /agents: The list of agents that uses, in backend, a SQL query to retrieve agents and status
On Django daemon, I added /agents_with_orm endpoint, to measure the overhead to use Django-ORM instead of to use SQL directly. Warning: I didn't find a solution to have the exact same query.
Methodology
Each daemon will run alone to preserve resources.Between each run, the daemon is restarted to be sure that previous test doesn't pollute the next one.
First turn
At the beginning, to have an idea how much maximum HTTP queries each daemon can support, I quickly attack (30 seconds) on localhost.Warning ! This benchmark doesn't represent the reality you can have in production, because you don't have a network limitation nor latency, it's only for calibration.
Simple JSON document
In each daemon folder in benchmarks repository, you can read the output result of each wrk.To simplify the reading, I summarize the captured values with an array and graphs:
| Requests/s | Errors | Avg Latency (s) | |
| Django+Gunicorn | 70598 | 4489 | 7.7 |
| Flask+Gunicorn | 79598 | 4433 | 13.16 |
| aiohttp.web+API-Hour | 395847 | 0 | 0.03 |
![]() |
| Requests by seconds (Higher is better) |
![]() |
| Errors (Lower is better) |
![]() |
| Latency (s) (Lower is better) |
Agents list from database
| Requests/s | Errors | Avg Latency (s) | |
| Django+Gunicorn | 583 | 2518 | 0.324 |
| Django ORM+Gunicorn | 572 | 2798 | 0.572 |
| Flask+Gunicorn | 634 | 2985 | 13.16 |
| Flask (connection pool) | 2535 | 79704 | 12.09 |
| aiohttp.web+API-Hour | 4179 | 0 | 0.098 |
![]() |
| Requests by seconds (Higher is better) |
![]() |
| Errors (Lower is better) |
![]() |
| Latency (s) (Lower is better) |
Conclusions for the next round
On high charge, Django doesn't have the same behaviour as Flask: Both handle more or less the same requests rate, but Django penalizes less global latency of HTTP queries. The drawback is that the slow HTTP queries are very slow (26,43s for Django compared to 13,31s for Flask).I removed Django ORM test for the next round because it isn't exactly the same SQL query generated and the performance difference with a SQL query is negligible.
I removed also Flask DB connection pool because the error rate is too important compared to other tests.
Second round
Here, I use wrk2, and changed the run time to 5 minutes.A longer run time is very important because of how resources availability can change with time.
There are at least two reasons for this:
1. Your test environment runs on top of some OS which continues its activity during the test.
Therefore, you need a long time to be more insensitive to transient use of your test machine resources by other things
like another OS daemon or cron job triggering meanwhile.
2. The ramp-up of your test will gradually consume more resources at different levels: at the level of your Python scripts & libs,
as well as at the level of you OS / (Virtual) Machine.
This decrease of available resources will not necessarily be instantaneous, nor linear.
This is a typical source of after-deployment bad surprises in prod.
Here too, to be as close as possible to production scenario, you need to give time to your test to arrive to a "hover", eventually saturating some resources.
Ideally you'd saturate the network first (which in this case is like winning the jackpot).
Here, I'm testing at a constant 4000 queries per second, this time through the network.
Simple JSON document
| Requests/s | Errors | Avg Latency (s) | |
| Django+Gunicorn | 1799 | 26883 | 97 |
| Flask+Gunicorn | 2714 | 26742 | 52 |
| aiohttp.web+API-Hour | 3995 | 0 | 0.002 |
![]() |
| Requests by seconds (Higher is better) |
![]() |
| Errors (Lower is better) |
![]() |
| Latency (s) (Lower is better) |
Agents list from database
| Requests/s | Errors | Avg Latency (s) | |
| Django+Gunicorn | 278 | 37480 | 141.6 |
| Flask+Gunicorn | 304 | 40951 | 136.8 |
| aiohttp.web+API-Hour | 3698 | 0 | 7.84 |
![]() |
| Requests by seconds (Higher is better) |
![]() |
| Errors (Lower is better) |
![]() |
| Latency (s) (Lower is better) |
(Extra) Third round
For the fun, I used the same setup as second round, but with only with 10 requests/seconds during 30 seconds to see if under a low load, sync daemons could be quicker, because you have the AsyncIO overhead.Agents list from database
| Requests/s | Errors | Avg Latency (s) | |
| Django+Gunicorn | 10 | 0 | 0.01936 |
| Flask+Gunicorn | 10 | 0 | 0.01874 |
| aiohttp.web+API-Hour | 10 | 0 | 0.00642 |
![]() |
| Latency (s) (Lower is better) |
Conclusion
Some clues to improve AsyncIO performances
- Use an alternative event loop: I've tested to replace AsyncIO event loop and network layer by aiouv and quamash. For now, it doesn't really have a huge impact, maybe in the future.
- Have multiplex protocols from frontend to backend: HTTP 2 is now a multiplex protocol, it means you can stack several HTTP queries without waiting for the first response. This pattern should increase AsyncIO performances, but it must be validated by a benchmark.
- If you have another idea, don't hesitate to post it in comments.
Don't take architectural decisions based on micro-benchmarks
Don't forget this is all about IO-bound
PyPy Development
Experiments in Pyrlang with RPython
Pyrlang is an Erlang BEAM bytecode interpreter written in RPython.
It implements approximately 25% of BEAM instructions. It can support integer calculations (but not bigint), closures, exception handling, some operators to atom, list and tuple, user modules, and multi-process in single core. Pyrlang is still in development.
There are some differences between BEAM and the VM of PyPy:
- BEAM is a register-based VM, whereas the VM in PyPy is stack-based.
- There is no traditional call-stack in BEAM. The Y register in BEAM is similar to a call-stack, but the Y register can sometimes store some variables.
- There are no typical language-level threads and OS-level threads in BEAM; only language-level processes, whose behavior is very similar to the actor model.
Regarding bytecode dispatch loop, Pyrlang uses a while loop to fetch instructions and operands, call the function corresponding to every instruction, and jump back to the head of the while loop. Due to the differences between the RPython call-stack and BEAM’s Y register, we decided to implement and manage the Y register by hand. On the other hand, PyPy uses RPython’s call stack to implement Python’s call stack. As a result, the function for the dispatch loop in PyPy calls itself recursively. This does not happen in Pyrlang.
The Erlang compiler (erlc) usually compiles the bytecode instructions for function invocation into CALL (for normal invocation) and CALL_ONLY (for tail recursive invocation). You can use a trampoline semantic to implement it:
- CALL instruction: The VM pushes the current instruction pointer (or called-program counter in PyPy) to the Y register, and jumps to the destination label. When encountering a RETURN instruction, the VM pops the instruction pointer from the Y register and returns to the location of the instruction pointer to continue executing the outer function.
- CALL_ONLY instruction: The VM simply jumps to the destination label, without any modification of the Y register. As a result, the tail recursive invocation never increases the Y register.
The current implementation only inserts the JIT hint of can_enter_jit following the CALL_ONLY instruction. This means that the JIT only traces the tail-recursive invocation in Erlang code, which has a very similar semantic to the loop in imperative programming languages like Python.
We have also written a single scheduler to implement the language level process in a single core. There is a runable queue in the scheduler. On each iteration, the scheduler pops one element (which is a process object with dispatch loop) from the queue, and executes the dispatch loop of the process object. In the dispatch loop, however, there is a counter-call “reduction” inside the dispatch loop. The reduction decrements during the execution of the loop, and when the reduction becomes 0, the dispatch loop terminates. Then the scheduler pushes that element into the runable queue again, and pops the next element for the queue, and so on.
We are planning to implement a multi-process scheduler for multi-core CPUs, which will require multiple schedulers and even multiple runable queues for each core, but that will be another story. :-)
Methods
We wrote two benchmark programs of Erlang:
- FACT: A benchmark to calculate the factorial in a tail-recursive style, but because we haven’t implemented big int, we do a remainder calculation to the argument for the next iteration, so the number never overflows.
- REVERSE: The benchmark creates a reversed list of numbers, such as [20000, 19999, 19998, …], and applies a bubble sort to it.
Results
The Value of Reduction
We used REVERSE to evaluate the JIT with different values of reduction:
The X axis is the value of reduction, and the Y axis is the execution time (by second).
It seems that when the value of reduction is small, the reduction influences the performance significantly, but when reduction becomes larger, it only increases the speed very slightly. In fact, we use 2000 as the default reduction value (as well as the reduction value in the official Erlang interpreter).
Surprisingly, the trace is always generated even when the reduction is very small, such as 0, which means the dispatch loop can only run for a very limited number of iterations, and the language level process executes fewer instructions than an entire loop in one switch of the scheduler). The generated trace is almost the same, regardless of different reduction values.
Actually, the RPython JIT only cares what code it meets, but does not care who executes it, thus the JIT always generates the results above. The trace even can be shared among different threads if they execute the same code.
The overhead at low reduction value may be due to the scheduler, which switches from different processes too frequently, or from the too-frequent switching between bytecode interpreter and native code, but not from JIT itself.
Here is more explanation from Armin Rigo:
“The JIT works well because you’re using a scheme where some counter is decremented (and the soft-thread interrupted when it reaches zero) only once in each app-level loop. The soft-thread switch is done by returning to some scheduler, which will resume a different soft-thread by calling it. It means the JIT can still compile each of the loops as usual, with the generated machine code containing the decrease-and-check-for-zero operation which, when true, exits the assembler."
Fair Process Switching vs. Unfair Process Switching
We are also concerned about the timing for decreasing reduction value. In our initial version of Pyrlang, we decrease reduction value at every local function invocation, module function invocation, and BIF (built-in function) invocation, since this is what the official Erlang interpreter does. However, since the JIT in RPython basically traces the target language loop (which is the tail recursive invocation in Pyrlang) it is typically better to keep the loop whole during a switch of the language level process. We modified Pyrlang, and made the reduction decrement only occur after CALL_ONLY, which is actually the loop boundary of the target language.
Of course, this strategy may cause an “unfair” execution among language level processes. For example, if one process has only a single long-sequence code, it executes until the end of the code. On the other hand, if a process has a very short loop, it may be executed by very limited steps then be switched out by the scheduler. However, in the real world, this “unfairness” is usually considered acceptable, and is used in many VM implementations including PyPy for improving the overall performance.
We compared these two versions of Pyrlang in the FACT benchmark. The reduction decrement is quite different because there are some BIF invocations inside the loop. In the old version the process can be suspended at loop boundaries or other function invocation, but in the new version, it can be suspended only at loop boundaries.
We show that the strategy is effective, removing around 7% of the overhead. We have also compared it in REVERSE, but since there are no extra invocations inside the trace, it cannot provide any performance improvement. In the real world, we believe there is usually more than one extra invocation inside a single loop, so this strategy is effective for most cases.
Comparison with Default Erlang and HiPE
We compared the performance of Pyrlang with the default Erlang interpreter and the HiPE (High Performance Erlang) complier. HiPE is an official Erlang compiler that can compile Erlang source code to native code. The speed of Erlang programs obviously improves but loses its generality instead.
Please note that Pyrlang is still in development, so in some situations it does less work than the default Erlang interpreter, such as not checking integer overflow when dealing with big integer, and not checking and adding locks when accessing message queues in the language-level process, so is therefore faster. The final version of Pyrlang may be slower.
We used the two benchmark programs above, and made sure both of them are executed for more than five seconds to cover the JIT warm-up time for RPython. The experiment environment is a OS X 10.10 machine with 3.5GHZ 6-core Intel Xeon E5 CPU and 14GB 1866 MHz DDR3 ECC memory.
Let’s look at the result of FACT. The graph shows that Pyrlang runs 177.41% faster on average than Erlang, and runs at almost the same speed as HiPE. However, since we haven’t implemented big integer in Pyrlang, the arithmetical operators do not do any extra overflow checking. It is reasonable that the final version for Pyrlang will be slower than the current version and HiPE.
As for REVERSE, the graph shows that Pyrlang runs 45.09% faster than Erlang, but 63.45% slower than HiPE on average. We think this is reasonable because there are only few arithmetical operators in this benchmark so the speeds of these three implementations are closer. However, we observed that at the scale of 40,000, the speed of Pyrlang slowed down significantly (111.35% slower than HiPE) compared with the other two scales (56.38% and 22.63% slower than HiPE).
Until now we can only hypothesize why Pyrlang slows down at that scale. We guess that the overhead might be from GC. This is because the BEAM bytecode provides some GC hints to help the default Erlang compiler to perform some GC operations immediately. For example, using GC_BIF instead of a BIF instruction tells the VM that there may be a GC opportunity, and tells the VM how many live variables should be around one instruction. In Pyrlang we do not use these kinds of hints but rely on RPython’s GC totally. When there are a huge number of objects during runtime, (as for REVERSE, it should be the Erlang list object) the speed therefore slows down.
Ruochen Huang
Kushal Das
My talk in MSF, India
Last week I gave a talk on Free and Open Source Software in the Metal and Steel factory, Indian Ordinance Factories, Ishapore, India. I met Mr. Amartya Talukdar, a well known activist and blogger from Kolkata in the blogger’s meet. He currently manages the I.T. team in the above mentioned place and he arranged the talk to spread more awareness about FOSS.
I reached the main gate an hour before the talk. The securities came around to ask me why I was standing there in the road. I was sure this is going to happen again. I went into the factory along with Mr. Talukdar, at least three times the securities stopped me while the guns were ready. They also took my mobile phone, I left my camera back at home for the same reason.
I met the I.T. Department and few developers who work there, before the talk. Around 9:40am we moved to the big conference room for my talk. The talk started with Mr. Talukdar giving a small introduction. I was not sure how many technical people will attend the talk, so it was less technical and more on demo side. The room was almost full within few minutes, and I hope that my introductions to FOSS, Fedora, and Python went well. I was carrying a few Python docs with me and few other Fedora stickers. In the talk I spent most of time demoing various tools which can increase productivity of the management by using the right tools. We saw reStructuredText, rst2pdf and Sphinx for managing documents. We also looked into version control systems and how we can use them. We talked a bit about Owncloud, but without network, I could not demo. I also demoed various small Python scripts I use, to keep my life simple. I learned about various FOSS tools they are already using. They use Linux in the servers, my biggest suggestion was about using Linux in the desktops too. Viruses are always a common problem which can easily be eliminated with Linux on the desktops.
My talk ended around 12pm. After lunch, while walking back to the factory Mr. Talukdar showed me various historical places and items from Dutch and British colony days. Of course there were again the securities while going out and coming in.
We spent next few hours discussing various technology and workflow related queries with the Jt. General Manager Mr. Neeraj Agrawal. It was very nice to see that he is updated with all the latest news and information from the FOSS and technology world. We really need more people like him who are open to new ideas and capable of managing both the worlds. In future we will be doing a few workshops targeting the needs of the developers of the factory.
Vasudev Ram
Publish SQLite data to PDF using named tuples
By Vasudev Ram
Some time ago I had written this post:
Publishing SQLite data to PDF is easy with xtopdf.
It showed how to get data from an SQLite (Wikipedia) database and write it to PDF, using xtopdf, my open source PDF creation library for Python.
Today I was browsing the Python standard library docs, and so thought of modifying that program to use the namedtuple data type from the collections module of Python, which is described as implementing "High-performance container datatypes". The collections module was introduced in Python 2.4.
Here is a modified version of that program, SQLiteToPDF.py, called SQLiteToPDFWithNamedTuples.py, that uses named tuples:
# SQLiteToPDFWithNamedTuples.pyThis time I've imported print_function so that I can use print as a function instead of as a statement.
# Author: Vasudev Ram - http://www.dancingbison.com
# SQLiteToPDFWithNamedTuples.py is a program to demonstrate how to read
# SQLite database data and convert it to PDF. It uses the Python
# data structure called namedtuple from the collections module of
# the Python standard library.
from __future__ import print_function
import sys
from collections import namedtuple
import sqlite3
from PDFWriter import PDFWriter
# Helper function to output a string to both screen and PDF.
def print_and_write(pw, strng):
print(strng)
pw.writeLine(strng)
try:
# Create the stocks database.
conn = sqlite3.connect('stocks.db')
# Get a cursor to it.
curs = conn.cursor()
# Create the stocks table.
curs.execute('''DROP TABLE IF EXISTS stocks''')
curs.execute('''CREATE TABLE stocks
(date text, trans text, symbol text, qty real, price real)''')
# Insert a few rows of data into the stocks table.
curs.execute("INSERT INTO stocks VALUES ('2006-01-05', 'BUY', 'RHAT', 100, 25.1)")
curs.execute("INSERT INTO stocks VALUES ('2007-02-06', 'SELL', 'ORCL', 200, 35.2)")
curs.execute("INSERT INTO stocks VALUES ('2008-03-07', 'HOLD', 'IBM', 300, 45.3)")
conn.commit()
# Create a namedtuple to represent stock rows.
StockRecord = namedtuple('StockRecord', 'date, trans, symbol, qty, price')
# Run the query to get the stocks data.
curs.execute("SELECT date, trans, symbol, qty, price FROM stocks")
# Create a PDFWriter and set some of its fields.
pw = PDFWriter("stocks.pdf")
pw.setFont("Courier", 12)
pw.setHeader("SQLite data to PDF with named tuples")
pw.setFooter("Generated by xtopdf - https://bitbucket.org/vasudevram/xtopdf")
# Write header info.
hdr_flds = [ str(hdr_fld).rjust(10) + " " for hdr_fld in StockRecord._fields ]
hdr_fld_str = ''.join(hdr_flds)
print_and_write(pw, '=' * len(hdr_fld_str))
print_and_write(pw, hdr_fld_str)
print_and_write(pw, '-' * len(hdr_fld_str))
# Now loop over the fetched data and write it to PDF.
# Map the StockRecord namedtuple's _make class method
# (that creates a new instance) to all the rows fetched.
for stock in map(StockRecord._make, curs.fetchall()):
row = [ str(col).rjust(10) + " " for col in (stock.date, \
stock.trans, stock.symbol, stock.qty, stock.price) ]
# Above line can instead be written more simply as:
# row = [ str(col).rjust(10) + " " for col in stock ]
row_str = ''.join(row)
print_and_write(pw, row_str)
print_and_write(pw, '=' * len(hdr_fld_str))
except Exception as e:
print("ERROR: Caught exception: " + e.message)
sys.exit(1)
finally:
pw.close()
conn.close()
Here's a screenshot of the PDF output in Foxit PDF Reader:
- Vasudev Ram - Online Python training and programming Dancing Bison EnterprisesSignup to hear about new products or services from me. Posts about Python Posts about xtopdf Contact Page
February 24, 2015
François Dion
J is for ... autojump!
Shell addon
At our last PYPTUG meeting, I was demoing Dshell. While at it I suggested using the j command. AKA, autojump:
https://github.com/joelthelion/autojump
| # default autojump command | |
| j() { | |
| if [[ ${1} == -* ]] && [[ ${1} != "--" ]]; then | |
| autojump ${@} | |
| return | |
| fi | |
| output="$(autojump ${@})" | |
| if [[ -d "${output}" ]]; then | |
| echo -e "\\033[31m${output}\\033[0m" | |
| cd "${output}" | |
| else | |
| echo "autojump: directory '${@}' not found" | |
| echo "\n${output}\n" | |
| echo "Try \`autojump --help\` for more information." | |
| false | |
| fi | |
| } | |
Although this part is all bash scripting, the actual autojump command is written in something else altogether.
Python powered
Autojump has been around for many years, to this day, few people actually are aware of it. In fact, as I typed j, a quick survey around the room confirmed my gut feel.
Python powered
I have blogged and tweeted about it before, but mostly in passing. Hopefully this post will bring a bit more exposure to this really useful tool.
And, do have a look at the python code ( https://github.com/joelthelion/autojump/blob/master/bin/autojump ). It has some interesting use of the lesser known SequenceMatcher class of the difflib module, and good use of lambdas. Oh, and yeah, it's pep8 formatted. Thank you.
Francois
@f_dion
Caktus Consulting Group
PyCon 2015 Ticket Giveaway
Caktus is giving away a PyCon 2015 ticket, valued at $350. We love going to PyCon every year. It’s the largest gathering of developers using Python, the open source programming language that Caktus relies on. This year, it’ll be held April 8th-16th at the beautiful Palais des congrès de Montréal (the inspiration we used to design the website).
To enter, follow @caktusgroup on Twitter and RT this message.
The giveaway will end Tuesday, March 3rd at 12pm EST. Winner will be notified via Twitter DM. A response via DM is required within 24 hours or entrant forfeits their ticket. Caktus employees are not eligible. Winning entrant must be 18 years of age or older. Ticket is non-transferable.
Bonne chance!
PythonClub - A Brazilian collaborative blog about Python
Tuplas mutantes em Python
Por Luciano Ramalho, autor do livro Fluent Python (O'Reilly, 2014)
See also the original article in English: http://radar.oreilly.com/2014/10/python-tuples-immutable-but-potentially-changing.html
Tuplas em Python têm uma característica surpreendente: elas são imutáveis, mas seus valores podem mudar. Isso pode acontecer quando uma tuple contém uma referência para qualquer objeto mutável, como uma list. Se você precisar explicar isso a um colega iniciando com Python, um bom começo é destruir o senso comum sobre variáveis serem como caixas em que armazenamos dados.
Em 1997 participei de um curso de verão sobre Java no MIT. A professora, Lynn Andrea Stein - uma premiada educadora de ciência da computação - enfatizou que a habitual metáfora de "variáveis como caixas" acaba atrapalhando o entendimento sobre variáveis de referência em linguagens OO. Variáveis em Python são como variáveis de referência em Java, portanto é melhor pensar nelas como etiquetas afixadas em objetos.
Eis um exemplo inspirado no livro Alice Através do Espelho e O Que Ela Encontrou Por Lá, de Lewis Carroll.
Tweedledum e Tweedledee são gêmeos. Do livro: “Alice soube no mesmo instante qual era qual porque um deles tinha 'DUM' bordado na gola e o outro, 'DEE'”.
Vamos representá-los como tuplas contendo a data de nascimento e uma lista de suas habilidades:
>>> dum = ('1861-10-23', ['poesia', 'fingir-luta'])
>>> dee = ('1861-10-23', ['poesia', 'fingir-luta'])
>>> dum == dee
True
>>> dum is dee
False
>>> id(dum), id(dee)
(4313018120, 4312991048)
É claro que dum e dee referem-se a objetos que são iguais, mas que não são o mesmo objeto. Eles têm identidades diferentes.
Agora, depois dos eventos testemunhados por Alice, Tweedledum decidiu ser um rapper, adotando o nome artístico T-Doom. Podemos expressar isso em Python dessa forma:
>>> t_doom = dum
>>> t_doom
('1861-10-23', ['poesia', 'fingir-luta'])
>>> t_doom == dum
True
>>> t_doom is dum
True
Então, t_doom e dum são iguais - mas Alice acharia tolice dizer isso, porque t_doom e dum referem-se à mesma pessoa: t_doom is dum.
Os nomes t_doom e dum são apelidos. O termo em inglês "alias" significa exatamente apelido. Gosto que os documentos oficiais do Python muitas vezes referem-se a variáveis como “nomes“. Variáveis são nomes que damos a objetos. Nomes alternativos são apelidos. Isso ajuda a tirar da nossa mente a ideia de que variáveis são como caixas. Qualquer um que pense em variáveis como caixas não consegue explicar o que vem a seguir.
Depois de muito praticar, T-Doom agora é um rapper experiente. Codificando, foi isso o que aconteceu:
>>> skills = t_doom[1]
>>> skills.append('rap')
>>> t_doom
('1861-10-23', ['poesia', 'fingir-luta 'rap'])
>>> dum
('1861-10-23', ['poesia', 'fingir-luta 'rap'])
T-Doom conquistou a habilidade rap, e também Tweedledum — óbvio, pois eles são um e o mesmo. Se t_doom fosse uma caixa contendo dados do tipo str e list, como você poderia explicar que uma inclusão à lista t_doom também altera a lista na caixa dum? Contudo, é perfeitamente plausível se você entende variáveis como etiquetas.
A analogia da etiqueta é muito melhor porque apelidos são explicados mais facilmente como um objeto com duas ou mais etiquetas. No exemplo, t_doom[1] e skills são dois nomes dados ao mesmo objeto da lista, da mesma forma que dum e t_doom são dois nomes dados ao mesmo objeto da tupla.
Abaixo está uma ilustração alternativa dos objetos que representam Tweedledum. Esta figura enfatiza o fato de a tupla armazenar referências a objetos, e não os objetos em si.
O que é imutável é o conteúdo físico de uma tupla, que armazena apenas referências a objetos. O valor da lista referenciado por dum[1] mudou, mas a identidade da lista referenciada pela tupla permanece a mesma. Uma tupla não tem meios de prevenir mudanças nos valores de seus itens, que são objetos independentes e podem ser encontrados através de referências fora da tupla, como o nome skills que nós usamos anteriormente. Listas e outros objetos imutáveis dentro de tuplas podem ser alterados, mas suas identidades serão sempre as mesmas.
Isso enfatiza a diferença entre os conceitos de identidade e valor, descritos em Python Language Reference, no capítulo Data model:
Cada objeto tem uma identidade, um tipo e um valor. A identidade de um objeto nunca muda, uma vez que tenha sido criado; você pode pensar como se fosse o endereço do objeto na memória. O operador is compara a identidade de dois objetos; a função id() retorna um inteiro representando a sua identidade.
Após dum tornar-se um rapper, os irmãos gêmeos não são mais iguais:
>>> dum == dee
False
Temos aqui duas tuplas que foram criadas iguais, mas agora elas são diferentes.
O outro tipo interno de coleção imutável em Python, frozenset, não sofre do problema de ser imutável mas com possibilidade de mudar seus valores. Isso ocorre porque um frozenset (ou um set simples, nesse sentido) pode apenas conter referências a objetos hashable (objetos que podem ser usados como chave em um dicionário), e o valor destes objetos, por definição, nunca pode mudar.
Tuplas são comumente usadas como chaves para objetos dict, e precisam ser hashable - assim como os elementos de um conjunto. Então, as tuplas são hashable ou não? A resposta certa é algumas tuplas são. O valor de uma tupla contendo um objeto mutável pode mudar, então uma tupla assim não é hashable. Para ser usada como chave para um dict ou elemento de um set, a tupla precisa ser constituída apenas de objetos hashable. Nossas tuplas de nome dum e dee não são hashable porque cada elemento contem uma referência a uma lista, e listas não são hashable.
Agora vamos nos concentrar nos comandos de atribuição que são o coração de todo esse exercício.
A atribuição em Python nunca copia valores. Ela apenas copia referências. Então quando escrevi skills = t_doom[1], não copiei a lista referenciada por t_doom[1], apenas copiei a referência a ela, que então usei para alterar a lista executando skills.append('rap').
Voltando ao MIT, a Profa. Stein falava sobre atribuição de uma forma muito cuidadosa. Por exemplo, ao falar sobre um objeto gangorra em uma simulação, ela dizia: “A variável g é atribuída à gangorra“, mas nunca “A gangorra é atribuída à variável g “. Em se tratando de variáveis de referência, é mais coerente dizer que a variável é atribuída ao objeto, e não o contrário. Afinal, o objeto é criado antes da atribuição.
Em uma atribuição como y = x * 10, o lado direito é computado primeiro. Isto cria um novo objeto ou retorna um já existente. Somente após o objeto ser computado ou retornado, o nome é atribuído a ele.
Eis uma prova disso. Primeiro criamos uma classe Gizmo, e uma instância dela:
>>> class Gizmo:
... def __init__(self):
... print('Gizmo id: %d' % id(self))
...
>>> x = Gizmo()
Gizmo id: 4328764080
Observe que o método __init__ mostra a identidade do objeto tão logo criado. Isso será importante na próxima demonstração.
Agora vamos instanciar outro Gizmo e imediatamente tentar executar uma operação com ele antes de atribuir um nome ao resultado:
>>> y = Gizmo() * 10
Gizmo id: 4328764360
Traceback (most recent call last):
...
TypeError: unsupported operand type(s) for *: 'Gizmo' and 'int'
>>> 'y' in globals()
False
Este trecho mostra que o novo objeto foi instanciado (sua identidade é 4328764360) mas antes que o nome y possa ser criado, uma exceção TypeError abortou a atribuição. A verificação 'y' in globals() prova que não existe o nome global y.
Para fechar: sempre leia lado direito de uma atribuição primero. Ali o objeto é computado ou retornado. Depois disso, o nome no lado esquerdo é vinculado ao objeto, como uma etiqueta afixada nele. Apenas esqueça aquela idéia de variáveis como caixas.
Em relação a tuplas, certifique-se que elas apenas contenham referências a objetos imutáveis antes de tentar usá-las como chaves em um dicionário ou itens em um set.
Este texto foi originalmente publicado no blog da editora O'Reilly em inglês. A tradução para o português foi feita por Paulo Henrique Rodrigues Pinheiro. O conteúto é baseado no capítulo 8 do meu livro Fluent Python. Esse capítulo, intitulado Object references, mutability and recycling também aborda a semântica da passagem de parâmetros para funções, melhores práticas para manipulação de parâmetros mutáveis, cópias rasas (shallow copies) e cópias profundas (deep copies), e o conceito de referências fracas (weak references) - além de outros tópicos. O livro foca em Python 3 mas grande parte de seu conteúdo se aplica a Python 2.7, como tudo neste texto.
Montreal Python User Group
PyCon Startup Row - Registration
Tuesday March 3rd, we're inviting Montreal startups to present their startups to a panel of investors and VCs. Presentations will last 5mn, including a demonstration of the product.
There will be various startups at various stages of growth, from new startups looking for traction to growing startups.
This is a paid event to sustain our costs and to provide appetizers and wine for the networking parts.
Please get your discounted early bird 8$ tickets at https://www.eventbrite.ca/e/mtl-newtech-pycon-edition-tickets-15867698714
for more informations about PyCon: https://us.pycon.org/2015/
First 3 startups to pitch. Others to be announced soon.
- Elysia: matches you to the trip you need for every occasion. http://www.elysia.co/
- Warden: Your online business is important. Make sure it stays secure!. https://wardenscanner.com/en/
- Erudite Science: The toolbox for educational game and app developers. http://eruditescience.com/
Agenda:
- 5:45pm Doors open
- 6:15pm Event Presentation. Each startup has 5 minutes to pitch, including a demo. Expect 1 or 2 questions from judges
- 7:30pm End of presentations. Judges deliberate
- 7:45pm Announcement of the Startup selected for Pycon
- 7:45pm Stay for networking!

February 23, 2015
BioPython News
OBF Google Summer of Code 2014 Wrap-up
In 2014, OBF had six students in the Google Summer of Code 2014™ (GSoC) program mentored under its umbrella of Bio* and related open-source bioinformatics community projects: Loris Cro (Bioruby) with mentors Francesco Strozzi and Raoul Bonnal; Evan Parker (Biopython) with mentors Wibowo Arindrarto and Peter Cock; Sarah Berkemer (BioHaskell) with mentors Christian Höner zu Siederdissen and Ketil Malde; and three students contributed to JSBML: Victor Kofia (mentors: Alex Thomas and Sarah Keating), Ibrahim Vazirabad (mentors: Andreas Dräger and Alex Thomas), and Leandro Watanabe (mentors: Nicolas Rodriguez and Chris Myers).
As a change from earlier years in which OBF participated in GSoC as a mentoring organization, in 2014 we purposefully defined our umbrella as much more inclusive of the wider bioinformatics open-source community, bringing it more in line with the annual Bioinformatics Open-Source Conference (BOSC). In part this was also motivated by “paying it forward“, a concept central to growing healthy open-source communities, after the larger domain-agnostic language projects such as SciRuby and PSF had extended an open hand to OBF mentors when OBF did not get admitted as a GSoC mentoring organization in 2013. In the end, four out of the six succeeding student applications were for projects outside of the traditional core Bio* projects, a result with which everyone won: We had a terrific crop of students, our community grew larger and stronger, and open-source bioinformatics was advanced in a more diverse way than would have been possible otherwise.
In addition to our students, huge kudos also go to our mentors (see above), and to Eric Talevich (Biopython) and Raoul Bonnal (Bioruby), who ran our program participation as administrators. They all invested significant amounts of time on behalf of our community and projects. Thank you!
Below follows a short summary of each of the 2014 student projects, starting with the three JSBML students.
JSBML and GSoC 2014
JSBML is an international community-driven, open-source project to develop a Java API library for reading, writing and manipulating SBML, a data format for representing and exchanging computational models in systems biology. SBML has been in use for over a decade but continues to evolve and grow, and hence so does JSBML. JSBML holds two annual development-oriented workshops, and the three 2014 JSBML GSoC students had the opportunity to participate in and present their work at the autumn event, COMBINE (Computational Modeling in Biology Network), which was held in Los Angeles, California, right at the end of GSoC. Furthermore, a scientific publication on a new JSBML release, currently under review at Bioinformatics, highlights some of the work done by the students. Hence, JSBML’s 2014 participation in GSoC was a great success and experience, both for the students as well as the JSBML project and community.
Ibrahim Y. Vazirabad – “Improving the plugin interface for CellDesigner“
CellDesigner is a frequently used program in computational systems biology. It features an easy-to-use GUI, powerful graph editing functions, and a rich simulation functionality, among others. To facilitate rapid prototyping of new algorithms in third-party applications, CellDesigner provides a plug-in interface for Java applications to its robust interface and other features. However, the design and implementation of the plug-in interface made developing software for it very difficult and time consuming. To remedy this, a draft version of a JSBML library had been created to allow developing and testing prospective plug-in modules initially as stand-alone software, which can then be turned into a CellDesigner plug-in with very little effort. The goal of Ibrahim’s project was to improve the interface provided by the library, and importantly, to revise it to support access to one of CellDesigner’s most interesting features, graphical network layout. As a result of Ibrahim’s work, new CellDesigner test cases and plugins that use this interface have already been implemented, including one that converts between CellDesigner’s proprietary data format and the official SBML layout extension.
Leandro H. Watanabe – “Arrays Package“
The arrays and dynamic package extensions to SBML have been proposed to overcome SBML’s limitation to static static models, which is in contrast to the inherently dynamic nature of many biological systems. The goal of Leandro’s project was to implement the arrays package in JSBML. Rather than enabling models with new behaviors to be constructed, the purpose of the arrays package is to represent regular constructs more efficiently and more compact than SBML core constructs can. To aid the integration of the arrays package into existing tools, Leandro also implemented the option of flattening an arrayed model to use only SBML core constructs, and a validation procedure for array constructs that checks whether a model violates any of the rules imposed on array constructs. As a consequence, his work helped solidify the Arrays Specification document of the SBML standard.
Victor Kofia – “Redesign the implementation of mathematical formulas“
JSBML uses the concept of abstract syntax trees to work with mathematical expressions. For example, the image to the right shows a syntax tree representing the formula k8 · R1. Originally, JSBML implemented different kinds of formula components all in just one complex class with diverse type attributes, which was prone to introducing errors upon code changes and generally made maintenance of the software difficult. Victor implemented a math package for JSBML, in which different kinds of tree nodes that can occur in formulas (e.g., real numbers or algebraic symbols such as ‘plus’ or ‘minus’) are represented with their own, specialized classes. This has made handling of formulas much more straightforward, and also more efficient. In the future, this new representation could even be used for symbolic or numeric calculations.
Evan Parker – “Addition of a lazy loading sequence parser to Biopython’s SeqIO package“
Though Biopython is already equipped with sequence parsers for a wide array of formats, these generally parsed entire records into memory. For large sequences such as entire chromosomes this quickly degrades performance. To allow sequences to be loaded on-demand, Evan designed a general lazy-loading parser by refactoring the existing object model, and then added format-specific modifications to each individual parser. The approach he devised works by pre-indexing the sequence files and then loading only those sequence regions that the user requests. Benchmarking and performance comparisons showed this approach yields significant performance gains when, as is common for genome-scale files, users are interested only in parts of the full sequence. Evan’s code is currently under review by Biopython core developers, and once merged will make parsing large sequences in Biopython much more tractable.
Loris Cro – “An ultra-fast scalable RESTful API to query large numbers of VCF datapoints“
Variant Call Format (VCF) files are commonly generated by genome sequencing projects for sequence variations among different individuals and can get very large. The goal of Loris’ work was to develop code for Bioruby to determine the common variations (i.e., intersections) between multiple individuals and groups of individuals in a fast and scalable way. In the first phase of the project, Loris tested different technologies for storing large VCF files, from which MongoDB emerged as having superior performance. In the second phase Loris developed the code for efficiently storing VCF data into MongoDB, and then implemented algorithms for performing the intersection queries (see Github repo and Loris’ project blog). The code was developed using JRuby and uses the HTS-JDK library to parse the VCF data. In the course of the project, Loris also provided valuable feedback to the HTS-JDK team that led to improvements of the VCF parser and data model. The result of Loris’ GSoC work is now available to the community as a Ruby Gem, which has been tested and used already in large international genome re-sequencing projects, including Gene2Farm and WHEALBI.
Sarah Berkemer – “Open source high-performance BioHaskell“
One of the challenges with sequence alignments for the purposes of sequence similarity searches is that for most known genes (i.e., sequences) relatively little is known about their biology, and the few for which a lot is known therefore tend to be only remotely related to a query sequence. Transitive alignments try to ameliorate this by aligning the query sequence against a large body of known but not deeply understood sequences, the intermediate set, which in turn are then aligned against the core of well-understood sequences. However, in contrast to aligning two sequences, aligning a sequence via a vast intermediate data set to a smaller core set is slow and memory-consuming. As part of her GSoC project, Sarah dug deep into the structure of the algorithm, and rewrote core parts to make use of fusing data structures and efficient tree-like data structures (see her project blog). Her work brought down the runtime for a benchmark by a factor of 3, from 31 to 11 minutes, and, arguably even more important, reduced memory consumption from 53 to 22 gigabytes. This now allows running the program on consumer-grade high-memory PCs. With Sarah having finished her Masters degree (congrats!!) in the meantime, she and her mentors are now in the process of writing a scientific application note and are planning to make the program available as an online web-service.
As a rather small family within the much larger OBF umbrella, the chance to have a student contribute to functional programming for computational biology has been a tremendous opportunity and learning experience for the Biohaskell community as well.
Ionel Cristian
The problem with packaging in Python
Packaging is currently too hard in Python, and while there's effort to improve it, it's still largely focused on the problem of installing. The current approach is to just throw docs and specs at the building part: [2]
Lets make docs! Must be poorly documented if no one understands it.
Why do we need a damn mountain of docs? Because when building a distribution the user experience is like this:
Thanks for asking Mr. Clippy, I'd like to package code without going mad.
There are so many things going on in setup.py:
- Do you use py_modules or packages?
- Do you hardcode the lists for py_modules or packages? Do you use setuptools.find_packages? What are the right arguments?
- What about package_dir?
- Do you want to distribute files that aren't code? Though luck: more buttons!
- Do you use a MANIFEST?
- Or maybe a MANIFEST.in is better? What the hell do I put in there? There's include, recursive-include, global-include, graft. Where do I need exclude, recursive-exclude, global-exclude or prune?
- How about data_files or package_data?
- What about include_package_data?
No one is going to read the list above, let alone understand what everything means!
We don't need a goddamn mountain of docs, we need something that's so simple even a monkey could publish a decent distribution on PyPI. But that means cutting down features ...
The perspective problem*
There are lots of improvements made in PEP-376, PEP-345, PEP-425, PEP-427 and PEP-426, but they are all improvements that allow tools like pip to work better. They still don't make my life easier, as a packager - the user of setuptools or distutils.
Don't get me wrong, it's good that we got those but I think there should be some focus on making a simpler packaging tool. An alternative to setuptools/distutils that has less features, more constraints but way easier to use. Sure, anyone can try to make something like that, but if it's not officially sanctioned it's going to have very limited success.
It has been tried before*
There have been attempts to replace the unholy duo [1] we have now but alas, the focus was wrong. There have been two directions of improvement:
- Internals: better architecture to make the build tools more maintainable/extensible/whatever. Distutils2 was the champion of this approach.
- Metadata as configuration: the "avoid code" mantra. Move the metadata in a configuration file, and avoid the crazy problems usually happen when you let users put code in setup.py. Distutils2 championed this idea and it lives today through d2to1.
However, the way code and data files are collected didn't change. As a packager, you still have to deal with the dozen confusing buttons. [3]
d2to1 is not better in this regard. In fact, it's worse because you have to hardcode metadata and there's no automatic discovery for whatever you're trying to package. [4]
The current course*
PEP-426 will open up possibilities of custom build systems, something else than setuptools, that could hypothetically solve all sorts of niche problems like C extensions with unusual dependencies.
What I dream of*
What if there would be a build system just for pure-Python distributions (and maybe some C extension with no dependencies)? Something that has some strong conventions: code in this place, docs in that place - no exceptions. Something like cargo has. Maybe with a nice project scaffold generator.
Of course, anyone can say: PEP-426 lets you build whatever you want, just do it! However, to make something really simple to use some conventions need to be broken, and if you want to convert your project some effort would be needed. You see, if it's not officially sanctioned it's not going to pick up. Death by lack of interest.
And if it doesn't pick up, then the vast majority of packagers are going to stick with the complicated setup.py we have now.
In a way, packaging in Python is a victim of bad habits - complex conventions, and feature bloat. It's hard to simplify things, because all the historical baggage people want to carry around. But it there's some official sanctioning then it's easier to accept the hard changes.
Concretely what I want is along these lines:
- Get rid of py_modules, packages and package_dir. Just discover automatically whatever you have in a src dir.
- Get rid of MANIFEST, MANIFEST.in and the baffling trio of package_data, data_files and include_package_data. Just take all the files are inside packages. Use .gitignore to exclude files.
- Have a single way to store and retrieve metadata like the version in your code. Not a handful of ways.
In other words, one way to do it. Not one clear way, cause we document the hell out of it, but one, and only one, way to do it. What do you think, could it work? Would it improve anything?
| [1] | Distutils and setuptools: the confusing system everyone loves to hate. |
| [2] | There are a ton of places where you can find information about packaging, of various quality and freshness. At least now there's sanctioned place to go to: https://packaging.python.org/en/latest/distributing.html Still, there's so much to read. What if there wouldn't be a need to know so much to package stuff? |
| [3] | Does this look familiar? It has mostly the same options as distutils's setup. Too many options. Still lots of trial and error to make a distribution. |
| [4] | Hardcoding information that you already have in the filesystem is a sure way to make mistakes. More about this: Python packaging pitfalls. |
PyPy Development
linalg support in pypy/numpy
Introduction
PyPy's numpy support has matured enough that it can now support the lapack/blas libraries through the numpy.linalg module. To install the version of numpy this blog post refers to, install PyPy version 2.5.0 or newer, and run this:pypy -m pip install git+https://bitbucket.org/pypy/numpy.git
This update is a major step forward for PyPy's numpy support. Many of the basic matrix operations depend on linalg, even matplotlib requires it to display legends (a pypy-friendly version of matplotlib 1.3 is available at https://github.com/mattip/matplotlib).
A number of improvements and adaptations, some of which are in the newly-released PyPy 2.5.0, made this possible:
- Support for an extended frompyfunc(), which in the PyPy version supports much of the ufunc API (signatures, multiple dtypes) allowing creation of pure-python, jit-friendly ufuncs. An additional keyword allows choosing between out = func(in) or func(in, out) ufunc signatures. More explanation follows.
- Support for GenericUfuncs via PyPy's (slow) capi-compatibility layer. The underlying mechanism actually calls the internal implementation of frompyfunc().
- A cffi version of _umath_linalg. Since cffi uses dlopen() to call into shared objects, we added support in the numpy build system to create non-python shared libraries from source code in the numpy tree. We also rewrote parts of the c-based _umath_linalg.c.src in python, renamed numpy's umath_linalg capi module to umath_linag_capi, and use it as a shared object through cffi.
Status
We have not completely implemented all the linalg features. dtype resolution via casting is missing, especially for complex ndarrays, which leads to slight numerical errors where numpy uses a more precise type for intermediate calculations. Other missing features in PyPy's numpy support may have implications for complete linalg support.Some OSX users have noticed they need to update pip to version 6.0.8 to overcome a regression in pip, and it is not clear if we support all combinations of blas/lapack implementations on all platforms.
Over the next few weeks we will be ironing out these issues.
Performance
A simple benchmark is shown below, but let's state the obvious: PyPy's JIT and the iterators built into PyPy's ndarray implementation will in most cases be no faster than CPython's numpy. The JIT can help where there is a mixture of python and numpy-array code. We do have plans to implement lazy evaluation and to further optimize PyPy's support for numeric python, but numpy is quite good at what it does.HowTo for PyPy's extended frompyfunc
The magic enabling blas support is a rewrite of the _umath_linalg c-based module as a cffi-python module that creates ufuncs via frompyfunc. We extended the numpy frompyfunc to allow it to function as a replacement for the generic ufunc available in numpy only through the c-api.We start with the basic frompyfunc, which wraps a python function into a ufunc:
def times2(in0):
return in0 * 2
ufunc = frompyfunc(times2, 1, 1)
In cpython's numpy the dtype of the result is always object, which is not implemented (yet) in PyPy, so this example will fail. While the utility of object dtypes can be debated, in the meantime we add a non-numpy-compatible keyword argument dtypes to frompyfunc. If dtype=['match'] the output dtype will match the dtype of the first input ndarray:
ufunc = frompyfunc(times2, 1, 1, dtype=['match']) ai = arange(24).reshape(3, 4, 2) ao = ufunc(ai) assert (ao == ai * 2).all()
I hear you ask "why is the dtypes keyword argument a list?" This is so we can support the Generalized Universal Function API, which allows specifying a number of specialized functions and the input-output dtypes each specialized function accepts.
Note that the function feeds the values of ai one at a time, the function operates on scalar values. To support more complicated ufunc calls, the generalized ufunc API allows defining a signature, which specifies the layout of the ndarray inputs and outputs. So we extended frompyfunc with a signature keyword as well.
We add one further extension to frompyfunc: we allow a Boolean keyword stack_inputs to specify the argument layout of the function itself. If the function is of the form:
out0, out1, ... = func(in0, in1,...)
then stack_inputs is False. If it is True the function is of the form:
func(in0, in1, ... out0, out1, ...)
Here is a complete example of using frompyfunc to create a ufunc, based on this link:
def times2(in_array, out_array):
in_flat = in_array.flat
out_flat = out_array.flat
for i in range(in_array.size):
out_flat[i] = in_flat[i] * 2
ufunc = frompyfunc([times2, times2], 1, 1,
signature='(i)->(i)',
dtypes=[dtype(int), dtype(int),
dtype(float), dtype(float),
],
stack_inputs=True,
)
ai = arange(10, dtype=int)
ai2 = ufunc(ai)
assert all(ai2 == ai * 2)
Using this extended syntax, we rewrote the lapack calls into the blas functions in pure python, no c needed. Benchmarking this approach actually was much slower than using the upstream umath_linalg module via cpyext, as can be seen in the following benchmarks. This is due to the need to copy c-aligned data into Fortran-aligned format. Our __getitem__ and __setitem__ iterators are not as fast as pointer arithmetic in C. So we next tried a hybrid approach: compile and use numpy's umath_linalg python module as a shared object, and call the optimized specific wrapper function from it.
Benchmarks
Here are some benchmarks, running a tight loop of the different versions of linalg.inv(a), where a is a 10x10 double ndarray. The benchmark ran on an i7 processor running ubuntu 14.04 64 bit:| Impl. | Time after warmup |
|---|---|
| CPython 2.7 + numpy 1.10.dev + lapack | 8.9 msec/1000 loops |
| PyPy 2.5.0 + numpy + lapack via cpyext | 8.6 msec/1000 loops |
| PyPy 2.5.0 + numpy + lapack via pure python + cffi | 19.9 msec/1000 loops |
| PyPy 2.5.0 + numpy + lapack via python + c + cffi | 9.5 msec/1000 loops |
While no general conclusions may be drawn from a single micro-benchmark, it does indicate that there is some merit in the approach taken.
Conclusion
PyPy's numpy now includes a working linalg module. There are still some rough corners, but hopefully we have implemented the parts you need. While the speed of the isolated linalg function is no faster than CPython and upstream numpy, it should not be significantly slower either. Your use case may see an improvement if you use a mix of python and lapack, which is the usual case.Please let us know how it goes. We love to hear success stories too.
We still have challenges at all levels of programming,and are always looking for people willing to contribute, so stop by on IRC at #pypy.
mattip and the PyPy Team
Nicola Iarocci
Eve 0.5.2 ‘Giulia’ is Out
Eve 0.5.2 has just been released with a bunch of interesting fixes and documentation updates. See the changelog for details.
Mike Driscoll
PyDev of the Week: Maciej Fijalkowski
This week we welcome Maciej Fijalkowski (@fijall) as our PyDev of the Week. He is a freelance programmer who spends a lot of time working on the PyPy project. I would recommend checking out some of his work on github. Let’s spend some time learning about our fellow Pythonista!
Can you tell us a little about yourself (hobbies, education, etc):
Originally from Poland, I am partly nomadic, having a semi-permanent base in Cape Town, South Africa. Got lured here by climbing, good weather, majestic landscapes, and later discovered surfing. Otherwise I can be found in various places in Europe and the US, especially Boulder, CO. I have been doing PyPy for about 8 years now (don’t know, lost track a bit), sometimes free time, sometimes permanent. These days I’m doing some consulting for both PyPy and other stuff, trying to build my own company, baroquesoftware.com.
Why did you start using Python?
I think it was early 2000s. I was using Perl and C++ at the time and a friend of mine was fighting with some programming assignments at the physics department. Doing a quick survey I found out that Python seems to be a language of choice for “beginners”. After teaching that to myself and her, I kind of discovered that Python is an actual language suitable not just for beginners. And this is how it started ![]()
What other programming languages do you know and which is your favorite?
Due to the nature of my work, I am proficient in C, assembler (x86 and ARM), C++, Python, RPython. I can also read/write Java, Ruby, PHP, Ocaml, Prolog and a bunch of others I don’t quite remember. I can never make a project in JavaScript that does not turn out to be a major mess. As for the favorite, is that a trick question? Unsurprisingly, I code mostly in Python, but a lot of my work is done in RPython, which is a static-subset of Python that we use for PyPy. While I think RPython suits its niche very well, I would not recommend it as a general-purpose language, so I suppose Python stays at the top for me. I actually have various ideas how to create a language/ecosystem that would address a lot of Python shortcomings, if I ever have time ![]()
What projects are you working on now?
Mostly PyPy, but more specifically:
- improving PyPy warmup time and memory consumption
- numpy
- helping people with various stuff, e.g. IO performance improvements, profiling etc.
Also I’m the main contributor to hippyvm.
Which Python libraries are your favorite (core or 3rd party)?
I think the one I use the most is py.test. By now it’s an absolutely essential part of what we’re doing. As for the favorite one, I might be a bit biased since I articipated in the design, but I really like cffi. According to PyPI it gets like half a million downloads a month, so it can’t just be me.
Thanks so much!
The Last 10 PyDevs of the Week
Ian Ozsvald
Starting Spark 1.2 and PySpark (and ElasticSearch and PyPy)
The latest PySpark (1.2) is feeling genuinely useful, late last year I had a crack at running Apache Spark 1.0 and PySpark and it felt a bit underwhelming (too much fanfare, too many bugs). The media around Spark continues to grow and e.g. today’s hackernews thread on the new DataFrame API has a lot of positive discussion and the lazily evaluated pandas-like dataframes built from a wide variety of data sources feels very powerful. Continuum have also just announced PySpark+GlusterFS.
One surprising fact is that Spark is Python 2.7 only at present, feature request 4897 is for Python 3 support (go vote!) which requires some cloud pickling to be fixed. Using the end-of-line Python release feels a bit daft. I’m using Linux Mint 17.1 which is based on Ubuntu 14.04 64bit. I’m using the pre-built spark-1.2.0-bin-hadoop2.4.tgz via their downloads page and ‘it just works’. Using my global Python 2.7.6 and additional IPython install (via apt-get):
spark-1.2.0-bin-hadoop2.4 $ IPYTHON=1 bin/pyspark ... IPython 1.2.1 -- An enhanced Interactive Python. ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.2.0 /_/
Using Python version 2.7.6 (default, Mar 22 2014 22:59:56) SparkContext available as sc. >>>
Note the IPYTHON=1, without that you get a vanilla shell, with it it’ll use IPython if it is in the search path. IPython lets you interactively explore the “sc” Spark context using tab completion which really helps at the start. To run one of the included demos (e.g. wordcount) you can use the spark-submit script:
spark-1.2.0-bin-hadoop2.4/examples/src/main/python $ ../../../../bin/spark-submit wordcount.py kmeans.py # count words in kmeans.py
For my use case we were initially after sparse matrix support, sadly they’re only available for Scala/Java at present. By stepping back from my sklean/scipy sparse solution for a minute and thinking a little more map/reduce I could just as easily split the problem into number of counts and that parallelises very well in Spark (though I’d love to see sparse matrices in PySpark!).
I’m doing this with my contract-recruitment client via my ModelInsight as we automate recruitment, there’s a press release out today outlining a bit of what we do. One of the goals is to move to a more unified research+deployment approach, rather than lots of tooling in R&D which we then streamline for production, instead we hope to share similar tooling between R&D and production so deployment and different scales of data are ‘easier’.
I tried the latest PyPy 2.5 (running Python 2.7) and it ran PySpark just fine. Using PyPy 2.5 a prime-search example takes 6s vs 39s with vanilla Python 2.7, so in-memory processing using RDDs rather than numpy objects might be quick and convenient (has anyone trialled this?). To run using PyPy set PYSPARK_PYTHON:
$ PYSPARK_PYTHON=~/pypy-2.5.0-linux64/bin/pypy ./pyspark
I’m used to working with Anaconda environments and for Spark I’ve setup a Python 2.7.8 environment (“conda create -n spark27 anaconda python=2.7″) & IPython 2.2.0. Whichever Python is in the search path or is specified at the command line is used by the pyspark script.
The next challenge to solve was integration with ElasticSearch for storing outputs. The official docs are a little tough to read as a non-Java/non-Hadoop programmer and they don’t mention PySpark integration, thankfully there’s a lovely 4-part blog sequence which “just works”:
- ElasticSearch and Python (no Spark but it sets the groundwork)
- Reading & Writing ElasticSearch using PySpark
- Sparse Matrix Multiplication using PySpark
- Dense Matrix Multiplication using PySpark
To summarise the above with a trivial example, to output to ElasticSearch using a trivial local dictionary and no other data dependencies:
$ wget http://central.maven.org/maven2/org/elasticsearch/ elasticsearch-hadoop/2.1.0.Beta2/elasticsearch-hadoop-2.1.0.Beta2.jar $ ~/spark-1.2.0-bin-hadoop2.4/bin/pyspark --jars elasticsearch-hadoop-2.1.0.Beta2.jar
>>> res=sc.parallelize([1,2,3,4])
>>> res2=res.map(lambda x: ('key', {'name': str(x), 'sim':0.22}))
>>> res2.collect()
[('key', {'name': '1', 'sim': 0.22}),
('key', {'name': '2', 'sim': 0.22}),
('key', {'name': '3', 'sim': 0.22}),
('key', {'name': '4', 'sim': 0.22})]
>>>res2.saveAsNewAPIHadoopFile(path='-',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf={"es.resource": "myindex/mytype"})
The above creates a list of 4 dictionaries and then sends them to a local ES store using “myindex” and “mytype” for each new document. Before I found the above I used this older solution which also worked just fine.
Running the local interactive session using a mock cluster was pretty easy. The docs for spark-standalone are a good start:
sbin $ ./start-master.sh
# the log (full path is reported by the script so you could `tail -f `) shows # 15/02/17 14:11:46 INFO Master: # Starting Spark master at spark://ian-Latitude-E6420:7077 # which gives the link to the browser view of the master machine which is # probably on :8080 (as shown here http://www.mccarroll.net/blog/pyspark/).
#Next start a single worker:
sbin $ ./start-slave.sh 0 spark://ian-Latitude-E6420:7077 # and the logs will show a link to another web page for each worker # (probably starting at :4040).
#Next you can start a pySpark IPython shell for local experimentation:
$ IPYTHON=1 ~/data/libraries/spark-1.2.0-bin-hadoop2.4/bin/pyspark --master spark://ian-Latitude-E6420:7077 # (and similarity you could run a spark-shell to do the same with Scala)
#Or we can run their demo code using the master node you've configured setup:
$ ~/spark-1.2.0-bin-hadoop2.4/bin/spark-submit --master spark://ian-Latitude-E6420:7077 ~/spark-1.2.0-bin-hadoop2.4/examples/src/main/python/wordcount.py README.txt
Note if you tried to run the above spark-submit (which specifies the –master to connect to) and you didn’t have a master node, you’d see log messages like:
15/02/17 14:14:25 INFO AppClient$ClientActor: Connecting to master spark://ian-Latitude-E6420:7077... 15/02/17 14:14:25 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@ian-Latitude-E6420:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@ian-Latitude-E6420:7077 15/02/17 14:14:25 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster@ian-Latitude-E6420:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: ian-Latitude-E6420/127.0.1.1:7077
If you had a master node running but you hadn’t setup a worker node then after doing the spark-submit it’ll hang for 5+ seconds and then start to report:
15/02/17 14:16:16 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
and if you google that without thinking about the worker node then you’d come to this diagnostic page which leads down a small rabbit hole…
Stuff I’d like to know:
- How do I read easily from MongoDB using an RDD (in Hadoop format) in PySpark (do you have a link to an example?)
- Milos notes a Python Blaze solution – does this distribute to many nodes?
- Who else in London is using (Py)Spark? Maybe catch-up over a coffee?
Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.
PyCon
Signup for PyCon Dinners led by Jessica McKellar and Brandon Rhodes!
While the cost of PyCon includes breakfast and lunch as well as coffee and snacks, dinner is on your own, and for good reason. It's Montréal! Get out and enjoy the city, find some good food and drink, and hang out with new groups of people.
To make it even easier, this year we've organized another series of PyCon Dinners, one led by Jessica McKellar and one by Brandon Rhodes. These events are a great way to wrap up the first day of PyCon, taking place Friday April 10 at 6 PM, with a great three course meal with new and old friends. As 60% of attendees surveyed last year stated it was their first PyCon, these dinners are a great way to kick off the weekend and make new connections and setup plans for more dinners or other late night festivities.
Jessica is a director of the Python Software Foundation and has been instrumental in outreach efforts around the Python community, especially when it comes to PyCon. She's also a contributor to Twisted and has worked a lot with the OpenHatch project. She's a very experienced speaker with a ton of knowledge and information to share, and will make an excellent host for an excellent meal.
Brandon is a returning veteran of running a PyCon Dinner, having run last year's as a Python trivia game. He's also an experienced speaker of the Python conference circuit, and will be the chair of PyCons 2016 and 2017 when we head to Portland, Oregon after this year's work as co-chair.
Tickets are required for either dinner, with the meal price subsidized by the PSF for a cost of $45. Each prix fixe meal includes a delicious starter, main course, and dessert, with options available for dietary needs.
Check out the options on https://us.pycon.org/2015/events/dinners/ and sign up today? You can add a dinner ticket to your existing registration at https://us.pycon.org/2015/registration/.
If you don't have tickets to PyCon yet, hurry up because they are selling out very very soon.
Django Weblog
Django sprint in Amsterdam, The Netherlands
We're very happy to announce that a two-day Django sprint will take place on March 7-8 in Amsterdam, Netherlands. This event is organized by the Dutch Django Association.
The venue is the office of DashCare just outside the center of Amsterdam. The sprint will start on Saturday, March 7th at 9:30 CET and and finish on Sunday, March 8th around 22:00 CET.
With the help of the Dutch Django Association and Divio we will have four core developers on site: Baptiste Mispelon, Markus Holtermann, Daniele Procida and Erik Romijn. Daniele will also be doing his famed “Don’t be afraid to commit” workshop, which will take people new to contributing to Django through the entire contribution process with real tickets. So please don’t hesitate to join even if you’ve never contributed to Django before.
If you'd like to join, please sign up on the meetup page. If you’re unable to come to Amsterdam, you're welcome to contribute to the sprint online. Sprinters and core developers will be available in the #django-sprint IRC channel on FreeNode.
We hope you can join us and help make the sprint as successful as possible!
Omaha Python Users Group
February Meeting Notes
Here are links to a few of the topics at this month’s meeting:
FuzzyWuzzy: String matching in Python
Plumbum: Shell Combinators and More
Django: The web framework for perfectionists with deadlines.
February 22, 2015
Daniel Greenfeld
Setting up LaTeX on Mac OS X
These are my notes for getting LaTeX running on Mac OS X with the components and fonts I want. Which is handy when you want to generate PDFs from Sphinx. At some point I want to replace this with a Docker container similar https://github.com/blang/latex-docker, albeit with the components in parts 3 and 4 below.
Get mactex-basic.pkg from http://www.ctan.org/pkg/mactex-basic
Click mactex-basic.pkg to install LaTeX.
Update tlmgr:
sudo tlmgr update --self
Install the following tools via tlmgr:
sudo tlmgr install titlesec sudo tlmgr install framed sudo tlmgr install threeparttable sudo tlmgr install wrapfig sudo tlmgr install multirow sudo tlmgr install enumitem sudo tlmgr install bbding sudo tlmgr install titling sudo tlmgr install tabu sudo tlmgr install mdframed sudo tlmgr install tcolorbox sudo tlmgr install textpos sudo tlmgr install import sudo tlmgr install varwidth sudo tlmgr install needspace sudo tlmgr install tocloft sudo tlmgr install ntheorem sudo tlmgr install environ sudo tlmgr install trimspaces
Install fonts via tlmgr:
sudo tlmgr install collection-fontsrecommended
note: Yes, I know I can install the basic LaTeX package using Homebrew, but sometimes I like doing things manually.
Al-Ahmadgaid Asaad
Python: Getting Started with Data Analysis
- Importing the data
- Importing CSV file both locally and from the web;
- Data transformation;
- Descriptive statistics of the data;
- Hypothesis testing
- One-sample t test;
- Visualization; and
- Creating custom function.
Importing the data
This is the crucial step, we need to import the data in order to proceed with the succeeding analysis. And often times data are in CSV format, if not, at least can be converted to CSV format. In Python we can do this using the following codes:To read CSV file locally, we need the
pandas module which is a python data analysis library. The read_csv function can read data both locally and from the web.Data transformation
Now that we have the data in the workspace, next is to do transformation. Statisticians and scientists often do this step to remove unnecessary data not included in the analysis. Let's view the data first:To R programmers, above is the equivalent of
print(head(df)) which prints the first six rows of the data, and print(tail(df)) -- the last six rows of the data, respectively. In Python, however, the number of rows for head of the data by default is 5 unlike in R, which is 6. So that the equivalent of the R code head(df, n = 10) in Python, is df.head(n = 10). Same goes for the tail of the data.Column and row names of the data are extracted using the
colnames and rownames functions in R, respectively. In Python, we extract it using the columns and index attributes. That is,Transposing the data is obtain using the
T method, Other transformations such as sort can be done using
sort attribute. Now let's extract a specific column. In Python, we do it using either iloc or ix attributes, but ix is more robust and thus I prefer it. Assuming we want the head of the first column of the data, we have By the way, the indexing in Python starts with 0 and not 1. To slice the index and first three columns of the 11th to 21st rows, run the following
Which is equivalent to
print df.ix[10:20, ['Abra', 'Apayao', 'Benguet']]To drop a column in the data, say columns 1 (Apayao) and 2 (Benguet), use the
drop attribute. That is, axis argument above tells the function to drop with respect to columns, if axis = 0, then the function drops with respect to rows.Descriptive Statistics
Next step is to do descriptive statistics for preliminary analysis of our data using thedescribe attribute: Hypothesis Testing
Python has a great package for statistical inference. And that's the stats library of scipy. The one sample t-test is implemented inttest_1samp function. So that, if we want to test the mean of the Abra's volume of palay production against the null hypothesis with 15000 assumed population mean of the volume of palay production, we have The values returned are tuple of the following values:
- t : float or array
t-statistic - prob : float or array
two-tailed p-value
The first array returned is the t-statistic of the data, and the second array is the corresponding p-values.
Visualization
There are several module for visualization in Python, and the most popular one is the matplotlib library. To mention few, we have bokeh and seaborn modules as well to choose from. In my previous post, I've demonstrated the matplotlib package which has the following graphic for box-whisker plot,
And you'll have the following,






Creating custom function
To define a custom function in Python, we use thedef function. For example, say we define a function that will add two numbers, we do it as follows,By the way, in Python indentation is important. Use indentation for scope of the function, which in R we do it with braces
{...}. Now here's an algorithm from my previous post, - Generate samples of size 10 from Normal distribution with $\mu$ = 3 and $\sigma^2$ = 5;
- Compute the $\bar{x}$ and $\bar{x}\mp z_{\alpha/2}\displaystyle\frac{\sigma}{\sqrt{n}}$ using the 95% confidence level;
- Repeat the process 100 times; then
- Compute the percentage of the confidence intervals containing the true mean.
Above code might be easy to read, but it's slow in replication. Below is the improvement of the above code, thanks to Python gurus, see comments on my previous post.
Update
For those who are interested in the ipython notebook of this article, please click here. This article was converted to ipython notebook by of Nuttens Claude.Data Source
Reference
Vasudev Ram
Excel to PDF with xlwings and xtopdf
By Vasudev Ram
Excel to PDF with xlwings and xtopdf - how many x in that? :)
I came across xlwings recently via the Net.
xlwings is by Zoomer Analytics, a startup based in Zürich, Switzerland, by a team with background in financial institutions.
Excerpt from the xlwings documentation:
[ xlwings is a BSD-licensed Python library that makes it easy to call Python from Excel and vice versa:
Interact with Excel from Python using a syntax that is close to VBA yet Pythonic.
Replace your VBA macros with Python code and still pass around your workbooks as easily as before.
xlwings fully supports NumPy arrays and Pandas DataFrames. It works with Microsoft Excel on Windows and Mac. ]
I checked out the xlwings quickstart.
Then did a quick test of using xlwings with xtopdf, my toolkit for PDF creation, to create a simple Excel spreadsheet, then read back its contents, and convert that to PDF.
Here is the code:
"""I ran it with this command:
xlwingsToPDF.py
A demo program to show how to convert the text extracted from Excel
content, using xlwings, to PDF. It uses the xlwings library, to create
and read the Excel input, and the xtopdf library to write the PDF output.
Author: Vasudev Ram - http://www.dancingbison.com
Copyright 2015 Vasudev Ram
"""
import sys
from xlwings import Workbook, Sheet, Range, Chart
from PDFWriter import PDFWriter
# Create a connection with a new workbook.
wb = Workbook()
# Create the Excel data.
# Column 1.
Range('A1').value = 'Foo 1'
Range('A2').value = 'Foo 2'
Range('A3').value = 'Foo 3'
# Column 2.
Range('B1').value = 'Bar 1'
Range('B2').value = 'Bar 2'
Range('B3').value = 'Bar 3'
pw = PDFWriter("xlwingsTo.pdf")
pw.setFont("Courier", 10)
pw.setHeader("Testing Excel conversion to PDF with xlwings and xtopdf")
pw.setFooter("xlwings: http://xlwings.org --- xtopdf: http://slid.es/vasudevram/xtopdf")
for row in Range('A1..B3').value:
s = ''
for col in row:
s += col + ' | '
pw.writeLine(s)
pw.close()
py xlwingsToPDF.pyand here is a screenshot of the output PDF file:
Note: The xlwings library can be installed with:
pip install xlwingsBut a prerequisite for it, pywin32, did not install automatically. pywin32 is a very useful and powerful Windows API wrapper library for Python, by Mark Hammond. I've used it a few times earlier, in earlier Python versions than Python 2.7.8, which I currently am using. I usually installed it directly in those earlier versions. This time, though it was a dependency for xlwings, it did not get installed automatically, and the above Python program gave a runtime error. I had to manually install pywin32 before the program could work.
- Enjoy.
- Vasudev Ram - Dancing Bison EnterprisesSignup to hear about new products or services from me.Contact Page
Python Software Foundation
PSF Community Service Award goes to Django Girls
Enroll as PSF Voting Member
Addendum: Just to clarify, if you are already a voting member (e.g., as a PSF Fellow), there is no need to do anything more. This new form is for Basic Members who do not as yet have voting rights but who qualify according to the criteria.
February 21, 2015
Andrzej Skupień
How to use PIPE in python subprocess.Popen objects
This is something, that I always have to check. So today I'm writing it down.
Documentation to subprocess.Popen is here.
So pipelines are useful when you want to do something with output of command performed by Popen. What you would like to do with output:
- pass it to another bash command
- use it inside Python script
Pass output to another bash command
So you want to pass output of first bash command to another. This will be equivalent of this code in bash:
$ ls /etc | grep ntp
ntp-restrict.conf
ntp.conf
ntp_opendirectory.conf
In Python you do that like this:
ls = subprocess.Popen('ls /etc'.split(), stdout=subprocess.PIPE)
grep = subprocess.Popen('grep ntp'.split(), stdin=ls.stdout, stdout=subprocess.PIPE)
output = grep.communicate()[0]
Do it in the proper way
Call ls.stdout.close() before grep.communicate() so that if grep dies prematurely, ls would exit sooner. And add ls.wait() at the end, to avoid creating a zombie:
ls = subprocess.Popen('ls /etc'.split(), stdout=subprocess.PIPE)
grep = subprocess.Popen('grep ntp'.split(), stdin=ls.stdout, stdout=subprocess.PIPE)
ls.stdout.close()
output = grep.communicate()[0]
ls.wait()
grep.stdin.close() is called by grep.communicate().
Another way to write it:
grep = Popen('grep ntp'.split(), stdin=PIPE, stdout=PIPE)
ls = Popen('ls /etc'.split(), stdout=grep.stdin)
output = grep.communicate()[0]
ls.wait()
As you can see, declaration order of commands in pipe doesn't matter.
Alternative to communicate function
Instead of grep.communicate() function you can use grep.stdout.read(). But first you have to wait for end of subprocess call:
>>> grep.wait()
>>> print grep.stdout.read()
Although this way have one disadvantage. It moves file pointer do the end of file. So every subsequent call will return empty string:
>>> grep.wait()
>>> print grep.stdout.read()
ntp-restrict.conf
ntp.conf
ntp_opendirectory.conf
>>> p2.stdout.read()
''
Further more this has one more disadvantage. If you first wait() and then read() a program that produces a lot of output (more than 4 kilobytes), you'll get a deadlock.
Special thanks to Marius Gedminas, J.F. Sebastian for help with this article.

















