skip to navigation
skip to content

Planet Python

Last update: September 26, 2020 01:47 AM UTC

September 25, 2020


ListenData

Python list comprehension : Learn by Examples

This tutorial covers how list comprehension works in Python. It includes many examples which would help you to familiarize the concept and you should be able to implement it in your live project at the end of this lesson.
Table of Contents

What is list comprehension?

Python is an object oriented programming language. Almost everything in them is treated consistently as an object. Python also features functional programming which is very similar to mathematical way of approaching problem where you assign inputs in a function and you get the same output with same input value. Given a function f(x) = x2, f(x) will always return the same result with the same x value. The function has no "side-effect" which means an operation has no effect on a variable/object that is outside the intended usage. "Side-effect" refers to leaks in your code which can modify a mutable data structure or variable.

Functional programming is also good for parallel computing as there is no shared data or access to the same variable.

List comprehension is a part of functional programming which provides a crisp way to create lists without writing a for loop.
list comprehension python
In the image above, the for clause iterates through each item of list. if clause filters list and returns only those items where filter condition meets. if clause is optional so you can ignore it if you don't have conditional statement.

[i**3 for i in [1,2,3,4] if i>2] means take item one by one from list [1,2,3,4] iteratively and then check if it is greater than 2. If yes, it takes cube of it. Otherwise ignore the value if it is less than or equal to 2. Later it creates a list of cube of values 3 and 4. Output : [27, 64]

List Comprehension vs. For Loop vs. Lambda + map()

All these three have different programming styles of iterating through each element of list but they serve the same purpose or return the same output. There are some differences between them as shown below.
1. List comprehension is more readable than For Loop and Lambda function.
List Comprehension

[i**2 for i in range(2,10)]
For Loop

sqr = []
for i in range(2,10):
sqr.append(i**2)
sqr
Lambda + Map

list(map(lambda i: i**2, range(2, 10)))

Output
[4, 9, 16, 25, 36, 49, 64, 81]
List comprehension is performing a loop operation and then combines items to a list in just a single line of code. It is more understandable and clearer than for loop and lambda.

range(2,10) returns 2 through 9 (excluding 10).

**2 refers to square (number raised to power of 2). sqr = [] creates empty list. append( ) function stores output of each repetition of sequence (i.e. square value) in for loop.

map( ) applies the lambda function to each item of iterable (list). Wrap it in list( ) to generate list as output

READ MORE »

September 25, 2020 04:17 PM UTC


PyCharm

Webinar Recording: “From The Docs: PyCharm Skills, Beginner to Advanced” with Alla Redko

PyCharm has broad, useful, up-to-date documentation. How does it get made? Who works on it? What are some hidden gems? Last week we had a webinar covering this with Alla Redko, technical writer for PyCharm, and the recording is now available.

We covered a bunch of ground in this video:

Lots of useful questions from the audience, so thanks to all the PyCharmers who participated and helped.

September 25, 2020 03:34 PM UTC


Codementor

Ternary Search Algorithm: Explained with example.

Learn about the fast searching algorithm.

September 25, 2020 01:14 PM UTC

Robot Framework with Selenium and Python: All You Need to Know

Robot framework offers an extensible keyword driven approach to Selenium testing. Go from beginner to advanced with our comprehensive Robot Framework Tutorial.

September 25, 2020 12:09 PM UTC


Real Python

The Real Python Podcast – Episode #28: Using Pylance to Write Better Python Inside of Visual Studio Code

A big decision a developer has to make is what tool to use to write code? Would you like an editor that understands Python, and is there to help with suggestions, definitions, and analysis of your code? For many developers, its the free tool, Visual Studio Code. This week on the show, we have Savannah Ostrowski, program manager for the Python Language Server and Python in Visual Studio. We discuss Pylance, a new language server with fast, feature-rich language support for Python in VS Code.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

September 25, 2020 12:00 PM UTC


Andrew Dalke

Mixing text and chemistry toolkits

This is part of a series of essays about using chemfp to work with SD files at the record and simple text level. Chemfp has a text toolkit to read and write SDF and SMILES files as records, rather than molecules. It also has a chemistry toolkit I/O API to have a consistent way to handle structure input and output when working with the OEChem, RDKit, and Open Babel toolkits. In this essay I'll combine the two, so chemfp reads records from an SD file, which are then passed to a chemistry toolkit for further parsing, then chemfp adds a data item back to the original record instead of converting the toolkits molecule into a new SDF record.

You can follow along yourself by installing chemfp (under the Base License Agreement) using:

python -m pip install chemfp -i https://chemfp.com/packages/

chemfp is a package for high-performance cheminformatics fingerprint similarity search. You'll also need at least one of the chemistry toolkits I mentioned.

A simple pipeline component to process SD files

Here's a program to add a data item to an SD file. It's supposed to be part of pipeline, reading from stdin and writing to stdout. To make things simple, all it does is add the data item STATUS with the value good. It uses chemfp's toolkit API (see yesterday's essay) to handle file I/O so you can select your toolkit of choice by commenting/uncommenting the appropriate import lines:

## Select the toolkit you want to use
from chemfp import openbabel_toolkit as T
#from chemfp import rdkit_toolkit as T
#from chemfp import openeye_toolkit as T

# Use None to read from stdin or or write to stdout.
# Specify "sdf" format; The default "smi" is for SMILES file format.

with T.read_molecules(None, "sdf") as reader:
    with T.open_molecule_writer(None, "sdf") as writer:
        for mol in reader:
            T.add_tag(mol, "STATUS", "good")
            writer.write_molecule(mol)

Toolkits make non-chemically significant changes

I used this program to process the chebi16594.sdf file I created for yesterday's essay. Here's a side-by-side difference of the changes each toolkit makes to the original record, using screen shots from Apple's "FileMerge":

Changes each toolkit makes to chebi16594.sdf
Open Babel
  • Adds an OpenBabel program, timestamp, and "2D" on the second line.
  • Uses "999" for the deprecated count of additional property lines.
  • Sorts the atom indices in the bond block from smallest to largest.
  • Adds atom stereo parity of 3 (either or unmarked stereo center) to two atoms.
  • Uses two spaces instead of one for the first line of the data items.
RDKit
  • Adds an RDKit program and "2D" on the second line.
  • Uses "999" for the deprecated count of additional property lines.
  • Doesn't include "5" (meaning a charge of -1) for the first oxygen atom.
    It's unneeded because it duplicates the CHG property. The documentation
    since at least 1992 says it's Retained for compatibility with older Ctabs,
  • Removes three bond columns which are not used or not needed for this record type.
OEChem
  • Adds an OEChem program, timestamp, and "2D" on the second line.
  • Omits the fff field which was already obsolete by 1992.
  • Uses "999" for the deprecated count of additional property lines.
  • Sorts the atom indices in the bond block from smallest to largest.

Every program changed the input but not, I'll stress, not in a chemically meaningful way. I've never come across a case where one of the changes would have affected me, though I suppose there are times when you might want to preserve the program/timestamp and comment lines.

Some toolkits can't process some records

Now I put the program in the pipeline and start using it on real data sets. For example, if I process ChEBI_complete.sdf.gz with RDKit it quickly stops with the following:

% gzcat ChEBI_complete.sdf.gz | python add_status.py > /dev/null
[12:52:56] WARNING: not removing hydrogen atom without neighbors
[12:52:56] Explicit valence for atom # 12 N, 4, is greater than permitted
Traceback (most recent call last):
  File "/Users/dalke/add_status.py", line 11, in <module>
    for mol in reader:
         ... many lines deleted ...
  File "<string>", line 1, in raise_tb
chemfp.ParseError: Could not parse molecule block, file '<stdin>', line 679957, record #380

By default chemfp's toolkit API stops processing when it detects that a record cannot be processed. This is part of the Errors should never pass silently suggested guideline for Python. To skip unparsable records, add errors="ignore" to the reader, like this:

with T.read_molecules(None, "sdf", errors="ignore") as reader:

I was curious about how many ChEBI records could not be parsed by the different toolkits so I wrote the following program to test each available toolkit (including chemfp's own text toolkit which iterates over records in SDF and SMILES files).

import chemfp

filename = "ChEBI_complete.sdf.gz"

# Test all available toolkits
toolkits = []
for name in ("text", "rdkit", "openeye", "openbabel"):
    try:
        toolkits.append(chemfp.get_toolkit(name))
    except ValueError:
        pass

# Tell OEChem to not skip records with no atoms
reader_args = {"openeye.sdf.flavor": "Default|SuppressEmptyMolSkip"}

# Count how many records or molecules each one finds.
results = []
for toolkit in toolkits:
    with toolkit.read_molecules(filename, reader_args=reader_args, errors="ignore") as reader:
        num_records = sum(1 for _ in reader)
    results.append( (toolkit.name, num_records) )

# Report the counts
print(f"After parsing {filename!r}:")
for name, num_records in results:
    print(f"{name} toolkit found {num_records} records")

The output shows that Open Babel and OEChem could process each record, while RDKit was unable to read 244 records.

After parsing 'ChEBI_complete.sdf.gz':
text toolkit found 113902 records
rdkit toolkit found 113658 records
openeye toolkit found 113902 records
openbabel toolkit found 113902 records

OEChem's SuppressEmptyMolSkip

Be aware that that the first version of this program reported that OEChem was not able to find 3 of the records. Further analysis showed the missing records were CHEBI:147324, CHEBI:147325, and CHEBI:156288 and all three of these records have no atoms.

That reminded me that by default OEChem skips records from an SD file with no atoms. Release 2.2.0 added the SuppressEmptyMolSkip flag. Quoting the documentation:

This input flavor suppresses the default action of skipping empty molecules in the input stream. This may be important in order to recover SDData stored on empty molecule records.

chemfp's reader_args support namespaces

The above code uses a reader_arg to configure OEChem's SDF flavor to include SuppressEmptyMolSkip, which I'll repeat here:

# Tell OEChem to not skip records with no atoms
reader_args = {"openeye.sdf.flavor": "Default|SuppressEmptyMolSkip"}

Even thought it wasn't needed for this case, I decided to show how the reader_args support namespaces. This reader_args will only configure the "flavor" value for the "sdf" reader in the "openeye" toolkit. It won't change the flavor for, say, OEChem's SMILES reader, or RDKit's or Open Babel's FASTA flavor.

What this means is that you can have one reader_args which correctly specifies the configuration for each of toolkit formats you might pass in.

What if you must always output a record?

Sometimes it's okay to ignore uninterpretable records, sometimes it isn't. That uncertainty is why chemfp's toolkit API default is to raise an exception and force you to decide.

Let's suppose your pipeline requires the same number of output records as input records, and that the status to use for a record which could not be parsed is bad. How do you do that? (Assume you use RDKit, where there's a chance that some records cannot be parsed.)

One way is to use chemfp's text toolkit to read the records from the SD file, and pass each record to the chemistry toolkit for parsing. If the molecule cannot be parsed, add bad to the STATUS data item, otherwise add good. Here's how that program might look:

from chemfp import text_toolkit

#from chemfp import openbabel_toolkit as T
#from chemfp import openeye_toolkit as T
from chemfp import rdkit_toolkit as T

reader_args = {"openeye.sdf.flavor": "Default|SuppressEmptyMolSkip"}

with text_toolkit.read_molecules(None, "sdf") as reader:
    with text_toolkit.open_molecule_writer(None, "sdf") as writer:
        for text_record in reader:
            if T.parse_molecule(text_record.record, "sdf",
                                reader_args=reader_args, errors="ignore") is None:
                status = "bad"
            else:
                status = "good"
            text_record.add_tag("STATUS", status)
            writer.write_molecule(text_record)

Chemfp's text_toolkit has very limited ability to modify an SDF record; all it can do is append a data item to the end of the record. This is because that feature was developed so chemfp-based programs can easily add a fingerprint, or add similarity search results, to an SDF record. It doesn't (yet) have the ability to replace or remove data items.

What if you want to preserve the input record?

Remember earlier when I showed how the different toolkits can change the syntax of a record, in a chemically insignificant way? The biggest change was the second line, which may store the program name and date (and other information) is always updated by the toolkit.

What if you don't want to change anything? In that case, the previous program shows that you can use chemfp to read the record, pass it off to the toolkit, and update the original record, so the only changes are any new data items you may have appended.

September 25, 2020 12:00 PM UTC


Codementor

How and why I built a menu planning application: What's on the Menu?

September 25, 2020 08:39 AM UTC


PyPy Development

PyPy 7.3.2 triple release: python 2.7, 3.6, and 3.7

 

The PyPy team is proud to release version 7.3.2 of PyPy, which includes three different interpreters:

The interpreters are based on much the same codebase, thus the multiple release. This is a micro release, all APIs are compatible with the 7.3.0 (Dec 2019) and 7.3.1 (April 2020) releases, but read on to find out what is new.

Conda Forge now supports PyPy as a python interpreter. The support is quite complete for linux and macOS. This is the result of a lot of hard work and good will on the part of the Conda Forge team. A big shout out to them for taking this on.

Development of PyPy has transitioning to https://foss.heptapod.net/pypy/pypy. This move was covered more extensively in this blog post. We have seen an increase in the number of drive-by contributors who are able to use gitlab + mercurial to create merge requests.

The CFFI backend has been updated to version 1.14.2. We recommend using CFFI rather than c-extensions to interact with C, and using cppyy for performant wrapping of C++ code for Python.

NumPy has begun shipping wheels on PyPI for PyPy, currently for linux 64-bit only. Wheels for PyPy windows will be available from the next NumPy release. Thanks to NumPy for their support.

A new contributor took us up on the challenge to get windows 64-bit support. The work is proceeding on the win64 branch, more help in coding or sponsorship is welcome.

As always, this release fixed several issues and bugs. We strongly recommend updating. Many of the fixes are the direct result of end-user bug reports, so please continue reporting issues as they crop up.

You can find links to download the v7.3.2 releases here:

https://pypy.org/download.html

We would like to thank our donors for the continued support of the PyPy project. Please help support us at Open Collective. If PyPy is not yet good enough for your needs, we are available for direct consulting work.

We would also like to thank our contributors and encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on pypy, or general help with making RPython’s JIT even better. Since the previous release, we have accepted contributions from 8 new contributors, thanks for pitching in.

If you are a python library maintainer and use c-extensions, please consider making a cffi / cppyy version of your library that would be performant on PyPy. In any case both cibuildwheel and the multibuild system support building wheels for PyPy.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7, 3.6, and 3.7. It’s fast (PyPy and CPython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

We also welcome developers of other dynamic languages to see what RPython can do for them.

This PyPy release supports:

  • x86 machines on most common operating systems (Linux 32/64 bits, Mac OS X 64 bits, Windows 32 bits, OpenBSD, FreeBSD)
  • big- and little-endian variants of PPC64 running Linux,
  • s390x running Linux
  • 64-bit ARM machines running Linux.

PyPy does support ARM 32 bit processors, but does not release binaries.

What else is new?

For more information about the 7.3.2 release, see the full changelog.

Please update, and continue to help us make PyPy better.

Cheers,
The PyPy team

 

 

September 25, 2020 07:45 AM UTC


Codementor

Find all the prime numbers less than 'n' in O(n) Time complexity

Given a number n, find all prime numbers in a segment [2;n] in Linear Time Complexity

September 25, 2020 06:58 AM UTC

September 24, 2020


Mike Driscoll

CodingNomads Tech Talk Series!

Recently CodingNomads invited me on their Tech Talk series. CodingNomads does online code camps for Python and Java.

The Tech Talks are a series of videos that teach or talk about tech. In my case, I got to talk about my favorite programming language, Python!

The first talk I did was on wxPython. In this video, I show how to create a simple image viewer:

Amazingly, I was invited to do a second talk. This time, I decided it would be fun to do an intro to Jupyter Notebook.

CodingNomads is not a sponsor of Mouse vs Python. They are a neat group that kindly asked me to be a part of their series after I volunteered some of my time to mentor people for them over the summer.

The post CodingNomads Tech Talk Series! appeared first on The Mouse Vs. The Python.

September 24, 2020 05:00 PM UTC


PyBites

10 Things We Picked Up From Code Reviewing

We originally sent the following 10 tips to our Friends List; we got requests to post it here for reference, so here you go ...

Ever wondered what you could learn from a code review?

Here are some things we picked up from code reviews that when addressed can make your code a lot cleaner:

  1. Break long functions (methods) into multiple smaller ones - this will make your code more reusable and easier to test.

    Remember each function should do only one thing. Example: a function that parses a csv file, builds up a result list and prints the results does 3 things and should be split accordingly.

  2. Move magic numbers sprinkled in your code, to constants (at the top of your module) - again easier to reuse, more readable, less surprises later on.

  3. Watch out for anything that you put in the global scope, localize variables (data) as much as possible - less unexpected side consequences.

  4. Use flake8 (or black) - more consistent (PEP8 compliant code) is easier to read and earns you more respect from fellow developers (also remember: "how you do the small things determines how you do the big things" - very true with software development).

    This also goes back to developers writing code not only for machines, but also (and more importantly) for other developers. Really long lines might annoy your colleagues that use vsplit to look at multiple code files at once.

  5. Keep try/except blocks narrow (ask yourself: "Are all those lines in between really going to throw this exception?!") and avoid bare exceptions or just using pass or reraising an exception without additional error handling code (e.g. at least log the error).

  6. Leverage the Python language (Pythonic code) - for example replace a try/finally with a with statement, don't overly check conditions (leaping), just try/except (ask for forgiveness). Here is a great article on this topic: Idiomatic Python: EAFP versus LBYL.

    Another example is relying on Python's concept of truthiness (e.g. just do if my_list instead of if len(my_list) > 0).

  7. Use the right data structure - if you check for membership in a big collection it's often better to use a set over a list which would be scanned sequentially and is therefor slower.

  8. Leverage the Standard Library - you don't have to reinvent the wheel.

    For example if you have a collections.Counter object you don't need to use max on it, you can use its most_common method. Counting values manually? You can use sum that receives an iterable. The all/any builtins are wonderful. Or for more complex operations, itertools is an excellent module.

  9. Long if-elif-elif-elif-elif-else's are quite ugly and hard to maintain. You can beautifully refactor those using dictionaries (mappings) - less lines of code, easier to maintain.

  10. Flat is better than nested (Zen of Python, btw pipe import this to your printer now ...) - closely related to number 1., but worth emphasizing: if you have a for in a for, and the inner for has a bunch of nested ifs, it's time to rethink what you are trying to do, because this code will be very hard to test and maintain in the future.

Hope that helps! What cool tips have you learned from going through code reviews? Comment below ...


Keep calm and code in Python!

-- Bob

September 24, 2020 04:43 PM UTC


PyCharm

Webinar: “virtualenv – a deep dive” with Bernat Gabor

virtualenv is a tool that builds virtual environments for Python. It was first created in September 2007 and just went through a rewrite from scratch. Did you ever want to know what parts virtual environments can be broken down into? Or how they work? And how does virtualenv differ from the Python builtin venv? This is the webinar you want.

Speaking To You

Bernat Gabor has been using Python since 2011 and has been a busy participant in the open-source Python community. He is the maintainer of the virtualenv package, which allows the creation of Python virtual environments for all Python versions and interpreter types, including CPython, Jython, and PyPy. He also maintains tox and has contributed to various other Python packages.

Bernat works at Bloomberg, a technology company with more than 6,000 software engineers around the world – 2,000 of whom use Python in their daily roles. Finally, he is part of the company’s Python Guild, a group of engineers dedicated to improving the adoption, usage, and best practices of Python within the company.

September 24, 2020 02:29 PM UTC


Stack Abuse

Facial Detection in Python with OpenCV

Introduction

Facial detection is a powerful and common use-case of Machine Learning. It can be used to automatize manual tasks such as school attendance and law enforcement. In the other hand, it can be used for biometric authorization.

In this article, we'll perform facial detection in Python, using OpenCV.

OpenCV

OpenCV is one of the most popular computer vision libraries. It was written in C and C++ and also provides support for Python, besides Java and MATLAB. While it's not the fastest library out there, it's easy to work with and provides a high-level interface, allowing developers to write stable code.

Let's install OpenCV so that we can use it in our Python code:

$ pip install opencv-contrib-python

Alternatively, you can install opencv-python for just the main modules of OpenCV. The opencv-contrib-python contains the main modules as well as the contrib modules which provide extended functionality.

Detecting Faces in an Image Using OpenCV

With OpenCV installed, we can import it as cv2 in our code.

To read an image in, we will use the imread() function, along with the path to the image we want to process. The imread() function simply loads the image from the specified file in an ndarray. If the image could not be read, for example in case of a missing file or an unsupported format, the function will return None.

We will be using an image from Kaggle dataset:

import cv2

path_to_image = 'Parade_12.jpg'
original_image = cv2.imread(path_to_image)

The full RGB information isn't necessary for facial detection. The color holds a lot of irrelevant information on the image, so it's more efficient to just remove it and work with a grayscale image. Additionally, the Viola-Jones algorithm, which works under the hood with OpenCV, checks the difference in intensity of an image's area. Grayscale images point this difference out more dramatically.

Note: In the case of color images, the decoded images will have the channels stored in BGR order, so when changing them to grayscale, we need to use the cv2.COLOR_BGR2GRAY flag:

image = cv2.cvtColor(original_image, cv2.COLOR_BGR2GRAY)

This could have been done directly when using imread(), by setting the cv2.IMREAD_GRAYSCALE flag:

original_image = cv2.imread(path_to_image, cv2.IMREAD_GRAYSCALE)

The OpenCV library comes with several pre-trained classifiers that are trained to find different things, like faces, eyes, smiles, upper bodies, etc.

The Haar features for detecting these objects are stored as XML, and depending on how you installed OpenCV, can most often be found in Lib\site-packages\cv2\data. They can also be found in the OpenCV GitHub repository.

In order to access them from code, you can use a cv2.data.haarcascades and add the name of the XML file you'd like to use.

We can choose which Haar features we want to use for our object detection, by adding the file path to the CascadeClassifier() constructor, which uses pre-trained models for object detection:

face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + "haarcascade_frontalface_default.xml")

Now, we can use this face_cascade object to detect faces in the Image:

detected_faces = face_cascade.detectMultiScale(image=image, scaleFactor=1.3, minNeighbors=4)

When object detection models are trained, they are trained to detect faces of a certain size and might miss faces that are bigger or smaller than they expect. With this in mind, the image is resized several times in the hopes that a face will end up being a "detectable" size. The scaleFactor lets OpenCV know how much to scale the images. In our case, 1.3 means that it can scale 30% down to try and match the faces better.

As for the minNeighbors parameter, it's used to control the number of false positives and false negatives. It defines the minimum number of positive rectangles (detect facial features) that need to be adjacent to a positive rectangle in order for it to be considered actually positive. If minNeighbors is set to 0, the slightest hint of a face will be counted as a definitive face, even if no other facial features are detected near it.

Both the scaleFactor and minNeighbors parameters are somewhat arbitrary and set experimentally. We have chosen values that worked well for us, and gave no false positives, with the trade-off of more false negatives (undetected faces).

The detectMultiScale() method returns a list of rectangles of all the detected objects (faces in our first case). Each element in the list represents a unique face. This list contains tuples, (x, y, w, h), where the x, y values represent the top-left coordinates of the rectangle, while the w, h values represent the width and height of the rectangle, respectively.

We can use the returned list of rectangles, and use the cv2.rectangle() function to easily draw the rectangles where a face was detected. Keep in mind that the color provided needs to be a tuple in RGB order:

for (x, y, width, height) in detected_faces:
    cv2.rectangle(
        image,
        (x, y),
        (x + width, y + height),
        color,
        thickness=2
    )

Now, let's put that all together:

import cv2

def draw_found_faces(detected, image, color: tuple):
    for (x, y, width, height) in detected:
        cv2.rectangle(
            image,
            (x, y),
            (x + width, y + height),
            color,
            thickness=2
        )

path_to_image = 'Parade_12.jpg'
original_image = cv2.imread(path_to_image)

if original_image is not None:
    # Convert image to grayscale
    image = cv2.cvtColor(original_image, cv2.COLOR_BGR2GRAY)

    # Create Cascade Classifiers
    face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + "haarcascade_frontalface_default.xml")
    profile_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + "haarcascade_profileface.xml")
    
    # Detect faces using the classifiers
    detected_faces = face_cascade.detectMultiScale(image=image, scaleFactor=1.3, minNeighbors=4)
    detected_profiles = profile_cascade.detectMultiScale(image=image, scaleFactor=1.3, minNeighbors=4)

    # Filter out profiles
    profiles_not_faces = [x for x in detected_profiles if x not in detected_faces]

    # Draw rectangles around faces on the original, colored image
    draw_found_faces(detected_faces, original_image, (0, 255, 0)) # RGB - green
    draw_found_faces(detected_profiles, original_image, (0, 0, 255)) # RGB - red

    # Open a window to display the results
    cv2.imshow(f'Detected Faces in {path_to_image}', original_image)
    # The window will close as soon as any key is pressed (not a mouse click)
    cv2.waitKey(0) 
    cv2.destroyAllWindows()
else:
    print(f'En error occurred while trying to load {path_to_image}')

We used two different models on this picture. The default model for detecting front-facing faces, and a model built to better detect faces looking to the side.

Faces detected with the frontalface model are outlined in green, and faces detected with the profileface model are outlined with red. Most of the faces the first model found would have also been found by the second, so we only drew red rectangles where the profileface model detected a face but frontalface didn't:

profiles_not_faces = [x for x in detected_profiles if x not in detected_faces]

The imshow() method simply shows the passed image in a window with the provided title. With the picture we selected, this would provide the following output:

frontal and profile face detection

Using different values for scaleFactor and minNeighbors will give us different results. For example, using scaleFactor = 1.1 and minNeighbors = 4 gives us more false positives and true positives with both models:

face detection lower scale factor

We can see that the algorithm isn't perfect, but it is very efficient. This is most notable when working with real-time data, such as a video feed from a webcam.

Real-Time Face Detection Using a Webcam

Video streams are simply streams of images. With the efficiency of the Viola-Jones algorithm, we can do face detection in real-time.

The steps we need to take are very similar to the previous example with only one image - we'll be performing this on each image in the stream.

To get the video stream, we'll use the cv2.VideoCapture class. The constructor for this class takes an integer parameter representing the video stream. On most machines, the webcam can be accessed by passing 0, but on machines with several video streams, you might need to try out different values.

Next, we need to read individual images from the input stream. This is done with the read() function, which returns retval and image. The image is simply the retrieved frame. The retval return value is used to detect whether a frame has been retrieved or not, and will be False if it hasn't.

However, it tends to be inconsistent with video input streams (doesn't detect that the webcam has been disconnected, for example), so we will be ignoring this value.

Let's go ahead and modify the previous code to handle a video stream:

import cv2

def draw_found_faces(detected, image, color: tuple):
    for (x, y, width, height) in detected:
        cv2.rectangle(
            image,
            (x, y),
            (x + width, y + height),
            color,
            thickness=2
        )

# Capturing the Video Stream
video_capture = cv2.VideoCapture(0)

# Creating the cascade objects
face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + "haarcascade_frontalface_default.xml")
eye_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + "haarcascade_eye_tree_eyeglasses.xml")

while True:
    # Get individual frame
    _, frame = video_capture.read()
    # Covert the frame to grayscale
    grayscale_image = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    
	# Detect all the faces in that frame
    detected_faces = face_cascade.detectMultiScale(image=grayscale_image, scaleFactor=1.3, minNeighbors=4)
    detected_eyes = eye_cascade.detectMultiScale(image=grayscale_image, scaleFactor=1.3, minNeighbors=4)
    draw_found_faces(detected_faces, frame, (0, 0, 255))
    draw_found_faces(detected_eyes, frame, (0, 255, 0))

    # Display the updated frame as a video stream
    cv2.imshow('Webcam Face Detection', frame)

    # Press the ESC key to exit the loop
    # 27 is the code for the ESC key
    if cv2.waitKey(1) == 27:
        break

# Releasing the webcam resource
video_capture.release()

# Destroy the window that was showing the video stream
cv2.destroyAllWindows()

Conclusion

In this article, we've created a facial detection application using Python and OpenCV.

Using the OpenCV library is very straight-forward for basic object detection programs. Experimentally adjusting the scaleFactor and minNeighbors parameters for the types of images you'd like to process can give pretty accurate results very efficiently.

September 24, 2020 12:30 PM UTC


Andrew Dalke

chemfp's chemistry toolkit I/O API

This is part of a series of essays about working with SD files at the record and simple text level. In the last two essays I showed examples of using chemfp to process SDF records and to read two record data items. In this essay I'll introduce chemfp's chemistry toolkit I/O API, which I developed to have a consistent way to handle structure input and output when working with the OEChem, RDKit, and Open Babel toolkits.

You can follow along yourself by installing chemfp (under the Base License Agreement) using:

python -m pip install chemfp -i https://chemfp.com/packages/

chemfp is a package for high-performance cheminformatics fingerprint similarity search. You'll also need at least one of the chemistry toolkits I mentioned.

Add an SDF data item using the native APIs

Every cheminformatics toolkit deserving of that description can add properties to an SDF record. Here's how to do it in several different toolkits, using the input file chebi16594.sdf (a modified version of CHEBI:16594) which contains the following:

CHEBI: 16594

Shortened for demonstration purposes.
  9  8  0  0  0  0  0  0  0  0  2 V2000
   19.3348  -19.3671    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   20.4867  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   21.6385  -19.3671    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   22.7903  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   23.9421  -19.3671    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
   22.7903  -17.3721    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   21.6385  -20.6971    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   18.1830  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   19.3348  -20.6971    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0  0  0  0
  3  2  1  0  0  0  0
  4  3  1  0  0  0  0
  5  4  1  0  0  0  0
  6  4  2  0  0  0  0
  7  3  1  0  0  0  0
  8  1  1  0  0  0  0
  9  1  1  0  0  0  0
M  CHG  1   5  -1
M  END
> <ChEBI ID>
CHEBI:16594

> <ChEBI Name>
2,4-diaminopentanoate

$$$$

For each toolkit I'll add an "MW" data item where the value is the molecular weight, as determined by the toolkit.

OEChem

For OEChem I create an oemolistream ("OpenEye molecule input stream") with the given filename. By default it auto-detects the format from the filename extension. The oemolistream's GetOEGraphMols() returns a molecule iterator. I'll use next() to get the first molecule, then iterate over the data items to report the existing data items:

>>> from openeye.oechem import *
>>> mol = next(oemolistream("chebi16594.sdf").GetOEGraphMols())
>>> [(data_item.GetTag(), data_item.GetValue()) for data_item in OEGetSDDataPairs(mol)]
[('ChEBI ID', 'CHEBI:16594'), ('ChEBI Name', '2,4-diaminopentanoate')]

The OECalculateMolecularWeight() function computes the molecule weight, so I'll use that to add an "MW" data item (with the weight rounded to 2 decimal digits), check that the item was added, then write the result to stdout in SD format:

>>> OECalculateMolecularWeight(mol)
131.15303999999998
>>> OEAddSDData(mol, "MW", f"{OECalculateMolecularWeight(mol):.2f}")
True
>>> [(data_item.GetTag(), data_item.GetValue()) for data_item in OEGetSDDataPairs(mol)]
[('ChEBI ID', 'CHEBI:16594'), ('ChEBI Name', '2,4-diaminopentanoate'), ('MW', '131.15')]
>>> ofs = oemolostream()
>>> ofs.SetFormat(OEFormat_SDF)
True
>>> OEWriteMolecule(ofs, mol)
CHEBI:16594
  -OEChem-09242013332D
Shortened for demonstration purposes.
  9  8  0     0  0  0  0  0  0999 V2000
   19.3348  -19.3671    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   20.4867  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   21.6385  -19.3671    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   22.7903  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   23.9421  -19.3671    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
   22.7903  -17.3721    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   21.6385  -20.6971    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   18.1830  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   19.3348  -20.6971    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  2  3  1  0  0  0  0
  3  4  1  0  0  0  0
  4  5  1  0  0  0  0
  4  6  2  0  0  0  0
  3  7  1  0  0  0  0
  1  8  1  0  0  0  0
  1  9  1  0  0  0  0
M  CHG  1   5  -1
M  END
> <ChEBI ID>
CHEBI:16594

> <ChEBI Name>
2,4-diaminopentanoate

> <MW>
131.15

$$$$
0

That final 0 is the interactive Python shell printing the return value of OEWriteMolecule. It is not part of what OEChem wrote to stdout.

RDKit

In RDKit you need to know which file reader to use for a given file, in this case, FowardSDMolSupplier(). (An upcoming release will offer a generic reader function which dispatches to the appropriate file reader.) The reader is a molecule iterator so again I'll use next() to get the first molecule, then see which data items are present:

>>> from rdkit import Chem
>>> mol = next(Chem.ForwardSDMolSupplier("chebi16594.sdf"))
>>> mol.GetPropsAsDict()
{'ChEBI ID': 'CHEBI:16594', 'ChEBI Name': '2,4-diaminopentanoate'}

I'll use Descriptors.MolWt() to compute the molecular weight and set the "MW" data item. You can see that even though I set the MW as a string, GetPropsAsDict() returns it as a float. This is because GetPropsAsDict() will try to coerce strings which look like floats or integers into native Python floats or integers (including "nan" and "-inf"). To prevent coercion, use the GetProp() method:

>>> from rdkit.Chem import Descriptors
>>> Descriptors.MolWt(mol)
131.155
>>> mol.SetProp("MW", f"{Descriptors.MolWt(mol):.2f}")
>>> mol.GetPropsAsDict()
{'ChEBI ID': 'CHEBI:16594', 'ChEBI Name': '2,4-diaminopentanoate', 'MW': 131.16}
>>> [(name, mol.GetProp(name)) for name in mol.GetPropNames()]
[('ChEBI ID', 'CHEBI:16594'), ('ChEBI Name', '2,4-diaminopentanoate'), ('MW', '131.16')]

Finally, I'll write the molecule to stdout.

>>> import sys
>>> writer = Chem.SDWriter(sys.stdout)
>>> writer.write(mol)
>>> writer.close()
CHEBI:16594
     RDKit          2D

  9  8  0  0  0  0  0  0  0  0999 V2000
   19.3348  -19.3671    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   20.4867  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   21.6385  -19.3671    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   22.7903  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   23.9421  -19.3671    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   22.7903  -17.3721    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   21.6385  -20.6971    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   18.1830  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   19.3348  -20.6971    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0
  3  2  1  0
  4  3  1  0
  5  4  1  0
  6  4  2  0
  7  3  1  0
  8  1  1  0
  9  1  1  0
M  CHG  1   5  -1
M  END
>  <ChEBI ID>  (1)
CHEBI:16594

>  <ChEBI Name>  (1)
2,4-diaminopentanoate

>  <MW>  (1)
131.16

$$$$

Open Babel

Open Babel, because of the pybel interface, is the easiest of the bunch. The following uses Open Babel 3.0, which moved pybel to a submodule of openbabel. I ask readfile() to open the given file as "sdf" format. That returns an iterator. I get the first molecule. It has a special "data" attribute with the SD data items combined with some internal Open Babel data items (RDKit does the same thing, but by default they are hidden)

>>> from openbabel import pybel
>>> mol = next(pybel.readfile("sdf", "chebi16594.sdf"))
>>> mol.data
{'MOL Chiral Flag': '0', 'ChEBI ID': 'CHEBI:16594', 'ChEBI Name': '2,4-diaminopentanoate',
'OpenBabel Symmetry Classes': '8 5 7 9 1 6 3 2 4'}

Pybel molecules have a molwt attribute containing the molecule weight, or I can compute it via the underlying OpenBabel OBMol object. I save it to the data attribute object, export the contents as an string in "sdf" format, and write the output to stdout, asking print() to not include the terminal newline:

>>> mol.molwt
131.15304 
>>> mol.OBMol.GetMolWt()
131.15304 
>>> mol.data["MW"] = f"{mol.molwt:.2f}"
>>> mol.data
{'MOL Chiral Flag': '0', 'ChEBI ID': 'CHEBI:16594', 'ChEBI Name': '2,4-diaminopentanoate',
'OpenBabel Symmetry Classes': '8 5 7 9 1 6 3 2 4', 'MW': '131.15'}
>>> print(mol.write("sdf"), end="")
CHEBI:16594
 OpenBabel09242014012D
Shortened for demonstration purposes.
  9  8  0  0  0  0  0  0  0  0999 V2000
   19.3348  -19.3671    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
   20.4867  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   21.6385  -19.3671    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
   22.7903  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   23.9421  -19.3671    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
   22.7903  -17.3721    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   21.6385  -20.6971    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   18.1830  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   19.3348  -20.6971    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0  0  0  0
  3  2  1  0  0  0  0
  4  3  1  0  0  0  0
  5  4  1  0  0  0  0
  6  4  2  0  0  0  0
  7  3  1  0  0  0  0
  8  1  1  0  0  0  0
  9  1  1  0  0  0  0
M  CHG  1   5  -1
M  END
>  <ChEBI ID>
CHEBI:16594

>  <ChEBI Name>
2,4-diaminopentanoate

>  <MW>
131.15

$$$$

chemfp's chemistry toolkit API

Chemfp supports Open Babel, OEChem+OEGraphSim, and RDKit. Each toolkit has its own way of handling chemical structure I/O. Following the fundamental theorem of software engineering, I "solved" the problem by introducing an extra level of indirection - I created a chemistry toolkit I/O API and developed wrapper implementations for each of the underlying chemistry toolkits.

Here's a side-by-side comparison:

Comparison of toolkit native and chemfp wrapper APIs
OEChem
nativechemfp
from openeye.oechem import *

mol = next(oemolistream("chebi16594.sdf").GetOEGraphMols())
mw = OECalculateMolecularWeight(mol)
OEAddSDData(mol, "MW", f"{mw:.2f}")
ofs = oemolostream()
ofs.SetFormat(OEFormat_SDF)
OEWriteMolecule(ofs, mol)
from chemfp import openeye_toolkit as OETK
from openeye.oechem import OECalculateMolecularWeight

mol = next(OETK.read_molecules("chebi16594.sdf"))
mw = OECalculateMolecularWeight(mol)
OETK.add_tag(mol, "MW", f"{mw:.2f}")
print(OETK.create_string(mol, "sdf"), end="")
RDKit
nativechemfp
import sys
from rdkit import Chem
from rdkit.Chem import Descriptors

mol = next(Chem.ForwardSDMolSupplier("chebi16594.sdf"))
mw = Descriptors.MolWt(mol)
mol.SetProp("MW", f"{mw:.2f}")

writer = Chem.SDWriter(sys.stdout)
writer.write(mol)
writer.close()
from chemfp import rdkit_toolkit as RDTK
from rdkit.Chem import Descriptors

mol = next(RDTK.read_molecules("chebi16594.sdf"))
mw = Descriptors.MolWt(mol)
RDTK.add_tag(mol, "MW", f"{mw:.2f}")
print(RDTK.create_string(mol, "sdf"), end="")
Open Babel
native (pybel)chemfp
from openbabel import pybel

mol = next(pybel.readfile("sdf", "chebi16594.sdf"))
mol.data["MW"] = f"{mol.molwt:.2f}"
print(mol.write("sdf"), end="")
from chemfp import openbabel_toolkit as OBTK

mol = next(OBTK.read_molecules("chebi16594.sdf"))
OBTK.add_tag(mol, "MW", f"{mol.GetMolWt():.2f}")
print(OBTK.create_string(mol, "sdf"), end="")

The point is not that chemfp's toolkit API is all that much shorter than the underlying toolkit API, but rather that it's consistent across the three toolkits. This becomes more useful when you start working with more than one toolkit and have to remember the nuances of each one.

Format and format option discovery

One of the important features I wanted in chemfp was full support for all of formats supported by the underlying toolkits, and all of the options for each of those toolkits. And I wanted to make that information discoverable. For example, the following shows the formats available through chemfp for each toolkit:

>>> from chemfp import rdkit_toolkit
>>> print(", ".join(fmt.name for fmt in rdkit_toolkit.get_formats()))
smi, can, usm, sdf, smistring, canstring, usmstring, molfile,
rdbinmol, fasta, sequence, helm, mol2, pdb, xyz, mae, inchi, inchikey,
inchistring, inchikeystring
>>> from chemfp import openeye_toolkit
>>> print(", ".join(fmt.name for fmt in openeye_toolkit.get_formats()))
smi, usm, can, sdf, molfile, skc, mol2, mol2h, sln, mmod, pdb, xyz,
cdx, mopac, mf, oeb, inchi, inchikey, oez, cif, mmcif, fasta,
sequence, csv, json, smistring, canstring, usmstring, slnstring,
inchistring, inchikeystring
>>> from chemfp import openbabel_toolkit
>>> print(", ".join(fmt.name for fmt in openbabel_toolkit.get_formats()))
smi, can, usm, smistring, canstring, usmstring, sdf, inchi, inchikey,
inchistring, inchikeystring, fa, abinit, dalmol, pdbqt, mmcif, xsf,
    ... many lines removed ...
acesout, POSCAR, pcjson, gzmat, mae, pointcloud, gamess, mopcrt,
confabreport

For each format type, there are properties to say of it is an input format or output format (InChIKey, for example, is only an output format), or if the format can handle file I/O or only handle string-based I/O. (The "smistring" format can only parse a SMILES string while the "smi" format can parse a SMILES file, specified by filename or by the contents in a string.)

There are also ways to figure out the default values for the readers and writers:

>>> from chemfp import rdkit_toolkit
>>> fmt = rdkit_toolkit.get_format("sdf")
>>> fmt.get_default_reader_args()
{'sanitize': True, 'removeHs': True, 'strictParsing': True, 'includeTags': True}
>>> fmt.get_default_writer_args()
{'includeStereo': False, 'kekulize': True, 'v3k': False}

Reader and writer args

Those reader_args and writer_args can be passed to the input and output methods. For example, the RDKit writer's v3k writer_arg, if True, asks RDKit to always generate a V3000 record, even if the molecule can be expressed as a V2000 record:

>>> from chemfp import rdkit_toolkit
>>> mol = rdkit_toolkit.parse_molecule("C#N", "smistring")
>>> print(rdkit_toolkit.create_string(mol, "sdf"))

     RDKit

  2  1  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  3  0
M  END
$$$$

>>> print(rdkit_toolkit.create_string(mol, "sdf", writer_args={"v3k": True}))

     RDKit

  0  0  0  0  0  0  0  0  0  0999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 2 1 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C 0 0 0 0
M  V30 2 N 0 0 0 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 3 1 2
M  V30 END BOND
M  V30 END CTAB
M  END
$$$$

and here's an example where I enable OEChem's "strict" SMILES parser so that multiple sequential bond symbols are not accepted:

>>> from chemfp import openeye_toolkit
>>> mol = openeye_toolkit.parse_molecule("C=#-C", "smistring")
>>> openeye_toolkit.create_string(mol, "smistring")
'CC'
>>> mol = openeye_toolkit.parse_molecule("C=#-C", "smistring", reader_args={"flavor": "Default|Strict"})
Warning: Problem parsing SMILES:
Warning: Bond without end atom.
Warning: C=#-C
Warning:   ^

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
       .... many lines omitted ... 
  File "<string>", line 1, in raise_tb  
chemfp.ParseError: OEChem cannot parse the smistring record: 'C=#-C'

The OpenEye flavor reader and writer args support the raw OEChem integer flags, as well as a string-based syntax to express them symbolically. In this case Default|Strict says to start with the default flags for this format then add the Strict option to it.

OEChem flavor help

The format API doesn't have a way to get detailed help about each option. For most cases it's not hard to guess from the name and Python data type. This doesn't work for OEChem's flavor options. The quickest way to get interactive help is to pass an invalid flavor and see the error message:

>>> from chemfp import openeye_toolkit
>>> openeye_toolkit.parse_molecule(mol, "sdf", reader_args={"flavor": "x"})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
       .... many lines omitted ... 
    raise err
ValueError: OEChem sdf format does not support the 'x' flavor option.
Available flavors are: FixBondMarks, SuppressEmptyMolSkip, SuppressImp2ExpENHSTE

Why did I develop chemfp's toolkit API?

I developed the API starting with chemfp 2.0 because there was a clear need to allow users to configure input processing in a way that respected what the underlying toolkits could do.

As a somewhat extreme example, the most recent version of RDKit supports the FASTA format, with a flavor option to configure it to interpret the input as protein (0 or 1), RNA (2-5), or DNA (6-9), with different values for L- or L+D- amino acids, or different options for 3' and 5' caps on the nucleotides. Thus, AA could be dialanine or a nucleotide sequence with two adenines. The following gives an example of computing the MACCS fingerprint for both cases, where I use the -R parameter to specify a reader argument:

% printf ">dialanine\nAA\n" | rdkit2fps --maccs -R flavor=0 --in fasta | tail -1
00000000000020000040084800201004842452fa09	dialanine
% printf ">diadenine capped RNA\nAA\n" | rdkit2fps --maccs -R flavor=5 --in fasta | tail -1
000000102084002191d41ccf33b3907bde6feb7d1f	diadenine capped RNA

There's less need to handle writer options since chemfp doesn't really need to write structure files. The closest is if the fingerprints or the results of a similarity search are added to an SDF output. In tomorrow's essay I'll describe some ways to modify the SDF output to include a new data item.

But really, that part of chemfp is probably more of a vanity project than anything else. I have some strong opinions on what a good API should be, and had the chance to implement it, show it handles the needs of multiple chemistry toolkits, and document it. Just like I drew some inspiration from pybel, perhaps others will draw some inspiration from the chemfp API.

I personally find it really satisfying to be able to develop, say, a HELM to SLN conversion tool which uses RDKit to convert the HELM string into an SDF record, then OEChem to convert the SDF record to SLN.

>>> from chemfp import rdkit_toolkit, openeye_toolkit
>>>
>>> def helm_to_sln(helm_str):
...   rdmol = rdkit_toolkit.parse_molecule(helm_str, "helm") 
...   sdf_record = rdkit_toolkit.create_string(rdmol, "sdf")
...   oemol = openeye_toolkit.parse_molecule(sdf_record, "sdf")
...   return openeye_toolkit.create_string(oemol, "slnstring")
...
>>> helm_to_sln("PEPTIDE1{[dA].[dN].[dD].[dR].[dE].[dW]}$$$$")
'NH2CH(C(=O)NHCH(C(=O)NHCH(C(=O)NHCH(C(=O)NHCH(C(=O)NHCH(C(=O)
OH)CH2C[1]=CHNHC[2]:C(@1):CH:CH:CH:CH:@2)CH2CH2C(=O)OH)CH2CH2C
H2NHC(=NH)NH2)CH2C(=O)OH)CH2C(=O)NH2)CH3'

This specific function may not be useful, but the ability to specify this sort of work in only a few lines makes it easier to try out new ideas.

September 24, 2020 12:00 PM UTC


Abhijeet Pal

Sending Emails With CSV Attachment Using Python

In this tutorial, we will learn how to send emails with CSV attachments using Python. Pre-Requirements I am assuming you already have an SMTP server setup if not you can use the Gmail SMTP or Maligun or anything similar to ... Read more

The post Sending Emails With CSV Attachment Using Python appeared first on Django Central.

September 24, 2020 08:02 AM UTC


Python Insider

Python 3.8.6 is now available

Python 3.8.6 is the sixth maintenance release of Python 3.8. Go get it here:

https://www.python.org/downloads/release/python-386/

 

Maintenance releases for the 3.8 series will continue at regular bi-monthly intervals, with 3.8.7 planned for mid-November 2020.

What’s new?

The Python 3.8 series is the newest feature release of the Python language, and it contains many new features and optimizations. See the “What’s New in Python 3.8” document for more information about features included in the 3.8 series.

Python 3.8 is becoming more stable. Our bugfix releases are becoming smaller as we progress. This one contains 122 changes, less than two thirds of the previous average for a new release. Detailed information about all changes made in version 3.8.6 specifically can be found in its change log. Note that compared to 3.8.5 this release also contains all changes present in 3.8.6rc1.

We hope you enjoy Python 3.8!

Thanks to all of the many volunteers who help make Python Development and these releases possible! Please consider supporting our efforts by volunteering yourself or through organization contributions to the Python Software Foundation.

Your friendly release team,
Ned Deily @nad
Steve Dower @steve.dower
Łukasz Langa @ambv

September 24, 2020 06:55 AM UTC


Codementor

Is Python better than R for data science?

In this Article you are going to know Is Python better than R for data science. https://nareshit.com/python-online-training/

September 24, 2020 06:44 AM UTC


Abhijeet Pal

Sending Email With Zip Files Using Python

In this tutorial, we will learn how to send emails with zip files using Python’s built-in modules. Pre-Requirements I am assuming that you already have an SMTP (Simple Mail Transfer Protocol ) server setup if not you can use Gmail ... Read more

The post Sending Email With Zip Files Using Python appeared first on Django Central.

September 24, 2020 05:50 AM UTC


Sebastian Witowski

Sorting Lists

There are at least two common ways to sort lists in Python:

Which one is faster? Let’s find out!

sorted() vs list.sort()

I will start with a list of 1 000 000 randomly shuffled integers. Later on, I will also check if the order matters.

# sorting.py
from random import sample

# List of 1 000 000 integers randomly shuffled
MILLION_RANDOM_NUMBERS = sample(range(1_000_000), 1_000_000)


def test_sort():
    return MILLION_RANDOM_NUMBERS.sort()

def test_sorted():
    return sorted(MILLION_RANDOM_NUMBERS)
$ python -m timeit -s "from sorting import test_sort" "test_sort()"
1 loop, best of 5: 6 msec per loop

$ python -m timeit -s "from sorting import test_sorted" "test_sorted()"
1 loop, best of 5: 373 msec per loop

When benchmarked with Python 3.8, sort() is around 60 times as fast as sorted() when sorting 1 000 000 numbers (373/6≈62.167).

Update: As pointed out by a vigilant reader in the comments section, I’ve made a terrible blunder in my benchmarks! timeit runs the code multiple times, which means that:

We get completely wrong results because we compare calling list.sort() on an ordered list with calling sorted() on a random list.

Let’s fix my test functions and rerun benchmarks.

# sorting.py
from random import sample

# List of 1 000 000 integers randomly shuffled
MILLION_RANDOM_NUMBERS = sample(range(1_000_000), 1_000_000)

def test_sort():
    random_list = MILLION_RANDOM_NUMBERS[:]
    return random_list.sort()

def test_sorted():
    random_list = MILLION_RANDOM_NUMBERS[:]
    return sorted(random_list)

This time, I’m explicitly making a copy of the initial shuffled list and then sort that copy (new_list = old_list[:] is a great little snippet to copy a list in Python). Copying a list adds a small overhead to our test functions, but as long as we call the same code in both functions, that’s acceptable.

Let’s see the results:

$ python -m timeit -s "from sorting import test_sort" "test_sort()"
1 loop, best of 5: 352 msec per loop

$ python -m timeit -s "from sorting import test_sorted" "test_sorted()"
1 loop, best of 5: 385 msec per loop

Now, sorted is less than 10% slower (385/352≈1.094). Since we only run one loop, the exact numbers are not very reliable. I have rerun the same tests a couple more times, and the results were slightly different each time. sort took around 345-355 msec and sorted took around 379-394 msec (but it was always slower than sort). This difference comes mostly from the fact that sorted creates a new list (again, as kindly pointed out by a guest reader in the comments).

Initial order matters

What happens when our initial list is already sorted?

MILLION_NUMBERS = list(range(1_000_000))
$ python -m timeit -s "from sorting import test_sort" "test_sort()"
20 loops, best of 5: 12.1 msec per loop

$ python -m timeit -s "from sorting import test_sorted" "test_sorted()"
20 loops, best of 5: 16.6 msec per loop

Now, sorting takes much less time and the difference between sort and sorted grows to 37% (16.6/12.1≈1.372). Why is sorted 37% slower this time? Well, creating a new list takes the same amount of time as before. And since the time spent on sorting has shrunk, the impact of creating that new list got bigger.

If you want to run the benchmarks on your computer, make sure to adjust the test_sort and test_sorted functions, so they use the new MILLION_NUMBERS variable (instead of the MILLION_RANDOM_NUMBERS). Make sure you do this update for each of the following tests.

And if we try to sort a list of 1 000 000 numbers ordered in descending order:

DESCENDING_MILLION_NUMBERS = list(range(1_000_000, 0, -1))
$ python -m timeit -s "from sorting import test_sort" "test_sort()"
20 loops, best of 5: 11.7 msec per loop

$ python -m timeit -s "from sorting import test_sorted" "test_sorted()"
20 loops, best of 5: 18.1 msec per loop

The results are almost identical as before. The sorting algorithm is clever enough to optimize the sorting process for a descending list.

For our last test, let’s try to sort 1 000 000 numbers where 100 000 elements are shuffled, and the rest are ordered:

# 10% of numbers are random
MILLION_SLIGHTLY_RANDOM_NUMBERS = [*range(900_000), *sample(range(1_000_000), 100_000)]
$ python -m timeit -s "from sorting import test_sort" "test_sort()"
5 loops, best of 5: 61.2 msec per loop

$ python -m timeit -s "from sorting import test_sorted" "test_sorted()"
5 loops, best of 5: 71 msec per loop

Both functions get slower as the input list becomes more scrambled.

Using list.sort() is my preferred way of sorting lists - it saves some time (and memory) by not creating a new list. But that’s a double-edged sword! Sometimes you might accidentally overwrite the initial list without realizing it (as I did with my initial benchmarks 😅). So, if you want to preserve the initial list’s order, you have to use sorted instead. And sorted can be used with any iterable, while sort only works with lists. If you want to sort a set, then sorted is your only solution.

Conclusions

sort is slightly faster than sorted, because it doesn’t create a new list. But you might still stick with sorted if:

If you want to learn more, the Sorting HOW TO guide from Python documentation contains a lot of useful information.

September 24, 2020 12:00 AM UTC


Matt Layman

Dynamically Regrouping QuerySets In Templates - Building SaaS #73

In this episode, we worked on a new view to display course resources. While building out the template, I used some template tags to dynamically regroup a queryset into a more useful data format for rendering. I started a new view before the stream to display content, but I had not filled it in before the stream started. We added new data to the context, and did some adjustments to the URL based on the required inputs for the view.

September 24, 2020 12:00 AM UTC

September 23, 2020


The No Title® Tech Blog

Book review – Effective Python, by Brett Slatkin (and a free chapter for download)

Those among you who have already learned some Python or may even have used it in some projects will certainly have heard the expression “Pythonic Code”, which conveys a general and somewhat wide meaning of “clean code and good software development practices in the context of Python”. With Effective Python, the author presents you with nothing less than 90 practical examples on how to adopt a pythonic developer mindset and how to write better Python code.

September 23, 2020 11:04 PM UTC


Python Engineering at Microsoft

Python in Visual Studio Code – September 2020 Release

We are pleased to announce that the September 2020 release of the Python Extension for Visual Studio Code is now available. You can  download the Python extensionfrom the Marketplace, or install it directly from the extension gallery in Visual Studio Code. If you already have the Python extension installed, you can also get the latest update by restarting Visual Studio Code. You can learn more about  Python support in Visual Studio Code  in the documentation.  

This was a short release where we addressed total of 34 issues, and it includes support for colorization and auto import improvements with Pylance, our new language server extension for Python in VS Code. 

If you’re interested, you can check the full list of improvements iour changelog. 

Support for semantic colorization in Pylance 

We are exciting to announce that you can now get support for semantic colorization with Pylance, helping to improve the readability of your codeSemantic colorization is an extension on syntax highlightingPylance generates semantic tokens which are used by themes to apply colors based on the semantic meaning of symbols (e.g. variables, functions, modules all have different colors applied to them). To see this new feature in action, you’ll need to apply a theme that supports semantic color. Some great themes to try out semantic colorization are the built-in Dark+ theme or One Dark Pro. 

Check out the before and after on this code sample with semantic colorization!

Python code with semantic colorization

Pylance auto-import improvements 

With improved auto-import completions, you can now see a clearer preview of the import statement that will be added to your file in the completion tooltipThe way that Pylance adds imports to your file has also been improved by detecting when you’ve already imported other submodules or functions from that module. Instead of adding a duplicate import statement to your file, Pylance will now amend the existing one by adding the symbol alphabetically in the statement, helping to keep your imports organized. 

Preview of the import statement on tooltip for auto import.

Other changes and enhancements 

We have also added small enhancements and fixed issues requested by users that should improve your experience working with Python in Visual Studio Code. Some notable changes include: 

We’re constantly A/B testing new features. If you see something different that was not announced by the team, you may be part of the experiment! To see if you are part of an experiment, you can check the first lines in the Python extension output channel. If you wish to opt-out of A/B testing, you can open the user settings.json file (View Command Palette… and run Preferences: Open Settings (JSON)) and set thepython.experiments.enabled” setting to false 

Be sure to  download the Python extension  for Visual Studio Code now to try out the above improvements. If you run into any problems or have suggestionsplease file an issue on the  Python VS Code GitHub  page. 

The post Python in Visual Studio Code – September 2020 Release appeared first on Python.

September 23, 2020 05:20 PM UTC


Patrick Kennedy

Application and Request Contexts in Flask

I wrote two blog posts on TestDriven.io about how the Application and Request contexts are handled in Flask:

  1. BasicsUnderstanding the Application and Request Contexts in Flask
  2. AdvancedDeep Dive into Flask’s Application and Request Contexts

The first blog post provides examples of how to the Application and Request contexts work, including how the current_app, request, test_client, and test_request_context can be used to effectively used to avoid pitfalls with these contexts.

The second blog post provides a series of diagrams illustrating how the Application and Request contexts are processed when a request is handled in Flask. This post also dives into how LocalStack objects work, which are the objects used for the Application Context Stack and the Request Context Stack.

September 23, 2020 02:49 PM UTC


Real Python

Python Community Interview With David Amos

This week I’m joined by David Amos, the content technical lead here at Real Python.

In this interview, we talk about David’s love of LEGO and mathematics. We also talk about the Python Basics book, which is soon to be out of early access, and his involvement with PyCoder’s Weekly. So, without further ado, let’s get started.

Ricky: Thank you for joining me, David. Many of our readers and members may already know your background, but for those who don’t, let’s ask the inevitable questions: How did you get into programming, and when did you start using Python?

David Amos

David: I discovered programming by accident when I came across the source code for the Gorillas game on my parents’ IBM 386 PS/2 computer. I guess I was about seven or eight years old. I found something called a .BAS file that opened up a program called QBasic and had all sorts of strange-looking text in it. I was instantly intrigued!

There was a note at the top of the file that explained how to adjust the game speed. I changed the value and ran the game. The effect was instantly noticeable. It was a thrilling experience.

I was obsessed with learning to program in QBasic. I made my own text adventure games. I even made a few animations using simple geometric shapes. It was tons of fun!

QBasic was a fantastic language for an eight-year-old kid to learn. It was challenging enough to keep me interested but easy enough to get quick results, which is really important for a child.

When I was around ten years old, I tried to teach myself C++. The ideas were too complex, and results came too slowly. After a few months of struggling, I stopped. But the idea of programming computers remained attractive to me—enough so that I took a web technology class in high school and learned the basics of HTML, CSS, and JavaScript.

In college, I decided to major in mathematics, but I needed a minor. I chose computer science because I thought having some experience with programming would make it easier to complete the degree requirements.

I learned about data structures with C++. I took an object-oriented programming class with Java. I studied operating systems and parallel computing with C. My programming horizons expanded vastly, and I found the whole subject pleasing both practically and intellectually.

At that time, I viewed programming as a tool to help me with mathematics research. In graduate school, I wrote programs to generate examples and test ideas for my research projects.

It was during graduate school, around 2013, that I found Python and pretty much instantly fell in love. I’d been using C++, MATLAB, and Mathematica as my primary research tools, but Python allowed me to focus on the research problem without getting caught up in the code.

And with Python’s awesome ecosystem of tools for scientific computing, like NumPy, SciPy, PuLP, and NetworkX, I had everything I needed to tackle problems like I would with MATLAB but in a much more expressive manner!

Ricky: You often hear the myth that a strong mathematics background is a prerequisite to be a programmer. While I think you’ll agree that it’s not always necessary for programmers to know advanced math, I’m curious to know how your math and data science background has helped you when writing code.

Read the full article at https://realpython.com/interview-david-amos/ »


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

September 23, 2020 02:00 PM UTC


PyCharm

PyCharm 2020.3 EAP – Starts now!

The Early Access Program for our next major release, PyCharm 2020.3, is now open! If you are always looking forward to the next ‘big thing’ we encourage you to join the program and share your thoughts on the latest PyCharm improvements!

pycharm EAP program

If you are not familiar to our EAP programs, here are some ground rules:

Highlighted feature

Configurable syntax highlighting for inner functions (PY-33235)

  1. In the Settings/Preferences dialog, go to Editor | Color Scheme | Python.
  2. Select any code element you want to customize and clear the corresponding Inherit values from the checkbox to change inherited color settings for this element; then specify your color and font settings.

For example, you can set a color highlighting for nested functions. From the list of the code elements, select Nested function definitions, clear the Inherit values from the checkbox, and specify the element foreground and background colors. Click OK to save the changes.

Define custom font and color settings for Python

More features and fixes present on this EAP build

Interested?

Download this EAP from our website. Alternatively, you can use the JetBrains Toolbox App to stay up to date throughout the entire EAP.
If you’re on Ubuntu 16.04 or later, you can use snap to get PyCharm EAP and stay up to date. You can find the installation instructions on our website.

September 23, 2020 01:52 PM UTC