skip to navigation
skip to content

Planet Python

Last update: March 08, 2021 10:45 AM UTC

March 08, 2021


Mike Driscoll

PyDev of the Week: Jens Winkelmann

This week we welcome Jens Winkelmann (@jWinman) as our PyDev of the Week! Jens is a PhD researcher in the Foams and Complex System Group at Trinity College Dublin (TCD). You can find out more about what Jens does on his web page. Jens is also a conference speaker.

Let’s spend a few moments getting to know Jens better!

Can you tell us a little about yourself (hobbies, education, etc):

I was born and raised in the beautiful city of Essen, Germany, where I also currently live and work again after a couple of years abroad.

I obtained a B.Sc. and an M.Sc., both in Physics from TU Dortmund (Germany), in 2013 and 2015, respectively. End of 2015, I moved to Dublin, Ireland, to pursuit a PhD in Physics in the Foams and Complex Systems research group of Trinity College Dublin, from which I graduated last year.

In December 2019 I returned to Essen and am working here now as a Data Scientist at talpasolutions GmbH. Talpasolutions is the leading driver of the Industrial Internet of Things in the heavy industry. We build digital products that offer actionable insights for machine manufacturers as well as operators based on collected machine sensor data.

In my free time I enjoy climbing, both rope climbing as well as bouldering. It is a great sport because it combines mental focus with physical workout and can be individual or communal as much as you like.

Why did you start using Python?

I started using Python for the data analysis and plotting parts of the Physics labs during my undergrad at TU Dortmund. Some friends of my study group who were more familiar with programming languages introduced me to it. They quickly convinced me that it reduces my stress level for the Physics labs tremendously in the long run compared to Excel.

First, I used it for typical tasks in Physics labs where you analyse and then plot experimental data using NumPy and Matplotlib. Over time the data analysis became more and more complex. I also used it for my Bachelor, Master and later on PhD thesis, where I analysed and visualised large amount of data created by computer simulations. It was only then that I fully appreciated what a powerful tool Python can be.

What other programming languages do you know and which is your favorite?

I also learned C/C++ in an introductory coding lecture as well as part of a Computational Physics lecture. I implemented a hydrodynamic simulation in C/C++ for my Bachelor as well as Master thesis. Computational speed is quite essential here and everything needed to be programmed from scratch. So Python was unfortunately not an option for this.

I also got a bit into functional programming through a lecture about Haskell during my Master studies. But the only learning that remained is the functools package in Python which provides some functional programming tools.

Python is by far my favourite programming language at the moment. Since it is so straightforward, it allows me to fully focus on the problem that I’d like to solve rather than getting distracted by unnecessary boiler-plate code. This and Python’s large ecosystem ranging from NumPy to tensorflow and keras makes it to a powerful tool in the repertoire of a Data Scientist.

What projects are you working on now?

Most of my current projects are related to my work as a Data Scientist at talpasolutions where I analyse data from the world’s largest machines that are being used in the mining industry. Our data science solutions increase overall equipment efficiency, operational productivity, predict possible maintenance downtimes, and also have an ecological impact: For example, we help our customers to reduce their diesel consumption and thus save CO2 emissions.

There are two particular projects or use cases that I’ve been currently involved in:

Our activity detection algorithms are comparable to object detections in image recognitions. The sensors of a heavy machinery such as a truck or excavator can be used to classify its current activity state. A truck for instance may be loading, dumping, idle, driving loaded, or driving unloaded. Based on sensor signal such as payload, speed, and dump angle, our algorithms infer its activity state. Activity detection algorithms are crucial because they build the basis for a digital surveillance of the mine’s productivity and further analytical tools of our software. Based on these algorithms, we provide actionable insights to our users that optimise their mine operations, e.g.: What is the average loading time of a truck? What are the largest efficiency losses in the mine operation?

The goal behind predictive maintenance is to reduce the mine operator’s maintenance costs which occur either due to unplanned downtimes or component failures. Our algorithms achieve this goal by predicting unplanned downtimes based on the machine’s historical data. The analytical results are then displayed in our software solution to inform the right person at the right time. With unplanned downtime quickly costing more than $1000 per truck per hour, the importance of this issue is indisputable. One of our exemplary strategy includes live-casting sensor data by using anomaly detection. For this strategy, we employ a neural network to detect possible anomalous behaviours in sensor signals such as the suspension pressure.

If this got you excited about my Data Science work feel free to watch my talk at the pyjamas conference (an online conference dedicated to Python) on YouTube.

Another project unrelated to my job as a Data Scientist includes writing an academic book by the title Columnar Structures of Spheres: Fundamentals and Applications together with Professor Ho-Hei Chan from the Harbin Institute of Technology in China. The book covers the topic of my PhD thesis about so-called ordered columnar structures that we investigated using computer simulations in Python. Such structures occur when identical spheres are being packed densely inside a cylindrical confinement (for more details check out this wikipedia article). We simulated such structures by employing optimisation algorithms in Python, which helped us to discover a novel experimental foam structure, a so-called line-slip structure.

The full range of their applications is still under discovery, but so far they have been found in foam structures (like beer foam), botany, and nano science. My personal favourite application is that of a photonic metamaterial. Such materials are characterised by having a negative refractive index which allows them to be used for super lenses or cloaking. Some of our structures are potential candidates for such a material.

Because of Covid-19, we actually made good progress on the writing lately. The book is now planned to be published in the summer 2021 by Jenny Stanford Publishing.

Which Python libraries are your favorite (core or 3rd party)?

The Python ecosystem provides an amazing variety of well-developed Python libraries for Data Scientists. They all serve different purpose. Some that I most often use are:

I especially like Matplotlib because of how versatile it is in creating graphs and data visualisation. But of course, Plotly shouldn’t go unmentioned here either. Matplotlib lacks a bit in plotting large amount of data in an interactive graph. This is where Plotly actually shines.

What drew you to data science?

In retrospective, it seems like Data Science is the natural path after studying Physics. But winding the clocks back to when I was starting my Physics undergrad degree, I didn’t even know what Data Science was.

During my time of my PhD in Dublin, I came across the Python Ireland community and participated in a few of the monthly meet-ups as well as the Python Conference in 2016. The talks and discussion with people at these meet-ups made me curious about Data Science. What I really liked about Data Science was the fact that it provided a way to do Science outside of Academia. On top of this, my Python skills turned out to be quite useful for Data Science as well.

So after I finished my PhD in Dublin, I decided to apply for a couple of positions in Germany and Ireland, including my current position at talpasolutions in my hometown Essen.

Talpasolutions stood out to me from all the other companies that I applied to because talpasolutions mission has meaning to me. By developing digital products for the mining industry, we improve the working conditions of heavy industry workers and we make the industry more environment-friendly by reducing its carbon food print.

Additionally, the mining industry has a long and famous history in Essen. Even though the last mines have been closed for years, it feels like, we at talpasolutions are carrying on the spirit of this era. Since Essen is my hometown, I really enjoy working here. For many other Data Science positions, I would be starving for meaning because what lots of those companies do is make people click ads or make rich people richer.

Can people without math backgrounds get into data science? Why or why not?

I think a solid foundation of math skills, especially statistics, is essential for Data Science. It is important to understand the math behind the models that you employ as a Data Scientist. The math background helps you to optimise your model and how to avoid over- or underfitting.

But you don’t need to be a math genius because the Data Science work in most companies consists only of applying and optimising already developed (machine learning) models to their data. Data Scientist at FAANG companies or research facilities are mainly the once developing completely new algorithms. In that case of course, your math skills better be in good shape.

Similar to Computer Science, Data Science is also ranging over a broad spectrum and it will continue to broaden in the future. I’d say there are some Data Science fields that require more and some less mathematics skills. We at talpasolutions deal entirely with numerical data from the engineering world which requires a certain degree of mathematical understanding from all our developers.

Is there anything else you’d like to say?

As final words, I’d like to say thank you for giving me the opportunity to answer your questions here. I hope my answers got your blog audience intrigued and more eager than ever to learn more about Data Science. I also would like to thank my friend Sanyo for proofreading my answers and making sure that they are making crispy-clear sense.

Thanks for doing the interview, Jens!

The post PyDev of the Week: Jens Winkelmann appeared first on Mouse Vs Python.

March 08, 2021 06:05 AM UTC


John Ludhi/nbshare.io

Strftime and Strptime In Python

Strftime and Strptime In Python

In this post, we will learn about strftime() and strptime() methods from Python datetime package.

Python Strftime Format

The strftime converts date object to a string date.

The syntax of strftime() method is...

dateobject.strftime(format)

Where format is the desired format of date string that user wants. Format is built using codes shown in the table below...

Code Meaning
%a Weekday as Sun, Mon
%A Weekday as full name as Sunday, Monday
%w Weekday as decimal no as 0,1,2...
%d Day of month as 01,02
%b Months as Jan, Feb
%B Months as January, February
%m Months as 01,02
%y Year without century as 11,12,13
%Y Year with century 2011,2012
%H 24 Hours clock from 00 to 23
%I 12 Hours clock from 01 to 12
%p AM, PM
%M Minutes from 00 to 59
%S Seconds from 00 to 59
%f Microseconds 6 decimal numbers

Datetime To String Python using strftime()

Example: Convert current time to date string...

In [8]:
import datetime 
from datetime import datetime
now = datetime.now()
print(now)
2021-03-07 23:24:11.192196

Let us convert the above datetime object to datetime string now.

In [2]:
now.strftime("%Y-%m-%d %H:%M:%S")
Out[2]:
'2021-03-07 23:16:41'

If you want to print month as locale’s abbreviated name, replace %m with %b as shown below...

In [3]:
now.strftime("%Y-%b-%d %H:%M:%S")
Out[3]:
'2021-Mar-07 23:16:41'

Another example...

In [4]:
now.strftime("%Y/%b/%A %H:%M:%S")
Out[4]:
'2021/Mar/Sunday 23:16:41'

Date To String Python using strftime()

Date to string is quite similar to datetime to string Python conversion.

Example: Convert current date object to Python date string.

In [5]:
today = datetime.today()
print(today)
2021-03-07 23:22:15.341074

Let us convert the above date object to Python date string using strftime().

In [6]:
today.strftime("%Y-%m-%d %H:%M:%S")
Out[6]:
'2021-03-07 23:22:15'

Python Strftime Milliseconds

To get a date string with milliseconds, use %f format code at the end as shown below...

In [7]:
today = datetime.today()
today.strftime("%Y-%m-%d %H:%M:%S.%f")
Out[7]:
'2021-03-07 23:23:50.851344'

Python Strptime Format

Strptime python is used to convert string to datetime object.

strptime(date_string, format)

example:

strptime("9/23/20", "%d/%m/%y")

Note - format "%d/%m/%y" represents the the corresponding "9/23/20" format. The output of the above command will be a Python datetime object.

The format is constructed using pre-defined codes. There are many codes to choose from. The most important ones are listed below.

Code Meaning
%a Weekday as Sun, Mon
%A Weekday as full name as Sunday, Monday
%w Weekday as decimal no as 0,1,2...
%d Day of month as 01,02
%b Months as Jan, Feb
%B Months as January, February
%m Months as 01,02
%y Year without century as 11,12,13
%Y Year with century 2011,2012
%H 24 Hours clock from 00 to 23
%I 12 Hours clock from 01 to 12
%p AM, PM
%M Minutes from 00 to 59
%S Seconds from 00 to 59
%f Microseconds 6 decimal numbers

Python Datetime Strptime

Example: Convert date string to Python datetime object.

In [9]:
import datetime
datetime.datetime.strptime("09/23/2030 8:28","%m/%d/%Y %H:%M")
Out[9]:
datetime.datetime(2030, 9, 23, 8, 28)

March 08, 2021 01:40 AM UTC


Codementor

How I learned Python

About me I am a senior software developer with experience over 8 years. I am working now at IT company as Developer. I'd like to get new technology and challenge. I love the programming and...

March 08, 2021 01:18 AM UTC


Matthew Wright

How to remove a column from a DataFrame, with some extra detail

Removing one or more columns from a pandas DataFrame is a pretty common task, but it turns out there are a number of possible ways to perform this task. I found that this StackOverflow question, along with solutions and discussion in it raised a number of interesting topics. It is worth digging in a little bit to the … Continue reading How to remove a column from a DataFrame, with some extra detail

The post How to remove a column from a DataFrame, with some extra detail appeared first on wrighters.io.

March 08, 2021 12:14 AM UTC

March 07, 2021


Cusy

New: Pattern Matching in Python 3.10

The originally object-oriented programming language Python is to receive a new feature in version 3.10, which is mainly known from functional languages: pattern matching. The change is controversial in the Python community and has triggered a heated debate.

Pattern matching is a symbol-processing method that uses a pattern to identify discrete structures or subsets, e.g. strings, trees or graphs. This procedure is found in functional or logical programming languages where a match expression is used to process data based on its structure, e.g. in Scala, Rust and F#. A match statement takes an expression and compares it to successive patterns specified as one or more cases. This is superficially similar to a switch statement in C, Java or JavaScript, but much more powerful.

Python 3.10 is now also to receive such a match expression. The implementation is described in PEP (Python Enhancement Proposal) 634. [1] Further information on the plans can be found in PEP 635 [2] and PEP 636 [3]. How pattern matching is supposed to work in Python 3.10 is shown by this very simple example, where a value is compared with several literals:

def http_error(status):
      match status:
          case 400:
              return "Bad request"
          case 401:
              return "Unauthorized"
          case 403:
              return "Forbidden"
          case 404:
              return "Not found"
          case 418:
              return "I'm a teapot"
          case _:
              return "Something else"

In the last case of the match statement, an underscore _ acts as a placeholder that intercepts everything. This has caused irritation among developers because an underscore is usually used in Python before variable names to declare them for internal use. While Python does not distinguish between private and public variables as strictly as Java does, it is still a very widely used convention that is also specified in the Style Guide for Python Code [4].

However, the proposed match statement can not only check patterns, i.e. detect a match between the value of a variable and a given pattern, it also rebinds the variables that match the given pattern.

This leads to the fact that in Python we suddenly have to deal with Schrödinger constants, which only remain constant until we take a closer look at them in a match statement. The following example is intended to explain this:

NOT_FOUND = 404
retcode = 200

match retcode:
    case NOT_FOUND:
        print('not found')

print(f"Current value of {NOT_FOUND=}")

This results in the following output:

not found
Current value of NOT_FOUND=200

This behaviour leads to harsh criticism of the proposal from experienced Python developers such as Brandon Rhodes, author of «Foundations of Python Network Programming»:

If this poorly-designed feature is really added to Python, we lose a principle I’ve always taught students: “if you see an undocumented constant, you can always name it without changing the code’s meaning.” The Substitution Principle, learned in algebra? It’ll no longer apply.

— Brandon Rhodes on 12 February 2021, 2:55 pm on Twitter [5]

Many long-time Python developers, however, are not only grumbling about the structural pattern-matching that is to come in Python 3.10. They generally regret developments in recent years in which more and more syntactic sugar has been sprinkled over the language. Original principles, as laid down in the Zen of Python [6], would be forgotten and functional stability would be lost.

Although Python has defined a sophisticated process with the Python Enhancement Proposals (PEPs) [7] that can be used to collaboratively steer the further development of Python, there is always criticism on Twitter and other social media, as is the case now with structural pattern matching. In fact, the topic has already been discussed intensively in the Python community. The Python Steering Council [8] recommended adoption of the Proposals as early as December 2020. Nevertheless, the topic only really boiled up with the adoption of the Proposals. The reason for this is surely the size and diversity of the Python community. Most programmers are probably only interested in discussions about extensions that solve their own problems. The other developments are overlooked until the PEPs are accepted. This is probably the case with structural pattern matching. It opens up solutions to problems that were hardly possible in Python before. For example, it allows data scientists to write matching parsers and compilers for which they previously had to resort to functional or logical programming languages.

With the adoption of the PEP, the discussion has now been taken into the wider Python community. Incidentally, Brett Cannon, a member of the Python Steering Council, pointed out in an interview [9] that the last word has not yet been spoken: until the first beta version, there is still time for changes if problems arise in practically used code. He also held out the possibility of changing the meaning of _ once again.

So maybe we will be spared Schrödinger’s constants.


[1]PEP 634: Specification
[2]PEP 635: Motivation and Rationale
[3]PEP 636: Tutorial
[4]https://pep8.org/#descriptive-naming-styles
[5]@brandon_rhodes
[6]PEP 20 – The Zen of Python
[7]Index of Python Enhancement Proposals (PEPs)
[8]Python Steering Council
[9]Python Bytes Episode #221

March 07, 2021 01:48 PM UTC


Python Pool

5 Best Ways to Find Python String Length

What is Python String Length?

Python string length is the function through which we find the length of the string. There is an inbuilt function called len() in python, so this len() function finds the length of the given string, array, list, tuple, dictionary, etc.

Through the len() function, we can optimize the performance of the program. The number of elements stored in the object is never calculated, so len() helps provide the number of elements.

Syntax

len(string)

Parameters

String : This will calculate the length of the value passed in the string variable.

Return Value

It will return an interger value i.e. the length of the given string.

Various Type Of Return Value

  1. String
  2. Empty
  3. Collection
  4. Type Error
  5. Dictionary

1. String:

It is used to return the number of characters present in the string, including punctuation, space, and all types of special characters. However, it would help if you were careful while using the len of a Null variable.

2. Empty:

In this the return call has the zero characters, but it is always None.

3. Collections:

The len() built in function return the number of elements in the collection.

4. Type Error:

Len function always depends on the type of the variable passed to it. A Non-Type len() does not have any built-in support.

5. Dictionary:

In this, each pair is counted as one unit. However, Keys and values are not independent in the dictionary.

Ways to find the length of string in Python

Ways to find the length of string

1. Using the built-in function len()

# Python code to demonstrate string length  
# using len 

str = 'Latracal'
print(len(str))

output:

8

Explanation:

In this code, we have taken str as the variable in which we have stored the string named ‘Latracal’ and then applied the len() in which we have put str in between. so the output came is 8 as the word ‘Latracal‘ contains 8 characters.

2. Using for loop to Find the length of the string in python

 A string can be iterated over easily and directly in for loop. By maintaining a count of the number of iterations will result in the length of the string.

# Python code to demonstrate string length  
# using for loop 
  
# Returns length of string 
def findLength(str): 
    counter = 0    
    for i in str: 
        counter += 1
    return counter 
  
  
str = "Latracal"
print(findLength(str))

output:

8

Explanation:

In this code, we have used for loop to find the length of the string. Firstly, we have taken an str variable that we have given ‘Latracal’ as the string. Secondly, we have called the findLength function in which we have counter equals 0, After that, for loop was written from 0 to string, and the counter value gets increased by 1 at a time. At last, we have printed the counter value.

3. Using while loop and Slicing

We slice a string making it shorter by 1 at regular intervals to time with each iteration till the string is the empty string. This is when the while loop stops. By maintaining a count of the number of iterations will result in the length of the string.

# Python code to demonstrate string length  
# using while loop. 
  
# Returns length of string 
def findLength(str): 
    count = 0
    while str[count:]: 
        count = count + 1
    return count 
  
str = "LatracalSolutions"
print(findLength(str)) 

output:

17

Explanation:

In this code, we have used for loop to find the length of the string. Firstly, we have taken an str variable in which we have given ‘LatracalSolutions’ as the string. Secondly, we have called the findLength function in which we have set the value of count equals 0. Thirdly, then applied the while loop in which we are slicing the value of str by one at each iteration till the string becomes empty. And at last, returned the count value.

4. Using string methods join and count

The join method of strings takes in an iteration and returns a string which is the concatenation of the iteration strings. The separator present in the between of elements is the original string on which the method is called. Using the join and count method, the joined string in the original string will also result in the string’s length.

# Python code to demonstrate string length  
# using join and count 
  
# Returns length of string 
def findLength(str): 
    if not str: 
        return 0
    else: 
        some_random_str = 'py'
        return ((some_random_str).join(str)).count(some_random_str) + 1
  
str = "LatracalSolutions"
print(findLength(str))

output:

17

Explanation:

In this code, we have used for loop to find the length of the string. Firstly, we have taken an str variable in which we have given ‘LatracalSolutions’ as the string. Secondly, then we have called the findLength function in which we have applied if and else function in which if contains the conditions that if the string is empty, it should return 0; otherwise, the else part will work. We have taken some random string ‘py’ in which the main string will get join by the iteration, and the count value will increase till the string becomes empty. After that, the output gets printed.

5. Using getsizeof() method to Find Length Of String In Python

This method is used to find the object’s storage size that occupies some space in the memory.

Note: This method is only applicable for normal ascii letters. If you have a special character it’ll not work, as it uses the size of the string in bytes. So be careful while using it!

import sys
s = "pythonpool"
print(sys.getsizeof(s) - sys.getsizeof(""))

Output:

10

Explanation:

Here, we have used the sys module which is inbuilt in python. then we have to take a string s and using the sys module with the getsizeof() method printed the length of the string.

Example to Find Length of String in Python

# Python code to demonstrate string length
# testing len() 
str1 = "Welcome to Latracal Solutions Python Tutorials"
print("The length of the string  is :", len(str1))

Output:

The length of the string  is : 46

Must Read

Summary: Python String Length

We’ve seen all 5 different ways of finding the string length, but in conclusion, only one of them is practical. In-built len() keyword is the best way to find the length of the string in any type of format.

However, if you have any doubts or questions, do let me know in the comment section below. I will try to help you as soon as possible.

Happy Pythoning!

The post 5 Best Ways to Find Python String Length appeared first on Python Pool.

March 07, 2021 01:34 PM UTC


John Ludhi/nbshare.io

Python Generators

Python Generators

Python generators are very powerful for handling operations which require large amount of memory.

Let us start with simple example. Below function prints infinite sequence of numbers.

In [1]:
def generator_example1():
    count = 0
    while True:
        yield count
        count+=1
In [2]:
g = generator_example1()
In [3]:
next(g)
Out[3]:
0
In [4]:
next(g)
Out[4]:
1
In [5]:
next(g)
Out[5]:
2

and so on...

Python Yield

Ok let us revisit our function 'generator_example1()'. What is happening in the below code?

Inside while loop, we have 'yield' statement. Yield breakes out of loop and gives back control to whomever called function generator_exampe1(). In statement 'g = generator_example1()', g is now a geneator as shown below.

In [6]:
def generator_example1():
    count = 0
    while True:
        yield count
        count+=1
In [7]:
g = generator_example1()
In [8]:
g
Out[8]:
<generator object generator_example1 at 0x7f3334416e08>

Once you have a generator function, you can iterate through it using next() function. Since we have a infinite 'while' loop in the genereator_example() function, we can call iterator as many times as we want it. Each time, we use next(), generator starts the execution from previous position and prints a new value.

Python Generator Expression

Python generators can be used outside the function without the 'yield'. Check out the below example.

In [9]:
g = (x for x in range(10))
In [10]:
g
Out[10]:
<generator object <genexpr> at 0x7f3334416f68>

(x for x in range(10)) is a Python generator object. The syntax is quite similar to Python list comprehension except that instead of square brackets, generators are defined using round brackets. As usual, once we have generator object, we can call iterator next() on it to print the values as shown below.

In [11]:
next(g)
Out[11]:
0
In [12]:
next(g)
Out[12]:
1

Python Generator stop Iteration

Python generators will throw 'StopIteration' exception, if there is no value to return for the iterator.

Let us look at following example.

In [13]:
def range_one():
    for x in range(0,1):
        yield x
In [14]:
g = range_one()
In [15]:
next(g)
Out[15]:
0
In [16]:
next(g)
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-16-e734f8aca5ac> in <module>
----> 1 next(g)

StopIteration: 

To avoid above error, we can catch exception like this and stop the iteration.

In [17]:
g = range_one()
In [18]:
try:
    print(next(g))
except StopIteration:
    print('Iteration Stopped')
0
In [19]:
try:
    print(next(g))
except StopIteration:
    print('Iteration Stopped')
Iteration Stopped

Python Generator send()

We can pass value to Python Generators using send() function.

In [20]:
def incrment_no():
    while True:
        x = yield
        yield x + 1
In [21]:
g = incrment_no()    # Create our generator
In [22]:
next(g) # It will go to first yield
In [23]:
print(g.send(7)) # value 7 is sent to generator which gets assgined to x, 2nd yield statement gets executed       
8

Python Recursive Generator

Python generators can be used recursively. Check out the below code. In below function, "yield from generator_factorial(n - 1)" is recursive call to function generator_factorial().

In [24]:
def generator_factorial(n):
    if n == 1:
        f = 1
    else:
        a = yield from generator_factorial(n - 1)
        f = n * a
    yield f
    return f
In [25]:
g = generator_factorial(3)
In [26]:
next(g)
Out[26]:
1
In [27]:
next(g)
Out[27]:
2
In [28]:
next(g)
Out[28]:
6

Python Generator throw() Error

Continuing with above example, let us say we want generator to throw error for the factorial of number greater than 100. We can add generator.throw() exception such as shown below.

In [29]:
n  = 100
if n >= 100:
    g.throw(ValueError, 'Only numbers less than 100 are allowed')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-29-bf449f9fafac> in <module>
      1 n  = 100
      2 if n >= 100:
----> 3     g.throw(ValueError, 'Only numbers less than 100 are allowed')

<ipython-input-24-e76bd978ab03> in generator_factorial(n)
      5         a = yield from generator_factorial(n - 1)
      6         f = n * a
----> 7     yield f
      8     return f

ValueError: Only numbers less than 100 are allowed

Python Generators Memory Efficient

Python generators take very less memory. Let us look at following two examples. In the examples below, note the difference between byte size of memory used by 'Python list' vs 'Python generator'.

In [30]:
import sys
In [31]:
#Python List comprehension
sequence = [x for x in range(1,1000000)]
sys.getsizeof(sequence)
Out[31]:
8697464
In [32]:
#Python Generators
sequence = (x for x in range(1,1000000))
sys.getsizeof(sequence)
Out[32]:
88

Python Generator Performance

One thing to notice here is that, Python generators are slower than Python list comprehension if the memory is large engough to compute. Let us look at below two examples from the performance perspective.

In [33]:
#Python List comprehension
import cProfile
cProfile.run('sum([x for x in range(1,10000000)])')
         5 function calls in 0.455 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.327    0.327    0.327    0.327 <string>:1(<listcomp>)
        1    0.073    0.073    0.455    0.455 <string>:1(<module>)
        1    0.000    0.000    0.455    0.455 {built-in method builtins.exec}
        1    0.054    0.054    0.054    0.054 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


In [34]:
#generators
import cProfile
cProfile.run('sum((x for x in range(1,10000000)))')
         10000004 function calls in 1.277 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 10000000    0.655    0.000    0.655    0.000 <string>:1(<genexpr>)
        1    0.000    0.000    1.277    1.277 <string>:1(<module>)
        1    0.000    0.000    1.277    1.277 {built-in method builtins.exec}
        1    0.622    0.622    1.277    1.277 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


Check the number of function calls and time the 'Python generator' took to compute the sum compare to Python 'list comprehension'.

Data Pipeline with Python Generator

Let us wrap up this tutorial with Data Pipelines. Python generators are great for building the pipelines.

Let us open a CSV file and iterate through it using Python generator.

In [41]:
def generator_read_csv_file():
    for entry in open('stock.csv'):
        yield entry
In [42]:
g = generator_read_csv_file()
In [43]:
next(g)
Out[43]:
'Date,Open,High,Low,Close,Adj Close,Volume\n'
In [44]:
next(g)
Out[44]:
'1996-08-09,14.250000,16.750000,14.250000,16.500000,15.324463,1601500\n'

Let us say, we want to replace the commas in the CSV for each line with spaces, we can build a pipeline for this.

In [45]:
g1 = (entry for entry in open('stock.csv'))
In [46]:
g2 = (row.replace(","," ") for row in g1)
In [47]:
next(g2)
Out[47]:
'Date Open High Low Close Adj Close Volume\n'
In [48]:
next(g2)
Out[48]:
'1996-08-09 14.250000 16.750000 14.250000 16.500000 15.324463 1601500\n'
In [50]:
next(g2)
Out[50]:
'1996-08-12 16.500000 16.750000 16.375000 16.500000 15.324463 260900\n'

Wrap Up:

It takes a little practice to get hold on Python generators but once mastered, Python generators are very useful for not only building data pipelines but also handling large data operations such as reading a large file.

March 07, 2021 01:40 AM UTC

March 06, 2021


Zero-with-Dot (Oleg Żero)

Why using SQL before using Pandas?

Introduction

Data analysis is one of the most essential steps in any data-related project. Regardless of the context (e.g. business, machine-learning, physics, etc.), there are many ways to get it right… or wrong. After all, decisions often depend on actual findings. and at the same time, nobody can tell you what to find before you have found it.

For these reasons, it is important to try to keep the process as smooth as possible. On one hand, we want to get into the essence quickly. On the other, we do not want to complicate the code. If cleaning the code takes longer than cleaning data, you know something is not right.

In this article, we focus on fetching data. More precisely, we show how to get the same results with both Python’s key analytics library, namely Pandas and SQL. Using an example dataset (see later), we describe some common patterns related to preparation and data analysis. Then we explain how to get the same results with either of them and discuss which one may be preferred. So, irrespectively if you know one way, but not the other, or you feel familiar with both, we invite you to read this article.

The example dataset

“NIPS Papers” from Kaggle will serve us an example dataset. We have purposely chosen this dataset for the following reasons:

You are welcome to explore it yourself. However, as we focus on the methodology, we limit to discuss the content to a bare minimum. Let’s dive in.

Fetching data

The way you pick the data depends on its format and the way it is stored. If the format is fixed, we get very little choice on how we pick it up. However, if the data sits in some database, we have more options.

The simplest and perhaps the most naive way is to fetch it table after table and store them locally as CSV files. This is not the best approach for two main reasons:

You don’t want any of these. However, if you have no clue about the data, fetching everything is the safest option.

Let’s take a look at what tables and columns the “NIPS” has.

table authors

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd

df_authors = pd.read_csv("data/authors.csv")
df_auhtors.info()

# returns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9784 entries, 0 to 9783
Data columns (total 2 columns):
#   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      9784 non-null   int64 
 1   name    9784 non-null   object
dtypes: int64(1), object(1)
 memory usage: 153.0+ KB

table papers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
df_papers = pd.read_csv("data/papers.csv")
df_papers.info()

# returns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7241 entries, 0 to 7240
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          7241 non-null   int64 
 1   year        7241 non-null   int64 
 2   title       7241 non-null   object
 3   event_type  2422 non-null   object
 4   pdf_name    7241 non-null   object
 5   abstract    7241 non-null   object
 6   paper_text  7241 non-null   object
dtypes: int64(2), object(5)
memory usage: 396.1+ KB

table paper_authors

As mentioned earlier, this table links the former two, using author_id and paper_id foreign keys. In addition, it has its own primary key id.

/assets/sql-or-pandas/figure-1.png Figure 1. Displaying the top 5 rows from the data from the `papers` table.

As we can see from the image (and also when digging into the analytics deeper), the pdf_name column is more or less redundant, given the title column. Furthermore, by calling df_papers["event_type"].unique(), we know there are four distinct values for this column: 'Oral', 'Spotlight', 'Poster' or NaN (which signifies a publication was indeed a paper).

Let’s say, we would like to filter away pdf_name together with any entry that represents any publication that is other than a usual paper. The code to do it in Pandas looks like this:

1
2
3
df = df_papers[~df_papers["event_type"] \
            .isin(["Oral", "Spotlight", "Poster"])] \
        [["year", "title", "abstract", "paper_text"]]

The line is composed of three parts. First, we pass df_papers["event_type"].isin(...), which is a condition giving us a binary mask, then we pass it on to df_papers[...] essentially filtering the rows. Finally, we attach a list of columns ["year", "title", "abstract", "paper_text"] to what is left (again using [...]) thus indicating the columns we want to preserve. Alternatively, we may also use .drop(columns=[...]) to indicate the unwanted columns.

A more elegant way to achieve the same result is to use Pandas’ .query method instead of using a binary mask.

1
2
3
df = df_papers \
        .query("event_type not in ('Oral', 'Spotlight', 'Poster')") \
        .drop(columns=["id", "event_type", "abstract"])

The code looks a bit cleaner, and a nice thing about .query is the fact, we can use @-sign to refer to another object, for example .query("column_a > @ass and column_b not in @bees"). On the flip side, this method is a bit slower, so you may want to stick to the binary mask method when having to repeat it excessively.

Using SQL for getting data

Pandas gets the job done. However, we do have databases for a reason. They are optimized to search through tables efficiently and deliver data as needed.

Coming back to our problem, all we have achieved here is simple filtering of columns and rows. Let’s delegate the task to the database itself, and use Pandas to fetch the prepared table.

Pandas provides three functions that can help us: pd.read_sql_table, pd.read_sql_query and pd.read_sql that can accept both a query or a table name. For SQLite pd.read_sql_table is not supported. This is not a problem as we are interested in querying the data at the database level anyway.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import sqlite3  # or sqlalchemy.create_engine for e.g. Postgres

con = sqlite3.connect(DB_FILEPATH)
query = """
select
    year,
    title,
    abstract,
    paper_text
from papers
where trim(event_type) != ''
"""

df_papers_sql = pd.read_sql(query, con=con)

Let’s break it down.

First, we need to connect to the database. For SQLite, it is easy, as we are only providing a path to a database file. For other databases, there are other libraries (e.g. psycopg2 for Postgres or more generic: sqlalchemy). The point is to create the database connection object that points Pandas in the right direction and sorts the authentication.

Once it is settled, the only thing left is constructing the right SQL query. SQL filters columns through the select statement. Similarly, rows are filtered with the where clause. Here we use the trim function to strip the entires from spaces, leaving us everything, but an empty string to pick up. The reason we use trim is specific to the data content of this example, but generally where is a place to put a condition.

With read_sql, the data is automatically DataFrame‘ed with all the rows and columns prefiltered as described.

Nice, isn’t it?

Let’s move further…

Joining, merging, collecting, combining…

Oftentimes data is stored across several tables. In these cases, stitching a dataset becomes an additional step that precedes the analytics.

Here, the relationship is rather simple: there is a many-to-many relationship between authors and papers, and the two tables are linked through the third, namely papers_authors. Let’s take a look at how Pandas handles the case. For the sake of argument, let’s assume we want to find the most “productive” authors in terms of papers published.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
df_authors \
    .merge(
        df_papers_authors.drop(columns=["id"]),
        left_on="id",
        right_on="author_id",
        how="inner") \
    .drop(columns=["id"]) \
    .merge(
        df_papers \
            .query("event_type in ('Oral', 'Spotlight', 'Poster')") \
            [["id", "year", "title", "abstract", "paper_text"]],
        left_on="paper_id",
        right_on="id",
        how="left") \
    .drop(columns=["id", "paper_id", "author_id"]) \
    .sort_values(by=["name", "year"], ascending=True)

… for Python, this is just one line of code, but here we split it into several for clarity.

We start with the table authors and want to assign papers. Pandas offers three functions for “combining” data.

To “get” to papers, we first need to inner-join the papers_authors table. However, both tables have an id column. To avoid conflict (or automatic prefixing), we remove the papers_authors.id column before joining. Then, we join on authors.id == papers_authors.author_id, after which we also drop id from the resulting table. Having access to papers_id, we perform joining again. This time, it is a left-join as we don’t want to eliminate “paperless” authors. We also take the opportunity to filter df_papers as described earlier. However, it is essential to keep papers.id or else Pandas will refuse to join them. Finally, we drop all indices: id, paper_id, author_id as they don’t bring any information and sort the records for convenience.

Using SQL for combining

Now, the same effect using SQL.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
query = """
    select
        a.name,
        p.year,
        p.title,
        p.abstract,
        p.paper_text
    from authors a
    inner join paper_authors pa on pa.author_id = a.id
    left join papers p on p.id = pa.paper_id
        and p.event_type not in ('Oral', 'Spotlight', 'Poster')
    order by name, year asc
"""
pd.read_sql(query, con=con)

Here, we build it “outwards” from line 8., subsequently joining the other tables, with the second one being trimmed using line 11.. The rest is just ordering and filtering using a, p, pa as an alias.

The effect is the same, but with SQL, we avoid having to manage indices, which has nothing to do with analytics.

Data cleaning

Let’s take a look at the resulting dataset.

/assets/sql-or-pandas/figure-2.png Figure 2. The top five rows of the combined table.

The newly created table contains missing values and encoding problems. Here, we skip fixing the encoding as this problem is specific to the data content. However, missing values are a very common issue. Pandas offers, among others, .fillna(...) and .dropna(...), and depending on the conventions, we may fill NaNs with different values.

Using SQL for data cleaning

Databases also have their way to deal with the issue. Here, the equivalents to fillna and dropna are coalesce and is not null, respectively.

Using coalesce, our query cures the dataset, injecting any value in case it is missing.

1
2
3
4
5
6
7
8
9
10
11
12
13
"""
select
    a.name,
    coalesce(p.year, 0) as year,
    coalesce(p.title, 'untitled') as title,
    coalesce(p.abstract, 'Abstract Missing') as abstract,
    coalesce(p.paper_text, '') as text,
from authors a
join paper_authors pa on pa.author_id = a.id
left join papers p on p.id = pa.paper_id
    and p.event_type not in ('Oral', 'Spotlight', 'Poster')
order by name, year asc
"""

Aggregations

Our dataset is prepared, “healed” and fetched using SQL. Now, let’s take it we would like to rank the authors based on the number of papers they write each year. In addition, we would like to calculate the total word count that every author “produced” every year.

Again, this is another standard data transformation problem. Let’s examine how Pandas handles it. The starting point is the joint and cleaned table.

1
2
3
4
5
6
7
8
9
df["paper_length"] = df["paper_text"].str.count()

df[["name", "year", "title", "paper_length"]] \
    .groupby(by=["name", "year"]) \
    .aggregate({"title": "count", "paper_length": "sum"}) \
    .reset_index() \
    .rename(columns={"title": "n_papers", "paper_length": "n_words"}) \
    .query("n_words > 0") \
    .sort_values(by=["n_papers"], ascending=False)

We calculate articles’ length by counting spaces. Although it is naive to believe that every word in a paper is separated by exactly one space, it does give us some estimation. Line 1. does that via .str attribute, introducing a new column at the same time.

Later on, we formulate a new table by applying a sequence of operations:

  1. We narrow down the table only to the columns of interest.
  2. We aggregate the table using both name and year columns.
  3. As we apply different aggregation functions to the two remaining columns, we use the .aggregate method that accepts a dictionary with instructions.
  4. The aggregation results in having a double-index. Line 6. restores name and year to columns.
  5. The remaining columns’ names stay the same, but no longer reflect the meaning for the numbers. We change that in line 7..
  6. For formerly missing values, it is safe to assume they have a word count equal to zero. To build the ranking, we eliminate them using .query.
  7. Finally, we sort the table for convenience.

The aggregated table is presented in figure 3.

/assets/sql-or-pandas/figure-3.png Figure 3. The ranking is an aggregated table.

Using SQL to aggregate

Now, once again, let’s achieve the same result using SQL.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
"""
select
    d.name,
    d.year,
    count(title) as n_papers,
    sum(paper_length) as n_words
from (
    select
        a.name,
        coalesce(p.year, 0) as year,
        coalesce(p.title, 'untitled') as title,
        length(coalesce(p.paper_text, ''))
            - length(replace(coalesce(p.paper_text, ''), ' ', '')
            as paper_length
    from authors a
    join papers_authors pa on pa.author_id = a.id
    left join papers p on p.id = pa.paper_id
        and p.event_type not in ('Oral', 'Spotlight', 'Poster')
    ) as d
    group by name, year
    having n_words > 0
    order by n_papers desc
"""

This query may appear more hassle than the Pandas code, but it isn’t. We combine all our work plus the aggregation in a single step. Thanks to the subquery and functions, it is possible to arrange so that in our particular example, we may get the result, before we even start the analysis.

The query contains a subquery (lines 8.-18.), where apart from removing abstract and introducing paper_length columns, almost everything stays the same. SQLite does not have an equivalent to str.count(), so we work around it counting differences between spaces and all other words using length and replace. Later, in line 19., we assign d to be an alias for the subquery table for reference.

Next, the groupby statement in combination with count and sum is what we did using Pandas’ .aggregate method. Here, we also apply the condition from line 21. using having. The having statement works similarly to where except that it operates “inside” the aggregation as opposed to where that is applied to a table that is formulated to remove some of its records.

Again, the resulting tables are exactly the same.

Conclusion

Pandas and SQL may look similar, but their nature is very different. Python is an object-oriented language and Pandas stores data as table-like objects. In addition, it offers a wide variety of methods to transform them in any way possible, which makes it an excellent tool for data analysis.

On the opposite side, Pandas’ methods to formulate a dataset are just a different “incarnation” to what SQL is all about. SQL is a declarative language that is naturally tailored to fetch, transform and prepare a dataset. If data resides in a relational database, letting a database engine perform these steps is a better choice. Not only are these engines optimized to do that, but also letting a database prepare a clean and convenient dataset facilitates the analysis process.

The disadvantage of SQL is the fact it may be harder to read and to figure out data to throw away and what to keep, before creating a dataset. Pandas, running on Python, lets us assign fractions of the dataset to variables, inspect them, and then make further decisions.

Still, these temporary variables often creep to clutter the workspace… Therefore, unless you are in doubt, there are strong reasons to use SQL.

And how do you analyze your data? ;)

March 06, 2021 11:00 PM UTC


Weekly Python StackOverflow Report

(cclxvi) stackoverflow python report

These are the ten most rated questions at Stack Overflow last week.
Between brackets: [question score / answers count]
Build date: 2021-03-06 19:28:13 GMT


  1. How can I bulk/batch transcribe wav files using python? - [8/4]
  2. Find single number in pairs of unique numbers of a Python list in O(lg n) - [6/2]
  3. Segmentation fault when importing a C++ shared object in Python - [6/1]
  4. How to quickly get the last line of a huge csv file (48M lines)? - [5/6]
  5. How to get index of numpy multidimensional array in reverse order? - [5/2]
  6. Find the index of first non-zero element to the right of given elements in python - [5/2]
  7. Strange Behavior With Pandas Group By - Transform On String Columns - [5/1]
  8. Why do predict_proba and probA_ / probB_ not match in sklearn SVC? - [5/0]
  9. what is the most conventional way to integrate C code into a Python library using distutils? - [5/0]
  10. How to loop to check if all values in a list are bigger than the values in another list? - [4/9]

March 06, 2021 07:28 PM UTC


The Digital Cat

TDD in Python with pytest - Part 5

This is the fifth and last post in the series "TDD in Python with pytest" where I develop a simple project following a strict TDD methodology. The posts come from my book Clean Architectures in Python and have been reviewed to get rid of some bad naming choices of the version published in the book.

You can find the first post here.

In this post I will conclude the discussion about mocks introducing patching.

Patching

Mocks are very simple to introduce in your tests whenever your objects accept classes or instances from outside. In that case, as shown in the previous sections, you just have to instantiate the class Mock and pass the resulting object to your system. However, when the external classes instantiated by your library are hardcoded this simple trick does not work. In this case you have no chance to pass a fake object instead of the real one.

This is exactly the case addressed by patching. Patching, in a testing framework, means to replace a globally reachable object with a mock, thus achieving the goal of having the code run unmodified, while part of it has been hot swapped, that is, replaced at run time.

A warm-up example

Clone the repository fileinfo that you can find here and move to the branch develop. As I did for the project simple_calculator, the branch master contains the full solution, and I use it to maintain the repository, but if you want to code along you need to start from scratch. If you prefer, you can clearly clone it on GitHub and make your own copy of the repository.

git clone https://github.com/lgiordani/fileinfo
cd fileinfo
git checkout --track origin/develop

Create a virtual environment following your preferred process and install the requirements

pip install -r requirements/dev.txt

You should at this point be able to run

pytest -svv

and get an output like

=============================== test session starts ===============================
platform linux -- Python XXXX, pytest-XXXX, py-XXXX, pluggy-XXXX --
fileinfo/venv3/bin/python3
cachedir: .cache
rootdir: fileinfo, inifile: pytest.ini
plugins: cov-XXXX
collected 0 items 

============================== no tests ran in 0.02s ==============================

Let us start with a very simple example. Patching can be complex to grasp at the beginning so it is better to start learning it with trivial use cases. The purpose of this library is to develop a simple class that returns information about a given file. The class shall be instantiated with the file path, which can be relative.

The starting point is the class with the method __init__. If you want you can develop the class using TDD, but for the sake of brevity I will not show here all the steps that I followed. This is the set of tests I have in tests/test_fileinfo.py

tests/test_fileinfo.py
from fileinfo.fileinfo import FileInfo


def test_init():
    filename = 'somefile.ext'
    fi = FileInfo(filename)
    assert fi.filename == filename


def test_init_relative():
    filename = 'somefile.ext'
    relative_path = '../{}'.format(filename)
    fi = FileInfo(relative_path)
    assert fi.filename == filename

and this is the code of the class FileInfo in the file fileinfo/fileinfo.py

fileinfo/fileinfo.py
import os


class FileInfo:
    def __init__(self, path):
        self.original_path = path
        self.filename = os.path.basename(path)

Git tag: first-version

As you can see the class is extremely simple, and the tests are straightforward. So far I didn't add anything new to what we discussed in the previous posts.

Now I want the method get_info to return a tuple with the file name, the original path the class was instantiated with, and the absolute path of the file. Pretending we are in the directory /some/absolute/path, the class should work as shown here

>>> fi = FileInfo('../book_list.txt')
>>> fi.get_info()
('book_list.txt', '../book_list.txt', '/some/absolute')

You can quickly realise that you have a problem writing the test. There is no way to easily test something as "the absolute path", since the outcome of the function called in the test is supposed to vary with the path of the test itself. Let us try to write part of the test

def test_get_info():
    filename = 'somefile.ext'
    original_path = '../{}'.format(filename)
    fi = FileInfo(original_path)
    assert fi.get_info() == (filename, original_path, '???')

where the '???' string highlights that I cannot put something sensible to test the absolute path of the file.

Patching is the way to solve this problem. You know that the function will use some code to get the absolute path of the file. So, within the scope of this test only, you can replace that code with something different and perform the test. Since the replacement code has a known outcome writing the test is now possible.

Patching, thus, means to inform Python that during the execution of a specific portion of the code you want a globally accessible module/object replaced by a mock. Let's see how we can use it in our example

tests/test_fileinfo.py
from unittest.mock import patch

[...]

def test_get_info():
    filename = 'somefile.ext'
    original_path = '../{}'.format(filename)

    with patch('os.path.abspath') as abspath_mock:
        test_abspath = 'some/abs/path'
        abspath_mock.return_value = test_abspath
        fi = FileInfo(original_path)
        assert fi.get_info() == (filename, original_path, test_abspath)

You clearly see the context in which the patching happens, as it is enclosed in a with statement. Inside this statement the module os.path.abspath will be replaced by a mock created by the function patch and called abspath_mock. So, while Python executes the lines of code enclosed by the statement with any call to os.path.abspath will return the object abspath_mock.

The first thing we can do, then, is to give the mock a known return_value. This way we solve the issue that we had with the initial code, that is using an external component that returns an unpredictable result. The line

tests/test_fileinfo.py
from unittest.mock import patch

[...]

def test_get_info():
    filename = 'somefile.ext'
    original_path = '../{}'.format(filename)

    with patch('os.path.abspath') as abspath_mock:
        test_abspath = 'some/abs/path'
        abspath_mock.return_value = test_abspath
        fi = FileInfo(original_path)
        assert fi.get_info() == (filename, original_path, test_abspath)

instructs the patching mock to return the given string as a result, regardless of the real values of the file under consideration.

The code that make the test pass is

fileinfo/fileinfo.py
class FileInfo:
    [...]

    def get_info(self):
        return (
            self.filename,
            self.original_path,
            os.path.abspath(self.original_path)
        )

When this code is executed by the test the function os.path.abspath is replaced at run time by the mock that we prepared there, which basically ignores the input value self.filename and returns the fixed value it was instructed to use.

Git tag: patch-with-context-manager

It is worth at this point discussing outgoing messages again. The code that we are considering here is a clear example of an outgoing query, as the method get_info is not interested in changing the status of the external component. In the previous post we reached the conclusion that testing the return value of outgoing queries is pointless and should be avoided. With patch we are replacing the external component with something that we know, using it to test that our object correctly handles the value returned by the outgoing query. We are thus not testing the external component, as it has been replaced, and we are definitely not testing the mock, as its return value is already known.

Obviously to write the test you have to know that you are going to use the function os.path.abspath, so patching is somehow a "less pure" practice in TDD. In pure OOP/TDD you are only concerned with the external behaviour of the object, and not with its internal structure. This example, however, shows that this pure approach has some limitations that you have to cope with, and patching is a clean way to do it.

The patching decorator

The function patch we imported from the module unittest.mock is very powerful, as it can temporarily replace an external object. If the replacement has to or can be active for the whole test, there is a cleaner way to inject your mocks, which is to use patch as a function decorator.

This means that you can decorate the test function, passing as argument the same argument you would pass if patch was used in a with statement. This requires however a small change in the test function prototype, as it has to receive an additional argument, which will become the mock.

Let's change test_get_info, removing the statement with and decorating the function with patch

tests/test_fileinfo.py
@patch('os.path.abspath')
def test_get_info(abspath_mock):
    test_abspath = 'some/abs/path'
    abspath_mock.return_value = test_abspath

    filename = 'somefile.ext'
    original_path = '../{}'.format(filename)

    fi = FileInfo(original_path)
    assert fi.get_info() == (filename, original_path, test_abspath)

Git tag: patch-with-function-decorator

As you can see the decorator patch works like a big with statement for the whole function. The argument abspath_mock passed to the test becomes internally the mock that replaces os.path.abspath. Obviously this way you replace os.path.abspath for the whole function, so you have to decide case by case which form of the function patch you need to use.

Multiple patches

You can patch more that one object in the same test. For example, consider the case where the method get_info calls os.path.getsize in addition to os.path.abspathm in order to return the size of the file. You have at this point two different outgoing queries, and you have to replace both with mocks to make your class work during the test.

This can be easily done with an additional patch decorator

tests/test_fileinfo.py
@patch('os.path.getsize')
@patch('os.path.abspath')
def test_get_info(abspath_mock, getsize_mock):
    filename = 'somefile.ext'
    original_path = '../{}'.format(filename)

    test_abspath = 'some/abs/path'
    abspath_mock.return_value = test_abspath

    test_size = 1234
    getsize_mock.return_value = test_size

    fi = FileInfo(original_path)
    assert fi.get_info() == (filename, original_path, test_abspath, test_size)

Please note that the decorator which is nearest to the function is applied first. Always remember that the decorator syntax with @ is a shortcut to replace the function with the output of the decorator, so two decorators result in

@decorator1
@decorator2
def myfunction():
    pass

which is a shorcut for

def myfunction():
    pass
myfunction = decorator1(decorator2(myfunction))

This explains why, in the test code, the function receives first abspath_mock and then getsize_mock. The first decorator applied to the function is the patch of os.path.abspath, which appends the mock that we call abspath_mock. Then the patch of os.path.getsize is applied and this appends its own mock.

The code that makes the test pass is

fileinfo/fileinfo.py
class FileInfo:
    [...]

    def get_info(self):
        return (
            self.filename,
            self.original_path,
            os.path.abspath(self.original_path),
            os.path.getsize(self.original_path)
        )

Git tag: multiple-patches

We can write the above test using two with statements as well

tests/test_fileinfo.py
def test_get_info():
    filename = 'somefile.ext'
    original_path = '../{}'.format(filename)

    with patch('os.path.abspath') as abspath_mock:
        test_abspath = 'some/abs/path'
        abspath_mock.return_value = test_abspath

        with patch('os.path.getsize') as getsize_mock:
            test_size = 1234
            getsize_mock.return_value = test_size

            fi = FileInfo(original_path)
            assert fi.get_info() == (
                filename,
                original_path,
                test_abspath,
                test_size
            )

Using more than one with statement, however, makes the code difficult to read, in my opinion, so in general I prefer to avoid complex with trees if I do not really need to use a limited scope of the patching.

Checking call parameters

When you patch, your internal algorithm is not executed, as the patched method just return the values it has been instructed to return. This is connected to what we said about testing external systems, so everything is good, but while we don't want to test the internals of the module os.path, we want to be sure that we are passing the correct values to the external methods.

This is why mocks provide methods like assert_called_with (and other similar methods), through which we can check the values passed to a patched method when it is called. Let's add the checks to the test

tests/test_fileinfo.py
@patch('os.path.getsize')
@patch('os.path.abspath')
def test_get_info(abspath_mock, getsize_mock):
    test_abspath = 'some/abs/path'
    abspath_mock.return_value = test_abspath

    filename = 'somefile.ext'
    original_path = '../{}'.format(filename)

    test_size = 1234
    getsize_mock.return_value = test_size

    fi = FileInfo(original_path)
    info = fi.get_info() 

    abspath_mock.assert_called_with(original_path)
    getsize_mock.assert_called_with(original_path)
    assert info == (filename, original_path, test_abspath, test_size)

As you can see, I first invoke fi.get_info storing the result in the variable info, check that the patched methods have been called witht the correct parameters, and then assert the format of its output.

The test passes, confirming that we are passing the correct values.

Git tag: addding-checks-for-input-values

Patching immutable objects

The most widespread version of Python is CPython, which is written, as the name suggests, in C. Part of the standard library is also written in C, while the rest is written in Python itself.

The objects (classes, modules, functions, etc.) that are implemented in C are shared between interpreters, and this requires those objects to be immutable, so that you cannot alter them at runtime from a single interpreter.

An example of this immutability can be given easily using a Python console

>>> a = 1
>>> a.conjugate = 5
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'int' object attribute 'conjugate' is read-only

Here I'm trying to replace a method with an integer, which is pointless per se, but clearly shows the issue we are facing.

What has this immutability to do with patching? What patch does is actually to temporarily replace an attribute of an object (method of a class, class of a module, etc.), which also means that if we try to replace an attribute in an immutable object the patching action will fail.

A typical example of this problem is the module datetime, which is also one of the best candidates for patching, since the output of time functions is by definition time-varying.

Let me show the problem with a simple class that logs operations. I will temporarily break the TDD methodology writing first the class and then the tests, so that you can appreciate the problem.

Create a file called logger.py and put there the following code

fileinfo/logger.py
import datetime


class Logger:
    def __init__(self):
        self.messages = []

    def log(self, message):
        self.messages.append((datetime.datetime.now(), message))

This is pretty simple, but testing this code is problematic, because the method log produces results that depend on the actual execution time. The call to datetime.datetime.now is however an outgoing query, and as such it can be replaced by a mock with patch.

If we try to do it, however, we will have a bitter surprise. This is the test code, that you can put in tests/test_logger.py

tests/test_logger.py
from unittest.mock import patch

from fileinfo.logger import Logger


@patch('datetime.datetime.now')
def test_log(mock_now):
    test_now = 123
    test_message = "A test message"
    mock_now.return_value = test_now

    test_logger = Logger()
    test_logger.log(test_message)
    assert test_logger.messages == [(test_now, test_message)]

When you try to execute this test you will get the following error

TypeError: can't set attributes of built-in/extension type 'datetime.datetime'

which is raised because patching tries to replace the function now in datetime.datetime with a mock, and since the module is immutable this operation fails.

Git tag: initial-logger-not-working

There are several ways to address this problem. All of them, however, start from the fact that importing or subclassing an immutable object gives you a mutable "copy" of that object.

The easiest example in this case is the module datetime itself. In the function test_log we tried to patch directly the object datetime.datetime.now, affecting the builtin module datetime. The file logger.py, however, does import datetime, so this latter becomes a local symbol in the module logger. This is exactly the key for our patching. Let us change the code to

tests/test_logger.py
@patch('fileinfo.logger.datetime.datetime')
def test_log(mock_datetime):
    test_now = 123
    test_message = "A test message"
    mock_datetime.now.return_value = test_now

    test_logger = Logger()
    test_logger.log(test_message)
    assert test_logger.messages == [(test_now, test_message)]

Git tag: correct-patching

If you run the test now, you can see that the patching works. What we did was to inject our mock in fileinfo.logger.datetime.datetime instead of datetime.datetime.now. Two things changed, thus, in our test. First, we are patching the module imported in the file logger.py and not the module provided globally by the Python interpreter. Second, we have to patch the whole module because this is what is imported by the file logger.py. If you try to patch fileinfo.logger.datetime.datetime.now you will find that it is still immutable.

Another possible solution to this problem is to create a function that invokes the immutable object and returns its value. This last function can be easily patched, because it just uses the builtin objects and thus is not immutable. This solution, however, requires changing the source code to allow testing, which is far from being optimal. Obviously it is better to introduce a small change in the code and have it tested than to leave it untested, but whenever is possible I try as much as possible to avoid solutions that introduce code which wouldn't be required without tests.

Mocks and proper TDD

Following a strict TDD methodology means writing a test before writing the code that passes that test. This can be done because we use the object under test as a black box, interacting with it through its API, and thus not knowing anything of its internal structure.

When we mock systems we break this assumption. In particular we need to open the black box every time we need to patch an hardcoded external system. Let's say, for example, that the object under test creates a temporary directory to perform some data processing. This is a detail of the implementation and we are not supposed to know it while testing the object, but since we need to mock the file creation to avoid interaction with the external system (storage) we need to become aware of what happens internally.

This also means that writing a test for the object before writing the implementation of the object itself is difficult. Pretty often, thus, such objects are built with TDD but iteratively, where mocks are introduced after the code has been written.

While this is a violation of the strict TDD methodology, I don't consider it a bad practice. TDD helps us to write better code consistently, but good code can be written even without tests. The real outcome of TDD is a test suite that is capable of detecting regressions or the removal of important features in the future. This means that breaking strict TDD for a small part of the code (patching objects) will not affect the real result of the process, only change the way we achieve it.

A warning

Mocks are a good way to approach parts of the system that are not under test but that are still part of the code that we are running. This is particularly true for parts of the code that we wrote, which internal structure is ultimately known. When the external system is complex and completely detached from our code, mocking starts to become complicated and the risk is that we spend more time faking parts of the system than actually writing code.

In this cases we definitely crossed the barrier between unit testing and integration testing. You may see mocks as the bridge between the two, as they allow you to keep unit-testing parts that are naturally connected ("integrated") with external systems, but there is a point where you need to recognise that you need to change approach.

This threshold is not fixed, and I can't give you a rule to recognise it, but I can give you some advice. First of all keep an eye on how many things you need to mock to make a test run, as an increasing number of mocks in a single test is definitely a sign of something wrong in the testing approach. My rule of thumb is that when I have to create more than 3 mocks, an alarm goes off in my mind and I start questioning what I am doing.

The second advice is to always consider the complexity of the mocks. You may find yourself patching a class but then having to create monsters like cls_mock().func1().func2().func3.assert_called_with(x=42) which is a sign that the part of the system that you are mocking is deep into some code that you cannot really access, because you don't know it's internal mechanisms.

The third advice is to consider mocks as "hooks" that you throw at the external system, and that break its hull to reach its internal structure. These hooks are obviously against the assumption that we can interact with a system knowing only its external behaviour, or its API. As such, you should keep in mind that each mock you create is a step back from this perfect assumption, thus "breaking the spell" of the decoupled interaction. Doing this makes it increasingly complex to create mocks, and this will contribute to keep you aware of what you are doing (or overdoing).

Final words

Mocks are a very powerful tool that allows us to test code that contains outgoing messages. In particular they allow us to test the arguments of outgoing commands. Patching is a good way to overcome the fact that some external components are hardcoded in our code and are thus unreachable through the arguments passed to the classes or the methods under analysis.

Updates

2021-03-06 GitHub user 4myhw spotted an inconsistency between the code on GitHub and the code in the post. Thanks!

Feedback

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

March 06, 2021 06:00 PM UTC

Delegation: composition and inheritance in object-oriented programming

Introduction

Object-oriented programming (OOP) is a methodology that was introduced in the 60s, though as for many other concepts related to programming languages it is difficult to give a proper date. While recent years have witnessed a second youth of functional languages, object-oriented is still a widespread paradigm among successful programming languages, and for good reasons. OOP is not the panacea for all the architectural problems in software development, but if used correctly can give a solid foundation to any system.

It might sound obvious, but if you use an object-oriented language or a language with strong OOP traits, you have to learn this paradigm well. Being very active in the Python community, I see how many times young programmers are introduced to the language, the main features, and the most important libraries and frameworks, without a proper and detailed description of OOP and how OOP is implemented in the language.

The implementation part is particularly important, as OOP is a set of concepts and features that are expressed theoretically and then implemented in the language, with specific traits or choices. It is very important, then, to keep in mind that the concepts behind OOP are generally shared among OOP languages, but are not tenets, and are subject to interpretation.

What is the core of OOP? Many books and tutorials mention the three pillars encapsulation, delegation, and polymorphism, but I believe these are traits of a more central concept, which is the collaboration of entities. In a well-designed OO system, we can observe a set of actors that send messages to each other to keep the system alive, responsive, and consistent.

These actors have a state, the data, and give access to it through an interface: this is encapsulation. Each actor can use functionalities implemented by another actor sending a message (calling a method) and when the relationship between the two is stable we have delegation. As communication happens through messages, actors are not concerned with the nature of the recipients, only with their interface, and this is polymorphism.

Alan Kay, in his "The Early History of Smalltalk", says

In computer terms, Smalltalk is a recursion on the notion of computer itself. Instead of dividing "computer stuff" into things each less strong than the whole — like data structures, procedures, and functions which are the usual paraphernalia of programming languages — each Smalltalk object is a recursion on the entire possibilities of the computer. Thus its semantics are a bit like having thousands and thousands of computers all hooked together by a very fast network.

I find this extremely enlightening, as it reveals the idea behind the three pillars, and the reason why we do or don't do certain things in OOP, why we consider good to provide some automatic behaviours or to forbid specific solutions.

By the way, if you replace the word "object" with "microservice" in the quote above, you might be surprised by the description of a very modern architecture for cloud-based systems. Once again, concepts in computer science are like fractals, they are self-similar and pop up in unexpected places.

In this post, I want to focus on the second of the pillars of object-oriented programming: delegation. I will discuss its nature and the main two strategies we can follow to implement it: composition and inheritance. I will provide examples in Python and show how the powerful OOP implementation of this language opens the door to interesting atypical solutions.

For the rest of this post, I will consider objects as mini computers and the system in which they live a "very fast network", using the words of Alan Kay. Data contained in an object is the state of the computer, its methods are the input/output devices, and calling methods is the same thing as sending a message to another computer through the network.

Delegation in OOP

Delegation is the mechanism through which an actor assigns a task or part of a task to another actor. This is not new in computer science, as any program can be split into blocks and each block generally depends on the previous ones. Furthermore, code can be isolated in libraries and reused in different parts of a program, implementing this "task assignment". In an OO system the assignee is not just the code of a function, but a full-fledged object, another actor.

The main concept to retain here is that the reason behind delegation is code reuse. We want to avoid code repetition, as it is often the source of regressions; fixing a bug in one of the repetitions doesn't automatically fix it in all of them, so keeping one single version of each algorithm is paramount to ensure the consistency of a system. Delegation helps us to keep our actors small and specialised, which makes the whole architecture more flexible and easier to maintain (if properly implemented). Changing a very big subsystem to satisfy a new requirement might affect other parts system in bad ways, so the smaller the subsystems the better (up to a certain point, where we incur in the opposite problem, but this shall be discussed in another post).

There is a dichotomy in delegation, as it can be implemented following two different strategies, which are orthogonal from many points of view, and I believe that one of the main problems that object-oriented systems have lies in the use of the wrong strategy, in particular the overuse of inheritance. When we create a system using an object-oriented language we need to keep in mind this dichotomy at every step of the design.

There are four areas or points of views that I want to introduce to help you to visualise delegation between actors: visibility, control, relationship, and entities. As I said previously, while these concepts apply to systems at every scale, and in particular to every object-oriented language, I will provide examples in Python.

Visibility: state sharing

The first way to look at delegation is through the lenses of state sharing. As I said before the data contained in an object can be seen as its state, and if hearing this you think about components in a frontend framework or state machines you are on the right path. The state of a computer, its memory or the data on the mass storage, can usually be freely accessed by internal systems, while the access is mediated for external ones. Indeed, the level of access to the state is probably one of the best ways to define internal and external systems in a software or hardware architecture.

When using inheritance, the child class shares its whole state with the parent class. Let's have a look at a simple example

class Parent:
    def __init__(self, value):
        self._value = value 3

    def describe(self): 1
        print(f"Parent: value is {self._value}")

class Child(Parent):
    pass

>>> cld = Child(5)
>>> print(cld._value)
5
>>> cld.describe() 2
Parent: value is 5

As you can see, describe is defined in Parent 1, so when the instance cld calls it 2, its class Child delegates the call to the class Parent. This, in turn, uses _value as if it was defined locally 3, while it is defined in cld. This works because, from the point of view of the state, Parent has complete access to the state of Child. Please note that the state is not even enclosed in a name space, as the state of the child class becomes the state of the parent class.

Composition, on the other side, keeps the state completely private and makes the delegated object see only what is explicitly shared through message passing. A simple example of this is

class Logger:
    def log(self, value):
        print(f"Logger: value is {value}")


class Process:
    def __init__(self, value):
        self._value = value 1
        self.logger = Logger()

    def info(self):
        self.logger.log(self._value) 2

>>> prc = Process(5)
>>> print(prc._value)
5
>>> prc.info()
Logger: value is 5

Here, instances of Process have an attribute _value 1 that is shared with the classLogger only when it comes to calling Logger.log 2 inside their info method. Logger objects have no visibility of the state of Process objects unless it is explicitly shared.

Note for advanced readers: I'm clearly mixing the concepts of instance and class here, and blatantly ignoring the resulting inconsistencies. The state of an instance is not the same thing as the state of a class, and it should also be mentioned that classes are themselves instances of metaclasses, at least in Python. What I want to point out here is that access to attributes is granted automatically to inherited classes because of the way __getattribute__ and bound methods work, while in composition such mechanisms are not present and the effect is that the state is not shared.

Control: implicit and explicit delegation

Another way to look at the dichotomy between inheritance and composition is that of the control we have over the process. Inheritance is usually provided by the language itself and is implemented according to some rules that are part of the definition of the language itself. This makes inheritance an implicit mechanism: when you make a class inherit from another one, there is an automatic and implicit process that rules the delegation between the two, which makes it run outside our control.

Let's see an example of this in action using inheritance

class Window:
    def __init__(self, title, size_x, size_y):
        self._title = title
        self._size_x = size_x
        self._size_y = size_y

    def resize(self, new_size_x, new_size_y):
        self._size_x = new_size_x
        self._size_y = new_size_y
        self.info()

    def info(self): 2
        print(f"Window '{self._title}' is {self._size_x}x{self._size_y}")


class TransparentWindow(Window):
    def __init__(self, title, size_x, size_y, transparency=50):
        self._title = title
        self._size_x = size_x
        self._size_y = size_y
        self._transparency = transparency

    def change_transparency(self, new_transparency):
        self._transparency = new_transparency

    def info(self): 1
        super().info() 3
        print(f"Transparency is set to {self._transparency}")       

At this point we can instantiate and use TransparentWindow

>>> twin = TransparentWindow("Terminal", 640, 480, 80)
>>> twin.info()
Window 'Terminal' is 640x480
Transparency is set to 80
>>> twin.change_transparency(70)
>>> twin.resize(800, 600)
Window 'Terminal' is 800x600
Transparency is set to 70

When we call twin.info, Python is running TransparentWindow's implementation of that method 1 and is not automatically delegating anything to Window even though the latter has a method with that name 2. Indeed, we have to explicitly call it through super when we want to reuse it 3. When we use resize, though, the implicit delegation kicks in and we end up with the execution of Window.resize. Please note that this delegation doesn't propagate to the next calls. When Window.resize calls self.info this runs TransparentWindow.info, as the original call was made from that class.

Composition is on the other end of the spectrum, as any delegation performed through composed objects has to be explicit. Let's see an example

class Body:
    def __init__(self, text):
        self._text = text

    def info(self):
        return {
            "length": len(self._text)
        }


class Page:
    def __init__(self, title, text):
        self._title = title
        self._body = Body(text)

    def info(self):
        return {
            "title": self._title,
            "body": self._body.info() 1
        }

When we instantiate a Page and call info everything works

>>> page = Page("New post", "Some text for an exciting new post")
>>> page.info()
{'title': 'New post', 'body': {'length': 34}}

but as you can see, Page.info has to explicitly mention Body.info through self._body 1, as we had to do when using inheritance with super. Composition is not different from inheritance when methods are overridden, at least in Python.

Relationship: to be vs to have

The third point of view from which you can look at delegation is that of the nature of the relationship between actors. Inheritance gives the child class the same nature as the parent class, with specialised behaviour. We can say that a child class implements new features or changes the behaviour of existing ones, but generally speaking, we agree that it is like the parent class. Think about a gaming laptop: it is a laptop, only with specialised features that enable it to perform well in certain situations. On the other end, composition deals with actors that are usually made of other actors of a different nature. A simple example is that of the computer itself, which has a CPU, has a mass storage, has memory. We can't say that the computer is the CPU, because that is reductive.

This difference in the nature of the relationship between actors in a delegation is directly mapped into inheritance and composition. When using inheritance, we implement the verb to be

class Car:
    def __init__(self, colour, max_speed):
        self._colour = colour
        self._speed = 0
        self._max_speed = max_speed

    def accelerate(self, speed):
        self._speed = min(speed, self._max_speed)


class SportsCar(Car):
    def accelerate(self, speed):
        self._speed = speed

Here, SportsCar is a Car, it can be initialised in the same way and has the same methods, though it can accelerate much more (wow, that might be a fun ride). Since the relationship between the two actors is best described by to be it is natural to use inheritance.

Composition, on the other hand, implements the verb to have and describes an object that is "physically" made of other objects

class Employee:
    def __init__(self, name):
        self._name = name


class Company:
    def __init__(self, ceo_name, cto_name):
        self._ceo = Employee(ceo_name)
        self._cto = Employee(cto_name)

We can say that a company is the sum of its employees (plus other things), and we easily recognise that the two classes Employee and Company have a very different nature. They don't have the same interface, and if they have methods with the same name is just by chance and not because they are serving the same purpose.

Entities: classes or instances

The last point of view that I want to explore is that of the entities involved in the delegation. When we discuss a theoretical delegation, for example saying "This Boeing 747 is a plane, thus it flies" we are describing a delegation between abstract, immaterial objects, namely generic "planes" and generic "flying objects".

class FlyingObject:
    pass


class Plane(FlyingObject):
    pass


>>> boeing747 = Plane()

Since Plane and FlyingObject share the same underlying nature, their relationship is valid for all objects of that type and it is thus established between classes, which are ideas that become concrete when instantiated.

When we use composition, instead, we are putting into play a delegation that is not valid for all objects of that type, but only for those that we connected. For example, we can separate gears from the rest of a bicycle, and it is only when we put together that specific set of gears and that bicycle that the delegation happens. So, while we can think theoretically at bicycles and gears, the actual delegation happens only when dealing with concrete objects.

class Gears:
    def __init__(self):
        self.current = 1

    def up(self):
        self.current = min(self.current + 1, 8)

    def down(self):
        self.current = max(self.current - 1, 0)


class Bicycle:
    def __init__(self):
        self.gears = Gears() 1

    def gear_up(self):
        self.gears.up() 2

    def gear_down(self):
        self.gears.down() 3

>>> bicycle = Bicycle()

As you can see here, an instance of Bicycle contains an instance of Gears 1 and this allows us to create a delegation in the methods gear_up 2 and gear_down 3. The delegation, however, happens between bicycle and bicycle.gears which are instances.

It is also possible, at least in Python, to have composition using pure classes, which is useful when the class is a pure helper or a simple container of methods (I'm not going to discuss here the benefits or the disadvantages of such a solution)

class Gears:
    @classmethod
    def up(cls, current):
        return min(current + 1, 8)

    @classmethod
    def down(cls, current):
        return max(current - 1, 0)


class Bicycle:
    def __init__(self):
        self.gears = Gears
        self.current_gear = 1

    def gear_up(self):
        self.current_gear = self.gears.up(self.current_gear)

    def gear_down(self):
        self.current_gear = self.gears.down(self.current_gear)

>>> bicycle = Bicycle()

Now, when we run bicycle.gear_up the delegation happens between bicycle, and instance, and Gears, a class. We might extend this forward to have a class which class methods call class methods of another class, but I won't give an example of this because it sounds a bit convoluted and probably not very reasonable to do. But it can be done.

So, we might devise a pattern here and say that in composition there is no rule that states the nature of the entities involved in the delegation, but that most of the time this happens between instances.

Note for advanced readers: in Python, classes are instances of a metaclass, usually type, and type is an instance of itself, so it is correct to say that composition happens always between instances.

Bad signs

Now that we looked at the two delegations strategies from different points of view, it's time to discuss what happens when you use the wrong one. You might have heard of the "composition over inheritance" mantra, which comes from the fact that inheritance is often overused. This wasn't and is not helped by the fact that OOP is presented as encapsulation, inheritance, and polymorphism; open a random OOP post or book and you will see this with your own eyes.

Please, bloggers, authors, mentors, teachers, and overall programmers: stop considering inheritance the only delegation system in OOP.

That said, I think we should avoid going from one extreme to the opposite, and in general learn to use the tools languages give us. So, let's learn how to recognise the "smell" of bad code!

You are incorrectly using inheritance when:

You are incorrectly using composition when:

Overall, code smells for inheritance are the need to override or delete attributes and methods, changes in one class affecting too many other classes in the inheritance tree, big classes that contain heavily unrelated methods. For composition: too many methods that just wrap methods of the contained instances, the need to pass too many arguments to methods, classes that are too empty and that just contain one instance of another class.

Domain modelling

We all know that there are few cases (in computer science as well as in life) where we can draw a clear line between two options and that most of the time the separation is blurry. There are many grey shades between black and white.

The same applies to composition and inheritance. While the nature of the relationship often can guide us to the best solution, we are not always dealing with the representation of real objects, and even when we do we always have to keep in mind that we are modelling them, not implementing them perfectly.

As a colleague of mine told me once, we have to represent reality with our code, but we have to avoid representing it too faithfully, to avoid bringing reality's limitations into our programs.

I believe this is very true, so I think that when it comes to choosing between composition an inheritance we need to be guided by the nature of the relationship in our system. In this, object-oriented programming and database design are very similar. When you design a database you have to think about the domain and the way you extract information, not (only) about the real-world objects that you are modelling.

Let's consider a quick example, bearing in mind that I'm only scratching the surface of something about which people write entire books. Let's pretend we are designing a web application that manages companies and their owners, and we started with the consideration that and Owner, well, owns the Company. This is a clear composition relationship.

class Company:
    def __init__(self, name):
        self.name = name

class Owner:
    def __init__(self, first_name, last_name, company_name):
        self.first_name = first_name
        self.last_name = last_name
        self.company = Company(company_name)

>>> owner1 = Owner("John", "Doe", "Pear")

Unfortunately, this automatically limits the number of companies owned by an Owner to one. If we want to relax that requirement, the best way to do it is to reverse the composition, and make the Company contain the Owner.

class Owner:
    def __init__(self, first_name, last_name):
        self.first_name = first_name
        self.last_name = last_name

class Company:
    def __init__(self, name, owner_first_name, owner_last_name):
        self.name = name
        self.owner = Owner(owner_first_name, owner_last_name)

>>> company1 = Company("Pear", "John", "Doe")
>>> company2 = Company("Pulses", "John", "Doe")

As you can see this is in direct contrast with the initial modelling that comes from our perception of the relationship between the two in the real world, which in turn comes from the specific word "owner" that I used. If I used a different word like "president" or "CEO", you would immediately accept the second solution as more natural, as the "president" is one of many employees.

The code above is not satisfactory, though, as it initialises Owner every time we create a company, while we might want to use the same instance. Again, this is not mandatory, it depends on the data contained in the Owner objects and the level of consistency that we need. For example, if we add to the owner an attribute online to mark that they are currently using the website and can be reached on the internal chat, we don't want have to cycle between all companies and set the owner's online status for each of them if the owner is the same. So, we might want to change the way we compose them, passing an instance of Owner instead of the data used to initialise it.

class Owner:
    def __init__(self, first_name, last_name, online=False):
        self.first_name = first_name
        self.last_name = last_name
        self.online = online

class Company:
    def __init__(self, name, owner):
        self.name = name
        self.owner = owner

>>> owner1 = Owner("John", "Doe")
>>> company1 = Company("Pear", owner1)
>>> company2 = Company("Pulses", owner1)

Clearly, if the class Company has no other purpose than having a name, using a class is overkill, so this design might be further reduced to an Owner with a list of company names.

class Owner:
    def __init__(self, first_name, last_name):
        self.first_name = first_name
        self.last_name = last_name
        self.companies = []

>>> owner1 = Owner("John", "Doe")
>>> owner1.companies.extend(["Pear", "Pulses"])

Can we use inheritance? Now I am stretching the example to its limit, but I can accept there might be a use case for something like this.

class Owner:
    def __init__(self, first_name, last_name):
        self.first_name = first_name
        self.last_name = last_name

class Company(Owner):
    def __init__(self, name, owner_first_name, owner_last_name):
        self.name = name
        super().__init__(owner_first_name, owner_last_name)

>>> company1 = Company("Pear", "John", "Doe")
>>> company2 = Company("Pulses", "John", "Doe")

As I showed in the previous sections, though, this code smells as soon as we start adding something like the email address.

class Owner:
    def __init__(self, first_name, last_name, email):
        self.first_name = first_name
        self.last_name = last_name
        self.email = email

class Company(Owner):
    def __init__(self, name, owner_first_name, owner_last_name, email):
        self.name = name
        super().__init__(owner_first_name, owner_last_name, email)

>>> company1 = Company("Pear", "John", "Doe")
>>> company2 = Company("Pulses", "John", "Doe")

Is email that of the company or the personal one of its owner? There is a clash, and this is a good example of "state pollution": both attributes have the same name, but they represent different things and might need to coexist.

In conclusion, as you can see we have to be very careful to discuss relationships between objects in the context of our domain and avoid losing connection with the business logic.

Mixing the two: composed inheritance

Speaking of blurry separations, Python offers an interesting hook to its internal attribute resolution mechanism which allows us to create a hybrid between composition and inheritance that I call "composed inheritance".

Let's have a look at what happens internally when we deal with classes that are linked through inheritance.

class Parent:
    def __init__(self, value):
        self.value = value

    def info(self):
        print(f"Value: {self.value}")

class Child(Parent):
    def is_even(self):
        return self.value % 2 == 0

>>> c = Child(5)
>>> c.info()
Value: 5
>>> c.is_even()
False

This is a trivial example of an inheritance relationship between Child and Parent, where Parent provides the methods __init__ and info and Child augments the interface with the method is_even.

Let's have a look at the internals of the two classes. Parent.__dict__ is

mappingproxy({'__module__': '__main__',
              '__init__': <function __main__.Parent.__init__(self, value)>,
              'info': <function __main__.Parent.info(self)>,
              '__dict__': <attribute '__dict__' of 'Parent' objects>,
              '__weakref__': <attribute '__weakref__' of 'Parent' objects>,
              '__doc__': None}

and Child.__dict__ is

mappingproxy({'__module__': '__main__',
              'is_even': <function __main__.Child.is_even(self)>,
              '__doc__': None})

Finally, the bond between the two is established through Child.__bases__, which has the value (__main__.Parent,).

So, when we call c.is_even the instance has a bound method that comes from the class Child, as its __dict__ contains the function is_even. Conversely, when we call c.info Python has to fetch it from Parent, as Child can't provide it. This mechanism is implemented by the method __getattribute__ that is the core of the Python inheritance system.

As I mentioned before, however, there is a hook into this system that the language provides us, namely the method __getattr__, which is not present by default. What happens is that when a class can't provide an attribute, Python first tries to get the attribute with the standard inheritance mechanism but if it can't be found, as a last resort it tries to run __getattr__ passing the attribute name.

An example can definitely clarify the matter.

class Parent:
    def __init__(self, value):
        self.value = value

    def info(self):
        print(f"Value: {self.value}")

class Child(Parent):
    def is_even(self):
        return self.value % 2 == 0

    def __getattr__(self, attr):
        if attr == "secret":
            return "a_secret_string"

        raise AttributeError

>>> c = Child(5)

Now, if we try to access c.secret, Python would raise an AttributeError, as neither Child nor Parent can provide that attribute. As a last resort, though, Python runs c.__getattr__("secret"), and the code of that method that we implemented in the class Child returns the string "a_secret_string". Please note that the value of the argument attr is the name of the attribute as a string.

Because of the catch-all nature of __getattr__, we eventually have to raise an AttributeError to keep the inheritance mechanism working, unless we actually need or want to implement something very special.

This opens the door to an interesting hybrid solution where we can compose objects retaining an automatic delegation mechanism.

class Parent:
    def __init__(self, value):
        self.value = value

    def info(self):
        print(f"Value: {self.value}")

class Child:
    def __init__(self, value):
        self.parent = Parent(value)

    def is_even(self):
        return self.value % 2 == 0

    def __getattr__(self, attr):
        return getattr(self.parent, attr)

>>> c = Child(5)
>>> c.value
5
>>> c.info()
Value: 5
>>> c.is_even()
False

As you can see, here Child is composing Parent and there is no inheritance between the two. We can nevertheless access c.value and call c.info, thanks to the face that Child.__getattr__ is delegating everything can't be found in Child to the instance of Parent stored in self.parent.

Note: don't confuse getattr with __getattr__. The former is a builtin function that gets an attribute provided its name, a replacement for the dotted notation when the name of the attribute is known as a string. The latter is the hook into the inheritance mechanism that I described in this section.

Now, this is very powerful, but is it also useful?

I think this is not one of the techniques that will drastically change the way you write code in Python, but it can definitely help you to use composition instead of inheritance even when the amount of methods that you have to wrap is high. One of the limits of composition is that you are at the extreme spectrum of automatism; while inheritance is completely automatic, composition doesn't do anything for you. This means that when you compose objects you need to decide which methods or attributes of the contained objects you want to wrap, in order to expose then in the container object. In the previous example, the class Child might want to expose the attribute value and the method info, which would result in something like

class Parent:
    def __init__(self, value):
        self.value = value

    def info(self):
        print(f"Value: {self.value}")

class Child:
    def __init__(self, value):
        self.parent = Parent(value)

    def is_even(self):
        return self.value % 2 == 0

    def info(self):
        return self.parent.info()

    @property
    def value(self):
        return self.parent.value

As you can easily see, the more Child wants to expose of the Parent interface, the more wrapper methods and properties you need. To be perfectly clear, in this example the code above smells, as there are too many one-liner wrappers, which tells me it would be better to use inheritance. But if the class Child had a dozen of its own methods, suddenly it would make sense to do something like this, and in that case, __getattr__ might come in handy.

Final words

Both composition and inheritance are tools, and both exist to serve the bigger purpose of code reuse, so learn their strength and their weaknesses, so that you might be able to use the correct one and avoid future issues in your code.

I hope this rather long discussion helped you to get a better picture of the options you have when you design an object-oriented system, and also maybe introduced some new ideas or points of view if you are already comfortable with the concepts I wrote about.

Updates

2021-03-06 Following the suggestion of Tim Morris I added the console output to the source code to make the code easier to understand. Thanks Tim for the feedback!

Feedback

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

March 06, 2021 06:00 PM UTC


Andre Roberge

Going back in history

Imagine that you wish to run a program that takes a long time to run. Just in case somethings goes wrong, you decide to use friendly-traceback (soon to be renamed...) in interactive mode to run it.  This turns out to be a good decision:

Time to explore what might be the problem, and where exactly things might have gone wrong.

Ooops ... a silly typo. Easy enough to correct:

Unfortunately, that did not work: Friendly-traceback, like all Python programs designed to handle exceptions, only capture the last one.

This has happened to me so many times; granted, it was always with short programs so that I could easily recreate the original exception. However, I can only imagine how frustrating it might be for beginners encountering this situation.

A solution


Fortunately, you are using the latest version of Friendly-traceback, the one that records exceptions that were captured, and allows you to discard the last one recorded (rinse, and repeat as often as needed), thus going back in history.



Now that we are set, we can explore further to determine what might have gone wrong.




March 06, 2021 12:58 PM UTC


Doug Hellmann

imapautofiler 1.11.0

New Features A configuration option has been added to disable the use of SSL/TLS for servers that do not support it. The default is to always use SSL/TLS. Contributed by Samuele Zanon. Upgrade Notes This release drops support for python 3.6 and 3.7.

March 06, 2021 11:14 AM UTC


PyBites

Are You Working on Your Mindset as Much as Your Technical Skills?

Do you want to read an amazing coaching / mindset book? Check out Wooden: A Lifetime of Observations and Reflections On and Off the Court:


In this post I wanted to share some of our favorite lessons:

  1. "Too often we neglect our journey in our eagerness or anxiety about reaching our goal ... The preparation is where success is truly found."

    Sometimes we become too obsessed over results and big and flashy goals: land a develop job, build a profitable side gig, etc.

    However if you want to be successful in the long term you have to fall in love with the game, the process, the daily reps.

    Only then can you become really great at anything and sustain the challenges you inevitably face.

  2. "When you improve a little each day, eventually big things occur."

    Some people post 3 days on social, code for a month, do 2 coding interviews and don't see significant results then throw in the towel.

    You will only see consistent results gradually though. What you see today is a reflection of the past 2 years of actions - the tip of the iceberg.

    Take our platform for example, the people that rise to the top and leave amazing stories have been coding for many days, weeks and even years. They made many mistakes along the way and experimented every day.

    Adopt the 1% rule: consistent little improvements always beat a few big improvements (which are mostly an illusion).

  3. "Your reaction to victory or defeat is an important part of how you play the game."

    The Detroit Pistons were notorious for their game and how they behaved when they lost a championship.

    Winning or losing is one thing, how you react to it is way more important. Are you defeated or do you hit the gym again the next day?

    When you hit roadblocks ("I don't grasp OOP", "my code crashes", "my app slows down with 10x the load") - do you complain about it or do you fully embrace these obstacles and see them as fuel or opportunities for growth?

  4. "Never believe you're better than anybody else, but remember that you're just as good as everybody else."

    This one should be obvious. Once you think you're better than somebody, expect to go downhill quickly.

    Stay humble, everybody you encounter can teach you a valuable lesson. Newbies can open expert Pythonista's eyes, just by reframing a problem with a beginner mindset.

    One of my favourite stories in this context is the kid that found a simple (clever) solution to get a truck unstuck.

  5. "You cannot function physically or mentally unless your emotions are under control."

    Emotions can cloud your judgement. It's often good to cool down before making any rash decisions.

    In this context we like the "hot letter" or "unsent angry letter" hack Abraham Lincoln (and other public figures) used.


These are the kinds of important mindset lessons we include in our PyBites Developer Mindset coaching program.

If you want to be coached on mindset in addition to Python and software development, book a Strategy Session with us.

Don't ignore the mindset side of things, it's as important (if not more important) as the technical skills!

-- Bob

March 06, 2021 06:21 AM UTC


Codementor

pytest quick tip: Adding CLI options

Quick intro to adding CLI arguments to a pytest test suite in the context of pytest-selenium.

March 06, 2021 05:48 AM UTC


Test and Code

147: Testing Single File Python Applications/Scripts with pytest and coverage

Have you ever written a single file Python application or script?
Have you written tests for it?
Do you check code coverage?

This is the topic of this weeks episode, spurred on by a listener question.

The questions:

The example code discussed in the episode: script.py

def foo():
    return 5


def main():
    x = foo()
    print(x)


if __name__ == '__main__': # pragma: no cover
    main()

## test code

# To test:
# pip install pytest
# pytest script.py

# To test with coverage:
# put this file (script.py) in a directory by itself, say foo
# then from the parent directory of foo:
# pip install pytest-cov
# pytest --cov=foo foo/script.py

# To show missing lines
# pytest --cov=foo --cov-report=term-missing foo/script.py


def test_foo():
    assert foo() == 5


def test_main(capsys):
    main()
    captured = capsys.readouterr()
    assert captured.out == "5\n"

Suggestion by @cfbolz if you need to import pytest:

if __name__ == '__main__': # pragma: no cover
    main()
else:
   import pytest

Sponsored By:

Support Test & Code : Python Testing

<p>Have you ever written a single file Python application or script?<br> Have you written tests for it?<br> Do you check code coverage?</p> <p>This is the topic of this weeks episode, spurred on by a listener question.</p> <p>The questions:</p> <ul> <li>For single file scripts, I&#39;d like to have the test code included right there in the file. Can I do that with pytest?</li> <li>If I can, can I use code coverage on it?</li> </ul> <p>The example code discussed in the episode: script.py</p> <pre><code>def foo(): return 5 def main(): x = foo() print(x) if __name__ == &#39;__main__&#39;: # pragma: no cover main() ## test code # To test: # pip install pytest # pytest script.py # To test with coverage: # put this file (script.py) in a directory by itself, say foo # then from the parent directory of foo: # pip install pytest-cov # pytest --cov=foo foo/script.py # To show missing lines # pytest --cov=foo --cov-report=term-missing foo/script.py def test_foo(): assert foo() == 5 def test_main(capsys): main() captured = capsys.readouterr() assert captured.out == &quot;5\n&quot; </code></pre> <p>Suggestion by <a href="https://twitter.com/cfbolz/status/1368196960302358528?s=20" rel="nofollow">@cfbolz</a> if you need to import pytest:</p> <pre><code>if __name__ == &#39;__main__&#39;: # pragma: no cover main() else: import pytest </code></pre><p>Sponsored By:</p><ul><li><a href="https://testandcode.com/pycharm" rel="nofollow">PyCharm Professional</a>: <a href="https://testandcode.com/pycharm" rel="nofollow">Try PyCharm Pro for 4 months and learn how PyCharm will save you time.</a> Promo Code: TESTANDCODE21</li></ul><p><a href="https://www.patreon.com/testpodcast" rel="payment">Support Test & Code : Python Testing</a></p>

March 06, 2021 02:00 AM UTC

March 05, 2021


PyBites

Don't Blame Yourself at Work

A workplace/career thought for you to consider today.

There are times in your career when things are going to feel pretty miserable.

You may feel underappreciated, feel that you're being micromanaged, ignored, etc.

It's natural that when this situation inevitably arises you'll start to doubt yourself and think that you're doing something wrong.

You'll ask yourself, "What am I doing wrong?", "Why do they hate me?", or "Am I even good enough to be doing this?".

In these moments it's important to take a step back and consider your situation from a distance. Take the emotion out of it and really analyse what's going on.

There's likely going to be some sort of change that's occurred in your life or around you to cause the degradation of your work environment. If it's something on your end, then take the necessary steps to fix it. Hold yourself to a high standard, own the change and get things back on track.

On the other hand though, it's important to check the temperature around you. By this I mean tactfully speak with people on your team or in your immediate work environment.

Quite often, and most likely, the problem is not you.

It's so easy for us to go down a path of self-destruction thinking we're at fault in these situations. It's further exacerbated by the loneliness that you'll feel. You don't naturally want to share your perceived "failings" with your colleagues so it might take quite a while before you realise you weren't the issue in the first place.

Finding someone you can trust and speak confidentially with on your team is crucial to finding out where the problem really lies.

Is it your manager? A new process? A shift in company culture? There are many things that can influence your day-to-day at work and it's so important not to jump to the conclusion that you're the "root of all evil" if things are feeling bleak.

My point here is don't blame yourself unnecessarily. Don't do it to yourself. Take the step back, analyse the situation and give it some earnest thought. Speak with those around you about how you're feeling and you'll likely find you're not alone. There's almost always a common denominator and I'd be willing to bet it's not you.

Just remember this if you ever find yourself feeling out of it at work.

-- Julian

To receive a career tip every Thursday, subscribe here.

March 05, 2021 06:52 PM UTC


Andre Roberge

Friendly-traceback will have a new name

tl; dr: I plan to change the name from friendly_traceback to friendly.


 When I started working on Friendly-traceback, I had a simple goal in mind:

Given an error message in a Python traceback, parse it and reformulate it into something easier to understand by beginners and that could be easily translated into languages other than English.

A secondary goal was to help users learn how to decipher a normal Python traceback and use the information provided by Pythonto understand what went wrong and how to fix it. 

Early on, I quickly realised that this would not be helpful when users are faced with arguably the most frustrating error message of them all: 

SyntaxError: invalid syntax

Encouraged by early adopters, I then began a quest to go much beyond simply interpreting a given error message, and trying to find a more specific cause of a given traceback. As Friendly-traceback was able to provide more and more information to users, I was faced with the realisation that too much information presented all at once could be counter-productive. Thus, it was broken down and could be made available in a console by asking what(), where(), why(), etc. If Friendly-traceback does not recognize a given error message, one can now simply type www() [name subject to change] and an Internet search for that specific message will be done using the default web browser.

By default, Friendly-traceback uses a custom exception hook to replace sys.excepthook: this definitely works with a standard Python interpreter. However, it does not work with IPython, Jupyter notebooks, IDLE (at least, not for Python 3.9 and older), etc.  So, custom modules now exist and users have to write:

To run a program from a terminal requires to write:

python -m friendly_traceback my_program.py [additional options]

All of these are rather long to type ...

In addition to tracebacks, I have been thinking of including Python warnings, and in particular SyntaxWarnings

Along the same lines, when using the "friendly" console, I have added some exprimental warnings, such as those shown below.




I do not know if these warnings will be part of future versions of Friendly-traceback. What I do know, is that I want to consider incorporating things other than traceback that might be useful to beginners and/or to non-English speakers.

Back to the name change.  I have typed "friendly_traceback" many, many times.  It is long and annoying to type. When I work at a console, I often do:

import friendly_traceback as ft

and proceed from there.

I suspect that not too many potential users would be fond of friendly_traceback as a name. Furthermore, I wonder how convenient it is to type a name with an underscore character when using a non-English keyboard. Finally, whenever I write about Friendly-traceback, it is an hyphen that is used between the two names, and not an underscore character: one more possible source of confusion.

For all these reasons, I plan to soon change the name to be simply "friendly". This will almost certainly be done as the version number will increase from 0.2.xy to 0.3.0 ... which is going to happen "soon".

Such a name change will mean a major editing job to the extensive documentation which currently includes 76 screenshots, most of which have "friendly_traceback" in them. This means that they will all have to be redone. Of course, the most important work to be done will be changing the source code itself; however, this should be fairly easy to do with a global search/replace.



March 05, 2021 03:56 PM UTC


Stack Abuse

Python: Check if Array/List Contains Element/Value

Introduction

In this tutorial, we'll take a look at how to check if a list contains an element or value in Python. We'll use a list of strings, containing a few animals:

animals = ['Dog', 'Cat', 'Bird', 'Fish']

Check if List Contains Element With for Loop

A simple and rudimentary method to check if a list contains an element is looping through it, and checking if the item we're on matches the one we're looking for. Let's use a for loop for this:

for animal in animals:
    if animal == 'Bird':
        print('Chirp!')

This code will result in:

Chirp!

Check if List Contains Element With in Operator

Now, a more succint approach would be to use the built-in in operator, but with the if statement instead of the for statement. When paired with if, it returns True if an element exists in a sequence or not. The syntax of the in operator looks like this:

element in list

Making use of this operator, we can shorten our previous code into a single statement:

if 'Bird' in animals: print('Chirp')

This code fragment will output the following:

Chirp

This approach has the same efficiency as the for loop, since the in operator, used like this, calls the list.__contains__ function, which inherently loops through the list - though, it's much more readable.

Check if List Contains Element With not in Operator

By contrast, we can use the not in operator, which is the logical opposite of the in operator. It returns True if the element is not present in a sequence.

Let's rewrite the previous code example to utilize the not in operator:

if 'Bird' not in animals: print('Chirp')

Running this code won't produce anything, since the Bird is present in our list.

But if we try it out with a Wolf:

if 'Wolf' not in animals: print('Howl')

This code results in:

Howl

Check if List Contains Element With Lambda

Another way you can check if an element is present is to filter out everything other than that element, just like sifting through sand and checking if there are any shells left in the end. The built-in filter() method accepts a lambda function and a list as its arguments. We can use a lambda function here to check for our 'Bird' string in the animals list.

Then, we wrap the results in a list() since the filter() method returns a filter object, not the results. If we pack the filter object in a list, it'll contain the elements left after filtering:

retrieved_elements = list(filter(lambda x: 'Bird' in x, animals))
print(retrieved_elements)

This code results in:

['Bird']

Now, this approach isn't the most efficient. It's fairly slower than the previous three approaches we've used. The filter() method itself is equivalent to the generator function:

(item for item in iterable if function(item))

The slowed down performance of this code, amongst other things, comes from the fact that we're converting the results into a list in the end, as well as executing a function on the item on each iteration.

Check if List Contains Element Using any()

Another great built-in approach is to use the any() function, which is just a helper function that checks if there are any (at least 1) instances of an element in a list. It returns True or False based on the presence or lack thereof of an element:

if any(element in 'Bird' for element in animals):
    print('Chirp')

Since this results in True, our print() statement is called:

Chirp

This approach is also an efficient way to check for the presence of an element. It's as efficient as the first three.

Check if List Contains Element Using count()

Finally, we can use the count() function to check if an element is present or not:

list.count(element)

This function returns the occurrence of the given element in a sequence. If it's greater than 0, we can be assured a given item is in the list.

Let's check the results of the count() function:

if animals.count('Bird') > 0:
    print("Chirp")

The count() function inherently loops the list to check for the number of occurences, and this code results in:

Chirp

Conclusion

In this tutorial, we've gone over several ways to check if an element is present in a list or not. We've used the for loop, in and not in operators, as well as the filter(), any() and count() methods.

March 05, 2021 01:30 PM UTC


Real Python

The Real Python Podcast – Episode #50: Consuming APIs With Python and Building Microservices With gRPC

Have you wanted to get your Python code to consume data from web-based APIs? Maybe you've dabbled with the requests package, but you don't know what steps to take next. This week on the show, David Amos is back, and he's brought another batch of PyCoder's Weekly articles and projects.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

March 05, 2021 12:00 PM UTC


Talk Python to Me

#306 Scaling Python and Jupyter with ZeroMQ

When we talk about scaling software threading and async get all the buzz. And while they are powerful, using asynchronous queues can often be much more effective. You might think this means creating a Celery server, maybe running RabbitMQ or Redis as well. <br/> <br/> What if you wanted this async ability and many more message exchange patterns like pub/sub. But you wanted to do zero of that server work? Then you should check out ZeroMQ. <br/> <br/> ZeroMQ is to queuing what Flask is to web apps. A powerful and simple framework for you to build just what you need. You're almost certain to learn some new networking patterns and capabilities in this episode with our guest Min Ragan-Kelley to discuss using ZeroMQ from Python as well as how ZeroMQ is central to the internals of Jupyter Notebooks.<br/> <br/> <strong>Links from the show</strong><br/> <br/> <div><b>Min on Twitter</b>: <a href="https://twitter.com/minrk" target="_blank" rel="noopener">@minrk</a><br/> <b>Simula Lab</b>: <a href="https://www.simula.no/research" target="_blank" rel="noopener">simula.no</a><br/> <b>Talk Python Binder episode</b>: <a href="https://talkpython.fm/256" target="_blank" rel="noopener">talkpython.fm/256</a><br/> <b>The ZeroMQ Guide</b>: <a href="https://zguide.zeromq.org/" target="_blank" rel="noopener">zguide.zeromq.org</a><br/> <b>Binder</b>: <a href="https://mybinder.org" target="_blank" rel="noopener">mybinder.org</a><br/> <b>IPython for parallel computing</b>: <a href="https://ipyparallel.readthedocs.io" target="_blank" rel="noopener">ipyparallel.readthedocs.io</a><br/> <b>Messaging in Jupyter</b>: <a href="https://jupyter-client.readthedocs.io/en/stable/messaging.html" target="_blank" rel="noopener">jupyter-client.readthedocs.io</a><br/> <b>DevWheel Package</b>: <a href="https://pypi.org/project/delvewheel/" target="_blank" rel="noopener">pypi.org</a><br/> <b>cibuildwheel</b>: <a href="https://pypi.org/project/cibuildwheel/" target="_blank" rel="noopener">pypi.org</a><br/> <br/> <b>YouTube Live Stream</b>: <a href="https://www.youtube.com/watch?v=AIq4fO5t_ks" target="_blank" rel="noopener">youtube.com</a><br/> <b>PyCon Ticket Contest</b>: <a href="https://talkpython.fm/pycon2021" target="_blank" rel="noopener">talkpython.fm/pycon2021</a><br/></div><br/> <strong>Sponsors</strong><br/> <br/> <a href='https://talkpython.fm/linode'>Linode</a><br> <a href='https://talkpython.fm/mito'>Mito</a><br> <a href='https://talkpython.fm/training'>Talk Python Training</a>

March 05, 2021 08:00 AM UTC


Python Pool

Python Shutil Module: 10 Methods You Should Know

Firstly, Python Shutil module in Python provides many functions to perform high-level operations on files and collections of files. Secondly, It is an inbuilt module that comes with the automation process of copying and removing files and directories. Thirdly, this module also takes care of low-level semantics like creating, closing files once they are copied, and focusing on the business logic.

How does the python shutil module work?

The basic syntax to use shutil module is as follows:

import shutil
shutil.submodule_name(arguments)

File-Directory operations

1. Python shutil.copy()

shutil.copy(): This function is used to copy the content or text of the source file to the destination file or directories. It also preserves the file’s permission mode, but another type of metadata of the file like the file’s creation and file’s modification is not preserved.

import os 
  
# import the shutil module  
import shutil 
  
# write the path of the file
path = '/home/User'
  
# List all the files and directories in the given path
print("Before copying file:") 
print(os.listdir(path)) 
  
  
# write the Source path 
source = "/home/User/file.txt"
  
# Print the file permission of the source given
perms = os.stat(source).st_mode 
print("File Permission mode:", perms, "\n") 
  
# Write the Destination path 
destinationfile = "/home/User/file(copy).txt"
  
# Copy the content of source file to destination file 
dests = shutil.copy(source, destinationfile) 
  
# List files and directories of the path 
print("After copying file:") 
print(os.listdir(path)) 
  
# Print again all the file permission
perms = os.stat(destinationfile).st_mode 
print("File Permission mode:", perms) 
  
# Print path of of the file which is created
print("Destination path:", dests) 

Output:

Before copying file:
['hrithik.png', 'test.py', 'file.text', 'copy.cpp']
File permission mode: 33188

After copying file:
['hrithik.png', 'test.py',  'file.text', 'file(copy).txt', 'copy.cpp']
File permission mode: 33188 
Destination path: /home/User/file(copy).txt

Explanation:

In this code, Firstly, we are checking with the files present in the directory. Secondly, then we will print the file permissions and give the source path of the file. Thirdly, we will give the destination path the copy of the content there in a new file. At last, we will again print all the files in the directory and check if the copy was created of that file or not.

2. Python shutil.copy2()

Firstly, this function is just like the copy() function except for the fact that it maintains metadata of the source file.

from shutil import *
import os
import time
import sys

def show_file_info(filename):
    stat_info = os.stat(filename)
    print '\tMode    :', stat_info.st_mode
    print '\tCreated :', time.ctime(stat_info.st_ctime)
    print '\tAccessed:', time.ctime(stat_info.st_atime)
    print '\tModified:', time.ctime(stat_info.st_mtime)

os.mkdir('example')
print ('SOURCE time: ')
show_file_info('shutil_copy2.py')
copy2('shutil_copy2.py', 'example')
print ('DESTINATION time:')
show_file_info('example/shutil_copy2.py')

Output:

SOURCE time:
        Mode    : 33188
        Created : Sat Jul 16 12:28:43 2020
        Accessed: Thu Feb 21 06:36:54 2021
        Modified: Sat Feb 19 19:18:23 2021
DESTINATION time:
        Mode    : 33188
        Created : Mon Mar 1 06:36:54 2021
        Accessed: Mon Mar 1 06:36:54 2021
        Modified: Tue Mar 2 19:18:23 2021 

Explanation:

In this code, we have written the function copy2() is the same as a copy, just it performs one extra operation that maintains the metadata.

3. Python shutil.copyfile()

In this function file, names get copied, which means the original file is copied by the specified name in the same directory. It says that the duplicate of the file is present in the same directory.

import os
import shutil

print('BEFORE LIST:', os.listdir('.'))
shutil.copyfile('file_copy.py', 'file_copy.py.copy')
print('AFTER LIST:', os.listdir('.'))

Output:

Latracal:shutil Latracal$ python file_copy.py
BEFORE LIST: 
[' .DS_Store', 'file_copy.py']
AFTER LIST: 
[ .DS_Store', 'file_copy.py', 'file_copy.py.copy']

Explanation:

In this code, we have written the function copyfile() the same file name gets copied for the new file just copy is added in the new file name. see in the output.

4. Python shutil.copytree()

This function copies the file and the subdirectories in one directory to another directory. That means that the file is present in the source as well as the destination. The names of both the parameters must be in the string.

import pprint
import shutil
import os

shutil.copytree('../shutil', './Latracal')
pprint.pprint(os.listdir('./Latracal'))

Output:

Latracal:shutil Latracal$ clone—directory. py
[' .DS—Store' ,
'file_copy.py' ,
'file_copy_new.py'
'file_with_metadata.py' , 
'clone_directory. py']

Explanation:

In this code, we have written the function copytree() so that we can get duplicate of that file.

5. Python shutil.rmtree()

This function is used to remove the particular file and subdirectory from the specified directory, which means that the directory is deleted from the system.

import pprint
import shutil
import os

print('BEFORE:')
pprint.pprint(os.listdir('.'))

shutil.rmtree('Latracal')

print('\nAFTER:')
pprint.pprint(os.listdir('.'))

Output:

Latracal:shutil Latracal$ retove—dir.py
BEFORE:
['.DS_Store',
'file_copy.py',

'file_copy_new.py',
'remove_dir.py',

'copy_with_metadata.py',
'Latracal'
'clone_directory.py']


AFTER:
['.DS_Store',
'file—copy.py' ,
'file_copy_new.py',
'remove_dir.py',

'copy_with_metadata.py',
'clone_directory. py']

Explanation:

In this code, we have written the function rmtree(), which is used to remove the file or directory. Firstly, we have listed all the files and applied the function to remove and again listed the file so that we can see if the file is deleted or not.

6. shutil.which()

The which() a function is an excellent tool that is used to find the file path in your machine to easily reach the particular destination by knowing the path of the file.

import shutil
import sys

print(shutil.which('bsondump'))
print(shutil.which('no-such-program'))

output:

Latracal:shutil Latracal$  python find_file.py
/usr/10ca1/mngodb@3.2/bin/bsondunp

Explanation:

In this code, we have written the function that () so that we can find any of the files when required.

7. Python shutil.disk_usage()

This function is used to understand how much information is present in our file system by just calling the disk_usage() function.

import shutil

total_mem, used_mem, free_mem = shutil.disk_usage('.')
gb = 10 **9

print('Total: {:6.2f} GB'.format(total_mem/gb))
print('Used : {:6.2f} GB'.format(used_mem/gb))
print('Free : {:6.2f} GB'.format(free_mem/gb))

Output:

shubhm:shutil shubhmS py
Total:499.9ø GB
Used :187.72 GB
Free :3ø8.26 GB

Explanation:

In this code, we have written the function disk_usage() to get to know about the total, used, and free disk space.

8. Python shutil.move()

This function is used to move the file and directory from one directory to another directory and removes it from the previous directory. It can be said as renaming the file or directory also.

import shutil
shutil.move('hello.py','newdir/')

Output:

 'newdir/hello.py'

Explanation:

In this code, we have written the function move() to move the file or directory from one place to another.

9. Python shutil.make_archive()

This function is used to build an archive (zip or tar) of files in the root directory.

import shutil
import pprint

root_directory='newdir'
shutil.make_archive("newdirabcd","zip",root_directory)

output:

'C:\\python\\latracal\\newdirabcd.zip' 

Explanation:

In this code, we have written the functionmake_archive() with telling them the name of the root directory to build the archive of files in the root directory.

10. Python shutil.get_archive_formats()

This function gives us all the supported archive formats in the file or directory.

import shutil
import sys 

shutil.get_archive_formats()

output:

[('bztar', "bzip2'ed tar-file"), ('gztar', "gzip'ed tar-file"), ('tar', 'uncompressed tar file'), ('xztar', "xz'ed tar-file"), ('zip', 'ZIP file')]

Explanation:

In this code, we have written the function get_archive_formats() to get the supportive archive formats in the file or directory.

Advantages

Must Read

Conclusion

In this article, we have studied many types of operations that how we can work on high-level file operations like copying contents of a file and create a new copy of a file, etc. without diving into complex File Handling operations with shutil module in Python.

However, if you have any doubts or questions, do let me know in the comment section below. I will try to help you as soon as possible.

Happy Pythoning!

The post Python Shutil Module: 10 Methods You Should Know appeared first on Python Pool.

March 05, 2021 04:30 AM UTC

The Insider’s Guide to A* Algorithm in Python

A* Algorithm in Python or in general is basically an artificial intelligence problem used for the pathfinding (from point A to point B) and the Graph traversals. This algorithm is flexible and can be used in a wide range of contexts. The A* search algorithm uses the heuristic path cost, the starting point’s cost, and the ending point. This algorithm was first published by Peter Hart, Nils Nilsson, and Bertram Raphael in 1968.

Why A* Algorithm?

This Algorithm is the advanced form of the BFS algorithm (Breadth-first search), which searches for the shorter path first than, the longer paths. It is a complete as well as an optimal solution for solving path and grid problems.

Optimal – find the least cost from the starting point to the ending point. Complete – It means that it will find all the available paths from start to end.

Basic concepts of A*

Where

g  (n) : The actual cost path from the start node to the current node. 

h ( n) : The actual cost path from the current node to goal node.

f  (n) : The actual cost path from the start node to the goal node.

For the implementation of A* algorithm we have to use two arrays namely OPEN and CLOSE.

OPEN:

An array that contains the nodes that have been generated but have not been yet examined till yet.

CLOSE:

An array which contains the nodes which are examined.

Algorithm

1: Firstly, Place the starting node into OPEN and find its f (n) value.

2: Then remove the node from OPEN, having the smallest f (n) value. If it is a goal node, then stop and return to success.

3: Else remove the node from OPEN, and find all its successors.

4: Find the f (n) value of all the successors, place them into OPEN, and place the removed node into CLOSE.

5: Goto Step-2.

6: Exit.

Advantages of A* Algorithm in Python

Disadvantages of A* Algorithm in Python

Pseudo-code of A* algorithm

let openList equal empty list of nodes
let closedList equal empty list of nodes
put startNode on the openList (leave it's f at zero)
while openList is not empty
    let currentNode equal the node with the least f value
    remove currentNode from the openList
    add currentNode to the closedList
    if currentNode is the goal
        You've found the exit!
    let children of the currentNode equal the adjacent nodes
    for each child in the children
        if child is in the closedList
            continue to beginning of for loop
        child.g = currentNode.g + distance b/w child and current
        child.h = distance from child to end
        child.f = child.g + child.h
        if child.position is in the openList's nodes positions
            if child.g is higher than the openList node's g
                continue to beginning of for loop
        add the child to the openList

A* Algorithm code for Graph

A* algorithm is best when it comes to finding paths from one place to another. It always makes sure that the founded path is the most efficient. This is the implementation of A* on a graph structure

A* AlgorithmPython code for Graph
from collections import deque

class Graph:
    def __init__(self, adjac_lis):
        self.adjac_lis = adjac_lis

    def get_neighbors(self, v):
        return self.adjac_lis[v]

    # This is heuristic function which is having equal values for all nodes
    def h(self, n):
        H = {
            'A': 1,
            'B': 1,
            'C': 1,
            'D': 1
        }

        return H[n]

    def a_star_algorithm(self, start, stop):
        # In this open_lst is a lisy of nodes which have been visited, but who's 
        # neighbours haven't all been always inspected, It starts off with the start 
  #node
        # And closed_lst is a list of nodes which have been visited
        # and who's neighbors have been always inspected
        open_lst = set([start])
        closed_lst = set([])

        # poo has present distances from start to all other nodes
        # the default value is +infinity
        poo = {}
        poo[start] = 0

        # par contains an adjac mapping of all nodes
        par = {}
        par[start] = start

        while len(open_lst) > 0:
            n = None

            # it will find a node with the lowest value of f() -
            for v in open_lst:
                if n == None or poo[v] + self.h(v) < poo[n] + self.h(n):
                    n = v;

            if n == None:
                print('Path does not exist!')
                return None

            # if the current node is the stop
            # then we start again from start
            if n == stop:
                reconst_path = []

                while par[n] != n:
                    reconst_path.append(n)
                    n = par[n]

                reconst_path.append(start)

                reconst_path.reverse()

                print('Path found: {}'.format(reconst_path))
                return reconst_path

            # for all the neighbors of the current node do
            for (m, weight) in self.get_neighbors(n):
              # if the current node is not presentin both open_lst and closed_lst
                # add it to open_lst and note n as it's par
                if m not in open_lst and m not in closed_lst:
                    open_lst.add(m)
                    par[m] = n
                    poo[m] = poo[n] + weight

                # otherwise, check if it's quicker to first visit n, then m
                # and if it is, update par data and poo data
                # and if the node was in the closed_lst, move it to open_lst
                else:
                    if poo[m] > poo[n] + weight:
                        poo[m] = poo[n] + weight
                        par[m] = n

                        if m in closed_lst:
                            closed_lst.remove(m)
                            open_lst.add(m)

            # remove n from the open_lst, and add it to closed_lst
            # because all of his neighbors were inspected
            open_lst.remove(n)
            closed_lst.add(n)

        print('Path does not exist!')
        return None

INPUT:

adjac_lis = {
    'A': [('B', 1), ('C', 3), ('D', 7)],
    'B': [('D', 5)],
    'C': [('D', 12)]
}
graph1 = Graph(adjac_lis)
graph1.a_star_algorithm('A', 'D')

OUTPUT:

Path found: ['A', 'B', 'D']
['A', 'B', 'D']

Explanation:

In this code, we have made the class named Graph, where multiple functions perform different operations. There is written with all the functions what all operations that function is performing. Then some conditional statements will perform the required operations to get the minimum path for traversal from one node to another node. Finally, we will get the output as the shortest path to travel from one node to another.

Also, Read

Conclusion

A* in Python is a powerful and beneficial algorithm with all the potential. However, it is only as good as its heuristic function, which is highly variable considering a problem’s nature. It has found its applications in software systems in machine learning and search optimization to game development.

The post The Insider’s Guide to A* Algorithm in Python appeared first on Python Pool.

March 05, 2021 01:12 AM UTC

March 04, 2021


Patrick Kennedy

Server-side Sessions in Flask with Redis

I wrote a blog post on TestDriven.io about how server-side sessions can be implemented in Flask with Flask-Session and Redis:

https://testdriven.io/blog/flask-server-side-sessions/

This blog post looks at how to implement server-side sessions work in Flask by covering the following topics:

March 04, 2021 04:58 PM UTC


Python Morsels

Inheriting one class from another

Watch first

Need a bit more background? Or want to dive deeper?

Watch other class-related screencasts.

Transcript:

How does class inheritance work in Python?

Creating a class that inherits from another class

We have a class called FancyCounter, that inherits from another class, Counter (which is in the collections module in the Python standard library):

from collections import Counter


class FancyCounter(Counter):
    def commonest(self):
        (value1, count1), (value2, count2) = self.most_common(2)
        if count1 == count2:
            raise ValueError("No unique most common value")
        return value1

The way we know we're inheriting from the Counter class because when we defined FancyCounter, just after the class name we put parentheses and wrote Counter inside them.

To create a class that inherits from another class, after the class name you'll put parentheses and then list any classes that your class inherits from.

In a function definition, parentheses after the function name represent arguments that the function accepts. In a class definition the parentheses after the class name instead represent the classes being inherited from.

Usually when practicing class inheritance in Python, we inherit from just one class. You can inherit from multiple classes (that's called multiple inheritance), but it's a little bit rare. We'll only discuss single-class inheritance right now.

Methods are inherited from parent classes

To use our FancyCounter class, we can call it (just like any other class):

>>> from fancy_counter import FancyCounter
>>> letters = FancyCounter("Hello there!")

Our class will accept a string when we call it because the Counter class has implemented a __init__ method (an initializer method).

Our class also has a __repr__ method for a nice string representation:

>>> letters
FancyCounter({'e': 3, 'l': 2, 'H': 1, 'o': 1, ' ': 1, 't': 1, 'h': 1, 'r': 1, '!': 1})

It even has a bunch of other functionality too. For example, it has overridden what happens when you use square brackets to assign key-value pairs on class instances:

>>> letters['l'] = -2
>>> letters
FancyCounter({'e': 3, 'H': 1, 'o': 1, ' ': 1, 't': 1, 'h': 1, 'r': 1, '!': 1, 'l': -2})

We can assign key-value pairs because our parent class, Counter creates dictionary-like objects.

All of that functionality was inherited from the Counter class.

Adding new functionality while inheriting

So our FancyCounter class inherited all of the functionality that our Counter class has but we've also extended it by adding an additional method, commonest, which will give us the most common item in our class.

When we call the commonest method, we'll get the letter e (which occurs three times in the string we originally gave to our FancyCounter object):

>>> letters.commonest()
'e'

Our commonest method relies on the most_common method, which we didn't define but which our parent class, Counter, did define:

    def commonest(self):
        (value1, count1), (value2, count2) = self.most_common(2)
        if count1 == count2:
            raise ValueError("No unique most common value")
        return value1

Our FancyCounter class has a most_commonest method because our parent class, Counter defined it for us!

Overriding inherited methods

If we wanted to customize what happens when we assigned to a key-value pair in this class, we could do that by overriding the __setitem__ method. For example, let's make it so that if we assign a key to a negative value, it instead assigns it to 0.

Before when we assigned letters['l'] to -2, we'd like it to be set to 0 instead of -2 (it's -2 here because we haven't customized this yet):

>>> letters['l'] = -2
>>> letters['l']
-2

To customize this behavior we'll make a __setitem__ method that accepts self, key, and value because that's what __setitem__ is given by Python when it's called:

    def __setitem__(self, key, value):
        value = max(0, value)

The above __setitem__ method basically says: if value is negative, set it to 0.

If we stop writing our __setitem__ at this point, it wouldn't be very useful. In fact that __setitem__ method would do nothing at all: it wouldn't give an error, but it wouldn't actually do anything either!

In order to do something useful we need to call our parent class's __setitem__ method. We can call our parent class' __setitem__ method by using super.

    def __setitem__(self, key, value):
        value = max(0, value)
        return super().__setitem__(key, value)

We're calling super().__setitem__(key, value), which will call the __setitem__ method on our parent class (Counter) with key and our new non-negative value.

Here's a full implementation of this new version of our FancyCounter class:

from collections import Counter


class FancyCounter(Counter):
    def commonest(self):
        (value1, count1), (value2, count2) = self.most_common(2)
        if count1 == count2:
            raise ValueError("No unique most common value")
        return value1
    def __setitem__(self, key, value):
        value = max(0, value)
        return super().__setitem__(key, value)

To use this class we'll call it and pass in a string again:

>>> from fancy_counter import FancyCounter
>>> letters = FancyCounter("Hello there!")

But this time, if we assign a key to a negative value, we'll see that it will be assigned to0 instead:

>>> letters['l'] = -2
>>> letters['l']
0

Summary

If you want to extend another class in Python, taking all of its functionality and adding more functionality to it, you can put some parentheses after your class name and then write the name of the class that you're inheriting from.

If you want to override any of the existing functionality in that class, you'll make a method with the same name as an existing method in your parent class. Usually (though not always) when overriding an existing method, you'll want to call super in order to extend the functionality of your parent class rather than completely overriding it.

Using super allows you to delegate back up to your parent class, so you can essentially wrap around the functionality that it has and tweak it a little bit for your own class's use.

That's the basics of class inheritance in Python.

March 04, 2021 04:00 PM UTC