skip to navigation
skip to content

Planet Python

Last update: May 01, 2026 07:44 PM UTC

May 01, 2026


Rodrigo Girão Serrão

TIL #144 – Sentinel built-in

Today I learned Python 3.15 will get a new sentinel built-in.

Sentinel values are unique placeholder values that are commonly used in programming. Python 3.15 ships with a new built-in sentinel that can be used to create new sentinel values:

# Python 3.15+
>>> MISSING = sentinel("MISSING")
>>> MISSING
MISSING

Before this built-in was added, the most common sentinel idiom used the built-in object:

MISSING = object()

def my_function(some_arg=MISSING):
    if some_arg is MISSING:
        ... # Handle the sentinel

In the function above, the sentinel value MISSING is being used to check whether the user passed anything as the parameter some_arg or not. PEP 661, that introduced this built-in, has a great discussion covering the reasons as to why this pattern, and many other sentinel patterns, fall short. In general, each common sentinel idiom suffers from at least one of the following problems:

  1. Bad string repr: the string representation is too long and uninformative
  2. Type unsafe: the sentinels don't have a distinct type so it becomes hard or impossible to write code that uses the sentinels and is type safe
  3. Unexpected copy behaviour: the sentinels can't be copied or pickled without breaking the sentinel behaviour

May 01, 2026 05:49 PM UTC


Mike Driscoll

Textual-cogs 0.0.5 Released

I always thought it would be fun to create my own open source libraries or applications and distribute them somehow. When I started writing my book, Creating TUI Applications with Textual and Python, I took the plunge and wrote a helper package called textual-cogs, which is a collection of reusable dialogs and widgets for Textual. Right now, it is mostly just dialogs, but I do hope to add some widgets to it as well.

Anyway, I have released two new dialogs in the past week, with one in v0.0.4 and the other in v0.0.5.

A Textual Directory Dialog

In v0.0.5, I added a directory dialog similar to wxPython’s wx.DirDialog. The dialog will display the user’s directories and allow the user to choose one. It will also allow the user to create a new folder.

Here’s a screenshot:

Textual cogs - Directory Dialog

A Textual Open File Dialog

In v0.0.4, I also added an open file dialog. Textual cogs already has a save file dialog, and I had meant to include the open file dialog originally, but only recently got it added.

Here is what that looks like:

Textual cogs - Open File Dialog

How to Install textual-cogs

You can install textual-cogs using pip or uv:

python -m pip install textual-cogs

Where to Get textual-cogs

You can find textual-cogs on the following websites:

The post Textual-cogs 0.0.5 Released appeared first on Mouse Vs Python.

May 01, 2026 02:58 PM UTC


Real Python

The Real Python Podcast – Episode #293: Agentic Data Science Pair Programming With marimo pair

How do you add agent skills to your data science workflow? How can a coding agent assist with data wrangling and research? This week on the show, Trevor Manz from marimo joins us to discuss marimo pair.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

May 01, 2026 12:00 PM UTC

Quiz: The Factory Method Pattern and Its Implementation in Python

In this quiz, you’ll test your understanding of The Factory Method Pattern and Its Implementation in Python.

Factory Method is one of the most widely used design patterns, and it’s a powerful tool for separating object creation from object use in your code.

By working through this quiz, you’ll revisit the components of the pattern, recognize opportunities to apply it, and see how you can implement a reusable, general-purpose solution in Python.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

May 01, 2026 12:00 PM UTC


Luke Plant

Inverse Sapir-Whorf and programming languages

The Sapir-Whorf hypothesis, in it simplest form, is the idea that the language you speak influences the thoughts you think. This post is about a twist on this idea, that I’m calling “Inverse Sapir-Whorf” (for want of a better term), and how we see it in computer programming languages.

Sapir-Whorf is one of those ideas that has been popularised in general culture in a rather misrepresented and exaggerated form. In the field of linguistics, not many people today take seriously the “strong” forms of Sapir-Whorf, such as “linguistic determinism” – the idea that a language controls your thoughts or limits what you can think, or that you even need certain languages to think certain thoughts.

For example, just because a language might lack grammatical tenses, it doesn’t at all follow that the speakers will be more limited in how they think about time – there are always other ways you can express time.

There is a fair amount of evidence that spoken languages can affect perception, skill and attitudes in certain areas, but it’s usually hard to demonstrate a large direct effect.

Inverse Sapir-Whorf is a bit different. I haven’t been able to track down where I first came across the idea, but it goes like this: if classic Sapir-Whorf says your language limits what you can say or think, or makes it hard to say some things, inverse Sapir-Whorf says your language limits what you can’t say, or makes it hard not to say some things, or even hard not to think about some things. Some examples might clear things up.

Examples in natural language

There are many examples to choose from, but they are not always obvious to native speakers of a language. I’ll pick just a few.

English: temporary or permanent present tense

What’s the difference between someone saying “I’m living in London” and “I live in London”? A non-native speaker may not pick this up at all, and a native speaker may pick it up only subconsciously, but “I’m living in London” reveals that the arrangement is temporary.

Now, this might not even be to do with the actual length of time you have been living there, because “temporary” is pretty relative. It might be more about how much you like London. You have to choose a tense, and because you typically do so subconsciously, the language is forcing you to reveal things – either the period of time you’ve been living somewhere, or how you feel about it.

English/Turkish/French: gendered pronouns and nouns

In English, in normal speech you are going to use “he” or “she” when referring to a specific person. “Singular they” does exist, but it’s very unnatural if you are talking about a specific person of known or assumed sex.

You can compare this to another language which doesn’t have gendered pronouns, such as Turkish, which just has “o” for he/she/it. The lack of gendered pronouns in Turkish doesn’t stop you from thinking or talking about a person’s sex, or produce a “less gendered society”, or anywhere close, so it would be difficult to find support for normal Sapir-Whorf here. But the inverse Sapir-Whorf is obvious – English pronouns push you to talk about it whether you want to or not. If you are trying to talk about someone you know, but do so anonymously, it can be very hard to avoid making their identification easier by revealing their sex with an inadvertent “him” or “her”.

Different again is French, in which nouns are gendered, which in some cases can force you to reveal information. If you translate “my friend” into French, you have to choose between “mon ami” (male friend) and “mon amie” (female friend), which are distinct, at least in written form, or “mon copain” vs “ma copine”. Possessive pronouns are also interesting – they are gendered in both English and French (his/her, son/sa), but refer to the gender of the possessor and possessee respectively, and so reveal different information.

Turkish: “mış” tense

With some simplifications, Turkish has two main past tenses: there is the normal one that is similar to “simple past” in English, and then there is the “mış” form (you can pronounce that “mish” if you want).

This has various functions, but when describing a past event, this form is used when you have second hand or unreliable information. If someone asks you “Did Fred come to work on Monday?”, then if you saw him you would use the normal past tense “geldi” (he came), but if you only heard that he came you would instead say “gelmiş” (he came, but second hand information).

The interesting thing to me as a non-native speaker was the effect of having these options, in contrast to English where you can just use simple past tense without any specific indication of reliability or where the information came from. In certain circumstances, Turkish forces you to include information about your level of certainty or whether you witnessed something – the simple past form is not neutral, because the existence of the “mış” form makes it an unnatural choice if it is not the most appropriate of the two.

Interestingly, having learned to think that way, my wife and I have noticed an effect on our English. Often in Turkish the “mış” suffix would come at the end of the last word in a sentence, so now quite frequently we get to the end of an English sentence and notice that we haven’t put in any marker for “this-is-second-hand-info-I-didn’t-actually-witness-it”, and so we tack “mış” on the end.

Of course, you can easily express the same thing in English, using words like “apparently” and other means, but English doesn’t force you to specify, while Turkish pretty much does.

Comments

You often don’t notice these things until you learn another language, or attempt to teach your language to a foreigner. You kind of just understand them subconsciously. The vast majority of times you choose simple present over present continuous, for example, you won’t be consciously thinking about what that implies.

I should also note that when a language forces you express something, it might not be in the form of something included, but in something omitted. For example, I might say “I love cake” or “I love the cake”. In the first case, I’m talking about cake generally, in the second about a specific cake. It is the absence of the word “the” in the first case that makes it unambiguous that I’m referring to all cake, because if I’m referring to a specific cake, I must use the word “the” or some other marker like “this”. In another language, there might not be a direct equivalent to this distinction.

Examples in programming

When it comes to programming languages, I think that the “straight” version of Sapir-Whorf is closer to being true - in some programming languages it is simply hard to express certain concepts. For example, in a language like Python or Haskell it’s hard (though not impossible) to talk about memory allocations. We often talk about the limitations of a language in terms of “things that are hard to express” in that language. Hillel Wayne has some more discussion of this in his post Sapir-Whorf does not apply to Programming Languages.

But I want to talk more about Inverse Sapir-Whorf. What is the language forcing you to talk about, even if you don’t actually care about it?

I think there are actually many, many examples of this, but seeing them can be quite hard, and often requires the “foreigner perspective” that comes from learning multiple languages.

Here are a few:

I suspect that many of the features of more “approachable” or “readable” programming languages could be analysed in these terms – they have a low inverse Sapir-Whorf barrier, and don’t force you to talk about things you don’t have an opinion on, and may not even understand yet.

Are there more examples of this that you’ve come across? How do they affect the programming languages we use, or how we perceive them?

May 01, 2026 08:22 AM UTC


Tryton News

Tryton News May 2026

During the last month we focused on fixing bugs, improving the behaviour of things, speeding-up performance issues - building on the changes from our last LTS release 8.0. We added some new features which we would like to introduce to you in this newsletter.

For an in depth overview of the Tryton issues please take a look at our issue tracker or see the issues and merge requests filtered by label.

Changes for the User

Accounting, Invoicing and Payments

We now updated the supported version of stripe from 2025-09-30.clover to 2026-03-25.dahlia.

Stock, Production and Shipments

Now we include the time-sheet costs in the production work costs.

User Interface

We now implemented a fallback on the model name when there is no name parameter is given in a Tryton URL.

No we support sending emails on chat messages and the ability to reply to them.

Modules

Now we move the account_de_skr03, account_es and account_es_sii modules to the external tryton-community project.

New Documentation

We now add a new documentation for the REST-API.

New Releases

We released bug fixes for the currently maintained long term support series
8.0, 7.0 and 6.0, and for the penultimate series 7.8 and 7.6.

Changes for the System Administrator

Now we [add a REST API](https://code.tryton.org/tryton/-/commit/44edc21632c653a7a0db8a0ee42a8631c6d10f31) for user applications. For more information, have a look at its documentation.

Changes for Implementers and Developers

We now fall back to compact syntax if RelaxNG files are not present. LXML is able to load the compact syntax in case the rnc2rng package is installed. This avoids the need to generate the RelaxNG files when developing.

Authors: @dave @pokoli @udono

1 post - 1 participant

Read full topic

May 01, 2026 06:00 AM UTC


Python GUIs

Streamlit Buttons — Making things happen with Streamlit buttons

Streamlit is a popular choice for creating interactive web applications in Python. With its simple syntax and intuitive interface, developers can quickly create visually appealing dashboards.

One of the great things about Streamlit is its ability to easily handle user interaction, and dynamically update the UI in response. One of the main way for users to trigger actions in UIs is through the use of buttons. In Streamlit, the st.button() method creates a button that users can click to perform an action. Each button can be associated with a different action.

In this tutorial we'll look at how you can use buttons to add interactivity to your Streamlit apps.

Creating Buttons in Streamlit

To create a button in Streamlit, you use the st.button() function, which takes an optional label as an argument. When the button is clicked, it returns True, which you can use to control subsequent actions.

Basic Button Syntax

Here's a simple example of a button in Streamlit:

python
import streamlit as st

if st.button('Click Me'):
    st.write("Button clicked!")

Simple Streamlit app with a single button Simple Streamlit app with a single button

The st.button('Click Me') creates a button labeled Click Me. When the button is clicked, it returns True and the if evaluates to true running the nested code underneath -- displaying the message "Button clicked!"

This basic structure is the foundation of working with buttons in Streamlit. Through this simple mechanism you can build quite complex interactivity.

Multiple Buttons for Different Actions

Building on the basic button structure, you can create multiple buttons within your Streamlit app, each associated with different actions. For instance, let's create buttons that display different messages based on which is clicked.

python
import streamlit as st

if st.button('Show Greeting'):
    st.write("Hello, welcome to the app!")

if st.button('Show Goodbye'):
    st.write("Goodbye! See you soon.")

Simple Streamlit app with two buttons Simple Streamlit app with two buttons

Each button is wrapped in a conditional statement. When a button is pressed, the corresponding action is executed. Depending on the button pressed, different messages are displayed, providing immediate feedback to the user.

This structure is versatile and can be expanded to include more buttons and actions.

Displaying Dynamic Content Based on Button Clicks

Buttons can be used to display all types of content dynamically, including text, images, and charts. For example, below is a similar example but displaying images.

python
import streamlit as st

img_url_1 = "https://placehold.co/150/FF0000"
img_url_2 = "https://placehold.co/150/8ACE00"

if st.button('Show Red Image'):
    st.image(img_url_1, caption="This is a red image")

if st.button('Show Green Image'):
    st.image(img_url_2, caption="This is a green image")

Simple Streamlit app with two buttons showing images Simple Streamlit app with two buttons showing images

When the Show Red Image button is pressed, a red image is displayed. The same goes for the Show Green Image button. This setup allows users to switch between different images based on their preferences.

Note that the state isn't persisted between each interaction. When you click on the "Show Red Image" the green image will disappear, and vice versa. This isn't a toggle but a natural consequence of how Streamlit works: the code of the script is executed on each interaction, so only one button can be in a "clicked" state at any time.

To persist state between runs of the script, you can use Streamlit's state management features. We'll cover this in a future tutorial.

Dynamic Forms Based on Button Press

Dynamic forms allow users to provide input in a structured way, which can vary based on user actions. This is particularly useful for collecting information without overwhelming users with multiple fields.

Here's a quick example where users can input their name and age based on button presses:

python
import streamlit as st

# Title
st.title("Dynamic Forms Based on Button Press")

# Name Input Field
if st.button('Enter Name'):
    name = st.text_input('What is your name?')
    if name:
        st.write(f"Hello, {name}\!")

# Age Input Field

if st.button('Enter Age'):
    age = st.number_input('What is your age?', min_value=1, max_value=120)
    if age:
        st.write(f"Your age is {age}.")

The button Enter Name triggers a text input field when clicked, allowing users to enter their names. The button Enter Age displays a number input field for users to enter their age. The app provides immediate feedback based on user input.

Handling Form Submission

For more complex collections of inputs that you want to work together, consider using st.form() to group inputs, allowing users to submit all inputs at once:

python
import streamlit as st

# Title
st.title("Dynamic Forms Based on Button Press")

with st.form("my_form"):
    name = st.text_input('What is your name?')
    age = st.number_input('What is your age?', min_value=1, max_value=120)
    submitted = st.form_submit_button("Submit")

    if submitted:
        st.write(f"Hello, {name}\! Your age is {age}.")

Streamlit form with submit button Streamlit form with submit button

Conclusion

In this tutorial, we explored how to make things happen in Streamlit using buttons. We learned how to create multiple buttons and display dynamic content based on user interaction.

Now that you have a basic understanding of buttons in Streamlit, you can add basic interaction to your Streamlit applications.

May 01, 2026 06:00 AM UTC


Antonio Cuni

Why Python Is Slow: Talking about SPy on the Behind the Commit Podcast

Why Python Is Slow: Talking about SPy on the Behind the Commit Podcast

During EuroPython 2025 I had the pleasure to talk to Mia Bajić for her podcast Behind The Commit.

In the chat we mainly talk about Python performance and how SPy tries to improve them.

Now the full episode is live: you can watch it on Youtube or listen on Spotify

May 01, 2026 12:00 AM UTC

April 30, 2026


Real Python

Quiz: Using Python for Data Analysis

In this quiz, you’ll test your understanding of Using Python for Data Analysis.

By working through this quiz, you’ll revisit the stages of a data analysis workflow, including cleansing raw data with pandas, spotting outliers and typos, and using regression to find relationships between variables.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

April 30, 2026 12:00 PM UTC


EuroPython

EuroPython 2026: Ticket Sales Now Open

Hey hey, folks 👋

Get ready for EuroPython 2026: the conference for all things Python, Data Science, and AI! 

We’ve got an exciting week planned:

We have a special keynote this year: Łukasz Langa and Pablo Galindo Salgado will be recording the core.py podcast right on the conference stage. It will feature their special guest Guido van Rossum, the creator of Python.

altTicket sales for EuroPython 2026 is now open

People who’ve been to EuroPython will tell you that it is more than just talks and tutorials: it&aposs a time when the entire community is together, regardless of experience level or background. Each conference leads to new friends being made, projects gaining new contributors, and even people securing their next job. We want you all to be a part of it 💚

🎫 Grab your ticket before they sell out:

Can’t wait to see you all in Kraków and hang out with the Python crowd again 🐍💚

Cheers,

The EuroPython 2026 Organisers ✨

April 30, 2026 10:00 AM UTC


Seth Michael Larson

The Frog for Whom the Bell Tolls

Kaeru no Tame ni Kane wa Naru (カエルの為(ため)に鐘(かね)は鳴(な)る) is a Japanese-only Game Boy title published in 1992 by Nintendo and developed by Intelligent Systems. The title’s official English translation is “The Frog for Whom the Bell Tolls”. For brevity, I’ll be using the title “Frog Game” in this article.

After I finished Link’s Awakening, the Frog Game started popping up everywhere in my digital life. The first occurrence was without my knowledge: some of the characters in Link’s Awakening, Prince Richard and his frogs, are originally from the Frog Game and use the same sprites and music.


Picture of my “Kaeru no Tame ni Kane wa Naru” Game Boy cartridge.

While researching what game to play after Link’s Awakening I watched a video by AntDude detailing the history of hand-held Legend of Zelda games. The video starts by mentioning “Frog Game” instead of the actual first Zelda game on the Game Boy: Link’s Awakening. Very intriguing...

After further research I stumbled across a project by Iván Delgado (Bluesky, YouTube) to create a colorization patch for “Frog Game” that appears to still be in progress. I was already a subscriber to Iván’s blog and had previously read their series of posts about colorizing Game Boy games.

Everything I read about the game made me want to play: the game was affordable, short (7 hours to beat), with a light and funny narrative, and had ties to some of my favorite games. I’ve since played Frog Game and I recommend the game as a quick and fun “pocket-sized” adventure.

Playing with English translations

Kaeru no Tame ni Kane wa Naru was never released outside of Japan and despite multiple re-releases to the 3DS eShop and now Nintendo Classics, there is no official English translation.

I can’t read Japanese, but to experience the dialogue. Luckily for me, there is a fan-created English translation patch from 2011. I would need the actual game ROM to apply the patch.


Japanese title screen for “Kaeru no Tame ni Kane wa Naru”


Title screen with the English translation patch applied

I purchased the game cartridge for $10 on eBay and dumped the ROM using GB Operator. Next I applied the English translation patch (.ips) using ROM Patcher JS by Marco Bledo. I loaded the resulting ROM into the Delta Emulator and played exclusively on this platform (with RetroAchievements enabled).

Beware: There are minor spoilers beyond this point!

References

While game’s title is a reference to “For Whom the Bell Tolls” by Ernest Hemingway, the game’s story definitely isn’t. One of the goals of the protagonists is to repair and ring the “Spring Bell” to break the curse on the princes and their army: transforming them from frogs back into humans.

The developers of Frog Game, Intelligent Systems, also developed my favorite game of all time: “Paper Mario: The Thousand Year Door”. Chapter 4 of Paper Mario is titled “For Pigs the Bell Tolls” which is another reference to Hemingway and potentially Frog Game? Chapter 4’s story in Paper Mario has the villain “Doo*liss” ringing the Creepy Steeple bell which transforms the Twilight Town inhabitants one-by-one into pigs.

Frog Game references Nintendo very directly multiple times. During your adventure you visit “Nantendo Inc.” (not a typo!) to talk to the scientists there. One of the “products” you end up needing from Nantendo is a “Mamicon”, likely a reference to the Nintendo Famicom. From just the name alone you will never guess what the Mamicom does, you’ll have to play the game to find out!

Frog Game is referenced in a few other Nintendo games beyond Link’s Awakening, including an Assist Trophy and Single-Player Challenge in Super Smash Bros. Mad Scienstein from Nantendo Inc. cameos in Wario Land 3, Wario Land 4, and Dr. Mario 64.

Gameplay

The rumors about Link’s Awakening sharing an engine with Frog Game likely come from using a mix of top-down and side-scrolling platformer perspectives. Frog Game uses the top-down perspective when exploring the world map or different towns and then switches to side-scrolling when in dungeons or the castle. Folks who have dug into the assembly of both games are fairly sure the two games don’t share an engine, meaning the rumors are unlikely to be true. Still a fun story :)

Despite appearing to be a traditional RPG with stats like Health, Attack, Speed, and the ability to upgrade your equipment, this game does not play like many RPGs. There are no tactics in combat beyond being able to run away from a battle or use an item, which for most of the story is only to heal using Wine. Battles proceed automatically in a cloud of dust and will consistently resolve as either a victory or defeat.

Combat and stats are used to limit progression with difficult “boss enemies” until you’ve discovered or unlocked every new stat upgrade in an area. Stat upgrades are given out similar to any other item: hidden in chests or as a reward for defeating an enemy. You can’t increase your stats on your own using “experience points” or “leveling up” meaning the game is in control of how strong you are.

The “illusion of control” is my favorite design choice of Frog Game, and it goes beyond just combat and items, too. There are many points in the game where, without you even noticing, the game has set you on a “one-way track” where your combat ability, health, and resources are exactly managed to produce an outcome later in the story. It’s fun trying to break the flow and seeing how the game responds!

Factions

The universe of Frog Game has multiple kingdoms and three factions: humans, frogs and snakes.

Frogs are afraid of snakes, as snakes will actively pursue frogs as prey, but frogs and humans are either neutral or friendly towards each other. The antagonist, Lord Delarin, leads the “Croakian Army”, an army of soldiers who are friendly towards frogs but hostile towards humans of other kingdoms and snakes. Humans, frogs, and snakes can only converse with members of their group and this “information asymmetry” is used throughout the story.

Prince Richard, Prince Sablé, and the Custard Kingdom army are all “cursed” by Mandola the witch, transforming them into frogs. Prince Sablé eventually gains the ability to transform into a frog, snake, and human somewhat at-will from Mandola through additional “curses”. These curses end up being instrumental to your success, similar to the “curses” from Black Chests in Paper Mario or Li’l Devils from Link’s Awakening.

Story

The story of Frog Game after the initial few chapters is quite light. You’re trying to accomplish the main goal which is to defeat Delarin and find Princess Tiramisu, but a lot of that happens in the background. The bulk of the story is solving your minute-to-minute troubles caused either by your short-sightedness or the Croakian army. You don’t meet Delarin until the very end and despite a few twists at the end: the Princess does not escape her fate. At the end of the day it’s a Game Boy game, so the expectations of the story are not high.



Thanks for keeping RSS alive! ♥

April 30, 2026 12:00 AM UTC

April 29, 2026


PyCharm

Using Bag-of-Words With PyCharm

Have you ever wondered how machine learning models actually work with text? After all, these models require numerical input, but text is, well, text.

Natural language processing (NLP) offers many ways to bridge this gap, from the large language models (LLMs) that are dominating headlines today all the way back to the foundational techniques of the 1950s. Those early methods fall under what we now call the bag-of-words (BoW) model, and despite their age, they remain remarkably effective for a wide range of language problems.

In this post, we’ll unpack how the bag-of-words model works, explore the techniques it uses to convert text into numerical representations, and look at where it fits relative to more modern NLP approaches. We’ll also build a text classification project using BoW techniques, and see how PyCharm’s specific features make the whole process faster and easier.

What is the bag-of-words model?

The bag-of-words model is a text representation technique that converts unstructured text into numerical vectors by tracking which words appear across a corpus (a collection of texts). Rather than preserving grammar or word order, it simply represents each document as a “bag” of its words, recording how often each one appears. The result is a vector of counts that captures what a text is about, even if it discards how that content is expressed.

This apparent limitation turns out to matter less than you might expect. For many tasks, such as text classification and sentiment analysis, the presence of certain words is often a stronger signal than their arrangement, and BoW captures that signal efficiently.

How does bag-of-words work?

To use the bag-of-words model, we need to convert each text in a corpus into a numerical vector. Let’s walk through how that works, starting with what that vector actually looks like.

Take the following sentence:

When diving into natural language processing, it is natural for beginners to feel overwhelmed by the complexity of sentiment analysis, which involves distinguishing negative from positive text. However, as you practice with libraries like NLTK or spaCy, the concepts naturally start to click.

A vector representation of this text using the BoW model might look something like this.

naturalnaturallynauseanearnearednearingnecessarynegative
21000001

If we think of this vector as a table, you may have noticed that each column represents a word in the corpus, and the row contains a number from 0 to 2. This number is a count of how many times the word occurs in the text, as we can see below:

When diving into natural language processing, it is natural for beginners to feel overwhelmed by the complexity of sentiment analysis, which involves distinguishing negative from positive text. However, as you practice with libraries like NLTK or spaCy, the concepts naturally start to click.

Each column represents a word in the vocabulary; each value records how many times that word appears. Here, “natural” appears twice, while “naturally” and “negative” each appear once.

Tokenization

Before we can build this vector, we need to split our text into tokens. In BoW modeling, this is typically straightforward: We split on whitespace, so “When diving into natural language processing,” becomes seven tokens: ["When", "diving", "into", "natural", "language", "processing", ","]. This is considerably simpler than the tokenization used in LLMs.

Vocabulary creation

Applying tokenization across every text in the corpus produces a long list of words. Deduplicating this list gives us our vocabulary, which we can see in the set of columns in the vector above. This process does introduce some noise: “Natural” and “natural”, for instance, would be treated as two separate tokens. We’ll look at preprocessing steps to address this shortly.

Encoding

With a vocabulary in hand, we create a vector for each text with one element per vocabulary word. Encoding is then the process of filling in those elements by checking each vocabulary word against the text.

The simplest approach is binary vectorization: 0 if a word is absent, 1 if present. More common, however, is count vectorization, which records the actual number of occurrences, as we saw in the example above. Count vectorization carries more information, since it helps distinguish texts that merely mention a topic from those that focus on it heavily.

One practical consequence of this approach is sparsity. If a corpus contains thousands of unique words, each vector will have thousands of elements, but any individual text will only use a small fraction of them, leaving most values at zero. This signal-to-noise issue is something we’ll return to.

Advantages of the bag-of-words model

The bag-of-words model has remained a staple in NLP for good reason. Its greatest strength is its simplicity: Because text is represented as a collection of word counts, the approach is easy to understand and straightforward to implement, making it a natural baseline before reaching for more complex architectures.

Beyond simplicity, BoW is computationally efficient. As you saw above, the underlying math is lightweight, which means it scales well to large text collections without demanding significant computing resources. For tasks where the presence of specific words is sufficient to capture meaning, with sentiment analysis and topic categorization being the clearest examples, it remains a highly effective tool.

Applications of bag-of-words

Like many NLP approaches, the bag-of-words model can be applied to many natural language problems. These potential applications include:

As you can see, the number of potential applications is broad, making bag-of-words modeling a popular first approach to natural language problems.

Why use PyCharm for NLP?

PyCharm is particularly well-suited to bag-of-words modeling because it supports the iterative, detail-oriented workflow that text processing requires. As you’ll soon see, building a reliable BoW pipeline involves multiple steps, such as cleaning text, tokenizing, vectorizing, and validating outputs, and PyCharm’s code intelligence makes each of these smoother. Autocompletion, parameter hints, and quick navigation through specialized NLP libraries reduce friction when experimenting with different vectorizer settings, and help you understand how each component behaves.

Debugging and data inspection are equally important here, since small preprocessing mistakes can have an outsized effect on results. PyCharm lets you step through your code and examine intermediate states of things such as token lists and vocabulary at runtime, making it straightforward to verify that your feature extraction is working as intended. This visibility is especially useful when diagnosing issues like unexpected vocabulary sizes or missing terms.

PyCharm also supports exploratory work through its excellent Jupyter Notebook integration and scientific tooling. BoW modeling often involves trying different preprocessing strategies and observing their effects immediately, so the ability to run code interactively and inspect outputs inline is a genuine advantage. Combined with built-in virtual environment and package management support, this keeps experiments reproducible and well-organized.

As projects grow, PyCharm’s refactoring tools, project navigation, and version control integration help manage the added complexity. BoW models rarely exist in isolation, and they’re often embedded in larger ML pipelines. In such contexts, PyCharm’s features for working with larger applications mean you spend less time managing code and more time improving your models.

Setting up the project

To see these components in action, let’s build an actual bag-of-words project. We’ll use a classic text classification dataset and the AG News dataset, and then use the model to classify news articles into one of four categories: World, Sports, Business, or Science/Technology.

To get started in PyCharm, open the Projects and Files tool window and select New… > New Project…. Since this is a data science project, we can use PyCharm’s built-in Jupyter project type, which sets up a sensible default structure for us.

During project configuration, you’ll be asked to choose a Python interpreter. By default, PyCharm uses uv and lets you select from a range of Python versions, though all major dependency management systems are supported: pip, Anaconda, Pipenv, Poetry, and Hatch. Every project is automatically created with an attached virtual environment, so your setup will be ready to go each time you reopen the project.

With the project configured, we can install our dependencies via the Python Packages tool window. Simply search for a package by name, select it from the list, and install your desired version directly into the virtual environment. You can also see the same information about the package you’d find on PyPI directly within the IDE. For this project, we’ll need pandas and Numpy, along with datasets from Hugging Face, scikit-learn, Pytorch, and spaCy.

Implementing bag-of-words with PyCharm

There are many versions of this dataset online. We’ll be using one of the versions hosted on Hugging Face Hub.

Loading and preparing the data

We’ll use Hugging Face’s datasets package to download this dataset.

from datasets import load_dataset
ag_news_all = load_dataset("sh0416/ag_news")

This gives us a Hugging Face DatasetDict object. If we look at it, we can see it contains a training dataset with 120,000 news articles, and a test dataset containing 7,600 articles.

ag_news_all
DatasetDict({
    train: Dataset({
        features: ['label', 'title', 'description'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['label', 'title', 'description'],
        num_rows: 7600
    })
})

As we’ll be training a model, we also need a validation set. We’ll convert the training and test sets to pandas DataFrames, and use the train_test_split method from scikit-learn to create the validation set from the training data.

import pandas as pd
from sklearn.model_selection import train_test_split

ag_news_train = ag_news_all["train"].to_pandas()
ag_news_test = ag_news_all["test"].to_pandas()

ag_news_train, ag_news_val = train_test_split(
   ag_news_train,
   test_size=0.1,     
   random_state=456,   
   stratify=ag_news_train['label'] 
)

print(f"Training set: {len(ag_news_train)} samples")
print(f"Validation set: {len(ag_news_val)} samples")

We now have a validation set with 12,000 articles, and a training set with 108,000 articles.

Training set: 108000 samples
Validation set: 12000 samples

For those of you new to machine learning, you might be wondering why we need all of these different datasets. The reason for this is to make sure we have a good idea that our model will generalize well and perform as expected on unseen data. The training set is the only data the model ever learns from directly. The validation set is used to monitor how the model is performing on unseen data as we make modeling decisions, such as choosing how many epochs to train for, how large to make the hidden layer, or which preprocessing steps to apply (we’ll see all of this later). This means that we look at validation performance repeatedly while building the model, and this increases the risk that our choices gradually become tuned to the quirks of that particular split. This is why we need a third set (the test set), which we keep completely locked away until we’ve finished all modeling decisions and want a single, unbiased estimate of how well our model will perform on new data. Using the test set for anything other than this final evaluation would give us an overly optimistic picture of our model’s real-world performance.

Let’s now inspect our datasets. PyCharm Pro has a lot of built-in features that make working with DataFrames easier, a few of which we’ll see soon. In this DataFrame, we have three columns: The article title and description, the article text, and the label indicating which of the four news categories the article belongs to. You can open any of the DataFrame cells in the Value Editor to see its full text, or widen the column to prevent truncation, both of which are useful for a quick visual inspection.

At the top of each column, PyCharm displays column statistics, giving you an at-a-glance summary of the data. Switching from Compact to Detailed mode via Show Column Statistics gives you rich summary statistics about each column, and saves you from writing a lot of pandas boilerplate to get it! From these statistics, we can see that our training set is evenly split across the news categories (which is very handy when training a model). We can also see that some headlines and descriptions are not unique, which may introduce noise when classifying these duplicates.

The first step in preparing the data is basic string cleaning, which normalizes the text and reduces meaningless token variation. For instance, without cleaning, “Natural” and “natural” would be treated as two separate vocabulary entries, as we noted earlier. 

We’ll apply four cleaning steps: lowercasing, punctuation removal, number removal, and whitespace normalization. There are different string cleaning steps you can apply depending on the language and use case, but for English-language texts, these tend to be very standard. Let’s go ahead and write a function to do this.

def apply_string_cleaning(dataset: pd.Series) -> pd.Series:
   patterns_to_remove = [
       r"[^a-zA-Z\s]",
   ]

   cleaned = dataset.str.lower()

   for pattern in patterns_to_remove:
       cleaned = cleaned.str.replace(pattern, " ", regex=True)

   cleaned = cleaned.str.replace(r"\s+", " ", regex=True).str.strip()

   return cleaned

ag_news_train["title_clean"] = apply_string_cleaning(ag_news_train["title"])
ag_news_train["description_clean"] = apply_string_cleaning(ag_news_train["description"])

This mostly works, but there’s one issue: The regex strips apostrophes entirely, turning contractions like “you’re” into “you re” and possessives like “Canada’s” into “Canada s”. The cleanest fix is a regex that preserves apostrophes in contractions while removing possessive endings, but this is not the most enjoyable thing to write by hand.

This is where PyCharm’s built-in AI Assistant comes in. Open the chat window via the AI Chat icon on the right-hand side of the IDE and enter the following prompt:

Can you please alter the @apply_string_cleaning function so that it retains apostrophes inside words when they’re used for contractions (e.g., “you’re”), but removes them when they’re used for possessives (e.g., “Canada’s” into “Canada”).

The @ notation lets you reference specific files or objects in your IDE without copying and pasting code into the prompt, including Jupyter variables like datasets and functions.

I ran this against Claude Sonnet 4.5, though JetBrains AI supports a wide range of models from OpenAI, Anthropic, Google, and xAI, as well as open models via Ollama, LM Studio, and OpenAI-compatible APIs. Below is the updated function it returned:

def apply_string_cleaning(dataset: pd.Series) -> pd.Series:
    cleaned = dataset.str.lower()
    
    # Remove possessive apostrophes (word's -> word)
    # This pattern matches: letter(s) + 's + word boundary
    cleaned = cleaned.str.replace(r"(\w+)'s\b", r"\1", regex=True)
    
    # Remove all non-letter characters except apostrophes within words
    cleaned = cleaned.str.replace(r"[^a-zA-Z'\s]", " ", regex=True)
    
    # Clean up any apostrophes at the start or end of words
    cleaned = cleaned.str.replace(r"\s'|'\s", " ", regex=True)
    
    # Remove multiple spaces and trim
    cleaned = cleaned.str.replace(r"\s+", " ", regex=True).str.strip()
    
    return cleaned

ag_news_train["title_clean"] = apply_string_cleaning(ag_news_train["title"])
ag_news_train["description_clean"] = apply_string_cleaning(ag_news_train["description"])

We can insert this into our Jupyter notebook directly by clicking on Insert Snippet as Jupyter Cell in the AI chat.

Once we run this updated function on our raw text, we get the correct result:

texttext_clean
Don’t stand for racism – football chiefdon’t stand for racism football chief
Canada’s Barrick Gold acquires nine per cent stake in Celtic Resources (Canadian Press)canada barrick gold acquires nine per cent stake in celtic resources canadian press

We can see the contraction “don’t” is correctly preserved in the first example, but the possessive “Canada’s” has been removed. We apply this to both the training and validation datasets using the same function, so that the cleaning is consistent across both splits:

ag_news_val["title_clean"] = apply_string_cleaning(ag_news_val["title"])
ag_news_val["description_clean"] = apply_string_cleaning(ag_news_val["description"])

Creating the bag-of-words model

Now that we have clean text, we need to build our vocabulary and encode it. We’ll use scikit-learn’s CountVectorizer for this:

from sklearn.feature_extraction.text import CountVectorizer

countVectorizerNews = CountVectorizer()
countVectorizerNews.fit(ag_news_train["text_clean"])
ag_news_train_cv = countVectorizerNews.transform(ag_news_train["text_clean"]).toarray()

The process has two distinct steps. First, .fit() scans the training data and builds a vocabulary by identifying every unique word and assigning it a fixed index position (for example, “government” = column 8,901). The result is a mapping of 59,544 unique words, which you can think of as the column headers for our eventual matrix.

Second, .transform() uses that vocabulary to convert each headline into a numerical vector, counting how many times each vocabulary word appears and placing that count at the corresponding index position.

The reason these are two separate steps is important: When we later process our validation and test data, we’ll call .transform() using the vocabulary learned from the training set. This ensures that all three splits share a consistent feature space. If we re-ran .fit() on the test data, we’d get a different vocabulary, and the model’s predictions would be meaningless.

With the vectorizer fitted and our training data transformed, we can start exploring what we’ve actually built. Let’s first take a look at the vocabulary. CountVectorizer stores it as a dictionary mapping each word to its index position, accessible via vocabulary_:

countVectorizerNews.vocabulary_
{'fed': 18461,
 'up': 55833,
 'with': 58324,
 'pension': 38929,
 'defaults': 13156,
 'citing': 9475,
 'failure': 18077,
 'of': 36704,
 'two': 54804,
 'big': 5269,
 'airlines': 1139,
 'to': 53531,
 'make': 31397,
 'payments': 38686,
 'their': 52947,
...}
len(countVectorizerNews.vocabulary_)
59544

This confirms that our vocabulary contains 59,544 unique words. Browsing through it, you can start to guess what kinds of terms appear frequently in the different types of news. Country names feature heavily in the “world” news category, terms like “football” and “cricket” in the “sports” news category, terms like “profit” and “losses” in the “business” news category, and company names like “Google” and “Microsoft” in the “science/technology” category.

Next, let’s inspect the feature matrix itself. ag_news_train_cv is a NumPy array with one row per headline and one column per vocabulary word, giving us a matrix of shape (108,000 × 59,544). We can wrap it in a DataFrame to make it easier to inspect in PyCharm’s DataFrame viewer:

pd.DataFrame(ag_news_train_cv, columns=countVectorizerNews.get_feature_names_out())

As expected, the matrix is very sparse. Most values are zero, since any individual headline only contains a small fraction of the full vocabulary. In fact, you might have noticed that the number of columns is two-thirds of the number of rows, which is never good for a feature matrix. We’ll explore how to reduce the dimensionality of the feature space in a later section.

Note that we also need to apply this vectorization to the validation dataset before moving on to modeling. Importantly, we are only applying the .transform method to the validation set, as we already trained it on the training dataset.

ag_news_val_cv = countVectorizerNews.transform(ag_news_val["text_clean"]).toarray()

Visualizing the results

Before we move onto reducing down the dimensionality of our feature space, let’s explore the distribution of the words in our corpus. This can help us to understand the most common and rare words, and how we might use this to further process our data to amplify the signal-to-noise ratio.

Word frequency plots

We’ll start by creating a DataFrame that aggregates word counts across all headlines and ranks them by frequency:

import numpy as np

vocab = countVectorizerNews.get_feature_names_out()
counts = np.asarray(ag_news_train_cv.sum(axis=0)).flatten()

pd.DataFrame({
  'vocab': vocab,
  'count': counts,
}).sort_values('count', ascending=False).reset_index(drop=True)

First, we retrieve the vocabulary in index order using get_feature_names_out(), so each word lines up with its corresponding column in the feature matrix. We then sum the matrix column-wise (that is, across all documents) to get the total number of times each word appears in the training set. Finally, we wrap these two arrays into a DataFrame and sort by count, giving us a ranked list of the most frequent terms.

Once this DataFrame is displayed in PyCharm, we can easily turn it into a visualization without writing a single line of code. By clicking on the Chart View button in the top left-hand corner of the DataFrame, we can explore a range of ways of visualizing our data. Go to Show Series Settings in the top right-hand corner, and adjust the parameters to get the count frequencies of the words: we set the X axis value to “vocab” (and change group and sort to none), the Y axis value to “count”, and the chart type to “Bar”.

Hovering over this chart, we can see that it has a very long-tailed distribution, which is very typical of vocabulary frequencies (this is actually so typical that it is described in something called Zipf’s law). This means that the majority of our words very rarely occur in the text, and in fact, if we hover over the right-hand side of the chart, it looks like around a third of our vocabulary terms are only used once! 

On the other hand, when we hover over the left-hand side of the chart, we can see that this is dominated by very common words, prepositions, and articles, such as “to”, “in”, “the”, and “you”. These words don’t really carry any meaning and pretty much occur in every text, so they’re unlikely to be useful for our classification task. 

Let’s have a look at some things we can do to clean up our feature space and help our semantically meaningful words stand out a bit more.

Advanced bag-of-words techniques

The basic BoW pipeline we’ve built so far is a solid foundation, but there are several techniques that can meaningfully improve its quality. This section walks through the most important ones. We’ll only be using a selection of them in our project, but you can investigate which of these seem appropriate when building your own project.

Stop word removal

Stop words are extremely common words that appear frequently across all kinds of text but carry little meaningful information. This includes words like “the”, “is”, “and”, “of”, as we saw in the histogram in the previous section. They inflate vocabulary size without adding signal, so removing them is one of the most straightforward ways to improve your BoW representation. NLTK provides a built-in stop word list for English and many other languages.

Stemming and lemmatization

Another issue you might have noticed in our vocabulary is that words that are semantically equivalent have different syntactic forms, meaning that while they should be treated as the same token, they occupy additional token slots. We can resolve this through two techniques: stemming and lemmatization. Stemming reduces words to their root form using simple rule-based truncation (e.g. “running” → “run”), while lemmatization takes a linguistic approach, mapping words to their dictionary base form. Lemmatization is slower but generally produces cleaner results, particularly for irregular word forms.

TF-IDF

Term frequency-inverse document frequency (TF-IDF) is an extension of basic count vectorization that weights each word by how informative it actually is. A word that appears frequently in one document but rarely across the corpus receives a high weight; a word that appears everywhere receives a low one. This neatly addresses one of the core weaknesses of raw count vectors: common but uninformative words can dominate the feature space even after stop-word removal.

N-grams

Standard BoW treats each word independently, which means it misses phrases whose meaning depends on word combinations. A classic example of this is “machine learning”, which has a distinct meaning to “machine” + “learning”. N-grams address this by treating sequences of adjacent words as single tokens, so a bigram model would capture “machine learning” as a feature in its own right. The trade-off is a much larger vocabulary, so in practice, bigrams are most commonly used, with trigrams reserved for cases where capturing longer phrases is important.

Handling out-of-vocabulary words

When you apply your fitted vectorizer to new data, any words not present in the training vocabulary are silently ignored by default. For many tasks, this is acceptable, but if your production data is likely to continue introducing new terms that carry meaningful signal, it’s worth considering alternatives. One common approach is to reserve a special <UNK> token to represent unseen words, which at least preserves the information that something unfamiliar appeared, even if its identity is unknown and multiple (perhaps unrelated) words are collapsed onto the same token. 

However, LLMs, with their more flexible approach to tokenization, tend to be a better choice if out-of-vocabulary words will be a major issue for your model once it is in production.

Dimensionality reduction

Even after stop word removal and other cleaning steps, BoW feature matrices are typically very high-dimensional and sparse. Two widely used techniques can help. Reducing to the top-N most frequent terms is the simplest approach, discarding low-frequency words that are unlikely to generalize well. For a more principled reduction, techniques like principal component analysis (PCA) or latent semantic analysis (LSA) project the feature matrix into a lower-dimensional space, compressing the representation while preserving as much of the meaningful variance as possible.

Feature selection techniques

Rather than reducing dimensionality arbitrarily, feature selection methods identify and retain only the features most relevant to your specific task. Chi-squared testing measures the statistical dependence between each term and the target label, making it well-suited to classification tasks. Mutual information takes a similar approach, scoring each feature by how much it reduces uncertainty about the class. Both methods can substantially reduce vocabulary size while preserving model performance.

Applying bag-of-words to a real-world problem

Let’s now continue the example we started earlier. We’re going to take the work we’ve done on our AG News text classification task and take it to its completion by building a model.

A common way to build a model using encoded text is neural networks, where each of the words in the vocabulary is treated as a feature, and the categories we want to predict (in our case, the news category) are the output. We’ll start by building a baseline model that applies only string cleaning and encoding to the text.

I had originally written this model in Keras, as part of a previous BoW project from a couple of years ago. However, that code was now out of date. In order to update it and adapt it to Pytorch, I asked JetBrains AI to do the following:

Please update this neural network from Keras to Pytorch, making improvements to make the code as reusable as possible.

This gave us the following successful port of the code:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

class MulticlassClassificationModel(nn.Module):
   def __init__(self, input_size: int, hidden_layer_size: int, num_classes: int = 4):
       super(MulticlassClassificationModel, self).__init__()
       self.fc1 = nn.Linear(input_size, hidden_layer_size)
       self.relu = nn.ReLU()
       self.fc2 = nn.Linear(hidden_layer_size, num_classes)

   def forward(self, x):
       x = self.fc1(x)
       x = self.relu(x)
       x = self.fc2(x)
       return x

def train_text_classification_model(
       train_features: np.ndarray,
       train_labels: np.ndarray,
       validation_features: np.ndarray,
       validation_labels: np.ndarray,
       input_size: int,
       num_epochs: int,
       hidden_layer_size: int,
       num_classes: int = 4,
       batch_size: int = 1920,
       learning_rate: float = 0.001) -> MulticlassClassificationModel:

   # Convert labels to 0-indexed (AG News has labels 1,2,3,4 -> need 0,1,2,3)
   train_labels_indexed = train_labels - 1
   validation_labels_indexed = validation_labels - 1

   # Convert numpy arrays to PyTorch tensors
   X_train = torch.FloatTensor(train_features.copy())
   y_train = torch.LongTensor(train_labels_indexed.copy())
   X_val = torch.FloatTensor(validation_features.copy())
   y_val = torch.LongTensor(validation_labels_indexed.copy())

   # Create datasets and dataloaders
   train_dataset = TensorDataset(X_train, y_train)
   train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

   # Initialize model, loss function, and optimizer
   model = MulticlassClassificationModel(input_size, hidden_layer_size, num_classes)
   criterion = nn.CrossEntropyLoss()
   optimizer = optim.RMSprop(model.parameters(), lr=learning_rate)

   # Training loop
   for epoch in range(num_epochs):
       model.train()
       train_loss = 0.0
       correct_train = 0
       total_train = 0

       for batch_features, batch_labels in train_loader:
           # Forward pass
           outputs = model(batch_features)
           loss = criterion(outputs, batch_labels)

           # Backward pass and optimization
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()

           # Calculate training metrics
           train_loss += loss.item()
           _, predicted = torch.max(outputs, 1)
           correct_train += (predicted == batch_labels).sum().item()
           total_train += batch_labels.size(0)

       # Validation
       model.eval()
       with torch.no_grad():
           val_outputs = model(X_val)
           val_loss = criterion(val_outputs, y_val)
           _, val_predicted = torch.max(val_outputs, 1)
           correct_val = (val_predicted == y_val).sum().item()
           total_val = y_val.size(0)

       # Print epoch metrics
       train_acc = correct_train / total_train
       val_acc = correct_val / total_val
       print(f'Epoch [{epoch+1}/{num_epochs}], '
             f'Train Loss: {train_loss/len(train_loader):.4f}, '
             f'Train Acc: {train_acc:.4f}, '
             f'Val Loss: {val_loss:.4f}, '
             f'Val Acc: {val_acc:.4f}')

   return model

def generate_predictions(model: MulticlassClassificationModel,
                       validation_features: np.ndarray,
                       validation_labels: np.ndarray) -> list:
   model.eval()

   # Convert to tensors
   X_val = torch.FloatTensor(validation_features.copy())

   with torch.no_grad():
       outputs = model(X_val)
       _, predicted = torch.max(outputs, 1)

   # Convert back to 1-indexed labels to match original dataset
   predicted_labels = (predicted.numpy() + 1)

   print("Confusion Matrix:")
   print(pd.crosstab(validation_labels, predicted_labels,
                     rownames=['Actual'], colnames=['Predicted']))
   return predicted_labels.tolist()

Let’s walk through this code step-by-step to understand how we’re going to train our text classifier.

The model architecture

MulticlassClassificationModel is a simple two-layer feedforward neural network. It takes a BoW vector as input, with each feature being a vocabulary word, and passes it through two sequential transformations to produce a prediction. The first layer (fc1) compresses this high-dimensional input down to a smaller intermediate representation, whose size we control via hidden_layer_size. A ReLU activation is then applied, which introduces a small amount of mathematical complexity that allows the model to learn patterns that a simple weighted sum couldn’t capture. The second layer (fc2) takes this intermediate representation and maps it down to four output values, one per news category, where the category with the highest value becomes the model’s prediction.

Training and validation

train_text_classification_model handles the full training loop. It starts with a small amount of housekeeping: The AG News labels run from 1 to 4, but PyTorch expects 0-indexed classes, so these are shifted down by 1. The features and labels are then converted to PyTorch tensors, and a DataLoader is created to feed the training data to the model in batches.

Each epoch, the model processes the training data batch by batch. For each batch, it runs a forward pass to generate predictions, computes the cross-entropy loss against the true labels, and then runs a backward pass to update the model weights via the RMSprop optimizer. At the end of every epoch, the model switches into evaluation mode and runs inference over the full validation set, printing the training and validation loss and accuracy so we can monitor how training is progressing.

Generating predictions

Once training is complete, generate_predictions runs the trained model on a held-out dataset and returns the predicted class for each article. It also prints a confusion matrix, which gives us a breakdown of which categories the model is getting right and where it’s getting confused, which is a much more informative picture than accuracy alone.

Running the baseline

We can now train the baseline model. We pass in the raw count-vectorized training and validation features, specify an input size equal to the vocabulary size (59,544 columns), train for two epochs, and use a hidden layer of 5,000 nodes.

baseline_model = train_text_classification_model(
    ag_news_train_cv,
    ag_news_train["label"].to_numpy(),
    ag_news_val_cv,
    ag_news_val["label"].to_numpy(),
    ag_news_train_cv.shape[1],
    5,
    5000
)

predictions = generate_predictions(
    baseline_model,
    ag_news_val_cv,
    ag_news_val["label"].to_numpy()
)
Epoch [1/2], Train Loss: 0.3553, Train Acc: 0.8813, Val Loss: 0.2307, Val Acc: 0.9243
Epoch [2/2], Train Loss: 0.1217, Train Acc: 0.9587, Val Loss: 0.2352, Val Acc: 0.9240

Confusion Matrix:
Predicted     1     2     3     4
Actual                           
1          2774    65    89    72
2            37  2944     9    10
3           112    20  2694   174
4            97    20   207  2676

Even with the very basic data preparation we did, we can see we’ve performed very well on this prediction task, with around 92% accuracy. The confusion matrix shows that the model seems to have the easiest time distinguishing between category two (sports) and the other topics, and the hardest time distinguishing between category three (business) and category four (science/technology). This makes sense, as the words used to describe sports are very distinct and unlikely to be used in other contexts (things like football), whereas there is likely to be overlapping vocabulary between business and technology (especially company names).

As we saw above, there is a lot we can do to improve the signal-to-noise ratio in BoW modeling. Let’s apply four commonly used techniques to our data and see whether this improves our predictions: lemmatization, stop word removal, limiting our vocabulary to the top N terms, and TF-IDF weighting. As you’ll see, all of these can be done relatively simply using inbuilt functions in packages such as spaCy and scikit-learn.

Lemmatization

As we discussed earlier, lemmatization collapses inflected word forms into a single vocabulary entry by mapping each word to its dictionary base form, which both shrinks the vocabulary and concentrates the signal for each concept into a single feature. We’ll use spaCy for this, which first requires downloading its small English language model:

!python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")

Our lemmatise_text function passes each text through spaCy’s NLP pipeline using nlp.pipe(), which processes them in batches of 1,000 for efficiency. For each document, it extracts the .lemma_ attribute of every token and joins them back into a single string. One small detail worth noting: we preserve the original DataFrame index when constructing the output Series, so that rows stay correctly aligned when we assign the results back.

We apply lemmatization before string cleaning, since spaCy needs the original casing and punctuation to correctly identify grammatical structure. For example, “running” and “Running” lemmatize to the same thing, but removing punctuation first can confuse the parser. Once lemmatized, we pass the output through apply_string_cleaning as before:

ag_news_train["title_clean"] = apply_string_cleaning(lemmatise_text(ag_news_train["title"]))
ag_news_train["description_clean"] = apply_string_cleaning(lemmatise_text(ag_news_train["description"]))

ag_news_val["title_clean"] = apply_string_cleaning(lemmatise_text(ag_news_val["title"]))
ag_news_val["description_clean"] = apply_string_cleaning(lemmatise_text(ag_news_val["description"]))

ag_news_train["text_clean"] = ag_news_train["title_clean"] + " " + ag_news_train["description_clean"]

ag_news_val["text_clean"] = ag_news_val["title_clean"] + " " + ag_news_val["description_clean"]

We apply this separately to the title and description columns before concatenating them into a single text_clean field. As you can see, we do this for both the training and validation sets using the same function, so that lemmatization is applied consistently across both splits.

Removing stop words

As with lemmatization, we covered the motivation for stop word removal earlier: Words like “the”, “is”, and “of” appear so frequently across all texts that they add noise rather than signal to our feature matrix. Here we’ll actually apply it to our data.

def remove_stopwords(texts: pd.Series) -> pd.Series:
   texts = texts.fillna("").astype(str)

   filtered_texts = []
   for doc in nlp.pipe(texts, batch_size=1000):
       filtered_texts.append(
           " ".join(token.text for token in doc if not token.is_stop)
       )

   return pd.Series(filtered_texts, index=texts.index)

Our remove_stopwords function again uses nlp.pipe() to process texts in batches. For each document, it filters out any token where spaCy’s is_stop attribute is True, and joins the remaining tokens back into a string. Conveniently, spaCy handles stop word detection using the same pipeline we already loaded for lemmatization, so no additional setup is needed.

We apply this to the already-cleaned and lemmatized text_clean column for both the training and validation sets, so the stop word removal builds directly on our previous preprocessing steps and is applied consistently across both splits.

ag_news_train["text_no_stopwords"] = remove_stopwords(ag_news_train["text_clean"])
ag_news_val["text_no_stopwords"] = remove_stopwords(ag_news_val["text_clean"])

Top N terms and TF-IDF vectorization

The final two improvements we’ll apply are limiting the vocabulary size and switching from raw count vectorization to TF-IDF weighting. Conveniently, scikit-learn’s TfidfVectorizer handles both in a single step.

Recall from earlier that TF-IDF downweights words that appear frequently across many documents while upweighting words that are distinctive to particular documents. This cleans up uninformative words that don’t quite qualify as stopwords, but add little useful information to our dataset. The max_features=20000 argument caps the vocabulary at the 20,000 most frequent terms after TF-IDF scoring, which discards the long tail of rare words that are unlikely to generalize well and brings our feature matrix down to a much more manageable size. (The choice of 20,000 words is arbitrary. We could have easily used a smaller or larger number, depending on our dataset and use case.)

As with CountVectorizer, we fit only on the training data and then use that fixed vocabulary to transform both the training and validation sets:

TfidfVectorizerNews = TfidfVectorizer(max_features=20000)
TfidfVectorizerNews.fit(ag_news_train["text_no_stopwords"])

ag_news_train_tfidf = TfidfVectorizerNews.transform(ag_news_train["text_no_stopwords"]).toarray()
ag_news_val_tfidf = TfidfVectorizerNews.transform(ag_news_val["text_no_stopwords"]).toarray()

We can inspect the resulting vocabulary and feature matrix exactly as we did before:

TfidfVectorizerNews.vocabulary_
{'fed': np.int64(6243),
 'pension': np.int64(13134),
 'default': np.int64(4469),
 'cite': np.int64(3200),
 'failure': np.int64(6109),
 'big': np.int64(1787),
 'airline': np.int64(401),
 'payment': np.int64(13051),
 'plan': np.int64(13424),
 'government': np.int64(7306),
 'official': np.int64(12453),
 'tuesday': np.int64(18437),
 'congress': np.int64(3691),
 'hard': np.int64(7689),
 'corporation': np.int64(3901),
...}
pd.DataFrame(ag_news_train_tfidf, columns=TfidfVectorizerNews.get_feature_names_out())

Compared to our baseline feature matrix of 59,544 columns filled almost entirely with zeros, this is considerably leaner. We now have 20,000 columns of weighted scores that better reflect each word’s actual importance to the document it appears in. It is still relatively sparse, but we can see from both the feature matrix and the vocabulary list that it is much more focused on semantically rich words.

Fitting the revised model

With our improved features in hand, we can now retrain the model. The call is identical to before, except we pass in the TF-IDF feature matrices instead of the raw count vectors, and the input size is now 20,000 rather than 59,544:

baseline_model = train_text_classification_model(
    ag_news_train_tfidf,
    ag_news_train["label"].to_numpy(),
    ag_news_val_tfidf,
    ag_news_val["label"].to_numpy(),
    ag_news_train_tfidf.shape[1],
    2,
    5000
)

predictions = generate_predictions(
    baseline_model,
    ag_news_val_tfidf,
    ag_news_val["label"].to_numpy()
)
Epoch [1/2], Train Loss: 0.3183, Train Acc: 0.8932, Val Loss: 0.2301, Val Acc: 0.9225
Epoch [2/2], Train Loss: 0.1512, Train Acc: 0.9475, Val Loss: 0.2332, Val Acc: 0.9243
Confusion Matrix - Raw Counts:
Predicted     1     2     3     4
Actual                           
1          2703    71   121   105
2            20  2955    13    12
3            68    19  2691   222
4            77    17   163  2743

The results are actually very encouraging! Our overall validation accuracy is essentially unchanged at around 92%, but we’ve achieved this with a feature matrix that is less than a third of the size. This suggests that the extra vocabulary in the baseline (including the stop words) was contributing to noise rather than signal. Reducing the size of the feature matrix makes our model more stable, less prone to overfitting, and much more manageable to deploy.

Looking at the confusion matrix, the pattern of errors is similar to before: Sports (category two) is the easiest category to classify, with 98.5% accuracy, while Business (category three) and Science/Technology (category four) remain the hardest to separate, with around 7% of articles in each category being misclassified as the other. This is consistent with what we saw in the baseline, so it seems that the preprocessing improvements have tightened things up at the margins, but the fundamental difficulty of the Business/Technology boundary is a property of the data rather than the feature representation.

Applying our model to the test set

Finally, we need to validate that our model performs as well on the test set as it does on the validation set. Up to this point, we’ve deliberately kept the test set locked away. As mentioned earlier, if we had been making modeling decisions based on test set performance, we’d risk inadvertently overfitting our choices to it, and our final accuracy estimate would be optimistic.

The preprocessing steps must be applied in exactly the same order as for the training and validation data: lemmatization, string cleaning, concatenation of title and description, and stop-word removal. Crucially, we also call .transform() rather than .fit_transform() on the test text, using the vocabulary learned from the training data:

ag_news_test["title_clean"] = apply_string_cleaning(lemmatise_text(ag_news_test["title"]))
ag_news_test["description_clean"] = apply_string_cleaning(lemmatise_text(ag_news_test["description"]))
ag_news_test["text_clean"] = ag_news_test["title_clean"] + " " + ag_news_test["description_clean"]
ag_news_test["text_no_stopwords"] = remove_stopwords(ag_news_test["text_clean"])

ag_news_test_tfidf = TfidfVectorizerNews.transform(ag_news_test["text_no_stopwords"]).toarray()

We can then generate predictions and evaluate accuracy on the test set:

test_predictions = generate_predictions(
    baseline_model,
    ag_news_test_tfidf,
    ag_news_test["label"].to_numpy()
)

test_accuracy = accuracy_score(ag_news_test["label"].to_numpy(), test_predictions)
print(f"Test Accuracy: {test_accuracy:.4f}")
Test Accuracy: 0.9183

Confusion Matrix - Raw Counts:
Predicted     1     2     3     4
Actual                           
1          1710    54    78    58
2            13  1870    10     7
3            51    12  1676   161
4            53     9   115  1723

The test accuracy of 91.8% is very close to the 92.4% we saw on the validation set, which is a reassuring sign that our model has generalized well rather than overfitting to the validation data. The confusion matrix tells the same story as before: Sports (category two) remains the easiest category to classify, with only 30 misclassified articles out of 1,900, while the Business/Technology boundary continues to be the main source of errors, with around 8% of articles in each category being misclassified as the other. The consistency between validation and test results gives us confidence that these error patterns reflect genuine properties of the data rather than artifacts of any particular split.

Limitations and alternatives

Loses word order information

The most fundamental limitation of the bag-of-words model is right there in the name: it treats text as an unordered collection of words, discarding all sequence information. This means “the dog bit the man” and “the man bit the dog” produce identical vectors, even though they describe very different events. For many classification tasks, this doesn’t matter much, but for tasks that require understanding the relationship between words, such as question answering or natural language inference, the loss of word order is a serious handicap.

Ignores semantics and context

BoW has no notion of word meaning or context. Each word is simply a column in a matrix, entirely independent of every other word. This creates two related problems. First, synonyms are treated as completely distinct features: “cheap” and “inexpensive” contribute nothing to each other’s signal, even though they mean the same thing. Second, words with multiple meanings are treated as a single feature regardless of context: “bank” means the same thing whether it appears in a sentence about rivers or finance. Both of these issues limit how well BoW representations can capture the actual semantics of a text.

Can result in large, sparse vectors

As we saw in our own example, even a moderately sized corpus of news headlines can produce a vocabulary of nearly 60,000 unique terms. The resulting feature matrix has one column per vocabulary word, but any individual document only uses a tiny fraction of them, leaving the vast majority of values at zero. This sparsity creates two practical problems: The matrices can consume a large amount of memory if stored densely, and the high dimensionality can make it harder for models to find meaningful patterns, a phenomenon sometimes called the curse of dimensionality.

Alternatives

If BoW’s limitations are a bottleneck for your task, there are several well-established alternatives worth considering.

For tasks where BoW already performs well, as we saw here with AG News, the added complexity of these approaches may not be worth the cost. BoW remains a strong baseline, and it’s always worth establishing how far it can take you before reaching for heavier machinery.

Get started with PyCharm today

In this post, we’ve covered a lot of ground: from the fundamentals of the bag-of-words model and how it converts text into numerical vectors, through to building and iteratively improving a real text classification pipeline on the AG News dataset. Along the way, we’ve seen how preprocessing steps like lemmatization, stop word removal, vocabulary capping, and TF-IDF weighting can meaningfully improve the efficiency of your feature representation, and how PyCharm’s DataFrame viewer, column statistics, chart view, and AI Assistant make each of these steps faster and easier to inspect and debug.

If you’d like to try this yourself, PyCharm Pro comes with a 30-day trial. As we saw in this tutorial, its built-in support for Jupyter notebooks, virtual environments, and scientific libraries means you can go from a blank project to a working NLP pipeline with minimal setup friction, leaving you free to focus on the fun parts. 

You can find the full code for this project on GitHub. If you’re interested in exploring more NLP topics, check out our recent blogs here.

April 29, 2026 05:42 PM UTC


PyCon

PyCon US 2026: Call for Volunteers

Looking to make a meaningful contribution to the Python community? Look no further than PyCon US 2026! Whether you're a seasoned Python pro or a newcomer to the community and looking to get involved, there's a volunteer opportunity that's perfect for you. 

Sign-up for volunteer roles is done directly through the PyCon US website. This way, you can view and manage shifts you sign up for through your personal dashboard! You can read up on the different roles to volunteer for and how to sign up on the PyCon US website.

PyCon US is largely organized and run by volunteers. Every year, we ask to fill over 300 onsite volunteer hours to ensure everything runs smoothly at the event. And the best part? You don't need to commit a lot of time to make a difference–some shifts are as short as 45 minutes long! You can sign up for as many or as few shifts as you’d like. Even a couple of hours of your time can go a long way in helping us create an amazing experience for attendees.

Keep in mind that you need to be registered for the conference to sign up for a volunteer role.

One important way to get involved is to sign up as a Session Chair or Session Runner. This is an excellent opportunity to meet and interact with speakers while helping to ensure that sessions run smoothly. And who knows, you might just learn something new along the way:) If you’re looking for an important yet simple-to-learn role, you may be just the person we’ve been looking for! 

We do ask if you sign up for these roles that you please do your absolute best to avoid canceling or worst case not showing up, so that we can make sure we have coverage for all the necessary time slots. You can sign up for these roles directly on the Talks schedule: Sign up for an open time slot by clicking the [+ Volunteer] in one of the talk slots for the session of your choice.

Volunteer your time at PyCon US 2026 and you’ll be part of a fantastic community that's passionate about Python programming. You can help us make this year's conference a huge success while connecting with your fellow event attendees. It’s especially great for first-timers looking to get the most out of PyCon US. Sign up today for the shifts that call to you and join the fun!

April 29, 2026 02:00 PM UTC


Real Python

AI Coding Agents Guide: A Map of the Four Workflow Types

AI coding agents can read your code, reason about changes, and act on your behalf. To choose the right one, it helps to understand the four common workflow types: integrated development environment (IDE), terminal, pull request (PR), and cloud.

In this tutorial, you’ll:

  • Identify the four common agent interaction modes
  • Understand what makes each workflow distinct
  • Recognize which mode fits common development scenarios
  • Weigh the risks and tradeoffs of each workflow

Before exploring the four workflow types, it’s worth looking at what makes a coding tool agentic in the first place.

Take the Quiz: Test your knowledge with our interactive “AI Coding Agents Guide: A Map of the Four Workflow Types” quiz. You’ll receive a score upon completion to help you track your learning progress:


Interactive Quiz

AI Coding Agents Guide: A Map of the Four Workflow Types

Check your understanding of how AI coding agents fit into your workflow through four interaction modes: IDE, terminal, pull request, and cloud.

Get Your Cheat Sheet: Click here to download your free AI coding agents cheat sheet and keep the four workflow types at your fingertips when choosing the right agent for the job.

Understanding AI Coding Agents

While standard chatbots provide one-off answers, coding agents are designed for autonomy, operating through a continuous execution loop to solve complex tasks. This loop typically follows four distinct steps:

  1. Read: They read relevant files from your codebase to form their context.
  2. Reason: They determine the logical steps needed to achieve your goal.
  3. Act: They execute those steps by editing files, running terminal commands, or using external tools.
  4. Evaluate: They check the results of their actions to see if more work is needed.

This loop repeats until the task is completed or the agent hands control back to you. Unlike simple predictive text or one-off prompts, agents bridge the gap between suggestion and execution by autonomously navigating the development workflow.

The core agent loop will generally stay the same, but where an agent runs will shape how you interact with it:

  • In an editor, it works alongside you.
  • In a terminal, you guide it step by step.
  • In pull requests, it reviews changes asynchronously.
  • In the cloud, it works in a managed environment and reports back later.

These environments define four primary agent types, each enabling a distinct workflow: IDE agents, terminal agents, PR agents, and cloud agents.

Exploring the Four Workflow Types

The four workflow types describe interaction modes and don’t always map cleanly to product categories. The same tool often spans multiple workflows. For example, Claude Code runs in your terminal, in your editor, and in the cloud with Claude Code on the web. It can also review pull requests with Code Review.

The goal is to match the workflow to the task. The diagram below summarizes the four types at a glance:

AI Agent Workflow Type TableThe Four Coding Agent Workflows

Read the full article at https://realpython.com/ai-coding-agents-guide/ »


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

April 29, 2026 02:00 PM UTC

Quiz: ChatterBot: Build a Chatbot With Python

In this quiz, you’ll test your understanding of ChatterBot: Build a Chatbot With Python.

You’ll revisit how ChatterBot learns from conversation data, how it picks replies based on similarity to what it’s already seen, and how it can pull in a local LLM to round out its responses.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

April 29, 2026 12:00 PM UTC

Quiz: Python 3.13: A Modern REPL

Test your knowledge of the redesigned interactive interpreter introduced in Python 3.13: A Modern REPL, including the help system, multiline statement editing, code pasting improvements, and the history browser.

Good luck!


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

April 29, 2026 12:00 PM UTC


Python GUIs

Actions in one thread changing data in another — How to communicate between threads and windows in PyQt6

I have a main window that starts background threads (e.g., handling GPIO data). From the main window I open secondary windows using buttons. When I press a button in a secondary window, I can't change anything in the background threads. But if I press a button in the main window, everything works. How do I communicate between a secondary window and a thread that was started from the main window?

This is a common problem when building PyQt6 applications with multiple windows and background threads. The good news is that Qt's signal and slot system is designed to handle this and it works safely across threads.

The core idea is that your secondary window doesn't need direct access to the thread or the worker object. Instead the secondary window and the worker just need access to the same signals, and can then use them to communicate with one another. Qt handles the cross-thread communication automatically.

Why doesn't direct access work?

When you create a background thread from the main window, you'll often store a reference to that thread on the main window. If that main window then creates a sub-window, it doesn't have any access to the objects on it's parent. Even if it did calling the methods on the thread directly is not usually the right approach.

You can access the attributes of a parent window using .parent() but this is a bad habit, because it tightly couples the parts of your application together. If you modify the structure of the parent window, you now need to also edit the sub-window. There are better ways that keep things nicely isolated.

The solution is to avoid calling methods directly across threads. Instead, use signals and slots. When a signal is emitted in one thread and connected to a slot in another, Qt automatically queues the call and delivers it safely.

Setting up a background worker

First, let's create a simple worker class that runs in a background thread. This worker simulates handling incoming data (like GPIO data) and also accepts commands from the GUI.

python
from PyQt6.QtCore import QObject, pyqtSignal, pyqtSlot
import time


class Worker(QObject):
    """A worker that runs in a background thread."""
    data_updated = pyqtSignal(str)

    def __init__(self):
        super().__init__()
        self.running = True
        self.current_value = 0

    @pyqtSlot()
    def run(self):
        """Simulate continuous data handling."""
        while self.running:
            self.current_value += 1
            self.data_updated.emit(f"Data: {self.current_value}")
            time.sleep(1)

    @pyqtSlot(int)
    def set_value(self, value):
        """Receive a new value from the GUI."""
        self.current_value = value
        self.data_updated.emit(f"Value set to: {self.current_value}")

The set_value slot is what we'll trigger from the secondary window. Because it's a slot connected via a signal, Qt will deliver the call on the correct thread.

Creating the secondary window

The secondary window has a button and a spin box. When the user clicks the button, the window emits a signal carrying the new value. The secondary window doesn't know anything about the worker — it just emits a signal.

python
from PyQt6.QtWidgets import QWidget, QVBoxLayout, QPushButton, QSpinBox, QLabel
from PyQt6.QtCore import pyqtSignal


class SecondaryWindow(QWidget):
    """A secondary window that emits a signal when the user sets a value."""
    value_changed = pyqtSignal(int)

    def __init__(self):
        super().__init__()
        self.setWindowTitle("Secondary Window")

        layout = QVBoxLayout()

        self.label = QLabel("Set a new value for the worker:")
        layout.addWidget(self.label)

        self.spinbox = QSpinBox()
        self.spinbox.setRange(0, 1000)
        layout.addWidget(self.spinbox)

        self.button = QPushButton("Send to Worker")
        self.button.clicked.connect(self.send_value)
        layout.addWidget(self.button)

        self.setLayout(layout)

    def send_value(self):
        self.value_changed.emit(self.spinbox.value())

The value_changed signal is the only interface this window exposes. This keeps things clean and decoupled.

Wiring everything together in the main window

The main window is where all the connections happen. It creates the worker, starts the thread, opens the secondary window, and connects the secondary window's signal to the worker's slot.

python
from PyQt6.QtWidgets import QMainWindow, QVBoxLayout, QPushButton, QLabel, QWidget
from PyQt6.QtCore import QThread


class MainWindow(QMainWindow):
    def __init__(self):
        super().__init__()
        self.setWindowTitle("Main Window")

        # Set up the UI
        layout = QVBoxLayout()

        self.status_label = QLabel("Waiting for data...")
        layout.addWidget(self.status_label)

        self.open_button = QPushButton("Open Secondary Window")
        self.open_button.clicked.connect(self.open_secondary)
        layout.addWidget(self.open_button)

        container = QWidget()
        container.setLayout(layout)
        self.setCentralWidget(container)

        # Keep a reference to the secondary window
        self.secondary_window = None

        # Set up the background thread and worker
        self.thread = QThread()
        self.worker = Worker()
        self.worker.moveToThread(self.thread)

        # Connect signals
        self.thread.started.connect(self.worker.run)
        self.worker.data_updated.connect(self.update_status)

        # Start the thread
        self.thread.start()

    def update_status(self, text):
        self.status_label.setText(text)

    def open_secondary(self):
        if self.secondary_window is None:
            self.secondary_window = SecondaryWindow()

            # Connect the secondary window's signal to the worker's slot.
            # This is the connection that makes cross-window,
            # cross-thread communication work.
            self.secondary_window.value_changed.connect(self.worker.set_value)

        self.secondary_window.show()

    def closeEvent(self, event):
        self.worker.running = False
        self.thread.quit()
        self.thread.wait()
        super().closeEvent(event)

The line that connects everything together is:

python
self.secondary_window.value_changed.connect(self.worker.set_value)

This connects a signal from the secondary window (running in the main/GUI thread) to a slot on the worker (which has been moved to a background thread). Qt sees that the sender and receiver live in different threads, so it automatically uses a queued connection. The slot call is placed into the background thread's event queue and executed there.

Understanding why the main window worked but the secondary didn't

In the original question, buttons in the main window could affect the background threads, but buttons in a secondary window could not. This usually happens because:

  1. The main window had direct signal-slot connections to the worker (set up when both the worker and the connections were created).
  2. The secondary window was created later, and its signals were never connected to the worker.

To solution is to connect its signals to the appropriate worker slots, when you create the secondary window, just as you would for the main window. The worker doesn't care where the signal comes from — it just responds to whatever signals are connected to its slots. For more on managing multiple windows in PyQt6, see our tutorial on creating multiple windows.

A note about QThreadPool vs QThread

The original question mentions using QThreadPool. If you're using QRunnable with a QThreadPool, the pattern is slightly different because QRunnable doesn't inherit from QObject and can't have slots directly. In that case, you typically create a separate QObject-based signals class and attach it to your runnable. For a detailed walkthrough of that approach, see Multithreading PyQt6 applications with QThreadPool.

However, for long-running background tasks that need two-way communication with the GUI (like GPIO handling), QThread with moveToThread() is usually a better fit. It gives you a proper event loop in the background thread, which means signals and slots work naturally in both directions.

Complete working example

Here's everything in a single file you can copy, run, and experiment with. If you're new to PyQt6, you may want to start with creating your first window before diving in.

python
import sys
import time

from PyQt6.QtCore import QObject, QThread, pyqtSignal, pyqtSlot
from PyQt6.QtWidgets import (
    QApplication,
    QLabel,
    QMainWindow,
    QPushButton,
    QSpinBox,
    QVBoxLayout,
    QWidget,
)


class Worker(QObject):
    """A worker that runs in a background thread."""

    data_updated = pyqtSignal(str)

    def __init__(self):
        super().__init__()
        self.running = True
        self.current_value = 0

    @pyqtSlot()
    def run(self):
        """Simulate continuous data handling."""
        while self.running:
            self.current_value += 1
            self.data_updated.emit(f"Data: {self.current_value}")
            time.sleep(1)

    @pyqtSlot(int)
    def set_value(self, value):
        """Receive a new value from the GUI."""
        self.current_value = value
        self.data_updated.emit(f"Value set to: {self.current_value}")


class SecondaryWindow(QWidget):
    """A secondary window that emits a signal when the user sets a value."""

    value_changed = pyqtSignal(int)

    def __init__(self):
        super().__init__()
        self.setWindowTitle("Secondary Window")

        layout = QVBoxLayout()

        self.label = QLabel("Set a new value for the worker:")
        layout.addWidget(self.label)

        self.spinbox = QSpinBox()
        self.spinbox.setRange(0, 1000)
        layout.addWidget(self.spinbox)

        self.button = QPushButton("Send to Worker")
        self.button.clicked.connect(self.send_value)
        layout.addWidget(self.button)

        self.setLayout(layout)

    def send_value(self):
        self.value_changed.emit(self.spinbox.value())


class MainWindow(QMainWindow):
    def __init__(self):
        super().__init__()
        self.setWindowTitle("Main Window")

        # Set up the UI
        layout = QVBoxLayout()

        self.status_label = QLabel("Waiting for data...")
        layout.addWidget(self.status_label)

        self.open_button = QPushButton("Open Secondary Window")
        self.open_button.clicked.connect(self.open_secondary)
        layout.addWidget(self.open_button)

        container = QWidget()
        container.setLayout(layout)
        self.setCentralWidget(container)

        # Keep a reference to the secondary window
        self.secondary_window = None

        # Set up the background thread and worker
        self.thread = QThread()
        self.worker = Worker()
        self.worker.moveToThread(self.thread)

        # Connect signals
        self.thread.started.connect(self.worker.run)
        self.worker.data_updated.connect(self.update_status)

        # Start the thread
        self.thread.start()

    def update_status(self, text):
        self.status_label.setText(text)

    def open_secondary(self):
        if self.secondary_window is None:
            self.secondary_window = SecondaryWindow()
            # Connect the secondary window's signal to the worker's slot
            self.secondary_window.value_changed.connect(
                self.worker.set_value
            )
        self.secondary_window.show()

    def closeEvent(self, event):
        self.worker.running = False
        self.thread.quit()
        self.thread.wait()
        super().closeEvent(event)


app = QApplication(sys.argv)
window = MainWindow()
window.show()
sys.exit(app.exec())

When you run this, you'll see the main window counting up once per second. Click "Open Secondary Window", enter a number, and click "Send to Worker" — the worker's counter will jump to your chosen value and continue counting from there.

The secondary window communicates with the background thread entirely through signals and slots, with no direct method calls across threads. This pattern scales well — you can connect as many windows as you like to the same worker, or connect one window to multiple workers. As long as you use signals and slots for cross-thread communication, Qt handles the thread safety for you.

For an in-depth guide to building Python GUIs with PyQt6 see my book, Create GUI Applications with Python & Qt6.

April 29, 2026 06:00 AM UTC

April 28, 2026


Talk Python Blog

Introducing the new Talk Python web player

We expect that most people who listen to Talk Python do so through their podcast player apps on their phone or even on their laptops. But there are plenty of times that people end up on an episode page and would love to have a nice experience interacting with that episode as well. One really common example: you go back to an episode you discovered several years ago, and the chances it’s still on your device are low. Though we do keep our entire back catalog available in the RSS feed, most podcast players trim down what they keep locally.

April 28, 2026 07:40 PM UTC


PyCoder’s Weekly

Issue #732: Web Scraping, Altair Charts, OpenAI's API, and More (April 28, 2026)

#732 – APRIL 28, 2026
View in Browser »

The PyCoder’s Weekly Logo


browser-use vs. Playwright: Which to Pick for Web Scraping?

Follow along in this walk-through building a Hacker News synthesizer with browser-use, then see it fail on a harder Newegg scraping task. Includes a side-by-side comparison with Playwright and a breakdown of when each tool is the right call.
CODECUT.AI • Shared by Khuyen Tran

Altair: Declarative Charts With Python

Build interactive Python charts the declarative way with Altair. Map data to visual properties and add linked selections. No JavaScript required.
REAL PYTHON

Positron: The Data Science IDE from Posit PBC

alt

Positron is a free IDE built for Python data science. AI assistance, interactive data frames, Jupyter notebooks, and instant app deployment, all in one place. Stop context-switching. Start shipping. Download free.
POSIT PBC sponsor

Leverage OpenAI’s API in Your Python Projects

Learn how to use the ChatGPT API with Python’s openai library to send prompts, control AI behavior with roles, and get structured outputs.
REAL PYTHON course

Quiz: Leverage OpenAI’s API in Your Python Projects

REAL PYTHON

Python Software Foundation Fellow Members for Q1 2026!

PYTHON SOFTWARE FOUNDATION

PEP 708: Extending the Repository API to Mitigate Dependency Confusion Attacks (Rejected)

PYTHON.ORG

PEP 806: Mixed Sync/Async Context Managers With Precise Async Marking (Rejected)

PYTHON.ORG

PEP 833: Freezing the HTML Simple Repository API (Draft)

PYTHON.ORG

Articles & Tutorials

Fixing a Memory “Leak” From Python 3.14’s Incremental Garbage Collection

Adam encountered an out-of-memory error while migrating a client project to Python 3.14. The issue occurred when running Django’s database migration command on a limited-resource server, and seemed to be caused by the new incremental garbage collection algorithm in Python 3.14.
ADAM JOHNSON

Logging to File and to Textual Console

When writing TUI applications in Textual you can’t just print out your debug info since the terminal is controlled by the framework. This article shows you how to log and use Textual’s built-in debug console.
MIKE DRISCOLL

Beyond Basic RAG: Build Persistent AI Agents

Master next-gen AI with Python notebooks for agentic reasoning, memory engineering, and multi-agent orchestration. Scale apps using production-ready patterns for LangChain, LlamaIndex, and high-performance vector search. Explore & Star on GitHub.
ORACLE sponsor

Read the Docs Now Supports uv Natively

Popular open source documentation site Read the Docs has announced they now support native uv in .readthedocs.yaml for Python dependency installation. Learn how to use it in your configurations
READ THE DOCS

PyTexas 2026 Recap

Per-talk notes from PyTexas 2026 in Austin: Hynek on domain modeling, Dawn Wages on specialization, MCP security, PEP 810 lazy imports, free-threading, Ruff, ty, uv, supply chain.
BERNÁT GÁBOR

The Carbon Footprint of Wagtail AI

One of the package maintainers for Wagtail AI shares his method for measuring the carbon impact of the different AI tasks users can do and goes over the initial results.
WAGTAIL.ORG • Shared by Meagen Voss

Gemini CLI vs Claude Code: Which to Choose for Python Tasks

Gemini CLI vs Claude Code: compare setup, performance, code quality, and cost to find the right Python AI coding tool for your workflow.
REAL PYTHON

Learn the Agentic Coding Workflow That Actually Works on Real Projects

65% of Python developers are stuck using AI for small tasks that fall apart on anything real. This 2-day live course (May 6-7 via Zoom) walks you through building a complete Python CLI app with Claude Code, from an empty directory to a shipped project on GitHub.
REAL PYTHON

Implementing OpenTelemetry in FastAPI

Learn how you can observe your FastAPI web apps with OpenTelemetry, including how to integrate it and why it is important.
SIGNOZ.IO • Shared by Dhruv Ahuja

Building a Python Library in 2026

So you want to build a Python library in 2026? Here’s everything you need to know about the state of the art.
STEPHEN IF

Projects & Code

Local Usage PyPI Alternative With Vulnerability Scanning

Very interesting project
GITHUB.COM/RUSTEDBYTES • Shared by Yehor Smoliakov

typeform: Type-Safe UI/CLI Generator Powered by Pydantic

GITHUB.COM/STHITAPRAJNAS

vibescore: One-Command Quality Score for Any Python Project

GITHUB.COM/STEF41 • Shared by Anonymous

dash: Data Apps & Dashboards for Python

GITHUB.COM/PLOTLY

profiling-explorer: Table-Based Profile Exploration Tool

GITHUB.COM/ADAMCHAINZ

Events

Weekly Real Python Office Hours Q&A (Virtual)

April 29, 2026
REALPYTHON.COM

PyCamp Spain 2026

April 30 to May 4, 2026
PYCAMP.ES

PyDelhi User Group Meetup

May 2, 2026
MEETUP.COM

PyBodensee Monthly Meetup

May 4, 2026
PYBODENSEE.COM

IndyPy: Lightning Talks

May 5 to May 6, 2026
MEETUP.COM


Happy Pythoning!
This was PyCoder’s Weekly Issue #732.
View in Browser »

alt

[ Subscribe to 🐍 PyCoder’s Weekly 💌 – Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]

April 28, 2026 07:30 PM UTC


Django Weblog

Renew Your PyCharm License and Support Django

Only a few days remain to support the Django Software Foundation through our annual JetBrains fundraiser.

You can now use the offer for new purchases and annual renewals. If your PyCharm Professional subscription expires this year, this is a great time to renew or extend it for up to 12 months.

Get 30% off PyCharm Professional, and 100% of proceeds from qualifying purchases and renewals go to the DSF to help fund Django Fellows, community programs, events, and the future of Django.

👉 Offer ends May 1: Learn more about the fundraiser

👉 Claim 30% off here: Get the JetBrains offer

April 28, 2026 07:20 PM UTC


Mariatta

PyCascades 2026 Recap

PyCascades 2026 Recap

PyCascades 2026 took place in Vancouver this year. I only get to attend on the first day, because I had a 5 a.m. flight to Washington DC the morning after.

Still, the first day’s talks were all very insightful and interesting. I’m waiting for all the talks to be published so that I could catch up on the ones I missed.

Here are notes on the talks I got to see.

April 28, 2026 04:36 PM UTC


Real Python

Testing Your Code With Python's unittest

The Python standard library ships with a testing framework named unittest, which you can use to write automated tests for your code. The unittest package has an object-oriented approach where test cases derive from a base class, which has several useful methods.

The framework supports many features that will help you write consistent unit tests for your code. These features include test cases, fixtures, test suites, and test discovery capabilities.

In this video course, you’ll learn how to:

To get the most out of this video course, you should be familiar with some important Python concepts, such as object-oriented programming, inheritance, and assertions. Having a good understanding of code testing is a plus.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

April 28, 2026 02:00 PM UTC

Quiz: Use Codex CLI to Enhance Your Python Projects

In this quiz, you’ll test your understanding of Use Codex CLI to Enhance Your Python Projects.

By working through this quiz, you’ll revisit how to install and configure Codex CLI, use Plan mode to review changes before they land, and refine features through iterative prompting in your terminal.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

April 28, 2026 12:00 PM UTC

Quiz: Testing Your Code With Python's unittest

In this quiz, you’ll test your understanding of Testing Your Code With Python’s unittest.

By working through this quiz, you’ll revisit key concepts like structuring tests with TestCase, using assertion methods, skipping tests conditionally, parameterizing with subtests, and preparing test data with fixtures.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

April 28, 2026 12:00 PM UTC


PyPy

PyPy v7.3.22 release

PyPy v7.3.22: release of python 2.7, 3.11

The PyPy team is proud to release version 7.3.22 of PyPy after the previous release on March 13, 2026. This is a bug-fix release that fixes several issues in the JIT. Among them, a long-standing JIT bug that started appearing when some instance optimizations exposed it. We also cleaned up many of the remaining stdlib test suite failures, which improves CPython compatibility around line numbers in dis.dis, signatures and objclass attributes for builtins, and other quality of life features.

There is now an RPython _pickle module that mirrors the CPython one, greatly speeding up pickling operations. Where before PyPy was 5.7x slower than CPython on the pickle benchmark from the pyperformance benchmark suite, now it is only 1.6x slower [0]. We also added pypy pickler extensions to dump and load lists using list strategies, and enabled them in the ForkingPickler used by multiprocessing, speeding up cases where such objects are passed between PyPy multiprocessing instances.

We also added an RPython json encoder, speeding up json_bench from being 2.6x slower than CPython to being 0.7x (meaning faster).

The release includes two different interpreters:

The interpreters are based on much the same codebase, thus the double release. This is a micro release, all APIs are compatible with the other 7.3 releases.

We recommend updating. You can find links to download the releases here:

https://pypy.org/download.html

We would like to thank our donors for the continued support of the PyPy project. If PyPy is not quite good enough for your needs, we are available for direct consulting work. If PyPy is helping you out, we would love to hear about it and encourage submissions to our blog via a pull request to https://github.com/pypy/pypy.org

We would also like to thank our contributors and encourage new people to join the project. PyPy has many layers and we need help with all of them: bug fixes, PyPy and RPython documentation improvements, or general help with making RPython's JIT even better.

If you are a python library maintainer and use C-extensions, please consider making a HPy / CFFI / cppyy version of your library that would be performant on PyPy. In any case, cibuildwheel supports building wheels for PyPy.

Footnotes

[0]

Once a PR to pyperformance to use the _pickle module on PyPy is accepted

What is PyPy?

PyPy is a Python interpreter, a drop-in replacement for CPython It's fast (PyPy and CPython performance comparison) due to its integrated tracing JIT compiler.

We also welcome developers of other dynamic languages to see what RPython can do for them.

We provide binary builds for:

PyPy supports Windows 32-bit, Linux PPC64 big- and little-endian, Linux ARM 32 bit, RISC-V RV64IMAFD Linux, and s390x Linux but does not release binaries. Please reach out to us if you wish to sponsor binary releases for those platforms. Downstream packagers provide binary builds for debian, Fedora, conda, OpenBSD, FreeBSD, Gentoo, and more.

What else is new?

For more information about the 7.3.22 release, see the full changelog.

Please update, and continue to help us make pypy better.

Cheers, The PyPy Team

April 28, 2026 10:00 AM UTC