惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

P
Privacy International News Feed
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Jina AI
Jina AI
T
Tailwind CSS Blog
WordPress大学
WordPress大学
Scott Helme
Scott Helme
C
Cybersecurity and Infrastructure Security Agency CISA
博客园 - Franky
C
CERT Recently Published Vulnerability Notes
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
雷峰网
雷峰网
Schneier on Security
Schneier on Security
博客园 - 聂微东
T
Tor Project blog
Hugging Face - Blog
Hugging Face - Blog
博客园 - 司徒正美
AI
AI
T
Troy Hunt's Blog
Security Latest
Security Latest
T
The Blog of Author Tim Ferriss
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Check Point Blog
T
Threat Research - Cisco Blogs
W
WeLiveSecurity
V
Vulnerabilities – Threatpost
Recorded Future
Recorded Future
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Cisco Talos Blog
Cisco Talos Blog
C
CXSECURITY Database RSS Feed - CXSecurity.com
Cloudbric
Cloudbric
J
Java Code Geeks
罗磊的独立博客
C
Cyber Attacks, Cyber Crime and Cyber Security
aimingoo的专栏
aimingoo的专栏
L
LangChain Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy & Cybersecurity Law Blog
Google DeepMind News
Google DeepMind News
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
L
Lohrmann on Cybersecurity
I
InfoQ
MongoDB | Blog
MongoDB | Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The GitHub Blog
The GitHub Blog
The Hacker News
The Hacker News
H
Help Net Security
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
P
Proofpoint News Feed
N
News and Events Feed by Topic

Arpit Bhayani

Temporal Primer - Building Long-Running Systems What Matters in Production RAG Structure of Every LLM Chat How LLMs Really Work Your Monolith Is Already A Distributed System Databases Were Not Designed For This BM25 JOIN Algorithms Venting at Work Comes at a Reputation Cost Why Half Your Skills Expire Every Few Years Multi-Paxos - Consensus in Distributed Databases MySQL Replication Internals Bloom Filters When You Increase Kafka Partitions Product Quantization The Q, K, V Matrices The Day I Accidentally Deleted Production How LLM Inference Works What are Blocking Queues and Why We Need Them Heartbeats in Distributed Systems How Writes Work in Apache Cassandra Redis Replication Internals How to Handle Arrogant Colleagues at Work How Does a CDN Handle Content Replication You Can't Fix Everything on Day One When Emotions Spill Over at Work Why gRPC Uses HTTP2 Meetings With No Agenda Are a Waste of Time Career Longevity Beats Constant Job Hopping Stay Relevant at Higher Salary Levels Why Distributed Systems Need Consensus Algorithms Like Raft Why Do Databases Deadlock and How Do They Resolve It Why and How Cache Locality Can Make Your Code Faster Why Eventual Consistency is Preferred in Distributed Systems Why does DNS use both UDP and TCP Should You Do a Master's My Honest Take Empathy Makes Great Engineers Unstoppable Good Mentors Build People, Not Just Skills Why You Should Always Have Back-Burner Projects Before You Push Back, Know What You're Standing On Be the One They Can Count On How Much Are People Willing to Bet on You How to Get Leadership to Say Yes to Your Project Don't Let Your Best Ideas Die in Silence Be the Person Everyone Wants to Work With The XY Problem and How to Avoid It The Startup Hiring Lie Nobody Talks About You Won't Be Promoted Unless You Ask It's Not Enough to be Right; Learn to be Heard No One Ships Great Software Alone You Don't Win by Proving Others Wrong Appreciate Generously; It Costs Nothing, But Builds Everything Your Soft Skills Aren't Soft at All Before you form an opinion, experience it Why You Need Both Curiosity and Action to Thrive A Daily Worklog Changed Everything How We Handle Mistakes Defines Us Own Your Mistakes Don't Wait. Step Up. Temporary Fixes Are Permanent Why Interviews Are Biased And What Sets You Apart Saying 'This isn't my problem' is actually the problem How to Write Effective OKRs Never Lose a Battle due to Miscommunication When In Doubt, Code It Out How to Follow Up Without Annoying People Lead Projects That Land, Execution Over Everything Abstract Thinking Will Define Your Next Decade We Engineers Suck at Task Estimation Shiny Obect Syndrome in Tech When to Change Jobs - The 3P Framework Comfort and Competition - Know When to Switch Gears Paper Notes - On-demand Container Loading in AWS Lambda Paper Notes - SQL Has Problems. We Can Fix Them Pipe Syntax In SQL Paper Notes - NanoLog - A Nanosecond Scale Logging System Don't Wait, Learn - The Best Resource is Mythical Paper Notes - WTF - The Who to Follow Service at Twitter The Unexpected Benefit of Reading Random Engineering Articles Roadmaps Are Limiting Your Growth Stop Leaving Money on the Table - Negotiate Your Job Offer Never Bad-Mouth Your Past Employers Show You're a Culture Fit Quantify your resume, Know Your Numbers The Importance of Being Likeable in Interviews Questions to Ask Your Interviewer How to Build Trust Through Collaboration Do This, Once You Are Out of the Interview Cycle Stop Pitching Ideas, Start Pitching Projects Read Those Design Docs, Even the Ones That Seem Irrelevant The Best Engineering Lessons Happen During Outages Great Engineers Start Broad LLM Summaries are Ruining Your Learning Turn System Design Interviews into Discussions Title Inflation At Work, Find Your Own Projects 6 Simple Strategies to Cracking Any Tech Interview How to Remain Unblocked Solving the Knapsack Problem with Evolutionary Algorithms Generating Pseudorandom Numbers with LFSR Local vs Global Indexes in Partitioned Databases
How Python Uses String Interning to Keep Runtime Efficient
Arpit Bhayani · 2020-12-20 · via Arpit Bhayani

Every programming language aims to be performant in its niche and achieving superior performance requires a bunch of compiler and interpreter level optimizations. Since character Strings are an integral part of any programming language, having the ability to perform string operations quickly elevates the overall performance.

In this essay, we dive deep into Python internals and find out how Python makes its interpreter performant using a technique called String Interning. This essay not only aims to put forth Python internals but also aims to make the reader comfortable in navigating through Python’s source code; so expect a lot of code snippets taken from CPython.

String Interning

String Interning is a compiler/interpreter optimization method that makes common string processing tasks space and time efficient by caching them. Instead of creating a new copy of string every time, this optimization method dictates to keep just one copy of string for every appropriate immutable distinct value and use the pointer reference wherever referred.

The single copy of each string is called its intern and hence the name String Interning. The lookup of string intern, may or may not be exposed as a public interfaced method. Modern programming languages like Java, Python, PHP, Ruby, Julia, and many more, performs String Interning to make their compilers and interpreters performant.

https://user-images.githubusercontent.com/4745789/102705512-d1691680-42ae-11eb-825f-1e032a7c12c5.png

Why should Strings be interned?

String Interning speeds up string comparisons. Without interning if we were to compare two strings for equality the complexity of it would shoot up to O(n) where we examine every character from both the strings to decide their equality. But if the strings are interned, instead of checking every character, equal strings will have the same object reference so just a pointer quality check would be sufficient to say if two string literals are equal. Since this is a very common operation, this is typically implemented as a pointer equality test, using just a single machine instruction with no memory reference at all.

String Interning reduces the memory footprint. Instead of filling memory with redundant String objects, Python optimizes memory footprint by sharing and reusing already defined objects as dictated by the flyweight design pattern.

String Interning in Python

Just like most other modern programming languages, Python also does String Interning to gain a performance boost. In Python, we can find if two objects are referring to the same in-memory object using the is operator. So if two string objects refer to the same in-memory object, the is operator yields True otherwise False.

>>> 'python' is 'python'
True

We can use this particular operator to test which all strings are interned and which are not. In CPython, String Interning is implemented through the following function, declared in unicodeobject.h and defined in unicodeobject.c.

PyAPI_FUNC(void) PyUnicode_InternInPlace(PyObject **);

In order to check if a String is interned, CPython implements a macro named PyUnicode_CHECK_INTERNED, again defined in unicodeobject.h. The macro suggests that the Python maintains a member named interned in PyASCIIObject structure whose value suggests if the corresponding String is interned or not.

#define PyUnicode_CHECK_INTERNED(op) \
    (((PyASCIIObject *)(op))->state.interned)

Internals of String Interning

In CPython, the String references are stored, accessed, and managed using a Python dictionary named interned. This dictionary is lazily initialized upon the first String Intern invocation and holds the reference to all the interned String objects.

Interning the String

The core function responsible for interning the String is named PyUnicode_InternInPlace defined in unicodeobject.c that upon invocation lazily builds the main dictionary interned to hold all interned strings and then registers the object into it with the key and the value both set as the same object reference. The following function snippet shows the String Interning process as implemented in Python.

void
PyUnicode_InternInPlace(PyObject **p)
{
    PyObject *s = *p;

    .........

    // Lazily build the dictionary to hold interned Strings
    if (interned == NULL) {
        interned = PyDict_New();
        if (interned == NULL) {
            PyErr_Clear();
            return;
        }
    }

    PyObject *t;

    // Make an entry to the interned dictionary for the
    // given object
    t = PyDict_SetDefault(interned, s, s);

    .........

    // The two references in interned dict (key and value) are
    // not counted by refcnt.
    // unicode_dealloc() and _PyUnicode_ClearInterned() take
    // care of this.
    Py_SET_REFCNT(s, Py_REFCNT(s) - 2);

    // Set the state of the string to be INTERNED
    _PyUnicode_STATE(s).interned = SSTATE_INTERNED_MORTAL;
}

Cleanup of Interned Strings

The cleanup function iterates over all the Strings held in the interned dictionary, adjusts the reference counts of the object, and marks them as NOT_INTERNED allowing them to be garbage collected. Once all the strings are marked as NOT_INTERNED, the interned dictionary is cleared and deleted. The cleanup function is defined in unicodeobject.c by the name _PyUnicode_ClearInterned.

void
_PyUnicode_ClearInterned(PyThreadState *tstate)
{
    .........

    // Get all the keys to the interned dictionary
    PyObject *keys = PyDict_Keys(interned);

    .........

    // Interned Unicode strings are not forcibly deallocated;
    // rather, we give them their stolen references back
    // and then clear and DECREF the interned dict.

    for (Py_ssize_t i = 0; i < n; i++) {
        PyObject *s = PyList_GET_ITEM(keys, i);

        .........

        switch (PyUnicode_CHECK_INTERNED(s)) {
        case SSTATE_INTERNED_IMMORTAL:
            Py_SET_REFCNT(s, Py_REFCNT(s) + 1);
            break;
        case SSTATE_INTERNED_MORTAL:
            // Restore the two references (key and value) ignored
            // by PyUnicode_InternInPlace().
            Py_SET_REFCNT(s, Py_REFCNT(s) + 2);
            break;
        case SSTATE_NOT_INTERNED:
            /* fall through */
        default:
            Py_UNREACHABLE();
        }

        // marking the string to be NOT_INTERNED
        _PyUnicode_STATE(s).interned = SSTATE_NOT_INTERNED;
    }

    // decreasing the reference to the initialized and
    // access keys object.
    Py_DECREF(keys);

    // clearing the dictionary
    PyDict_Clear(interned);

    // clearing the object interned
    Py_CLEAR(interned);
}

String Interning in Action

Now that we understand the internals of String Interning and Cleanup, we find out what all Strings are interned in Python. To discover the spots all we do is grep for the function invocation for PyUnicode_InternInPlace in the CPython source code and peek at the neighboring code. Here is a list of interesting spots where String Interning happens in Python.

Variables, Constants, and Function Names

CPython performs String Interning on constants such as Function Names, Variable Names, String Literals, etc. Following is the snippet from codeobject.c that suggests that when a new PyCode object is created the interpreter is interning all the compile-time constants, names, and literals.

PyCodeObject *
PyCode_NewWithPosOnlyArgs(int argcount, int posonlyargcount, int kwonlyargcount,
                          int nlocals, int stacksize, int flags,
                          PyObject *code, PyObject *consts, PyObject *names,
                          PyObject *varnames, PyObject *freevars, PyObject *cellvars,
                          PyObject *filename, PyObject *name, int firstlineno,
                          PyObject *linetable)
{

    ........

    if (intern_strings(names) < 0) {
        return NULL;
    }

    if (intern_strings(varnames) < 0) {
        return NULL;
    }

    if (intern_strings(freevars) < 0) {
        return NULL;
    }

    if (intern_strings(cellvars) < 0) {
        return NULL;
    }

    if (intern_string_constants(consts, NULL) < 0) {
        return NULL;
    }

    ........

}

Dictionary Keys

CPython also interns thee Strings which keys of any dictionary object. Upon putting an item in the dictionary the interpreter String Interning on the key against which item is stored. The following code is taken from dictobject.c showcasing the exact behavior.

Fun Fact: There is a comment next to the PyUnicode_InternInPlace function call that suggests if we really need to intern all the keys in all the dictionaries.

int
PyDict_SetItemString(PyObject *v, const char *key, PyObject *item)
{
    PyObject *kv;
    int err;
    kv = PyUnicode_FromString(key);
    if (kv == NULL)
        return -1;

    // Invoking String Interning on the key
    PyUnicode_InternInPlace(&kv); /* XXX Should we really? */

    err = PyDict_SetItem(v, kv, item);
    Py_DECREF(kv);
    return err;
}

Attributes of any Object

Objects in Python can have attributes that can be explicitly set using setattr function or are implicitly set as part of Class members or as pre-defined functions on data types. CPython interns all these attribute names, so as to make lookup blazing fast. Following is the snippet of the function PyObject_SetAttr responsible for setting a new attribute to a Python object, as defined in the file object.c.

int
PyObject_SetAttr(PyObject *v, PyObject *name, PyObject *value)
{

    ........

    PyUnicode_InternInPlace(&name);

    ........
}

Explicit Interning

Python also allows explicit String Interning through the function intern defined in sys module. When this function is invoked with any String object, the provided String is interned. Following is the code snippet from the file sysmodule.c that shows String Interning happening in sys_intern_impl.

static PyObject *
sys_intern_impl(PyObject *module, PyObject *s)
{

    ........

    if (PyUnicode_CheckExact(s)) {
        Py_INCREF(s);
        PyUnicode_InternInPlace(&s);
        return s;
    }

    ........
}

Only compile-time strings are interned. Strings that are specified during interpretation or compile-time are interned while dynamically created strings are not.

Strings having ASCII letters and underscores are interned. During compile time when string literals are observed for interning, CPython ensures that it only interns the literals matching the regular expression [a-zA-Z0-9_]* as they closely resemble Python identifiers.

Comments on how CPython does String Interning internally (as discussed in the Video) can be found in this PR.

References