Load a Notion Database in Pandas

Pandas is a popular data manipulation and analysis library in Python. While Notion can handle some simple sums and averages, you might want to use Pandas to do more advanced analytical work on data stored in a Notion database, like visualizations or even use data from Notion for machine learning.

Pandas loads data into a Numpy backed structure called a DataFrame, which makes column and row based analysis and aggregation on mixed data types much easier. Pandas can also be used as an intermediate format to output to Excel and CSV, however, if you just want to do conversion it may be better to do that directly.

Pandas offers a load_json function to load data from a JSON format, however, the format provided by Notion doesn’t match very closely with any of the formats Pandas understands, so we’ll do some pre-processing and use the Pandas DataFrame constructor.

Work with Pandas is often done interactively in a Jupyter notebook, so I’ve set up the full code example, with some bonus data analysis, as a notebook on GitHub Gists! Jupyter also comes with the Anaconda distribution, if that’s how you’ve gotten Pandas, and will also include requests, the other library we use in this tutorial.

In this tutorial, we’ll work with 3 classes - a NotionClient, handling the Notion queries, PandasConverter, handling conversion of the Notion responses into Pandas datatypes, and PandasLoader, which uses the client and converter to create a DataFrame out of the items in a database.

Set up the Notion Client #

As we’ve done in other tutorials, well make a NotionClient class to handle the Notion requests. In this case, we only need to query a database. The __init__ method handles setting up a session for Notion requests, while query_database allows us to make paginated requests against a Notion database.

import requests
from urllib.parse import urljoin

class NotionClient():
    def __init__(self, notion_key):
        self.notion_key = notion_key
        self.default_headers = {'Authorization': f"Bearer {self.notion_key}",
                                'Content-Type': 'application/json', 'Notion-Version': '2022-06-28'}
        self.session = requests.Session()
        self.session.headers.update(self.default_headers)
        self.NOTION_BASE_URL = "https://api.notion.com/v1/"
        
    def query_database(self, db_id, filter_object=None, sorts=None, start_cursor=None, page_size=None):
        db_url = urljoin(self.NOTION_BASE_URL, f"databases/{db_id}/query")
        params = {}
        if filter_object is not None:
            params["filter"] = filter_object
        if sorts is not None:
            params["sorts"] = sorts
        if start_cursor is not None:
            params["start_cursor"] = start_cursor
        if page_size is not None:
            params["page_size"] = page_size
            
        return self.session.post(db_url, json=params)

Convert the response to something Pandas can understand #

Next, we’ll handle pulling data out of a Notion response. For this tutorial, we’ll focus on a subset of Notion field types that are most likely to be used in data analysis: checkbox, date, number, rich_text, and title. We will process the Notion data into a list of records that we can use to create a Pandas DataFrame.

The Notion API response has a lot of nested layers of JSON data, but we want to pull out just the value in a way that makes sense to Pandas. The JSON structure in Notion defines a “type” field, and then has an value which may itself be a complex object specific to that type, with the type as a key, and sometimes that extends to multiple levels of nesting! The excerpt below shows a single Notion property for a property named “Name” which is the title.

"Name": {
    "id": "title",
    "type": "title",
    "title": [
      {
        "type": "text",
        "text": {
          "content": "Pynotion Blog Ideas",
          "link": null
        },
        "annotations": {
          "bold": false,
          "italic": false,
          "strikethrough": false,
          "underline": false,
          "code": false,
          "color": "default"
        },
        "plain_text": "Pynotion Blog Ideas",
        "href": null
      }
    ]
  }

We’ll define a class called PandasConverter class to munge the Notion JSON into an array of records we can load into Pandas,

class PandasConverter():

A text field in Notion, like a rich_text field or a title, consists of separately formatted pieces of text that need to be combined together to make a single string for Pandas (in most cases - it could be that formatting is relevant to your use case!). A method to do that looks like:

def get_text(self, text_object):
    text = ""
    text_type = text_object.get("type")
    for rt in text_object.get(text_type):
        text += rt.get("plain_text")
    return text

Notion dates can either be a single date, or a date interval defined by a start and an end date. Pandas has separate types for each of these. Notion dates are specified in ISO8601 format which is easily understood by Pandas as is, but we need to do a calculation to deal with date intervals.

def get_date(self, date_object):
    date_value = date_object.get("date")
    if date_value is not None:
        if date_value.get("end") is None:
            return date_value.get("start")
        else:            
            start = datetime.fromisoformat(date_value.get("start"))
            end = datetime.fromisoformat(date_value.get("end"))
            return end - start
    return None

Some other kinds of data in Notion, like Number or Checkbox properties, are much simpler to process, and we can simply retrieve the value directly. And if we have a property type we haven’t supported, we’ll get some JSON data in Pandas which still could be useful, but we’ll define an is_supported method to filter out properties we aren’t currently supporting. With the get_text and get_date functions above, we can get the values for the property types we plan to support using this function to check the type and call the appropriate function.

# At the top of PandasConverter
text_types =  ["rich_text", "title"]

# Further down
def is_supported(self, prop):
    if prop.get("type") in ["checkbox", "date", "number", "rich_text", "title"]:
        return True
    else:
        return False
        
def get_property_value(self, prop):
	  prop_type = prop.get("type")
	  if prop_type in self.text_types:
	      return self.get_text(prop)
	  elif prop_type == "date":
	      return self.get_date(prop)
	  else:
	      return prop.get(prop_type)

Next, we’ll parse a record out of an individual Notion item, but only using properties of fields we’ve chosen to support. You can of course extend this to other field types by defining methods to parse them. “properties” is a dictionary keyed by the name of a property, so we iterate over the names and retrieve the value into a record dictionary more friendly to Pandas.

def get_record(self, result):
    record = {}
    for name in result["properties"]:
        if self.is_supported(result["properties"][name]):
            record[name] = self.get_property_value(result["properties"][name])

    return record

Finally, we can create records out of an entire list of results:

def response_to_records(self, db_response):
    records = []
    for result in db_response["results"]:
        records.append(self.get_record(result))
    return records

That’s all of the functionality for the PandasConverter class, you can expand the section below for it all in one piece.

Full PandasConverter code #

from datetime import datetime
class PandasConverter():
    text_types =  ["rich_text", "title"]    
    
    def response_to_records(self, db_response):
        records = []
        for result in db_response["results"]:
            records.append(self.get_record(result))
        return records    

    def get_record(self, result):
        record = {}
        for name in result["properties"]:
            if self.is_supported(result["properties"][name]):
                record[name] = self.get_property_value(result["properties"][name])

        return record
    
    def is_supported(self, prop):
        if prop.get("type") in ["checkbox", "date", "number", "rich_text", "title"]:
            return True
        else:
            return False
        
    def get_property_value(self, prop):
        prop_type = prop.get("type")
        if prop_type in self.text_types:
            return self.get_text(prop)
        elif prop_type == "date":
            return self.get_date(prop)
        else:
            return prop.get(prop_type)
       
    def get_text(self, text_object):
        text = ""
        text_type = text_object.get("type")
        for rt in text_object.get(text_type):
            text += rt.get("plain_text")
        return text
            
    def get_date(self, date_object):
        date_value = date_object.get("date")
        if date_value is not None:
            if date_value.get("end") is None:
                return date_value.get("start")
            else:            
                start = datetime.fromisoformat(date_value.get("start"))
                end = datetime.fromisoformat(date_value.get("end"))
                return end - start
        return None

Load the data from Notion into a Pandas DataFrame #

PandasLoader combines the NotionClient and PandasConverter to handle retrieving all items in a database using paginated requests using the has_more and next_cursor properties of the Notion response. We create a list of records, append to it with each page of results, and return a Pandas DataFrame at the end. To show progress for large datasets we print out the page number.

import pandas as pd
class PandasLoader():
    def __init__(self, notion_client, pandas_converter):
        self.notion_client = notion_client
        self.converter = pandas_converter
        
    def load_db(self, db_id):
        page_count = 1
        print(f"Loading page {page_count}")
        db_response = self.notion_client.query_database(db_id)
        records = []
        if db_response.ok:
            db_response_obj = db_response.json()
            records.extend(self.converter.response_to_records(db_response_obj))
            
            while db_response_obj.get("has_more"):
                page_count += 1
                print(f"Loading page {page_count}")
                start_cursor = db_response_obj.get("next_cursor")
                db_response = self.notion_client.query_database(db_id, start_cursor=start_cursor)
                if db_response.ok:
                    db_response_obj = db_response.json()
                    records.extend(self.converter.response_to_records(db_response_obj))
        return pd.DataFrame(records)

Putting it all together to get your data in a Data Frame #

To put it all together and make a DataFrame, you need to get a Notion API key and connect the integration to a database, which is covered in the post on Getting Started with the Notion API Using Python. Here I’ve saved the Notion Key as an environment variable to keep it secret. Add in the ID of a database you’d like to load in the code below and get a DataFrame!

import os
client = NotionClient(os.environ.get("NOTION_KEY"))
converter = PandasConverter()
loader = PandasLoader(client, converter)
df = loader.load_db("YOUR_DATABASE_ID")
df

Now you’ve got your Notion data loaded into Pandas, you can do whatever you’d like with the Pandas data - plot it, use it for machine learning, do some statistical analysis, export it to Excel, etc. The full example notebook on GitHub Gists shows an example data analysis and plotting using a Kaggle dataset about Ted Talks.