Python Html To Markdown

Introduction

Split the file into yaml and markdown parts Extract the meta-data from the YAML. Convert the markdown to an HTML fragment (the page content). Combine the meta-data and page content with the HTML template to create a complete HTML file.
Markdown is a simple text format that can be parsed and turned into HTML using various python tools. In this case, the markdown file can be combined with a responsive HTML email template to simplify the process of generating content for newsletters.

As part of managing the PB Python newsletter, I wanted to develop a simple way towrite emails once using plain text and turn them into responsive HTML emails for the newsletter.In addition, I needed to maintain a static archive page on the blog that links to the content of eachnewsletter. This article shows how to use python tools to transform a markdown file into a responsive HTMLemail suitable for a newsletter as well as a standalone page integrated into a pelican blog.

Rationale

Sometimes markdown doesn’t make line breaks when you want them. To force a linebreak, use the following code: Indenting Use the greater than sign followed by a space, for example: Text that will be indented when the Markdown is rendered. Any subsequent text is indented until the next carriage return.

I am a firm believer in having access to all of the content I create in a simple text format. That is partof the reason why I use pelican for the blog and write all content in restructured text.I also believe in hosting the blog using static HTML so it is fast for readers and simple to distribute.Since I spend a lot of time creating content, I want to make sure I can easily transform it into anotherformat if needed. Plain text files are the best format for my needs.

As I wrote in my previous post, Mailchimp was getting cost prohibitive. In addition, I didnot like playing around with formatting emails. I want to focus on content and turning it intoa clean and responsive email - not working with an online email editor. I also want the newsletterarchives available for people to view and search in a more integrated way with the blog.

One thing that Mailchimp does well is that it provides an archive of emailsand ability for the owner to download them in raw text. However, once you cancel your account,those archives will go away. It’s also not very search engine friendly so it’s hard to reference backto it and expose the content to others not subscribed to the newsletter.

With all that in mind, here is the high level process I had in mind:

HTML Email

Before I go through the python scripts, here’s some background on developing responsive HTML-basedemails. Unfortunately, building a template that works well in all email clients is not easy. I naively assumedthat the tips and tricks that work for a web site would work in an HTML email. Unfortunately that is not the case.The best information I could find is that you need to use HTML tables to format messages so they will look acceptablein all the email clients. Yuck. I feel like I’m back in Geocities.

This is one of the benefits that email vendors like Mailchimp provide. They will go through all thehard work of figuring out how to make templates that look good everywhere. For somethis makes complete sense. For my simple needs, it was overkill. Your mileage may vary.

Along the way, I found several resources that I leveraged for portions of my final solution.Here they are for reference:

Building responsive email templates - Really useful templates that served as the basis for the final template.
Free Responsive Simple HTML Template - Another good set of simple templates.
Send email written in Markdown - A python repo that had a lot of good concepts for building the markdown email.

Besides having to use HTML tables, I learned that it is recommended that all the CSS be inlinedin the email. In other words, the email needs to have all the styling included in the tags using style:

Once again this is very old school web and would be really painful if not for tools that will do the inliningfor you. I used the excellent premailer library to take an embedded CSS stylesheet and inline with the rest of the HTML.

You can find a full HTML template and all the code on github but here is a simple summary for reference.Please use the github version since this one is severely simplified and likely won’t work as is:

This is a jinja template and you will notice that there is a place for email_content andtitle. The next step in the process is to render a markdown text file into HTML and placethat HTML snippet into a template.

Markdown Article

Now that we know how we want the HTML to look, let’s create a markdown file.The only twist with this solution is that I want to create one markdown file thatcan be rendered in pelican and used for the HTML email.

Here is what a simple markdown file(sample_doc.md) looks like thatwill work with pelican:

The required input file uses standard markdown. The one tricky aspect is that the top 5 lines contain meta-datathat pelican needs to make sure the correct url and templates are used when creating the output. Our final scriptwill need to remove them so that it does not get rendered into the newsletter email. If you are not trying toincorporate into your blog, you can remove these lines.

If you are interested in incorporating this in your pelican blog, here is how my content is structured:

All of the newsletter markdown files are stored in the newsletter directory and the blog postsare stored in the articles directory.

The final configuration I had to make in the pelicanconf.py file was to make sure the pathswere setup correctly:

Now the blog is properly configured to render one of the newsletters.

Python code

Now that we have HTML template and the markdown document, we need a short python scriptto pull it all together. I will be using the following libraries so make sure they are all installed:

python-markdown2 - Turn raw markdown into HTML
jinja2 - Template engine to generate HTML
premailer - Inline CSS
BeautifulSoup - Clean up the HTML. This is optional but showing how to use it if you choose to.

Additionally, make sure you are using python3 so you have access to pathlib andargparse.

In order to keep the article compact, I am only including the key components. Please lookat the github repo for an approach that is a proper python standalone program that cantake arguments from the command line.

The first step, import everything:

Setup the input files and output HTML file:

Please refer to the pathlib article if you are not familiar with how or why to use it.

Now that the files are established, we need to read in the markdown file andparse out the header meta-data:

Using readlines to read the file ensures that each line in the file is stored in a list.This approach works for our small file but could be problematic if you had a massive file thatyou did not want to read into memory at once. For an email newsletter you should be ok withusing readlines.

Here is what it all_content[0:6] looks like:

We can clean up the title line for insertion into the template:

Which renders a title PB Python - Newsletter Number 6

The final parsing step is to get the body into a single list without the header:

Convert the raw markdown into a simple HTML string:

Now that the HTML is ready, we need to insert it into our jinja template:

At this point, raw_html has a fully formed HTML version of the newsletter.We need to use premailer’s transform to get the CSS inlined. I am alsousing BeautifulSoup to do some cleaning up and formatting of the HTML.This is purely aesthetic but I think it’s simple enough to do so I am including it:

The final step is to make sure that the unsubscribe link does not get mangled. Dependingon your email provider, you may not need to do this:

Here is an example of the final email file:

Python Convert Html To Markdown

You should be able to copy and paste the raw HTML into your email marketing campaign andbe good to go. In addition, this file will render properly in pelican. See this page for somepast examples.

Summary

Markdown is a simple text format that can be parsed and turned into HTML usingvarious python tools. In this case, the markdown file can be combined with a responsiveHTML email template to simplify the process of generating content for newsletters.The added bonus is that the content can be included in a static blog so that it is searchableand easily available to your readers.

This solution is not limited to just building emails. Now that newer versions of pandaswill include a native to_markdown method, this general approach could be extendedto other uses. Using these principles you can build fairly robust reports and documentsusing markdown then incorporate the dataframe output into the final results. If there isinterest in an example, let me know in the comments.

Comments

Flask setup

Python Html To Markdown Pdf

Introduction

A few months ago, I wanted to serve my own blog instead of using websites like Medium.It was a pretty basic blog and I wrote all my articles in HTML.However, some day, I came across the idea of writing my own markdown to HTML generator, which would eventually allow me to write my articles in markdown.Furthermore, extending it by features like estimated reading time would be easier.Long story short, I implemented my own markdown to HTML generator and I really like it!

In this article series, I want to show you how you can build your own markdown to HTML generator.The series consists of three parts:

Part 1 (current article) presents the implementation of the whole generation pipeline.
Part 2 extends the implemented pipeline by a module used to compute the estimated reading time for a given article.
Part 3 demonstrates how you can use the pipeline to produce your own RSS feeds.

The code used in all three parts is available on GitHub.

Note: The idea of a markdown to HTML generator for my articles is based on an implementation Anthony Shaw uses to generate his articles.

Project setup

In order to follow the current article, you need to install a few packages.We put them into a requirements.txt file.

Markdown is a package, which allows you to transform your markdown code into HTML.We use Flask to serve the static files afterwards.

But before you install them, create a virtual environment to not mess up your Python installation:

Once activated, you can install the dependencies from the requirements.txt file via pip.

Great!Let’s create a few directories to better organize our code.First, we create a directory app.This directory contains our Flask app serving the blog.All subsequent directories will be created inside the app directory.Second, we create a directory called posts.This directory contains the markdown files we want to convert into HTML files.Next, we create a directory templates, which will contain the templates we serve later using Flask.Inside the templates directory, we create two more directories:

posts contains the resulting HTML files, which correspond to the ones in the posts directory in the application’s root.
shared contains HTML files which are used across many files.

Furthermore, we create a directory called services.The directory will contain modules we use in our Flask application or to generate certain things for it.Last but not least, a directory called static is created with two subdirectories images and css.Custom CSS files and the thumbnails for the posts will be stored here.

Your final project structure should look like this:

Awesome!We finished the general project setup.Let’s hit over to the Flask setup.

Flask setup

Routing

We already installed Flask in the last section.However, we still need a Python file which defines the endpoints the users can access.Create a new file in your app directory called main.py and copy the following content into it.

The file defines a basic Flask application with two endpoints.The first endpoint, which the user can access using the / route, returns the index page, where all posts are listed.

The second endpoint is a more generic one.It accepts a post’s name and returns the corresponding HTML file.

Next, we turn the app directory into a Python package by adding a __init__.py file to it.This file is empty.If you are on a UNIX machine, you can run the following command from your project’s root directory:

Templates

Now, we create two templates index.html and layout.html.We store both in the templates/shared directory.The layout.html template will be used for a single blog entry, whereas the index.html template is used to generate the index page from where we can access each post.Let’s start with the index.html template.

It is a basic HTML file, where we have two meta-tags, a title, and two style sheets.Notice that we use a remote style sheet and a local one.The remote style sheet is utilized to enable the Bootstrap [¹] classes.The second one is for custom styles.We define them later.

The body of the HTML file encloses a single container, which contains Jinja2 [²] logic to generate a Bootstrap card [³] for each post.Did you notice that we do not access the values directly based on the variable names but need to add [0] to it?This is because the parsed metadata from the posts are lists.In essence, each metadata element is a list of exactly one element.We will have a look at it later.So far, so good.Let’s take a look at the layout.html template.

As you can see, it is a little bit shorter and simpler than the previous one.The head of the file is pretty similar to the index.html file except the fact that we have a different title.Of course, we could use a common template for both, but I do not want to make things more complex at this point.

The container in the body defines only an h1-tag.Afterwards, the content we supply to the template is inserted and rendered.

Styling

As promised in the last section, we will have a look at the custom CSS file called style.css.We locate the file in static/css and customize our page as needed.Here is the content we will use for our basic example:

I do not like the default appearance of blockquotes in Bootstrap, so we add a bit more spacing and a border on the left.Additionally, the margin at the bottom of the paragraph inside the blockquote is removed.With it, it looks pretty unnatural.

Last but not least, the padding on the left and on the right of the cards are removed.With the additional padding on both sides, the thumbnails are not aligned properly, so we remove them here.

So far so good.We finished everything concerning the Flask setup.Let’s start writing some posts!

Writing the posts

As the title promises, you can write your posts in markdown - yeah!There is nothing else you need to take care about while writing your posts except that it needs to be valid markdown.

After finishing the article, we need to add some metadata to the post.This metadata is added before the actual article and is separated from it by three dashes ---.Here is an extract of an example post (post1.md):

Note: You can find the complete sample article in the GitHub repository at app/posts/post1.md.

In our case the metadata consists of a title, subtitle, category, the date it will be/was published and the path to the corresponding thumbnail for the card in index.html.

We used the metadata in the HTML files, do you remember?The metadata specification needs to be valid YAML.In the example at hand, the key is followed by a colon and the value.In the end, the value after the colon is the first and only element in a list.That is why we accessed the values by the index-operator [] in the templates.

Let’s suppose we finished writing our articles.Before we can move on to the conversion part, there is one thing left to do: We need thumbnails for our posts!To make things easier, just pick a random picture you have on your computer or from the web, name it placeholder.jpg and put it in the static/images directory.The metadata of the two posts in the GitHub repository contain an image key-value pair with placeholder.jpg as value.

Note: In the GitHub repository you can find the two sample articles I am referring to.

Markdown to HTML converter

Finally, we can start implementing the markdown to HTML converter.Therefore, we utilize the third-party package Markdown we installed at the beginning.Let’s start by creating a new module in which our conversion service will live.Hence, we create a new file named converter.py in our service directory.We go through the whole script step by step.You can view the whole script at once in the GitHub repository.

First, we import everything we need and create a few constants:

ROOT points to the root of our project.Hence, it is the directory which contains the app directory.
POSTS_DIR is the path to the posts written in markdown.
TEMPLATE_DIR points to the templates directory respectively.
BLOG_TEMPLATE_FILE stores the path to the layout.html file.
INDEX_TEMPLATE_FILE is the path to the index.html.
BASE_URL is the base url of our project, e.g. https://florian-dahlitz.de.By default (if it is not provided via the environment variable DOMAIN) the value is http://0.0.0.0:5000.

Next, we create a new function called generate_entries().It is the only function we define in order to convert the posts.

Inside the function, we start by getting the paths of all markdown files in the POSTS_DIR directory.pathlib‘s awesome glob() function helps us with it.

Furthermore, we define the extensions we want the Markdown package to use.All the extensions used in this article come with the installation of it by default.

Note: You can find out more about the extensions in the documentation [⁴].

Additionally, we instantiate a new file loader and create an environment used while converting the articles.Subsequently, an empty list called all_posts is created.This list will contain all posts after we processed them.Now, we enter the for-loop and iterate over all posts we found in POSTS_DIR.

We start the for-loop by printing the path to the post we are currently processing.This is especially helpful if something breaks.Then we know, which post’s conversion failed.

Next, we create the part of the url right after the base url.Let’s say we have an article with the heading “Python For Beginners”.We store the post in a file called python-for-beginners.md, so the resulting url will be http://0.0.0.0:5000/posts/python-for-beginners.

The variable url_html stores the same string as url except that we add .html at the end.We use this variable to define another one called target_file.The variable points to the location, where the corresponding HTML file will be stored.

Last but not least, we define a variable _md.It represents an instance of the markdown.Markdown class, which is used to convert the markdown code to HTML.You might ask yourself, why we did not instantiate this instance before the for-loop but inside.Of course, for our little example here, it would make no difference (except a slightly shorter execution time).However, if you use extensions like footnotes for using footnotes, it is necessary to instantiate a new instance for each post as the footnotes once added are not removed from this instance.Consequently, if your first post uses some footnotes, all other posts will have the same footnotes even though you did not define them explicitly.

Let’s move on to the first with-block in the for-loop.

In essence, the with-block opens the current post and reads the content of it into the variable content.Afterwards, _md.convert() is invoked to convert the content written in markdown into HTML.Subsequently, the environment env is used to render the resulting HTML code based on the supplied template BLOG_TEMPLATE_FILE (which is layout.html if you remember).

The second with-block is used to write the document created in the first with-block to the target_file.

The following three lines of code take the publishing date (published) from the metadata, bring it into the correct format (RFC 2822) and assign it back to the metadata of the post.Furthermore, the resulting post_dict is added to the all_posts list.

We are now outside the for-loop, hence, we iterated over all posts we found in POSTS_DIR and processed them.Let’s have a look at the three remaining lines of code in the generate_entries() function.

We sort the posts by date but reversed, so the latest posts are shown first.Subsequently, we write the posts to a newly created index.html file in the templates directory.Do not mistake this index.html for the one in the templates/shared directory.The one in the templates/shared directory is the template, this one is the generated one we want to serve using Flask.

The last thing we add to the script is the following if-statement after the function generate_entries().

This means if we execute the file via the command-line, it calls the function generate_entries().

Awesome, we finished the converter.py script!Let’s try it out by running the following command from your project’s root directory:

You should see some output printing the paths of the files it converted.Assuming you wrote two articles or used the two posts from the GitHub repository, you should find three newly created files in the templates directory.First, the index.html, which is directly located in the templates directory, and secondly, two HTML files in the templates/posts directory, which correspond to the markdown files.

You can view them in the browser by starting your Flask application and going to http://0.0.0.0:5000.

Summary

Awesome, you made it through the first part of the series!In this article, you have learned how to utilize the Markdown package to create your own markdown to HTML generator.You implemented the whole pipeline, which is highly extensible, what you will see in the upcoming posts.

I hope you enjoyed reading the article.Make sure to share it with your friends and collegues.If you have not already, consider following me on Twitter where I am @DahlitzF or to subscribe to my newsletter so you will not miss any upcoming article.Stay curious and keep coding!

References

Bootstrap↩
Primer on Jinja Templating↩
Bootstrap Card↩
Python-Markdown Extensions↩

Bytesload815

Python Html To Markdown

Introduction

Rationale

HTML Email

Markdown Article

Python code

Python Convert Html To Markdown

Summary

Comments

Table of Contents

Python Html To Markdown Pdf

Introduction

Project setup

Flask setup

Routing

Templates

Styling

Writing the posts

Markdown to HTML converter

Summary

References