‎

In which Ernest reverse engineers CSS with Clojure

About 23 years ago I gave a talk to the Greater London Linux Users Group on XHTML and CSS. I didn't really know very much about it, but I'd been doing some experimentation and thought it was interesting, so when volunteers for talks were requested, I offered to share what I had learned.

I had completely forgotten about this, and it was only after writing the previous post that I remembered, so it's not quite true to say that I know absolutely nothing about CSS. But I still know almost nothing, and what little I ever knew, I have never used since I played with it.

It's also not strictly true that exporting my scribblings from org mode means there's no CSS. In fact that's just absolutely incorrect - I just hadn't thought about it.

In fact, there's a gigantic chunk of CSS embedded in the exported HTML, and I don't really know what it does or exactly how it works. So let's see if we can work it out.

What I remember about CSS

CSS is a set of rules for determining how content should be displayed

The fundamental distinction that I remembered liking when I first learned about HTML and CSS was the idea of separating content, and how that content was displayed. As a person who had grown up with primitive word processing packages, and gradually seen the evolution of WYSIWYG, resetting my brain to think this was was a powerful paradigm shift.

CSS is what describes how the content looks. CSS stands for "cascading style sheets" - the style bit being obvious. The word "sheet" is, I think, a nod to the paper publishing industry, where the guidelines on how to style something might literally be on one or more sheets of paper. The "cascading" bit means that CSS is explicitly a layered concept. You might not realise it, but your web browser has opinions on how HTML should be presented - and not all browsers have the same opinion. If a user takes advantage of accessiblity features on their computer - for example if they are visually impaired - that has opinions on style too. The designer of the website obviously has opinions as well. So the way the page actually looks is actually a combination of multiple inputs, with rules of precedence.

So CSS is a set of rules for determining how content should be displayed.

CSS has selectors and declarations

When I did my talk on CSS I also talked about XHTML. In those days the adoption of XHTML represented a radical shift in thinking about how content was written. In the late 1990s and early 2000s, there was not such an emphasis on unambiguous and cleanly structured code. It was quite common, and quite workable to create websites in HTML that looked ok, but which were not really consistent or standards-compliant. XHTML took the formal structure of XML and the standards of the World Wide Web Consortium (W3C) and joined them together. This was significant for CSS because it meant there was an opinionated, validatable, consistent, and unambiguous way to refer to the specifc parts of the content that you wanted to style. Of course these days we use HTML5 rather than XHTML, but the idea of explicit and clear semantics and thus targetable elements of content is a necessary pre-requisite for CSS to work well.

The way CSS works is that it makes a declaration - ie it declares a policy, a set of rules, with respect to a selector - ie a part of the written content.

The grammar of CSS is pretty simple - the selector describes the html element we wish to style. There are various ways to define a selector, from one simple and specific element to complex combinations of elements, classes and IDs. The declaration is a set of one or more rules, made up of a property, such as color, padding, position, and one or more values.

Investigating the org-mode CSS

It turns out that there are 128 declarations and 154 rules in the default stylesheet that org embedded in my html.

✓ ~/Developer/ernestscribbler [main]$ grep -c "{" org.css
128
✓ ~/Developer/ernestscribbler [main]$ grep -c ":" org.css
154

I wondered how many unique rules there were, and what they were. As I mentioned, the grammar is quite straightforward. We know that anything inside a pair of braces is a rule, and that rules are made of key pairs, the key of which is a single word, terminated with a colon. For example:

.equation { vertical-align: middle; }

We can use that knowledge to extract the unique rules. Clojure to the rescue:

(require '[clojure.java.io :as io]
         '[clojure.string :as str])

(defn extract-properties [file]
  (->> (slurp (io/file file))
       (re-seq #"\{([^}]*)\}")
       (map second)
       (mapcat #(re-seq #"([a-zA-Z-]+)\s*:" %))
       (map second)
       (distinct)
       (sort)))

Clojure is like a bunch of unix pipelines, only much better. This isn't much different from a series of grep, sed, and awk statements, separated by pipes. The ->> is a threaded macro - it basically means: make me a pipeline with these steps.

Instead of nesting a series of functions inside each other, in a cacophany of parentheses, it takes an initial value and passes that value as the last argument to the first function, and then uses the result of that function as the last argument to the following function, and so on. This is a very idiomatic way to define a clear, linear flow, similar to a shell pipeline, where we can see how data is transformed step by step.

Each step in the pipeline corresponds to a transformation: first we isolate the declaration blocks, then we extract the properties, and finally we filter and sort the results.

Let's go through it step by step.

Step one

(slurp (io/file file))

Slurp is a function that returns the entire contents of a file as a single string.

Step two

(re-seq #"\{([^}]*)\}")

re-seq is a function that applies a regular expression to a string, and returns a list of matches. In our case the regular expression simply says: match anything within a pair of literal braces, zero or more times, and capture that. The function will actually return a list of pairs of match and capture.

If our input is pre.src-haskell:before { content: 'Haskell'; }, the output will be [["{ content: 'Haskell'; }" " content: 'Haskell'; "]]

Our input is one huge string, and re-seq goes through that string, scanning from the beginning to end, looking for all matches against the regular expression.

Step three

(map second)

(map second) simply says: for each list in the list, apply the second function - this will return a list of just the bits captured by the capture group.

Step four

(mapcat #(re-seq #"([a-zA-Z-]+)\s*:" %))

Now we need to get rid of the stuff on right side of the rule - we only want the property. Again, that's a straightforward regular expression anything in the character class a-z or A-Z plus a -, followed by a colon, and we only care about the stuff before the colon.

We can use re-seq for this again, but now we have a sequence of matches, not one string. Remembering that re-seq produces a list of lists, we would like to flatten out the results, so we get one long sequence of match, capture, rather than a list of pairs. We can use the mapcat function for that - that says: apply a function to every element, and concatenate the result into one list. Mapcat needs a function, so we use an anonymous function to call re-seq on each item in the sequence.

This will result in a single list, this time a list that goes match capture match capture, and so on.

Step five

(map second)

We don't care about the match, so we can use the second function again, to return on every other item in the list.

Steps six and seven

(distinct)
(sort)

Finally we use the distinct and sort functions to remove duplicates, and sort the final list.

Running the function

user=> (extract-properties "org.css")
("background" "background-color" "border" "border-collapse" "border-radius" "border-style" "caption-side" "color" "content" "display" "font-family" "font-size" "font-weight" "margin" "margin-bottom" "margin-left" "margin-right" "margin-top" "max-width" "overflow" "overflow-x" "padding" "position" "right" "text-align" "text-decoration" "top" "vertical-align" "white-space" "width")
user=> (count (extract-properties "org.css"))
30

That's all for now, folks

We've established the the stylesheet that org embeds in my exported CSS has thirty unique properties that it changes. This helps us start to understand what it's doing.

Next time, we'll do a similar trick to extract all the selectors, which will give us the two dimensions of the style sheet - what elements it changes, and what properties it cares about.

Finally we can see how many of those elements appear in my actual content, and what rules apply to them, and we can start thinking about how to change them.

Having deduced both the properties and selector, we'll have a complete map of the stylesheet, telling us which properties are applied to which elements. Then we'll be aple to identify and modify the styling to meet our design goals, whatever they end up being.

Yeah, we could literally do that by reading the stylesheet, but who wants to do that when you can write lisp?

Back to main page