Split on separator but keep the separator, in Python

Posted on Fri 23 December 2011 in Coding

I have a function that reads from a file using file’s iterator capability, i.e. (simplified version, in reality there’s going on more within the generator):

def readfile(f):
    using open(f) as fd:
        for line in fd:
            yield line

If the file contains something like a\nb\nc\n, I will get back four lines, three of which include the newline and one which is the empty string. In mocking this function, I want the mock to behave in the same way, of course. Thus, I need a way to split some text on newlines, but keep the newlines.

Splitting using the [split][] function of a string does not keep the separator but does give the empty string part in the end:

>>> "a\nb\nc\n".split("\n")
['a', 'b', 'c', '']

Another split function that can be used on strings is [splitlines][], but it only splits on newlines and discards the empty string part in the end:

>>> "a\nb\nc\n".splitlines(True)
['a\n', 'b\n', 'c\n']

I can use a trick at this point, and append the separator “manually”, but unfortunately the result is not exactly as I want it, as the fourth part is no longer the empty string:

>>> [x + "\n" for x in "a\nb\nc\n".split("\n")]
['a\n', 'b\n', 'c\n', '\n']

Similarly, if I append an empty string to the result of using splitlines, I would need to distinguish between the cases when the string to split ends in a newline and when it doesn’t.

Splitting using the split function in the re module yields the exact same result as in the first case:

>>> re.split("\n", "a\nb\nc\n")
['a', 'b', 'c', '']

But that function uses a regular expression pattern as separator, not a simple string! And in the documentation, it is spelled out that using a capturing group retains the separator pattern. Let’s try that:

>>> re.split("(\n)", "a\nb\nc\n")
['a', '\n', 'b', '\n', 'c', '\n', '']

Much better, except for the fact that each newline is detached from its corresponding line of text. But to work around that, I can apply the [reduce][] function:

>>> reduce(lambda acc, elem: acc[:-1] + [acc[-1] + elem] if elem == "\n" else acc + [elem], re.split("(\n)", "a\nb\nc\n"), [])
['a\n', 'b\n', 'c\n', '']

Voilà! The desired end result! While the code might seem intimidating, it’s actually quite straightforward. The reduce function applies a function to the original list (the one resulting from the split) in order to reduce it to a single value, which in my case also is a list. The function I use here consists of two parts. If the current element is a separator, the function appends it to the last element of the current accumulator list: acc[:-1] + [acc[-1] + elem] if elem == "\n". Otherwise, it just appends the element to the accumulator list: else acc + [elem].

In a generic version of the one-liner above, the separator must be escaped so that no part of it is mistaken for a regular expression special character:

def splitkeepsep(s, sep):
    return reduce(lambda acc, elem: acc[:-1] + [acc[-1] + elem] if elem == sep else acc + [elem], re.split("(%s)" % re.escape(sep), s), [])