Split on separator but keep the separator, in Python
Posted on Fri 23 December 2011 in Coding
I have a function that reads from a file using file’s iterator capability, i.e. (simplified version, in reality there’s going on more within the generator):
def readfile(f):
using open(f) as fd:
for line in fd:
yield line
If the file contains something like a\nb\nc\n
, I will get back four
lines, three of which include the newline and one which is the empty
string. In mocking this function, I want the mock to behave in the same
way, of course. Thus, I need a way to split some text on newlines, but
keep the newlines.
Splitting using the split
function of a string does not keep the
separator but does give the empty string part in the end:
>>> "a\nb\nc\n".split("\n")
['a', 'b', 'c', '']
Another split function that can be used on strings is splitlines
,
but it only splits on newlines and discards the empty string part in
the end:
>>> "a\nb\nc\n".splitlines(True)
['a\n', 'b\n', 'c\n']
I can use a trick at this point, and append the separator “manually”, but unfortunately the result is not exactly as I want it, as the fourth part is no longer the empty string:
>>> [x + "\n" for x in "a\nb\nc\n".split("\n")]
['a\n', 'b\n', 'c\n', '\n']
Similarly, if I append an empty string to the result of using
splitlines
, I would need to distinguish between the cases when the
string to split ends in a newline and when it doesn’t.
Splitting using the split
function in the re module yields the
exact same result as in the first case:
>>> re.split("\n", "a\nb\nc\n")
['a', 'b', 'c', '']
But that function uses a regular expression pattern as separator, not a simple string! And in the documentation, it is spelled out that using a capturing group retains the separator pattern. Let’s try that:
>>> re.split("(\n)", "a\nb\nc\n")
['a', '\n', 'b', '\n', 'c', '\n', '']
Much better, except for the fact that each newline is detached from its
corresponding line of text. But to work around that, I can apply the
reduce
function:
>>> reduce(lambda acc, elem: acc[:-1] + [acc[-1] + elem] if elem == "\n" else acc + [elem], re.split("(\n)", "a\nb\nc\n"), [])
['a\n', 'b\n', 'c\n', '']
Voilà! The desired end result! While the code might seem intimidating,
it’s actually quite straightforward. The reduce
function applies a
function to the original list (the one resulting from the split) in
order to reduce it to a single value, which in my case also is a list.
The function I use here consists of two parts. If the current element is
a separator, the function appends it to the last element of the current
accumulator list: acc[:-1] + [acc[-1] + elem] if elem == "\n"
.
Otherwise, it just appends the element to the accumulator list:
else acc + [elem]
.
In a generic version of the one-liner above, the separator must be escaped so that no part of it is mistaken for a regular expression special character:
def splitkeepsep(s, sep):
return reduce(lambda acc, elem: acc[:-1] + [acc[-1] + elem] if elem == sep else acc + [elem], re.split("(%s)" % re.escape(sep), s), [])