I have a function that reads from a file using file’s iterator capability, i.e. (simplified version, in reality there’s going on more within the generator):
def readfile(f): using open(f) as fd: for line in fd: yield line
If the file contains something like
a\nb\nc\n, I will get back four
lines, three of which include the newline and one which is the empty
string. In mocking this function, I want the mock to behave in the same
way, of course. Thus, I need a way to split some text on newlines, but
keep the newlines.
Splitting using the [
split] function of a string does not keep the
separator but does give the empty string part in the end:
>>> "a\nb\nc\n".split("\n") ['a', 'b', 'c', '']
Another split function that can be used on strings is [
but it only splits on newlines and discards the empty string part in
>>> "a\nb\nc\n".splitlines(True) ['a\n', 'b\n', 'c\n']
I can use a trick at this point, and append the separator “manually”, but unfortunately the result is not exactly as I want it, as the fourth part is no longer the empty string:
>>> [x + "\n" for x in "a\nb\nc\n".split("\n")] ['a\n', 'b\n', 'c\n', '\n']
Similarly, if I append an empty string to the result of using
splitlines, I would need to distinguish between the cases when the
string to split ends in a newline and when it doesn’t.
Splitting using the
split function in the re module yields the
exact same result as in the first case:
>>> re.split("\n", "a\nb\nc\n") ['a', 'b', 'c', '']
But that function uses a regular expression pattern as separator, not a simple string! And in the documentation, it is spelled out that using a capturing group retains the separator pattern. Let’s try that:
>>> re.split("(\n)", "a\nb\nc\n") ['a', '\n', 'b', '\n', 'c', '\n', '']
Much better, except for the fact that each newline is detached from its
corresponding line of text. But to work around that, I can apply the
>>> reduce(lambda acc, elem: acc[:-1] + [acc[-1] + elem] if elem == "\n" else acc + [elem], re.split("(\n)", "a\nb\nc\n"), ) ['a\n', 'b\n', 'c\n', '']
Voilà! The desired end result! While the code might seem intimidating,
it’s actually quite straightforward. The
reduce function applies a
function to the original list (the one resulting from the split) in
order to reduce it to a single value, which in my case also is a list.
The function I use here consists of two parts. If the current element is
a separator, the function appends it to the last element of the current
acc[:-1] + [acc[-1] + elem] if elem == "\n".
Otherwise, it just appends the element to the accumulator list:
else acc + [elem].
In a generic version of the one-liner above, the separator must be escaped so that no part of it is mistaken for a regular expression special character:
def splitkeepsep(s, sep): return reduce(lambda acc, elem: acc[:-1] + [acc[-1] + elem] if elem == sep else acc + [elem], re.split("(%s)" % re.escape(sep), s), )