Communication errors

What does this cover

In this guide we assume that you already have connected the printer successfully, but you have some issues like unexpected pauses or print aborts or just have seen some error messages you want to understand.

First we will explain how communication in general and some communication settings work. Only if you understand how it works you will understand what goes wrong.

After the general explanation you will learn all about the error sources that can happen, why they might happen and some ideas for improvements.

Communication protocol

You have two disconnected devices – a 3d printer and Repetier-Server running on a computer. These are connected via usb cable, serial cable, network or a 3rd party intermediate software (Klipper Pi app, Duet control server) with one Repetier-Server instace. In between there is also the operating system or eventually an USB to serial driver chip. Only if the data can flow exactly as wanted between printer CPU over all involved parties to the server you will get the desired print result.

The basic communication works driven by the server side. A command gets send and when the printer has received it and is ready to get the next command, it will send back a line starting with “ok ” or only “ok”. When the server receives this ok message it knows it can send the next command and so on. From time to time the printer will also send additional data like temperatures, status changes and more.

Now the printer developer knows that the commands might get corrupted. So what we do is that we prepend a line number to each command you send with the N-parameter and end a line with *checksum. When the printer sees this it computes the checksum as well and when it differs it will complain and request a resend of the correct command.

We call this communication solution ping-pong-mode, simply because it goes always in one direction and then waits for response and so on.

If you are familiar with our configuration, you saw that there is the possibilty to disable ping-pong. So why should you consider that and what does it do? This solves two issues. The first is increasing communication speed. Due to all the involved parts a message has to go through there is not only the physical possible speed, but also a latency from starting sending a message until the first byte is received. So assume a latency of 2ms and unlimited speed otherwise. One command needs 2 ms to be send and then we wait 2 ms for the ok to see we can send more data. That makes 4 ms per command or 1000 ms/4 ms = 250 commands per second. 250 lines per second is a good speed and most prints can also print good with even 100 line per second. Only thing is in some edgy cases it is not enough and you will see the printer stutter ot move unevenly. This is when you have very tiny moves. How tiny is defined depend on print speed. So assume you print at 100 mm/s that means with 250 lines/s your average move length should be below 100 mm/250 = 0.4 mm. With very curvy models and with very small triangles in the stl model it can be higher. The second issue it solves is missing “ok” from printer. There is no checksum solution for the backchannel, so if “ok” becomes “k” we will not know that we can send next command. But as long as we can send more parallel commands the next one can still be send causing no timeout. Ideally the printer firmware adds the line number to the ok message so we see that we missed an ok and correct it on the fly. Without line number it only halves the number of timeouts.

So what happens if you disable ping-pong is that we send multiple commands and count the bytes send without having gotten an “ok”. This has 2 limits. First limit is the input buffer size. That buffer is on the printer side and stores received data for later processing. As long as we do not send more than would fit inside we are save. A typical input buffer is 127 byte and when we can have 2 commands inside with more or less double the possible communication speed. And firmware can send 2 ok messages maybe in a single data packet. The second limit is that you can limit the number of parallel commands. This is normally disabled and makes only sense if the input buffer is really big. Then putting too many commands in parallel will delay pauses or print stops too much simply because we can not cancel already send commands.

Debugging

To solve communication issues you need to watch the communication. So first step is to enable logging. Only in the log you can see the full communication and detect the pattern described later.

Also having the console open during print can help. Depending on the filters it can even show same messages as in log, but you only see the last 1000 lines so if the print continues and all messages are enabled it might already be outside visible lines when you detect it. So normally you have it open with filters enabled. The important messages will then be visible and you can note the times when they happened and check around that time in the log later.

Typical errors

Checksum mismatch

As explained above we normally send line numbers and checksums with every command. If you see such an error it means the printer detected a mismach and wants the server to reesend the line. In console or log you see the resend messages.

As long as it only happens once in a while this is nothing to worry about. It happens. Not necessarily because the data was not transmitted correctly, but also maybe because the printer did not process it until the next byte was received. This happens if interrupts in printer firmware were blocked too long.

If you constantly see this error there are two typical reasons. One is that you have set the input buffer too high. Try first activating ping-pong mode and if it disappears that was the reason. Stay in ping-pong mode or disable it and reduce input buffer until the error disappears. Go not below 63 byte. Lower values make no sense and you should use ping-pong mode then.

If the issue remains also in ping-pong mode the error rate is very high. Reasons are slightly wrong baud rate, e.g. 230400 instead of 250000 or electric noise on the cables.

Timeout messages

A timeout means we had send a command to printer and did not receive an ok message for it in the expected time. Some commands are known to be slow like waiting for temperature to be reached or homing. For these the server extends the timeout automatically. For all other commands the timeout you have set in communication settings is applied.

One command that is hard to measure are move commands. These get buffered, but when the move buffer (do not confuse with input buffer) is full and a new move gets added it must wait until the current move is finished to add it in move buffer. So timeout must be longer than the slowest move time you are expecting. 30 s are a good value for most.

Long timeouts have the drawback of creating blobs where the extuder stands still, so shorter is better. For this the additional busy protocol exists in several printer firmwares. When we get no feedback from printer the server can not know if it missed a command or execution is just slower as expected. So with this busy protocol the printer sends “busy” to indicate that execution will take a bit longer. Normally the interval is 2 s so you can safely set timeout to 3 s in this case.

Timeout due to missed ok

This is the main reason timeout handling exists as last defense to a broken print. Due to a communication error the server missed an ok and can not send new commands.

Only improvement is reducing electronic issues, see next chapter.

Timeout from slow moves or slow commands

You could call this a false positive. When this happens the server assumes a missed ok also this is not the case. It will send next commands probably causing input buffer to overflow and cause a checksum error afterwards.

The solution is to increase the timeout so it covers the time of this command. Or ignore it if you know it is only one special slow move and you do not want to reduce timeout just for this single exception.

Timeout from missing busy signal

If firmware supports busy it should always send busy if it takes longer, but some are not implemented correctly.  We have seen versions where it e.g. was not working for moves in some versions until the bug was closed.

To validate check first if busy is enabled. Go to console view, enable ACK and send

G4 S20

which takes 20 seconds so you should see busy messages in console now. For analysis in print enable logging and see when the last command was send and when timeout happened. If there are no busy messages but print continued later there is a bad busy implementation and you need set timeout as if you had no busy support.

Timeout from getting no data

This is the really bad thing you do not want. Firmware is running, Repetier-Server is running and the connection is marked open, but the server gets no more responses from the operating system. It is impossible to say where exactly the error happends – OS driver or printer USB driver. They think connection is open, but for some reason no data gets through.

If it is the OS driver and you are using linux, we have the possibility to unpower USB to reset the driver. That is what the option “USB Reconnect after timeout” controls. Never will never reset USB. Early does it on first timeout, conservative on second timeout in a row. We try to not reset printer with DTR toggle in this case, but it will not work on all drivers. Depends on DTR state and if driver does toggle on it’s own.

Also check electronic issues chapter.

Unexpected disconnects

Some users assume we disconnect the printer. So let me first say that we not do this except you had enabled USB reconnect on timeouts. In that case you will also see “Reconnecting USB port to fix serial driver problems …” in console and log. Of course if you deactivate a printer in Repetier-Server it will also close the connection. The real reason is in most cases that the operating system closed the connection and signaled this. You see in newer sever versions a message like “Connection closed by OS” in console and log.

So why would the OS disconned a serial?

  • Cable disconnected or loose.
  • USB chip disconnected it for protection (EMF).
  • OS underpowerd USB because the core voltage dropped too much too long. This is a frequent issue with Raspberry Pi systems, so we added the bolt menu that shows if you had or just experience undervoltage. This does not happen on all levels of undervoltage, but if you see this it might be the reason.
  • Driver issues in OS made it disconnect.
  • For network connection a network problem might cause it, also it takes a while for a network connection to drop. But especially if one of the envolved parties use wifi this is a  frequent reason especially on bad wifi connections.

Electronic issues

The errors you can not fix are missed bytes if the receiver was to busy to read the data in time.

The more frequent problem is data corruption or even disconnects from electric problems. There are many sources of this problem like:

  • Unshielded parallel cables. If one cable is switching on/off it creates a magnetic field change that induces power on parallel cables close by that can cause the trigger value to change bit value. Especially heater and motor cables with their high currents can easily trigger such signals. In printers this e.g. causes sometimes end stops to trigger without contact.
  • Voltage changes slightly when heaters/motors get enabled or disabled so power unit has to copy with bigger current changes.
  • Also CPU and parallel wires on printer board can be sources.
  • Missing communication line terminators.
  • … we are software developer so this is not our best field of knowledge.

What can happen:

  • Wrong data for recipient.
  • Driver hangs due to not having a detection for a special pattern that should not happen.
  • USB chip disconnects device. In linux you see a message about EMF as possible reason. In newer server versions we show this also in console if it is at the time of a disconnect.