Node Operations FAQ

Commonly-seen node operations questions

Testnet: what makes one node generate more blocks than another on the testnet?

This is one of the common questions we see from our community, so let's dive into some of the factors governing block production.

  • Uptime - what percent of the time is your node turned on and actively participating in consensus?

  • System resources - does your node meet the minimum suggested system requirements?

  • Join time - was your node one of the earliest joiners or one of the late comers?

  • Latency - how good is your node's connection to other nodes?

Let's discuss these one by one.

#1: Uptime

Your node needs to be on, fully synced, and participating in consensus in order to produce blocks. If it's frequently not on, not synced, or not participating in consensus, it will fall behind other nodes that have higher uptime.

One of the most commonly-seen problems with having low uptime is actually related to systems resources, which leads us to the next factor.

#2: System Resources

The recommended setup is 4vCPU, 8GB of RAM, and 200GB of disk space. If the node runs out of one of these resources, it will forcibly shut down and consequently suffer low uptime.

One of the most common issues we've encountered is with the use of shared VPS that are extremely cheap (just several USD / month). On a shared VPS, you're never guaranteed to have the stated resource allocated to you. So for example, when it says it's allocating 4vCPU but you're actually just getting 2vCPU, the node runs out of resources and shuts down, which hurts your uptime. To avoid this, we recommend using dedicated resources.

#3: Join Time

If node A joined the network earlier than node B and both nodes have the exact same system resources, then node A of course will have produced more blocks than node B, simply because it got an earlier start.

There is a second, more nuanced reason that greatly amplifies this effect. Since the network maintains a relatively fixed Period timer (PBFT block time), the number of PBFT blocks produced for any given time interval is fixed as well. So that means if node A joined early, not only did it get an early start, it is also competing against less nodes to produce blocks.

For example, say that the network produces 15 blocks a minute, if there is only 1 node in the system during that minute, it is producing all 15 blocks. But in the second minute, 9 more nodes joined so there are 10 nodes, now each node will on average produce 1.5 block during the same minute.

It needs to be emphasized that, all block production accounting for the purposes of ranking nodes are reset at the beginning of every week (note, only the calculations are reset, the blocks are on the blockchain and are not reset), if you joined a little later this week, as long as you keep your node running, you'll be in the same competitive position as all nodes that were fully synced & participating in consensus at the beginning of next week.

#4: Latency

In order for your node to produce a block that's accepted by the network, other nodes need to hear about it in a reasonable amount of time. If your network connection is poor, and you don't send out the proposal fast enough, or the connection is lossy and the packets need to be resent, then the chances of your block being selected as the winning block is greatly diminished.

The development team has not yet done detailed network stress tests to see just how much latency is considered "too much". When we have that analysis we'll publish some guidance to the community.

Testnet: why are there a few nodes producing far more blocks than the other nodes on the testnet?

If you go to the Taraxa explorer's node page starting in 2022, you'll see a few nodes (6 as of this writing) that are producing far more blocks than the other nodes. Why is that?

Those nodes are nodes maintained by the developer team. The reasons why the dev-operated nodes (which are excluded from rewards, as they should be) are producing way more blocks is because they hold more delegation.

On the testnet, these developer-operated nodes (dev-nodes) are collectively (i.e., the 6 nodes together) to hold 2x the amount of delegation than all the other community nodes combined. So if you take all the blocks produced by all the non-dev nodes, add them up, that'll be 1/2 of what all the blocks produced by the dev-nodes added up.

Why are the dev-nodes given more delegation? Having the dev-nodes hold a majority voting power means that the developer team can quickly push out

Because the need to deploy new features to test out, and we want to do it quickly and see the impact, that means the dev nodes need to have a voting majority. This way it makes it easy to test, b/c we don't have to wait for the community nodes to update (which could take days or even weeks to reach 2/3 majority, as the community is slow to upgrade as we've observed) in order to test new features. We also have the added benefit of seeing what happens when a large chunk of nodes in the network are NOT updated (which happens a lot in the real world), to see if there are any forking, compatibility issues.

a change was made to the way PBFT blocks are proposed. Previously they didn't take into account delegation, now they do - this was made in this PR: https://github.com/Taraxa-project/taraxa-node/pull/1382. So this is why previously although the dev-operated nodes had more delegation, they weren't producing more PBFT blocks than your average community node. This change was made to make sure the testnet's code mirrors that of the mainnet's, as it should. Another question, why are the dev nodes producing SO many more blocks?

What errors should I NOT be concerned about?

The blockchain network exists in a constant state of flux where a lot of things are "going wrong" constantly. But the beauty of a successful blockchain network is that it can handle and fix these inconsistencies and errors gracefully. Many of the errors you see displayed from the node should not concern you because they're temporary, and the node understands them and knows how to handle them.

Here's a list of commonly seen "errors" that you should NOT be concerned about,

Is incentivized testnet live?

We have an ongoing incentivized testnet, please check out this step by step guide to participate!

How do I run a node?

Please see our node operation instructions.

We recommend everyone who wants to run a node join our Discord server and look for the #node-operations channel.

Is there a testnet?

Yes, you can look at the testnet through our explorer. It is a test network so occasionally it will go down or get wiped, please join our Discord server for the latest information.

How do I report a problem?

First, join Taraxa's Discord server for technical discussions.

Always try include the following information when you're reporting a problem,

  • Your node's public address (see how to get your node's public address)

  • Your system resources: CPU (# of cores), RAM, Disk

  • Are you running this on a dedicated or a shared machine

  • If it's in the cloud, the cloud service provider, and your instance's physical location (e.g., Frankfurt - Germany)

  • Screenshot the error message, or better yet the logs (see how to get the logs)

  • System resource consumption screenshot - e.g., a time-series of CPU or RAM utilization

  • Anything out of the ordinary you were doing right before this error occurred, e.g., tried to import a previous state_db.

Thanks for all your feedback!

How do I download the node's logs when reporting a problem?

Here's the command to generate logs from the node,

docker logs taraxa_compose_node_1 > logs

Note that, the container is not always called taraxa_compose_node_1 on every environment. If this doesn't wrok, please check to make sure - use docker ps to see a list of all your containers and figure out exactly what the name of your container is.

If the node has been running for a while, the log file might be too big, so it's a good idea just to get the latest few log entries, say 50,000, you can try this,

docker logs --tail 50000 taraxa_compose_node_1 > logs

Now that you have the logs file, just send it to the dev team along with your problem report. Thanks!

How do I tell if I have the latest version of the node?

Taraxa nodes are published via docker images to simplify deployment. Each image comes with its own digest, which is a unique identifier for the image. As long as your current image's digest matches that of the latest image, then you have the latest version.

To determine the digest of your own node, use,

docker image ls --digests

and find the digest of your node's image, typically named something like taraxa_compose_node_1 or something extremely similar to it.

To find the latest image's digest, go to our docker hub, find the latest image at top, click into it, and there should be at the top the digest that's labeled,

DIGEST:sha256:

Testnet: after I deleted my node and tried to register it again, there's an error that says "node already exists"

You cannot delete a node and add it back again. If you delete a node, you have to get a "new node". The simplest way to do that is to delete the node's wallet, and restart the node. Afterwards you should be able to register the new node.

Delete the wallet

To delete the wallet, find the wallet.json file and delete it. It is typically located here,

taraxa-compose/config/wallet.json

If you cannot find it, just go to the root directory and try,

find . -name wallet.json

Then once it's found, go to the directory and delete the wallet.json file.

Restart the node

The simplest way to restart the node is probably just to restart the entire server. But you could also use the command,

docker-compose restart

to restart the node and the dashboard app. If it gives you an error saying the YAML file is not found, then try searching for the YAML file listed, go to that directory, and execute the restart command above again.

How do I tell if my node has been synced?

Either go to the dashboard, which is located at your node's IP at port :3000, or you can look in the CLI log outputs and look for the ---- tl;dr ---- section, first line should tell you the node's sync status.

Why does my node's sync percentage go down?

This is normal.

Your node determines its synchronization progress by asking the nodes it's connected to. Sometimes, and this often happens after a network recovers from a crash, the node is only connected to peers that aren't 100% synced themselves. But when the node connects to a new peer who is either fully synced or has made more sync progress than its existing (or previous) peers, the node adjusts and re-calculates its sync progress.

We recommend comparing your node's synchronization status against the network progress on the explorer to get a better sense of where your node actually is.

How do I know if my node is producing blocks?

There are several ways to tell,

  • Go to your node's IP at port :3000, and see "Synced - Participating in consensus", or if you see that in your node's logs STATUS: GOOD. NODE SYNCED AND PARTICIPATING IN CONSENSUS

  • Go to the explorer's node page and see if your address is listed, note it's paginated so you may not be on the first page

  • Search for your node's public address on the explorer and see how many blocks (if any) it has produced

  • Go to the community site's node list and see if your node is listed active

Several things to note,

  • Sometimes the explorer is reset and that will cause you to not see the node list or the community site node list, the most reliable way to tell is to look at your local node and see if it is participating in consensus

  • You will often see messages like PARTICIPATING IN CONSENSUS BUT NO NEW FINALIZED BLOCKS, PBFT STALLED, POSSIBLY PARTITIONED. NODE HAS NOT RESTARTED SYNCING, or STUCK. NODE HAS NOT RESTARTED SYNCING, these happen from time to time and not necessarily specific to your node

What if my node is not producing any blocks at all and just says "Synced"?

If your node is 100% synced but has not produced any block, please make sure that your node is properly registered on the community site's node page.

How come my node has "0 peers"?

There are many reasons why your node could have no peers, here are a few common reasons.

  1. Node has no internet connectivity: if you check the logs, under the heading that looks something like this: "SUMMARY [2023-08-25 06:06:50.300574] INFO: Number of discovered peers: 0". If this is 0 it means that node is not able to discover any node on the network and most likely node itself has some internet connectivity issues. At that point you need to figure out what's happening with your node's local internet connection that's causing this issue. We know of many jurisdictions and sometimes ISPs that block internet traffic from the US, Google, and/or peer-to-peer traffic.

  2. Node has the incorrect version or corrupted stateDB: if the number of discovered peers is greater that 0 but no peer is connected than this is usually a case where node is either running some incorrect version, has incorrect configuration/genesis file or a corrupted stateDB which causes either other nodes marking it as malicious based on the data it is sending or it marking other nodes malicious based on the data it is receiving. In this case it's best to reset your node.

How do I update / reset my node?

Testnet: instructions to update or reset a testnet node.

Mainnet: instructions to update & reset a mainnet validator node.

Why is my node shown as inactive on the community site?

A node is considered active only if it has been fully synced as well as having produced at least 1 block in the past 24 hours. If not, then it shows up as inactive on the community site.

A block-producing node should also show up on the explorer's node list.

Testnet: I received TARA on my node after registration, what does that mean?

TARA tokens on the testnet are not real tokens, so please don't try to send those out (it won't work), and please do not send any tokens from another chain (e.g., ETH) into the testnet - it won't work and you'll lose your tokens.

The tokens are sent to your node as part of the faucet to generate some transaction traffic on the network, and that later on we will run community-driven stress tests which will require that everyone has some testnet tokens to send around.

Why is my node eating up so much disk space?

At the current stage we only have a FULL node implementation, which means the node stores the entirety of the blockchain's history. For a full node, our space consumption is comparable to other blockchain networks such as Ethereum.

What will permanently solve this problem is create a light node, which only stores the current state, and prunes (deletes) the historical transactional data. This is on the roadmap.

For now, you can ameliorate this problem by disabling and deleting some snapshots if you'd like. The node generates and stores many network snapshots (for testing purposes) which could take up a lot of memory, and we're going to update it later on to stop generating them.

If you would like to save disk space, you can do two things.

Step 1: please go to the config file

/taraxa-ops-master/taraxa_compose/config/testnet.json

Inside this file, set,

"db_snapshot_each_n_pbft_block" : 0

IMPORTANT: Now restart the node or just restart the entire machine. If you don't this change won't take effect.

Step 2: remove the snapshots from your existing node

On a Linux system the db files are located here,

./var/lib/docker/volumes/taraxa_compose_data/_data/db/

The only files you need to keep are db and state_db. The rest you can just clear out.

If you're on a different system, you can try searching for the file state_db and see where it's located.

My node gets killed when it runs out of disk space!

The most common problem we're seeing is that the node runs out of disk space. We're working to update our one-click install scripts to help with this problem - e.g., attaching a disk volume to the machine on VPS.

In the meantime please increase the disk allocation on your own.

Can I run multiple nodes on the same IP address?

Yes, you can have multiple nodes on the same IP address, but they need to occupy different ports. On the first node, you can just leave everything by default in the docker image. For subsequent nodes, you'll need to map the default ports to different ports (each node a different mapping, of course).

The settings are in the docker-compose.yml file. Inside you'll see a section called ports, so for example, you might do something like this on the second node,

ports:
  - "10003:10002"
  - "10003:10002/udp"
  - "7778:7777"
  - "8778:8777"

For example, the first line maps the 10002 port to the actual port of 10003. So if you want to set up more nodes, the mappings (the 10003, 77781 etc. ports) will need to be different for each node.

ERROR: No such container: taraxa_compose_node-1

This happens when you're trying to access the node's container (e.g., when trying to produce the prove-you-own-your-node signature), but the container's name is wrong.

Different operating systems name these containers slightly differently. When you see this the best thing to do is to try

docker container ls

and see what your node's container is actually called.

Of course, if you are running multiple nodes then they're likely sequentially numbered, listing all the containers will also help to find the one you're looking for.

ERROR: Vote sortition failed.

Typically speaking you don't need to worry about this error.

This means that your node has received a vote that it deems invalid. This doesn't mean something's wrong with your node, likely it indicates something's wrong with the node that generated the invalid vote.

ERROR: Received NewBlock xxx has missing pivot or/and tips

Typically speaking you don't need to worry about this error.

It indicates that the node has received a DAG block, but the node is missing its parents - i.e., other DAG blocks it's pointing to. Since there's no guarantee the order in which packets arrive over the network, a node could easily see a specific block before it sees the parent. In fact, when this happens, the node will proactively request from its peers the missing parent blocks.

ERROR: DagBlockValidation failed

Typically speaking you don't need to worry about this error.

The reason for this happening is the same as the error about missing pivots or tips, and the node should naturally recover.

ERROR: Incorrect node version: 0, our node version xxxxx, host xxxxx will be disconnected

You don't need to worry about this error.

We added a versioning system, and nodes that have different versions will not connect to each other as peers. This message is your node encountering another node that's a different version, so it has decided to disconnect that node from its peers.

This design choice may be revisited later, as we progress towards mainnet we may have to take backwards compatibility into consideration.

RangeError [ERR_HTTP_INVALID_STATUS_CODE]: Invalid status code: undefined

You don't need to worry about this error.

It's actually from the node status app and it happens when the app starts before the actual node and can't get data from it.

Last updated