The study of crystal forms is a crucial part of the research needed for understanding and developing a drug candidate compound. Some compounds are found to crystallize into more than one distinct crystal form, a phenomenon known as polymorphism. Typically, these different forms (i.e., polymorphs) also differ in physical properties, including density, morphology, solubility, and dissolution rate. Polymorphism is particularly important for pharmaceutical molecules as the efficiency of drugs can be strongly affected by the crystal form due to significant changes in bioavailability. Therefore, polymorph screening has become an essential procedure in drug development. Usually, a series of experiments in various conditions are performed, but these can take a long time, and assessing the completeness of a polymorph screening is a difficult task. To accelerate this process, one can take advantage of computation. The low-energy structures located by crystal structure prediction (CSP) could serve as clear targets for crystallization experiments.

In the past decade, crystal structure prediction (CSP) has made significant progress in both inorganic and organic crystals, the latter of which is directly related to drug development. Currently, there are two main challenges in this field: to ensure the completeness of search for crystal structures and to improve the accuracy of energy ranking.

The goal of crystal structure prediction is to identify the crystal structure with the best thermodynamic stability for a given molecule (neutral or ionic) and in given conditions (temperature and pressure). For different polymorphs, the chemical formula is the same, but different arrangements of the molecules lead to different energies. In other words, atomic coordinates and energy constitute an ultra-high dimensional potential energy surface (3N-6 dimensions for a system of N atoms), where the minimum energy point corresponds to the crystal structure with the best thermodynamic stability. The main task of the crystal structure prediction is to describe the potential energy surface accurately and search for the global and local minimum on it. The complexity of this problem grows exponentially with the increasing dimension of the potential energy surface (i.e., the degrees of freedom). With the development of theoretical methodologies, including ab initio and density functional theory, as well as the application of various empirical potential models, it is possible to calculate the relative energies of crystal structures more reliably. Therefore, the major challenge of crystal structure prediction is determined by the complexity of the potential energy surface, and whether the search succeeds is otherwise controlled by the effectiveness of the search algorithm.

For the case of organic molecular crystals, the rigid fragments in molecules result in the reducing dimension of the potential energy surface. The crystal energy can be expressed as a function of the following parameters: lattice parameters, crystal symmetry, centroid, orientation, molecular flexible dihedral angles, and more.

To improve the effectiveness and efficiency of its CSP, XtalPi uses a global optimization algorithm that combines stochastic optimization algorithms with heuristic optimization algorithms for the search in potential energy surfaces. Meanwhile, the sampling in the low energy area is enhanced automatically by the AI algorithm.

XtalPI’s CSP can include any number of space groups. However, we tend to limit the search in specific space groups as more than 90% of organic crystals belong to only 13 space groups: P1, P-1, P21, P21/c, C2, Cc, P212121, P21212, Pna21, Pca21, Pbcn, and Pbca. Therefore, limiting the search space substantially reduces the calculation needed and increases the CSP’s efficiency. Moreover, the efficiency can be improved by conformational constraints of the target molecule based on reported data and chemical knowledge, which could enhance the sampling of preferred conformations.

Parallel tempering, also known as replica exchange, is an improved Monte Carlo simulation method. After a random initialization, N replicas of a system are simulated at a series of temperatures. Then the system states at different temperatures are exchanged at intervals according to the Metropolis standard. The purpose is to ensure the lower temperature systems can access the states generated at high temperatures and vice versa, which allows a more sufficient sampling in both high and low energy regions. Therefore, the global optimum can theoretically always be reached after enough steps. However, as a stochastic algorithm, its convergence is slow, which prompts us to improve it by applying the particle swarm optimization algorithm in the searching process. In this way, the historical information is used to detect the new search space while pruning the existing search space.

When molecules are particularly complex, the random initialization is not an efficient method to generate structures. In this case, a scoring function based on the quantification of cell volume and atom collisions is applied to measure how good the crystal structure is. Then the best-score structures and other random structures take part in the following search, which improves the efficiency of structure generation. Several local optimization algorithms (including Powell, SLSQP, trust-constr, Nelder-Mead) are tested for the optimization of the scoring function, and the best one would be employed. In addition, if part of the structure-related information is known, these factors can be added to the scoring function for local optimization.

An excellent way to increase the searching efficiency is to apply the Monte Carlo trees model between different rounds of searching. The structure and energy information obtained in the previous round can act as feedback, and the lower energy parameter space for the next round of search would automatically recommend with higher efficiency. Another method of the AI-enhanced sampling is available for the system with chemical knowledge-based restrictions (specific interactions such as the hydrogen bonding) on the searching space. In this case, the parameters are not entirely independent, but the relationship is implicit. XtalPi also uses machine learning to match the pattern between molecular conformations and the intra/intermolecular interactions, which also significantly improves the searching efficiency.