You have measured your sample, the diffractometer has done its job, and now you are staring at a .xy file — two columns of numbers stretching over thousands of rows. The first column is \(2\theta\), the second is intensity. Somewhere in that noisy signal are the Bragg reflections that encode your crystal structure. But how do you find them reliably? This is the peak-finding problem, and getting it right is the foundation of everything that follows: indexing, lattice parameter refinement, Rietveld analysis, Scherrer crystallite size estimation. If you miss a peak or mistake noise for a reflection, every downstream result is compromised.

I wrote this as a practical walkthrough — the kind of guide I wished I had when I first started processing XRD data for our Nd0.7Sr0.3MnO3 nanoparticle samples. It covers the theory behind peak detection, the preprocessing steps that make or break your results, and a Python implementation you can adapt for your own data.

1. What Lives in a .xy File

A typical .xy file from a powder diffractometer contains two columns: the scattering angle \(2\theta\) (in degrees) and the measured intensity \(I\) (in counts or counts per second). The angular step size is usually 0.01° to 0.05°, and a full scan from 20° to 80° might contain 1200 to 6000 data points. Some instruments output additional columns (estimated standard deviation, for example), but the core data is always \(2\theta\) and \(I\).

Before doing anything else, load and visualize your raw data. This sounds obvious, but skipping this step is how you end up trusting automated routines that picked up detector glitches or missed broad peaks hiding in the baseline.

Python — Loading .xy data
import numpy as np
import matplotlib.pyplot as plt

# Load the .xy file (skip header lines if any)
data = np.loadtxt('sample.xy', skiprows=1)
two_theta = data[:, 0]
intensity = data[:, 1]

fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(two_theta, intensity, 'k-', linewidth=0.5)
ax.set_xlabel(r'$2\theta$ (°)')
ax.set_ylabel('Intensity (counts)')
ax.set_title('Raw XRD Pattern')
plt.tight_layout()
plt.show()

2. Preprocessing: Background Subtraction and Smoothing

Raw XRD data always sits on top of a background — a slowly varying baseline caused by amorphous content, air scattering, fluorescence, and thermal diffuse scattering. The background obscures weak peaks and biases the intensity of strong ones. You need to subtract it before peak finding.

Background Estimation

There are several approaches, from simple to sophisticated:

Method Complexity Best For
Linear interpolation between minima Low Clean patterns, few peaks
Rolling ball / SNIP algorithm Medium General-purpose, automated
Polynomial fit to non-peak regions Medium Smooth backgrounds
Asymmetric least squares (ALS) High Complex backgrounds, overlapping peaks

For manganite samples with strong, well-separated peaks, I find that the SNIP (Statistics-sensitive Non-linear Iterative Peak-clipping) algorithm works reliably. It is implemented in many libraries and does not require you to manually select background regions. In Python, the pybaselines package provides a clean implementation:

Python — Background subtraction (SNIP)
from pybaselines.morphological import snip

# SNIP background estimation
# max_half_window controls smoothness — larger = smoother baseline
bkg, params = snip(intensity, max_half_window=40)
intensity_corr = intensity - bkg

fig, axes = plt.subplots(2, 1, figsize=(12, 7), sharex=True)
axes[0].plot(two_theta, intensity, 'k-', lw=0.5, label='Raw')
axes[0].plot(two_theta, bkg, 'r-', lw=1.5, label='Background')
axes[0].legend()
axes[1].plot(two_theta, intensity_corr, 'b-', lw=0.5)
axes[1].set_xlabel(r'$2\theta$ (°)')
axes[1].set_ylabel('Corrected Intensity')
plt.tight_layout()
plt.show()
⚠ Watch out: over-smoothing the background

If max_half_window is too large, the SNIP baseline will follow the peaks and subtract real signal. If too small, it will be too noisy and leave residual background. Start with a value of 30–50 and visually inspect the result. The baseline should track the slow curvature of the background without cutting into any peaks.

Noise Smoothing

After background subtraction, the corrected pattern may still have high-frequency noise (especially for weakly diffracting or nanocrystalline samples with broad peaks). A light smoothing pass can help peak detection without destroying peak shape. The Savitzky-Golay filter is the standard choice because it preserves peak positions and heights better than a simple moving average:

Python — Savitzky-Golay smoothing
from scipy.signal import savgol_filter

# window_length must be odd; polyorder < window_length
# Smaller window = less smoothing = better peak preservation
intensity_smooth = savgol_filter(intensity_corr, window_length=11, polyorder=3)
💡 Rule of thumb

For step sizes of 0.02°, a Savitzky-Golay window of 7–15 points and polynomial order 2–3 works well. For broader peaks (nanoparticles with crystallite sizes below 20 nm), you can increase the window slightly. Never smooth so aggressively that neighboring peaks merge.

3. Peak Detection: The Algorithm

With a clean, background-subtracted signal in hand, peak detection becomes a local maximum problem with constraints. The idea is simple: find points where the intensity is higher than their neighbors and exceeds a meaningful threshold above the noise floor.

SciPy's find_peaks is the workhorse here. It takes the 1D intensity array and returns the indices of local maxima satisfying conditions you specify:

Key parameters for find_peaks $$\text{height} > h_\text{min}, \quad \text{prominence} > p_\text{min}, \quad \text{distance} > d_\text{min}$$

Let me explain each parameter and why it matters:

Python — Peak detection with scipy.signal.find_peaks
from scipy.signal import find_peaks

# Estimate noise level from a peak-free region
# (e.g., 70–80° where your material has no peaks)
noise_region = intensity_corr[(two_theta > 70) & (two_theta < 80)]
noise_std = np.std(noise_region)

# Peak detection
peaks, properties = find_peaks(
intensity_smooth,
height=3 * noise_std, # above 3σ noise
prominence=0.05 * intensity_smooth.max(), # 5% of max
distance=15, # ≥0.3° apart (for 0.02° step)
width=3 # at least 3 points wide
)

# Extract peak positions and intensities
peak_positions = two_theta[peaks]
peak_intensities = intensity_smooth[peaks]

print(f'Found {len(peaks)} peaks:')
for pos, I in zip(peak_positions, peak_intensities):
print(f' 2θ = {pos:.2f}°, I = {I:.0f}')

4. Visualizing the Results

Always plot your detected peaks on top of the pattern. This is not optional — it is the primary quality check. You should see markers sitting cleanly at the apex of every diffraction peak, with no markers on noise bumps and no real peaks missed.

Python — Plotting detected peaks
fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(two_theta, intensity_smooth, 'k-', lw=0.6, label='Pattern')
ax.plot(peak_positions, peak_intensities, 'rv',
markersize=8, label=f'{len(peaks)} peaks found')

# Annotate each peak with its 2θ value
for pos, I in zip(peak_positions, peak_intensities):
ax.annotate(f'{pos:.1f}°',
xy=(pos, I), xytext=(0, 12),
textcoords='offset points',
fontsize=7, ha='center',
color='#8b2500')

ax.set_xlabel(r'$2\theta$ (°)')
ax.set_ylabel('Intensity (a.u.)')
ax.legend()
plt.tight_layout()
plt.savefig('xrd_peaks.png', dpi=200)
plt.show()

If you find too many peaks, increase the prominence threshold. If you miss weak peaks (like impurity phases or superlattice reflections), lower the height threshold and check manually. There is no universal set of parameters — they depend on your data quality, step size, and the material itself.

5. From Peaks to d-spacings: Applying Bragg's Law

Once you have reliable peak positions, converting to interplanar spacings is straightforward via Bragg's law:

Bragg's Law $$n\lambda = 2d_{hkl}\sin\theta$$

For first-order diffraction (\(n=1\)) with Cu K\(\alpha\) radiation (\(\lambda = 1.5406\) Å):

Python — Converting 2θ to d-spacing
wavelength = 1.5406 # Cu Kα in Å

theta_rad = np.radians(peak_positions / 2) # θ from 2θ
d_spacings = wavelength / (2 * np.sin(theta_rad))

print('Peak positions and d-spacings:')
print(f'{"2θ (°)":>10} {"d (Å)":>10}')
for pos, d in zip(peak_positions, d_spacings):
print(f'{pos:10.3f} {d:10.4f}')

These d-spacings are your input for peak indexing — matching each reflection to its Miller indices \((hkl)\) by comparing against the ICDD/PDF database or a calculated pattern from a known CIF file. For perovskite manganites like Nd0.7Sr0.3MnO3, the strongest peaks are typically (110)/(002) for orthorhombic Pbnm or (104)/(110) for rhombohedral R-3c, appearing around \(2\theta \approx 32\)°.

6. Common Pitfalls

⚠ Pitfall #1: Kα2 splitting

At high angles (\(2\theta > 50\)°), the K\(\alpha_1\)/K\(\alpha_2\) doublet becomes resolved, creating shoulder peaks that your algorithm may detect as separate reflections. Either strip K\(\alpha_2\) before peak finding (using Rachinger correction) or account for it in your prominence threshold. This is not a deficiency of your sample — it is physics.

⚠ Pitfall #2: Confusing peak broadening with noise

Nanocrystalline samples (crystallite size < 30 nm) produce intrinsically broad peaks. If you set your width parameter too narrow, you will miss these broad, real reflections. Conversely, if your sample is highly crystalline, narrow instrumental peaks might have noisy tops that create spurious double-detections — increase the distance parameter to fix this.

⚠ Pitfall #3: Preferred orientation artifacts

If your powder sample is not properly randomized (e.g., plate-like crystallites lying flat), some peaks will be anomalously strong and others weak or absent. Peak finding will work fine, but the relative intensities will not match the PDF reference. This is a sample preparation problem, not an algorithm problem.

⚠ Pitfall #4: Not validating against a reference

Always compare your found peaks against a known reference pattern (ICDD card, CIF simulation, or literature data). If you find peaks that do not match any expected phase, you may have an impurity phase — which is scientifically interesting, not an error to ignore.

· · ·

7. The Complete Pipeline

Putting it all together, here is the full script that takes a .xy file and produces a table of peaks with their \(2\theta\) positions, d-spacings, and relative intensities:

Python — Complete XRD peak finding pipeline
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import find_peaks, savgol_filter
from pybaselines.morphological import snip

# ── 1. Load data ──
data = np.loadtxt('sample.xy', skiprows=1)
two_theta, intensity = data[:, 0], data[:, 1]

# ── 2. Background subtraction ──
bkg, _ = snip(intensity, max_half_window=40)
I_corr = intensity - bkg

# ── 3. Smoothing ──
I_smooth = savgol_filter(I_corr, window_length=11, polyorder=3)

# ── 4. Peak detection ──
noise = np.std(I_corr[(two_theta > 70) & (two_theta < 80)])
peaks, props = find_peaks(I_smooth,
height=3*noise,
prominence=0.05*I_smooth.max(),
distance=15, width=3)

# ── 5. Calculate d-spacings ──
lam = 1.5406 # Cu Kα
theta = np.radians(two_theta[peaks] / 2)
d = lam / (2 * np.sin(theta))
I_rel = I_smooth[peaks] / I_smooth[peaks].max() * 100

# ── 6. Print results ──
print(f'{"#":>3} {"2θ (°)":>10} {"d (Å)":>10} {"I_rel (%)":>10}')
for i, (p, di, Ir) in enumerate(zip(two_theta[peaks], d, I_rel)):
print(f'{i+1:3d} {p:10.3f} {di:10.4f} {Ir:10.1f}')

8. What Comes Next

Peak finding is only the first step in XRD analysis. Once you have your peaks, the natural next steps are:

Each of these steps builds directly on the peak positions and shapes you extract here. Getting clean, reliable peaks is not glamorous work, but it is the foundation that makes everything else possible.

In XRD, finding the peaks is easy. Finding the right peaks is the craft.