data_quality_control.analysis module

class data_quality_control.analysis.Analyzer(datadir, nslc_code, fileunit='year')

Bases: data_quality_control.base.BaseProcessedData

Handler to view and manage processed amplitudes and PSDs.

Loads data for desired time range and provides plotting routines.

Parameters

datadir (pathlib.Path) – directory of HDF5-files
nslc_code (str) – code of format {network}.{station}.{location}.{channel}
fileunit ({"year", "month", "day", "hour"}) – flag indicating the expected time range per file. Determines the search pattern for files.

iter_time(starttime, endtime)

Generator providing start and endtime of HDF5-files within requested timerange depending on fileunit

See

fileunit="year": data_quality_control.util.iter_years() ,
fileunit="month":data_quality_control.util.iter_month() ,
fileunit="day":data_quality_control.util.iter_days() ,
fileunit="hour":data_quality_control.util.iter_hours()

Type: func

_check_if_requested_times_are_available(stime, etime): Reduce start/end time to those in self if out of available range. Issue errors

_check_times(starttimes, endtime)

Determines if given times indicate time range or time list.

Raises UserWarning if input is wrong.

_iter_open_hdf5(fnamelist=None)

Generator that returns open h5py.File object for each filename in fnamelist. Closes file before yielding next file and before error is raised.

Useful for testing or if direct access to the hdf5-files is needed.

If fnamelist=None, we use self.files. Causes error if not set.

_reset(): Set attributes amplitude_frequencies, winlen_seconds, startdate, enddate, amplitudes, psds, frequency_axis to None.

filter_psds_for_times(timelist)

Select and set PSDs only for given times.

Finds indices of times in list, extracts respective PSDs and replaces psds.

Parameters: timelist (list of UTCDateTimes) –

Warning

Overrides existing psd array! Cannot be used multiple times without reloading the data.

property fmtstr: Name pattern of hdf5-files.

get_available_datafiles(): Return list with all available HDF5-filenames for stationcode in datadir

get_available_timerange(): Get list of available files and read startdate of first and enddate of last file in sorted list.

get_data(starttimes, endtime=None)

Load amplitudes, psds and metadata from HDF5-files for specified times.

starttimes can be given as array-like object containing UTCDateTimes or a single UTCDateTime. In the first case Psds are loaded only for the windows corresponding to the times in starttimes. In the second case, endtime must be given as UTCDateTime. Psds are loaded for the entire range between start and end time.

If list of starttimes is given, starttime and endtime properties of the Analyzer are set to minimum and maximum times in list.

Amplitudes are always loaded for the entire range between starttime and endtime.

Parameters

starttimes (array-like or UTCDateTime) – If UTCDateTime, we expect also endtime as UTCDateTime. PSDs are loaded for the time range between start and endtime. if array-like, endtime is ignored. Psds are loaded for windows listed in starttimes only.
endtime (None or UTCDateTime) – only required if starttimes is UTCDateTime. End of time range for data selection.

get_filenames(starttime, endtime): Get filenames within time range.

nslc_as_dict(): Return station code as dictionary.

plot3d()

plot3d_amplitudes(func=None)

plot3d_spectrogram(func=None, colorbarlabel='', freqs=(None, None), log_freq_ax=False, vmin=None, vmax=None)

Notes

Latex rendering for me works in title but not on axis labels.

plot_spectrogram(ax=None, func=None, colorbarlabel='', freqs=(None, None), log_freq_ax=False, **kwargs)

Plot power spectral densities as spectrogram.

We use matplotlibs pcolormesh to plot the PSDs in a spectrogram-like way, i.e. time on x-axis, frequency on y-axis and PSD as color.

Parameters

ax (matplotlib.axes) – axes to plot in
func (callable) – modifies PSDS as func(self.psds.T) before plotting. If not given, we plot np.log10(self.psds.T * 1e9**2) which is log10(nm^2/s^2/Hz.
colorbarlabel (str [""]) – set colorbar label. Useful in combination with func to set correct units for color scale.
kwargs – keyword arguments passed to pcolormesh. vmax can be callable to set vmax=vmax(Z). If not given, we use: cmap=plt.cm.afmhot, shading=auto, vmax=np.nanmax(Z)

trim(starttime=None, endtime=None, fill_value=None)

Remove data outside of start/end time.

Parameters

starttime (obspy.core.UTCDateTime) – If None set starttime is used
endtime (obspy.core.UTCDateTime) – If None set endtime is used
fill_value (float, int [None]) – If given, fills arrays to requested start/endtime if available data starts after starttime or ends before endtime. If None, resulting time range may be shorter than requested.

class data_quality_control.analysis.SmoothOperator(datadir, nslc_code, fileunit='year', kernel_size=None, kernel_shift=None)

Bases: data_quality_control.analysis.Analyzer

Manage smoothing/data reduction of processed amplitudes and PSDs.

Grandparent is BaseProcessedData.

The main routine to call is smooth(). The other methods are mostly helpers to perform the smoothing/downsampling, organize the iteration over the data base and manage IO-operations. They can be useful on their own, though.

Parameters

datadir (pathlib.Path) – directory of HDF5-files
nslc_code (str) – code of format {network}.{station}.{location}.{channel}
fileunit ({"year", "month", "day", "hour"}) – flag indicating the expected time range per file. Determines the search pattern for files.
kernel_size (int > 1) – Number of samples over which median is computed. 1 sample corresponds to the window size of the original data.
kernel_shift (int >= 1) – Number of samples by which the kernel is shifted to compute the next median. See kernel_shift

Example

from data_quality_control import analysis, dqclogging

nslc_code = "GR.BFO..BHZ"
datadir = "output/"
outdir = "output/smoothd/"
kernel_size = 6
kernel_shift = 3

# Set loglevel in console, no logfile
dqclogging.configure_handlers(loglevel_console="INFO",
                            loglevel_file=None)

# Interpolation
## Initiate smoother
polly = analysis.SmoothOperator(datadir, nslc_code,
                            kernel_size=kernel_size,
                            kernel_shift=kernel_shift)

## Start interpolation over whole available time range
polly.smooth(outdir, force_new_file=True)

# View results
lyza = analysis.Analyzer(datadir, nslc_code,
                            fileunit="year")

startdate = UTC("2020-12-24")
enddate = UTC("2021-01-15")
lyza.get_data(startdate, enddate)

fig = lyza.plot_spectrogram()

kernel_size

Number of samples over which median is computed. 1 sample corresponds to the window size of the original data.

Type: int > 1

kernel_shift

Number of samples by which the kernel is shifted to compute the next median. kernel_shift=1 corresponds to a classic running median, kernel_shift>1 downsamples the original data. Determines data_quality_control.base.BaseProcessedData.winlen_seconds. For example if the seismic data were processed using winlen_seconds=3600 the new data_quality_control.base.BaseProcessedData.winlen_seconds is 6*3600s = 21600s.

Type: int >= 1

WINLEN_SECONDS

window size in seconds of the original processed data, i.e. the window size over which amplitudes and psds were computed from the seismic data. Is set based on the value in the first file and checked against data_quality_control.base.BaseProcessedData.winlen_seconds of every subsequent file. Raises RuntimeError if it changes.

Type: int

winlen_seconds

Window size in seconds of the current data. Changes with every iteration, i.e. every newly read file, between value of the original processed data read from file and kernel_shift*winlen_seconds after the median operation is applied.

Type: int

_check_framed_shape(x, X, label='')

Used in _medianfilter() to check if all data went into frames.

Parameters

x (ndarray (1D)) – data as time series
X (ndarray (2D)) – data as (overlapping) frames
kernel_shift (int) – distance between two frames in samples
label (str) – name of x for more meaningful error msg.

Raises

AssertionError – if there are samples left.

_check_kernelshiftsize()

Check if winlen_seconds and kernel_shift if resulting increment does not yield an integer quotient of possible total durations defined by fileunit.

Runs data_quality_control.util.assert_integer_quotient_of_wins_per_fileunit()

Raises: UserWarning – if kernel_shift yields inappropriate increment

_get_WINLEN_SECONDS(TSTA, TEND)

Read first file in list to get window size in seconds.

Parameters

TSTA (UTCDateTime) – beginning of time range in which we look for data.
TEND (UTCDateTime) – end of time range in which we look for data.

Example

If you want to know the winlen_seconds in your targeted data base, adjust the following code snippet:

nslc_code = "GR.BFO..BHZ"
datadir = "output/"
polly = analysis.SmoothOperator(datadir, nslc_code)
polly._get_WINLEN_SECONDS(*polly.get_available_timerange())
print("Winlen in processed data", polly.WINLEN_SECONDS)

_get_start_endtime(starttime=None, endtime=None): Replace None input parameters with start/end time from all available data.

_medianfilter(): Transform timeseries data into frames and apply median operation.

_set_check_WINLEN_SECONDS(): Sets WINLEN_SECONDS if not set, otherwise raises error if different from py:attr:.winlen_seconds. Used to monitor if window size changes between files.

iter_times_kernel(tsta, tend)

Generator providing start and endtime of data to read for interpolation.

It’s a wrapper around data_quality_control.analysis.Analyzer.iter_time() which adjusts the yielded start and endtimes by kernel_shift, kernel_size and WINLEN_SECONDS to accommodate all data to smooth over the entire fileunit.

Yields

new_tsta (UTCDateTime) – starttime of data to read for interpolation. Determined by new_tend except for first one.
new_tend (UTCDateTime) – endtime of data to read for interpolation

set_kernel(kernel_size, kernel_shift)

Set kernel parameters. Forces integers.

Preserves existing attribute if kernel_size=None or kernel_shift=None.

Parameters

kernel_size (num > 1) – Number of samples over which median is computed. 1 sample corresponds to the window size of the original data.
kernel_shift (num > 1) – Number of samples by which the kernel is shifted to compute the next median. See kernel_shift

smooth(outdir='.', starttime=None, endtime=None, kernel_size=None, kernel_shift=None, force_new_file=False)

Performs smoothing/downsampling of database for requested time range.

Parameters

outdir (str) – directory where new results are stored. Should be different from datadir because file names will be identical and would override input data.
starttime (UTCDateTime) – Begin of time range to process. If None earliest available time in database is used.
endtime (UTCDateTime) – End of time range to process. If None lastest available time in database is used.
kernel_size (int, None) – See kernel_size. Overrides existing value. Must be set here if not set on initialization.
kernel_shift (int, None) – See kernel_shift. Overrides existing value. Must be set here if not set on initialization.
force_new_file (bool [False]) – If True overwrites existing output data.