divbrowse.lib.genotype_data

Module Contents

Classes

GenotypeData

Class for managing all genotype data related data structures and methods

Functions

calculate_mean(→ numpy.ndarray)

Calculate the mean for each variant of a variant matrix array holding the number of alternate alleles

impute_with_mean(→ numpy.ndarray)

variant matrix array for that missing values should be imputed (replaced) with the mean for the variant

calc_pca_for_slice_of_variant_calls(...)

Calculate a PCA for a variant matrix array

calc_umap_for_slice_of_variant_calls(...[, n_neighbors])

Calculate UMAP for a variant matrix array

divbrowse.lib.genotype_data.calculate_mean(slice_of_variant_calls: numpy.ndarray) numpy.ndarray

Calculate the mean for each variant of a variant matrix array holding the number of alternate alleles

Note

Missing variant calls are excluded from the mean calculation

Parameters

slice_of_variant_calls (numpy.ndarray) – Numpy array representing a variant matrix holding the number of alternate allele calls

Returns

Numpy array holding the means per variant

Return type

numpy.ndarray

divbrowse.lib.genotype_data.impute_with_mean(slice_of_variant_calls: numpy.ndarray) numpy.ndarray

variant matrix array for that missing values should be imputed (replaced) with the mean for the variant

Parameters

slice_of_variant_calls (numpy.ndarray) – Numpy array representing a variant matrix holding the number of alternate allele calls

Returns

Imputed version of the input variant matrix array

Return type

numpy.ndarray

divbrowse.lib.genotype_data.calc_pca_for_slice_of_variant_calls(slice_of_variant_calls, samples_selected)

Calculate a PCA for a variant matrix array

Parameters

slice_of_variant_calls (numpy.ndarray) – Numpy array representing a variant matrix holding the number of alternate allele calls

Returns

PCA result aligned with the sample IDs in the first column

Return type

numpy.ndarray

divbrowse.lib.genotype_data.calc_umap_for_slice_of_variant_calls(slice_of_variant_calls, samples_selected, n_neighbors=15)

Calculate UMAP for a variant matrix array

Parameters

slice_of_variant_calls (numpy.ndarray) – Numpy array representing a variant matrix holding the number of alternate allele calls

Returns

PCA result aligned with the sample IDs in the first column

Return type

numpy.ndarray

class divbrowse.lib.genotype_data.GenotypeData(config)

Class for managing all genotype data related data structures and methods

_load_data()
get_vcf_header()
_setup_sample_id_mapping()
_create_chrom_indices()
_create_list_of_chromosomes()
sample_ids_to_mask(sample_ids: list) numpy.ndarray

Creates a boolean mask based on the input sample IDs that could be found in the samples array of the Zarr storage

Parameters

sample_ids (list) – List with sample IDs

Returns

Boolean mask, True for found sample IDs

Return type

numpy.ndarray

map_input_sample_ids_to_vcf_sample_ids(sample_ids: list) list

Map input sample IDs to VCF sample IDs according to the configured mapping table

Parameters

sample_ids (list) – List with sample IDs

Returns

List of mapped sample IDs

Return type

list

map_vcf_sample_ids_to_input_sample_ids(sample_ids: list) list

Map VCF sample IDs to input sample IDs according to the configured mapping table

Parameters

sample_ids (list) – List with sample IDs

Returns

List of mapped sample IDs

Return type

list

get_samples_mask(sample_ids)

Returns a tupel consisting of a boolean mask for found sample Ids and a list of mapped sample IDs

Parameters

sample_ids (list) – List with sample IDs

Returns

Boolean mask, True for found sample IDs list: mapped sample IDs

Return type

numpy.ndarray

get_posidx_by_genome_coordinate(chrom, pos) Tuple[int, str]

Returns array coordinates for given physical position on a given chromosome

Parameters
  • chrom (str) – ID of the chromosome

  • pos (int) – Physical position on the chromosome

Returns

lookup (int) Array coordinate of the found physical position on the chromosome lookup_type (str): Type of the lookup, could be either ‘direct_lookup’ or ‘nearest_lookup’

count_alternate_alleles(sliced_variant_calls)

Returns a tupel consisting of a boolean mask for found sample Ids and a list of mapped sample IDs

Parameters

sliced_variant_calls (numpy.ndarray) – variant matrix array holding the allele calls (0/0 0/1 1/1)

Returns

variant matrix array holding the number of alternate allele calls

Return type

numpy.ndarray

count_variants_in_window(chrom, startpos, endpos) int

Counts number of variants in a genomic region

Parameters
  • chrom (str) – The chromosome of the genomic region.

  • startpos (int) – The first position of the genommic region.

  • endpos (int) – The last position of the genommic region.

Returns

Number of variants in the genomic region

Return type

int

calculate_minor_allele_freq(numbers_of_alternate_alleles)

Calculates minor allele frequency

Parameters

numbers_of_alternate_alleles (numpy.ndarray) – Numpy array representing a variant matrix holding the number of alternate allele calls

Returns

Numpy array (1d) holding the calculated minor allele frequencies per each variant

Return type

numpy.ndarray

calc_variants_summary_stats(numbers_of_alternate_alleles)
apply_variant_filter_settings(fs, numbers_of_alternate_alleles, _slice_variant_calls)
get_slice_of_variant_calls(chrom, startpos=None, endpos=None, count=None, samples=None, variant_filter_settings=None)